Bug #2683
closedHTTP requests with long GET are getting dropped
Description
Hi Guys,
i want to report a serious bug that probably is going to be unnoticed for a long time. Basically for high traffic websites (100-200 req/s) the server is getting stalled on http requests with long _GET (4k-5k size). We use lighty as backend for our web service, which serves a lot of these.
It's not easy to reproduce it seems that some URLs are triggering it and some not, also after adding `connection: close` header it seems to fix some URLs while breaking others. Even when you catch the URL that seems to be more likely to cause problems than others - the error only happens like 10-15% of the time, and only when many concurrent connections are issued.
For our test code we picked one URL that produced error, and when we run at like 200 requests/s we had like 20 or 30 timed out, per second.
This happens in 1.4.37... we switched back to 1.4.35, compiled on same machine, and everything is back to normal.
Here you can see the bug, basically when working using production data the service was timing out for 30-40 requests per minute. But there are some _GET strings (maybe depends on the length) which are problematic, and for those number of broken connections gets much higher.
http://screencast.com/t/4aGO5neJi462
As you see, server is stalled on READ, it works the same way when we point it to eg. TXT file instead of PHP. Reporting it as immediate as something in core might be broken...
Here is same code running on .35
http://screencast.com/t/H2Y0yguiFM7v
Files
Updated by nate about 9 years ago
Possibly worth noting HTTP GET's are a bit of a sticky situation. 8KB~ is about the maximum I've ever seen a browser use, but some browsers will truncate everything after a certain amount. IE and I think Safari truncate after 2KB worth of content, Chrome I think maxes out at 4KB, etc.
There's a sort of unofficial rule of thumb to avoid ever going over 2KB (2048 bytes) in a GET request, if anything since it's horrible looking, at worst because it may break things for a lot of people.
Probably worth posting what browser's you've tested on and/or what user agents you've seen logged that hang up on the request as I'd put money on it not really being a lighttpd issue necessarily.
Updated by nitrox about 9 years ago
Are you doing requests via php to external ressources? And you´re sure its not php related?
Updated by slawomir.pryczek about 9 years ago
@nitrox: It's not PHP-related, because as mentioned the same thing is going on when accessing static file this way. It makes no sense of course but when you replace file.php with file.txt and leave same GET string, the timeouts are still here.
@nate: It doesn't seem client-related as it's done using PHP Curl (7.43.0), sure this error doesn't seem to be harmful for normal websites, im talking about situation when the server is used as backend for internal services. We actually use GET so it's easier to see what's going on (even with very long query strings).
This service isn't accessible from outside, it's internal service to calculate some things. Otherwise i won't be able to see that timeouts are there and are caused by lighttpd. If the server is able to accept 4-8K HTTP/GET it shouldn't stall on connections at random. I mean it should drop it with some HTTP error code than "lock-up" the request indefinitely.
Will try to put it in some VMWare container but im preety sure (90%) it's bug in lighty. Because:
a. The client NEVER changes (always Curl 7.43.0)
b. It seems related to size of GET because when pointed to 0-size test.txt instead, it's behaving the same way
c. Linux TCP and file limit settings were tweaked to be very high to be sure it isn't caused by that
d. Internal resources, using private network and dedicated switch, so networkin can't be the issue
e. (most imporant) The bug goes away after switching to older lighty version
To explain what we're trying to do... we have a web application that is querying several servers in private network to compute final output which requires a lot of processing. That's why it's split to several machines, after updating lighty the system became unstable.
Updated by slawomir.pryczek about 9 years ago
Ok under debian:
apt-get install apache2-utils
echo abc > test.json (in HTTP dir)
Run the attached bug.sh file (edit IP Address first)
I also prepared the VMware container for you. Actually after i made it i understood that the error is so easy to trigger that probably you won't need it, but anyway. It's using debian, apache bench, and static file. No single request is getting served for long GETs. So i think this bug is for sure related to lighty, as previously i was experiencing this bug on newest fedora and CURL was HTTP client.
Here is screenshot:
http://screencast.com/t/YENHrgTfgdp
To run long request go to /home/bug.sh
To run same request with short query string go to /home/correct.sh
You can download the VM from http://liczniki.org/_other/lighty.zip
Updated by stbuehler about 9 years ago
- Priority changed from Immediate to Normal
Hi,
thanks for the virtual machine (although you might want to tell others the credentials they need to log in :) ).
I attached strace (strace -s 8000 -p $(pidof lighttpd)
) and got this (and similar output for all the other connections) when running bug.sh
:
accept(4, {sa_family=AF_INET, sin_port=htons(38667), sin_addr=inet_addr("192.168.56.101")}, [16]) = 206 fcntl(206, F_SETFD, FD_CLOEXEC) = 0 fcntl(206, F_SETFL, O_RDWR|O_NONBLOCK) = 0 ioctl(206, FIONREAD, [2047]) = 0 read(206, "GET /test.json?aaaaaaaaaaaaa[...]aaaaaaaaa", 2111) = 2047 epoll_ctl(6, EPOLL_CTL_ADD, 206, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=206, u64=206}}) = 0
You can see lighttpd adding the filedescriptor to the event loop at the bottom, and the event never fires. This means "ab" is only sending 2047 bytes, and the header is never sent completely - a bug in "ab". Then I ran "ab" from my own machine (debian testing) with the same URL and it worked - so apparently they fixed it in the meantime.
This doesn't solve your original problem, but it shows that we cannot easily reproduce the original issue with "ab".
Updated by slawomir.pryczek almost 9 years ago
Hi, right, sorry credentials are root/root.
Will try to reproduce the issue with some short PHP script then, it seems indeed that AB have some issues, it doesn't work in .35 too (while the production code we run had problems only when we switched lighty to .37). Any ideas what might be wrong?
Just for double check i tried same thing in apache and it seems that ab is indeed buggy, so this is bad test case. Hopefully i can come up with something that can be used to reproduce the bug. I tested on debian 7, probably they have old ab in repo.
Updated by slawomir.pryczek almost 9 years ago
- File debug-1-txt.txt debug-1-txt.txt added
- File lighty-stall-worker.php lighty-stall-worker.php added
Ok i was able to create script that'll trigger the bug, it'll happen after 20-100 requests, there'll be some request that'll stall.
Confirmed it works with .35 but breaks .37.
It always breaks when URL length is 5746/5745 bytes, but randomly. I tested with fixed size URLs (5746b for all requests) and it was working correctly. It seems there need to be some shorter / longer URLs for it to break. Please test the PHP code attached, i'll upload new VM if needed.
When i attach strace to process, it's strange but it's not breaking, so im not able to trigger the bug in such case.
Updated by stbuehler almost 9 years ago
- Priority changed from Normal to Immediate
Actually, this sounds like #2670
Updated by stbuehler almost 9 years ago
- Priority changed from Immediate to Normal
wtf... didn't touch priority :)
Updated by slawomir.pryczek almost 9 years ago
Ok i setup new VM... http://liczniki.org/_other/lighty-2.zip
To test please open one console go to /home and run one of web servers
./test.sh -> compile and run lighty .37
./test-35.sh -> compile and run lighty .35
Then open another console, go to /home, type ./bug.sh
http://screencast.com/t/VGfjqbzWw
Interesingly URL sizes needs to be different when the code that access URLs is on same host as server (around 4111b-4114b), maybe related to TCP packet size...
For lighty .35 all works normally.
Updated by slawomir.pryczek almost 9 years ago
Okay i got that fixed...
Just by copying CRLF detection code from .35 to .37... recompiled, both local and remote code that triggered the bug is no longer "working" :)
http://screencast.com/t/C5QYHCcPd
Any reason you haven't copied when you knew that the new code is buggy? I mean not able to detect request end is huge issue i think, even if you moved it to 4k offset by that fix, and old code was already tested... does the old one contains any other issues?
Can we use the old code instead, in .37 or we'll need to fix new one.
Updated by stbuehler almost 9 years ago
If you follow the link to #2670 you'll see that it is fixed in current SVN head (or git, whatever you prefer); the patch inlined in the issue is not what I committed in the end. I'm still working on mod_secdownload (I want to provide alternatives to md5), that is why I didn't prepare a release yet.
Updated by slawomir.pryczek almost 9 years ago
Yes i found it in connection.c, it's c <-> cc typo, hard to spot. I see it's fixed in new repo. Thanks for the help ;)
Updated by stbuehler almost 9 years ago
- Is duplicate of Bug #2670: connection hangs when final cr/lf is in seperate packet added
Updated by stbuehler almost 9 years ago
- Status changed from New to Duplicate
- Target version deleted (
1.4.x)
I assume this was the problem in the end.
Also available in: Atom