Bug #2796
closedmod_fastcgi can fail to read entire response from server
Description
I've encountered a problem with mod_fastcgi where it can occasionally fail
to fully read a response from a FastCGI server. I can reproduce the
problem, but not every time. The problem appears to have an element of
chance to it.
I noticed the problem on lighttpd 1.4.45. However, I've also tested it with
the latest git version.
I have a patch that fixes the problem. I'm not sure it will be acceptable
as is, but I thought it might help to include it.
I'll describe how to reproduce the problem and what I found.
Steps to reproduce:
1. Run a FastCGI server listening on a TCP socket.
Configure the FastCGI server to send a large response. I've found various
size responses to cause it, but it definitely needs to be above a certain
size. I can reliably reproduce it with a body size of 1,000,000 bytes (plus
the overhead of the FastCGI records).
I reproduced it with two different FastCGI server implementations.
One implementation I wrote (in Go) while trying to debug this is here:
https://github.com/horgh/fastcgi-server
Using this server like so reliably reproduces the problem:
./fastcgi-server -body-size 1000000
I also built a test application using a different FastCGI library and
caused the problem as well. I initially noticed a problem in an application
using the Go standard library FastCGI library.
2. Configure lighttpd to connect to the FastCGI server:
fastcgi.server = (
"/fcgitest" => (
(
"host" => "192.168.1.3",
"port" => 9901,
"check-local" => "disable",
)
),
)
3. Configure lighttpd's stream-response-body to 2.
server.stream-response-body = 2
4. Make a request for the resource. I've done this using cURL:
curl --fail -v http://192.168.1.3:3000/fcgitest > /dev/null
5. Periodically see the request hang. cURL will sit there and never fully
receive the response. Eventually the FastCGI server will lose the
connection to lighttpd with connection reset by peer. On the FastCGI server
side, it will be blocked trying to write to lighttpd.
I debugged this in lighttpd and I discovered that it stops reading the
response from the FastCGI server partway through, and never completes
reading it. The amount lighttpd reads varies, but frequently it does not
complete. I have some strace output if you like.
My solution:
I noticed there were calls to stop listening for the FDEVENT_IN event.
These occur only when stream-response-body is 2, so that option is crucial
to causing this behaviour.
The solution in my patch is to continue listening for FDEVENT_IN events.
I'm not sure the implications of this. I see there is a comment above one
of the calls I remove in my patch, but I don't know enough about the code
to know the full effect.
Background:
I'm not sure how relevant this is, but what originally drew my attention to
there being a problem here was that lighttpd reported responses from the
FastCGI server were incomplete:
2017-03-05 08:47:01: (mod_fastcgi.c.2424) unexpected end-of-file (perhaps the fastcgi process died): pid: 0 socket: tcp:127.0.0.1:9901
2017-03-05 08:47:01: (mod_fastcgi.c.3180) response already sent out, but backend returned error on socket: tcp:127.0.0.1:9901 for /gorse?sort-order=date-asc&user-id=1&read-state=unread, terminating connection
From what I gathered debugging this aspect, the ioctl() call to determine
how much to read was reporting 0 and then aborting the request. It is my
belief there was/is a race condition where there would be more to read but
ioctl() happened before any was ready.
After applying my patch I cannot reproduce this problem.
Aside:
The reason I had stream-response-body set to 2 was I have an audio stream
application that relies on the response continually streaming to clients.
Please let me know if I can provide any more information or help more.
Thanks for your time!
Files
Updated by gstrauss over 7 years ago
Thanks for the details on how to reproduce. I won't have a chance to look into this further for a few days, but will update when I do.
FYI: your suggested patch makes server.stream-response-body = 2
equivalent to server.stream-response-body = 1
, so that is not the correct fix for server.stream-response-body = 2
. Curious: why are you using server.stream-response-body = 2
?
Updated by horgh over 7 years ago
Thanks for the reply!
That is interesting that my patch makes the behaviour the same as stream-response-body = 1. I did not realize that.
I started using stream-response-body = 2 as I found the streaming behaviour changed when I updated from 1.4.39. The default of 0 broke an app I have that streams continuously. I actually did not try stream-response-body = 1, primarily because 2 sounded more like what I wanted ("minimal buffering"). Given what you said, it sounds like server-response-body = 1 would work just fine for me.
I suppose I was/am not clear about the intended difference of the two settings (1 vs. 2). Is it supposed to be that setting it to 2 avoids writing temporary files, and that is all?
Thanks again. I'd be pleased to help more if I can!
Updated by gstrauss over 7 years ago
Thanks to your detailed description of the issue, I was able to quickly narrow in on what I think is the problem.
This is what I see in the code around line 3678:
/* optimistic read from backend, which might re-enable FDEVENT_IN */ handler_t rc = fcgi_recv_response(srv, hctx); /*(might invalidate hctx)*/
which can theoretically result in the subsequent ioctl() returning 0 (success), while reporting bytes-ready of 0 bytes if none are ready to be read. Normally, if lighttpd receives FDEVENT_IN from the backend, ioctl() reporting 0 bytes ready would likely indicate EOF. In the optimistic read case, we have not been checking for FDEVENT_IN, so we do not know if any bytes are ready to be read. Here is an untested patch against lighttpd 1.4.45 which might handle the optimistic read case for you:
--- a/src/mod_fastcgi.c +++ b/src/mod_fastcgi.c @@ -2421,6 +2421,10 @@ static int fcgi_demux_response(server *srv, handler_ctx *hctx) { } } if (0 == r) { + if (!(fdevent_event_get_interest(srv->ev, hctx->fd) & FDEVENT_IN)) { + fdevent_event_add(srv->ev, &(hctx->fde_ndx), hctx->fd, FDEVENT_IN); + return 0; + } log_error_write(srv, __FILE__, __LINE__, "ssdsb", "unexpected end-of-file (perhaps the fastcgi process died):", "pid:", proc->pid,
(If this works for you, then a similar patch also needs to be applied in mod_proxy)
Updated by gstrauss over 7 years ago
To answer your question about the difference between server.stream-response-body = 1
and server.stream-response-body = 2
:
lighttpd is typically faster than the client and typically faster than the backend. Users who stream a large response will most likely want server.stream-response-body = 1
so that lighttpd buffers (to tempfiles) the response from the backend as quickly as the backend produces it, even if lighttpd was unable to send the response as quickly to a (far away) client. Buffering the response from the backend allows the backend to finish sending its response quickly, and then to go on serving other requests. This is quite advantageous when the backend is a fat PHP client compared to the slim lighttpd.
server.stream-response-body = 2
is more appropriate when the backend is producing what might be an extremely large stream and it is desired that the backend be blocked from sending more response data when lighttpd is unable to send response data just as quickly to the client. While this avoids the tempfiles, it also blocks the backend producer from sending more data, which might be desirable for some use cases, though probably not for the most common use cases. This might be appropriate for large generated responses from a backend to lighttpd on a small embedded system which does not have a large amount of disk space for temporary files.
server.stream-response-body = 0
just means that lighttpd should buffer the entire response from the backend before starting to send the response to the client. This currently allows mod_deflate (if configured) to compress the response before sending it to the client. In most cases, the response from the backend is small enough, the backend is located on the same machine as the lighttpd server, and the backend produces the response quickly enough that there is not a big perceived difference between server.stream-response-body = 0
and server.stream-response-body = 1
. However, the tunable exists for those who do have use cases which suggest a different tuning from the default.
In your case, it would seem that server.stream-response-body = 1
will be best for you, though I would very much appreciate some feedback if you get a chance to test the above patch with server.stream-response-body = 2
.
I'll be off the grid for the next few days, so won't be able to respond to further posts until next week.
Updated by gstrauss over 7 years ago
- Status changed from New to Patch Pending
- Target version changed from 1.4.x to 1.4.46
Updated by horgh over 7 years ago
Thank you for the patch! And thanks for the explanation about how the different server-response-body
options work and their purpose! That is very clear.
I added your explanation to the wiki page about the option. I hope that is okay! I figure it might help others understand in the future too which might be the best for them (https://redmine.lighttpd.net/projects/lighttpd/wiki/Server_stream-response-bodyDetails).
I also tested your patch against 1.4.45. Unfortunately the issue still occurs. My client hangs receiving varying amounts from lighttpd.
One other thing I noticed: On the FastCGI server, the amount it sends to lighttpd varies too (as well as what the client sees). In some cases it reports sending the entire response to lighttpd. In others, it blocks part way through sending to lighttpd.
Regarding the case on line 2423 (ioctl()
returning 0): I believe if this happens then lighttpd closes the connection to the FastCGI server and reports unexpected EOF. I've seen this error, but it's not happening in my test case currently.
I've been thinking about this after what you said about blocking the backend. Maybe wherever the buffer gets drained (con->write_queue
?) we should add back interest in FDEVENT_IN
(as in your patch). I'm not sure where that would be though!
Updated by horgh over 7 years ago
- File patch-v2.patch patch-v2.patch added
I've looked at this some more.
I discovered that the module's subrequest function gets called after writing from the write queue (CON_STATE_WRITE
in connection_state_machine()
). This is the function that has the optimistic read.
I found that if I start listening for FDEVENT_IN
again here instead of trying the optimistic read, then the problem goes away. The entire response from the FastCGI server is read by lighttpd.
As well, from what you said, it seems like the optimistic read could lead to that spurious EOF error which I originally encountered.
I attached a patch showing this change. What do you think?
Thanks again!
Updated by gstrauss over 7 years ago
I attached a patch showing this change. What do you think?
Since the behavior occurs only when server.stream-response-body = 2
, adding back FDEVENT_IN interest was missing in another location. While I found that location, the place where you add it in your patch is a better place. In the patch I will push shortly, I kept the optimistic read, but if the request is not finished (and not an error) after the optimistic read, FDEVENT_IN interest is added (as in your patch). Thanks!
Updated by gstrauss over 7 years ago
- Status changed from Patch Pending to Fixed
- % Done changed from 0 to 100
Applied in changeset e4bb56222fa7c1e315ea7d05cb4ea2f0de781534.
Updated by horgh over 7 years ago
Awesome!
I've tested with the latest git changes, including this patch, and everything is working well.
Thank you for fixing it so quickly!
Updated by gstrauss over 7 years ago
- Related to Feature #2802: Could the error-log be improved a tiny bit in regards to "Permission denied" errors propagated from the filesystem (on Linux at the least)? added
Also available in: Atom