Bug #2217
closedmod_cgi : Race condition leaving response not forwarded to HTTP client.
Description
We're having this race condition occurring (50% of tests fails) since we moved to kernel version 2.6.34.
The problem happens in mod_cgi when the cgi program writes an HTTP response to stdout (i.e. the IPC pipe with lighttpd's mod_cgi) and the cgi programs quits correctly.
Here's the problematic sequence:
1. cgi program initialization is all normal.
) mod_cgi polls on the pipe for response (nothing).
) ...
) mod_cgi polls on the pipe for response (nothing).
2. cgi program writes the HTTP response.
) mod_cgi polls on the pipe for response (got something on pipe).
) mod_cgi reads all response (109 bytes).
) mod_cgi reads again, but gets EAGAIN (normal since it's a non-blocking IO).
) mod_cgi exits read loop for now.
3. cgi program exits cleanly.
) mod_cgi's mod_cgi_handle_subrequest function is called, reaps the exited cgi program child and never finishes HTTP transaction with HTTP client.
When this happens, I have:con->file_started 1
con->file_finished 0
On other times, depending on timing, the second read from mod_cgi returns 0 (instead of EAGAIN) and then the HTTP transaction is completed successfully. It seems it depends on how fast the cgi program exits.
Can someone else corroborate this race condition, or at least verify it?
Thanks
Files
Updated by cnepveu over 14 years ago
Our fix for now is in the cgi program.
1. We close the pipe on our side (fclose stdout).
2. Sleep 2 seconds before exiting process.
This makes sure that lighttpd fd event poll detects the closed pipe, invokes mod_cgi's read loop and finishes the HTTP transaction, before reaping the child.
Updated by joshuah over 14 years ago
I have seen the same issue. I'm currently testing a patch that calls the same cleanup code performed on FDEVENT_HUP in mod_cgi.c before closing the descriptor in the subrequest handler. I'll try to post it here if it turns out to work well.
Updated by joshuah over 14 years ago
I'm attaching a patch against 1.4.23. I think it was actually tested with a derivative of 1.4.24 or 1.4.25.
Updated by cnepveu over 14 years ago
Thanks for the code. I had tried something very similar, but wasn't sure of implications or other side effects. So I elected to implement a workaround in the cgi program instead, and wait for an official patch to be released.
Let me know how your fix works. Also, what kernel version are you using?
Thanks!
Charles
Updated by joshuah over 14 years ago
I am using a kernel based on 2.6.27.44. I did not see this problem after 50 tries on 2.6.27.31 but that could just mean that it's rarer. It seemed to take less than 20 tries to reproduce it with the later kernel. With the patch I ran curl 500+ times on the problematic page without seeing the problem, so I think the patches fixes the problem, but it has not been thoroughly tested.
Updated by pi3orama over 14 years ago
I use gentoo stable kernel 2.6.32-gentoo-r7 and lighttpd-1.4.23 (built from source) and also meet this problem:
The cgi script is tiny:
#!/bin/sh echo "Content-type: text/html" echo "" echo "Plain text cxx!"
I use ab to test it:
$ /usr/sbin/ab -n 1000 -c 20 http://localhost:60002/cgi-bin/test
And this is the result:
Benchmarking localhost (be patient) ... ... Completed 800 requests Completed 900 requests apr_poll: The timeout specified has expired (70007) Total of 998 requests completed
One or more request always timeout.
Was my problem same as this one?
(PS: I have apply the patch but timeout problem still exists)
Updated by stbuehler over 14 years ago
- Target version changed from 1.4.x to 1.4.27
I couldn't reproduce it myself, but i hope my commit fixes the issue(s).
It would be nice if you could test it, thx!
Updated by stbuehler over 14 years ago
- Status changed from New to Fixed
- % Done changed from 0 to 100
Applied in changeset r2737.
Updated by pi3orama over 14 years ago
We are the authors of Snitchaser . We are trying to find the root cause of this issue for showing the effectiveness of our tool. Snitchaser is a record-replay tool which allow developers to record non-deterministic events during execution, and replay non-deterministic bugs using our replay tool. We want to share our debugging results and experience here.
The root cause of this issue is race condition in mod_cgi.c.
In cgi_demux_response()
, lighttpd loop read()
from cgi process' pipe to retrieve results. In normal situation, the last read()
returns 0, which indicates the finish of data transfer. Sometimes read()
returns EAGAIN
because of non-blocking IO. In this situation, when the pipe becomes readable or broken (cgi process terminates), the main loop in service.c will get an fdevent (FDEVENT_IN
or FDEVENT_HUP
), cgi_handle_fdevent()
will be called again. If the cgi process terminates, cgi_handle_fdevent()
will receive FDEVENT_HUP
and the con->file_finished
will be set to "1".
However, mod_cgi_handle_subrequest()
may be called before the fdevent_poll()
. mod_cgi_handle_subrequest()
use waitpid()
to get the state of cgi process. If the process terminates, it will do the cleanup work and remove pipe's fd from fdevent. Therefore, lighttpd never gets the following FDEVENT_HUP
event, so never set con->file_finished
to "1". As a result, in the connection state machine, the buggy connection won't get closed correctly, which causes a timeout of an http transaction.
To solve this problem, a small patch is enough:
--- b/src/mod_cgi.c 2009-04-29 02:29:17.000000000 +0800 +++ a/src/mod_cgi.c 2010-07-27 17:21:19.675916561 +0800 @@ -1348,10 +1348,13 @@ * check if we already got the response */ - if (!con->file_started) return HANDLER_WAIT_FOR_EVENT; + hctx->pid = 0; + + if (con->file_started) return HANDLER_FINISHED; if (WIFEXITED(status)) { /* nothing */ + return HANDLER_WAIT_FOR_EVENT; } else { log_error_write(srv, __FILE__, __LINE__, "s", "cgi died ?");
This patch doesn't remove pipe from fdevent when cgi process exit normally. Which allows cgi_handle_fdevent()
to receive FDEVENT_HUP and do the correct processing.
Snitchaser shows its power on debugging such non-deterministic issue. We first generate a buggy trace which contain 10 requests and 1 timeout, then replay the execution under gdb control over and over again. The following work is just like traditional cyclic debugging.
- By tracing
accept()
andclose()
, we realize that one of sockets never get closed. This is the direct reason of timeout; - We identify the buggy connection using its address (addresses are same in different replay), and trace its processing, and witness the EAGAIN returned by
read()
; - Compare with some correct execution, we find that in this situation,
cgi_handle_fdevent()
should receive anFDEVENT_HUP
event later and thecon->file_finished
will be set to "1"; - After some hacking and replay, we realize that the problem is caused by the removal of pipe's fd from fdevent.
- After we find the root cause, to fix this issue is straightforward.
Snitchaser has shown its ability of converting non-deterministic bug (Heisenbug) to traditional Bohrbug in this case study. Currently it is under fast development. It is unstable and may have many problems. However, we hope you to try it and report their problems and experience to us to help us improve this tool. We also want you to tell us about some non-deterministic bugs in modern open source software. We will try to fix them under the help of Snitchaser.
Thank you for your reading.
Updated by stbuehler over 14 years ago
Yes, pretty good description. Just have a look at my commit and you will find a very similar diff (i hope i fixed some more problems) :)
Also available in: Atom