https://redmine.lighttpd.net/https://redmine.lighttpd.net/favicon.ico?13667327412010-06-02T14:28:30Zlighty labsLighttpd - Bug #2217: mod_cgi : Race condition leaving response not forwarded to HTTP client.https://redmine.lighttpd.net/issues/2217?journal_id=69902010-06-02T14:28:30Zcnepveu
<ul></ul><p>Our fix for now is in the cgi program.<br />1. We close the pipe on our side (fclose stdout).<br />2. Sleep 2 seconds before exiting process.</p>
<p>This makes sure that lighttpd fd event poll detects the closed pipe, invokes mod_cgi's read loop and finishes the HTTP transaction, before reaping the child.</p> Lighttpd - Bug #2217: mod_cgi : Race condition leaving response not forwarded to HTTP client.https://redmine.lighttpd.net/issues/2217?journal_id=70332010-06-14T16:47:11Zjoshuah
<ul></ul><p>I have seen the same issue. I'm currently testing a patch that calls the same cleanup code performed on FDEVENT_HUP in mod_cgi.c before closing the descriptor in the subrequest handler. I'll try to post it here if it turns out to work well.</p> Lighttpd - Bug #2217: mod_cgi : Race condition leaving response not forwarded to HTTP client.https://redmine.lighttpd.net/issues/2217?journal_id=70342010-06-14T19:02:06Zjoshuah
<ul><li><strong>File</strong> <a href="/attachments/1157">lighttpd-1.4.23-diff.patch</a> <a class="icon-only icon-download" title="Download" href="/attachments/download/1157/lighttpd-1.4.23-diff.patch">lighttpd-1.4.23-diff.patch</a> added</li></ul><p>I'm attaching a patch against 1.4.23. I think it was actually tested with a derivative of 1.4.24 or 1.4.25.</p> Lighttpd - Bug #2217: mod_cgi : Race condition leaving response not forwarded to HTTP client.https://redmine.lighttpd.net/issues/2217?journal_id=70412010-06-16T15:51:44Zcnepveu
<ul></ul><p>Thanks for the code. I had tried something very similar, but wasn't sure of implications or other side effects. So I elected to implement a workaround in the cgi program instead, and wait for an official patch to be released.</p>
<p>Let me know how your fix works. Also, what kernel version are you using?<br />Thanks!</p>
<p>Charles</p> Lighttpd - Bug #2217: mod_cgi : Race condition leaving response not forwarded to HTTP client.https://redmine.lighttpd.net/issues/2217?journal_id=70422010-06-16T16:11:31Zjoshuah
<ul></ul><p>I am using a kernel based on 2.6.27.44. I did not see this problem after 50 tries on 2.6.27.31 but that could just mean that it's rarer. It seemed to take less than 20 tries to reproduce it with the later kernel. With the patch I ran curl 500+ times on the problematic page without seeing the problem, so I think the patches fixes the problem, but it has not been thoroughly tested.</p> Lighttpd - Bug #2217: mod_cgi : Race condition leaving response not forwarded to HTTP client.https://redmine.lighttpd.net/issues/2217?journal_id=70492010-06-29T16:37:23Zpi3orama
<ul></ul><p>I use gentoo stable kernel 2.6.32-gentoo-r7 and lighttpd-1.4.23 (built from source) and also meet this problem:</p>
<p>The cgi script is tiny:<br /><pre>
#!/bin/sh
echo "Content-type: text/html"
echo ""
echo "Plain text cxx!"
</pre></p>
<p>I use ab to test it:</p>
<pre>
$ /usr/sbin/ab -n 1000 -c 20 http://localhost:60002/cgi-bin/test
</pre>
<p>And this is the result:</p>
<pre>
Benchmarking localhost (be patient)
...
...
Completed 800 requests
Completed 900 requests
apr_poll: The timeout specified has expired (70007)
Total of 998 requests completed
</pre>
<p>One or more request always timeout.</p>
<p>Was my problem same as this one?</p>
<p>(PS: I have apply the patch but timeout problem still exists)</p> Lighttpd - Bug #2217: mod_cgi : Race condition leaving response not forwarded to HTTP client.https://redmine.lighttpd.net/issues/2217?journal_id=70682010-07-04T12:38:32Zstbuehler
<ul><li><strong>Target version</strong> changed from <i>1.4.x</i> to <i>1.4.27</i></li></ul><p>I couldn't reproduce it myself, but i hope my commit fixes the issue(s).</p>
<p>It would be nice if you could test it, thx!</p> Lighttpd - Bug #2217: mod_cgi : Race condition leaving response not forwarded to HTTP client.https://redmine.lighttpd.net/issues/2217?journal_id=70702010-07-04T12:40:03Zstbuehler
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Fixed</i></li><li><strong>% Done</strong> changed from <i>0</i> to <i>100</i></li></ul><p>Applied in changeset r2737.</p> Lighttpd - Bug #2217: mod_cgi : Race condition leaving response not forwarded to HTTP client.https://redmine.lighttpd.net/issues/2217?journal_id=71412010-07-27T12:30:24Zpi3orama
<ul></ul><p>We are the authors of <a href="http://gitorious.org/snitchaser" class="external">Snitchaser</a> . We are trying to find the root cause of this issue for showing the effectiveness of our tool. Snitchaser is a record-replay tool which allow developers to record non-deterministic events during execution, and replay non-deterministic bugs using our replay tool. We want to share our debugging results and experience here.</p>
<p>The root cause of this issue is race condition in mod_cgi.c.</p>
<p>In <code>cgi_demux_response()</code>, lighttpd loop <code>read()</code> from cgi process' pipe to retrieve results. In normal situation, the last <code>read()</code> returns 0, which indicates the finish of data transfer. Sometimes <code>read()</code> returns <code>EAGAIN</code> because of non-blocking IO. In this situation, when the pipe becomes readable or broken (cgi process terminates), the main loop in service.c will get an fdevent (<code>FDEVENT_IN</code> or <code>FDEVENT_HUP</code>), <code>cgi_handle_fdevent()</code> will be called again. If the cgi process terminates, <code>cgi_handle_fdevent()</code> will receive <code>FDEVENT_HUP</code> and the <code>con->file_finished</code> will be set to "1".</p>
<p>However, <code>mod_cgi_handle_subrequest()</code> may be called before the <code>fdevent_poll()</code>. <code>mod_cgi_handle_subrequest()</code> use <code>waitpid()</code> to get the state of cgi process. If the process terminates, it will do the cleanup work and remove pipe's fd from fdevent. Therefore, lighttpd never gets the following <code>FDEVENT_HUP</code> event, so never set <code>con->file_finished</code> to "1". As a result, in the connection state machine, the buggy connection won't get closed correctly, which causes a timeout of an http transaction.</p>
<p>To solve this problem, a small patch is enough:<br /><pre>
--- b/src/mod_cgi.c 2009-04-29 02:29:17.000000000 +0800
+++ a/src/mod_cgi.c 2010-07-27 17:21:19.675916561 +0800
@@ -1348,10 +1348,13 @@
* check if we already got the response
*/
- if (!con->file_started) return HANDLER_WAIT_FOR_EVENT;
+ hctx->pid = 0;
+
+ if (con->file_started) return HANDLER_FINISHED;
if (WIFEXITED(status)) {
/* nothing */
+ return HANDLER_WAIT_FOR_EVENT;
} else {
log_error_write(srv, __FILE__, __LINE__, "s", "cgi died ?");
</pre></p>
<p>This patch doesn't remove pipe from fdevent when cgi process exit normally. Which allows <code>cgi_handle_fdevent()</code> to receive FDEVENT_HUP and do the correct processing.</p>
<p><a href="http://gitorious.org/snitchaser" class="external">Snitchaser</a> shows its power on debugging such non-deterministic issue. We first generate a buggy trace which contain 10 requests and 1 timeout, then replay the execution under gdb control over and over again. The following work is just like traditional cyclic debugging.</p>
<ol>
<li>By tracing <code>accept()</code> and <code>close()</code>, we realize that one of sockets never get closed. This is the direct reason of timeout;</li>
<li>We identify the buggy connection using its address (addresses are same in different replay), and trace its processing, and witness the EAGAIN returned by <code>read()</code>;</li>
<li>Compare with some correct execution, we find that in this situation, <code>cgi_handle_fdevent()</code> should receive an <code>FDEVENT_HUP</code> event later and the <code>con->file_finished</code> will be set to "1";</li>
<li>After some hacking and replay, we realize that the problem is caused by the removal of pipe's fd from fdevent.</li>
<li>After we find the root cause, to fix this issue is straightforward.</li>
</ol>
<p><a href="http://gitorious.org/snitchaser" class="external">Snitchaser</a> has shown its ability of converting non-deterministic bug (Heisenbug) to traditional Bohrbug in this case study. Currently it is under fast development. It is unstable and may have many problems. However, we hope you to try it and report their problems and experience to us to help us improve this tool. We also want you to tell us about some non-deterministic bugs in modern open source software. We will try to fix them under the help of Snitchaser.</p>
<p>Thank you for your reading.</p> Lighttpd - Bug #2217: mod_cgi : Race condition leaving response not forwarded to HTTP client.https://redmine.lighttpd.net/issues/2217?journal_id=71622010-08-05T23:15:56Zstbuehler
<ul></ul><p>Yes, pretty good description. Just have a look at my commit and you will find a very similar diff (i hope i fixed some more problems) :)</p>