Project

General

Profile

Bug #2827

POST to mod_cgi sometimes hangs

Added by davidm 8 days ago. Updated 4 days ago.

Status:
Fixed
Priority:
Normal
Assignee:
-
Category:
core
Target version:
Start date:
2017-10-12
Due date:
% Done:

100%

Estimated time:
Missing in 1.5.x:

Description

We have been seeing cases where lighttpd 1.4.45 would occasionally hang after POSTing all bytes to a CGI binary. Specifically, what happens is that cgi_write_request() would correctly detect that cq->bytes_out == con->request.content_length and as a result it would call cgi_connection_close_fdtocgi(). This in turn calls cgi_connection_close_fdtocgi() to schedule closing of the file descriptor. So far so good.

The problem is that fedevent_sched_run() then never gets called, so the file descriptor never actually gets closed and the CGI binary never sees an EOF. I think the reason this is happening is that the final call to cgi_write_request() was triggered through connection_state_machine() (after the fdevent_poll() loop). In this case, if there are no further HTTP requests, fdevent_poll() will never return a pending event and thus will never run fdevent_sched_run().

We have:

server.stream-request-body = 2
server.stream-response-body = 2

in case that matters.

Associated revisions

Revision 8ed588ce (diff)
Added by gstrauss 8 days ago

[core] handle fds pending close after poll timeout (fixes #2827)

handle fds pending close whether or not new events are triggered

(thx davidm)

x-ref:
"POST to mod_cgi sometimes hangs"
https://redmine.lighttpd.net/issues/2827

Revision 7f82ddab (diff)
Added by gstrauss 3 days ago

[core] remove fdevent_sched_run from fdevent_libev (#2827)

remove fdevent_sched_run from fdevent_libev.c
(redundant since commit 8ed588ce)

x-ref:
"POST to mod_cgi sometimes hangs"
https://redmine.lighttpd.net/issues/2827

History

#1 Updated by gstrauss 8 days ago

  • Status changed from New to Patch Pending
  • Target version changed from 1.4.x to 1.4.46

Technically speaking, CGI should check (and validate) CONTENT_LENGTH before reading input, and should read CONTENT_LENGTH of input, instead of reading until EOF.

Still, you have quite thoroughly described an edge case in lighttpd which should be fixed. The easiest fix is to move the call to fdevent_sched_run() to be outside that conditional block. lighttpd will still wait 1 second if no new events arrive, but will then proceed as expected.

While I could add some logic so that lighttpd polls for events immediately (instant timeout) if there are pending fds to be closed, closes the fds pending close, and then polls for a longer timeout, I think the simple fix is sufficient, and if you do not want to wait 1 second for your CGI to continue, then you should follow the CGI spec and use CONTENT_LENGTH.

https://tools.ietf.org/html/rfc3875

4.2.  Request Message-Body
   [...]
   A request-body is supplied with the request if the CONTENT_LENGTH is
   not NULL.  The server MUST make at least that many bytes available
   for the script to read.  The server MAY signal an end-of-file
   condition after CONTENT_LENGTH bytes have been read or it MAY supply
   extension data.  Therefore, the script MUST NOT attempt to read more
   than CONTENT_LENGTH bytes, even if more data is available.  However,
   it is not obliged to read any of the data.

#2 Updated by gstrauss 8 days ago

--- a/src/server.c
+++ b/src/server.c
@@ -2085,13 +2085,14 @@ static int server_main (server * const srv, int argc, char **argv) {
                                        (*handler)(srv, context, revents);
                                }
                        } while (--n > 0);
-                       fdevent_sched_run(srv, srv->ev);
                } else if (n < 0 && errno != EINTR) {
                        log_error_write(srv, __FILE__, __LINE__, "ss",
                                        "fdevent_poll failed:",
                                        strerror(errno));
                }

+               if (n >= 0) fdevent_sched_run(srv, srv->ev);
+
                for (ndx = 0; ndx < srv->joblist->used; ndx++) {
                        connection *con = srv->joblist->ptr[ndx];
                        connection_state_machine(srv, con);

#3 Updated by davidm 8 days ago

Wouldn't it be better to do fdevent_sched_run() unconditionally? Otherwise, if epoll_wait() gets interrupted by a signal, we wouldn't call fdevent_sched_run() even though there might be pending work from the connection_state_machine(), right?

#4 Updated by gstrauss 6 days ago

Wouldn't it be better to do fdevent_sched_run() unconditionally?

Not if there are things intended to be sent to the kernel which were interrupted before being sent to the kernel. (This might not be entirely relevant in this call to poll, but is relevant in other event loops I have written elsewhere.)

Otherwise, if epoll_wait() gets interrupted by a signal, we wouldn't call fdevent_sched_run() even though there might be pending work from the connection_state_machine(), right?

This code is in a loop. If interrupted, it will be tried again on the next loop. On any busy system, there should be actual fd events ready before interrupts, and on mostly idle systems, there is unlikely to be a large number of syscall interrupts (without fd events) which might interrupt the loop many times before it completes at least once (with either an event or a 1 second timeout without any event).

I have already pointed out that the root cause of your problem is the insecure CGI script which does not follow the CGI standard and might read an arbitrarily large amount of data.

#5 Updated by gstrauss 5 days ago

  • Status changed from Patch Pending to Fixed
  • % Done changed from 0 to 100

#6 Updated by stbuehler 4 days ago

Why not just set the timeout to 1 (ms) when there are close requests pending (e.g. in fdevent_poll)? Also why is fdevent_sched_run called in fdevent_libev_poll ?

#7 Updated by gstrauss 4 days ago

Also why is fdevent_sched_run called in fdevent_libev_poll ?

See eb37615a

Also available in: Atom