Project

General

Profile

Bug #2878

fastcgi and stream-response-body=2 hangs on last chunk

Added by abelbeck 4 months ago. Updated 4 months ago.

Status:
Fixed
Priority:
Normal
Assignee:
-
Category:
core
Target version:
Start date:
2018-03-17
Due date:
% Done:

100%

Estimated time:
Missing in 1.5.x:

Description

Our project has been using lighttpd for many years since version 1.4.29, Thanks!

For the past year we have been using 1.4.45 (server.stream-response-body = 2) with this patch:
https://redmine.lighttpd.net/projects/lighttpd/repository/revisions/e4bb56222fa7c1e315ea7d05cb4ea2f0de781534/diff

We decided it was time to upgrade to 1.4.49 (server.stream-response-body = 2), usually almost any size of fastcgi/PHP download hangs around the last chunk. The only recourse is to reboot the box.

All things equal, server.stream-response-body = 0 and server.stream-response-body = 1 work perfectly, (albeit w/temp files).

We tried 1.4.46-49, all exhibit the same problem, basically since gw_backend.c was created.

Versions:

lighttpd/1.4.49 (ssl)
Note: 1.4.46-48 behave the same way

PHP/5.6.34

Linux/3.16.54 x86_64

Config snippet:

server.stream-response-body = 2

fastcgi.server = ( ".php" =>
  ( "localhost" =>
    (
      "socket" => "/var/run/php-fastcgi.socket",
      "bin-path" => "/usr/bin/php-cgi",
      "max-procs" => 2
    )
  )
)

PHP snippet:

} else {
  header('Content-Type: application/octet-stream');
  header('Content-Disposition: attachment; filename="'.$tmpfile.'"');
  header('Content-Transfer-Encoding: binary');
  header('Content-Length: '.filesize($prefix.$tmpfile));
  ob_clean();       
  flush();                   
  @readfile($prefix.$tmpfile);
  @unlink($prefix.$tmpfile);
  exit;
}

I have googled, studied the source, tried various config tweaks and have not been able to put my finger on the problem.

Any help would be appreciated.

Lonnie
AstLinux Project

lighttpd-V.txt (648 Bytes) lighttpd-V.txt lighttpd -V output abelbeck, 2018-03-17 18:23

Associated revisions

Revision 210b5770 (diff)
Added by gstrauss 4 months ago

[core] fix rare race condition from backends (fixes #2878)

fix rare race condition from backends with server.stream-response-body=2

(thx abelbeck)

x-ref:
"fastcgi and stream-response-body=2 hangs on last chunk"
https://redmine.lighttpd.net/issues/2878

History

#1

Updated by gstrauss 4 months ago

It looks like your PHP is sending trailers in the response. Is your PHP also sending "Transfer-Encoding: chunked"? Have you tried disabling "Transfer-Encoding: chunked" in PHP and letting lighttpd do the chunking? (If you do, I realize that your Trailer will end up in the response body, so you might temporarily disable those trailers as well during testing.)

Would you please provide some more information about your setup, e.g. lighttpd -f /etc/lighttpd/lighttpd.conf -p ? You seem to imply that this is occurring over an SSL connection. Do you see problems over the SSL connection? Do you see the same problem if not using SSL? While you are correct that lighttp 1.4.46 introduced gw_backend.c to consolidate code across lighttpd backend gateways, lighttpd 1.4.46 also introduced mod_openssl to separate SSL code from the lighttpd core.

(Aside: CGI shouldn't send "Transfer-Encoding: chunked" as that is not part of the CGI spec, though you could argue that NPH CGI should allow it.)

#2

Updated by gstrauss 4 months ago

Another thing to try to help narrow this down: does it work or not if you run the PHP as a CGI script (using mod_cgi) rather than as FastCGI (using mod_fastcgi)?

#3

Updated by abelbeck 4 months ago

gstrauss wrote:

Another thing to try to help narrow this down: does it work or not if you run the PHP as a CGI script (using mod_cgi) rather than as FastCGI (using mod_fastcgi)?

I switched to mod_cgi as you suggested ...

cgi.assign             = ( ".php" => "/usr/bin/php-cgi" )

And that works perfectly with stream-response-body=2 ! albeit slower.

Does that suggest the problem is in mod_fastcgi.c ?

#4

Updated by gstrauss 4 months ago

Does that suggest the problem is in mod_fastcgi.c ?

Not definitively, though there is a comment in the code about backends that use framing (i.e. FastCGI) and server.stream-response-body = 2.

Are you using ssl.read-ahead = "disable"? (Edited) That was added in lighttpd 1.4.45 and you should be using it if you are on an slow embedded system which can not keep up with encrypting and sending the response faster than the response is produced by the backend.

#5

Updated by abelbeck 4 months ago

Back using mod_fastcgi, here is the response debug log when a download is triggered, notice no "chunked" encoding (possibly my "chunk" in the title is a misnomer)

Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.388) -- splitting Request-URI 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.389) Request-URI     :  /admin/system.php 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.390) URI-scheme      :  https 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.391) URI-authority   :  pbx4.abelbeck.org 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.392) URI-path (raw)  :  /admin/system.php 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.393) URI-path (clean):  /admin/system.php 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.394) URI-query       :   
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (mod_access.c.159) -- mod_access_uri_handler called 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.532) -- before doc_root 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.533) Doc-Root     : /var/www 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.534) Rel-Path     : /admin/system.php 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.535) Path         :  
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.587) -- after doc_root 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.588) Doc-Root     : /var/www 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.589) Rel-Path     : /admin/system.php 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.590) Path         : /var/www/admin/system.php 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.607) -- logical -> physical 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.608) Doc-Root     : /var/www 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.609) Basedir      : /var/www 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.610) Rel-Path     : /admin/system.php 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.611) Path         : /var/www/admin/system.php 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.623) -- handling physical path 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.624) Path         : /var/www/admin/system.php 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.631) -- handling subrequest 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.632) Path         : /var/www/admin/system.php 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.633) URI          : /admin/system.php 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (response.c.634) Pathinfo     :  
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (mod_access.c.159) -- mod_access_uri_handler called 
Mar 17 18:15:08 pbx4 daemon.err lighttpd[31657]: (gw_backend.c.2406) handling it in mod_gw 
Mar 17 18:15:09 pbx4 daemon.err lighttpd[31657]: (response.c.122) Response-Header: \nHTTP/1.1 200 OK\r\nX-Powered-By: PHP/5.6.34\r\nContent-Type: application/octet-stream\r\nContent-Disposition: attachment; filename="pbx4-basic-2018-03-17.tar.gz"\r\nContent-Transfer-Encoding: binary\r\nContent-Length: 380692\r\nDate: Sat, 17 Mar 2018 23:15:08 GMT\r\nServer: lighttpd/1.4.49\r\n\r\n 
<<lighttpd hangs, system reboot required>>

As such, 208 KB out of 372 KB got transfered when it hung.

#6

Updated by abelbeck 4 months ago

Are you using ssl.read-ahead = "enable"? That was added in lighttpd 1.4.45 and you should be using it if you are on an slow embedded system which can not keep up with encrypting and sending the response faster than the response is produced by the backend.

I was not, but enabling ssl.read-ahead did not help.

Interesting, here is says to disable ssl.read-ahead which is the default now.
https://redmine.lighttpd.net/projects/lighttpd/wiki/Server_stream-response-bodyDetails

#7

Updated by gstrauss 4 months ago

Yes, sorry. I meant ssl.read-ahead = "disable". I'll edit my comment above.

I am interested in tracking this down, though it may be a little bit more difficult to duplicate on my system, which has a faster CPU.

In the interest of providing a workaround, why are you using server.stream-response-body = 2 to avoid temporary files when you are using PHP, which likely keeps most of the response in memory? If your system has the memory for your PHP, you are probably better off letting lighttpd handle buffering to disk to allow PHP to finish up more quickly, hopefully freeing its memory sooner. If lighttpd is able to send the response quickly enough, then the temporary file will get cleaned up, potentially before even being flushed to disk for the first time. Similarly, if you are on a small embedded system, then your fastcgi.server config should set "max-procs" => "1" and you should probably also set in fastcgi.server

      "bin-environment" => ( 
         "PHP_FCGI_CHILDREN" => "1",
         "PHP_FCGI_MAX_REQUESTS" => "20" ),

in order to free up memory more frequently by restarting the PHP backend.
https://redmine.lighttpd.net/projects/1/wiki/FrequentlyAskedQuestions#How-many-php-CGI-processes-will-lighttpd-spawn

If the PHP is serving files as content, then you might also consider using "X-Sendfile" rather than reading the file into PHP to send to lighttpd. See Docs_ModFastCGI

#8

Updated by abelbeck 4 months ago

I am interested in tracking this down, though it may be a little bit more difficult to duplicate on my system, which has a faster CPU.

We run on a variety of hardware, the current lowest supported is PC Engines ALIX (500 MHz, 256 MB RAM) to a more typical 2-4 core, 1-2 GHz, 1-4 GB RAM.

The boxes tested here with this problem are a 4-core, 1GHz, 4 GB RAM and a 4-core, 1.8 GHz, 2 GB RAM, both x86_64. So not real low end.

In the interest of providing a workaround, why are you using server.stream-response-body = 2 to avoid temporary files when you are using PHP, which likely keeps most of the response in memory? If your system has the memory for your PHP, you are probably better off letting lighttpd handle buffering to disk to allow PHP to finish up more quickly, hopefully freeing its memory sooner.

Typically we are handing a web interface, providing proxy's to Fossil, Monit, Darkstat and Shellinabox and IP Phone provisioning. Typically the data transferred is relatively small. But on occasion 20 GB and larger files may be downloaded via fastcgi. It seems better to stream these large files than try to guess what a /var/tmp tmpfs max size should be. Plus, we have always have done it this way. :-)

If we have to re-think using server.stream-response-body = 2 we can, but it still seems like the best choice for our application.

If the PHP is serving files as content, then you might also consider using "X-Sendfile" rather than reading the file into PHP to send to lighttpd. See Docs_ModFastCGI

I looked into that, but "X-Sendfile" is not a general solution for us without a fair amount of re-writing.

Thanks for all your comments and suggestions so far, much appreciated.

#9

Updated by gstrauss 4 months ago

When you say "<<lighttpd hangs, system reboot required>>", are you unable to restart lighttpd? It might be useful to know what is happening that would result in the entire system hanging. That sounds like a resource issue somewhere. Is lighttpd or PHP using lots of CPU during this time? Is there available free memory? It would be useful to get a few seconds of 'strace' output on the lighttpd process to see what is happening. If lighttpd is spinning at 100% CPU, or on the other hand, lighttpd is simply poll()ing connections once a second, then either of those behaviors would point to a race condition not handled in the state transitions in lighttpd which regulate reading/writing to backend and reading/writing to the client when using server.stream-response-body = 2

#10

Updated by abelbeck 4 months ago

When you say "<<lighttpd hangs, system reboot required>>", are you unable to restart lighttpd? It might be useful to know what is happening that would result in the entire system hanging. That sounds like a resource issue somewhere. Is lighttpd or PHP using lots of CPU during this time? Is there available free memory?

lighttpd is using 100% of a core, memory is reasonable, and a reboot comes back clean, no collateral damage.

  1  [|                                                   0.9%]     Tasks: 45, 70 thr; 2 running
  2  [                                                    0.0%]     Load average: 0.90 0.37 0.14 
  3  [|                                                   0.5%]     Uptime: 03:19:41
  4  [||||||||||||||||||||||||||||||||||||||||||||||||||100.0%]
  Mem[|||||                                         110/3880MB]
  Swp[                                                   0/0MB]

CPU   PID USER      PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command                                                          
  4  2575 root       20   0 47280  3636  1896 R 99.8  0.1  2:08.03 lighttpd -f /etc/lighttpd.conf                                   

The PHP processes were idling ...

  3  2612 root       20   0 45544  5288  3112 S  0.0  0.1  0:00.03 /usr/bin/php
  4  2614 root       20   0 45544  5288  3112 S  0.0  0.1  0:00.03 /usr/bin/php
  3  2615 root       20   0 45544  2492   208 S  0.0  0.1  0:00.00 /usr/bin/php
  4  2617 root       20   0 46100  5224  2192 S  0.0  0.1  0:00.18 /usr/bin/php

lighttpd can't be restarted since the previous listen socket is busy.

Starting lighttpd...
2018-03-17 21:59:40: (network.c.313) can't bind to socket: 0.0.0.0:80 Address already in use 

I performed an strace of the 100% CPU lighttpd process via htop ...

ioctl(15, FIONREAD, [104827])           = 0
ioctl(15, FIONREAD, [104827])           = 0
ioctl(15, FIONREAD, [104827])           = 0
ioctl(15, FIONREAD, [104827])           = 0
ioctl(15, FIONREAD, [104827])           = 0
ioctl(15, FIONREAD, [104827])           = 0
ioctl(15, FIONREAD, [104827])           = 0
ioctl(15, FIONREAD, [104827])           = 0
ioctl(15, FIONREAD, [104827])           = 0
ioctl(15, FIONREAD, [104827])           = 0
ioctl(15, FIONREAD, [104827])           = 0
ioctl(15, FIONREAD, [104827])           = 0
ioctl(15, FIONREAD, [104827])           = 0
ioctl(15, FIONREAD, [104827])           = 0

the ioctl's filled the screen in a blink of an eye.

#11

Updated by abelbeck 4 months ago

Here is a strace data just before it hangs ...

ioctl(15, FIONREAD, [70076])            = 0
read(15, "\245$\34\362\371\254\360\222\246\34\354\f\342_\361\331\213+j\342\331G\222~:\237\276\"\220\201\245\324"..., 16119) = 16119
ioctl(15, FIONREAD, [53957])            = 0
read(15, "\274Y6J\2038\16\2415\230j\237\275\324\304S\346\3\225\vws\236Y\375u\272\0\27\302\263\217"..., 16119) = 16119
ioctl(15, FIONREAD, [37838])            = 0
read(15, "\370a\244T\2720\226K\244\22)\371S\236\302\22\3206-_`\373\4\37\311\277\355BFp\231\220"..., 16119) = 16119
epoll_ctl(9, EPOLL_CTL_MOD, 15, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=15, u64=15}}) = 0
epoll_wait(9, [{EPOLLOUT, {u32=14, u64=14}}, {EPOLLIN, {u32=15, u64=15}}], 1025, 1000) = 2
write(14, "\27\3\3@\30\5+A\232\27\23\301\274V\340P\3Vl\271\246\221)\214\"+\305o`\204J-"..., 16413) = 16413
ioctl(15, FIONREAD, [169303])           = 0
epoll_ctl(9, EPOLL_CTL_MOD, 15, {EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=15, u64=15}}) = 0
epoll_wait(9, [{EPOLLOUT, {u32=14, u64=14}}, {EPOLLHUP|EPOLLRDHUP, {u32=15, u64=15}}], 1025, 1000) = 2
write(14, "\27\3\3@\30\5+A\232\27\23\301\2752e\235]\344\305$I\333j\2716\376Q\254\356\2212\207"..., 16413) = 16413
ioctl(15, FIONREAD, [169335])           = 0
ioctl(15, FIONREAD, [169335])           = 0
ioctl(15, FIONREAD, [169335])           = 0

#12

Updated by abelbeck 4 months ago

For completeness, I disabled optimizations with -O0 at the end of CFLAGS, same problem.

BTW, the stripped lighttpd binary size went from 233056 bytes to 345200 bytes compiled with -O0

#13

Updated by gstrauss 4 months ago

Thank you. That's useful. With that, I should be able to track this down tonight or tomorrow. (It's getting late here.)

#14

Updated by gstrauss 4 months ago

Looks like things got stuck in a loop attempting to make progress on packets already read into mod_fastcgi buffers. Please try this:

--- a/src/http-header-glue.c
+++ b/src/http-header-glue.c
@@ -1214,7 +1214,11 @@ handler_t http_response_read(server *srv, connection *con, http_response_opts *o
                   : HANDLER_GO_ON;    /* optimistic read; data not ready */
               #else
                 if (!(fdevent_event_get_interest(srv->ev, fd) & FDEVENT_IN))
-                    return HANDLER_GO_ON; /* optimistic read; data not ready */
+                    return NULL != opts->parse
+                        && (con->conf.stream_response_body
+                            & FDEVENT_STREAM_RESPONSE_BUFMIN)
+                      ? opts->parse(srv, con, opts, b, ~(size_t)0u)
+                      : HANDLER_GO_ON; /* optimistic read; data not ready */
                 toread = 4096; /* let read() below indicate if EOF or EAGAIN */
               #endif
             }

I am not yet convinced that this is the right fix, but if it appears to work for you, then at least we'll know that we're looking in the right area.

#15

Updated by gstrauss 4 months ago

On second thought, if you attach a debugger to the running process (which will be tough if the system is unresponsive) and type 'where', I think you'll find things spinning in a loop in gw_backend.c:gw_handle_fdevent() around line 2171. I am still tracing the code to see how we might get here even if the response from the backend was already finished. It is possible that there are sets of multiple events received at the same time and that the code should check if con->file_finished is already set. If that is the case, then this might do it:

--- a/src/gw_backend.c
+++ b/src/gw_backend.c
@@ -2167,6 +2167,7 @@ static handler_t gw_handle_fdevent(server *srv, void *ctx, int revents) {
              * since event loop will spin on fd FDEVENT_HUP event
              * until unregistered. */
             handler_t rc;
+            if (con->file_finished) return HANDLER_FINISHED;
             do {
                 rc = gw_recv_response(srv,hctx); /*(might invalidate hctx)*/
             } while (rc == HANDLER_GO_ON);       /*(unless HANDLER_GO_ON)*/

#16

Updated by abelbeck 4 months ago

Sadly, neither patch had any effect on the issue.

#17

Updated by abelbeck 4 months ago

Bingo ! This hack/patch solves the problem !

--- lighttpd-1.4.49/src/gw_backend.c.orig    2018-03-18 07:57:42.108035259 -0500
+++ lighttpd-1.4.49/src/gw_backend.c    2018-03-18 07:59:30.343341030 -0500
@@ -2166,7 +2166,7 @@
             handler_t rc;
             do {
                 rc = gw_recv_response(srv,hctx); /*(might invalidate hctx)*/
-            } while (rc == HANDLER_GO_ON);       /*(unless HANDLER_GO_ON)*/
+            } while (0);
             return rc; /* HANDLER_FINISHED or HANDLER_ERROR */
         } else {
             gw_proc *proc = hctx->proc;

Tested on both test boxes.

Possibly the /*(might invalidate hctx)*/ comment is related :-)

#18

Updated by gstrauss 4 months ago

Progress! That's the loop that it where things are getting stuck for you.

commit e4bb5622 for #2796 is for an issue reported with server.stream-response-body = 2. Later, this code was moved to a new routine, but the return value was not updated appropriately. That was done in commit 748fc860 and exposed this condition. It appears to be specific to mod_fastcgi.c and server.stream-response-body = 2, as you have found.

The fix is probably something that should be done in mod_fastcgi.c, not your disabling of that loop. Perhaps mod_fastcgi needs to track that HANDLER_FINISHED had previously been returned from this routine, and then avoid the trace and return HANDLER_FINISHED instead of HANDLER_GO_ON at the spot in commit 748fc860.

As an aside:

lighttpd can't be restarted since the previous listen socket is busy.

Starting lighttpd...
2018-03-17 21:59:40: (network.c.313) can't bind to socket: 0.0.0.0:80 Address already in use 

How are you attempting to kill lighttpd? kill -9? Try kill -QUIT instead.
#19

Updated by abelbeck 4 months ago

Your explanation makes perfect sense. I'll await your production patch. Much Thanks !

As an aside:

Yes, using kill -QUIT does stop the 100% CPU process an allow it to be restarted.

#20

Updated by gstrauss 4 months ago

Slightly longer patch than I'd like, but I expect this to handle your use case. With this patch, if you start getting "unexpected end-of-file (perhaps the fastcgi process died):", then your backend PHP is not properly ending the FastCGI request.

I still want to reproduce this myself to better understand the condition(s) under which this might occur.

diff --git a/src/fdevent.h b/src/fdevent.h
index 44dbbb9d..32a45abe 100644
--- a/src/fdevent.h
+++ b/src/fdevent.h
@@ -25,8 +25,9 @@ typedef handler_t (*fdevent_handler)(struct server *srv, void *ctx, int revents)
 #define FDEVENT_STREAM_REQUEST_BUFMIN   BV(1)
 #define FDEVENT_STREAM_REQUEST_POLLIN   BV(15)

-#define FDEVENT_STREAM_RESPONSE         BV(0)
-#define FDEVENT_STREAM_RESPONSE_BUFMIN  BV(1)
+#define FDEVENT_STREAM_RESPONSE           BV(0)
+#define FDEVENT_STREAM_RESPONSE_BUFMIN    BV(1)
+#define FDEVENT_STREAM_RESPONSE_POLLRDHUP BV(15)

 int fdevent_config(server *srv);
 const char * fdevent_show_event_handlers(void);
diff --git a/src/gw_backend.c b/src/gw_backend.c
index d681cf9a..1abe97bc 100644
--- a/src/gw_backend.c
+++ b/src/gw_backend.c
@@ -2167,9 +2167,13 @@ static handler_t gw_handle_fdevent(server *srv, void *ctx, int revents) {
              * since event loop will spin on fd FDEVENT_HUP event
              * until unregistered. */
             handler_t rc;
+            const unsigned short flags = con->conf.stream_response_body;
+            con->conf.stream_response_body &= ~FDEVENT_STREAM_RESPONSE_BUFMIN;
+            con->conf.stream_response_body |= FDEVENT_STREAM_RESPONSE_POLLRDHUP;
             do {
                 rc = gw_recv_response(srv,hctx); /*(might invalidate hctx)*/
             } while (rc == HANDLER_GO_ON);       /*(unless HANDLER_GO_ON)*/
+            con->conf.stream_response_body = flags;
             return rc; /* HANDLER_FINISHED or HANDLER_ERROR */
         } else {
             gw_proc *proc = hctx->proc;
diff --git a/src/http-header-glue.c b/src/http-header-glue.c
index 0e448090..6135c1da 100644
--- a/src/http-header-glue.c
+++ b/src/http-header-glue.c
@@ -1213,8 +1213,11 @@ handler_t http_response_read(server *srv, connection *con, http_response_opts *o
                   ? HANDLER_FINISHED  /* read finished */
                   : HANDLER_GO_ON;    /* optimistic read; data not ready */
               #else
-                if (!(fdevent_event_get_interest(srv->ev, fd) & FDEVENT_IN))
-                    return HANDLER_GO_ON; /* optimistic read; data not ready */
+                if (!(fdevent_event_get_interest(srv->ev, fd) & FDEVENT_IN)) {
+                    if (!(con->conf.stream_response_body
+                          & FDEVENT_STREAM_RESPONSE_POLLRDHUP))
+                        return HANDLER_GO_ON;/*optimistic read; data not ready*/
+                }
                 toread = 4096; /* let read() below indicate if EOF or EAGAIN */
               #endif
             }
diff --git a/src/mod_cgi.c b/src/mod_cgi.c
index dd788308..047b3d83 100644
--- a/src/mod_cgi.c
+++ b/src/mod_cgi.c
@@ -425,9 +425,13 @@ static handler_t cgi_handle_fdevent(server *srv, void *ctx, int revents) {
                         * since event loop will spin on fd FDEVENT_HUP event
                         * until unregistered. */
                        handler_t rc;
+                       const unsigned short flags = con->conf.stream_response_body;
+                       con->conf.stream_response_body &= ~FDEVENT_STREAM_RESPONSE_BUFMIN;
+                       con->conf.stream_response_body |= FDEVENT_STREAM_RESPONSE_POLLRDHUP;
                        do {
                                rc = cgi_recv_response(srv,hctx);/*(might invalidate hctx)*/
                        } while (rc == HANDLER_GO_ON);           /*(unless HANDLER_GO_ON)*/
+                       con->conf.stream_response_body = flags;
                        return rc; /* HANDLER_FINISHED or HANDLER_COMEBACK or HANDLER_ERROR */
                } else if (!buffer_string_is_empty(hctx->response)) {
                        /* unfinished header package which is a body in reality */
diff --git a/src/mod_fastcgi.c b/src/mod_fastcgi.c
index 4ad4f83d..b442214e 100644
--- a/src/mod_fastcgi.c
+++ b/src/mod_fastcgi.c
@@ -368,7 +368,10 @@ static handler_t fcgi_recv_parse(server *srv, connection *con, struct http_respo
        int fin = 0;

        if (0 == n) {
-               if (!(fdevent_event_get_interest(srv->ev, hctx->fd) & FDEVENT_IN)) return HANDLER_GO_ON;
+               if (-1 == hctx->request_id) return HANDLER_FINISHED; /*(flag request ended)*/
+               if (!(fdevent_event_get_interest(srv->ev, hctx->fd) & FDEVENT_IN)
+                   && !(con->conf.stream_response_body & FDEVENT_STREAM_RESPONSE_POLLRDHUP))
+                       return HANDLER_GO_ON;
                log_error_write(srv, __FILE__, __LINE__, "ssdsb",
                                "unexpected end-of-file (perhaps the fastcgi process died):",
                                "pid:", hctx->proc->pid,
@@ -436,6 +439,7 @@ static handler_t fcgi_recv_parse(server *srv, connection *con, struct http_respo

                        break;
                case FCGI_END_REQUEST:
+                       hctx->request_id = -1; /*(flag request ended)*/
                        fin = 1;
                        break;
                default:
#21

Updated by gstrauss 4 months ago

abelbeck, I have been unable to reproduce what you're seeing. Any more hints? What version of PHP? Would you consider providing your full config, with anything private 'XXXXX' out to be redacted?

I am curious how widespread this issue actually is. You note that this behavior has been present since lighttpd 1.4.46, which was released almost 5 months ago, and you're the first person reporting this behavior. (lighttpd shouldn't spin like that, so, yes this is a bug that needs to be fixed.)

Looking at the minimal PHP code you've shown, I question the use of ob_clean() and flush(). Why are they needed? (You should flush output buffer to $tmpfile before reading the file back in, and close the output file.) Additionally, you should probably use X-Sendfile to send the tempfile. Once lighttpd opens the file, which will be shortly after you send the headers from your PHP, it does not matter if you unlink the file, so you could have a background job which checks for files over a minute old (or even a few seconds old) in your temp directory and unlinks them. Or you could just sleep a few seconds in the PHP before unlink() and exit.

#22

Updated by abelbeck 4 months ago

Your latest patch works perfectly !

I tested a fair amount on two boxes, trying to make it fail for fastcgi downloads. With your latest patch I say this bug is fixed !

#23

Updated by abelbeck 4 months ago

gstrauss wrote:

abelbeck, I have been unable to reproduce what you're seeing. Any more hints? What version of PHP?

We use PHP/5.6.34 .

Our kernel is configured using PREEMPT, that may be unique to your testing ...

# uname -a
Linux pbx4 3.16.54-astlinux #1 SMP PREEMPT Thu Mar 15 19:10:18 CDT 2018 x86_64 GNU/Linux

I question the use of ob_clean() and flush().

At one time they were needed. IIRC stray PHP generated characters could be inserted before the download, ideally there should not be any, but they are easy to add accidentally.

Since this requires server.stream-response-body = 2 that greatly reduces the number of people that may run across this.

#24

Updated by abelbeck 4 months ago

Also interesting fact is Lighttpd 1.4.45's mod_fastcgi.c had the same } while (rc == HANDLER_GO_ON); loop, but that never gave us any problem for over a year.

Possibly we just need to be happy this edge-condition was discovered and now fixed.

It was truly a pleasure working with you "gstrauss".

#25

Updated by gstrauss 4 months ago

  • Category changed from mod_fastcgi to core
  • Status changed from New to Patch Pending
  • Target version changed from 1.4.x to 1.4.50

Likewise. :) Thank you for your technical assistance in tracking this down and testing patches.

#26

Updated by gstrauss 4 months ago

  • Status changed from Patch Pending to Fixed
  • % Done changed from 0 to 100

Also available in: Atom