[Solved] chunked encoding from a fastcgi application seems broken since 1.4.40

We have a web application, written on top of web.py, which we run as fastcgi app. It allows to download on-the-fly-generated files, and we are using Transfer-encoding chunked like in the example here: http://webpy.org/cookbook/streaming_large_files
Unfortunately, this no longer works with lighttpd 1.4.41 (and also 1.4.40); it results in complaints about broken chunked encoding (from curl and chrome) or nothing at all (firefox). The response sent to the client has the Transfer-encoding header (not surprising, because it is set by our application), but also a content-length, and no chunk length before the response body.

The change where it stopped working is this one:

commit 5863cb57522e46cde5a35784f4ce83459a8561f4
Date:   Sun Jun 19 21:05:25 2016 -0400

    defer choosing "Transfer-Encoding: chunked" 

    defer choosing "Transfer-Encoding: chunked" until response header
    is about to be written

Reverting it (or in our case even only the hunk in mod_fastcgi) makes it work again. I'd file a bug but I'm not sure if maybe we're doing something wrong, given that nobody else complained about this. Any insights are appreciated.

Replies (13)

RE: chunked encoding from a fastcgi application seems broken since 1.4.40 - Added by gstrauss over 8 years ago

maybe we're doing something wrong

Yes, but let me briefly try to explain to put that in context.

lighttpd mod_proxy sends HTTP/1.0 requests to backends. If mod_proxy were to send HTTP/1.1 requests to backends, then mod_proxy would need to be able to handle Transfer-Encoding: chunked, a requirement for HTTP/1.1. Chunked transfer-encoding is part of the HTTP/1.1 protocol. Chunked encoding is not described by the CGI protocol or FastCGI protocol, each which defines its own way to communicate between the web server and backend.

CGI-like responses (CGI, FastCGI, SCGI, uwsgi, etc) are parsed by the web server, and web servers do not expect a Transfer-Encoding from CGI-like backends, unless that CGI-like backend is configured to be a NPH (non-parsed header) script, where the web server supports and has been configured to not parse the response headers from the backend. In Apache, this may be done by naming a CGI script to begin with 'nph-' in the script name. Apparently, lighttpd provided NPH-like behavior in versions prior to 1.4.40, but changes in lighttpd 1.4.40 -- which were made to fix potentially unbounded memory use by lighttpd -- appear to have broken the accidental(?) NPH-like behavior you were relying on.

However, if your FastCGI backend must control the Transfer-Encoding (a hop-by-hop connection header) as your backend responds to clients, then your backend would do well to actually speak HTTP/1.1 protocol and either be the web server, or be the web backend for a web server to proxy to and that speaks HTTP/1.1 to your backend. In the latter case, the web server speaking HTTP/1.1 to your backend need not necessarily send chunks back to the client. Again, Transfer-Encoding is a hop-by-hop transfer encoding. The web server might decode the chunked response and send it back to an HTTP/1.0 client.

Taking a wider view, please ask yourself what you are attempting to accomplish. You are trying to do something (not described here by you), and the solution you found that seemed to work was to have your backend send Transfer-Encoding: chunked. Why? If your FastCGI simply omitted Transfer-Encoding and Content-Length, then the response to the client would close at the end of the response, effectively disabling keep-alive of the connection. Perhaps that was what you were trying to avoid?

In any case, I am going to guess that you were attempting to stream a response to the client. Have a look at the lighttpd 1.4.40 directive

  server.stream-response-body = 1

See the commit message 695c8f4e for a description.

tl;dr: I don't think it in your best interest for your backend to send Transfer-Encoding: chunked unless your backend is speaking HTTP/1.1. There are better, modern libraries you can use to have your backend speak HTTP/1.1 instead of looking for NPH-like behavior from CGI or FastCGI. What are you actually trying to accomplish?

RE: chunked encoding from a fastcgi application seems broken since 1.4.40 - Added by gstrauss over 8 years ago

CGI spec https://tools.ietf.org/html/rfc3875#section-6.3.4

   The script MUST NOT return any header fields that relate to
   client-side communication issues and could affect the server's
   ability to send the response to the client.

FastCGI already has a protocol (named, unsuprisingly, the FastCGI protocol) for sending response back to web server using the structured FCGI_Header with type FCGI_STDOUT. Transfer-Encoding, an HTTP hop-by-hop header, is not something a FastCGI backend should be sending. Ever.

You might try setting server.protocol-http11 = "disable" in the scope of that FastCGI script to try to workaround the FastCGI script being too clever by half.

RE: chunked encoding from a fastcgi application seems broken since 1.4.40 - Added by dieter.ro over 8 years ago

Thanks for your elaborate replies. I have followed your advice and changed the lighttpd config to enable stream-response-body and our code to not set the Transfer-Encoding header. (Just so you know, we were not trying to be smart here, we just followed the web.py recipe I linked to in my original post.)

Unfortunately, this doesn't seem to work reliably in our setup. The download often stops prematurely; some testing suggests this happens when the client downloads quicker than the back-end application can provide the data. (I can make it work more reliably by downloading with curl --limit-rate.) If the download stops early, messages like the following are logged:

2016-10-25T10:48:50+00:00 daemon[err] lighttpd[5195] (network_write_no_mmap.c.57) file was shrinked: /var/lighttpd/uploads/lighttpd-upload-cX0yej 
2016-10-25T10:48:50+00:00 daemon[err] lighttpd[5195] (connections.c.553) connection closed: write failed on fd 10

Apparently I didn't do a very good job at explaining what we're trying to achieve in my original post, so I'll try again: We have a web application where the front-end allows to download log files, backups, database dumps, etc. These downloads can grow to hundreds of megabytes and it can take quite some time for the back-end to produce the full data. Obviously we don't want the client to wait until the whole result is ready; we want to start streaming the data as soon as any of it is available. (If this is still unclear, let me know what kind of details I need to provide.)

Thanks and kind regards.

RE: chunked encoding from a fastcgi application seems broken since 1.4.40 - Added by gstrauss over 8 years ago

Thanks for the details. I do not know why you might be getting the "file was shrinked" (sic) error. Still, in looking into this, I saw an opportunity to optimize the code and elide the fstat() when operating on temp files created by lighttpd. Are you able to test the patch at
https://redmine.lighttpd.net/projects/lighttpd/repository?utf8=%E2%9C%93&rev=b2ab1c8d0eadf3b85bf5a777bc8cb59ae5277cb8 ?
(https://git.lighttpd.net/lighttpd/lighttpd1.4.git on branch personal/gstrauss/master)

I am interested if that works for you.

FYI:

As an aside, I would not trust http://webpy.org/cookbook/streaming_large_files. It does not check if client supports HTTP/1.1 before sending Transfer-Encoding: chunked, and the blanket statement is patently false: "but you need to make sure you add the Transfer-Encoding chunked header for it to display properly. Otherwise the browser will buffer all data before displaying it to you." False.

Most modern browsers will have no trouble displaying a streaming response. Even if some will buffer by default, you can use javascript and XMLHttpRequest() to process a streaming request and update the page. YMMV by browser type and browser version.

RE: chunked encoding from a fastcgi application seems broken since 1.4.40 - Added by dieter.ro over 8 years ago

Interesting… I applied your patch, and the failure mode is now different. Now, when a download breaks off, the log lines are:

2016-10-26T08:44:00+00:00 daemon[err] lighttpd[17109] (network_openssl.c.156) SSL (error): 5 0 0 Success 
2016-10-26T08:44:00+00:00 daemon[err] lighttpd[17109] (connections.c.553) connection closed: write failed on fd 11

Based on this, I tried over plain HTTP and that seems to work every time. It also works without the patch over HTTP.

RE: chunked encoding from a fastcgi application seems broken since 1.4.40 - Added by gstrauss over 8 years ago

Thanks for the update. My patch will be in the next release of lighttpd (1.4.43) since I think it is an improvement, even if it does not address this issue at hand.

I am still staring at network_openssl.c and the code looks to be doing the right things in a thorough manner, including ERR_clear_error() prior to SSL_write(), and then checking SSL_get_error() and ERR_get_error(). I had a hunch that maybe SSL_write() was being called with 0-length data, but it appears that should be handled in the code, though could probably use some verification.

RE: chunked encoding from a fastcgi application seems broken since 1.4.40 - Added by gstrauss over 8 years ago

@dieter.ro: what are you using for server.event-handler? Are you using "libev"?
If so, please try server.event-handler = "linux-sysepoll" (if on Linux, or anything else other than "libev" on other platforms)

RE: chunked encoding from a fastcgi application seems broken since 1.4.40 - Added by gstrauss over 8 years ago

(network_openssl.c.156) SSL (error): 5 0 0 Success

This is still somewhat baffling to me. What version of openssl are you using? (If libressl, what version of libressl?)

Would you mind adding the additional patch below, which does not treat this case as a fatal connection error? thanks.

diff --git a/src/network_openssl.c b/src/network_openssl.c
index 9a9138e..23e3dad 100644
--- a/src/network_openssl.c
+++ b/src/network_openssl.c
@@ -153,9 +153,10 @@ int network_write_chunkqueue_openssl(server *srv, connection *con, SSL *ssl, chu
                                        }
                                } else {
                                        /* neither error-queue nor errno ? */
-                                       log_error_write(srv, __FILE__, __LINE__, "sddds", "SSL (error):",
+                                       log_error_write(srv, __FILE__, __LINE__, "sdddds", "SSL (error):", data_len,
                                                        ssl_r, r, errno,
                                                        strerror(errno));
+                                       return 0; /* try again later */
                                }
                                break;

RE: chunked encoding from a fastcgi application seems broken since 1.4.40 - Added by gstrauss over 8 years ago

@dieter.ro, can I get some pointers how to reproduce this?

I am unable to reproduce the error and have tried building lighttpd with ./configure --enable-mmap and without --enable-mmap. My backend FastCGI is running a loop to write some data, then sleep 1 second, then repeat, while the client is curl and is downloading as fast as it can. I am testing on localhost over a TCP socket. I have also tried to vary the amount of data written each time through the loop. (Note: I am testing lighttpd with the first patch above, not with the patch directly above, and am using openssl 1.0.2j-fips)

Would you try to describe how quickly and how much data your backend produces, and at what (approximate) rate?

RE: chunked encoding from a fastcgi application seems broken since 1.4.40 - Added by dieter.ro over 8 years ago

Hi,

First, thanks for all your efforts.

I'll try and cook up something to easily reproduce this, but I think the issue may be timing-dependent (see below), so it may not be easy. (Also, I probably won't have time for this in the coming few days.)

Meanwhile, here are some answers and observations.
We're not using libev, we don't set server.event-handler, and also libev is not installed.
We use OpenSSL 1.0.2j. Every time the SSL "error" appears, the length is indeed 0.

More interestingly, I have straced both the fcgi application and lighttpd to try and find out where things fail first. I am almost certain there is something wrong in the way lighttpd manages its tempfiles. I cannot attach the full strace (it's 11M with xz-compression), but a script which summarizes the open, write, seek, read, and close operations from it print this:

/var/lighttpd/uploads/lighttpd-upload-qvanKL [13]

W[13]: 1025040
S[13]: 0
R[13]: 65536
W[13]: 24574
C[13]: 0

/var/lighttpd/uploads/lighttpd-upload-qvanKL [13]
S[13]: 0
R[13]: 65536
S[13]: 32768
...

The numbers after R and W are the sum of all reads and writes respectively that happened without another of these operations occurring. Note that the same tempfile is opened twice; this doesn't happen for any other temp files. It is also not unlinked before the close (not visible in the summary, but I checked), which is also atypical. Probably more importantly; the second batch of writes (the 4th line in the output above) happens at a moment the seek pointer is at a point where still unread data is. Later, the output of my script becomes:

R[13]: 25616
S[13]: 1015808
R[13]: 9232
S[13]: 1025040
S[13]: 1025040
S[13]: 1025040
S[13]: 1025040
S[13]: 1025040
S[13]: 1025040
...

which goes on until I killed it. A full strace fragment of the repetition is this:

24295 10:10:07.767862 setsockopt(10, SOL_TCP, TCP_CORK, [0], 4) = 0
24295 10:10:07.767902 epoll_wait(9, [{EPOLLOUT, {u32=10, u64=10}}], 1025, 1000) = 1
24295 10:10:07.767922 setsockopt(10, SOL_TCP, TCP_CORK, [1], 4) = 0
24295 10:10:07.767938 lseek(13, 1025040, SEEK_SET) = 1025040
24295 10:10:07.767952 read(13, "", 24574) = 0
24295 10:10:07.767990 sendto(8, "<27>Oct 27 10:10:07 lighttpd[24295]: (network_openssl.c.156) SSL (error): 0 5 0 0 Succe
ss ", 90, MSG_NOSIGNAL, NULL, 0) = 90
24295 10:10:07.768093 nanosleep({0, 100000000}, NULL) = ? ERESTART_RESTARTBLOCK (Interrupted by signal

(I amended your patch with a usleep call.)

The reason I think it's timing related is that it became much harder to reproduce while stracing.

thanks and kind regards

fops.txt.gz (103 KB) fops.txt.gz

strace summary on file ops

RE: chunked encoding from a fastcgi application seems broken since 1.4.40 - Added by gstrauss over 8 years ago

Thank you! You've identified the problem. The problem occurs when lighttpd is still writing to a temp file, and sends some (but not all) of the temp file to the client. In that case, the position in the temp file is incorrect and the file can become corrupt/truncated the next write(). lighttpd should seek back to the end of the temp file if it does not finish reading the entire temp file. Alternatively, when writing to a temp file, lighttpd should first seek to the end of the file. Of course, that is what O_APPEND is for. Here is a patch to replace the patch above which continues through that SSL error for 0 data_len (which occurs due to the improperly truncated file).

--- a/src/chunk.c
+++ b/src/chunk.c
@@ -480,6 +480,9 @@ static chunk *chunkqueue_get_append_tempfile(chunkqueue *cq) {
                return NULL;
        }

+       fd_close_on_exec(fd);
+       fcntl(fd, F_SETFL, fcntl(fd, F_GETFL, 0) | O_APPEND);
+
        c = chunkqueue_get_unused_chunk(cq);
        c->type = FILE_CHUNK;
        c->file.fd = fd;

RE: chunked encoding from a fastcgi application seems broken since 1.4.40 - Added by gstrauss over 8 years ago

Fixed in #2760 with 23503883

RE: chunked encoding from a fastcgi application seems broken since 1.4.40 - Added by dieter.ro over 8 years ago

Great! I can confirm it works perfectly after applying your commit. :)

(1-13/13)

Project

General

Profile

Lighttpd