Bug #3111
closedHTTP/2 requests sometimes take very long (missing last chunk)
Description
Using the current lighttpd version 1.4.60 I see sometimes HTTP/2 requests take very long:
- dynamic request to a backend connected with mod_proxy on a unix domain socket
- curl shows, that the response is received fast, but partially
- always the same last chunk is missing (byte exact)
- curl hangs waiting the for the last data
- after some time curl shows the following message:
* TLSv1.3 (IN), TLS alert, close notify (256): * Closing connection 0 * TLSv1.3 (OUT), TLS alert, close notify (256):
- if I press Ctrl+C to terminate prematurely, the log message in the access log shows the length of a full download
- the same url may succeed or fail in subsequent tests, the log message is always the same: return code 200, same length
Only some urls are affected, but I think the problem is not the URL but certain response sizes.
Forcing an HTTP/1.1 request with curl always succeeds, no failure in 10 attempts.
I can't say for sure if the change to version 1.4.60 is the cause or if the problem has existed for a longer period of time.
Updated by gstrauss about 3 years ago
might be related to the (also recently submitted) https://redmine.lighttpd.net/boards/2/topics/10063
Updated by gstrauss about 3 years ago
Do you have a non-TLS port on which lighttpd is listening? Does the issue occur with curl --http-prior-knowledge
sent to the non-TLS (clear) port? I am trying to narrow down if this might possibly related to HTTP/2 + TLS buffering.
Updated by flynn about 3 years ago
For tests I enabled a non-TLS port and tested with curl --http2-prior-knowledge
: no failure in 10 attempts.
Exact at the byte position, where the TLS connection sometimes hangs, curl in verbose mode reports
Connection #0 to host xxx left intact
and continues.
Updated by flynn about 3 years ago
According to my observations the content length of the response should be a little below server.chunkqueue-chunk-sz
, so that the whole response with headers is a slightly larger than server.chunkqueue-chunk-sz
.
Updated by gstrauss about 3 years ago
Thanks. I'll have more time to dig into this later this evening.
Updated by gstrauss about 3 years ago
I have not been able to reproduce this with a trivial lighttpd backend listening on a unix domain socket, but am still trying.
How is the backend responding? HTTP/1.0 or HTTP/1.1? Is Content-Length
provided? Is it using Transfer-Encoding: chunked
? (You might try proxy.header += ("force-http10" => "enable")
)
Does the backend send the response in full, or are there pauses between writes?
Is server.stream-response-body non-zero? (I am assuming that response streaming is enabled.)
Thinking aloud:
Default server.chunkqueue-chunk-sz
is 8k, and lighttpd tries to send data to TLS modules in 16k blocks, so a HTTP/2 frames for response headers and response body should fit into a single 16k block, and should be sent in a single TLS record (if all data has been received from backend)
If you have alternate TLS modules installed, you might test with lighttpd mod_gnutls to see if that makes a difference.
What version of openssl libs are you using? 1.1.1l or 3.0.0?
Updated by gstrauss about 3 years ago
I found one issue, but there might be more.
--- a/src/http-header-glue.c +++ b/src/http-header-glue.c @@ -1329,6 +1329,7 @@ handler_t http_response_read(request_st * const r, http_response_opts * const op /* accumulate response in b until headers completed (or error)*/ if (r->resp_body_started) { buffer_clear(b); + if (0 == r->resp_body_scratchpad) r->resp_body_finished = 1; /* enable simple accumulation of small reads in some situations * no Content-Length (will read to EOF) * Content-Length (will read until r->resp_body_scratchpad == 0)
lighttpd 1.4.60 schedules together the processing of front-end and backend-end events for requests on the same client connection. This reduces duplicative checks and improves buffer reuse, but may have missed detecting if the response body was already fully received along with the headers when the backend provided
Content-Length
with the response (instead of Transfer-Encoding: chunked
or neither).Updated by gstrauss about 3 years ago
A few patches have been pushed to my dev branch.
Updated by gstrauss about 3 years ago
So far, I have been unable to reproduce the issue with response body at varying amounts less than, but close to, 8k.
Would you share some trace with debug.log-state-handling = "enable"
for the hanging stream id?
if I press Ctrl+C to terminate prematurely, the log message in the access log shows the length of a full download
That suggests that the front-end lighttpd has read the data from the backend and has applied HTTP/2 framing when transferring the data from the request write queue to the connection write queue. Does the response have Content-Length
? Does curl report that it was missing bytes? Or perhaps there is some reason that lighttpd is not sending HTTP/2 END_STREAM frame after lighttpd finishes sending the response body data?
Updated by flynn about 3 years ago
- openssl version is from Debian buster: 1.1.1d
- the Debian package currently does not include the gnutls module, I added this myself, test results re reported later
- using the nss TLS module the problem occurs too, but curl now reports immediately:
curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)
- the responses all have a
Content-Length
- usually it takes some time to trigger the problem, after a restart usually all is fine. Also the affected urls may change after restart ...
I'll test your new patches now.
Updated by flynn about 3 years ago
Share some trace with debug.log-state-handling = "enable" on a high volume server is not so easy, I cannot filter out the request in question.
Any ideas, how to limit thee debug messges to a certian request (type). As far as I know this option is always global, I cannot limit the scope with $HTTP[...]...
- (mod_nss.c.2197) NSS: mod_nss_read_err: (PR_END_OF_FILE_ERROR) Encountered end of file
Updated by gstrauss about 3 years ago
Any ideas, how to limit thee debug messges to a certian request (type). As far as I know this option is always global, ...
Most of the debug.*
options are not global. Unfortunately, debug.log-state-handling
and debug.log-request-header-on-error
are global, since they are connection-level, or may occur before the request has been parsed (and is valid).
I'll put together a patch (below) to enable debug.log-state-handling
after request headers have been read. (It does not make as much sense to do so for debug.log-request-header-on-error
)
Aside: lighttpd mod_gnutls will be included in Debian in a future version of lighttpd. It was not included with lighttpd 1.4.56 (released Nov 2020) since at the time, gnutls libraries in Debian were very old and did not have support for TLS v1.3. gnutls 3.6.5 with support for TLS v1.3 was released Dec 2018, almost 2 years prior to lighttpd 1.4.56, but Debian still had an older version of gnutls at the time. (Debian updated to use newer gnutls libraries in Jan 2021)
Separate issue:
(mod_nss.c.2197) NSS: mod_nss_read_err: (PR_END_OF_FILE_ERROR) Encountered end of file
I guess I should add PR_END_OF_FILE_ERROR
to those ignored unless debug.log-ssl-noise = "enable"
Quick patch to allow debug.log-state-handling
to be toggled after request has started:
--- a/src/configfile.c +++ b/src/configfile.c @@ -192,6 +192,9 @@ static void config_merge_config_cpv(request_config * const pconf, const config_p case 32:/* server.breakagelog */ if (cpv->vtype == T_CONFIG_LOCAL) pconf->serrh = cpv->v.v; break; + case 33:/* debug.log-state-handling */ + pconf->log_state_handling = (0 != cpv->v.u); + break; default:/* should not happen */ return; } @@ -717,7 +720,7 @@ static int config_insert_srvconf(server *srv) { T_CONFIG_SCOPE_SERVER } ,{ CONST_STR_LEN("debug.log-state-handling"), T_CONFIG_BOOL, - T_CONFIG_SCOPE_SERVER } + T_CONFIG_SCOPE_CONNECTION } ,{ CONST_STR_LEN("server.feature-flags"), T_CONFIG_ARRAY_KVANY, T_CONFIG_SCOPE_SERVER } @@ -1005,6 +1008,9 @@ static int config_insert(server *srv) { ,{ CONST_STR_LEN("server.breakagelog"), T_CONFIG_STRING, T_CONFIG_SCOPE_CONNECTION } + ,{ CONST_STR_LEN("debug.log-state-handling"), + T_CONFIG_BOOL, + T_CONFIG_SCOPE_CONNECTION } ,{ NULL, 0, T_CONFIG_UNSET, T_CONFIG_SCOPE_UNSET }
Updated by flynn about 3 years ago
The initial tests with the two patches in your personal repo on this ticket look promising, but I want to test again tomorrow after some runtime.
Updated by flynn about 3 years ago
Until now I cannot reproduce this error anymore, the patches https://git.lighttpd.net/lighttpd/lighttpd1.4/commit/a9a3f2f0b97766b7a84ce19fe7ff88383a3f3a97 and https://git.lighttpd.net/lighttpd/lighttpd1.4/commit/71798d46f9c8d1ba3c71c4867edaac52c88f7d20 solves this issue for me.
Updated by gstrauss about 3 years ago
Thank you for testing and for following up. lighttpd 1.4.61 (including these fixes) is slated to be released later this week.
Updated by gstrauss about 3 years ago
- Status changed from Patch Pending to Fixed
Applied in changeset 0757d71e1444b7bde1426de38dbdba123514ee82.
Updated by gstrauss about 3 years ago
There is another bug in lighttpd HTTP/2 support where HTTP/2 END_STREAM might not be sent if the entire response has been sent to the client, and then some time later the backend ends the response (or closes the pipe in the case of CGI). This bug is present in lighttpd 1.4.59 (and possible earlier), but may occur slightly more frequently in lighttpd 1.4.60 due to lighttpd 1.4.60 scheduling together the processing of front-end and backend-end events for requests on the same client connection, reducing the number of calls to the backend processing code.
https://git.lighttpd.net/lighttpd/lighttpd1.4/commit/90f062cb07d362e346792964cf76aac65120d375
Updated by flynn about 3 years ago
We had an issue with an etherpad backend, this patch solves the issue, the etherpad backend has a reproducable response timing now.
Also available in: Atom