Bug #3111: HTTP/2 requests sometimes take very long (missing last chunk) - Lighttpd - lighty labs

Actions

Copy link

Bug #3111

closed

HTTP/2 requests sometimes take very long (missing last chunk)

Added by flynn over 3 years ago. Updated over 3 years ago.

Status:

Fixed

Priority:

Normal

Category:

core

Target version:

1.4.61

ASK QUESTIONS IN Forums:

Description

Using the current lighttpd version 1.4.60 I see sometimes HTTP/2 requests take very long:

dynamic request to a backend connected with mod_proxy on a unix domain socket
curl shows, that the response is received fast, but partially
always the same last chunk is missing (byte exact)
curl hangs waiting the for the last data

after some time curl shows the following message:

* TLSv1.3 (IN), TLS alert, close notify (256):
* Closing connection 0
* TLSv1.3 (OUT), TLS alert, close notify (256):

if I press Ctrl+C to terminate prematurely, the log message in the access log shows the length of a full download
the same url may succeed or fail in subsequent tests, the log message is always the same: return code 200, same length

Only some urls are affected, but I think the problem is not the URL but certain response sizes.

Forcing an HTTP/1.1 request with curl always succeeds, no failure in 10 attempts.

I can't say for sure if the change to version 1.4.60 is the cause or if the problem has existed for a longer period of time.

Actions

Copy link

Updated by gstrauss over 3 years ago

might be related to the (also recently submitted) https://redmine.lighttpd.net/boards/2/topics/10063

Actions

Copy link

Updated by gstrauss over 3 years ago

Do you have a non-TLS port on which lighttpd is listening? Does the issue occur with curl --http-prior-knowledge sent to the non-TLS (clear) port? I am trying to narrow down if this might possibly related to HTTP/2 + TLS buffering.

Actions

Copy link

Updated by flynn over 3 years ago

For tests I enabled a non-TLS port and tested with curl --http2-prior-knowledge: no failure in 10 attempts.

Exact at the byte position, where the TLS connection sometimes hangs, curl in verbose mode reports

Connection #0 to host xxx left intact

and continues.

Actions

Copy link

Updated by flynn over 3 years ago

According to my observations the content length of the response should be a little below server.chunkqueue-chunk-sz, so that the whole response with headers is a slightly larger than server.chunkqueue-chunk-sz.

Actions

Copy link

Updated by gstrauss over 3 years ago

Thanks. I'll have more time to dig into this later this evening.

Actions

Copy link

Updated by gstrauss over 3 years ago

I have not been able to reproduce this with a trivial lighttpd backend listening on a unix domain socket, but am still trying.

How is the backend responding? HTTP/1.0 or HTTP/1.1? Is Content-Length provided? Is it using Transfer-Encoding: chunked? (You might try proxy.header += ("force-http10" => "enable"))
Does the backend send the response in full, or are there pauses between writes?

Is server.stream-response-body non-zero? (I am assuming that response streaming is enabled.)

Thinking aloud:

Default server.chunkqueue-chunk-sz is 8k, and lighttpd tries to send data to TLS modules in 16k blocks, so a HTTP/2 frames for response headers and response body should fit into a single 16k block, and should be sent in a single TLS record (if all data has been received from backend)

If you have alternate TLS modules installed, you might test with lighttpd mod_gnutls to see if that makes a difference.
What version of openssl libs are you using? 1.1.1l or 3.0.0?

Actions

Copy link

Updated by gstrauss over 3 years ago

I found one issue, but there might be more.

--- a/src/http-header-glue.c
+++ b/src/http-header-glue.c
@@ -1329,6 +1329,7 @@ handler_t http_response_read(request_st * const r, http_response_opts * const op
             /* accumulate response in b until headers completed (or error)*/
             if (r->resp_body_started) {
                 buffer_clear(b);
+                if (0 == r->resp_body_scratchpad) r->resp_body_finished = 1;
                 /* enable simple accumulation of small reads in some situations
                  * no Content-Length (will read to EOF)
                  * Content-Length (will read until r->resp_body_scratchpad == 0)

lighttpd 1.4.60 schedules together the processing of front-end and backend-end events for requests on the same client connection. This reduces duplicative checks and improves buffer reuse, but may have missed detecting if the response body was already fully received along with the headers when the backend provided Content-Length with the response (instead of Transfer-Encoding: chunked or neither).

Actions

Copy link

Updated by gstrauss over 3 years ago

A few patches have been pushed to my dev branch.

Actions

Copy link

Updated by gstrauss over 3 years ago

So far, I have been unable to reproduce the issue with response body at varying amounts less than, but close to, 8k.

Would you share some trace with debug.log-state-handling = "enable" for the hanging stream id?

if I press Ctrl+C to terminate prematurely, the log message in the access log shows the length of a full download

That suggests that the front-end lighttpd has read the data from the backend and has applied HTTP/2 framing when transferring the data from the request write queue to the connection write queue. Does the response have Content-Length? Does curl report that it was missing bytes? Or perhaps there is some reason that lighttpd is not sending HTTP/2 END_STREAM frame after lighttpd finishes sending the response body data?

Actions

Copy link

#10

Updated by gstrauss over 3 years ago

Category set to core

Actions

Copy link

#11

Updated by flynn over 3 years ago

Some answers:

openssl version is from Debian buster: 1.1.1d
the Debian package currently does not include the gnutls module, I added this myself, test results re reported later
using the nss TLS module the problem occurs too, but curl now reports immediately:
```
curl: (92) HTTP/2 stream 0 was not closed cleanly: INTERNAL_ERROR (err 2)
```
the responses all have a Content-Length
usually it takes some time to trigger the problem, after a restart usually all is fine. Also the affected urls may change after restart ...

I'll test your new patches now.

Actions

Copy link

#12

Updated by flynn over 3 years ago

Share some trace with debug.log-state-handling = "enable" on a high volume server is not so easy, I cannot filter out the request in question.

Any ideas, how to limit thee debug messges to a certian request (type). As far as I know this option is always global, I cannot limit the scope with $HTTP[...]...

The nss module reports one error very often:

(mod_nss.c.2197) NSS: mod_nss_read_err: (PR_END_OF_FILE_ERROR) Encountered end of file

Actions

Copy link

#13

Updated by gstrauss over 3 years ago

Any ideas, how to limit thee debug messges to a certian request (type). As far as I know this option is always global, ...

Most of the debug.* options are not global. Unfortunately, debug.log-state-handling and debug.log-request-header-on-error are global, since they are connection-level, or may occur before the request has been parsed (and is valid).

I'll put together a patch (below) to enable debug.log-state-handling after request headers have been read. (It does not make as much sense to do so for debug.log-request-header-on-error)

Aside: lighttpd mod_gnutls will be included in Debian in a future version of lighttpd. It was not included with lighttpd 1.4.56 (released Nov 2020) since at the time, gnutls libraries in Debian were very old and did not have support for TLS v1.3. gnutls 3.6.5 with support for TLS v1.3 was released Dec 2018, almost 2 years prior to lighttpd 1.4.56, but Debian still had an older version of gnutls at the time. (Debian updated to use newer gnutls libraries in Jan 2021)

Separate issue:

(mod_nss.c.2197) NSS: mod_nss_read_err: (PR_END_OF_FILE_ERROR) Encountered end of file

I guess I should add PR_END_OF_FILE_ERROR to those ignored unless debug.log-ssl-noise = "enable"

Quick patch to allow debug.log-state-handling to be toggled after request has started:

--- a/src/configfile.c
+++ b/src/configfile.c
@@ -192,6 +192,9 @@ static void config_merge_config_cpv(request_config * const pconf, const config_p
       case 32:/* server.breakagelog */
         if (cpv->vtype == T_CONFIG_LOCAL) pconf->serrh = cpv->v.v;
         break;
+      case 33:/* debug.log-state-handling */
+        pconf->log_state_handling = (0 != cpv->v.u);
+        break;
       default:/* should not happen */
         return;
     }
@@ -717,7 +720,7 @@ static int config_insert_srvconf(server *srv) {
         T_CONFIG_SCOPE_SERVER }
      ,{ CONST_STR_LEN("debug.log-state-handling"),
         T_CONFIG_BOOL,
-        T_CONFIG_SCOPE_SERVER }
+        T_CONFIG_SCOPE_CONNECTION }
      ,{ CONST_STR_LEN("server.feature-flags"),
         T_CONFIG_ARRAY_KVANY,
         T_CONFIG_SCOPE_SERVER }
@@ -1005,6 +1008,9 @@ static int config_insert(server *srv) {
      ,{ CONST_STR_LEN("server.breakagelog"),
         T_CONFIG_STRING,
         T_CONFIG_SCOPE_CONNECTION }
+     ,{ CONST_STR_LEN("debug.log-state-handling"),
+        T_CONFIG_BOOL,
+        T_CONFIG_SCOPE_CONNECTION }
      ,{ NULL, 0,
         T_CONFIG_UNSET,
         T_CONFIG_SCOPE_UNSET }

Actions

Copy link

#14

Updated by flynn over 3 years ago

The initial tests with the two patches in your personal repo on this ticket look promising, but I want to test again tomorrow after some runtime.

Actions

Copy link

#15

Updated by gstrauss over 3 years ago

Status changed from New to Patch Pending

Actions

Copy link

#16

Updated by flynn over 3 years ago

Until now I cannot reproduce this error anymore, the patches https://git.lighttpd.net/lighttpd/lighttpd1.4/commit/a9a3f2f0b97766b7a84ce19fe7ff88383a3f3a97 and https://git.lighttpd.net/lighttpd/lighttpd1.4/commit/71798d46f9c8d1ba3c71c4867edaac52c88f7d20 solves this issue for me.

Actions

Copy link

#17

Updated by gstrauss over 3 years ago

Thank you for testing and for following up. lighttpd 1.4.61 (including these fixes) is slated to be released later this week.

Actions

Copy link

#18

Updated by gstrauss over 3 years ago

Status changed from Patch Pending to Fixed

Applied in changeset 0757d71e1444b7bde1426de38dbdba123514ee82.

Actions

Copy link

#19

Updated by gstrauss over 3 years ago

There is another bug in lighttpd HTTP/2 support where HTTP/2 END_STREAM might not be sent if the entire response has been sent to the client, and then some time later the backend ends the response (or closes the pipe in the case of CGI). This bug is present in lighttpd 1.4.59 (and possible earlier), but may occur slightly more frequently in lighttpd 1.4.60 due to lighttpd 1.4.60 scheduling together the processing of front-end and backend-end events for requests on the same client connection, reducing the number of calls to the backend processing code.

https://git.lighttpd.net/lighttpd/lighttpd1.4/commit/90f062cb07d362e346792964cf76aac65120d375

Actions

Copy link

#20

Updated by flynn over 3 years ago

We had an issue with an etherpad backend, this patch solves the issue, the etherpad backend has a reproducable response timing now.

Actions

Copy link

Also available in: Atom

Project

General

Profile

Lighttpd

Custom queries

Bug #3111

HTTP/2 requests sometimes take very long (missing last chunk)

Updated by gstrauss over 3 years ago

Updated by gstrauss over 3 years ago

Updated by flynn over 3 years ago

Updated by flynn over 3 years ago

Updated by gstrauss over 3 years ago

Updated by gstrauss over 3 years ago

Updated by gstrauss over 3 years ago

Updated by gstrauss over 3 years ago

Updated by gstrauss over 3 years ago

Updated by gstrauss over 3 years ago

Updated by flynn over 3 years ago

Updated by flynn over 3 years ago

Updated by gstrauss over 3 years ago

Updated by flynn over 3 years ago

Updated by gstrauss over 3 years ago

Updated by flynn over 3 years ago

Updated by gstrauss over 3 years ago

Updated by gstrauss over 3 years ago

Updated by gstrauss over 3 years ago

Updated by flynn over 3 years ago