Project

General

Profile

Actions

Bug #3102

closed

Chrome gives random net::ERR_HTTP2_PROTOCOL_ERROR

Added by stenvaag over 2 years ago. Updated over 2 years ago.

Status:
Fixed
Priority:
Normal
Category:
core
Target version:
ASK QUESTIONS IN Forums:
No

Description

When trying with https://git.lighttpd.net/lighttpd/lighttpd1.4/src/branch/master I get random error from Chrome.

net::ERR_HTTP2_PROTOCOL_ERROR

It is very difficult to reproduce the error. I have to work in a web application that I have created that have a lot of GET and PUT for some time to get the error.

When turning on

debug.log-request-header = "enable"
debug.log-response-header = "enable"
debug.log-request-handling = "enable"

I'm unable to get the error.

I've tried to add log_error lines in h2.c in all places that could produce the error. (I get "normal" output with "h2_send_refused_stream" when trying 12 PUT from my application as expected.) I'm unsure if this (more than 8 PUTs) is required prior to getting the protocol error. When restarting the application (refresh in browser) I get h2_send_goaway with H2_E_NO_ERROR. But when the error "net::ERR_HTTP2_PROTOCOL_ERROR" comes in the browser there is no additional lines in the error log.

I need some more advice on how to track this down.


Files

error_log.zip (726 KB) error_log.zip stenvaag, 2021-09-15 10:15
error_log_timout_1.zip (167 KB) error_log_timout_1.zip stenvaag, 2021-09-15 10:57
Actions #1

Updated by gstrauss over 2 years ago

I need some more advice on how to track this down.

Well, I need some sleep :), so here are some ideas to start.

The debug logging would probably be helpful, but you've tried that.

There are quite a few places in src/h2.c which might send HTTP/2 PROTOCOL_ERROR (lighttpd enum H2_E_PROTOCOL_ERROR), but we do not know if lighttpd is sending it, or if lighttpd is doing something else and Chrome is reporting net::ERR_HTTP2_PROTOCOL_ERROR

While you could build lighttpd with --with-libunwind and annotate src/h2.c to print a stack trace (ck_bt() from src/ck.h) in the h2_send_*() funcs if H2_E_PROTOCOL_ERROR is the error code, that might not be worth the effort unless you find a way to reproduce the error more easily.

Instead, I would suggest starting with lighttpd 1.4.59 -- perhaps add the final result of the patches for #3093 and #3100 -- and see if you can reproduce the error with that. As you suggested, perhaps the issue is exposed in combination with > 8 PUTs at the beginning of the HTTP/2 connection. That would be my hunch, too. After all, #3100 modified the patch for #3093 specifically to restore lighttpd 1.4.59 behavior to handle a large burst of requests received immediately at the beginning of the HTTP/2 connection. The behavior was modified for #3093 (and then again in #3100) for the flurry of PUT requests, and sends REFUSED_STREAM for the excess PUTs beyond what lighttpd is willing to handle in parallel for that HTTP/2 connection.

If you find that you can trigger the bug a little bit more easily (not currently the case), you might build and test lighttpd for each revision where src/h2.c has been modified since lighttpd 1.4.59, as there are not too many changes, and most are fairly small changes.

Actions #2

Updated by gstrauss over 2 years ago

You might test with a burst of requests at connect time (after waiting for lighttpd to close the HTTP/2 connection after idle timeout (server.max-read-idle))
The default for server.max-read-idle is 60 seconds, so you probably want to tune that down to something like 2 or 3 (seconds) during testing.
Send a burst of GET requests followed by PUTs followed by GETs, and test PUTs followed by GETs, varying the first amount by < 8 or > 8.

Actions #3

Updated by stenvaag over 2 years ago

I managed to get the error with request and response logging enabled. I'm still unsure if this is my fcgi backend or lighttpd.

I observe that id:9 is reused prior to the last one getting a response. Prior to this error I am sending the usual 12 PUTs in a burst, then some GETs, and finally a small burst of PUT's where the failing one is the first PUT. The PUT is from a ajax call the GET is from image loading in the map component in my application.

2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: :method: PUT
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: :authority: gollum.agromatic.no
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: :scheme: https
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: :path: /tstwst/amweb/data/maps/22951?_dc=1631641308834&ID=FDTFMV64HFAGN5O6TRN4CMDT08&TID=1
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: content-length: 123
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: sec-ch-ua: "Google Chrome";v="93", " Not;A Brand";v="99", "Chromium";v="93" 
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: sec-ch-ua-mobile: ?0
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: sec-ch-ua-platform: "Windows" 
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: content-type: application/json
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: accept: */*
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: origin: https://gollum.agromatic.no
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: sec-fetch-site: same-origin
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: sec-fetch-mode: cors
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: sec-fetch-dest: empty
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: referer: https://gollum.agromatic.no/amweb/?ID=FDTFMV64HFAGN5O6TRN4CMDT08
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: accept-encoding: gzip, deflate, br
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: accept-language: nb,no;q=0.9,da;q=0.8,sv;q=0.7,en;q=0.6
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: cookie: G_ENABLED_IDPS=google
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: :method: GET
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: :authority: gollum.agromatic.no
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: :scheme: https
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: :path: /tstwst/amweb/wms?ID=FDTFMV64HFAGN5O6TRN4CMDT08&LAYERS=sptema&SRS=EPSG%3A32632&FORMAT=image%2Fpng&TRANSPARENT=TRUE&SERVICE=WMS&VERSION=1.1.1&REQUEST=GetMap&STYLES=&_OLSALT=0.4320661165774786&BBOX=608481,6744861,609158,6745538&WIDTH=256&HEIGHT=256
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: sec-ch-ua: "Google Chrome";v="93", " Not;A Brand";v="99", "Chromium";v="93" 
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: sec-ch-ua-mobile: ?0
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: sec-ch-ua-platform: "Windows" 
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: accept: image/jxl,image/avif,image/webp,image/apng,image/svg+xml,image/*,*/*;q=0.8
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: sec-fetch-site: same-origin
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: sec-fetch-mode: no-cors
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: sec-fetch-dest: image
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: referer: https://gollum.agromatic.no/amweb/?ID=FDTFMV64HFAGN5O6TRN4CMDT08
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: accept-encoding: gzip, deflate, br
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: accept-language: nb,no;q=0.9,da;q=0.8,sv;q=0.7,en;q=0.6
2021-09-14 19:41:55: h2.c.1264) fd:11 id:9 rqst: cookie: G_ENABLED_IDPS=google
2021-09-14 19:41:55: h2.c.1888) fd:11 id:9 resp: :status: 200
2021-09-14 19:41:55: h2.c.1877) fd:11 id:9 resp: content-type: image/png
2021-09-14 19:41:55: h2.c.1877) fd:11 id:9 resp: content-length: 116
2021-09-14 19:41:55: h2.c.1877) fd:11 id:9 resp: accept-ranges: bytes
2021-09-14 19:41:55: h2.c.1888) fd:11 id:9 resp: date: Tue, 14 Sep 2021 17:41:55 GMT
2021-09-14 19:41:55: h2.c.1888) fd:11 id:9 resp: server: lighttpd/1.4.60

Actions #4

Updated by gstrauss over 2 years ago

That raises (at least) two questions:
  • Is Chrome erroneously repeating the same id?
  • Why is lighttpd not rejecting the repeated id?
Actions #5

Updated by stenvaag over 2 years ago

There is probably a "goaway" here that is not logged. I removed all my logging (of goaway). I could try to reinsert, but it is difficult to get the error.

I will try to modify my fcgi backend to return an empty response on a request. If this creates a similar result this could be the error.

Actions #6

Updated by gstrauss over 2 years ago

lighttpd src/h2.c currently ignores HTTP trailers. Here's a patch to detect and reject if an HTTP/2 HEADERS frame is treated as HTTP trailers, and contains an HTTP/2 pseudo-header (starting with ":").

--- a/src/h2.c
+++ b/src/h2.c
@@ -1265,9 +1265,12 @@ h2_parse_headers_frame (request_st * const restrict r, const unsigned char *psrc
                   "fd:%d id:%u rqst: %.*s: %.*s", r->con->fd, r->h2id,
                   (int)hpctx.klen, hpctx.k, (int)hpctx.vlen, hpctx.v);

-            r->http_status = http_request_parse_header(r, &hpctx);
-            if (0 != r->http_status)
+            const int http_status = http_request_parse_header(r, &hpctx);
+            if (__builtin_expect( (0 != http_status), 0)) {
+                if (r->http_status == 0) /*might be set if processing trailers*/
+                    r->http_status = http_status;
                 break;
+            }
         }
       #if 0 /*(see catch-all below)*/
         /* Send GOAWAY (further below) (decoder state not maintained on error)
@@ -1319,7 +1322,7 @@ h2_parse_headers_frame (request_st * const restrict r, const unsigned char *psrc
     rq->bytes_in  += (off_t)hpctx.hlen;
     rq->bytes_out += (off_t)hpctx.hlen;

-    if (0 == r->http_status && LSHPACK_OK == rc) {
+    if (0 == r->http_status && LSHPACK_OK == rc && !trailers) {
         if (hpctx.pseudo)
             r->http_status =
               http_request_validate_pseudohdrs(r, hpctx.scheme,
--- a/src/request.c
+++ b/src/request.c
@@ -749,6 +749,9 @@ http_request_parse_header (request_st * const restrict r, http_header_parse_ctx
         }
     }
     else { /*(trailers)*/
+        if (*k == ':')
+            return http_request_header_line_invalid(r, 400,
+              "invalid pseudo-header in trailers -> 400");
         /* ignore trailers (after required HPACK decoding) if streaming
          * request body to backend since headers have already been sent
          * to backend via Common Gateway Interface (CGI) (CGI, FastCGI,

Actions #7

Updated by gstrauss over 2 years ago

id:9 for the PUT request is the 5th client stream on the HTTP/2 connection (1,3,5,7,9), so you might check the logs and match up the file descriptors (e.g. fd:11) to see the sequence of requests on the connection that had the issue (e.g. 5 PUTs and then GET?)

For testing purposes to try to reproduce the issue, have you tried tuning down the read idle timeout? server.max-read-idle = 1

Actions #8

Updated by stenvaag over 2 years ago

With the above patches for trailers I imediately got the protocol error from the first PUT of the 12, the rest succeeded.

Actions #9

Updated by stenvaag over 2 years ago

Here is the complete error.log from yestday.

Actions #10

Updated by gstrauss over 2 years ago

Looking at error_log.zip and trying to match up against what you posted above, it looks like you may have selected fd:11 id:9 from multiple sections of the log file, and those look to be on different connections.
debug.log-state-handling = "enable" will show connection open and close, as well as request state (so might get a little noisy).

I do see that the one PUT you selected did not get an HTTP response, though it may have gotten an HTTP/2 error on the stream.

With the above patches for trailers I imediately got the protocol error from the first PUT of the 12, the rest succeeded.

Are you including trailers in your HTTP/2 requests? The patches I posted above should only affect HTTP/2 connections when trailers are sent on the client request stream.

The prior patches for #3093 and #3100 should reject the 9th PUT request on the stream (id 17) if all initial requests on the connection are PUT, and all are sent prior to the client sending a SETTINGS ACK to lighttpd. After the client sends the SETTINGS ACK to lighttpd, the client "should not" send requests that would result in more than 8 active streams at a time, since lighttpd sends SETTINGS_MAX_CONCURRENT_STREAMS 8 in SETTINGS as part of the HTTP/2 Connection Preface that lighttpd sends to the client.

If you can now more easily reproduce this, would you attach a log with debugging enabled and include debug.log-state-handling = "enable"?

Also, is the behavior with the above patches (for trailers) what you expect?

Actions #11

Updated by stenvaag over 2 years ago

I have no knowledge about trailers.

Actions #12

Updated by stenvaag over 2 years ago

Attached is a log with server.max-read-idle = 1. A grep for "driftsenheter" in the log, the last one at "12:49:19" fails with protocol error. Looks like there is a new "fd" opened just after.

Actions #13

Updated by stenvaag over 2 years ago

I'm unable to get the error with debug.log-state-handling = "enable"

Actions #14

Updated by gstrauss over 2 years ago

The connection with the GET for "driftsenheter" in the log, the last one at "12:49:19" looks like it (the connection) was opened a few seconds earlier at 12:49:13. The logs do not contain every HTTP/2 frame sent back and forth (that would be noisy!) so we don't see what is happening at that layer.

Is there anything in the lighttpd error log around that time?

I'll try to put together some patches later today to add some additional instrumentation. I hope to have them ready for you by tomorrow morning (your timezone).

Actions #15

Updated by stenvaag over 2 years ago

Is there anything in the lighttpd error log around that time?

You have a complete copy of the error.log.

Looks like the connection is closed and opened again here

2021-09-15 12:47:44: h2.c.1888) fd:10 id:259 resp: server: lighttpd/1.4.60
2021-09-15 12:49:13: h2.c.1264) fd:10 id:1 rqst: :method: GET

Actions #16

Updated by stenvaag over 2 years ago

I modified connections.c to only include "connection closed" messages. Then I was able to get the protocol error:

...
2021-09-15 13:55:28: h2.c.1888) fd:10 id:277 resp: server: lighttpd/1.4.60
2021-09-15 13:55:34: connections.c.1458) connection closed - keep-alive timeout: 10
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: :method: PUT
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: :authority: gollum.agromatic.no
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: :scheme: https
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: :path: /tstwst/amweb/data/maps/22951?_dc=1631706927914&ID=A3P03BA18GKPTR2ENTRJFH20D8&TID=1
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: content-length: 118
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: sec-ch-ua: "Google Chrome";v="93", " Not;A Brand";v="99", "Chromium";v="93" 
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: sec-ch-ua-mobile: ?0
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: sec-ch-ua-platform: "Windows" 
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: content-type: application/json
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: accept: */*
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: origin: https://gollum.agromatic.no
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: sec-fetch-site: same-origin
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: sec-fetch-mode: cors
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: sec-fetch-dest: empty
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: referer: https://gollum.agromatic.no/amweb/?ID=A3P03BA18GKPTR2ENTRJFH20D8
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: accept-encoding: gzip, deflate, br
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: accept-language: nb,no;q=0.9,da;q=0.8,sv;q=0.7,en;q=0.6
2021-09-15 13:55:34: h2.c.1264) fd:10 id:279 rqst: cookie: G_ENABLED_IDPS=google
2021-09-15 13:55:34: connections.c.109) connection closed for fd 10
2021-09-15 13:55:34: h2.c.1264) fd:10 id:1 rqst: :method: PUT
...

Actions #17

Updated by gstrauss over 2 years ago

At first glance, that appears to be proper behavior.
2021-09-15 13:55:34: connections.c.1458) connection closed - keep-alive timeout: 10
After that, lighttpd will have sent GOAWAY to the client, marking stream id:277 as the last handled request, so when stream id:279 comes in, lighttpd will parse the HPACK and then discard the new request (see end of h2_recv_headers())

Actions #18

Updated by gstrauss over 2 years ago

Some additional patches which might help your tracing. There are all optional and for testing only. The patch to log_buffer_prepare() adds a high-precision timestamp. lighttpd sends SETTINGS to client at the beginning of a new connection.

--- a/src/h2.c
+++ b/src/h2.c
@@ -313,6 +313,7 @@ h2_send_rst_stream_id (uint32_t h2id, connection * const con, const request_h2er
     rst_stream.u[3] = htonl(e);
     chunkqueue_append_mem(con->write_queue,  /*(+3 to skip over align padding)*/
                           (const char *)rst_stream.c+3, sizeof(rst_stream)-3);
+log_error(con->srv->errh, __FILE__, __LINE__, "fd:%d id:%u RST_STREAM", con->fd, h2id);
 }

@@ -396,6 +397,7 @@ h2_send_goaway (connection * const con, const request_h2error_t e)
     goaway.u[4] = htonl(e);
     chunkqueue_append_mem(con->write_queue,  /*(+3 to skip over align padding)*/
                           (const char *)goaway.c+3, sizeof(goaway)-3);
+log_error(con->srv->errh, __FILE__, __LINE__, "fd:%d GOAWAY last-stream-id:%u", con->fd, h2c->h2_cid);
 }

@@ -1803,6 +1805,7 @@ h2_init_con (request_st * const restrict h2r, connection * const restrict con, c
     /*h2r->h2_rwin += 65537;*//*(enable if WINDOWS_UPDATE is sent above)*/
     chunkqueue_append_mem(con->write_queue,
                           (const char *)h2settings, sizeof(h2settings));
+log_error(con->srv->errh, __FILE__, __LINE__, "fd:%d SETTINGS", con->fd);

     if (!h2_recv_client_connection_preface(con)) {
         /*(alternatively, func ptr could be saved in an element in (h2con *))*/
--- a/src/log.c
+++ b/src/log.c
@@ -103,6 +103,19 @@ static int log_buffer_prepare(const log_error_st *errh, const char *filename, un
                if (-1 == errh->fd) return -1;
                /* cache the generated timestamp */
                if (__builtin_expect( (tlast != log_epoch_secs), 0)) {
+                       if (1) { /*(do not cache if high-precision; do not update tlast)*/
+                               struct timespec ts = { 0, 0 };
+                               log_clock_gettime_realtime(&ts);
+                               buffer_clear(b);
+                               /* alt: buffer_append_strftime(b, "%F %T", &tm); */
+                               buffer_append_int(b, TIME64_CAST(ts.tv_sec));
+                               buffer_append_string_len(b, CONST_STR_LEN(".000000000: ("));
+                               char n[LI_ITOSTRING_LENGTH];
+                               const size_t nlen =
+                                 li_utostrn(n, sizeof(n), (unsigned long)ts.tv_nsec);
+                               memcpy(b->ptr+buffer_clen(b)-nlen-3, n, nlen);
+                               break;
+                       }
                        struct tm tm;
                        tlast = log_epoch_secs;
                        tlen = (uint32_t)

Actions #19

Updated by stenvaag over 2 years ago

1631777393.420149005: (h2.c.1891) fd:10 id:71 resp: server: lighttpd/1.4.60
1631777398.812910701: (connections.c.1458) connection closed - keep-alive timeout: 10
1631777398.812985521: (h2.c.1266) fd:10 id:73 rqst: :method: PUT
1631777398.812993524: (h2.c.1266) fd:10 id:73 rqst: :authority: gollum.agromatic.no
1631777398.812998263: (h2.c.1266) fd:10 id:73 rqst: :scheme: https
1631777398.813004961: (h2.c.1266) fd:10 id:73 rqst: :path: /tstwst/skplweb/data/skiftevekster.json?_dc=1631777396335&ID=SQSAT1JH6DGCUNLID1L65A0HBG&TID=1
1631777398.813008681: (h2.c.1266) fd:10 id:73 rqst: content-length: 325
1631777398.813012768: (h2.c.1266) fd:10 id:73 rqst: sec-ch-ua: "Google Chrome";v="93", " Not;A Brand";v="99", "Chromium";v="93" 
1631777398.813034670: (h2.c.1266) fd:10 id:73 rqst: sec-ch-ua-mobile: ?0
1631777398.813038789: (h2.c.1266) fd:10 id:73 rqst: user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.63 Safari/537.36
1631777398.813043807: (h2.c.1266) fd:10 id:73 rqst: sec-ch-ua-platform: "Windows" 
1631777398.813047609: (h2.c.1266) fd:10 id:73 rqst: content-type: application/json
1631777398.813051198: (h2.c.1266) fd:10 id:73 rqst: accept: */*
1631777398.813054996: (h2.c.1266) fd:10 id:73 rqst: origin: https://gollum.agromatic.no
1631777398.813058436: (h2.c.1266) fd:10 id:73 rqst: sec-fetch-site: same-origin
1631777398.813061696: (h2.c.1266) fd:10 id:73 rqst: sec-fetch-mode: cors
1631777398.813064918: (h2.c.1266) fd:10 id:73 rqst: sec-fetch-dest: empty
1631777398.813068162: (h2.c.1266) fd:10 id:73 rqst: referer: https://gollum.agromatic.no/amweb/?ID=SQSAT1JH6DGCUNLID1L65A0HBG
1631777398.813071560: (h2.c.1266) fd:10 id:73 rqst: accept-encoding: gzip, deflate, br
1631777398.813076091: (h2.c.1266) fd:10 id:73 rqst: accept-language: nb,no;q=0.9,da;q=0.8,sv;q=0.7,en;q=0.6
1631777398.813079753: (h2.c.1266) fd:10 id:73 rqst: cookie: G_ENABLED_IDPS=google
1631777398.813118960: (h2.c.400) fd:10 GOAWAY last-stream-id:73
1631777398.813123670: (h2.c.316) fd:10 id:73 RST_STREAM
1631777398.814291420: (connections.c.109) connection closed for fd 10
1631777398.814907361: (h2.c.1803) fd:10 SETTINGS
1631777398.826927622: (h2.c.1266) fd:10 id:1 rqst: :method: PUT
Actions #20

Updated by gstrauss over 2 years ago

It looks like there is a short period of time after lighttpd decides to close the idle connection, and during which new data can arrive on the connection, before lighttpd sends GOAWAY. Since lighttpd read the new stream request, it sends a RST_STREAM, which the browser might not retry. Please try this, which may result in lighttpd sending GOAWAY with the previous stream id (in the example it would be 71) before reading and then having to send RST_STREAM for the PUT in id 73. Chrome should then resend that request.

--- a/src/connections.c
+++ b/src/connections.c
@@ -1460,6 +1460,7 @@ static void connection_check_timeout (connection * const con, const unix_time64_
                               con->fd);
                 }
                 connection_set_state(r, CON_STATE_RESPONSE_END);
+                con->is_readable = 0;
                 changed = 1;
             }
         }

Actions #21

Updated by stenvaag over 2 years ago

I'm now unable to get the error.

...
1631780629.700954970: (h2.c.1891) fd:10 id:2639 resp: server: lighttpd/1.4.60
1631780635.707591418: (connections.c.1458) connection closed - keep-alive timeout: 10
1631780635.707629042: (h2.c.400) fd:10 GOAWAY last-stream-id:2639
1631780635.708262156: (connections.c.109) connection closed for fd 10
1631780638.661998649: (h2.c.1803) fd:10 SETTINGS
1631780638.668634915: (h2.c.1266) fd:10 id:1 rqst: :method: PUT
...
1631780648.773547098: (h2.c.1891) fd:10 id:251 resp: server: lighttpd/1.4.60
1631780654.780188118: (connections.c.1458) connection closed - keep-alive timeout: 10
1631780654.780226409: (h2.c.400) fd:10 GOAWAY last-stream-id:251
1631780654.780949803: (connections.c.109) connection closed for fd 10
1631780656.944193513: (h2.c.1803) fd:10 SETTINGS
1631780656.949804056: (h2.c.1266) fd:10 id:1 rqst: :method: PUT
...
1631780661.073669198: (h2.c.1891) fd:10 id:59 resp: server: lighttpd/1.4.60
1631780666.757065202: (connections.c.1458) connection closed - keep-alive timeout: 10
1631780666.757160603: (h2.c.400) fd:10 GOAWAY last-stream-id:59
1631780666.758526292: (connections.c.109) connection closed for fd 10
1631780666.759218003: (h2.c.1803) fd:10 SETTINGS
1631780666.767826790: (h2.c.1266) fd:10 id:1 rqst: :method: GET

Actions #22

Updated by gstrauss over 2 years ago

  • Status changed from New to Patch Pending

Great! Thank you very much for your time and persistence in helping track this down.

This tightens up the code a bit, which was not expecting to have a short race receiving data after a read timeout. After all, the read idle timer (measured in seconds) expired!

These patches will be on lighttpd git master shortly, and will be part of lighttpd 1.4.60, expected to be released this month.


BTW, you might want to double-check your FastCGI backend and how it handles reading errors or truncated reads, e.g. unexpected EOF. Before the above patch, lighttpd may have started the connection to your FastCGI backend, but then aborted by closing the connection. Your FastCGI probably read EOF and got 0 bytes, and if you looked in your read buffer for the FastCGI id without checking that you at least read enough bytes to fill the FastCGI header, the buffer would contain whatever garbage was previously there.

Actions #23

Updated by gstrauss over 2 years ago

  • Status changed from Patch Pending to Fixed
Actions #24

Updated by stenvaag over 2 years ago

Thanks for your tip about my FastCGI backend. You where correct about a random byte was checked when the connection was closed.

Actions

Also available in: Atom