Project

General

Profile

Actions

Bug #3152

closed

Random Segfaults with version 1.4.64 w/ mod_sockproxy and ALPN h2

Added by ultimator over 2 years ago. Updated over 2 years ago.

Status:
Fixed
Priority:
Normal
Category:
mod_sockproxy
Target version:
ASK QUESTIONS IN Forums:
No

Description

Since the update to version 1.4.64 (from 1.4.59), lighttpd crashes randomly with SIGSEGV. I also added mod_sockproxy around this time so this might be related.
Sometimes it takes only a few hours, sometimes a few days until a crash occurs. I can't reproduce them.
OS is Gentoo Linux. Arch arm64. Compiled with brotli ipv6 lua nettle pcre2 openssl xxhash attr zlib.
CFLAGS="-march=armv8-a+crc+simd -mtune=cortex-a72 -ftree-vectorize -O2 -pipe -fomit-frame-pointer"
Here is a gdb stacktrace for two of the crashes

gdb --args /usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf
...
Program received signal SIGSEGV, Segmentation fault.                                                                                                                                                               
0x00000055555668f4 in connection_state_machine_h2 (h2r=0x555570a520, con=0x555570a520) at connections.c:1209
1209    connections.c: Datei oder Verzeichnis nicht gefunden.

(gdb) bt full                                                                                                                                                                                                      
#0  0x00000055555668f4 in connection_state_machine_h2 (h2r=0x555570a520, con=0x555570a520) at connections.c:1209
        h2c = 0x0
        resched = 127
#1  0x0000005555566f0c in connection_state_machine (con=0x555570a520) at connections.c:1388
        r = 0x555570a520
#2  0x0000005555561194 in server_run_con_queue (joblist=0x555570a520, sentinel=0x55555d9f20 <log_con_jqueue>) at server.c:1938
        con = 0x555570a520
        jqnext = 0x55555d9f20 <log_con_jqueue>
#3  0x0000005555561324 in server_main_loop (srv=0x55555db520) at server.c:1991
        mono_ts = 1392894
        sentinel = 0x55555d9f20 <log_con_jqueue>
        joblist = 0x555570a520
        last_active_ts = 1392894
#4  0x000000555556154c in main (argc=4, argv=0x7ffffff168) at server.c:2065
        srv = 0x55555db520
        rc = 1
gdb --args /usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf
...
Program received signal SIGSEGV, Segmentation fault.
0x00000055555668f4 in connection_state_machine_h2 (h2r=0x55556a8de0, con=0x55556a8de0) at connections.c:1209
1209    connections.c: Datei oder Verzeichnis nicht gefunden.

(gdb) bt full
#0  0x00000055555668f4 in connection_state_machine_h2 (h2r=0x55556a8de0, con=0x55556a8de0) at connections.c:1209
        h2c = 0x0
        resched = 127
#1  0x0000005555566f0c in connection_state_machine (con=0x55556a8de0) at connections.c:1388
        r = 0x55556a8de0
#2  0x0000005555561194 in server_run_con_queue (joblist=0x55556a8de0, sentinel=0x55555d9f20 <log_con_jqueue>) at server.c:1938
        con = 0x55556a8de0
        jqnext = 0x55555d9f20 <log_con_jqueue>
#3  0x0000005555561324 in server_main_loop (srv=0x55555db520) at server.c:1991
        mono_ts = 1437908
        sentinel = 0x55555d9f20 <log_con_jqueue>
        joblist = 0x55556a8de0
        last_active_ts = 1437908
#4  0x000000555556154c in main (argc=4, argv=0x7ffffff148) at server.c:2065
        srv = 0x55555db520
        rc = 1
Last logged connection before the crash:
2022-04-05 18:30:47: (connections.c.770) fd:18 rqst: PUT /_matrix/federation/v1/send/1648990922236 HTTP/1.1
2022-04-05 18:30:47: (connections.c.770) fd:18 rqst: Content-Length: 253
2022-04-05 18:30:47: (connections.c.770) fd:18 rqst: User-Agent: Synapse/1.55.2
2022-04-05 18:30:47: (connections.c.770) fd:18 rqst: Content-Type: application/json
2022-04-05 18:30:47: (connections.c.770) fd:18 rqst: Authorization: X-Matrix origin=kif.rocks,key="<redacted>",sig="<redacted>" 
2022-04-05 18:30:47: (connections.c.770) fd:18 rqst: Host: matrix.pygos.space:443
2022-04-05 18:30:47: (connections.c.770) fd:18 rqst: 
2022-04-05 18:30:47: (response.c.402) -- parsed Request-URI
2022-04-05 18:30:47: (response.c.404) Request-URI     : /_matrix/federation/v1/send/1648990922236
2022-04-05 18:30:47: (response.c.406) URI-scheme      : https
2022-04-05 18:30:47: (response.c.408) URI-authority   : matrix.pygos.space
2022-04-05 18:30:47: (response.c.410) URI-path (clean): /_matrix/federation/v1/send/1648990922236
2022-04-05 18:30:47: (response.c.412) URI-query       : 
2022-04-05 18:30:47: (gw_backend.c.2682) handling the request using proxy
2022-04-05 18:30:47: (response.c.490) -- logical -> physical
2022-04-05 18:30:47: (response.c.492) Doc-Root     : /var/www/servers/matrix.pygos.space/htdocs/
2022-04-05 18:30:47: (response.c.494) Basedir      : /var/www/servers/matrix.pygos.space/htdocs/
2022-04-05 18:30:47: (response.c.496) Rel-Path     : /_matrix/federation/v1/send/1648990922236
2022-04-05 18:30:47: (response.c.498) Path         : /var/www/servers/matrix.pygos.space/htdocs/_matrix/federation/v1/send/1648990922236
2022-04-05 18:30:47: (response.c.164) fd:18 resp: HTTP/1.1 200 OK
2022-04-05 18:30:47: (response.c.164) fd:18 resp: Server: Synapse/1.55.2
2022-04-05 18:30:47: (response.c.164) fd:18 resp: Date: Tue, 05 Apr 2022 16:30:48 GMT
2022-04-05 18:30:47: (response.c.164) fd:18 resp: Content-Type: application/json
2022-04-05 18:30:47: (response.c.164) fd:18 resp: Cache-Control: no-cache, no-store, must-revalidate
2022-04-05 18:30:47: (response.c.164) fd:18 resp: Access-Control-Allow-Origin: *
2022-04-05 18:30:47: (response.c.164) fd:18 resp: Access-Control-Allow-Methods: GET, HEAD, POST, PUT, DELETE, OPTIONS
2022-04-05 18:30:47: (response.c.164) fd:18 resp: Access-Control-Allow-Headers: X-Requested-With, Content-Type, Authorization, Date
2022-04-05 18:30:47: (response.c.164) fd:18 resp: Strict-Transport-Security: max-age=15768000; includeSubdomains; preload
2022-04-05 18:30:47: (response.c.164) fd:18 resp: Content-Length: 11
2022-04-05 18:30:47: (response.c.164) fd:18 resp: 
2022-04-05 18:30:47: (gw_backend.c.2682) handling the request using sockproxy


Files

lighttpd.conf (23 KB) lighttpd.conf ultimator, 2022-04-06 12:23
Actions #1

Updated by gstrauss over 2 years ago

Thanks for the thorough info. It looks like h2c has been cleaned up, and the SIGSEGV occurs on a NULL pointer dereference. I need to look into why...

(gdb) bt full
#0  0x00000055555668f4 in connection_state_machine_h2 (h2r=0x555570a520, con=0x555570a520) at connections.c:1209
        h2c = 0x0

Actions #2

Updated by gstrauss over 2 years ago

So far, nothing obvious has jumped out to me as an error. Given the stack trace you provided, the connection is still in the job queue after being retired (which can happen), but the connection should not have con->request.http_version set to HTTP_VERSION_2 after being retired, and so should not reach connection_state_machine_h2()

connection_handle_response_end_state() is the only place which calls h2_retire_con(), which is the only place which sets con->h2 = NULL. (con->h2 is allocated when con->request.http_version is set to HTTP_VERSION_2.)

Further along in connection_handle_response_end_state(), request_reset() is called (either directly or through the call to connection_handle_shutdown() which calls connection_reset(), which calls request_reset()), and request_reset() changes con->request.http_version to HTTP_VERSION_UNSET.

I'll have to look more into this later. In the meantime, are you able to share your lighttpd config (lighttpd -f /etc/lighttpd/lighttpd.conf -p) so that I can see what other modules might be involved?

While the following might be a workaround (and might need further minor adjustments to be sufficient), I would like to track down why this is occurring, so the following is a bandaid, and not a solution.

--- a/src/connections.c
+++ b/src/connections.c
@@ -1205,6 +1205,7 @@ static void
 connection_state_machine_h2 (request_st * const h2r, connection * const con)
 {
     h2con * const h2c = con->h2;
+    if (NULL == h2c) return;

     if (h2c->sent_goaway <= 0
         && (chunkqueue_is_empty(con->read_queue) || h2_parse_frames(con))

Actions #3

Updated by gstrauss over 2 years ago

Here is a guess. Would you please try this patch if you can? Perhaps ALPN negotiates h2 in a TLS module before mod_sockproxy picks up the connection.

--- a/src/mod_sockproxy.c
+++ b/src/mod_sockproxy.c
@@ -163,6 +163,7 @@ static handler_t mod_sockproxy_connection_accept(connection *con, void *p_d) {
                hctx->create_env = sockproxy_create_env_connect;
                hctx->response = chunk_buffer_acquire();
                r->http_status = -1; /*(skip HTTP processing)*/
+               r->http_version = HTTP_VERSION_UNSET;
        }

        return HANDLER_GO_ON;

Actions #4

Updated by ultimator over 2 years ago

Thank you for looking into this.
lighttpd -f /etc/lighttpd/lighttpd.conf -p is attached.
I'll try your patch and report back.

Actions #5

Updated by ultimator over 2 years ago

It's been over a day now with the patch applied and no crash so far.
Looks promising :)
Thank you

I'll continue the testing and report back in a few days if sth changes

Actions #6

Updated by gstrauss over 2 years ago

  • Status changed from New to Patch Pending
  • Target version changed from 1.4.xx to 1.4.65

Thanks for the update. To confirm: Are you using only the one-line patch to mod_sockproxy? (and not the bandaid to connections.c?)

(If you still have core files from crash, you could print h2r->handler_module->name and (if h2r->handler_module is not NULL), then I would expect to see "sockproxy".)

Actions #7

Updated by gstrauss over 2 years ago

  • Subject changed from Random Segfaults with version 1.4.64 to Random Segfaults with version 1.4.64 w/ mod_sockproxy and ALPN h2
  • Category set to mod_sockproxy
Actions #8

Updated by ultimator over 2 years ago

Yes I only used the patch to mod_sockproxy.
Unfortunately I have no core dumps. However if the patch turns out to be working, I could remove it, wait for a crash and then check this if that's desired.

Actions #9

Updated by gstrauss over 2 years ago

Unfortunately I have no core dumps. However if the patch turns out to be working, I could remove it, wait for a crash and then check this if that's desired.

Thanks, but not necessary. The simple mod_sockproxy patch fixes something that I overlooked when adding HTTP/2 support to lighttpd.

Actions #10

Updated by gstrauss over 2 years ago

  • Status changed from Patch Pending to Fixed
Actions #11

Updated by ultimator over 2 years ago

Bad news. It just happened again with the exact same stacktrace as before. However I think the patch seems to work somehow because it took a lot longer this time.
I also inadvertently closed the debugger so I have to wait for the next crash for a core dump.

Actions #12

Updated by gstrauss over 2 years ago

Please save the core dump. If I can access the core file and the associated lighttpd executable file, I can examine the program state outside the stack trace.

Actions #13

Updated by gstrauss over 2 years ago

If the TLS ClientHello is received or processed after lighttpd mod_sockproxy has been initialized for the connection, I think that might explain what you are seeing. I think one answer is to avoid selecting a protocol using TLS ALPN if mod_sockproxy will be handling the connection. However, doing so will break someone using mod_sockproxy to connect to an HTTP/2 backend, and desiring to use HTTP/2. For that, an alternative patch (further below) might be better. I am leaning towards applying the second patch below, and applying a similar patch to other lighttpd TLS modules.

--- a/src/mod_openssl.c
+++ b/src/mod_openssl.c
@@ -1874,6 +1874,8 @@ mod_openssl_alpn_select_cb (SSL *ssl, const unsigned char **out, unsigned char *
     handler_ctx *hctx = (handler_ctx *) SSL_get_app_data(ssl);
     unsigned short proto;
     UNUSED(arg);
+    if (hctx->r->handler_module) /*(e.g. mod_sockproxy)*/
+        return SSL_TLSEXT_ERR_NOACK;

     for (unsigned int i = 0, n; i < inlen; i += n) {
         n = in[i++];

Alternative patch

--- a/src/mod_openssl.c
+++ b/src/mod_openssl.c
@@ -1883,7 +1883,8 @@ mod_openssl_alpn_select_cb (SSL *ssl, const unsigned char **out, unsigned char *
             if (in[i] == 'h' && in[i+1] == '2') {
                 if (!hctx->r->conf.h2proto) continue;
                 proto = MOD_OPENSSL_ALPN_H2;
-                hctx->r->http_version = HTTP_VERSION_2;
+                if (hctx->r->handler_module == NULL)/*(e.g. not mod_sockproxy)*/
+                    hctx->r->http_version = HTTP_VERSION_2;
                 break;
             }
             continue;

Actions #14

Updated by ultimator over 2 years ago

It crashed again. I dumped the core but I can't upload it here due to filesize restrictions. So I put it on Google Drive:
https://drive.google.com/file/d/1JmruFxcxVg92QWGT7oY51KWloz94x4Xg/view?usp=drivesdk

I'll also apply the second patch now and see if that works.

Actions #15

Updated by gstrauss over 2 years ago

Thank you. I can see from the core that there are 4 connections, and the crash occurs on a connection being handled by mod_sockproxy before any non-TLS bytes are read. (I can't as easily look into mod_openssl.so state since you did not include that in the .tar.) Still, the core supports my hunch that the connection is negotiating h2 via ALPN, but that mod_sockproxy has already claimed the connection, so the http_version is not being reset. I think the latest patch will address your issue.

As an additional safeguard, my dev branch contains the following patch, which should (independently) avoid the issue.

--- a/src/connections.c
+++ b/src/connections.c
@@ -1378,11 +1380,11 @@ connection_state_machine_h1 (request_st * const r, connection * const con)
 void
 connection_state_machine (connection * const con)
 {
     request_st * const r = &con->request;
-    if (r->http_version == HTTP_VERSION_2)
+    if (con->h2)
         connection_state_machine_h2(r, con);
     else /* if (r->http_version <= HTTP_VERSION_1_1) */
         connection_state_machine_h1(r, con);
 }

As an aside should others find this, a short-term workaround is to disable h2 support in lighttpd, unless HTTP/2 support is critical to your needs
server.feature-flags += ("server.h2proto" => "disable")

Actions #16

Updated by ultimator over 2 years ago

gstrauss wrote in #note-15:

I can't as easily look into mod_openssl.so state since you did not include that in the .tar.

I added all the *.mod files to the archive. You can use the link from comment #note-14

Actions #17

Updated by gstrauss over 2 years ago

@ultimator: The core contains sensitive info. Please change your TLS certificate(s) and revoke the current one(s), if necessary.

Actions #18

Updated by gstrauss over 2 years ago

The mod_openssl.c patch you are now testing should address the issue. My guess is that some scanner is connecting to port 853 and trying TLS ALPN h2. Later this weekend or on Mon, I'll try to reproduce this and confirm that it is fixed with either of the mod_openssl.c patches above.

Actions #19

Updated by ultimator over 2 years ago

gstrauss wrote in #note-17:

@ultimator: The core contains sensitive info. Please change your TLS certificate(s) and revoke the current one(s), if necessary.

Thanks, already done.

Actions

Also available in: Atom