Bug #3152
closedRandom Segfaults with version 1.4.64 w/ mod_sockproxy and ALPN h2
Description
Since the update to version 1.4.64 (from 1.4.59), lighttpd crashes randomly with SIGSEGV. I also added mod_sockproxy
around this time so this might be related.
Sometimes it takes only a few hours, sometimes a few days until a crash occurs. I can't reproduce them.
OS is Gentoo Linux. Arch arm64
. Compiled with brotli ipv6 lua nettle pcre2 openssl xxhash attr zlib
.CFLAGS="-march=armv8-a+crc+simd -mtune=cortex-a72 -ftree-vectorize -O2 -pipe -fomit-frame-pointer"
Here is a gdb stacktrace for two of the crashes
gdb --args /usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf ... Program received signal SIGSEGV, Segmentation fault. 0x00000055555668f4 in connection_state_machine_h2 (h2r=0x555570a520, con=0x555570a520) at connections.c:1209 1209 connections.c: Datei oder Verzeichnis nicht gefunden. (gdb) bt full #0 0x00000055555668f4 in connection_state_machine_h2 (h2r=0x555570a520, con=0x555570a520) at connections.c:1209 h2c = 0x0 resched = 127 #1 0x0000005555566f0c in connection_state_machine (con=0x555570a520) at connections.c:1388 r = 0x555570a520 #2 0x0000005555561194 in server_run_con_queue (joblist=0x555570a520, sentinel=0x55555d9f20 <log_con_jqueue>) at server.c:1938 con = 0x555570a520 jqnext = 0x55555d9f20 <log_con_jqueue> #3 0x0000005555561324 in server_main_loop (srv=0x55555db520) at server.c:1991 mono_ts = 1392894 sentinel = 0x55555d9f20 <log_con_jqueue> joblist = 0x555570a520 last_active_ts = 1392894 #4 0x000000555556154c in main (argc=4, argv=0x7ffffff168) at server.c:2065 srv = 0x55555db520 rc = 1
gdb --args /usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf ... Program received signal SIGSEGV, Segmentation fault. 0x00000055555668f4 in connection_state_machine_h2 (h2r=0x55556a8de0, con=0x55556a8de0) at connections.c:1209 1209 connections.c: Datei oder Verzeichnis nicht gefunden. (gdb) bt full #0 0x00000055555668f4 in connection_state_machine_h2 (h2r=0x55556a8de0, con=0x55556a8de0) at connections.c:1209 h2c = 0x0 resched = 127 #1 0x0000005555566f0c in connection_state_machine (con=0x55556a8de0) at connections.c:1388 r = 0x55556a8de0 #2 0x0000005555561194 in server_run_con_queue (joblist=0x55556a8de0, sentinel=0x55555d9f20 <log_con_jqueue>) at server.c:1938 con = 0x55556a8de0 jqnext = 0x55555d9f20 <log_con_jqueue> #3 0x0000005555561324 in server_main_loop (srv=0x55555db520) at server.c:1991 mono_ts = 1437908 sentinel = 0x55555d9f20 <log_con_jqueue> joblist = 0x55556a8de0 last_active_ts = 1437908 #4 0x000000555556154c in main (argc=4, argv=0x7ffffff148) at server.c:2065 srv = 0x55555db520 rc = 1Last logged connection before the crash:
2022-04-05 18:30:47: (connections.c.770) fd:18 rqst: PUT /_matrix/federation/v1/send/1648990922236 HTTP/1.1 2022-04-05 18:30:47: (connections.c.770) fd:18 rqst: Content-Length: 253 2022-04-05 18:30:47: (connections.c.770) fd:18 rqst: User-Agent: Synapse/1.55.2 2022-04-05 18:30:47: (connections.c.770) fd:18 rqst: Content-Type: application/json 2022-04-05 18:30:47: (connections.c.770) fd:18 rqst: Authorization: X-Matrix origin=kif.rocks,key="<redacted>",sig="<redacted>" 2022-04-05 18:30:47: (connections.c.770) fd:18 rqst: Host: matrix.pygos.space:443 2022-04-05 18:30:47: (connections.c.770) fd:18 rqst: 2022-04-05 18:30:47: (response.c.402) -- parsed Request-URI 2022-04-05 18:30:47: (response.c.404) Request-URI : /_matrix/federation/v1/send/1648990922236 2022-04-05 18:30:47: (response.c.406) URI-scheme : https 2022-04-05 18:30:47: (response.c.408) URI-authority : matrix.pygos.space 2022-04-05 18:30:47: (response.c.410) URI-path (clean): /_matrix/federation/v1/send/1648990922236 2022-04-05 18:30:47: (response.c.412) URI-query : 2022-04-05 18:30:47: (gw_backend.c.2682) handling the request using proxy 2022-04-05 18:30:47: (response.c.490) -- logical -> physical 2022-04-05 18:30:47: (response.c.492) Doc-Root : /var/www/servers/matrix.pygos.space/htdocs/ 2022-04-05 18:30:47: (response.c.494) Basedir : /var/www/servers/matrix.pygos.space/htdocs/ 2022-04-05 18:30:47: (response.c.496) Rel-Path : /_matrix/federation/v1/send/1648990922236 2022-04-05 18:30:47: (response.c.498) Path : /var/www/servers/matrix.pygos.space/htdocs/_matrix/federation/v1/send/1648990922236 2022-04-05 18:30:47: (response.c.164) fd:18 resp: HTTP/1.1 200 OK 2022-04-05 18:30:47: (response.c.164) fd:18 resp: Server: Synapse/1.55.2 2022-04-05 18:30:47: (response.c.164) fd:18 resp: Date: Tue, 05 Apr 2022 16:30:48 GMT 2022-04-05 18:30:47: (response.c.164) fd:18 resp: Content-Type: application/json 2022-04-05 18:30:47: (response.c.164) fd:18 resp: Cache-Control: no-cache, no-store, must-revalidate 2022-04-05 18:30:47: (response.c.164) fd:18 resp: Access-Control-Allow-Origin: * 2022-04-05 18:30:47: (response.c.164) fd:18 resp: Access-Control-Allow-Methods: GET, HEAD, POST, PUT, DELETE, OPTIONS 2022-04-05 18:30:47: (response.c.164) fd:18 resp: Access-Control-Allow-Headers: X-Requested-With, Content-Type, Authorization, Date 2022-04-05 18:30:47: (response.c.164) fd:18 resp: Strict-Transport-Security: max-age=15768000; includeSubdomains; preload 2022-04-05 18:30:47: (response.c.164) fd:18 resp: Content-Length: 11 2022-04-05 18:30:47: (response.c.164) fd:18 resp: 2022-04-05 18:30:47: (gw_backend.c.2682) handling the request using sockproxy
Files
Updated by gstrauss over 2 years ago
Thanks for the thorough info. It looks like h2c has been cleaned up, and the SIGSEGV occurs on a NULL pointer dereference. I need to look into why...
(gdb) bt full #0 0x00000055555668f4 in connection_state_machine_h2 (h2r=0x555570a520, con=0x555570a520) at connections.c:1209 h2c = 0x0
Updated by gstrauss over 2 years ago
So far, nothing obvious has jumped out to me as an error. Given the stack trace you provided, the connection is still in the job queue after being retired (which can happen), but the connection should not have con->request.http_version
set to HTTP_VERSION_2
after being retired, and so should not reach connection_state_machine_h2()
connection_handle_response_end_state()
is the only place which calls h2_retire_con()
, which is the only place which sets con->h2 = NULL
. (con->h2
is allocated when con->request.http_version
is set to HTTP_VERSION_2
.)
Further along in connection_handle_response_end_state()
, request_reset()
is called (either directly or through the call to connection_handle_shutdown()
which calls connection_reset()
, which calls request_reset()
), and request_reset()
changes con->request.http_version
to HTTP_VERSION_UNSET
.
I'll have to look more into this later. In the meantime, are you able to share your lighttpd config (lighttpd -f /etc/lighttpd/lighttpd.conf -p
) so that I can see what other modules might be involved?
While the following might be a workaround (and might need further minor adjustments to be sufficient), I would like to track down why this is occurring, so the following is a bandaid, and not a solution.
--- a/src/connections.c +++ b/src/connections.c @@ -1205,6 +1205,7 @@ static void connection_state_machine_h2 (request_st * const h2r, connection * const con) { h2con * const h2c = con->h2; + if (NULL == h2c) return; if (h2c->sent_goaway <= 0 && (chunkqueue_is_empty(con->read_queue) || h2_parse_frames(con))
Updated by gstrauss over 2 years ago
Here is a guess. Would you please try this patch if you can? Perhaps ALPN negotiates h2 in a TLS module before mod_sockproxy picks up the connection.
--- a/src/mod_sockproxy.c +++ b/src/mod_sockproxy.c @@ -163,6 +163,7 @@ static handler_t mod_sockproxy_connection_accept(connection *con, void *p_d) { hctx->create_env = sockproxy_create_env_connect; hctx->response = chunk_buffer_acquire(); r->http_status = -1; /*(skip HTTP processing)*/ + r->http_version = HTTP_VERSION_UNSET; } return HANDLER_GO_ON;
Updated by ultimator over 2 years ago
- File lighttpd.conf lighttpd.conf added
Thank you for looking into this.lighttpd -f /etc/lighttpd/lighttpd.conf -p
is attached.
I'll try your patch and report back.
Updated by ultimator over 2 years ago
It's been over a day now with the patch applied and no crash so far.
Looks promising :)
Thank you
I'll continue the testing and report back in a few days if sth changes
Updated by gstrauss over 2 years ago
- Status changed from New to Patch Pending
- Target version changed from 1.4.xx to 1.4.65
Thanks for the update. To confirm: Are you using only the one-line patch to mod_sockproxy? (and not the bandaid to connections.c?)
(If you still have core files from crash, you could print h2r->handler_module->name
and (if h2r->handler_module
is not NULL), then I would expect to see "sockproxy".)
Updated by gstrauss over 2 years ago
- Subject changed from Random Segfaults with version 1.4.64 to Random Segfaults with version 1.4.64 w/ mod_sockproxy and ALPN h2
- Category set to mod_sockproxy
Updated by ultimator over 2 years ago
Yes I only used the patch to mod_sockproxy
.
Unfortunately I have no core dumps. However if the patch turns out to be working, I could remove it, wait for a crash and then check this if that's desired.
Updated by gstrauss over 2 years ago
Unfortunately I have no core dumps. However if the patch turns out to be working, I could remove it, wait for a crash and then check this if that's desired.
Thanks, but not necessary. The simple mod_sockproxy patch fixes something that I overlooked when adding HTTP/2 support to lighttpd.
Updated by gstrauss over 2 years ago
- Status changed from Patch Pending to Fixed
Applied in changeset e5dc98faf3b873704290f4b6df3f631784ac294f.
Updated by ultimator over 2 years ago
Bad news. It just happened again with the exact same stacktrace as before. However I think the patch seems to work somehow because it took a lot longer this time.
I also inadvertently closed the debugger so I have to wait for the next crash for a core dump.
Updated by gstrauss over 2 years ago
Please save the core dump. If I can access the core file and the associated lighttpd executable file, I can examine the program state outside the stack trace.
Updated by gstrauss over 2 years ago
If the TLS ClientHello is received or processed after lighttpd mod_sockproxy has been initialized for the connection, I think that might explain what you are seeing. I think one answer is to avoid selecting a protocol using TLS ALPN if mod_sockproxy will be handling the connection. However, doing so will break someone using mod_sockproxy to connect to an HTTP/2 backend, and desiring to use HTTP/2. For that, an alternative patch (further below) might be better. I am leaning towards applying the second patch below, and applying a similar patch to other lighttpd TLS modules.
--- a/src/mod_openssl.c +++ b/src/mod_openssl.c @@ -1874,6 +1874,8 @@ mod_openssl_alpn_select_cb (SSL *ssl, const unsigned char **out, unsigned char * handler_ctx *hctx = (handler_ctx *) SSL_get_app_data(ssl); unsigned short proto; UNUSED(arg); + if (hctx->r->handler_module) /*(e.g. mod_sockproxy)*/ + return SSL_TLSEXT_ERR_NOACK; for (unsigned int i = 0, n; i < inlen; i += n) { n = in[i++];
Alternative patch
--- a/src/mod_openssl.c +++ b/src/mod_openssl.c @@ -1883,7 +1883,8 @@ mod_openssl_alpn_select_cb (SSL *ssl, const unsigned char **out, unsigned char * if (in[i] == 'h' && in[i+1] == '2') { if (!hctx->r->conf.h2proto) continue; proto = MOD_OPENSSL_ALPN_H2; - hctx->r->http_version = HTTP_VERSION_2; + if (hctx->r->handler_module == NULL)/*(e.g. not mod_sockproxy)*/ + hctx->r->http_version = HTTP_VERSION_2; break; } continue;
Updated by ultimator over 2 years ago
It crashed again. I dumped the core but I can't upload it here due to filesize restrictions. So I put it on Google Drive:
https://drive.google.com/file/d/1JmruFxcxVg92QWGT7oY51KWloz94x4Xg/view?usp=drivesdk
I'll also apply the second patch now and see if that works.
Updated by gstrauss over 2 years ago
Thank you. I can see from the core that there are 4 connections, and the crash occurs on a connection being handled by mod_sockproxy before any non-TLS bytes are read. (I can't as easily look into mod_openssl.so state since you did not include that in the .tar.) Still, the core supports my hunch that the connection is negotiating h2 via ALPN, but that mod_sockproxy has already claimed the connection, so the http_version is not being reset. I think the latest patch will address your issue.
As an additional safeguard, my dev branch contains the following patch, which should (independently) avoid the issue.
--- a/src/connections.c +++ b/src/connections.c @@ -1378,11 +1380,11 @@ connection_state_machine_h1 (request_st * const r, connection * const con) void connection_state_machine (connection * const con) { request_st * const r = &con->request; - if (r->http_version == HTTP_VERSION_2) + if (con->h2) connection_state_machine_h2(r, con); else /* if (r->http_version <= HTTP_VERSION_1_1) */ connection_state_machine_h1(r, con); }
As an aside should others find this, a short-term workaround is to disable h2 support in lighttpd, unless HTTP/2 support is critical to your needsserver.feature-flags += ("server.h2proto" => "disable")
Updated by gstrauss over 2 years ago
@ultimator: The core contains sensitive info. Please change your TLS certificate(s) and revoke the current one(s), if necessary.
Updated by gstrauss over 2 years ago
The mod_openssl.c patch you are now testing should address the issue. My guess is that some scanner is connecting to port 853 and trying TLS ALPN h2. Later this weekend or on Mon, I'll try to reproduce this and confirm that it is fixed with either of the mod_openssl.c patches above.
Updated by ultimator over 2 years ago
gstrauss wrote in #note-17:
@ultimator: The core contains sensitive info. Please change your TLS certificate(s) and revoke the current one(s), if necessary.
Thanks, already done.
Also available in: Atom