Bug #3118
closedSegfault after updating to version 1.4.62
Description
After updating to version 1.4.62 I get random segfaults. T could not produce a coredump, but a gdb backtrace:
#0 0x00007ffff7df8504 in __GI___libc_write (fd=49, buf=0x5555557f10e3, nbytes=24) at ../sysdeps/unix/sysv/linux/write.c:26 #1 0x00007ffff79d9b65 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.1 #2 0x00007ffff79d4dea in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.1 #3 0x00007ffff79d3e43 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.1 #4 0x00007ffff79d42e3 in BIO_write () from /lib/x86_64-linux-gnu/libcrypto.so.1.1 #5 0x00007ffff7c23877 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1 #6 0x00007ffff7c247b1 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1 #7 0x00007ffff7c2df99 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1 #8 0x00007ffff7c2c095 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1 #9 0x00007ffff7c3717f in SSL_shutdown () from /lib/x86_64-linux-gnu/libssl.so.1.1 #10 0x00007ffff7cc59fd in mod_openssl_close_notify (hctx=hctx@entry=0x5555557d9320) at mod_openssl.c:3340 #11 0x00007ffff7cc5ca8 in mod_openssl_handle_con_shut_wr (con=<optimized out>, p_d=<optimized out>) at mod_openssl.c:3320 #12 0x00005555555765a7 in plugins_call_fn_con_data (con=con@entry=0x5555557cd400, e=e@entry=9) at plugin.c:303 #13 0x000055555558bc0a in plugins_call_handle_connection_shut_wr (con=con@entry=0x5555557cd400) at plugin.c:361 #14 0x0000555555579e3f in connection_handle_shutdown (con=0x5555557cd400) at connections.c:221 #15 connection_handle_response_end_state (con=<optimized out>, r=<optimized out>) at connections.c:221 #16 connection_state_machine_loop (r=0x5555557cd400, con=0x5555557cd400) at connections.c:1063 #17 0x000055555557957f in connection_state_machine_h1 (con=0x5555557cd400, r=0x5555557cd400) at connections.c:1378 #18 connection_state_machine (con=0x5555557cd400) at connections.c:1396 #19 0x00005555555763bf in server_run_con_queue (sentinel=<optimized out>, joblist=<optimized out>) at server.c:1918 #20 server_main_loop (srv=0x5555555af560) at server.c:1971 #21 0x000055555556314a in main (argc=4, argv=0x7fffffffe288) at server.c:2045
This did not happen with version 1.4.61.
stderr shows different malloc error messages:
- lighttpd: corrupted size vs. prev_size
- lighttpd: lighttpd: malloc.c:4028: _int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed.
- lighttpd: free(): invalid pointer
- lighttpd: malloc(): smallbin double linked list corrupted
I'M not sure, but maybe caused by a POST-request (etherpad).
This is debian 10/buster system, libssl version is 1.1.1d-0+deb10u7.
Updated by gstrauss almost 3 years ago
- Priority changed from Normal to High
In lighttpd 1.4.62, memory allocation changed for regexes for pcre2 support and a similar change was made for pcre support, so my initial guess is that the bug is somewhere related to that change. To help narrow things down, are you building lighttpd --with-pcre or --with-pcre2?
Updated by flynn almost 3 years ago
I read the changelog and changed my build to --with-pcre2.
Updated by gstrauss almost 3 years ago
Trying to narrow things down, and operating on the hunch that the memory corruption is related to allocations for regex matching:
lighttpd 1.4.62 does not allocate memory per-request for regex captures if the captures are not used. Do you use %0 or %1 or %2 in url.rewrite
or url.redirect
rules?
Since you recompiled with --with-pcre2, is the issue reproducible? If not, would you try a clean build and see if the issue is reproducible building --with-pcre (and not --with-pcre2)?
Updated by flynn almost 3 years ago
At the moment the issue is reprodicable by just calling a pad from etherpad.
A clean build with --with-pcre
crashes the same way.
I use url.rewrite
but I do not use %0/1/2 substitutions.
Also I have lot of regular epressions in the configuration using the operators =~
and !~
.
Updated by gstrauss almost 3 years ago
I use url.rewrite but I do not use %0/1/2 substitutions.
If that is the case, then the memory used for captures is shared and should not be freed.
Also I have lot of regular epressions in the configuration using the operators =~ and !~.
That's fine.
.
I'll have to try to run some scenarios under valgrind to see if I can catch what changed.
Updated by gstrauss almost 3 years ago
--- a/src/h2.c +++ b/src/h2.c @@ -2573,7 +2573,7 @@ h2_init_stream (request_st * const h2r, connection * const con) #ifdef HAVE_PCRE if (srv->config_captures) memcpy(r->cond_match, h2r->cond_match, - srv->config_captures * sizeof(cond_match_t)); + srv->config_captures * sizeof(cond_match_t *)); #endif /*(see request_config_reset() and request_reset_ex())*/ r->server_name = h2r->server_name;
Updated by flynn almost 3 years ago
This patch solves the my problem only partial:
- no more malloc messages
- crashes are not reproducables any more, it takes longer to trigger
I still get segfaults with the backtrace above, but also the following backtrace:
Program received signal SIGSEGV, Segmentation fault. 0x0000555555565c9f in request_free_data (r=r@entry=0x5555557f8d20) at reqpool.c:235 235 reqpool.c: No such file or directory. (gdb) bt #0 0x0000555555565c9f in request_free_data (r=r@entry=0x5555557f8d20) at reqpool.c:235 #1 0x0000555555565cfb in request_pool_free () at reqpool.c:263 #2 0x00005555555615d2 in server_handle_sigalrm (srv=0x5555555af560, mono_ts=2321664, last_active_ts=2321659) at server.c:1873 #3 0x0000555555576138 in server_main_loop (srv=0x5555555af560) at server.c:1942 #4 0x00005555555630fa in main (argc=4, argv=0x7fffffffe268) at server.c:2045
Updated by gstrauss almost 3 years ago
- Category set to core
Bug: r->con->srv->config_captures might not be valid when reqpool is periodically freed, since r->con is overloaded when request_st is connected to reqpool.
diff --git a/src/reqpool.c b/src/reqpool.c --- a/src/reqpool.c +++ b/src/reqpool.c @@ -67,6 +67,7 @@ request_init_data (request_st * const r, connection * const con, server * const #ifdef HAVE_PCRE if (srv->config_captures) {/*(save 128b per con if no regex conditions)*/ + r->cond_captures = srv->config_captures; r->cond_match = calloc(srv->config_captures, sizeof(cond_match_t *)); force_assert(NULL != r->cond_match); r->cond_match_data = calloc(srv->config_captures, sizeof(cond_match_t)); @@ -232,7 +233,7 @@ request_free_data (request_st * const r) free(r->cond_cache); #ifdef HAVE_PCRE if (r->cond_match_data) { - for (int i = 0, used = r->con->srv->config_captures; i < used; ++i) { + for (int i = 0, used = r->cond_captures; i < used; ++i) { #ifdef HAVE_PCRE2_H if (r->cond_match_data[i].match_data) pcre2_match_data_free(r->cond_match_data[i].match_data); diff --git a/src/request.h b/src/request.h --- a/src/request.h +++ b/src/request.h @@ -194,6 +194,7 @@ struct request_st { struct chunkqueue reqbody_queue; /*(might use tempfiles)*/ struct stat_cache_entry *tmp_sce; /*(value valid only in sequential code)*/ + int cond_captures; };
Updated by gstrauss almost 3 years ago
Had prepared this patch before your post above, so the additional data from you suggests this is the primary issue.
Would you please verify lighttpd -f /etc/lighttpd/lighttpd.conf -p
? The intention of the code is that you should not be reaching the reqpool free code for r->cond_match_data if you do not have %0, %1, %2, etc in url.rewrite*
or url.redirect
substitution targets.
Updated by gstrauss almost 3 years ago
- Has duplicate Bug #3119: Segfault on closing connections added
Updated by flynn almost 3 years ago
I applied both patches and seems stable now, I cannot trigger a crash now (--with-pcre).
You are right:lighttpd -f /etc/lighttpd/lighttpd.conf -p
shows, that I use %0 for http->https url.redirect
.
Sorry, my fault, I only searched for url.rewrite
.
A build with pcre2 (--with-pcre2) looks stable too.
Updated by gstrauss almost 3 years ago
- Status changed from New to Patch Pending
Updated by gstrauss almost 3 years ago
- Status changed from Patch Pending to Fixed
Applied in changeset ef9608f307205a312891d1afaf90a003548e4ff0.
Updated by gstrauss almost 3 years ago
Thank you for reporting the error and verifying the patches. lighttpd 1.4.63 has been released with the fixes.
The bug was reproducible with a longer-running test which ran the periodic connection and reqpool cleanup jobs, something not part of my unit tests. I'll look into adding something for longer-running integration tests.
lighttpd -f /etc/lighttpd/lighttpd.conf -p
shows, that I use %0 for http->httpsurl.redirect
.
FYI: A more efficient example without %0 can be found in HowToRedirectHttpToHttps
Also I have lot of regular expressions in the configuration using the operators
=~
and!~
.
With lighttpd 1.4.63 (and lighttpd 1.4.62), you should be able to measure a slight reduction in memory use. Previously, approx 128 bytes were allocated per connection + per active HTTP/2 stream for each regex in the configuration (128 * num_regexes_in_lighttpd_conf * (num_connections + num_HTTP2_streams). Since lighttpd 1.4.62, allocation for regex captures is done only for conditions containing url.rewrite*
or url.redirect
which utilize %X substitutions.
Updated by flynn almost 3 years ago
You are faster than I!
I already saw the memory reduction in my monitoring, it is regardless the server configuration about 1MB.
This means that a simple lighttpd without dynamic backends can run now with 1MB private dirty memory, even if with HTTP/2 enabled.
Great work!
Updated by flynn almost 3 years ago
gstrauss wrote in #note-14:
Thank you for reporting the error and verifying the patches. lighttpd 1.4.63 has been released with the fixes.
The bug was reproducible with a longer-running test which ran the periodic connection and reqpool cleanup jobs, something not part of my unit tests. I'll look into adding something for longer-running integration tests.
Feel free to ask me for prerelease real world tests ...
lighttpd -f /etc/lighttpd/lighttpd.conf -p
shows, that I use %0 for http->httpsurl.redirect
.FYI: A more efficient example without %0 can be found in HowToRedirectHttpToHttps
Replacing "https://%0$0"
by "https://${url.authority}${url.path}${qsa}"
saves 1MB again - thank you for the hint.
Updated by HenrikHolst almost 3 years ago
I have applied all 3 patches but are still experiencing segfaults, here is a GDB output of the latest one:
Core was generated by `/usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf'. Program terminated with signal SIGSEGV, Segmentation fault. #0 mod_openssl_handle_con_accept (con=0x55c7001b8aa0, p_d=0x55c7001ef320) at mod_openssl.c:3267 3267 mod_openssl.c: No such file or directory. (gdb) bt #0 mod_openssl_handle_con_accept (con=0x55c7001b8aa0, p_d=0x55c7001ef320) at mod_openssl.c:3267 #1 0x000055c6fe256318 in plugins_call_fn_con_data (con=con@entry=0x55c7001b8aa0, e=e@entry=8) at plugin.c:302 #2 0x000055c6fe26c0ee in plugins_call_handle_connection_accept (con=con@entry=0x55c7001b8aa0) at plugin.c:360 #3 0x000055c6fe2592d6 in connection_accepted (srv=srv@entry=0x55c7001b7550, srv_socket=srv_socket@entry=0x55c700222750, cnt_addr=cnt_addr@entry=0x7ffd05ebd000, cnt=13) at connections.c:952 #4 0x000055c6fe25cedc in network_server_handle_fdevent (context=0x55c700222750, revents=<optimized out>) at network.c:78 #5 0x000055c6fe25fa84 in fdevent_linux_sysepoll_poll (ev=<optimized out>, timeout_ms=<optimized out>) at fdevent_impl.c:384 #6 0x000055c6fe25fb55 in fdevent_poll (ev=0x55c700228080, timeout_ms=<optimized out>) at fdevent_impl.c:342 #7 0x000055c6fe256144 in server_main_loop (srv=0x55c7001b7550) at server.c:1973 #8 0x000055c6fe242bd2 in main (argc=4, argv=0x7ffd05ebd288) at server.c:2045 (gdb)
Updated by HenrikHolst almost 3 years ago
And here is a run with valgrind:
==2801666== Memcheck, a memory error detector ==2801666== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al. ==2801666== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info ==2801666== Command: /usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf ==2801666== ==2801666== Invalid read of size 1 ==2801666== at 0x5534B77: mod_openssl_handle_con_accept (mod_openssl.c:3267) ==2801666== by 0x12B317: plugins_call_fn_con_data (plugin.c:302) ==2801666== by 0x12E2D5: connection_accepted (connections.c:952) ==2801666== by 0x131EDB: network_server_handle_fdevent (network.c:78) ==2801666== by 0x134A83: fdevent_linux_sysepoll_poll (fdevent_impl.c:384) ==2801666== by 0x134B54: fdevent_poll (fdevent_impl.c:342) ==2801666== by 0x12B143: server_main_loop (server.c:1973) ==2801666== by 0x117BD1: main (server.c:2045) ==2801666== Address 0x4100000082 is not stack'd, malloc'd or (recently) free'd ==2801666== ==2801666== ==2801666== Process terminating with default action of signal 11 (SIGSEGV) ==2801666== Access not within mapped region at address 0x4100000082 ==2801666== at 0x5534B77: mod_openssl_handle_con_accept (mod_openssl.c:3267) ==2801666== by 0x12B317: plugins_call_fn_con_data (plugin.c:302) ==2801666== by 0x12E2D5: connection_accepted (connections.c:952) ==2801666== by 0x131EDB: network_server_handle_fdevent (network.c:78) ==2801666== by 0x134A83: fdevent_linux_sysepoll_poll (fdevent_impl.c:384) ==2801666== by 0x134B54: fdevent_poll (fdevent_impl.c:342) ==2801666== by 0x12B143: server_main_loop (server.c:1973) ==2801666== by 0x117BD1: main (server.c:2045) ==2801666== If you believe this happened as a result of a stack ==2801666== overflow in your program's main thread (unlikely but ==2801666== possible), you can try to increase the size of the ==2801666== main thread stack using the --main-stacksize= flag. ==2801666== The main thread stack size used in this run was 8388608. ==2801666== ==2801666== HEAP SUMMARY: ==2801666== in use at exit: 633,487 bytes in 6,549 blocks ==2801666== total heap usage: 15,973 allocs, 9,424 frees, 1,647,548 bytes allocated ==2801666== ==2801666== LEAK SUMMARY: ==2801666== definitely lost: 0 bytes in 0 blocks ==2801666== indirectly lost: 0 bytes in 0 blocks ==2801666== possibly lost: 0 bytes in 0 blocks ==2801666== still reachable: 633,487 bytes in 6,549 blocks ==2801666== suppressed: 0 bytes in 0 blocks ==2801666== Rerun with --leak-check=full to see details of leaked memory ==2801666== ==2801666== For lists of detected and suppressed errors, rerun with: -s ==2801666== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0) ==2801666== could not unlink /tmp/vgdb-pipe-from-vgdb-to-2801666-by-root-on-??? ==2801666== could not unlink /tmp/vgdb-pipe-to-vgdb-from-2801666-by-root-on-??? ==2801666== could not unlink /tmp/vgdb-pipe-shared-mem-vgdb-2801666-by-root-on-??? Segmentation fault (core dumped)
Updated by HenrikHolst almost 3 years ago
Sorry about this, being a dumbass I of course forgot to also move over the modules from the new build... so just ignore what I wrote.
Also available in: Atom