Project

General

Profile

Actions

Bug #3118

closed

Segfault after updating to version 1.4.62

Added by flynn about 2 months ago. Updated about 1 month ago.

Status:
Fixed
Priority:
High
Category:
core
Target version:
ASK QUESTIONS IN Forums:
No

Description

After updating to version 1.4.62 I get random segfaults. T could not produce a coredump, but a gdb backtrace:

#0  0x00007ffff7df8504 in __GI___libc_write (fd=49, buf=0x5555557f10e3, nbytes=24) at ../sysdeps/unix/sysv/linux/write.c:26
#1  0x00007ffff79d9b65 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.1
#2  0x00007ffff79d4dea in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.1
#3  0x00007ffff79d3e43 in ?? () from /lib/x86_64-linux-gnu/libcrypto.so.1.1
#4  0x00007ffff79d42e3 in BIO_write () from /lib/x86_64-linux-gnu/libcrypto.so.1.1
#5  0x00007ffff7c23877 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#6  0x00007ffff7c247b1 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#7  0x00007ffff7c2df99 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#8  0x00007ffff7c2c095 in ?? () from /lib/x86_64-linux-gnu/libssl.so.1.1
#9  0x00007ffff7c3717f in SSL_shutdown () from /lib/x86_64-linux-gnu/libssl.so.1.1
#10 0x00007ffff7cc59fd in mod_openssl_close_notify (hctx=hctx@entry=0x5555557d9320) at mod_openssl.c:3340
#11 0x00007ffff7cc5ca8 in mod_openssl_handle_con_shut_wr (con=<optimized out>, p_d=<optimized out>) at mod_openssl.c:3320
#12 0x00005555555765a7 in plugins_call_fn_con_data (con=con@entry=0x5555557cd400, e=e@entry=9) at plugin.c:303
#13 0x000055555558bc0a in plugins_call_handle_connection_shut_wr (con=con@entry=0x5555557cd400) at plugin.c:361
#14 0x0000555555579e3f in connection_handle_shutdown (con=0x5555557cd400) at connections.c:221
#15 connection_handle_response_end_state (con=<optimized out>, r=<optimized out>) at connections.c:221
#16 connection_state_machine_loop (r=0x5555557cd400, con=0x5555557cd400) at connections.c:1063
#17 0x000055555557957f in connection_state_machine_h1 (con=0x5555557cd400, r=0x5555557cd400) at connections.c:1378
#18 connection_state_machine (con=0x5555557cd400) at connections.c:1396
#19 0x00005555555763bf in server_run_con_queue (sentinel=<optimized out>, joblist=<optimized out>) at server.c:1918
#20 server_main_loop (srv=0x5555555af560) at server.c:1971
#21 0x000055555556314a in main (argc=4, argv=0x7fffffffe288) at server.c:2045

This did not happen with version 1.4.61.

stderr shows different malloc error messages:
- lighttpd: corrupted size vs. prev_size
- lighttpd: lighttpd: malloc.c:4028: _int_malloc: Assertion `(unsigned long) (size) >= (unsigned long) (nb)' failed.
- lighttpd: free(): invalid pointer
- lighttpd: malloc(): smallbin double linked list corrupted

I'M not sure, but maybe caused by a POST-request (etherpad).

This is debian 10/buster system, libssl version is 1.1.1d-0+deb10u7.


Related issues

Has duplicate Bug #3119: Segfault on closing connectionsDuplicateActions
Actions #1

Updated by gstrauss about 2 months ago

  • Priority changed from Normal to High

In lighttpd 1.4.62, memory allocation changed for regexes for pcre2 support and a similar change was made for pcre support, so my initial guess is that the bug is somewhere related to that change. To help narrow things down, are you building lighttpd --with-pcre or --with-pcre2?

Actions #2

Updated by flynn about 2 months ago

I read the changelog and changed my build to --with-pcre2.

Actions #3

Updated by gstrauss about 2 months ago

Trying to narrow things down, and operating on the hunch that the memory corruption is related to allocations for regex matching:
lighttpd 1.4.62 does not allocate memory per-request for regex captures if the captures are not used. Do you use %0 or %1 or %2 in url.rewrite or url.redirect rules?

Since you recompiled with --with-pcre2, is the issue reproducible? If not, would you try a clean build and see if the issue is reproducible building --with-pcre (and not --with-pcre2)?

Actions #4

Updated by flynn about 2 months ago

At the moment the issue is reprodicable by just calling a pad from etherpad.

A clean build with --with-pcre crashes the same way.

I use url.rewrite but I do not use %0/1/2 substitutions.

Also I have lot of regular epressions in the configuration using the operators =~ and !~.

Actions #5

Updated by gstrauss about 2 months ago

I use url.rewrite but I do not use %0/1/2 substitutions.

If that is the case, then the memory used for captures is shared and should not be freed.

Also I have lot of regular epressions in the configuration using the operators =~ and !~.

That's fine.

.

I'll have to try to run some scenarios under valgrind to see if I can catch what changed.

Actions #6

Updated by gstrauss about 1 month ago

--- a/src/h2.c
+++ b/src/h2.c
@@ -2573,7 +2573,7 @@ h2_init_stream (request_st * const h2r, connection * const con)
   #ifdef HAVE_PCRE
     if (srv->config_captures)
         memcpy(r->cond_match, h2r->cond_match,
-               srv->config_captures * sizeof(cond_match_t));
+               srv->config_captures * sizeof(cond_match_t *));
   #endif
     /*(see request_config_reset() and request_reset_ex())*/
     r->server_name = h2r->server_name;
Actions #7

Updated by flynn about 1 month ago

This patch solves the my problem only partial:
- no more malloc messages
- crashes are not reproducables any more, it takes longer to trigger

I still get segfaults with the backtrace above, but also the following backtrace:

Program received signal SIGSEGV, Segmentation fault.
0x0000555555565c9f in request_free_data (r=r@entry=0x5555557f8d20) at reqpool.c:235
235     reqpool.c: No such file or directory.
(gdb) bt
#0  0x0000555555565c9f in request_free_data (r=r@entry=0x5555557f8d20) at reqpool.c:235
#1  0x0000555555565cfb in request_pool_free () at reqpool.c:263
#2  0x00005555555615d2 in server_handle_sigalrm (srv=0x5555555af560, mono_ts=2321664, last_active_ts=2321659) at server.c:1873
#3  0x0000555555576138 in server_main_loop (srv=0x5555555af560) at server.c:1942
#4  0x00005555555630fa in main (argc=4, argv=0x7fffffffe268) at server.c:2045

Actions #8

Updated by gstrauss about 1 month ago

  • Category set to core

Bug: r->con->srv->config_captures might not be valid when reqpool is periodically freed, since r->con is overloaded when request_st is connected to reqpool.

diff --git a/src/reqpool.c b/src/reqpool.c
--- a/src/reqpool.c
+++ b/src/reqpool.c
@@ -67,6 +67,7 @@ request_init_data (request_st * const r, connection * const con, server * const

   #ifdef HAVE_PCRE
     if (srv->config_captures) {/*(save 128b per con if no regex conditions)*/
+        r->cond_captures = srv->config_captures;
         r->cond_match = calloc(srv->config_captures, sizeof(cond_match_t *));
         force_assert(NULL != r->cond_match);
         r->cond_match_data = calloc(srv->config_captures, sizeof(cond_match_t));
@@ -232,7 +233,7 @@ request_free_data (request_st * const r)
     free(r->cond_cache);
   #ifdef HAVE_PCRE
     if (r->cond_match_data) {
-        for (int i = 0, used = r->con->srv->config_captures; i < used; ++i) {
+        for (int i = 0, used = r->cond_captures; i < used; ++i) {
           #ifdef HAVE_PCRE2_H
             if (r->cond_match_data[i].match_data)
                 pcre2_match_data_free(r->cond_match_data[i].match_data);
diff --git a/src/request.h b/src/request.h
--- a/src/request.h
+++ b/src/request.h
@@ -194,6 +194,7 @@ struct request_st {
     struct chunkqueue reqbody_queue; /*(might use tempfiles)*/

     struct stat_cache_entry *tmp_sce; /*(value valid only in sequential code)*/
+    int cond_captures;
 };

Actions #9

Updated by gstrauss about 1 month ago

Had prepared this patch before your post above, so the additional data from you suggests this is the primary issue.

Would you please verify lighttpd -f /etc/lighttpd/lighttpd.conf -p? The intention of the code is that you should not be reaching the reqpool free code for r->cond_match_data if you do not have %0, %1, %2, etc in url.rewrite* or url.redirect substitution targets.

Actions #10

Updated by gstrauss about 1 month ago

  • Has duplicate Bug #3119: Segfault on closing connections added
Actions #11

Updated by flynn about 1 month ago

I applied both patches and seems stable now, I cannot trigger a crash now (--with-pcre).

You are right:
lighttpd -f /etc/lighttpd/lighttpd.conf -p shows, that I use %0 for http->https url.redirect.
Sorry, my fault, I only searched for url.rewrite.

A build with pcre2 (--with-pcre2) looks stable too.

Actions #12

Updated by gstrauss about 1 month ago

  • Status changed from New to Patch Pending
Actions #13

Updated by gstrauss about 1 month ago

  • Status changed from Patch Pending to Fixed
Actions #14

Updated by gstrauss about 1 month ago

Thank you for reporting the error and verifying the patches. lighttpd 1.4.63 has been released with the fixes.

The bug was reproducible with a longer-running test which ran the periodic connection and reqpool cleanup jobs, something not part of my unit tests. I'll look into adding something for longer-running integration tests.

lighttpd -f /etc/lighttpd/lighttpd.conf -p shows, that I use %0 for http->https url.redirect.

FYI: A more efficient example without %0 can be found in HowToRedirectHttpToHttps

Also I have lot of regular expressions in the configuration using the operators =~ and !~.

With lighttpd 1.4.63 (and lighttpd 1.4.62), you should be able to measure a slight reduction in memory use. Previously, approx 128 bytes were allocated per connection + per active HTTP/2 stream for each regex in the configuration (128 * num_regexes_in_lighttpd_conf * (num_connections + num_HTTP2_streams). Since lighttpd 1.4.62, allocation for regex captures is done only for conditions containing url.rewrite* or url.redirect which utilize %X substitutions.

Actions #15

Updated by flynn about 1 month ago

You are faster than I!

I already saw the memory reduction in my monitoring, it is regardless the server configuration about 1MB.

This means that a simple lighttpd without dynamic backends can run now with 1MB private dirty memory, even if with HTTP/2 enabled.

Great work!

Actions #16

Updated by flynn about 1 month ago

gstrauss wrote in #note-14:

Thank you for reporting the error and verifying the patches. lighttpd 1.4.63 has been released with the fixes.

The bug was reproducible with a longer-running test which ran the periodic connection and reqpool cleanup jobs, something not part of my unit tests. I'll look into adding something for longer-running integration tests.

Feel free to ask me for prerelease real world tests ...

lighttpd -f /etc/lighttpd/lighttpd.conf -p shows, that I use %0 for http->https url.redirect.

FYI: A more efficient example without %0 can be found in HowToRedirectHttpToHttps

Replacing "https://%0$0" by "https://${url.authority}${url.path}${qsa}" saves 1MB again - thank you for the hint.

Actions #17

Updated by HenrikHolst about 1 month ago

I have applied all 3 patches but are still experiencing segfaults, here is a GDB output of the latest one:

Core was generated by `/usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  mod_openssl_handle_con_accept (con=0x55c7001b8aa0, p_d=0x55c7001ef320) at mod_openssl.c:3267
3267    mod_openssl.c: No such file or directory.
(gdb) bt
#0  mod_openssl_handle_con_accept (con=0x55c7001b8aa0, p_d=0x55c7001ef320) at mod_openssl.c:3267
#1  0x000055c6fe256318 in plugins_call_fn_con_data (con=con@entry=0x55c7001b8aa0, e=e@entry=8) at plugin.c:302
#2  0x000055c6fe26c0ee in plugins_call_handle_connection_accept (con=con@entry=0x55c7001b8aa0) at plugin.c:360
#3  0x000055c6fe2592d6 in connection_accepted (srv=srv@entry=0x55c7001b7550, srv_socket=srv_socket@entry=0x55c700222750, cnt_addr=cnt_addr@entry=0x7ffd05ebd000, cnt=13) at connections.c:952
#4  0x000055c6fe25cedc in network_server_handle_fdevent (context=0x55c700222750, revents=<optimized out>) at network.c:78
#5  0x000055c6fe25fa84 in fdevent_linux_sysepoll_poll (ev=<optimized out>, timeout_ms=<optimized out>) at fdevent_impl.c:384
#6  0x000055c6fe25fb55 in fdevent_poll (ev=0x55c700228080, timeout_ms=<optimized out>) at fdevent_impl.c:342
#7  0x000055c6fe256144 in server_main_loop (srv=0x55c7001b7550) at server.c:1973
#8  0x000055c6fe242bd2 in main (argc=4, argv=0x7ffd05ebd288) at server.c:2045
(gdb) 

Actions #18

Updated by HenrikHolst about 1 month ago

And here is a run with valgrind:

==2801666== Memcheck, a memory error detector
==2801666== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==2801666== Using Valgrind-3.15.0 and LibVEX; rerun with -h for copyright info
==2801666== Command: /usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf
==2801666== 
==2801666== Invalid read of size 1
==2801666==    at 0x5534B77: mod_openssl_handle_con_accept (mod_openssl.c:3267)
==2801666==    by 0x12B317: plugins_call_fn_con_data (plugin.c:302)
==2801666==    by 0x12E2D5: connection_accepted (connections.c:952)
==2801666==    by 0x131EDB: network_server_handle_fdevent (network.c:78)
==2801666==    by 0x134A83: fdevent_linux_sysepoll_poll (fdevent_impl.c:384)
==2801666==    by 0x134B54: fdevent_poll (fdevent_impl.c:342)
==2801666==    by 0x12B143: server_main_loop (server.c:1973)
==2801666==    by 0x117BD1: main (server.c:2045)
==2801666==  Address 0x4100000082 is not stack'd, malloc'd or (recently) free'd
==2801666== 
==2801666== 
==2801666== Process terminating with default action of signal 11 (SIGSEGV)
==2801666==  Access not within mapped region at address 0x4100000082
==2801666==    at 0x5534B77: mod_openssl_handle_con_accept (mod_openssl.c:3267)
==2801666==    by 0x12B317: plugins_call_fn_con_data (plugin.c:302)
==2801666==    by 0x12E2D5: connection_accepted (connections.c:952)
==2801666==    by 0x131EDB: network_server_handle_fdevent (network.c:78)
==2801666==    by 0x134A83: fdevent_linux_sysepoll_poll (fdevent_impl.c:384)
==2801666==    by 0x134B54: fdevent_poll (fdevent_impl.c:342)
==2801666==    by 0x12B143: server_main_loop (server.c:1973)
==2801666==    by 0x117BD1: main (server.c:2045)
==2801666==  If you believe this happened as a result of a stack
==2801666==  overflow in your program's main thread (unlikely but
==2801666==  possible), you can try to increase the size of the
==2801666==  main thread stack using the --main-stacksize= flag.
==2801666==  The main thread stack size used in this run was 8388608.
==2801666== 
==2801666== HEAP SUMMARY:
==2801666==     in use at exit: 633,487 bytes in 6,549 blocks
==2801666==   total heap usage: 15,973 allocs, 9,424 frees, 1,647,548 bytes allocated
==2801666== 
==2801666== LEAK SUMMARY:
==2801666==    definitely lost: 0 bytes in 0 blocks
==2801666==    indirectly lost: 0 bytes in 0 blocks
==2801666==      possibly lost: 0 bytes in 0 blocks
==2801666==    still reachable: 633,487 bytes in 6,549 blocks
==2801666==         suppressed: 0 bytes in 0 blocks
==2801666== Rerun with --leak-check=full to see details of leaked memory
==2801666== 
==2801666== For lists of detected and suppressed errors, rerun with: -s
==2801666== ERROR SUMMARY: 1 errors from 1 contexts (suppressed: 0 from 0)
==2801666== could not unlink /tmp/vgdb-pipe-from-vgdb-to-2801666-by-root-on-???
==2801666== could not unlink /tmp/vgdb-pipe-to-vgdb-from-2801666-by-root-on-???
==2801666== could not unlink /tmp/vgdb-pipe-shared-mem-vgdb-2801666-by-root-on-???
Segmentation fault (core dumped)

Actions #19

Updated by HenrikHolst about 1 month ago

Sorry about this, being a dumbass I of course forgot to also move over the modules from the new build... so just ignore what I wrote.

Actions

Also available in: Atom