Project

General

Profile

Actions

Bug #3058

closed

rare spontaneous segfaults

Added by mgottinger 9 months ago. Updated 9 months ago.

Status:
Fixed
Priority:
Normal
Category:
core
Target version:
ASK QUESTIONS IN Forums:
No

Description

It happened first a week ago with version 1.4.57 (Arch Linux) installed, compiled with support for brotli (Arch Linux package does not ship with brotli enabled by default).

I saw that lighttpd was running with 100% load on 1 core (default - 1 worker), I forcefully restarted the service (kill -9), clients started reconnecting and load was 100% again, not answering any request - again.
I downgraded to 1.4.56 - same behaviour - I upgraded to 1.4.58 - same behaviour.
There were no log entries, indicating any problem in advance, this systemd log message is the only thing I have.

Stack trace of thread 1531:
#0  0x0000562e210b63dc n/a (lighttpd + 0x1a3dc)
#1  0x0000562e210b69c8 n/a (lighttpd + 0x1a9c8)
#2  0x0000562e210d25ea gw_handle_trigger (lighttpd + 0x365ea)
#3  0x0000562e210d49ff plugins_call_handle_trigger (lighttpd + 0x389ff)
#4  0x0000562e210bd093 n/a (lighttpd + 0x21093)
#5  0x0000562e210bc6a5 n/a (lighttpd + 0x206a5)
#6  0x0000562e210aa5e6 main (lighttpd + 0xe5e6)
#7  0x00007f559509c152 __libc_start_main (libc.so.6 + 0x28152)
#8  0x0000562e210bcc8e _start (lighttpd + 0x20c8e)

I disabled various parts of the configuration - the origin seemed to be Nextcloud.
I did not see any php-fpm process with high load, nothing.
After almost 2 hours of searching for the cause, it all started working again, as if nothing had happened.

Today I have had the same issue, I restarted the service - this time everything was working normally again after crashing only twice.
(I also updated certificates this weekend and restarted the service, so it didn't run for a long time. The brotli package has not been updated since last year.)

Can I do anything to get more information, when this happens?
(lighttpd-angel does restart lighttpd but it does not help, if it is stuck again immediately...)

Actions #1

Updated by gstrauss 9 months ago

Can I do anything to get more information, when this happens?

Do you know if any specific type of request triggers the behavior?

debug.log-request-header = "enable" 
debug.log-response-header = "enable" 

Can you share your config? lighttpd -f /etc/lighttpd/lighttpd.conf -p

Since you're compiling lighttpd, would you try this patch? The stack trace (missing info) looks like a different issue, but let's try anyway:

--- a/src/connections.c
+++ b/src/connections.c
@@ -523,7 +523,7 @@ static int connection_handle_write_state(request_st * const r, connection * cons
         }
     } while (r->http_version <= HTTP_VERSION_1_1
              && (!chunkqueue_is_empty(&r->write_queue)
-                 ? con->is_writable
+                 ? con->is_writable > 0
                  : r->resp_body_finished));

     return CON_STATE_WRITE;

Actions #2

Updated by gstrauss 9 months ago

mgottinger: you titled this post "rare spontaneous segfaults". Are you seeing lighttpd crash? Or are you seeing lighttpd spin at 100% CPU usage? Those are different things.

Actions #3

Updated by mgottinger 9 months ago

Unfortunately not, that was one of my first guesses, but I did and do not see any specific request.

I have set debug.log-...
Config is still the same as in #3048 - I'm back :-D

I have added your patch and compiled it - now I got to wait...
Thanks for the quick response.

Actions #4

Updated by mgottinger 9 months ago

Both: it crashes with segfault -> lighttpd-angel restarts the process -> it is almost immediately going 100% for some time and crashes some time later with same stack trace to be restarted again...

Actions #5

Updated by gstrauss 9 months ago

Both: it crashes with segfault -> lighttpd-angel restarts the process -> it is almost immediately going 100% for some time and crashes some time later with same stack trace to be restarted again...

Is this before or after the patch above?

Is the signal a SIGSEGV or SIGABRT? If lighttpd is calling abort() resulting in SIGABRT, then you can have lighttpd print a stack trace if you build with ./configure --with-libunwind

From the stack you have shared, that looks like some sort of corruption. That is hard to track down without any information about the requests leading up to the corruption.

You mentioned that you downgraded to lighttpd 1.4.56 and saw the same problem. Were you running lighttpd 1.4.56 without brotli support before you upgraded to lighttpd 1.4.57 with brotli support a week ago? If you've been running lighttpd 1.4.56 for about a month and did not see this issue, and then started seeing the issue after upgrading to lighttpd 1.4.57 a week ago, and now also see the issue with lighttpd 1.4.56, what changed between lighttpd 1.4.56 from a month ago to lighttpd 1.4.56 now?

Actions #6

Updated by gstrauss 9 months ago

I disabled various parts of the configuration - the origin seemed to be Nextcloud.

Which parts? Did you try disabling mod_deflate and mod_status? (Both are optional and their absense should not disrupt regular clients)

Actions #7

Updated by mgottinger 9 months ago

It is SIGSEGV:

ANOM_ABEND auid=4294967295 uid=33 gid=33 ses=4294967295 pid=1531 comm="lighttpd" exe="/usr/bin/lighttpd" sig=11 res=1

Before - hasn't happened since the patch. (It ran fine for 1.5 weeks until today.)

I had 1.4.56 without brotli, then with brotli, updated to 1.4.57 with brotli on December 27th and updated to 1.4.58 with brotli on January 8th (the day I had first encountered this segfault problem.)

I only disabled $HTTP["host"] parts for subdomains, because I thought some user broke something and caused a loop or whatever...

I did not try disabling mod_deflate or mod_status.

I did remove brotli in deflate.allowed-encodings = ( "brotli", "gzip", "deflate" ) that day, but it didn't change anything.
I also emptied the deflate.cache-dir directory - just in case - didn't change anything as well.

Actions #8

Updated by mgottinger 9 months ago

What changed: I did install updates, restarted the VM...
Looking at the dependencies (https://archlinux.org/packages/extra/x86_64/lighttpd/) only libxml2 (2.9.10-7 -> 2.9.10-8) changed.

Do you want me to list all packages?

Actions #9

Updated by gstrauss 9 months ago

Do you want me to list all packages?

Not at this time. I am asking questions that will hopefully lead you in the right direction to help narrow down where the problem might be.

From what you have said, it sounds like configuring mod_deflate was one change that can be undone (remove "brotli") and/or disable mod_deflate, then see if the problem recurs. You noted that the problem still occurred after removing "brotli" from deflate.allowed-encodings, so mod_deflate is probably not the issue since brotli support was available in lighttpd 1.4.56.

Actions #10

Updated by gstrauss 9 months ago

BTW, it is not surprising that lighttpd 1.4.56 crashed for you since #3048 was fixed in lighttpd 1.4.57. When you tried lighttpd 1.4.56, did you try it with the patch from #3048? Did lighttpd 1.4.56 crash for you with SIGSEGV or with ldap failed assertion?

Actions #11

Updated by gstrauss 9 months ago

Among the changes in lighttpd 1.4.56 are improvements to mod_proxy. If your lighttpd instance crashes again, please try again with server.feature-flags += ("proxy.force-http10" => "enable") and then see if that makes things more stable.

(The goal is still to narrow down where the problem might be. mod_proxy? mod_fastcgi? elsewhere?)

If the crash is happening with any frequency, then I might ask you to run the server under valgrind in order to detect the where the corruption is happening. Running under valgrind will be noticeably slower, so you likely do not want to run that way for an extended period of time.

Actions #12

Updated by mgottinger 9 months ago

Locking at the logs once again, I think this segfault only happens with 1.4.58. (Correlating installation/update times and system log.)
The process used 100% with 1.4.56 and 1.4.57, but only crashed with the stack trace after I updated to 1.4.58. (And I still have 1.4.58.)

I applied https://redmine.lighttpd.net/projects/lighttpd/repository/14/revisions/2565ad1b861db9872f3162248a81fe03178f3528 that time with 1.4.56, that calloc did the trick that time with mod_auth and LDAP in 1.4.56.
(I also used some patches #3044, which made HTTP2 stable enough for me to stay with 1.4.56.)

Unfortunately no frequency at all, might be another 1.5 weeks until next time.
And I suspect the 100% utilisation isn't lighttpds fault.

Actions #13

Updated by mgottinger 9 months ago

I will try server.feature-flags += ("proxy.force-http10" => "enable") next time and get myself educated on valgrind.
(There is an old Apache, which I use mod_proxy for...)

I will report back.

Actions #14

Updated by gstrauss 9 months ago

lighttpd should not use 100% CPU unless lighttpd is really busy doing productive work or unless you have a custom module loaded into lighttpd (or custom lua code) which is spinning.

Please use pstack or gdb to get a stack trace if you find lighttpd at 100% CPU use.

There is one bug recently reported (by someone else) which might result in 100% CPU use for HTTP/1.1 requests if you have traffic limits configured in lighttpd. A fix for that will be pushed to master branch shortly. However, to run into it, you need to have connection.kbytes-per-second or server.kbytes-per-second configured in lighttpd.conf.

Actions #15

Updated by mgottinger 9 months ago

As it turns out I'm using traffic limits on exactly that proxied Apache server.
Where can I find this fix?

Sorry, that I do really not feel comfortable sharing the whole configuration.

Actions #16

Updated by gstrauss 9 months ago

See the one-line patch at the tip of lighttpd git master.

I understand that you need to keep parts of your configuration private. At the same time, it is more difficult to try to troubleshoot when I am blind to what parts and features of lighttpd you are using. Of course, I do not need or want to see passwords in your config, but it would be useful to know which modules and directives you are using. If you like, you may send a private email to me with your config encrypted with my public key (the one used to sign lighttpd releases)

Actions #17

Updated by gstrauss 9 months ago

  • Category set to core
  • Status changed from New to Fixed
  • Target version changed from 1.4.x to 1.4.59

The "rare spontaneous segfault" may have already been fixed in commit 8faa456f See #3052
The 100% CPU use with traffic limits is fixed in commit 471ab4dd

Actions #18

Updated by mgottinger 9 months ago

Both fixes sound reasonable to me.
Thank you for your time.

Actions

Also available in: Atom