Project

General

Profile

Actions

Bug #3059

closed

Connections stuck in Close_Wait causing 100% cpu usage

Added by mitd over 3 years ago. Updated about 3 years ago.

Status:
Fixed
Priority:
Normal
Category:
core
Target version:
ASK QUESTIONS IN Forums:
No

Description

Hello folks,

I've run into an odd issue with some of the simple web servers we run (they serve some text in index.html). Specifically, it seems like there are connections that get stuck in close_wait and eat up 100% of CPU resources.
To note, these are 1 core VMs we run this on.

This issue started when we updated from lighttpd 1.4.55-2 to 1.4.57-1. I am going to provide full context on what was done in case it helps. Initially after this upgrade, I noticed that the service stopped starting up and saw these warnings:
2021-01-13 20:21:21: (configfile.c.2269) server.upload-dirs doesn't exist: /var/cache/lighttpd/uploads
2021-01-13 20:21:21: (plugin.c.195) dlopen() failed for: /usr/lib/lighttpd/mod_openssl.so /usr/lib/lighttpd/mod_openssl.so: cannot open shared object file: No such file or directory
2021-01-13 20:21:21: (server.c.1238) loading plugins finally failed

and:
2021-01-13 20:21:21: (configfile.c.253) Warning: please add "mod_openssl" to server.modules list in lighttpd.conf. A future release of lighttpd 1.4.x will not automatically load mod_openssl and lighttpd will not use SSL/TLS where your lighttpd.conf contains ssl.* directives

So I went in and edited /etc/lighttpd/lighttpd.conf and added mod_openssl and installed the lighttpd-mod-openssl package. That resolved the issue with the service not running.

After this point, the servers ran normally for a while, then we got an alert about CPU usage. Taking a look it seemed like lighttpd was constantly consuming the entire core. Luckily enough, it still seemed to be accepting connections normally (as far we we could tell from access.log and errors.log didn't show anything out of the ordinary).

At this point, looking at lsof, it showed the following (with some stuff deleted of course):

mitd:~$ sudo lsof -p 157648
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
lighttpd 157648 www-data cwd DIR 254,1 4096 2 /
lighttpd 157648 www-data rtd DIR 254,1 4096 2 /
lighttpd 157648 www-data txt REG 254,1 375136 1966696 /usr/sbin/lighttpd
lighttpd 157648 www-data mem REG 254,1 23160 1967130 /usr/lib/x86_64-linux-gnu/libnss_cache.so.2.0
lighttpd 157648 www-data mem REG 254,1 51696 1968527 /usr/lib/x86_64-linux-gnu/libnss_files-2.31.so
lighttpd 157648 www-data mem REG 254,1 14408 1844237 /usr/lib/lighttpd/mod_staticfile.so
lighttpd 157648 www-data mem REG 254,1 30792 1841268 /usr/lib/lighttpd/mod_dirlisting.so
lighttpd 157648 www-data mem REG 254,1 3076960 1967430 /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1
lighttpd 157648 www-data mem REG 254,1 593696 1967535 /usr/lib/x86_64-linux-gnu/libssl.so.1.1
lighttpd 157648 www-data mem REG 254,1 22600 1835392 /usr/lib/lighttpd/mod_accesslog.so
lighttpd 157648 www-data mem REG 254,1 55520 1839916 /usr/lib/lighttpd/mod_openssl.so
lighttpd 157648 www-data mem REG 254,1 14408 1843016 /usr/lib/lighttpd/mod_redirect.so
lighttpd 157648 www-data mem REG 254,1 14408 1835607 /usr/lib/lighttpd/mod_alias.so
lighttpd 157648 www-data mem REG 254,1 149608 1968567 /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
lighttpd 157648 www-data mem REG 254,1 1839792 1967267 /usr/lib/x86_64-linux-gnu/libc-2.31.so
lighttpd 157648 www-data mem REG 254,1 43048 1966278 /usr/lib/x86_64-linux-gnu/libxxhash.so.0.8.0
lighttpd 157648 www-data mem REG 254,1 257528 1966472 /usr/lib/x86_64-linux-gnu/libnettle.so.8.0
lighttpd 157648 www-data mem REG 254,1 18688 1968044 /usr/lib/x86_64-linux-gnu/libdl-2.31.so
lighttpd 157648 www-data mem REG 254,1 464848 1967316 /usr/lib/x86_64-linux-gnu/libpcre.so.3.13.3
lighttpd 157648 www-data mem REG 254,1 14408 1835248 /usr/lib/lighttpd/mod_access.so
lighttpd 157648 www-data mem REG 254,1 14408 1842895 /usr/lib/lighttpd/mod_indexfile.so
lighttpd 157648 www-data mem REG 254,1 177928 1966929 /usr/lib/x86_64-linux-gnu/ld-2.31.so
lighttpd 157648 www-data 0u CHR 1,3 0t0 7949 /dev/null
lighttpd 157648 www-data 1u CHR 1,3 0t0 7949 /dev/null
lighttpd 157648 www-data 2u unix 0xffff98ddb1b31800 0t0 1393254 type=STREAM
lighttpd 157648 www-data 3w REG 0,23 7 1393271 /run/lighttpd.pid
lighttpd 157648 www-data 4u IPv4 1393272 0t0 TCP *:http (LISTEN)
lighttpd 157648 www-data 5u IPv6 1393273 0t0 TCP *:http (LISTEN)
lighttpd 157648 www-data 6u IPv4 1393274 0t0 TCP *:https (LISTEN)
lighttpd 157648 www-data 7u IPv6 1393275 0t0 TCP *:https (LISTEN)
lighttpd 157648 www-data 8w REG 254,1 3423240 3670309 /var/log/lighttpd/error.log
lighttpd 157648 www-data 9w REG 254,1 117697465 3670308 /var/log/lighttpd/access.log
lighttpd 157648 www-data 10u a_inode 0,13 0 7868 [eventpoll]
lighttpd 157648 www-data 11u IPv4 1586902 0t0 TCP #####:https->#####.com:52059 (ESTABLISHED)
lighttpd 157648 www-data 12u IPv4 1435873 0t0 TCP ME.com:https->#####.com:41960 (CLOSE_WAIT)
lighttpd 157648 www-data 13u IPv4 1586953 0t0 TCP .com:https->:59153 (ESTABLISHED)
lighttpd 157648 www-data 19u IPv4 1415541 0t0 TCP ME.com:http->***********.com:43470 (CLOSE_WAIT)
lighttpd 157648 www-data 20r REG 254,1 27 3670597 /var/www/html/index.html

You can see 12u and 19u are both stuck in close_wait. They stayed in that state for hours. An strace of the lighttpd process showed the similar messages to the following spammed repeated:

epoll_ctl(10, EPOLL_CTL_MOD, 12, {EPOLLERR|EPOLLHUP, {u32=1349789952, u64=94078313441536}}) = 0
epoll_ctl(10, EPOLL_CTL_MOD, 12, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=1349789952, u64=94078313441536}}) = 0
getsockopt(21, SOL_TCP, TCP_INFO, "\10\0\0\0\0\4x\0\200 \5\0@\234\0\0\264\5\0\0\264\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\374\231c\1\0\0\0\0\30\231c\1\234\257b\1\334\5\0\0\350\301\0\0007\275\1\0\233\336\0\0\377\377\377\177\n\0\0\0\264\5\0\0\3\0\0\0\0\0\0\0\20r\0\0\0\0\0\0", [104]) = 0

At this point using the following I was able to kill the 12u and 19u specifically after which CPU usage returned to normal and strace stopped showing the previous messages:
gdb -p pid
call (int)close(12)
call (int)close(19)
exit

To mitigate this in the short term, I have setup a cron job to check CPU usage and restart the service if CPU usage is above a percentage.
Any help to get to the bottom of this bug would be appreciated, let me know what other info you may need.


Files

scan.txt (1.7 MB) scan.txt mitd, 2021-01-26 22:45
Actions #1

Updated by gstrauss over 3 years ago

  • Category set to core
  • Target version changed from 1.4.x to 1.4.59

Any help to get to the bottom of this bug would be appreciated, let me know what other info you may need.

Sharing your config (minus any passwords) would be helpful. lighttpd -f /etc/lighttpd/lighttpd.conf -p

A bug fix was recently pushed to lighttpd git master for a 100% spin when traffic limits are active. To run into the bug, you need to have connection.kbytes-per-second or server.kbytes-per-second configured in lighttpd.conf. See also #3058, which might be the same issue.

--- a/src/connections.c
+++ b/src/connections.c
@@ -523,7 +523,7 @@ static int connection_handle_write_state(request_st * const r, connection * cons
         }
     } while (r->http_version <= HTTP_VERSION_1_1
              && (!chunkqueue_is_empty(&r->write_queue)
-                 ? con->is_writable
+                 ? con->is_writable > 0 && 0 == con->traffic_limit_reached
                  : r->resp_body_finished));

     return CON_STATE_WRITE;
Actions #2

Updated by mitd over 3 years ago

Heres the config:
https://pastebin.com/7vKE72zU

And we don't have traffic limits set in our setup so don't believe that is the issue.
As for issue 3058 which you linked, I looked at journalctl -u lighttpd.service and didn't spot any stack traces. Let me know if you need me to look elsewhere.

Thanks.

Actions #3

Updated by gstrauss over 3 years ago

That's a fairly straightforward config. Thank you for sharing it.

Do you know if the stuck requests are HTTP/1.1 or HTTP/2 ?

Let me know if you need me to look elsewhere.

If you see lighttpd at 100% CPU usage, please pstack or use gdb to get obtain a stack trace to help narrow down where I need to look.

There was a bug fixed in lighttpd 1.4.58 commit 37ae9423 so if you can test with lighttpd 1.4.58, that would be appreciated. lighttpd 1.4.56 was a huge release of over a year of development, and unfortunately, there apparently remain some bugs to be squashed.

Unrelated: please check your ssl.cipher-list. I see that you include RC4 and disable AESGCM. You might have a look at lighttpd TLS docs for some recommendations.

Actions #4

Updated by mitd over 3 years ago

I will look and confirm if they are HTTP/1.1 or 2 and try and use pstack/gdb to obtain a stack trace and update this when I can.

I forgot to mention that I did in fact update to 1.4.58 but the issue does persist unfortunately.

And thank you, I'll take a look at the documentation.

Appreciate it,
Mit

Actions #5

Updated by gstrauss over 3 years ago

getsockopt(21, SOL_TCP, TCP_INFO, "\10\0\0\0\0\4x\0\200 \5\0@\234\0\0\264\5\0\0\264\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\374\231c\1\0\0\0\0\30\231c\1\234\257b\1\334\5\0\0\350\301\0\0007\275\1\0\233\336\0\0\377\377\377\177\n\0\0\0\264\5\0\0\3\0\0\0\0\0\0\0\20r\0\0\0\0\0\0", [104]) = 0

The call to getsockopt() for SOL_TCP TCP_INFO in fdevent_is_tcp_half_closed() tells me that lighttpd is detecting that the socket was closed (or at least half-closed). I'll dig in further to see if there is a reason why that does not result in the connection being closed soon afterwards. Once that is received, lighttpd ideally should not have to check that again for that connection (unless lighttpd has not yet drained what is in the kernel recv socket buffer for that connection).

Actions #6

Updated by gstrauss over 3 years ago

Are you able to reproduce this error with any frequency?

The code which reaches fdevent_is_tcp_half_closed() is fairly small. In clear-text (non-TLS), lighttpd will read() and recv EOF. I am wondering if the underlying TLS library might be in a state which does not handle this.

You're using mod_openssl. Would you be willing to upgrade or downgrade your version of openssl? Would you be willing to try with mod_gnutls or mod_mbedtls or mod_wolfssl or mod_nss as a substitute for mod_openssl? To use the alternative TLS modules, you might have to modify ssl.cipher-list (or comment it out and use the default in lighttpd, which is ssl.cipher-list = "HIGH:!aNULL:!eNULL:!EXP", and in lighttpd 1.4.56 and later, "MinProtocol" => "TLSv1.2")

Actions #7

Updated by mitd over 3 years ago

Some new info, the connections seem to be coming from one of our security scanners. So this happens each time the servers get scanned.
It uses HTTP/1.0 for the majority of the connections I saw.

Here is a pastebin of a longer strace since I had:
https://pastebin.com/3yMKzqud

I used gdb to connect to the process and ran bt which printed out the following:
(gdb) bt
#0 0x00007fb3b4914f1a in getsockopt () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x000055c4b35156c8 in fdevent_is_tcp_half_closed ()
#2 0x000055c4b34f2762 in ?? ()
#3 0x000055c4b3502c9c in ?? ()
#4 0x000055c4b34f0f2e in main ()

I'll attempt to use a different TLS modules tomorrow.

Actions #8

Updated by gstrauss over 3 years ago

The following patch is a worthwhile improvement, but I do not think it sufficient to address the issue here. This patch below should prevent repeated calls to getsockopt() after POLLRDHUP is received on the connection.

--- a/src/connections.c
+++ b/src/connections.c
@@ -1218,7 +1224,9 @@ connection_set_fdevent_interest (request_st * const r, connection * const con)
     int n = 0;
     switch(r->state) {
       case CON_STATE_READ:
-        n = FDEVENT_IN | FDEVENT_RDHUP;
+        n = FDEVENT_IN;
+        if (!(r->conf.stream_request_body & FDEVENT_STREAM_REQUEST_POLLRDHUP))
+            n |= FDEVENT_RDHUP;
         break;
       case CON_STATE_WRITE:
         if (!chunkqueue_is_empty(con->write_queue)

Actions #9

Updated by gstrauss over 3 years ago

That's an interesting strace. Thanks.
I don't see anything trying to read() after the event is received and need to try to track down why that is.

Actions #10

Updated by gstrauss over 3 years ago

If this is reproducible after a scan, I'd be curious what the request is.

debug.log-request-header = "enable" 
debug.log-response-header = "enable" 

If you can attach gdb to the lighttpd process using 100% CPU and step (repeat from epoll_wait() to the next epoll_wait()), then if you can share that result, it should help pinpoint what is going on.

Actions #11

Updated by gstrauss over 3 years ago

debug.log-state-handling = "enable" will generate a lot of noise in the error logs on a busy site, but would also be very useful to help identify where things are spinning.

Actions #12

Updated by mitd over 3 years ago

I attempted to step through using gdb today and generated the following:
https://pastebin.com/niNTr7Ja
I'm unsure if I did it correctly (I'm unfamiliar unfortunately).

I've added:
debug.log-request-header = "enable"
debug.log-response-header = "enable"

And I'll add:
debug.log-state-handling = "enable"
afterwards.
I'm in the process of gaining the rights necessary to run the scans on demand so I can reproduce the error state easily. Switching TLS modules will also hopefully be done soon. Once I get the rights I'll see if I can get a print of the error log around that time frame.

Actions #13

Updated by gstrauss over 3 years ago

It does not look like lighttpd was built with full debugging symbols, or maybe lighttpd has not been deployed with debugging symbols available on the system on which you are running gdb on lighttpd

The calls to BIO_write () tell me that you're getting into mod_openssl and into the underlying openssl libraries. I don't know why in your strace we do not see a read() or write() system call from mod_openssl, or why openssl is not detecting that the connection has closed.

Actions #14

Updated by gstrauss over 3 years ago

The most recent trace that you shared shows BIO_write() and write() along with SIGPIPE, which is blocked, but the write() should fail with EPIPE

It is curious that you did not see write() in your original strace

In any case, it is still a mystery why this error is not being propagated through lighttpd. I'm still looking through the code...

Actions #15

Updated by gstrauss over 3 years ago

Only because I haven't yet asked: is there any trace in the lighttpd error log?

Actions #16

Updated by gstrauss over 3 years ago

What version of the openssl libraries are installed on your system?
openssl version

Actions #17

Updated by gstrauss about 3 years ago

@mitd: I am interested in helping to track this down. Do you have an estimate when you might have time to come back to this?

It would be useful to know what commands are being sent by the scanner. From your traces, two of the connections seem to end up in this problem state.

Also, please share openssl version. Thanks.

Actions #18

Updated by mitd about 3 years ago

Sorry about the delay, had things come up. Still looking into it.
What I've tried so far.
I've installed wolfssl and am testing now with the config changes you mentioned above.
openssl version
OpenSSL 1.1.1i 8 Dec 2020

I don't see a trace in the error logs. I've gotten some access to run the scans (all I can do is initiate them, not much else). But with that hopefully I can produce the calls in the access.log from the scanner. I'll reply once I get that info, I've initiated a scan now but am waiting for it to actually start.

Actions #19

Updated by gstrauss about 3 years ago

Great. Thanks. If you reproduce the CLOSE_WAIT, you may have to attach with gdb and issue call (int)close(xxxx) before the request will end up in the access.log, since logging occurs after the request ends. debug.log-request-header = "enable" shows up in the error log once the request headers are received. If they do not show up, then it is possible that things are caught up somewhere in the TLS negotiation between the scanner and lighttpd, before the HTTP request is received by lighttpd.

Actions #20

Updated by mitd about 3 years ago

Seems like the issue still happens with wolfssl.
I've also attached a (very long) log file with the calls from the qualys security scanner.

Weirdly enough when I attempted to call close this time, I received $1 = -1 and it didn't close. I'm going to attempt this on another server that still has mod_openssl installed.

Actions #21

Updated by gstrauss about 3 years ago

14561 lines of requests is, ... um ..., a lot.
Would you share your server.feature-flags settings, if you have set any? (I am specifically interested in "server.h2proto" and "server.h2c")

Actions #22

Updated by mitd about 3 years ago

Unfortunately yes, it sends tons of requests.
Here are the flags:

server.feature-flags += ("server.h2proto" => "enable")
server.feature-flags += ("server.h2c" => "enable")

Attempted to close the connections and here is what I got in the error log:
https://pastebin.com/zuQAkKe8
That was pulled a few seconds after closing the connection.

Actions #23

Updated by gstrauss about 3 years ago

In an effort to bound where the problem is, does the problem occur if you (temporarily) disable those feature flags and then trigger a scan? If it does, does the problem occur if you set only server.feature-flags += ("server.h2c" => "disable") and leave server.feature-flags += ("server.h2proto" => "enable") ?

Actions #24

Updated by mitd about 3 years ago

Looks like it still occurs with both flags disabled. Will try it with h2c disabled and h2proto enabled once this scan is finished.

Actions #25

Updated by mitd about 3 years ago

Issue persists with:
server.feature-flags += ("server.h2c" => "disable") and server.feature-flags += ("server.h2proto" => "enable")

Actions #26

Updated by gstrauss about 3 years ago

That is expected, as most of the requests in your scan access log are HTTP/1.0 with a few HTTP/1.1. There was one HTTP/2.0 "prior knowledge" request, and if the problem still happened when you disabled "server.h2c" and disabled "server.h2proto" (which also disables "server.h2c"), then the issue is much less likely to be in the HTTP/2 code in lighttpd.

Your the scan access log you shared has the HTTP request-line, but does not have the headers the went along with the request. Having those headers may help me to reproduce the issue.

So far, running all of the request lines against lighttpd does not trigger the issue on my test box, and I tried with and without HTTPS.

I don't have any good hunches right now. If you can temporarily run a scan without port 80 enabled, we can establish whether or not the issue occurs with HTTPS (probably does). To do so, temporary change server.bind = 80 to server.bind = 443 and comment out the $SERVER["socket"] == "[::]:80" section in your config.

If you have enabled request header logging with debug.log-request-header = "enable" then the error log will contain a log line with the file descriptor, followed by the request headers. When you trigger the error and the server has that fd in CLOSE_WAIT, the last instance in the error logs which uses the fd (and the subsequent request headers) is likely the request that triggered the issue. (This is before you attach with gdb to try to close() the fd). That might help give us a starting point. There is a chance that a sequence of requests cause the problem, but it would still be useful to see the request that gets stuck.

Actions #27

Updated by mitd about 3 years ago

I attempted to bind to port 443 but had to comment out
include_shell "/usr/share/lighttpd/use-ipv6.pl " + server.port
Once that was removed I was able to start the service. However looking at the access log, it started throwing 400s. I think this is due to our anycast setup (not 100% familiar with it) and the requests themselves.

In order to continue down that route, I'll have to remove one server from the published anycast config.
That may also help clear up the access and error log to help debug. There are some steps involved to change that so I'll try and get started asap.

Actions #28

Updated by gstrauss about 3 years ago

In order to continue down that route, I'll have to remove one server from the published anycast config.

I am not sure if that is worth the effort. You might instead modify the access log to include the server IP with %A

We do not yet know if the trigger for the error condition is due to a specific HTTP request or due to a TCP behavior.

I do know that it is also effort to scrub your error log before possibly sharing it.

Do you think I might be able to reproduce this using the Qualsys Community Edition?

Actions #29

Updated by mitd about 3 years ago

The community edition may work but I'm not 100%. A quick search seems to indicate that the scanning capabilities are the same between community and express lite at least (not sure which variant we use internally) so that's somewhat promising. Since the scan is initiated through an RPC call on my end, it's somewhat obfuscated, all I can tell is that it's a "default scan" which scans all plugins at a normal testing rate but no credential bruteforcing.
And I'll still look into the possibility of removing it from our anycast config just in case.

Actions #30

Updated by gstrauss about 3 years ago

  • Status changed from New to Patch Pending

Using the Qualsys Community scanner, I was not able to reproduce the error with the patch posted last week in #3059#note-8 (presumably the patch resulted in the request eventually timing out after hitting the read timeout)

However, I was able to reproduce the error once without that patch, and the problem is for oversized header requests read across multiple calls to read(). (I must not have tested with splitting large headers across multiple calls.) There was a typo I introduced Sep 2019 in development, which ended up part of lighttpd 1.4.56. This will be part of lighttpd 1.4.59:

--- a/src/connections.c
+++ b/src/connections.c
@@ -626,7 +626,7 @@ static chunk * connection_read_header_more(connection *con, chunkqueue *cq, chun

     if (cq->first != cq->last && 0 != olen) {
         const size_t clen = chunkqueue_length(cq);
-        size_t block = (olen + (16384-1)) & (16384-1);
+        size_t block = (olen + (16384-1)) & ~(16384-1);
         block += (block - olen > 1024 ? 0 : 16384);
         chunkqueue_compact_mem(cq, block > clen ? clen : block);
     }

Thanks for your help in tracking this down.

Actions #31

Updated by mitd about 3 years ago

Awesome, happy to hear that.

Thanks for your help and patience through this. My team and a few others will be quite happy about this as I know a few folks had just gone ahead and disabled lighttpd in the meantime.

Actions #32

Updated by gstrauss about 3 years ago

  • Status changed from Patch Pending to Fixed
Actions

Also available in: Atom