Project

General

Profile

Actions

Bug #3249

closed

lighttpd-1.4.74/1.4.75 do not close sockets in mod_sockproxy

Added by ultimator 6 months ago. Updated 6 months ago.

Status:
Fixed
Priority:
Normal
Category:
mod_sockproxy
Target version:
ASK QUESTIONS IN Forums:
No

Description

I noticed that my server starts hanging after 3-4 days of uptime.
My monitoring showed an increasing number of used tcp sockets (/proc/net/sockstat).

After the limit of around 400 sockets is reached, the server hangs I have to manually restart it.
IIRC ss -tupn also showed lots of connections with state CLOSE_WAIT to another host where lighttpd is a reverse proxy for (sockproxy.server).

The problem occurs with version 1.4.74 and 1.4.75.
The problem does not occur with version 1.4.73.

I'll enable debug logs and test this further.

OS: Gentoo Linux
Arch: arm64
USE: brotli lua nettle pcre ssl xattr zlib zstd
CFLAGS: -march=armv8-a+crc+simd -mtune=cortex-a72 -ftree-vectorize -O2 -pipe -fomit-frame-pointer


Files

Actions #1

Updated by ultimator 6 months ago

Update: it's sockproxy not proxy

ss -tupn | grep CLOSE

tcp   CLOSE-WAIT 1      0          internal-ip:853         external-ip:50282 users:(("lighttpd",pid=25677,fd=19))
tcp   CLOSE-WAIT 1      0          internal-ip:853         external-ip:54738 users:(("lighttpd",pid=25677,fd=25))
tcp   CLOSE-WAIT 1      0          internal-ip:853         external-ip:45374 users:(("lighttpd",pid=25677,fd=23))
tcp   CLOSE-WAIT 1      0          internal-ip:853         external-ip:43702 users:(("lighttpd",pid=25677,fd=17))
tcp   CLOSE-WAIT 1      0          internal-ip:853         external-ip:55522 users:(("lighttpd",pid=25677,fd=26))
tcp   CLOSE-WAIT 1      0          internal-ip:853         external-ip:36140 users:(("lighttpd",pid=25677,fd=18))
tcp   CLOSE-WAIT 1      0          internal-ip:853         external-ip:47640 users:(("lighttpd",pid=25677,fd=20))
tcp   CLOSE-WAIT 1      0          internal-ip:853         external-ip:49532 users:(("lighttpd",pid=25677,fd=16))
tcp   CLOSE-WAIT 1      0          internal-ip:853         external-ip:39818 users:(("lighttpd",pid=25677,fd=29))
tcp   CLOSE-WAIT 1      0          internal-ip:853         external-ip:42926 users:(("lighttpd",pid=25677,fd=21))
tcp   CLOSE-WAIT 1      0          internal-ip:853         external-ip:50500 users:(("lighttpd",pid=25677,fd=24))

Relevant part of config
$SERVER["socket"] == "0.0.0.0:853" {
    ssl.engine    = "enable" 

    # ssl params
    ...

    sockproxy.server = ( "" => ( ( "host" => "my-internal-host", "port" => 9000 ) ) )
}

Actions #2

Updated by ultimator 6 months ago

Bisecting gave me

bc873072ba8e23d5f07c5a9b0e58bde88dada722 is the first bad commit
commit bc873072ba8e23d5f07c5a9b0e58bde88dada722
Author: Glenn Strauss <gstrauss@gluelogic.com>
Date:   Mon Dec 11 01:51:06 2023 -0500

    [core] quicker server graceful shutdown of websockets

 src/connections.c |  9 +++++++++
 src/gw_backend.c  |  4 +++-
 src/h2.c          | 23 +++++++++++++++++++++--
 3 files changed, 33 insertions(+), 3 deletions(-)

I reverted this for now and that fixes the issue.

Actions #3

Updated by gstrauss 6 months ago · Edited

  • Subject changed from >=lighttpd-1.4.74 does not close sockets to lighttpd-1.4.74/1.4.75 do not close sockets in mod_sockproxy
  • Category set to mod_sockproxy
  • Target version changed from 1.4.xx to 1.4.76

Thanks for the report and troubleshooting. I may have to add a special-case for mod_sockproxy in the conditions in commit bc873072

Actions #4

Updated by gstrauss 6 months ago · Edited

sockproxy.server = ( "" => ( ( "host" => "my-internal-host", "port" => 9000 ) ) )

Is your backend service receiving TCP_FIN from lighttpd calling shutdown(fd, SHUT_WR)? Is your backend then calling close() on the socket?
(The backend service likely called close() since the socket is in CLOSE_WAIT state)

Actions #5

Updated by gstrauss 6 months ago

  • Description updated (diff)
Actions #6

Updated by ultimator 6 months ago

The socket on the backend is closed.
I did a capture on the backend until I noticed the first socket in close-wait state on the proxy host.
It's like this:

Lighttpd host  --- Backend host
             Syn -> 
                <- Syn Ack
             Ack ->
         Fin Ack ->
                <- Ack
                <- Fin Ack
             Ack ->
------------- 5 minutes later -------------------
             Syn ->
                <- Syn Ack
             Rst ->

As far as I understand, lighttpd sends RST but keeps the socket open.

Actions #7

Updated by gstrauss 6 months ago

Would you please test if this addresses your issue?

--- a/src/gw_backend.c
+++ b/src/gw_backend.c
@@ -1924,7 +1924,11 @@ static void gw_conditional_tcp_fin(gw_handler_ctx * const hctx, request_st * con
     if (r->http_version <= HTTP_VERSION_1_1)
         r->con->is_readable = 0;
     shutdown(hctx->fd, SHUT_WR);
-    fdevent_fdnode_event_clr(hctx->ev, hctx->fdn, FDEVENT_OUT);
+    int events = fdevent_fdnode_interest(hctx->fdn) & ~FDEVENT_OUT;
+    if (!(events & FDEVENT_IN))
+        hctx->read_ts = log_monotonic_secs;
+    events |= FDEVENT_IN|FDEVENT_RDHUP;
+    fdevent_fdnode_event_set(hctx->ev, hctx->fdn, events);
 }

 static handler_t gw_write_refill_wb(gw_handler_ctx * const hctx, request_st * const r) {

Actions #8

Updated by ultimator 6 months ago

gstrauss wrote in #note-7:

Would you please test if this addresses your issue?
[...]

Does not fix the issue unfortunately. Same behavior as before.

Actions #9

Updated by gstrauss 6 months ago

Thanks for testing. I'll see if I can find some time to reproduce this. If not today, it may have to wait for next weekend. Sorry.

Actions #10

Updated by gstrauss 6 months ago · Edited

I was unable to reproduce the issue using a trivial mod_sockproxy backend.

Would you be able to get an strace() of lighttpd while it handles a request to one of your mod_sockproxy backends?

Actions #11

Updated by gstrauss 6 months ago

I am still unable to reproduce this.

I have been reviewing commit bc873072

How frequently do you restart lighttpd? Are you sending SIGUSR1 to lighttpd for graceful restart?
Are the requests that are not closing the sockets HTTP/1.x or HTTP/2?

What is the rest of your lighttpd.conf, such as timeout settings?
What is the behavior of your client? Is it actively uploading data?
What is the behavior of your backend behind mod_sockproxy? Is it actively sending data?

Actions #12

Updated by gstrauss 6 months ago

Here is a modified patch to the patch above

--- a/src/gw_backend.c
+++ b/src/gw_backend.c
@@ -1920,11 +1920,17 @@ static void gw_conditional_tcp_fin(gw_handler_ctx * const hctx, request_st * con

     /* propagate shutdown SHUT_WR to backend if TCP half-close on con->fd */
     r->conf.stream_request_body |= FDEVENT_STREAM_REQUEST_BACKEND_SHUT_WR;
-    r->conf.stream_request_body &= ~FDEVENT_STREAM_REQUEST_POLLIN;
+    r->conf.stream_request_body &=
+      ~(FDEVENT_STREAM_REQUEST_BUFMIN|FDEVENT_STREAM_REQUEST_POLLIN);
     if (r->http_version <= HTTP_VERSION_1_1)
         r->con->is_readable = 0;
     shutdown(hctx->fd, SHUT_WR);
-    fdevent_fdnode_event_clr(hctx->ev, hctx->fdn, FDEVENT_OUT);
+    int events = fdevent_fdnode_interest(hctx->fdn) & ~FDEVENT_OUT;
+    if (!(events & FDEVENT_IN)) {
+        hctx->read_ts = log_monotonic_secs;
+        events |= FDEVENT_IN|FDEVENT_RDHUP;
+    }
+    fdevent_fdnode_event_set(hctx->ev, hctx->fdn, events);
 }

 static handler_t gw_write_refill_wb(gw_handler_ctx * const hctx, request_st * const r) {

Actions #13

Updated by gstrauss 6 months ago

  • Status changed from New to Need Feedback
Actions #14

Updated by gstrauss 6 months ago

Awaiting feedback.

Please confirm that the issue does not occur with lighttpd 1.4.75 and with the following set to hctx->host->tcp_fin_propagate = 0;.

--- a/src/gw_backend.c
+++ b/src/gw_backend.c
@@ -1852,7 +1852,7 @@ void gw_set_transparent(gw_handler_ctx *hctx) {
             /*(error, but not critical)*/
         }
     }
-    hctx->host->tcp_fin_propagate = 1; /*(force setting enabled for host)*/
+    hctx->host->tcp_fin_propagate = 0; /*(test: force disabled)*/
     hctx->wb_reqlen = -1;
     gw_set_state(hctx, GW_STATE_WRITE);
 }

Actions #15

Updated by gstrauss 6 months ago

Please try this patch by itself:

--- a/src/gw_backend.c
+++ b/src/gw_backend.c
@@ -1915,6 +1915,8 @@ static void gw_conditional_tcp_fin(gw_handler_ctx * const hctx, request_st * con
     if (!chunkqueue_is_empty(&hctx->wb))return;
     if (!hctx->host->tcp_fin_propagate) return;
     if (hctx->gw_mode == GW_AUTHORIZER) return;
+    if (hctx->state == GW_STATE_CONNECT_DELAYED)
+        return;
     if (r->conf.stream_request_body & FDEVENT_STREAM_REQUEST_BACKEND_SHUT_WR)
         return;


Maybe lighttpd is receiving TCP_FIN from client before "my-internal-host" completes the connect()

Actions #16

Updated by ultimator 6 months ago · Edited

Sorry for the absensce.

How frequently do you restart lighttpd?

Usually it's running for maybe 2-3 weeks until I restart due to updates.

Are you sending SIGUSR1 to lighttpd for graceful restart?

I'm using the gentoo openrc service of lighttpd. The stop command is

start-stop-daemon --stop --quiet --pidfile "${LIGHTTPD_PID}" 

There is also a graceful stop that does the following but I don't use it

start-stop-daemon --quiet --pidfile "${LIGHTTPD_PID}" --signal INT  

Are the requests that are not closing the sockets HTTP/1.x or HTTP/2?

I'll check this.

What is the rest of your lighttpd.conf, such as timeout settings?

I have no timeouts specified so they should all be set to default values.
I enabled HTTP/2 with

server.feature-flags = (                                                                                                  
    "server.h2proto" => "enable",                                                                                                   
    "server.h2c"     => "enable",                                                                                                   
)

What is the behavior of your client? Is it actively uploading data?
What is the behavior of your backend behind mod_sockproxy? Is it actively sending data?

I can't really tell. I'm basically serving DOT but there is no client yet that uses it (at least none I'm aware of). The backend is DNS resolver. Lighttpd strips the TLS and forwards to the resolver. A proper DOT client would encapsulate a DNS packet but the transmitted packages do not contain data, only tcp metadata. I don't even know if there really is a client or if it's just some check for the upstream server by lighttpd itself. I can provide the capture file if needed.

I'll test the patches now.

Actions #17

Updated by ultimator 6 months ago · Edited

Patch from #note-12 does not work. Same behavior as before.
Patch from #note-14 with

hctx->host->tcp_fin_propagate = 0;

seems to fix the problem.
The server is running for around an hour now and the issue did not occur with the patch applied.
Patch from #note-15 also seems to fix the problem.

Tested with lighttpd version 1.4.75.

Actions #18

Updated by ultimator 6 months ago

I dumped the traffic of a connection on the lighttpd host where the socket stays open.
No patches applied. Pcap file attached.

The client is 5.83.178.7.
The lighttpd server is 192.168.157.12 as well as 192.168.157.9 (internal).
The backend is 192.168.157.10.

Client   ---  Lighttpd host  --- Backend host
   Syn   ->
          <-  Syn, Ack
   Ack   ->
Fin, Ack ->
                         Syn ->
                              <- Syn Ack
                         Rst ->
          <- Ack        
Actions #19

Updated by gstrauss 6 months ago

  • Status changed from Need Feedback to Patch Pending

Patch from #note-15 also seems to fix the problem.

Thank you for testing. #note-15 appears to be the proper solution. As you showed in #note-18, the client is sending Fin before lighttpd is even connecting to the backend, and so lighttpd was removing socket event interest in POLLOUT to indicate that the connect() had completed, but the lighttpd TCP fin propagation from the client was removing socket event interest in POLLOUT before the connect() had completed, and so the socket was left with no socket interest until timing out.

Actions #20

Updated by gstrauss 6 months ago

  • Status changed from Patch Pending to Fixed
Actions

Also available in: Atom