Bug #3249
closedlighttpd-1.4.74/1.4.75 do not close sockets in mod_sockproxy
Description
I noticed that my server starts hanging after 3-4 days of uptime.
My monitoring showed an increasing number of used tcp sockets (/proc/net/sockstat).
After the limit of around 400 sockets is reached, the server hangs I have to manually restart it.
IIRC ss -tupn
also showed lots of connections with state CLOSE_WAIT to another host where lighttpd is a reverse proxy for (sockproxy.server
).
The problem occurs with version 1.4.74 and 1.4.75.
The problem does not occur with version 1.4.73.
I'll enable debug logs and test this further.
OS: Gentoo Linux
Arch: arm64
USE: brotli lua nettle pcre ssl xattr zlib zstd
CFLAGS: -march=armv8-a+crc+simd -mtune=cortex-a72 -ftree-vectorize -O2 -pipe -fomit-frame-pointer
Files
Updated by ultimator 8 months ago
Update: it's sockproxy
not proxy
ss -tupn | grep CLOSE
tcp CLOSE-WAIT 1 0 internal-ip:853 external-ip:50282 users:(("lighttpd",pid=25677,fd=19)) tcp CLOSE-WAIT 1 0 internal-ip:853 external-ip:54738 users:(("lighttpd",pid=25677,fd=25)) tcp CLOSE-WAIT 1 0 internal-ip:853 external-ip:45374 users:(("lighttpd",pid=25677,fd=23)) tcp CLOSE-WAIT 1 0 internal-ip:853 external-ip:43702 users:(("lighttpd",pid=25677,fd=17)) tcp CLOSE-WAIT 1 0 internal-ip:853 external-ip:55522 users:(("lighttpd",pid=25677,fd=26)) tcp CLOSE-WAIT 1 0 internal-ip:853 external-ip:36140 users:(("lighttpd",pid=25677,fd=18)) tcp CLOSE-WAIT 1 0 internal-ip:853 external-ip:47640 users:(("lighttpd",pid=25677,fd=20)) tcp CLOSE-WAIT 1 0 internal-ip:853 external-ip:49532 users:(("lighttpd",pid=25677,fd=16)) tcp CLOSE-WAIT 1 0 internal-ip:853 external-ip:39818 users:(("lighttpd",pid=25677,fd=29)) tcp CLOSE-WAIT 1 0 internal-ip:853 external-ip:42926 users:(("lighttpd",pid=25677,fd=21)) tcp CLOSE-WAIT 1 0 internal-ip:853 external-ip:50500 users:(("lighttpd",pid=25677,fd=24))
Relevant part of config
$SERVER["socket"] == "0.0.0.0:853" { ssl.engine = "enable" # ssl params ... sockproxy.server = ( "" => ( ( "host" => "my-internal-host", "port" => 9000 ) ) ) }
Updated by ultimator 8 months ago
Bisecting gave me
bc873072ba8e23d5f07c5a9b0e58bde88dada722 is the first bad commit commit bc873072ba8e23d5f07c5a9b0e58bde88dada722 Author: Glenn Strauss <gstrauss@gluelogic.com> Date: Mon Dec 11 01:51:06 2023 -0500 [core] quicker server graceful shutdown of websockets src/connections.c | 9 +++++++++ src/gw_backend.c | 4 +++- src/h2.c | 23 +++++++++++++++++++++-- 3 files changed, 33 insertions(+), 3 deletions(-)
I reverted this for now and that fixes the issue.
Updated by gstrauss 8 months ago · Edited
- Subject changed from >=lighttpd-1.4.74 does not close sockets to lighttpd-1.4.74/1.4.75 do not close sockets in mod_sockproxy
- Category set to mod_sockproxy
- Target version changed from 1.4.xx to 1.4.76
Thanks for the report and troubleshooting. I may have to add a special-case for mod_sockproxy in the conditions in commit bc873072
Updated by gstrauss 8 months ago · Edited
sockproxy.server = ( "" => ( ( "host" => "my-internal-host", "port" => 9000 ) ) )
Is your backend service receiving TCP_FIN from lighttpd calling shutdown(fd, SHUT_WR)
? Is your backend then calling close()
on the socket?
(The backend service likely called close()
since the socket is in CLOSE_WAIT
state)
Updated by ultimator 8 months ago
The socket on the backend is closed.
I did a capture on the backend until I noticed the first socket in close-wait state on the proxy host.
It's like this:
Lighttpd host --- Backend host Syn -> <- Syn Ack Ack -> Fin Ack -> <- Ack <- Fin Ack Ack -> ------------- 5 minutes later ------------------- Syn -> <- Syn Ack Rst ->
As far as I understand, lighttpd sends RST but keeps the socket open.
Updated by gstrauss 8 months ago
Would you please test if this addresses your issue?
--- a/src/gw_backend.c +++ b/src/gw_backend.c @@ -1924,7 +1924,11 @@ static void gw_conditional_tcp_fin(gw_handler_ctx * const hctx, request_st * con if (r->http_version <= HTTP_VERSION_1_1) r->con->is_readable = 0; shutdown(hctx->fd, SHUT_WR); - fdevent_fdnode_event_clr(hctx->ev, hctx->fdn, FDEVENT_OUT); + int events = fdevent_fdnode_interest(hctx->fdn) & ~FDEVENT_OUT; + if (!(events & FDEVENT_IN)) + hctx->read_ts = log_monotonic_secs; + events |= FDEVENT_IN|FDEVENT_RDHUP; + fdevent_fdnode_event_set(hctx->ev, hctx->fdn, events); } static handler_t gw_write_refill_wb(gw_handler_ctx * const hctx, request_st * const r) {
Updated by gstrauss 8 months ago
I am still unable to reproduce this.
I have been reviewing commit bc873072
How frequently do you restart lighttpd? Are you sending SIGUSR1 to lighttpd for graceful restart?
Are the requests that are not closing the sockets HTTP/1.x or HTTP/2?
What is the rest of your lighttpd.conf, such as timeout settings?
What is the behavior of your client? Is it actively uploading data?
What is the behavior of your backend behind mod_sockproxy? Is it actively sending data?
Updated by gstrauss 8 months ago
Here is a modified patch to the patch above
--- a/src/gw_backend.c +++ b/src/gw_backend.c @@ -1920,11 +1920,17 @@ static void gw_conditional_tcp_fin(gw_handler_ctx * const hctx, request_st * con /* propagate shutdown SHUT_WR to backend if TCP half-close on con->fd */ r->conf.stream_request_body |= FDEVENT_STREAM_REQUEST_BACKEND_SHUT_WR; - r->conf.stream_request_body &= ~FDEVENT_STREAM_REQUEST_POLLIN; + r->conf.stream_request_body &= + ~(FDEVENT_STREAM_REQUEST_BUFMIN|FDEVENT_STREAM_REQUEST_POLLIN); if (r->http_version <= HTTP_VERSION_1_1) r->con->is_readable = 0; shutdown(hctx->fd, SHUT_WR); - fdevent_fdnode_event_clr(hctx->ev, hctx->fdn, FDEVENT_OUT); + int events = fdevent_fdnode_interest(hctx->fdn) & ~FDEVENT_OUT; + if (!(events & FDEVENT_IN)) { + hctx->read_ts = log_monotonic_secs; + events |= FDEVENT_IN|FDEVENT_RDHUP; + } + fdevent_fdnode_event_set(hctx->ev, hctx->fdn, events); } static handler_t gw_write_refill_wb(gw_handler_ctx * const hctx, request_st * const r) {
Updated by gstrauss 8 months ago
Awaiting feedback.
Please confirm that the issue does not occur with lighttpd 1.4.75 and with the following set to hctx->host->tcp_fin_propagate = 0;
.
--- a/src/gw_backend.c +++ b/src/gw_backend.c @@ -1852,7 +1852,7 @@ void gw_set_transparent(gw_handler_ctx *hctx) { /*(error, but not critical)*/ } } - hctx->host->tcp_fin_propagate = 1; /*(force setting enabled for host)*/ + hctx->host->tcp_fin_propagate = 0; /*(test: force disabled)*/ hctx->wb_reqlen = -1; gw_set_state(hctx, GW_STATE_WRITE); }
Updated by gstrauss 8 months ago
Please try this patch by itself:
--- a/src/gw_backend.c +++ b/src/gw_backend.c @@ -1915,6 +1915,8 @@ static void gw_conditional_tcp_fin(gw_handler_ctx * const hctx, request_st * con if (!chunkqueue_is_empty(&hctx->wb))return; if (!hctx->host->tcp_fin_propagate) return; if (hctx->gw_mode == GW_AUTHORIZER) return; + if (hctx->state == GW_STATE_CONNECT_DELAYED) + return; if (r->conf.stream_request_body & FDEVENT_STREAM_REQUEST_BACKEND_SHUT_WR) return;
Maybe lighttpd is receiving TCP_FIN from client before "my-internal-host" completes the
connect()
Updated by ultimator 7 months ago · Edited
Sorry for the absensce.
How frequently do you restart lighttpd?
Usually it's running for maybe 2-3 weeks until I restart due to updates.
Are you sending SIGUSR1 to lighttpd for graceful restart?
I'm using the gentoo openrc service of lighttpd. The stop command is
start-stop-daemon --stop --quiet --pidfile "${LIGHTTPD_PID}"
There is also a graceful stop that does the following but I don't use it
start-stop-daemon --quiet --pidfile "${LIGHTTPD_PID}" --signal INT
Are the requests that are not closing the sockets HTTP/1.x or HTTP/2?
I'll check this.
What is the rest of your lighttpd.conf, such as timeout settings?
I have no timeouts specified so they should all be set to default values.
I enabled HTTP/2 with
server.feature-flags = ( "server.h2proto" => "enable", "server.h2c" => "enable", )
What is the behavior of your client? Is it actively uploading data?
What is the behavior of your backend behind mod_sockproxy? Is it actively sending data?
I can't really tell. I'm basically serving DOT but there is no client yet that uses it (at least none I'm aware of). The backend is DNS resolver. Lighttpd strips the TLS and forwards to the resolver. A proper DOT client would encapsulate a DNS packet but the transmitted packages do not contain data, only tcp metadata. I don't even know if there really is a client or if it's just some check for the upstream server by lighttpd itself. I can provide the capture file if needed.
I'll test the patches now.
Updated by ultimator 7 months ago · Edited
Patch from #note-12 does not work. Same behavior as before.
Patch from #note-14 with
hctx->host->tcp_fin_propagate = 0;
seems to fix the problem.
The server is running for around an hour now and the issue did not occur with the patch applied.
Patch from #note-15 also seems to fix the problem.
Tested with lighttpd version 1.4.75.
Updated by ultimator 7 months ago
- File lighttpd.pcapng lighttpd.pcapng added
I dumped the traffic of a connection on the lighttpd host where the socket stays open.
No patches applied. Pcap file attached.
The client is 5.83.178.7.
The lighttpd server is 192.168.157.12 as well as 192.168.157.9 (internal).
The backend is 192.168.157.10.
Client --- Lighttpd host --- Backend host Syn -> <- Syn, Ack Ack -> Fin, Ack -> Syn -> <- Syn Ack Rst -> <- Ack
Updated by gstrauss 7 months ago
- Status changed from Need Feedback to Patch Pending
Patch from #note-15 also seems to fix the problem.
Thank you for testing. #note-15 appears to be the proper solution. As you showed in #note-18, the client is sending Fin before lighttpd is even connecting to the backend, and so lighttpd was removing socket event interest in POLLOUT to indicate that the connect()
had completed, but the lighttpd TCP fin propagation from the client was removing socket event interest in POLLOUT before the connect()
had completed, and so the socket was left with no socket interest until timing out.
Updated by gstrauss 7 months ago
- Status changed from Patch Pending to Fixed
Applied in changeset 42c8a1ac941c667ecd329fe43bfa942c0c3767fe.
Also available in: Atom