Bug #3086: sockets disabled, out-of-fds with proxy module - Lighttpd - lighty labs

Actions

Copy link

Bug #3086

closed

sockets disabled, out-of-fds with proxy module

Added by flynn almost 3 years ago. Updated over 2 years ago.

Status:

Fixed

Priority:

Normal

Category:

mod_proxy

Target version:

1.4.60

ASK QUESTIONS IN Forums:

Description

One of my lighttpd instance ran out of sockets. The cause was a virtual host with a simple reverse proxy connection with two backends like this:

      proxy.server = ( "/" => ("blabla" =>
        ("host" => server, "port" => 8002),
        ("host" => server, "port" => 8003)
      ),)

server is variable containing a VPN connected IP.
On the remote host is a NAT rule configured that points to a TCP-port on the localhost device.

The remote server is down! But the TCP connections are not refused, e.g. with ICMP unreachable, they kept open (at the moment I do not know the exact TCP state), either by the VPN or the NAT or the combination of both
Every new connection creates a new socket until server.max-fds has reached and lighttpd stops to work.
There are NO log messages, nor in the access-log neither in the error-log.

It took more than a day to hit server.max-fds.

I expect, that lighttpd kills after a certian time open/stale TCP connections, even in opening phase.

Am I wrong?

If needed, I can try to reproduce this to get the exact state of the TCP connection with netstat of ss.

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by gstrauss almost 3 years ago

I expect, that lighttpd kills after a certian time open/stale TCP connections, even in opening phase.

lighttpd uses the OS system defaults for TCP, and waits for the kernel to timeout the socket connect() attempt.

The network behavior you describe of no response is not very friendly. lighttpd (reasonably) expects to receive notification from the kernel TCP networking stack that the connection succeeded or failed. What (reasonable) behavior would you like to see here? lighttpd currently does not have a separate set of timeouts for connecting to backends, as the front-end read and write timeouts are applicable after the backend connection has been established (or fails). Also, if the client disconnects, then lighttpd should abort the connection to the backend. Is that not happening?

A band-aid/workaround that you might apply is to increase server.max-fds from the default (4096). I generally run with server.max-fds = 16384. In my development environment, that is more than enough to avoid running out of fds during my typical benchmarking tests.

I'll look into what it would take to apply the front-side read or write timeout to connections to the backend, as the front-end read timeout waiting to read from the client is not in effect when lighttpd is not reading from the client, which might occur while lighttpd is connecting to the backend (and lighttpd is configured with server.stream-request-body = 2 and has read a sufficient amount that lighttpd temporarily suspends waiting for read events from the client).

Actions

Copy link

Updated by gstrauss almost 3 years ago

Are the client connections HTTP/2? And are there other request streams being made on the connection during the attempt to connect to the backend proxy? Maybe additional retries on the same HTTP/2 connection, but without cancelling the prior request streams? Activity on the client connection would reset the front-end connection read and write timeouts. (See the XXX comments in connection.c:connection_check_timeout() where this could be better handled)

Actions

Copy link

Updated by flynn almost 3 years ago

Because I have no access logs I only can estimate from current and past logs: the monitoring show constant growing socket usage over 36h, so this can be only the HTTP/1.1 requests from a monitoring tool (monit).

Other settings are:

server.stream-response-body = 1
server.stream-request-body = 1

Actions

Copy link

Updated by gstrauss almost 3 years ago

That's expected, too. On an HTTP/1.1 connection, when lighttpd finishes reading a request, the read timeout is also disabled, regardless of the server.stream-request-body setting.

lighttpd should add independent timeout setting(s) for backend connections.

Actions

Copy link

Updated by gstrauss almost 3 years ago

If needed, I can try to reproduce this to get the exact state of the TCP connection with netstat of ss.

That might be helpful. Please do so when you have a chance. I have a hunch that the socket connection appears "open". If you set proxy.debug = 3 in lighttpd.conf, you'll be able to see if lighttpd thinks that the TCP connect() to the backend succeeded, and you might see what happens if you make a test request to lighttpd after temporarily taking down the service behind the VPN and NAT.

I asked:

Also, if the client disconnects, then lighttpd should abort the connection to the backend. Is that not happening?

flynn: is the client connection hanging, too? (from your monit instance) lighttpd should receive POLLRDHUP when the client disconnects, if the client does so cleanly by sending TCP FIN, and the lighttpd should clean up the connection to the backend. What is the TCP connection state of the client connection? (from your monit instance) If lighttpd is not receiving POLLRDHUP when the client disconnects (and you're running on a linux kernel which supports POLLRDHUP), then that may be a separate issue that I would like to track down.

Actions

Copy link

Updated by flynn almost 3 years ago

I reproduced the problem using wget:
- the client hangs
- I tested only some minutes, and terminated with Ctrl+C (this simulates the timeout in the original monit request)
- the error log shows reenabling attempts, but only on the first backend on port 8002:

2021-07-13 14:53:30: gw_backend.c.238) establishing connection failed: socket: tcp:192.168.107.4:8002: Connection refused
2021-07-13 14:53:32: gw_backend.c.315) gw-server re-enabled: tcp:192.168.107.4:8002 192.168.107.4 8002 
2021-07-13 14:55:02: gw_backend.c.238) establishing connection failed: socket: tcp:192.168.107.4:8002: Connection refused
2021-07-13 14:55:04: gw_backend.c.315) gw-server re-enabled: tcp:192.168.107.4:8002 192.168.107.4 8002 
2021-07-13 14:55:07: gw_backend.c.238) establishing connection failed: socket: tcp:192.168.107.4:8002: Connection refused
2021-07-13 14:55:09: gw_backend.c.315) gw-server re-enabled: tcp:192.168.107.4:8002 192.168.107.4 8002 
2021-07-13 14:57:24: gw_backend.c.238) establishing connection failed: socket: tcp:192.168.107.4:8002: Connection refused
2021-07-13 14:57:26: gw_backend.c.315) gw-server re-enabled: tcp:192.168.107.4:8002 192.168.107.4 8002 
2021-07-13 14:57:29: gw_backend.c.238) establishing connection failed: socket: tcp:192.168.107.4:8002: Connection refused
2021-07-13 14:57:31: gw_backend.c.315) gw-server re-enabled: tcp:192.168.107.4:8002 192.168.107.4 8002 
2021-07-13 14:59:47: gw_backend.c.238) establishing connection failed: socket: tcp:192.168.107.4:8002: Connection refused
2021-07-13 14:59:49: gw_backend.c.315) gw-server re-enabled: tcp:192.168.107.4:8002 192.168.107.4 8002 
2021-07-13 14:59:52: gw_backend.c.238) establishing connection failed: socket: tcp:192.168.107.4:8002: Connection refused
2021-07-13 14:59:54: gw_backend.c.315) gw-server re-enabled: tcp:192.168.107.4:8002 192.168.107.4 8002 
2021-07-13 15:02:10: gw_backend.c.238) establishing connection failed: socket: tcp:192.168.107.4:8002: Connection refused
2021-07-13 15:02:12: gw_backend.c.315) gw-server re-enabled: tcp:192.168.107.4:8002 192.168.107.4 8002 
2021-07-13 15:02:15: gw_backend.c.238) establishing connection failed: socket: tcp:192.168.107.4:8002: Connection refused
2021-07-13 15:02:17: gw_backend.c.315) gw-server re-enabled: tcp:192.168.107.4:8002 192.168.107.4 8002 
2021-07-13 15:04:33: gw_backend.c.238) establishing connection failed: socket: tcp:192.168.107.4:8002: Connection refused
2021-07-13 15:04:35: gw_backend.c.315) gw-server re-enabled: tcp:192.168.107.4:8002 192.168.107.4 8002 
2021-07-13 15:04:38: gw_backend.c.238) establishing connection failed: socket: tcp:192.168.107.4:8002: Connection refused
2021-07-13 15:04:40: gw_backend.c.315) gw-server re-enabled: tcp:192.168.107.4:8002 192.168.107.4 8002 
2021-07-13 15:06:56: gw_backend.c.238) establishing connection failed: socket: tcp:192.168.107.4:8002: Connection refused
2021-07-13 15:06:58: gw_backend.c.315) gw-server re-enabled: tcp:192.168.107.4:8002 192.168.107.4 8002 
2021-07-13 15:07:01: gw_backend.c.238) establishing connection failed: socket: tcp:192.168.107.4:8002: Connection refused
2021-07-13 15:07:03: gw_backend.c.315) gw-server re-enabled: tcp:192.168.107.4:8002 192.168.107.4 8002

- netstat -an show hanging/lost sockets on the second backend on port 8003, the number of lost sockets corresponds to the number of reenabling attempts.

tcp        0      0 192.168.107.1:47946     192.168.107.4:8003      ESTABLISHED                                                   
tcp        0      0 192.168.107.1:48692     192.168.107.4:8003      ESTABLISHED                                                   
tcp        0      0 192.168.107.1:49686     192.168.107.4:8003      ESTABLISHED                                                   
tcp        0      0 192.168.107.1:49334     192.168.107.4:8003      ESTABLISHED                                                   
tcp        0      0 192.168.107.1:47966     192.168.107.4:8003      ESTABLISHED                                                   
tcp        0      0 192.168.107.1:48332     192.168.107.4:8003      ESTABLISHED                                                   
tcp        0      0 192.168.107.1:48314     192.168.107.4:8003      ESTABLISHED                                                   
tcp        0      0 192.168.107.1:47590     192.168.107.4:8003      ESTABLISHED                                                   
tcp        0      0 192.168.107.1:49324     192.168.107.4:8003      ESTABLISHED                                                   
tcp        0      0 192.168.107.1:47578     192.168.107.4:8003      ESTABLISHED                                                   
tcp        0      0 192.168.107.1:49706     192.168.107.4:8003      ESTABLISHED                                                   
tcp        0      0 192.168.107.1:48682     192.168.107.4:8003      ESTABLISHED                                                   
tcp        0      0 192.168.107.1:47410     192.168.107.4:8003      ESTABLISHED

Is there a problem with reenabling and multiple backends?

Actions

Copy link

Updated by gstrauss almost 3 years ago

The default configuration for proxy.server (and other *.server) is "disable-time" => 1, so if a backend has an issue, it is disabled for 1-2 seconds. To set it to 10 seconds, for example:

proxy.server = ( "/" => ("blabla" =>
        ("host" => server, "port" => 8002, "disable-time" => 10),
        ("host" => server, "port" => 8003, "disable-time" => 10)
      ),)

After a backend is disabled, there is typically 0 active connections to it, so most of the balancing options will once find the first listed host as the first candidate to which to connect.

lighttpd treats each "host" as an independent host for disabling/enabling, since the issue could be the process listening on the <host>:8002, even if there are other configured ports <host>:9999. "Connection refused" does not indicate that a host and all its ports are unreachable; something like "Network unreachable" would (but lighttpd does not special-case that and that is not what we appear to be seeing here).

Is there a problem with reenabling and multiple backends?

It is strange that you are getting Connection refused for 8002, but nothing for 8003. I'll review the code and will think more on this.

Actions

Copy link

Updated by gstrauss almost 3 years ago

If lighttpd is getting "Connection refused" from the backend on port 8002, but the backend is reporting ESTABLISHED connections on port 8003, why is there a difference? Is one port redirected through the VPN and NAT on the backend, but not the other?

If the connections to target port 8003 are in the ESTABLISHED state, then nothing is being reported as "wrong", even though the backend is down. lighttpd does not consider ESTABLISHED to be any type of failure, and is therefore not disabling/re-enabling the backend on that target port.

Yes, lighttpd can and should improve how timeouts are applied to backends in order to better recover from this situation, but I don't think the behavior of the backend VPN/NAT can be considered proper or optimal.

Actions

Copy link

Updated by gstrauss almost 3 years ago

I pushed some patches to the tip of my dev branch (personal/gstrauss/master). Top commit sets a hard-coded connect() timeout to 3 seconds (and that commit will be removed before the feature is delivered).

Trace is issued to error log for connect() timeouts, but trace is issued for read or write timeouts only if (proxy).debug is enabled. Is that reasonable? Should all backend timeouts be logged to error log?

Not yet done: interface/directives to configure timeouts in lighttpd.conf. While I could make it a requirement to configure timeouts per-host (and repeat in lighttpd.conf per-host), I am considering a global setting to make the timeouts inheritable. Per-host settings seem to be a good idea since one backend host might be localhost, and others might be remote, with different timeouts desirable.

The new timeout code applies to all backends that listen on sockets. Notably this does not include mod_cgi. I might extend the read/write timeouts to mod_cgi with same or different config directives. (A connect() timeout does not apply to mod_cgi.)

Actions

Copy link

#10

Updated by gstrauss almost 3 years ago

Status changed from New to Patch Pending

I updated the patches on the tip of my dev branch (personal/gstrauss/master) and added to 'host' params:
"connect-timeout"
"write-timeout"
"read-timeout"
defaults will be '0' for each (no timeout)

For those worried about repeating params, I recommend using lighttpd variable in lighttpd.conf, e.g.

var.host_policy = (
  "check-local"     => "disable", # sample
  "connect-timeout" => 5,         # sample
  "write-timeout"   => 15,        # sample
  "read-timeout"    => 30,        # sample
)

var.host_h1  = ( "host" => "192.168.1.101", "port" => 10000 )
var.host_h1 += host_policy
var.host_h2  = ( "host" => "192.168.1.102", "port" => 10000 )
var.host_h2 += host_policy

proxy.server = ("/" => ( host_h1, host_h2 ) )

Actions

Copy link

#11

Updated by gstrauss over 2 years ago

@flynn
Before committing, I tested the timeouts on my system and they worked as expected.
I hope this helps you to workaround your misbehaving backend system.

Actions

Copy link

#12

Updated by flynn over 2 years ago

Tested and works!

On error I get now a return value of 500, but I would expect a 504 (gateway timeout).

Actions

Copy link

#13

Updated by gstrauss over 2 years ago

On error I get now a return value of 500, but I would expect a 504 (gateway timeout).

If I recall correctly, I had debated whether or not to change this or to leave prior behavior (e.g. results triggered by existing lighttpd timeouts from client side), in case someone was specifically looking for 500 errors.

However, since this is a new set of timeouts, and these timeouts must be explicitly enabled in the config, this patch should give you a 504 unless the response has already started.

--- a/src/gw_backend.c
+++ b/src/gw_backend.c
@@ -2675,6 +2675,8 @@ static void gw_handle_trigger_hctx_timeout(gw_handler_ctx * const hctx, const ch
         } /* else "read" */
     }
     gw_backend_error(hctx, r);
+    if (r->http_status == 500 && !r->resp_body_started && !r->handler_module)
+        r->http_status = 504; /*Gateway Timeout*/
 }

 __attribute_noinline__

Actions

Copy link

#14