Bug #1825: Don't disable backend when overloaded - Lighttpd - lighty labs

Could you not disable a backend when it's overloaded? Just return an error (504). If you do disable the backend, all requests will fail for 2 seconds, which isn't necessary.

/b.php
usleep(100000);

ab -c 200 -n 1000 http://localhost/b.php

error.log:
2008-11-18 20:14:17: (mod_fastcgi.c.2900) backend is overloaded; we'll disable it for 2 seconds and send the request to another backend instead: reconnects: 0 load: 131
2008-11-18 20:14:17: (mod_fastcgi.c.3542) all handlers for /b.php on .php are down.
2008-11-18 20:14:20: (mod_fastcgi.c.2673) fcgi-server re-enabled: 0 /tmp/php.socket

Why does this keep getting pushed back? It's killing us.

Actions

Copy link

#6

Updated by stbuehler over 15 years ago

Status changed from New to Fixed

This should be fixed with #2040 (r2597)

Actions

Copy link

#7

Updated by Olaf-van-der-Spek over 15 years ago

Status changed from Fixed to Reopened

r2597 reduces the disable time, it doesn't remove it, does it?

Actions

Copy link

#8

Updated by stbuehler over 15 years ago

Status changed from Reopened to Fixed
% Done changed from 0 to 100

Applied in changeset r2632.

Actions

Copy link

#9

Updated by azilber over 15 years ago

Status changed from Fixed to Reopened
Assignee set to stbuehler

using ab -c 1000 -n 100000 on 1.4.23 with this mod_fastcgi patch applied lighttpd will error 500 and never recover with disable-time 0
Stock 1.4.23 backend overloads, disables, then recovers.

Actions

Copy link

#10

Updated by stbuehler over 15 years ago

Missing in 1.5.x set to No

Hm, couldn't reproduce that - could you please share all details: config, how many backends, php script?

Actions

Copy link Download all files

#11

Updated by azilber over 15 years ago

File r2632.mod_fastcgi.patch r2632.mod_fastcgi.patch added
File logsendtest.php logsendtest.php added

stbuehler wrote:

Hm, couldn't reproduce that - could you please share all details: config, how many backends, php script?

Definitely. Stock 1.4.23 was built with the attached patch using: --with ssl --with lua --with ldap
I can reproduce this error on 32 and 64bit.
Lighttpd backend config was reduced to 4 children so backend would overload:

fastcgi.server             = ( ".php" => ( (
                                   "socket" => "/tmp/php-fastcgi.socket" + var.PID,
                                   "bin-path" => "/usr/bin/php-cgi",
                                   "max-procs" => 2,
                                   "disable-time" => 0,
                                   "bin-environment" => (
                                        "PHP_FCGI_CHILDREN" => "2",
                                        "PHP_FCGI_MAX_REQUESTS" => "500" 
                                   ),
                                   "idle-timeout" => 30,
                                   "bin-copy-environment" => ( "PATH", "SHELL", "USER"),
                                   "broken-scriptfilename" => "enable" 
                                 )
                               )
                            )

Note these two consecutive tests:

date
Thu Oct 15 16:26:34 CDT 2009
ab -c 300 -n 10000 http://localhost/logsendtest.php
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Finished 10000 requests

Server Software: lighttpd/1.4.23
Server Hostname: localhost
Server Port: 80

Document Path: /logsendtest.php
Document Length: 369 bytes

Concurrency Level: 300
Time taken for tests: 0.929053 seconds
Complete requests: 10000
Failed requests: 260
(Connect: 0, Length: 260, Exceptions: 0)
Write errors: 0
Non-2xx responses: 9892
Total transferred: 5275968 bytes
HTML transferred: 3620508 bytes
Requests per second: 10763.65 [#/sec] (mean)
Time per request: 27.872 [ms] (mean)
Time per request: 0.093 [ms] (mean, across all concurrent requests)
Transfer rate: 5545.43 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 3 5.0 1 35
Processing: 1 22 49.8 16 396
Waiting: 0 14 48.1 6 384
Total: 1 26 52.5 18 414

Percentage of the requests served within a certain time (ms)
50% 18
66% 22
75% 29
80% 37
90% 42
95% 46
98% 234
99% 395
100% 414 (longest request)

date
Thu Oct 15 16:27:39 CDT 2009
ab -c 10 -n 10000 http://localhost/logsendtest.php
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Finished 10000 requests

Server Software: lighttpd/1.4.23
Server Hostname: localhost
Server Port: 80

Document Path: /logsendtest.php
Document Length: 369 bytes

Concurrency Level: 10
Time taken for tests: 1.271489 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Non-2xx responses: 10000
Total transferred: 5340000 bytes
HTML transferred: 3690000 bytes
Requests per second: 7864.79 [#/sec] (mean)
Time per request: 1.271 [ms] (mean)
Time per request: 0.127 [ms] (mean, across all concurrent requests)
Transfer rate: 4100.70 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.4 0 1
Processing: 0 0 0.7 1 1
Waiting: 0 0 0.0 0 0
Total: 0 0 0.8 1 2
WARNING: The median and mean for the processing time are not within a normal deviation
These results are probably not that reliable.
WARNING: The median and mean for the total time are not within a normal deviation
These results are probably not that reliable.

Percentage of the requests served within a certain time (ms)
50% 1
66% 1
75% 1
80% 1
90% 1
95% 1
98% 1
99% 1
100% 2 (longest request)

---
As you can see, in the first test, there was an overload with 300 concurrent connections. For the 2nd test, I just did 10 concurrent connection, of which every single one failed with a 503 error. Hitting that url with a browser also yields a 503 error. Lighty never recovers with a disable-time of 0 set. When it's set to any other positive value, it is able to recover.

Actions

Copy link

#12

Updated by azilber over 15 years ago

% Done changed from 100 to 90

I'm also attaching the rpm spec file. Grab the 1.4.23 source, drop it into /usr/src/redhat/SOURCES directory along with the r2632 patch above, the spec file into SPECS above it, then in specs do:

rpmbuild -ba --with ssl --with lua --with ldap lighttpd.spec

Then test away.

Actions

Copy link

#13

Updated by azilber over 15 years ago

File lighttpd.spec lighttpd.spec added

rpm spec file 1.4.23 with disable-time patch

Actions

Copy link

#14

Updated by stbuehler over 15 years ago

Please test either svn or 1.4.24-rc1 (see http://blog.lighttpd.net/).

But i could reproduce it now (after i dropped my sysctl net.core.somaxconn value), and i think it has a problem counting the active backend.

Actions

Copy link

#15

Updated by azilber over 15 years ago

stbuehler wrote:

Please test either svn or 1.4.24-rc1 (see http://blog.lighttpd.net/).

But i could reproduce it now (after i dropped my sysctl net.core.somaxconn value), and i think it has a problem counting the active backend.

Mine is set to 128 (default). I'm using Centos 5.2 on AWS.

Actions

Copy link

#16

Updated by stbuehler over 15 years ago

Status changed from Reopened to Fixed
% Done changed from 90 to 100

Applied in changeset r2657.

Actions

Copy link

#17

Updated by stbuehler over 15 years ago

The host->active_count fixed the problem for me - it would be nice if you could test it. Thx for your help!

Actions

Copy link

#18

Updated by Olaf-van-der-Spek over 15 years ago

Status changed from Fixed to Reopened
Target version changed from 1.4.24 to 1.4.26

stbuehler wrote:

The host->active_count fixed the problem for me - it would be nice if you could test it. Thx for your help!

Thanks for the updates!

By default backends still get disabled. Could this be changed?

2009-12-17 13:56:45: (mod_fastcgi.c.2998) backend is overloaded; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 0 load: 275
2009-12-17 13:56:45: (mod_fastcgi.c.2998) backend is overloaded; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 1 load: 275
2009-12-17 13:56:45: (mod_fastcgi.c.3595) all handlers for /browse.php?c69=1&page=2 on .php are down.
2009-12-17 13:56:47: (mod_fastcgi.c.2761) fcgi-server re-enabled: 0 /tmp/php.socket
2009-12-17 13:56:47: (mod_fastcgi.c.2761) fcgi-server re-enabled: 0 /tmp/php.socket

Actions

Copy link

#19

Updated by over 15 years ago

Status changed from Reopened to Patch Pending

Actions

Copy link

#20

Updated by stbuehler over 15 years ago

Status changed from Patch Pending to Reopened

Don't see a patch for that part.

I'm not sure whether i really want to completely remove the disable time by default - i guess there was a good reason this was introduced in the first time.

Actions

Copy link

#21

Updated by stbuehler over 15 years ago

Target version changed from 1.4.26 to 1.4.x

Ok, there are some things why i don't want to make 0 the default disable-time:

if your bin-path applications keeps crashing on startup, you'll burn too much cpu power (ok, we could handle this with different default values for disable-time)
if you have only one backend, lighty will try to reconnect 5 times before giving up (but reconnecting i a useful thing if you have more than one backend)
if you have more than one backend, it will never try the second backend if the first failed (balancer problem)
you'll flood the log with failed connects

Now, all these problems could (and probably should) be fixed, but they are not just some small changes imho.

Actions

Copy link

#22

Updated by Olaf-van-der-Spek over 15 years ago

I'm not requesting disable-time = 0.
I'm requesting to not invoke disable-time when the backend is overloaded.

Actions

Copy link

#23

Updated by gstrauss about 9 years ago

Status changed from Reopened to Need Feedback

The error message "backend is overloaded" occurs in mod_fastcgi when connect() fails with EAGAIN, or a delayed connect() when the subsequent getsockopt returns an error in socket_error (getsockopt(hctx->fd, SOL_SOCKET, SO_ERROR, &socket_error, &socket_error_len)).

These are both definitely hard errors and a proper solution requires tuning parameters, such as (Linux) net.core.somaxconn = 65535 in /etc/sysctl.conf, or reducing the server.max-connections in-flight on the frontend. (Of course, reducing the number of concurrent connections attempted by the benchmark tool (e.g. ab) is also a tuning option.)

EAGAIN on connect() would indicate overloaded. For delayed connect(), the socket_error from getsockopt() SOL_SOCKET SO_ERROR would need to be evaluated to see if the error probably indicates overloaded or might be some other type of error, which would be a different hard error from "overloaded".

In the overloaded condition, what should be the behavior? If there are other backends, we want to try those backends. However, we also want to make sure that we do not spin attempting to reconnect to a backend that just had a connection failure.

Since the backend is, indeed overloaded (or has some other hard error) when mod_fastcgi reports "backend is overloaded", it seems reasonable that the backend be disabled for a short period of time to avoid needless spinning and thrashing.

I predict that this ticket will eventually be withdrawn with a recommendation to tune (Linux) kernel option net.core.somaxconn, or to set a lower value in lighttpd config option server.max-connections.

I'm requesting to not invoke disable-time when the backend is overloaded.

Here's an UNTESTED patch to do that. Not recommended. Buyer beware:
Marking this ticket as "Needs Feedback". Please test the following patch and report back.

diff --git a/src/mod_fastcgi.c b/src/mod_fastcgi.c
index e55c06d..91b672b 100644
--- a/src/mod_fastcgi.c
+++ b/src/mod_fastcgi.c
@@ -2744,11 +2744,13 @@ static handler_t fcgi_write_request(server *srv, handler_ctx *hctx) {
                                                "socket:", hctx->proc->connection_name);
                        }

+                       #if 0
                        fcgi_host_disable(srv, hctx);
                        log_error_write(srv, __FILE__, __LINE__, "sdssdsd",
                                "backend is overloaded; we'll disable it for", hctx->host->disable_time, "seconds and send the request to another backend instead:",
                                "reconnects:", hctx->reconnects,
                                "load:", host->load);
+                       #endif

                        fastcgi_status_copy_procname(p->statuskey, hctx->host, hctx->proc);
                        buffer_append_string_len(p->statuskey, CONST_STR_LEN(".died"));
@@ -2833,6 +2835,7 @@ static handler_t fcgi_write_request(server *srv, handler_ctx *hctx) {
                        /* cool down the backend, it is overloaded
                         * -> EAGAIN */

+                       #if 0
                        if (hctx->host->disable_time) {
                                log_error_write(srv, __FILE__, __LINE__, "sdssdsd",
                                        "backend is overloaded; we'll disable it for", hctx->host->disable_time, "seconds and send the request to another backend instead:",
@@ -2843,6 +2846,7 @@ static handler_t fcgi_write_request(server *srv, handler_ctx *hctx) {
                                if (hctx->proc->state == PROC_STATE_RUNNING) hctx->host->active_procs--;
                                hctx->proc->state = PROC_STATE_OVERLOADED;
                        }
+                       #endif

                        fastcgi_status_copy_procname(p->statuskey, hctx->host, hctx->proc);
                        buffer_append_string_len(p->statuskey, CONST_STR_LEN(".overloaded"));

Actions

Copy link

#24

Updated by Olaf-van-der-Spek about 9 years ago

gstrauss wrote:

In the overloaded condition, what should be the behavior? If there are other backends, we want to try those backends. However, we also want to make sure that we do not spin attempting to reconnect to a backend that just had a connection failure.

Since the backend is, indeed overloaded (or has some other hard error) when mod_fastcgi reports "backend is overloaded", it seems reasonable that the backend be disabled for a short period of time to avoid needless spinning and thrashing.

The overload situation might go away in the next millisecond.. if there's a single backend IMO the simplest solution would be to return an error to the client and to retry on the next request that needs that backend. No disabling, as simple as possible.
Lowering server.max-connections is NOT a solution, it'd merely move the problem and potentially cause other problems.

Actions

Copy link

#25

Updated by gstrauss about 9 years ago

Status changed from Need Feedback to Invalid

Well, you haven't had enough time to test the patch I provided, and the behavior change you are requesting is specific to using a single backend. If you have load issues, perhaps you should consider changing that to have more backends?

Then again, if your overload condition is caused by the backend needing to do too much work, the current lighttpd behavior of backing off is quite reasonable. As such, the behavior is unlikely to change to what you have requested.

However, if your overload condition is caused by a huge surge in new connections, increasing the size of the listen backlog, both in the kernel and in your application (which might otherwise set it to a lower value such as 1024), might help this specific scenario.

The root cause of your problem appears to be the size of your listen backlog and is manifesting itself with a hard error returned by the kernel when lighttpd attempts to create a new connection to the backend. If you do not wish to lower server.max-connections on your frontend, then the solution to be able to handle a huge surge in new connections is to tune your kernel listen backlog queue to increase its size. On Linux, you can try net.core.somaxconn = 65535 in /etc/sysctl.conf.

Actions

Copy link

#26

Updated by gstrauss about 9 years ago

It bears repeating that in addition to increasing the kernel setting for listen backlog queue size, the application also needs to request a larger listen backlog queue size in the call to listen(). SOMAXCONN is typically 128, which might be too small.

Depending on your use, lighttpd code might need to be updated, too, since lighttpd hard-codes the value 1024 in network.c, mod_fastcgi.c and mod_scgi.c. (A listen backlog queue size requested larger than the kernel maximum is silently reduced to the kernel maximum)

Actions

Copy link

#27

Updated by gstrauss about 9 years ago

Related to Feature #2116: add server.listen-backlog option instead of hard-coded value (128 * 8) for listen() added

Actions

Copy link

#28

Updated by gstrauss about 9 years ago

Related to Feature #2116: add server.listen-backlog option instead of hard-coded value (128 * 8) for listen() added

Actions

Copy link

#29

Updated by gstrauss about 9 years ago

Related to deleted (Feature #2116: add server.listen-backlog option instead of hard-coded value (128 * 8) for listen())

Actions

Copy link

#30

Updated by Olaf-van-der-Spek about 9 years ago

It's not really related to that one as it's the FastCGI server that's listening in this case, not Lighttpd.

Actions

Copy link

#31

Updated by gstrauss about 9 years ago

It took me about 5 lines of code each to add separate listen-backlog configuration options to server.listen-backlog, mod_fastcgi, and mod_scgi.
https://github.com/lighttpd/lighttpd1.4/pull/50

Actions

Copy link

#32

Updated by stbuehler almost 9 years ago

Assignee deleted (~~stbuehler~~)

Actions

Copy link

#33

Updated by stbuehler almost 9 years ago

Target version deleted (~~1.4.x~~)

r2632.mod_fastcgi.patch (19.1 KB) r2632.mod_fastcgi.patch	mod_fastcgi patch to 1.4.23	azilber, 2009-10-15 21:30
logsendtest.php (281 Bytes) logsendtest.php	Fake connection test php	azilber, 2009-10-15 21:30
lighttpd.spec (2.5 KB) lighttpd.spec		azilber, 2009-10-15 21:37

Related to Feature #2040: Reduce disable-time	Fixed	2009-07-23	Actions
Related to Feature #2116: add server.listen-backlog option instead of hard-coded value (128 * 8) for listen()	Fixed	2009-12-10	Actions
Has duplicate Bug #1854: Lighty 1.4.20 leaks memory and stability problem with backend	Duplicate	2008-12-19	Actions
Has duplicate Bug #575: high-time connections in handle-req impact fastcgi overload calculation	Duplicate		Actions

Project

General

Profile

Lighttpd

Custom queries

Bug #1825

Don't disable backend when overloaded

Updated by Olaf-van-der-Spek over 16 years ago

Updated by icy over 16 years ago

Updated by stbuehler about 16 years ago

Updated by stbuehler almost 16 years ago

Updated by azilber over 15 years ago

Updated by stbuehler over 15 years ago

Updated by Olaf-van-der-Spek over 15 years ago

Updated by stbuehler over 15 years ago

Updated by azilber over 15 years ago

Updated by stbuehler over 15 years ago

Updated by azilber over 15 years ago

Updated by azilber over 15 years ago

Updated by azilber over 15 years ago

Updated by stbuehler over 15 years ago

Updated by azilber over 15 years ago

Updated by stbuehler over 15 years ago

Updated by stbuehler over 15 years ago

Updated by Olaf-van-der-Spek over 15 years ago

Updated by over 15 years ago

Updated by stbuehler over 15 years ago

Updated by stbuehler over 15 years ago

Updated by Olaf-van-der-Spek over 15 years ago

Updated by gstrauss about 9 years ago

Updated by Olaf-van-der-Spek about 9 years ago

Updated by gstrauss about 9 years ago

Updated by gstrauss about 9 years ago

Updated by gstrauss about 9 years ago

Updated by gstrauss about 9 years ago

Updated by gstrauss about 9 years ago

Updated by Olaf-van-der-Spek about 9 years ago

Updated by gstrauss about 9 years ago

Updated by stbuehler almost 9 years ago

Updated by stbuehler almost 9 years ago