Project

General

Profile

Actions

Bug #1825

closed

Don't disable backend when overloaded

Added by Olaf-van-der-Spek about 16 years ago. Updated over 8 years ago.

Status:
Invalid
Priority:
Normal
Category:
-
Target version:
-
ASK QUESTIONS IN Forums:

Description

Could you not disable a backend when it's overloaded? Just return an error (504). If you do disable the backend, all requests will fail for 2 seconds, which isn't necessary.

/b.php
usleep(100000);

  1. ab -c 200 -n 1000 http://localhost/b.php

error.log:
2008-11-18 20:14:17: (mod_fastcgi.c.2900) backend is overloaded; we'll disable it for 2 seconds and send the request to another backend instead: reconnects: 0 load: 131
2008-11-18 20:14:17: (mod_fastcgi.c.3542) all handlers for /b.php on .php are down.
2008-11-18 20:14:20: (mod_fastcgi.c.2673) fcgi-server re-enabled: 0 /tmp/php.socket


Files

r2632.mod_fastcgi.patch (19.1 KB) r2632.mod_fastcgi.patch mod_fastcgi patch to 1.4.23 azilber, 2009-10-15 21:30
logsendtest.php (281 Bytes) logsendtest.php Fake connection test php azilber, 2009-10-15 21:30
lighttpd.spec (2.5 KB) lighttpd.spec azilber, 2009-10-15 21:37

Related issues 4 (0 open4 closed)

Related to Feature #2040: Reduce disable-time Fixed2009-07-23Actions
Related to Feature #2116: add server.listen-backlog option instead of hard-coded value (128 * 8) for listen()Fixed2009-12-10Actions
Has duplicate Bug #1854: Lighty 1.4.20 leaks memory and stability problem with backendDuplicate2008-12-19Actions
Has duplicate Bug #575: high-time connections in handle-req impact fastcgi overload calculationDuplicateActions
Actions #1

Updated by Olaf-van-der-Spek almost 16 years ago

  • Target version set to 1.4.21
Actions #2

Updated by icy almost 16 years ago

  • Target version changed from 1.4.21 to 1.4.22
Actions #3

Updated by stbuehler over 15 years ago

  • Target version changed from 1.4.22 to 1.4.23
Actions #4

Updated by stbuehler over 15 years ago

  • Target version changed from 1.4.23 to 1.4.24
Actions #5

Updated by azilber about 15 years ago

Olaf-van-der-Spek wrote:

Could you not disable a backend when it's overloaded? Just return an error (504). If you do disable the backend, all requests will fail for 2 seconds, which isn't necessary.

/b.php
usleep(100000);

  1. ab -c 200 -n 1000 http://localhost/b.php

error.log:
2008-11-18 20:14:17: (mod_fastcgi.c.2900) backend is overloaded; we'll disable it for 2 seconds and send the request to another backend instead: reconnects: 0 load: 131
2008-11-18 20:14:17: (mod_fastcgi.c.3542) all handlers for /b.php on .php are down.
2008-11-18 20:14:20: (mod_fastcgi.c.2673) fcgi-server re-enabled: 0 /tmp/php.socket

Why does this keep getting pushed back? It's killing us.

Actions #6

Updated by stbuehler about 15 years ago

  • Status changed from New to Fixed

This should be fixed with #2040 (r2597)

Actions #7

Updated by Olaf-van-der-Spek about 15 years ago

  • Status changed from Fixed to Reopened

r2597 reduces the disable time, it doesn't remove it, does it?

Actions #8

Updated by stbuehler about 15 years ago

  • Status changed from Reopened to Fixed
  • % Done changed from 0 to 100

Applied in changeset r2632.

Actions #9

Updated by azilber about 15 years ago

  • Status changed from Fixed to Reopened
  • Assignee set to stbuehler

using ab -c 1000 -n 100000 on 1.4.23 with this mod_fastcgi patch applied lighttpd will error 500 and never recover with disable-time 0
Stock 1.4.23 backend overloads, disables, then recovers.

Actions #10

Updated by stbuehler about 15 years ago

  • Missing in 1.5.x set to No

Hm, couldn't reproduce that - could you please share all details: config, how many backends, php script?

Updated by azilber about 15 years ago

stbuehler wrote:

Hm, couldn't reproduce that - could you please share all details: config, how many backends, php script?

Definitely. Stock 1.4.23 was built with the attached patch using: --with ssl --with lua --with ldap
I can reproduce this error on 32 and 64bit.
Lighttpd backend config was reduced to 4 children so backend would overload:

fastcgi.server             = ( ".php" => ( (
                                   "socket" => "/tmp/php-fastcgi.socket" + var.PID,
                                   "bin-path" => "/usr/bin/php-cgi",
                                   "max-procs" => 2,
                                   "disable-time" => 0,
                                   "bin-environment" => (
                                        "PHP_FCGI_CHILDREN" => "2",
                                        "PHP_FCGI_MAX_REQUESTS" => "500" 
                                   ),
                                   "idle-timeout" => 30,
                                   "bin-copy-environment" => ( "PATH", "SHELL", "USER"),
                                   "broken-scriptfilename" => "enable" 
                                 )
                               )
                            )

Note these two consecutive tests:

  1. date
    Thu Oct 15 16:26:34 CDT 2009
  2. ab -c 300 -n 10000 http://localhost/logsendtest.php
    This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
    Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
    Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Finished 10000 requests

Server Software: lighttpd/1.4.23
Server Hostname: localhost
Server Port: 80

Document Path: /logsendtest.php
Document Length: 369 bytes

Concurrency Level: 300
Time taken for tests: 0.929053 seconds
Complete requests: 10000
Failed requests: 260
(Connect: 0, Length: 260, Exceptions: 0)
Write errors: 0
Non-2xx responses: 9892
Total transferred: 5275968 bytes
HTML transferred: 3620508 bytes
Requests per second: 10763.65 [#/sec] (mean)
Time per request: 27.872 [ms] (mean)
Time per request: 0.093 [ms] (mean, across all concurrent requests)
Transfer rate: 5545.43 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 3 5.0 1 35
Processing: 1 22 49.8 16 396
Waiting: 0 14 48.1 6 384
Total: 1 26 52.5 18 414

Percentage of the requests served within a certain time (ms)
50% 18
66% 22
75% 29
80% 37
90% 42
95% 46
98% 234
99% 395
100% 414 (longest request)


  1. date
    Thu Oct 15 16:27:39 CDT 2009
  2. ab -c 10 -n 10000 http://localhost/logsendtest.php
    This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
    Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
    Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Finished 10000 requests

Server Software: lighttpd/1.4.23
Server Hostname: localhost
Server Port: 80

Document Path: /logsendtest.php
Document Length: 369 bytes

Concurrency Level: 10
Time taken for tests: 1.271489 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Non-2xx responses: 10000
Total transferred: 5340000 bytes
HTML transferred: 3690000 bytes
Requests per second: 7864.79 [#/sec] (mean)
Time per request: 1.271 [ms] (mean)
Time per request: 0.127 [ms] (mean, across all concurrent requests)
Transfer rate: 4100.70 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.4 0 1
Processing: 0 0 0.7 1 1
Waiting: 0 0 0.0 0 0
Total: 0 0 0.8 1 2
WARNING: The median and mean for the processing time are not within a normal deviation
These results are probably not that reliable.
WARNING: The median and mean for the total time are not within a normal deviation
These results are probably not that reliable.

Percentage of the requests served within a certain time (ms)
50% 1
66% 1
75% 1
80% 1
90% 1
95% 1
98% 1
99% 1
100% 2 (longest request)

---
As you can see, in the first test, there was an overload with 300 concurrent connections. For the 2nd test, I just did 10 concurrent connection, of which every single one failed with a 503 error. Hitting that url with a browser also yields a 503 error. Lighty never recovers with a disable-time of 0 set. When it's set to any other positive value, it is able to recover.

Actions #12

Updated by azilber about 15 years ago

  • % Done changed from 100 to 90

I'm also attaching the rpm spec file. Grab the 1.4.23 source, drop it into /usr/src/redhat/SOURCES directory along with the r2632 patch above, the spec file into SPECS above it, then in specs do:

rpmbuild -ba --with ssl --with lua --with ldap lighttpd.spec

Then test away.

Actions #13

Updated by azilber about 15 years ago

rpm spec file 1.4.23 with disable-time patch

Actions #14

Updated by stbuehler about 15 years ago

Please test either svn or 1.4.24-rc1 (see http://blog.lighttpd.net/).

But i could reproduce it now (after i dropped my sysctl net.core.somaxconn value), and i think it has a problem counting the active backend.

Actions #15

Updated by azilber about 15 years ago

stbuehler wrote:

Please test either svn or 1.4.24-rc1 (see http://blog.lighttpd.net/).

But i could reproduce it now (after i dropped my sysctl net.core.somaxconn value), and i think it has a problem counting the active backend.

Mine is set to 128 (default). I'm using Centos 5.2 on AWS.

Actions #16

Updated by stbuehler about 15 years ago

  • Status changed from Reopened to Fixed
  • % Done changed from 90 to 100

Applied in changeset r2657.

Actions #17

Updated by stbuehler about 15 years ago

The host->active_count fixed the problem for me - it would be nice if you could test it. Thx for your help!

Actions #18

Updated by Olaf-van-der-Spek almost 15 years ago

  • Status changed from Fixed to Reopened
  • Target version changed from 1.4.24 to 1.4.26

stbuehler wrote:

The host->active_count fixed the problem for me - it would be nice if you could test it. Thx for your help!

Thanks for the updates!

By default backends still get disabled. Could this be changed?

2009-12-17 13:56:45: (mod_fastcgi.c.2998) backend is overloaded; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 0 load: 275
2009-12-17 13:56:45: (mod_fastcgi.c.2998) backend is overloaded; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 1 load: 275
2009-12-17 13:56:45: (mod_fastcgi.c.3595) all handlers for /browse.php?c69=1&page=2 on .php are down.
2009-12-17 13:56:47: (mod_fastcgi.c.2761) fcgi-server re-enabled: 0 /tmp/php.socket
2009-12-17 13:56:47: (mod_fastcgi.c.2761) fcgi-server re-enabled: 0 /tmp/php.socket

Actions #19

Updated by almost 15 years ago

  • Status changed from Reopened to Patch Pending
Actions #20

Updated by stbuehler almost 15 years ago

  • Status changed from Patch Pending to Reopened

Don't see a patch for that part.

I'm not sure whether i really want to completely remove the disable time by default - i guess there was a good reason this was introduced in the first time.

Actions #21

Updated by stbuehler almost 15 years ago

  • Target version changed from 1.4.26 to 1.4.x
Ok, there are some things why i don't want to make 0 the default disable-time:
  • if your bin-path applications keeps crashing on startup, you'll burn too much cpu power (ok, we could handle this with different default values for disable-time)
  • if you have only one backend, lighty will try to reconnect 5 times before giving up (but reconnecting i a useful thing if you have more than one backend)
  • if you have more than one backend, it will never try the second backend if the first failed (balancer problem)
  • you'll flood the log with failed connects

Now, all these problems could (and probably should) be fixed, but they are not just some small changes imho.

Actions #22

Updated by Olaf-van-der-Spek almost 15 years ago

I'm not requesting disable-time = 0.
I'm requesting to not invoke disable-time when the backend is overloaded.

Actions #23

Updated by gstrauss over 8 years ago

  • Status changed from Reopened to Need Feedback

The error message "backend is overloaded" occurs in mod_fastcgi when connect() fails with EAGAIN, or a delayed connect() when the subsequent getsockopt returns an error in socket_error (getsockopt(hctx->fd, SOL_SOCKET, SO_ERROR, &socket_error, &socket_error_len)).

These are both definitely hard errors and a proper solution requires tuning parameters, such as (Linux) net.core.somaxconn = 65535 in /etc/sysctl.conf, or reducing the server.max-connections in-flight on the frontend. (Of course, reducing the number of concurrent connections attempted by the benchmark tool (e.g. ab) is also a tuning option.)

EAGAIN on connect() would indicate overloaded. For delayed connect(), the socket_error from getsockopt() SOL_SOCKET SO_ERROR would need to be evaluated to see if the error probably indicates overloaded or might be some other type of error, which would be a different hard error from "overloaded".

In the overloaded condition, what should be the behavior? If there are other backends, we want to try those backends. However, we also want to make sure that we do not spin attempting to reconnect to a backend that just had a connection failure.

Since the backend is, indeed overloaded (or has some other hard error) when mod_fastcgi reports "backend is overloaded", it seems reasonable that the backend be disabled for a short period of time to avoid needless spinning and thrashing.

I predict that this ticket will eventually be withdrawn with a recommendation to tune (Linux) kernel option net.core.somaxconn, or to set a lower value in lighttpd config option server.max-connections.

I'm requesting to not invoke disable-time when the backend is overloaded.

Here's an UNTESTED patch to do that. Not recommended. Buyer beware:
Marking this ticket as "Needs Feedback". Please test the following patch and report back.

diff --git a/src/mod_fastcgi.c b/src/mod_fastcgi.c
index e55c06d..91b672b 100644
--- a/src/mod_fastcgi.c
+++ b/src/mod_fastcgi.c
@@ -2744,11 +2744,13 @@ static handler_t fcgi_write_request(server *srv, handler_ctx *hctx) {
                                                "socket:", hctx->proc->connection_name);
                        }

+                       #if 0
                        fcgi_host_disable(srv, hctx);
                        log_error_write(srv, __FILE__, __LINE__, "sdssdsd",
                                "backend is overloaded; we'll disable it for", hctx->host->disable_time, "seconds and send the request to another backend instead:",
                                "reconnects:", hctx->reconnects,
                                "load:", host->load);
+                       #endif

                        fastcgi_status_copy_procname(p->statuskey, hctx->host, hctx->proc);
                        buffer_append_string_len(p->statuskey, CONST_STR_LEN(".died"));
@@ -2833,6 +2835,7 @@ static handler_t fcgi_write_request(server *srv, handler_ctx *hctx) {
                        /* cool down the backend, it is overloaded
                         * -> EAGAIN */

+                       #if 0
                        if (hctx->host->disable_time) {
                                log_error_write(srv, __FILE__, __LINE__, "sdssdsd",
                                        "backend is overloaded; we'll disable it for", hctx->host->disable_time, "seconds and send the request to another backend instead:",
@@ -2843,6 +2846,7 @@ static handler_t fcgi_write_request(server *srv, handler_ctx *hctx) {
                                if (hctx->proc->state == PROC_STATE_RUNNING) hctx->host->active_procs--;
                                hctx->proc->state = PROC_STATE_OVERLOADED;
                        }
+                       #endif

                        fastcgi_status_copy_procname(p->statuskey, hctx->host, hctx->proc);
                        buffer_append_string_len(p->statuskey, CONST_STR_LEN(".overloaded"));

Actions #24

Updated by Olaf-van-der-Spek over 8 years ago

gstrauss wrote:

In the overloaded condition, what should be the behavior? If there are other backends, we want to try those backends. However, we also want to make sure that we do not spin attempting to reconnect to a backend that just had a connection failure.

Since the backend is, indeed overloaded (or has some other hard error) when mod_fastcgi reports "backend is overloaded", it seems reasonable that the backend be disabled for a short period of time to avoid needless spinning and thrashing.

The overload situation might go away in the next millisecond.. if there's a single backend IMO the simplest solution would be to return an error to the client and to retry on the next request that needs that backend. No disabling, as simple as possible.
Lowering server.max-connections is NOT a solution, it'd merely move the problem and potentially cause other problems.

Actions #25

Updated by gstrauss over 8 years ago

  • Status changed from Need Feedback to Invalid

Well, you haven't had enough time to test the patch I provided, and the behavior change you are requesting is specific to using a single backend. If you have load issues, perhaps you should consider changing that to have more backends?

Then again, if your overload condition is caused by the backend needing to do too much work, the current lighttpd behavior of backing off is quite reasonable. As such, the behavior is unlikely to change to what you have requested.

However, if your overload condition is caused by a huge surge in new connections, increasing the size of the listen backlog, both in the kernel and in your application (which might otherwise set it to a lower value such as 1024), might help this specific scenario.

The root cause of your problem appears to be the size of your listen backlog and is manifesting itself with a hard error returned by the kernel when lighttpd attempts to create a new connection to the backend. If you do not wish to lower server.max-connections on your frontend, then the solution to be able to handle a huge surge in new connections is to tune your kernel listen backlog queue to increase its size. On Linux, you can try net.core.somaxconn = 65535 in /etc/sysctl.conf.

Actions #26

Updated by gstrauss over 8 years ago

It bears repeating that in addition to increasing the kernel setting for listen backlog queue size, the application also needs to request a larger listen backlog queue size in the call to listen(). SOMAXCONN is typically 128, which might be too small.

Depending on your use, lighttpd code might need to be updated, too, since lighttpd hard-codes the value 1024 in network.c, mod_fastcgi.c and mod_scgi.c. (A listen backlog queue size requested larger than the kernel maximum is silently reduced to the kernel maximum)

Actions #27

Updated by gstrauss over 8 years ago

  • Related to Feature #2116: add server.listen-backlog option instead of hard-coded value (128 * 8) for listen() added
Actions #28

Updated by gstrauss over 8 years ago

  • Related to Feature #2116: add server.listen-backlog option instead of hard-coded value (128 * 8) for listen() added
Actions #29

Updated by gstrauss over 8 years ago

  • Related to deleted (Feature #2116: add server.listen-backlog option instead of hard-coded value (128 * 8) for listen())
Actions #30

Updated by Olaf-van-der-Spek over 8 years ago

It's not really related to that one as it's the FastCGI server that's listening in this case, not Lighttpd.

Actions #31

Updated by gstrauss over 8 years ago

It took me about 5 lines of code each to add separate listen-backlog configuration options to server.listen-backlog, mod_fastcgi, and mod_scgi.
https://github.com/lighttpd/lighttpd1.4/pull/50

Actions #32

Updated by stbuehler over 8 years ago

  • Assignee deleted (stbuehler)
Actions #33

Updated by stbuehler over 8 years ago

  • Target version deleted (1.4.x)
Actions

Also available in: Atom