Bug #1825
Don't disable backend when overloaded
| Status: | Reopened | Start date: | 2008-11-18 | |
|---|---|---|---|---|
| Priority: | Normal | Due date: | ||
| Assignee: | stbuehler | % Done: | 100% | |
| Category: | - | |||
| Target version: | 1.4.x | |||
| Missing in 1.5.x: | No |
Description
Could you not disable a backend when it's overloaded? Just return an error (504). If you do disable the backend, all requests will fail for 2 seconds, which isn't necessary.
/b.php
usleep(100000);
- ab -c 200 -n 1000 http://localhost/b.php
error.log:
2008-11-18 20:14:17: (mod_fastcgi.c.2900) backend is overloaded; we'll disable it for 2 seconds and send the request to another backend instead: reconnects: 0 load: 131
2008-11-18 20:14:17: (mod_fastcgi.c.3542) all handlers for /b.php on .php are down.
2008-11-18 20:14:20: (mod_fastcgi.c.2673) fcgi-server re-enabled: 0 /tmp/php.socket
Related issues
Associated revisions
mod_fastcgi: don't disable backend if disable-time is 0 (fixes #1825)
mod_fastcgi: Fix host->active_procs counter, return 503 if connect wasn't successful after 5 tries (fixes #1825)
History
#1 Updated by Olaf-van-der-Spek over 4 years ago
- Target version set to 1.4.21
#2 Updated by icy over 4 years ago
- Target version changed from 1.4.21 to 1.4.22
#3 Updated by stbuehler about 4 years ago
- Target version changed from 1.4.22 to 1.4.23
#4 Updated by stbuehler almost 4 years ago
- Target version changed from 1.4.23 to 1.4.24
#5 Updated by azilber over 3 years ago
Olaf-van-der-Spek wrote:
Could you not disable a backend when it's overloaded? Just return an error (504). If you do disable the backend, all requests will fail for 2 seconds, which isn't necessary.
/b.php
usleep(100000);
- ab -c 200 -n 1000 http://localhost/b.php
error.log:
2008-11-18 20:14:17: (mod_fastcgi.c.2900) backend is overloaded; we'll disable it for 2 seconds and send the request to another backend instead: reconnects: 0 load: 131
2008-11-18 20:14:17: (mod_fastcgi.c.3542) all handlers for /b.php on .php are down.
2008-11-18 20:14:20: (mod_fastcgi.c.2673) fcgi-server re-enabled: 0 /tmp/php.socket
Why does this keep getting pushed back? It's killing us.
#6 Updated by stbuehler over 3 years ago
- Status changed from New to Fixed
#7 Updated by Olaf-van-der-Spek over 3 years ago
- Status changed from Fixed to Reopened
r2597 reduces the disable time, it doesn't remove it, does it?
#8 Updated by stbuehler over 3 years ago
- Status changed from Reopened to Fixed
- % Done changed from 0 to 100
Applied in changeset r2632.
#9 Updated by azilber over 3 years ago
- Status changed from Fixed to Reopened
- Assignee set to stbuehler
using ab -c 1000 -n 100000 on 1.4.23 with this mod_fastcgi patch applied lighttpd will error 500 and never recover with disable-time 0
Stock 1.4.23 backend overloads, disables, then recovers.
#10 Updated by stbuehler over 3 years ago
- Missing in 1.5.x set to No
Hm, couldn't reproduce that - could you please share all details: config, how many backends, php script?
#11 Updated by azilber over 3 years ago
- File r2632.mod_fastcgi.patch
added - File logsendtest.php
added
stbuehler wrote:
Hm, couldn't reproduce that - could you please share all details: config, how many backends, php script?
Definitely. Stock 1.4.23 was built with the attached patch using: --with ssl --with lua --with ldap
I can reproduce this error on 32 and 64bit.
Lighttpd backend config was reduced to 4 children so backend would overload:
fastcgi.server = ( ".php" => ( (
"socket" => "/tmp/php-fastcgi.socket" + var.PID,
"bin-path" => "/usr/bin/php-cgi",
"max-procs" => 2,
"disable-time" => 0,
"bin-environment" => (
"PHP_FCGI_CHILDREN" => "2",
"PHP_FCGI_MAX_REQUESTS" => "500"
),
"idle-timeout" => 30,
"bin-copy-environment" => ( "PATH", "SHELL", "USER"),
"broken-scriptfilename" => "enable"
)
)
)
Note these two consecutive tests:
- date
Thu Oct 15 16:26:34 CDT 2009 - ab -c 300 -n 10000 http://localhost/logsendtest.php
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Finished 10000 requests
Server Software: lighttpd/1.4.23
Server Hostname: localhost
Server Port: 80
Document Path: /logsendtest.php
Document Length: 369 bytes
Concurrency Level: 300
Time taken for tests: 0.929053 seconds
Complete requests: 10000
Failed requests: 260
(Connect: 0, Length: 260, Exceptions: 0)
Write errors: 0
Non-2xx responses: 9892
Total transferred: 5275968 bytes
HTML transferred: 3620508 bytes
Requests per second: 10763.65 [#/sec] (mean)
Time per request: 27.872 [ms] (mean)
Time per request: 0.093 [ms] (mean, across all concurrent requests)
Transfer rate: 5545.43 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 3 5.0 1 35
Processing: 1 22 49.8 16 396
Waiting: 0 14 48.1 6 384
Total: 1 26 52.5 18 414
Percentage of the requests served within a certain time (ms)
50% 18
66% 22
75% 29
80% 37
90% 42
95% 46
98% 234
99% 395
100% 414 (longest request)
- date
Thu Oct 15 16:27:39 CDT 2009 - ab -c 10 -n 10000 http://localhost/logsendtest.php
This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Copyright 2006 The Apache Software Foundation, http://www.apache.org/
Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Finished 10000 requests
Server Software: lighttpd/1.4.23
Server Hostname: localhost
Server Port: 80
Document Path: /logsendtest.php
Document Length: 369 bytes
Concurrency Level: 10
Time taken for tests: 1.271489 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Non-2xx responses: 10000
Total transferred: 5340000 bytes
HTML transferred: 3690000 bytes
Requests per second: 7864.79 [#/sec] (mean)
Time per request: 1.271 [ms] (mean)
Time per request: 0.127 [ms] (mean, across all concurrent requests)
Transfer rate: 4100.70 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.4 0 1
Processing: 0 0 0.7 1 1
Waiting: 0 0 0.0 0 0
Total: 0 0 0.8 1 2
WARNING: The median and mean for the processing time are not within a normal deviation
These results are probably not that reliable.
WARNING: The median and mean for the total time are not within a normal deviation
These results are probably not that reliable.
Percentage of the requests served within a certain time (ms)
50% 1
66% 1
75% 1
80% 1
90% 1
95% 1
98% 1
99% 1
100% 2 (longest request)
---
As you can see, in the first test, there was an overload with 300 concurrent connections. For the 2nd test, I just did 10 concurrent connection, of which every single one failed with a 503 error. Hitting that url with a browser also yields a 503 error. Lighty never recovers with a disable-time of 0 set. When it's set to any other positive value, it is able to recover.
#12 Updated by azilber over 3 years ago
- % Done changed from 100 to 90
I'm also attaching the rpm spec file. Grab the 1.4.23 source, drop it into /usr/src/redhat/SOURCES directory along with the r2632 patch above, the spec file into SPECS above it, then in specs do:
rpmbuild -ba --with ssl --with lua --with ldap lighttpd.spec
Then test away.
#13 Updated by azilber over 3 years ago
- File lighttpd.spec added
rpm spec file 1.4.23 with disable-time patch
#14 Updated by stbuehler over 3 years ago
Please test either svn or 1.4.24-rc1 (see http://blog.lighttpd.net/).
But i could reproduce it now (after i dropped my sysctl net.core.somaxconn value), and i think it has a problem counting the active backend.
#15 Updated by azilber over 3 years ago
stbuehler wrote:
Please test either svn or 1.4.24-rc1 (see http://blog.lighttpd.net/).
But i could reproduce it now (after i dropped my sysctl net.core.somaxconn value), and i think it has a problem counting the active backend.
Mine is set to 128 (default). I'm using Centos 5.2 on AWS.
#16 Updated by stbuehler over 3 years ago
- Status changed from Reopened to Fixed
- % Done changed from 90 to 100
Applied in changeset r2657.
#17 Updated by stbuehler over 3 years ago
The host->active_count fixed the problem for me - it would be nice if you could test it. Thx for your help!
#18 Updated by Olaf-van-der-Spek over 3 years ago
- Status changed from Fixed to Reopened
- Target version changed from 1.4.24 to 1.4.26
stbuehler wrote:
The host->active_count fixed the problem for me - it would be nice if you could test it. Thx for your help!
Thanks for the updates!
By default backends still get disabled. Could this be changed?
2009-12-17 13:56:45: (mod_fastcgi.c.2998) backend is overloaded; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 0 load: 275
2009-12-17 13:56:45: (mod_fastcgi.c.2998) backend is overloaded; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 1 load: 275
2009-12-17 13:56:45: (mod_fastcgi.c.3595) all handlers for /browse.php?c69=1&page=2 on .php are down.
2009-12-17 13:56:47: (mod_fastcgi.c.2761) fcgi-server re-enabled: 0 /tmp/php.socket
2009-12-17 13:56:47: (mod_fastcgi.c.2761) fcgi-server re-enabled: 0 /tmp/php.socket
#19 Updated by over 3 years ago
- Status changed from Reopened to Patch Pending
#20 Updated by stbuehler over 3 years ago
- Status changed from Patch Pending to Reopened
Don't see a patch for that part.
I'm not sure whether i really want to completely remove the disable time by default - i guess there was a good reason this was introduced in the first time.
#21 Updated by stbuehler over 3 years ago
- Target version changed from 1.4.26 to 1.4.x
- if your bin-path applications keeps crashing on startup, you'll burn too much cpu power (ok, we could handle this with different default values for disable-time)
- if you have only one backend, lighty will try to reconnect 5 times before giving up (but reconnecting i a useful thing if you have more than one backend)
- if you have more than one backend, it will never try the second backend if the first failed (balancer problem)
- you'll flood the log with failed connects
Now, all these problems could (and probably should) be fixed, but they are not just some small changes imho.
#22 Updated by Olaf-van-der-Spek over 3 years ago
I'm not requesting disable-time = 0.
I'm requesting to not invoke disable-time when the backend is overloaded.
Also available in: Atom