Bug #1825

Don't disable backend when overloaded

Added by Olaf-van-der-Spek over 5 years ago. Updated about 4 years ago.

Status:ReopenedStart date:2008-11-18
Priority:NormalDue date:
Assignee:stbuehler% Done:

100%

Category:-
Target version:1.4.x
Missing in 1.5.x:No

Description

Could you not disable a backend when it's overloaded? Just return an error (504). If you do disable the backend, all requests will fail for 2 seconds, which isn't necessary.

/b.php
usleep(100000);

  1. ab -c 200 -n 1000 http://localhost/b.php

error.log:
2008-11-18 20:14:17: (mod_fastcgi.c.2900) backend is overloaded; we'll disable it for 2 seconds and send the request to another backend instead: reconnects: 0 load: 131
2008-11-18 20:14:17: (mod_fastcgi.c.3542) all handlers for /b.php on .php are down.
2008-11-18 20:14:20: (mod_fastcgi.c.2673) fcgi-server re-enabled: 0 /tmp/php.socket

r2632.mod_fastcgi.patch Magnifier - mod_fastcgi patch to 1.4.23 (19.1 KB) azilber, 2009-10-15 21:30

logsendtest.php Magnifier - Fake connection test php (281 Bytes) azilber, 2009-10-15 21:30

lighttpd.spec (2.5 KB) azilber, 2009-10-15 21:37


Related issues

Related to Feature #2040: Reduce disable-time Fixed 2009-07-23
Duplicated by Bug #1854: Lighty 1.4.20 leaks memory and stability problem with bac... Duplicate 2008-12-19
Duplicated by Bug #575: high-time connections in handle-req impact fastcgi overlo... Duplicate

Associated revisions

Revision 2632
Added by stbuehler over 4 years ago

mod_fastcgi: don't disable backend if disable-time is 0 (fixes #1825)

Revision 2657
Added by stbuehler over 4 years ago

mod_fastcgi: Fix host->active_procs counter, return 503 if connect wasn't successful after 5 tries (fixes #1825)

History

#1 Updated by Olaf-van-der-Spek over 5 years ago

  • Target version set to 1.4.21

#2 Updated by icy about 5 years ago

  • Target version changed from 1.4.21 to 1.4.22

#3 Updated by stbuehler about 5 years ago

  • Target version changed from 1.4.22 to 1.4.23

#4 Updated by stbuehler almost 5 years ago

  • Target version changed from 1.4.23 to 1.4.24

#5 Updated by azilber over 4 years ago

Olaf-van-der-Spek wrote:

Could you not disable a backend when it's overloaded? Just return an error (504). If you do disable the backend, all requests will fail for 2 seconds, which isn't necessary.

/b.php
usleep(100000);

  1. ab -c 200 -n 1000 http://localhost/b.php

error.log:
2008-11-18 20:14:17: (mod_fastcgi.c.2900) backend is overloaded; we'll disable it for 2 seconds and send the request to another backend instead: reconnects: 0 load: 131
2008-11-18 20:14:17: (mod_fastcgi.c.3542) all handlers for /b.php on .php are down.
2008-11-18 20:14:20: (mod_fastcgi.c.2673) fcgi-server re-enabled: 0 /tmp/php.socket

Why does this keep getting pushed back? It's killing us.

#6 Updated by stbuehler over 4 years ago

  • Status changed from New to Fixed

This should be fixed with #2040 (r2597)

#7 Updated by Olaf-van-der-Spek over 4 years ago

  • Status changed from Fixed to Reopened

r2597 reduces the disable time, it doesn't remove it, does it?

#8 Updated by stbuehler over 4 years ago

  • Status changed from Reopened to Fixed
  • % Done changed from 0 to 100

Applied in changeset r2632.

#9 Updated by azilber over 4 years ago

  • Status changed from Fixed to Reopened
  • Assignee set to stbuehler

using ab -c 1000 -n 100000 on 1.4.23 with this mod_fastcgi patch applied lighttpd will error 500 and never recover with disable-time 0
Stock 1.4.23 backend overloads, disables, then recovers.

#10 Updated by stbuehler over 4 years ago

  • Missing in 1.5.x set to No

Hm, couldn't reproduce that - could you please share all details: config, how many backends, php script?

#11 Updated by azilber over 4 years ago

stbuehler wrote:

Hm, couldn't reproduce that - could you please share all details: config, how many backends, php script?

Definitely. Stock 1.4.23 was built with the attached patch using: --with ssl --with lua --with ldap
I can reproduce this error on 32 and 64bit.
Lighttpd backend config was reduced to 4 children so backend would overload:

fastcgi.server             = ( ".php" => ( (
                                   "socket" => "/tmp/php-fastcgi.socket" + var.PID,
                                   "bin-path" => "/usr/bin/php-cgi",
                                   "max-procs" => 2,
                                   "disable-time" => 0,
                                   "bin-environment" => (
                                        "PHP_FCGI_CHILDREN" => "2",
                                        "PHP_FCGI_MAX_REQUESTS" => "500" 
                                   ),
                                   "idle-timeout" => 30,
                                   "bin-copy-environment" => ( "PATH", "SHELL", "USER"),
                                   "broken-scriptfilename" => "enable" 
                                 )
                               )
                            )

Note these two consecutive tests:

  1. date
    Thu Oct 15 16:26:34 CDT 2009
  2. ab -c 300 -n 10000 http://localhost/logsendtest.php
    This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
    Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
    Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Finished 10000 requests

Server Software: lighttpd/1.4.23
Server Hostname: localhost
Server Port: 80

Document Path: /logsendtest.php
Document Length: 369 bytes

Concurrency Level: 300
Time taken for tests: 0.929053 seconds
Complete requests: 10000
Failed requests: 260
(Connect: 0, Length: 260, Exceptions: 0)
Write errors: 0
Non-2xx responses: 9892
Total transferred: 5275968 bytes
HTML transferred: 3620508 bytes
Requests per second: 10763.65 [#/sec] (mean)
Time per request: 27.872 [ms] (mean)
Time per request: 0.093 [ms] (mean, across all concurrent requests)
Transfer rate: 5545.43 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 3 5.0 1 35
Processing: 1 22 49.8 16 396
Waiting: 0 14 48.1 6 384
Total: 1 26 52.5 18 414

Percentage of the requests served within a certain time (ms)
50% 18
66% 22
75% 29
80% 37
90% 42
95% 46
98% 234
99% 395
100% 414 (longest request)


  1. date
    Thu Oct 15 16:27:39 CDT 2009
  2. ab -c 10 -n 10000 http://localhost/logsendtest.php
    This is ApacheBench, Version 2.0.40-dev <$Revision: 1.146 $> apache-2.0
    Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
    Copyright 2006 The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 1000 requests
Completed 2000 requests
Completed 3000 requests
Completed 4000 requests
Completed 5000 requests
Completed 6000 requests
Completed 7000 requests
Completed 8000 requests
Completed 9000 requests
Finished 10000 requests

Server Software: lighttpd/1.4.23
Server Hostname: localhost
Server Port: 80

Document Path: /logsendtest.php
Document Length: 369 bytes

Concurrency Level: 10
Time taken for tests: 1.271489 seconds
Complete requests: 10000
Failed requests: 0
Write errors: 0
Non-2xx responses: 10000
Total transferred: 5340000 bytes
HTML transferred: 3690000 bytes
Requests per second: 7864.79 [#/sec] (mean)
Time per request: 1.271 [ms] (mean)
Time per request: 0.127 [ms] (mean, across all concurrent requests)
Transfer rate: 4100.70 [Kbytes/sec] received

Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 0 0.4 0 1
Processing: 0 0 0.7 1 1
Waiting: 0 0 0.0 0 0
Total: 0 0 0.8 1 2
WARNING: The median and mean for the processing time are not within a normal deviation
These results are probably not that reliable.
WARNING: The median and mean for the total time are not within a normal deviation
These results are probably not that reliable.

Percentage of the requests served within a certain time (ms)
50% 1
66% 1
75% 1
80% 1
90% 1
95% 1
98% 1
99% 1
100% 2 (longest request)

---
As you can see, in the first test, there was an overload with 300 concurrent connections. For the 2nd test, I just did 10 concurrent connection, of which every single one failed with a 503 error. Hitting that url with a browser also yields a 503 error. Lighty never recovers with a disable-time of 0 set. When it's set to any other positive value, it is able to recover.

#12 Updated by azilber over 4 years ago

  • % Done changed from 100 to 90

I'm also attaching the rpm spec file. Grab the 1.4.23 source, drop it into /usr/src/redhat/SOURCES directory along with the r2632 patch above, the spec file into SPECS above it, then in specs do:

rpmbuild -ba --with ssl --with lua --with ldap lighttpd.spec

Then test away.

#13 Updated by azilber over 4 years ago

rpm spec file 1.4.23 with disable-time patch

#14 Updated by stbuehler over 4 years ago

Please test either svn or 1.4.24-rc1 (see http://blog.lighttpd.net/).

But i could reproduce it now (after i dropped my sysctl net.core.somaxconn value), and i think it has a problem counting the active backend.

#15 Updated by azilber over 4 years ago

stbuehler wrote:

Please test either svn or 1.4.24-rc1 (see http://blog.lighttpd.net/).

But i could reproduce it now (after i dropped my sysctl net.core.somaxconn value), and i think it has a problem counting the active backend.

Mine is set to 128 (default). I'm using Centos 5.2 on AWS.

#16 Updated by stbuehler over 4 years ago

  • Status changed from Reopened to Fixed
  • % Done changed from 90 to 100

Applied in changeset r2657.

#17 Updated by stbuehler over 4 years ago

The host->active_count fixed the problem for me - it would be nice if you could test it. Thx for your help!

#18 Updated by Olaf-van-der-Spek over 4 years ago

  • Status changed from Fixed to Reopened
  • Target version changed from 1.4.24 to 1.4.26

stbuehler wrote:

The host->active_count fixed the problem for me - it would be nice if you could test it. Thx for your help!

Thanks for the updates!

By default backends still get disabled. Could this be changed?

2009-12-17 13:56:45: (mod_fastcgi.c.2998) backend is overloaded; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 0 load: 275
2009-12-17 13:56:45: (mod_fastcgi.c.2998) backend is overloaded; we'll disable it for 1 seconds and send the request to another backend instead: reconnects: 1 load: 275
2009-12-17 13:56:45: (mod_fastcgi.c.3595) all handlers for /browse.php?c69=1&page=2 on .php are down.
2009-12-17 13:56:47: (mod_fastcgi.c.2761) fcgi-server re-enabled: 0 /tmp/php.socket
2009-12-17 13:56:47: (mod_fastcgi.c.2761) fcgi-server re-enabled: 0 /tmp/php.socket

#19 Updated by over 4 years ago

  • Status changed from Reopened to Patch Pending

#20 Updated by stbuehler about 4 years ago

  • Status changed from Patch Pending to Reopened

Don't see a patch for that part.

I'm not sure whether i really want to completely remove the disable time by default - i guess there was a good reason this was introduced in the first time.

#21 Updated by stbuehler about 4 years ago

  • Target version changed from 1.4.26 to 1.4.x
Ok, there are some things why i don't want to make 0 the default disable-time:
  • if your bin-path applications keeps crashing on startup, you'll burn too much cpu power (ok, we could handle this with different default values for disable-time)
  • if you have only one backend, lighty will try to reconnect 5 times before giving up (but reconnecting i a useful thing if you have more than one backend)
  • if you have more than one backend, it will never try the second backend if the first failed (balancer problem)
  • you'll flood the log with failed connects

Now, all these problems could (and probably should) be fixed, but they are not just some small changes imho.

#22 Updated by Olaf-van-der-Spek about 4 years ago

I'm not requesting disable-time = 0.
I'm requesting to not invoke disable-time when the backend is overloaded.

Also available in: Atom