Bug #575
closedhigh-time connections in handle-req impact fastcgi overload calculation
Description
This ticket is a summary of details presented to Jan via IRC on 2006-03-10.
Based on a pool of six lighttpd heads receiving traffic from a load balancer, all six heads reached a terminal overload state where they could not recover without restart. From internal statistics, fastcgi load was 100+ on each head. After restart of lighttpd on a head, once it was picked up by the load balancer, fastcgi load stabilized at ~20.
fastcgi.backend.main-php.0.connected: 205994 fastcgi.backend.main-php.0.died: 0 fastcgi.backend.main-php.0.disabled: 0 fastcgi.backend.main-php.0.load: 144 fastcgi.backend.main-php.0.overloaded: 488 fastcgi.backend.main-php.1.connected: 155287 fastcgi.backend.main-php.1.died: 0 fastcgi.backend.main-php.1.disabled: 0 fastcgi.backend.main-php.1.load: 144 fastcgi.backend.main-php.1.overloaded: 488 fastcgi.backend.main-php.load: 288
Confirmed at the load balancer that this was not a high amount of inbound traffic. lighttpd server status showed a reasonable distribution of various pages waiting in handle-req status with high values for the Time column.
338 connections hWhhhhrhhhhhhhhhWrhhhrhhhhhhhrhWrhrhhhhhhhWhhhhhhh hhhhhhhhhrhhrhhhhhhhhhhhhhhhhhhhhrhhhhhhhhhhhhrrhh rhhhhhrWrrrrhhhhhhrhhhhhhhrhhhhhrhhhhhhrhWhhhhrrhr hhrhhhhhhhhhhhhWhhhrhhhrhhrhhhrhhhWhhhhhhhhhhhrhhh hhrrhhrhhrhhhrhrrhhhhhWhhhhhhhWhrhrrrhhhrrhhhhrhhh WWrrhrrrrWrhrhWrrrrrrrhrWhrrhrrhhrhhhhrhrhhhWhrWrr hrhrhhhhhhhhrhhrhhhWhrhhhrrrrrrhhhhhhh
Approximately 150 connections shown in handle-req status have Time of 2756 or higher. Approximately 30-40 connections of this set have Time of 5000 or higher.
lighttpd error log shows continual overload status causing disable, wait, re-enable in continual cycle. Heads will not recover without restart, but head works fine after restart has occurred.
Based on discussion via IRC, as a workaround measure, plan is to add a global timeout for handle-req, such that these long-running connections in handle-req status will be shed.
-Jacob
-- moorman
Updated by Anonymous over 18 years ago
I see the same condition with lighttpd-1.4.11. Over time, many php fastcgi process build up with large handle-req times. These php processes can be successfully killed and are then respawned. I do not however see anything in the lighttpd error log corresponding with processes falling into this state. PHP is not segfaulting, nor running out of memory.
The same behavior occurs with identical builds of PHP 5.1.2 and 5.1.6, the latter of which has a completely re-written fastcgi implementation. lighttpd-1.4.11 on AMD64 RHEL4.
-- jbyers
Updated by Anonymous over 17 years ago
I think the problem still persists in 1.4.16.
My log is full of this:
2007-08-08 11:02:46: (mod_fastcgi.c.2836) backend is overloaded; we'll disable it for 2 seconds and
send the request to another backend instead: reconnects: 0 load: 138
2007-08-08 11:02:49: (mod_fastcgi.c.3479) all handlers for /server.php on .php are down.
2007-08-08 11:02:49: (mod_fastcgi.c.2614) fcgi-server re-enabled: 0 /tmp/php-fastcgi.socket
2007-08-08 11:02:59: (mod_fastcgi.c.2836) backend is overloaded; we'll disable it for 2 seconds and
send the request to another backend instead: reconnects: 0 load: 138
2007-08-08 11:02:59: (mod_fastcgi.c.3479) all handlers for /server.php on .php are down.
2007-08-08 11:03:02: (mod_fastcgi.c.2614) fcgi-server re-enabled: 0 /tmp/php-fastcgi.socket
...
and while it isn't all locked-up, it fills with:
2007-08-08 11:21:37: (server.c.1165) NOTE: a request for /foo timed out after writi
ng 26280 bytes. We waited 360 seconds. If this a problem increase server.max-write-idle
-- sblam
Updated by Anonymous about 17 years ago
I still experience this same issue in 1.4.18, after a server reboot it might work for another couple weeks.
Updated by oschonrock about 17 years ago
we saw what appears to what may be a related issue with overloading (to do with PHP not indicating to lighty that it is in fact overloaded):
have you considered trying to launch the php-fcgi server separately with spawn_fcgi as described in that issue?
Updated by Anonymous about 17 years ago
We also experience this problem on a regular basis across three web servers under reasonable load (around 1M hits per day each - although the problem does not appear related to load and often occurs well outside of peak times).
We see the problem with the following configurations:
PHP4.4.4 (eAccelerator) under spawn_fcgi
lighttpd 1.4.13
PHP5.2.5 (XCache/Suhosin) spawned directly by Lighty
lighttpd 1.4.18
I have altered the priority, as this appears to be a show-stopping bug for PHP FastCGI under lighttpd.
Has anyone tried 1.5.x-svn?
-- pat
Updated by Anonymous almost 17 years ago
Same problem here, I was advised to upgrade to 1.5.x branch. I doubt that will change anything.
-- Aleksey Korzun
Updated by Anonymous almost 17 years ago
Same issues here. Has anyone experienced issues with the patch supplied?
I would like to see some action in this "bug" (I know it is basically a PHP-not-obeying-fastcgi-standards-issue).
Thank you!
-- ff
Updated by Anonymous almost 17 years ago
WORKING RESOLUTION:
Given the comment above, and given that the 1.5.x branch is now close to release, (and given that 1.4.x was causing severe instability in our production environment) it seemed prudent to try 1.5.x to determine if this would have any effect. I built 1.5.0-r1992 from SVN using the following configuration:
./configure --prefix=/usr --libdir=/usr/lib/lighttpd \ --with-bzip2 \ --with-attr \ --with-linux-aio \ --with-openssl=/usr/include/openssl
/etc/lighttpd.conf [...] proxy-core.balancer = "sqf" proxy-core.allow-x-sendfile = "enable" proxy-core.allow-x-rewrite = "enable" $HTTP["url"] =~ "\.php" { proxy-core.protocol = "fastcgi" proxy-core.max-pool-size = 4 # (set to same as PHP_FCGI_CHILDREN) proxy-core.backends = ( "unix:/tmp/.fcgi-php.socket" ) proxy-core.rewrite-request = ( "_pathinfo" => ( "\.php(/.*)" => "$1" ) ) } [...]
This configuration has thus far resolved the PHP lock-up issue that we have been experiencing. We have not experienced server downtime for over 4 days (we were previously experiencing downtime on individual members of our cluster several times per day).
In reference to the above comment (ff@nodomain.cc):
I don't pretend to be an expert (and indeed I know little about the FastCGI protocol); however, several people have suggested that the PHP's mis-implementation of the FastCGI protocol does not cause issues when running under spawn-php. I do not know whether this is indeed the case but I experienced the issue described in this ticket under both configurations (spawn-php or lighttpd spawned interpretters) as noted in my earlier post. It is possible that these issues are therefore entirely separate but I am not able to determine this.
If it is of any use to those who may be attempting to debug this issue, it is worth noting that I also experienced this issue using all three of the following configurations (under lighttpd 1.4.x):
- spawn-php over TCP/IP
- spawn-php over unix socket
- lighttpd spawns single PHP process which spawns own children (unix socket)
- lighttpd spawns many individual PHP interpretters (unix socket)
Cheers,
Patrick
-- pat
Updated by Anonymous almost 17 years ago
I've upgraded to 1.5 now and i don't get a build up of handle-req any more now its write-content connection times that go into the high thousands. I've set server.max-write-idle to 200 but that hasn't solved anything. Any ideas?
Updated by Anonymous almost 17 years ago
Thanks, Pat.
I will wait until 1.5 is stable to roll it out to production. This looks promising so far!
-- Aleksey Korzun
Updated by georgexsh over 16 years ago
It seems that 1.4.19 + php 5.2.4 + xcache have seem issue.
Updated by Rich over 15 years ago
does the new 1.4.23 release address this?
-- Rich
Updated by azilber over 15 years ago
Rich wrote:
does the new 1.4.23 release address this?
-- Rich
Apparently not, we're still having the same issue. Over a year and still backend overloads. If anything this is the single biggest issue for us in a high volume production environment.
Updated by stbuehler about 15 years ago
- Status changed from New to Duplicate
- Assignee deleted (
jan) - Priority changed from Urgent to Normal
- Missing in 1.5.x set to No
I think this should be fixed in 1.4.24, see #1825.
Also available in: Atom