Project

General

Profile

Actions

Bug #575

closed

high-time connections in handle-req impact fastcgi overload calculation

Added by Anonymous almost 19 years ago. Updated about 15 years ago.

Status:
Duplicate
Priority:
Normal
Category:
mod_fastcgi
Target version:
-
ASK QUESTIONS IN Forums:

Description

This ticket is a summary of details presented to Jan via IRC on 2006-03-10.

Based on a pool of six lighttpd heads receiving traffic from a load balancer, all six heads reached a terminal overload state where they could not recover without restart. From internal statistics, fastcgi load was 100+ on each head. After restart of lighttpd on a head, once it was picked up by the load balancer, fastcgi load stabilized at ~20.


fastcgi.backend.main-php.0.connected: 205994
fastcgi.backend.main-php.0.died: 0
fastcgi.backend.main-php.0.disabled: 0
fastcgi.backend.main-php.0.load: 144
fastcgi.backend.main-php.0.overloaded: 488
fastcgi.backend.main-php.1.connected: 155287
fastcgi.backend.main-php.1.died: 0
fastcgi.backend.main-php.1.disabled: 0
fastcgi.backend.main-php.1.load: 144
fastcgi.backend.main-php.1.overloaded: 488
fastcgi.backend.main-php.load: 288

Confirmed at the load balancer that this was not a high amount of inbound traffic. lighttpd server status showed a reasonable distribution of various pages waiting in handle-req status with high values for the Time column.


338 connections
hWhhhhrhhhhhhhhhWrhhhrhhhhhhhrhWrhrhhhhhhhWhhhhhhh
hhhhhhhhhrhhrhhhhhhhhhhhhhhhhhhhhrhhhhhhhhhhhhrrhh
rhhhhhrWrrrrhhhhhhrhhhhhhhrhhhhhrhhhhhhrhWhhhhrrhr
hhrhhhhhhhhhhhhWhhhrhhhrhhrhhhrhhhWhhhhhhhhhhhrhhh
hhrrhhrhhrhhhrhrrhhhhhWhhhhhhhWhrhrrrhhhrrhhhhrhhh
WWrrhrrrrWrhrhWrrrrrrrhrWhrrhrrhhrhhhhrhrhhhWhrWrr
hrhrhhhhhhhhrhhrhhhWhrhhhrrrrrrhhhhhhh

Approximately 150 connections shown in handle-req status have Time of 2756 or higher. Approximately 30-40 connections of this set have Time of 5000 or higher.

lighttpd error log shows continual overload status causing disable, wait, re-enable in continual cycle. Heads will not recover without restart, but head works fine after restart has occurred.

Based on discussion via IRC, as a workaround measure, plan is to add a global timeout for handle-req, such that these long-running connections in handle-req status will be shed.

-Jacob

-- moorman


Related issues 1 (0 open1 closed)

Is duplicate of Bug #1825: Don't disable backend when overloadedInvalid2008-11-18Actions
Actions #1

Updated by Anonymous over 18 years ago

I see the same condition with lighttpd-1.4.11. Over time, many php fastcgi process build up with large handle-req times. These php processes can be successfully killed and are then respawned. I do not however see anything in the lighttpd error log corresponding with processes falling into this state. PHP is not segfaulting, nor running out of memory.

The same behavior occurs with identical builds of PHP 5.1.2 and 5.1.6, the latter of which has a completely re-written fastcgi implementation. lighttpd-1.4.11 on AMD64 RHEL4.

-- jbyers

Actions #2

Updated by Anonymous over 17 years ago

I think the problem still persists in 1.4.16.

My log is full of this:

2007-08-08 11:02:46: (mod_fastcgi.c.2836) backend is overloaded; we'll disable it for 2 seconds and
send the request to another backend instead: reconnects: 0 load: 138
2007-08-08 11:02:49: (mod_fastcgi.c.3479) all handlers for /server.php on .php are down.
2007-08-08 11:02:49: (mod_fastcgi.c.2614) fcgi-server re-enabled: 0 /tmp/php-fastcgi.socket
2007-08-08 11:02:59: (mod_fastcgi.c.2836) backend is overloaded; we'll disable it for 2 seconds and
send the request to another backend instead: reconnects: 0 load: 138
2007-08-08 11:02:59: (mod_fastcgi.c.3479) all handlers for /server.php on .php are down.
2007-08-08 11:03:02: (mod_fastcgi.c.2614) fcgi-server re-enabled: 0 /tmp/php-fastcgi.socket
...

and while it isn't all locked-up, it fills with:
2007-08-08 11:21:37: (server.c.1165) NOTE: a request for /foo timed out after writi
ng 26280 bytes. We waited 360 seconds. If this a problem increase server.max-write-idle

-- sblam

Actions #3

Updated by Anonymous about 17 years ago

I still experience this same issue in 1.4.18, after a server reboot it might work for another couple weeks.

Actions #4

Updated by oschonrock about 17 years ago

we saw what appears to what may be a related issue with overloading (to do with PHP not indicating to lighty that it is in fact overloaded):

#1488

have you considered trying to launch the php-fcgi server separately with spawn_fcgi as described in that issue?

Actions #5

Updated by Anonymous almost 17 years ago

We also experience this problem on a regular basis across three web servers under reasonable load (around 1M hits per day each - although the problem does not appear related to load and often occurs well outside of peak times).

We see the problem with the following configurations:

PHP4.4.4 (eAccelerator) under spawn_fcgi
lighttpd 1.4.13

PHP5.2.5 (XCache/Suhosin) spawned directly by Lighty
lighttpd 1.4.18

I have altered the priority, as this appears to be a show-stopping bug for PHP FastCGI under lighttpd.

Has anyone tried 1.5.x-svn?

-- pat

Actions #6

Updated by Anonymous almost 17 years ago

Same problem here, I was advised to upgrade to 1.5.x branch. I doubt that will change anything.

-- Aleksey Korzun

Actions #7

Updated by Anonymous almost 17 years ago

Same issues here. Has anyone experienced issues with the patch supplied?
I would like to see some action in this "bug" (I know it is basically a PHP-not-obeying-fastcgi-standards-issue).

Thank you!

-- ff

Actions #8

Updated by Anonymous almost 17 years ago

WORKING RESOLUTION:

Given the comment above, and given that the 1.5.x branch is now close to release, (and given that 1.4.x was causing severe instability in our production environment) it seemed prudent to try 1.5.x to determine if this would have any effect. I built 1.5.0-r1992 from SVN using the following configuration:


./configure --prefix=/usr --libdir=/usr/lib/lighttpd \
            --with-bzip2 \
            --with-attr \
            --with-linux-aio \
            --with-openssl=/usr/include/openssl

/etc/lighttpd.conf
[...]
proxy-core.balancer               = "sqf" 
proxy-core.allow-x-sendfile       = "enable" 
proxy-core.allow-x-rewrite        = "enable" 

$HTTP["url"] =~ "\.php" {
  proxy-core.protocol             = "fastcgi" 
  proxy-core.max-pool-size        = 4 # (set to same as PHP_FCGI_CHILDREN)
  proxy-core.backends             = ( "unix:/tmp/.fcgi-php.socket" )
  proxy-core.rewrite-request = (
    "_pathinfo" => ( "\.php(/.*)" => "$1" )
  )
}
[...]

This configuration has thus far resolved the PHP lock-up issue that we have been experiencing. We have not experienced server downtime for over 4 days (we were previously experiencing downtime on individual members of our cluster several times per day).

In reference to the above comment ():

I don't pretend to be an expert (and indeed I know little about the FastCGI protocol); however, several people have suggested that the PHP's mis-implementation of the FastCGI protocol does not cause issues when running under spawn-php. I do not know whether this is indeed the case but I experienced the issue described in this ticket under both configurations (spawn-php or lighttpd spawned interpretters) as noted in my earlier post. It is possible that these issues are therefore entirely separate but I am not able to determine this.

If it is of any use to those who may be attempting to debug this issue, it is worth noting that I also experienced this issue using all three of the following configurations (under lighttpd 1.4.x):

  • spawn-php over TCP/IP
  • spawn-php over unix socket
  • lighttpd spawns single PHP process which spawns own children (unix socket)
  • lighttpd spawns many individual PHP interpretters (unix socket)

Cheers,
Patrick

-- pat

Actions #9

Updated by Anonymous almost 17 years ago

I've upgraded to 1.5 now and i don't get a build up of handle-req any more now its write-content connection times that go into the high thousands. I've set server.max-write-idle to 200 but that hasn't solved anything. Any ideas?

Actions #10

Updated by Anonymous almost 17 years ago

Thanks, Pat.

I will wait until 1.5 is stable to roll it out to production. This looks promising so far!

-- Aleksey Korzun

Actions #11

Updated by georgexsh over 16 years ago

It seems that 1.4.19 + php 5.2.4 + xcache have seem issue.

Actions #12

Updated by Rich over 15 years ago

does the new 1.4.23 release address this?

-- Rich

Actions #13

Updated by azilber over 15 years ago

Rich wrote:

does the new 1.4.23 release address this?

-- Rich

Apparently not, we're still having the same issue. Over a year and still backend overloads. If anything this is the single biggest issue for us in a high volume production environment.

Actions #14

Updated by stbuehler about 15 years ago

  • Status changed from New to Duplicate
  • Assignee deleted (jan)
  • Priority changed from Urgent to Normal
  • Missing in 1.5.x set to No

I think this should be fixed in 1.4.24, see #1825.

Actions

Also available in: Atom