Bug #1464
closedlighttpd breaks with higher traffic
Description
(Disclaimer, I am not reporting this on the forums because there are a ton of threads that are left un-answered or I have tried all "solutions" offered already.)
Setup
- Lighttpd 4.1.18 (only necessary modules loaded, access_log not enabled)
- PHP 5.2.4 (fcgi)
- FreeBSD 6.2
- Applications run include OpenAds and in house apps (PHP)
(We serve on average 12-13 million pages per month, depending on the time of the year, sometimes even more.)
Configuration
fastcgi.server = ( ".php" => ( "localhost" => ( "socket" => "/var/run/lighttpd/php-fastcgi.socket", "bin-path" => "/usr/local/bin/php-cgi", "max-procs" => 4, "min-procs" => 1, "bin-environment" => ( "PHP_FCGI_CHILDREN" => "4", "PHP_FCGI_MAX_REQUESTS" => "5000" ), "bin-copy-environment" => ( "PATH", "SHELL", "USER" ), "broken-scriptfilename" => "enable" ) ) )
Basically today, our traffic went up (after thanksgiving weekend) and lighttpd started "breaking" and delivered "Error 500" to all clients. The errors included "unexpected end of file" and "all handlers are down", etc..
2007-11-25 19:23:03: (mod_fastcgi.c.2885) backend died; we'll disable it for 5 seconds and send the request to another backend instead: reconnects: 0 load: 789 2007-11-25 19:23:03: (mod_fastcgi.c.1731) connect failed: Connection refused on unix:/var/run/lighttpd/php-fastcgi.socket-2 2007-11-25 19:23:03: (mod_fastcgi.c.2885) backend died; we'll disable it for 5 seconds and send the request to another backend instead: reconnects: 1 load: 789 2007-11-25 19:23:03: (mod_fastcgi.c.1731) connect failed: Connection refused on unix:/var/run/lighttpd/php-fastcgi.socket-1 2007-11-25 19:23:03: (mod_fastcgi.c.2885) backend died; we'll disable it for 5 seconds and send the request to another backend instead: reconnects: 2 load: 789 2007-11-25 19:23:03: (mod_fastcgi.c.1731) connect failed: Connection refused on unix:/var/run/lighttpd/php-fastcgi.socket-0 2007-11-25 19:23:03: (mod_fastcgi.c.2885) backend died; we'll disable it for 5 seconds and send the request to another backend instead: reconnects: 3 load: 789 2007-11-25 19:23:03: (mod_fastcgi.c.3496) all handlers for /delivery/ajs.php on .php are down.
2007-11-18 02:41:21: (mod_fastcgi.c.2462) unexpected end-of-file (perhaps the fastcgi process died): pid: 8224 socket: unix:/var/run/lighttpd/php-fastcgi.socket-3 2007-11-26 00:00:51: (mod_fastcgi.c.3269) response already sent out, but backend returned error on socket: unix:/var/run/lighttpd/php-fastcgi.socket-3 for /index.php , terminating connection
2007-11-18 18:55:27: (request.c.1098) GET/HEAD with content-length -> 400
Initially I tried to research all of the above ("GET/HEAD ..." didn't bring up anything) and adjusted with PHP Children, workers etc. after reading various blog posts and threads on the forum.
All did not help so I switched the server to use the "external" spawn-fcgi for a couple hours. Worked for maybe two and now I got rid off the spawn-fcgi and I am back tweaking it.
In the process of using spawn-fcgi, lighttpd continued to break a few times and I had to restart lighttpd (not the spawn-fcgi) to "fix" that. The PHP processes seemed to work all time, but since I am not an expert with spawn-fcgi I went back to my original setup.
My current configuration:
fastcgi.server = ( ".php" => ( "localhost" => ( "socket" => "/var/run/lighttpd/php-fastcgi.socket", "bin-path" => "/usr/local/bin/php-cgi", "max-procs" => 32, "max-load-per-proc" => 4, "bin-environment" => ( "PHP_FCGI_CHILDREN" => "16", "PHP_FCGI_MAX_REQUESTS" => "4000" ), "bin-copy-environment" => ( "PATH", "SHELL", "USER" ), "broken-scriptfilename" => "enable" ) ) )
(I removed min-procs because even though it is used in every other example in the wiki it is not used anymore since 1.3.x. - so the wiki says.) I now have many PHP processes spawned - maybe two many, but the side seems to stay up.
web01# ps aux|grep php-cgi|wc -l 545
Now my question is - I would really like to solve this issue once and for all. I have been using lighttpd to serve media content and never ran into this issue, it only happens with fast-cgi.
Many people seem to have this issue and go back to Apache - which I would like to avoid. Is there anything I can provide you with to resolve this?
If someone gets back to me, I'd even provide access to one of those high traffic servers so you can look at it.
Updated by till almost 17 years ago
I also wonder what exactly this means:
2007-11-26 03:21:21: (mod_fastcgi.c.2885) backend died; we'll disable it for 5 seconds and send the request to another backend instead: reconnects: 1 load: 1 2007-11-26 03:21:21: (mod_fastcgi.c.1731) connect failed: No such file or directory on unix:/var/run/lighttpd/php-fastcgi.socket-3 2007-11-26 03:21:21: (mod_fastcgi.c.2885) backend died; we'll disable it for 5 seconds and send the request to another backend instead: reconnects: 2 load: 1 2007-11-26 03:21:21: (mod_fastcgi.c.1731) connect failed: No such file or directory on unix:/var/run/lighttpd/php-fastcgi.socket-2
I got so many php backends running and waiting for "action", where exactly comes this error from?
Updated by darix almost 17 years ago
you dont have enough php backends for your workload. increase the number of php backends.
Updated by till almost 17 years ago
Replying to darix:
you dont have enough php backends for your workload. increase the number of php backends.
Do you mean the CHILDREN?
I don't seem to be able to find the right "mix" between workers-procs-children.
Updated by till almost 17 years ago
Updated by till almost 17 years ago
I upped my backends (as suggested):
fastcgi.server = ( ".php" => ( "localhost" => ( "socket" => "/var/run/lighttpd/php-fastcgi.socket", "bin-path" => "/usr/local/bin/php-cgi", "max-procs" => 16, "min-procs" => 1, "bin-environment" => ( "PHP_FCGI_CHILDREN" => "4", "PHP_FCGI_MAX_REQUESTS" => "5000" ), "bin-copy-environment" => ( "PATH", "SHELL", "USER" ), "broken-scriptfilename" => "enable" ) ) )
(Note: I have 4 GB of RAM, so according to the equation I should be good.)
Whenever I get the following error, that is the end of it. If I restart lighttpd, it keeps going, if not - it will just sit there, the box will go idle.
response already sent out, but backend returned error on socket: unix:/var/run/lighttpd/php-fastcgi.socket-12 for /foo , terminating connection
Updated by Anonymous over 16 years ago
I know it's cute to say "sorry I meant backends", but clearly the documentation is a bit sketchy on exactly WTF that means in Lighttpd terms. Is a backend an instance of PHP, is a backend a child of php, is a backend a Lighttpd concept that represents something else?
Updated by Anonymous over 16 years ago
I can reproduce this problem consistently on one of my local dev servers. I use Apache ab test, loading a real php file (not a 'hello world' script) 1000 times with a concurrency of 400 requests. Any concurrency that exceeds a single php servers children will start to cause this failure to happen EXCEPT when PHP_FCGI_CHILDREN is either 1 or 2.
My latest config creates 4 processes (one for each server worker process on this dual core machine) and allows each child to run 10000 times:
fastcgi.server = ( ".php" =>
((
"bin-path" => "/usr/bin/php-cgi",
"socket" => "/tmp/php.socket",
"max-procs" => 4,
"min-procs" => 4,
"idle-timeout" => 20,
"bin-environment" => (
"PHP_FCGI_CHILDREN" => "2",
"PHP_FCGI_MAX_REQUESTS" => "10000",
),
"bin-copy-environment" => (
"PATH", "SHELL", "USER"
),
"broken-scriptfilename" => "enable"
))
)
I am unable to break this configuration. If I either 1) don't create enough parent processes, or 2) increase the children beyond 2, it will break consistently with my ab test.
I have found that creating more processes does not benefit me under high loads enough to offset the extra memory required for each with its xcache running.
Updated by stbuehler over 16 years ago
Linux kernel problem: just do
sysctl net.core.somaxconn=1024
(default is 128, so if often "crashes" when the load is a little bit above 128)
If you need more then 1024 connections to one backend, you need to change the source of lighttpd too (or spawn-fcgi if you use external spawning); you have to modify the second parameter of "listen(..., 1024)".
I guess we should document that in the error message too.
Updated by gstrauss over 8 years ago
- Description updated (diff)
- Status changed from New to Fixed
- Assignee deleted (
jan)
Also available in: Atom