Bug #897

lighttpd responds 500 to ALL requests after brief period of overload (it never recovers from "cool-down" period)

Added by Anonymous almost 8 years ago. Updated almost 5 years ago.

Status:FixedStart date:
Priority:UrgentDue date:
Assignee:jan% Done:

100%

Category:mod_fastcgi
Target version:1.4.24
Missing in 1.5.x:

Description

I wrote a simple php program as:


<?

$r="";
for($i=0;$i<1000000;++$i)
$r.="sdjkfhsldkjfhsdlf".time();

?>

then bind lighty in localhost, then start ab:


/usr/local/apache2/bin/ab -n 100000 -c 200 http://127.0.0.1/s.php

top shows phps are very busy:


  PID USERNAME   THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
29035 www          1 125    0 81656K 14240K RUN      0:33  5.27% php
29045 www          1 125    0 90428K 14248K *SYSMA   0:33  5.22% php
29047 www          1 125    0 90428K 14244K *SYSMA   0:33  5.18% php
29034 www          1 124    0 90668K 16776K *SYSMA   0:34  5.13% php
29039 www          1 125    0 90396K 22264K *SYSMA   0:34  5.13% php
29042 www          1 124    0 90492K 23084K RUN      0:33  5.13% php
29037 www          1 124    0 90396K 17300K RUN      0:33  5.13% php
29036 www          1 124    0 90332K 14196K *SYSMA   0:34  5.08% php

After about 30 seconds, lighty.error.log said:


2006-10-25 18:20:15: (mod_fastcgi.c.3423) all handlers for  /s.php on .php are down. 

then answers 500 to every request.

I interrupted ab, and wait until php progresses are idle, but lighty still refused to reply.

-- namosys

lighttpd-1.4-fastcgi-never-respawns.diff Magnifier (1.71 KB) peto, 2009-02-02 22:07

Associated revisions

Revision 2635
Added by stbuehler almost 5 years ago

mod_fastcgi: fix is_local detection, respawn backends if bin-path is set (fixes #897)

History

#1 Updated by Anonymous over 7 years ago

this bug is present in lighty 1.4.13 to on FreeBSD 6

#2 Updated by Anonymous over 7 years ago

confirmed on FreeBSD 6.1-RELEASE-p14

any news on this?

#3 Updated by Anonymous over 7 years ago

Replying to anonymous:

confirmed on FreeBSD 6.1-RELEASE-p14

any news on this?

forgot to add -- lighttpd-1.4.13

#4 Updated by jan over 7 years ago

  • Status changed from New to Fixed
  • Resolution set to wontfix

This is the expected behaviour on 1.4.x and is documented in Docs:PerformanceFastCGI in the wiki.

In 1.5.0 we are doing the scheduling of requests in the module ourself and send a 502/503 in case we can't connect to the backends. The overall problem stays the same: if you don't have enough backends to handle the load some requests have to be dropped.

#5 Updated by Anonymous over 7 years ago

  • Status changed from Fixed to Need Feedback
  • Resolution deleted (wontfix)

Hi Jan,

The problem in my case is that after such an "overload" incident, lighty responds with 500 to all php requests. You are forced to restart it to get it back into "regular operation mode."

Is this going to be addressed?

Also, is there something you recommend we do so that backends are available most of the time (if not always)?

Thank you

#6 Updated by jan over 7 years ago

The backends should be reactived after the cool-down period, right. Can you attach the output of

$ lsof -Pnp <pidofphp> -p <pidoflighty>

We have too see the state of the connections when you see this problem.

#7 Updated by Anonymous over 7 years ago

confirmed on RHEL 5 + lighttpd 1.4.13

-- hutuworm

#8 Updated by Anonymous about 7 years ago

This also occurs on my site, specifically after being massively DDoSed (but probably under the same test case as above, as well) using lighttpd 1.4.11 + FreeBSD 4.7-RELEASE #3

Please note that it does not always occur after a DDoS, but only sometimes, which is even more odd.

I have to kill lighttpd and restart it manually in order to fix the issue, it never resolves itself, and in the case of this morning, was down for almost 2 hours (ouch) with the same exact error as above (all backends are marked busy -> 500 error), even after all the attacks subsided.

-- njaguar

#9 Updated by Anonymous about 7 years ago

Yup, I have the same problem. Mandrivalinux 2007.1 using web.py with fcgi on an old Pentium II 450 MHz. The problem is apparent on pages that are pretty heavy (much database action and lots of text-processing); manually refreshing it a couple of times quickly causes the server to "lock up" and send 500 for a couple of minutes, long after the python process has finished. During this time, the lighttpd process is at 99% CPU usage.

-- bb

#10 Updated by Anonymous almost 7 years ago

I see a similar problem where lighttpd stops forwarding any fastcgi requests to my app and returns 500. This does not get resolved automatically - the only option is a restart of lighttpd. Tried a little hack in mod_fastcgi.c and it seemed to solve the problem. More details in the forum post http://forum.lighttpd.net/topic/16057

-- venkatraju

#11 Updated by Anonymous over 6 years ago

I believe I found the problem: take a look in mod_fastcgi.c. On line 2884, a process is being assigned the state PROC_STATE_DIED_WAIT_FOR_PID. The next time fcgi_restart_dead_procs is called, the switch statement matches this state (line 2638). However, since the process is still alive, the program ends up on line 2671. When execution reaches this point, it should be noted that the process is not dead; but merely overloaded. Since the process is left in the old state, the active_procs counter ends up being decremented several times and therefore becomes positive (it's a size_t). Note: the line numbers I referred to were with respect to 1.4.18.

I am posting a patch that worked for me, but I don't recommend using it until a lighttpd developer verifies it since this is my first time looking at and working with the code...


--- mod_fastcgi.c.orig  Mon Mar  3 04:59:34 2008
+++ mod_fastcgi.c       Mon Mar  3 04:59:48 2008
@@ -2669,7 +2669,11 @@
                        }

                        /* fall through if we have a dead proc now */
-                       if (proc->state != PROC_STATE_DIED) break;
+                       if (proc->state != PROC_STATE_DIED) {
+                               proc->state = PROC_STATE_OVERLOADED;
+                               host->active_procs++;
+                               break;
+                       }

                case PROC_STATE_DIED:
                        /* local procs get restarted by us,

#12 Updated by Anonymous over 6 years ago

I have been fighting this bug for the last few weeks. After spending a day in gdb fixing it, I find that someone has already fixed it. Why is there no developer traction on this bug, especially given that there is a patch?

#13 Updated by Anonymous over 6 years ago

Perhaps the title is misleading. I am changing it in the hopes that it will attract some attention from the developers. I have been using the patch posted above for quite a while and haven't had any problems with it. I hope it will get fixed in the next release.

By the way, an extra note to the developers: try the method outlined in the first post. After a while, lighttpd will respond 500 to some requests. This is not the problem. Now stop the ab process and try executing the script from your browser. All subsequent requests will return 500 no matter how long you wait. This is the problem.

#14 Updated by stbuehler over 6 years ago

  1. The patch doesn't convince me:
    - i couldn't find the given lines numbers
    - the proc isn't counted as active in state OVERLOADED, so even if we do not count active_proc correctly, the patch will probably not really fix it.
    - a local proc should only get into that state if it killed the socket, so even if it is not dead it is assumed it will not work anymore. But perhaps we really shouldn't care about that and just try to use it again.
  2. I tried reproducing it (i used usleep instead of busy loops) - but it just worked as it should.

#15 Updated by meto about 6 years ago

This mod has reduced, but hasn't removed the problem totally... I believe that there is some other point that's fragile to overload.

#16 Updated by meto about 6 years ago

I've looked closer on the config side of lighttpd and I believe that there is other source of the bug - connection killing, since after 500 error and restarting lightty left the hung php running. I found this fastcgi setting:


"kill-signal": By default lighttpd send SIGTERM(15) to FastCGI processes, which were spawned by lighttpd. Applications, which link libfcgi, need to be killed with SIGUSR1(10). This applies to php <5.2.1, lua-magnet and others. Option is accepted in prior versions, option is functional in 1.4.14+.

I use PHP 5.2.0 (stable package on debian etch) and set kill-signal to 10. It seems that it's ok now.

#18 Updated by stbuehler almost 6 years ago

Overload problems may be caused by a too low backlog limit (number of waiting connections for a socket):

Just do


sysctl net.core.somaxconn=1024

(default is 128, so if often "crashes" when the load is a little bit above 128)

That of course doesn't fix any real load problems you have, and you have load problems if you have more than 128 connections waiting for php; but it may help against peeks.

#19 Updated by stbuehler almost 6 years ago

  • Target version changed from 1.4.20 to 1.4.21

#20 Updated by stu almost 6 years ago

I had this problem with version 1.4.13 on Debian.

It started after I enabled mod_compress and disabling it fixed the problem. So if you are having 500 responds with backend overloaded in error log try disabling mod compress.

2008-11-24 15:25:13: (mod_fastcgi.c.3455) all handlers for  /index.php on .php are down. 
2008-11-24 15:25:16: (mod_fastcgi.c.2597) fcgi-server re-enabled:  0 /tmp/php.socket 
2008-11-24 15:25:17: (mod_fastcgi.c.2819) backend is overloaded, we disable it for a 2 seconds and send the request to another backend instead: reconnects: 0 load: 136
...

server.modules              = (
            "mod_access",
            "mod_alias",
            "mod_accesslog",
            "mod_expire",
            "mod_fastcgi",
            "mod_rewrite",
            "mod_redirect",
            "mod_status",
#           "mod_evhost",
            "mod_compress",
#           "mod_usertrack",
#           "mod_rrdtool",
#           "mod_webdav",
#           "mod_flv_streaming",
#           "mod_evasive" 
 )

...

#### compress module
compress.cache-dir          = "/var/tmp/lighttpd-cache/" 
compress.filetype           = ("text/plain", "text/html", "application/x-javascript", "text/css", "text/xml", "image/gif", "image/jpeg", "image/png")

...

#21 Updated by peto over 5 years ago

Patch to 1.4 fixes one major case of this:

If lighttpd has a fastcgi bin-path specified, but the socket is already active and connects successfully, the proc is marked is_local = 0, indicating that the backend wasn't started by lighttpd.

If the backend then goes away, lighttpd should start a new one; that's what specifying bin-path is telling it to do. However, it never gets there: at that point it just waits a few seconds, then tries to connect to the socket as if bin-path wasn't specified.

This can happen accidentally: if for any reason lighttpd connects to a socket when it thinks the backend is dead, it'll permanently start treating that backend as non-local, and from then on will never restart it. If bin-path has been specified, it should always start a new process if an existing one can't be found.

For the benefit of googlers, this particular one manifests in the logs as:

(mod_fastcgi.c.2912) backend died; we'll disable it for 5 seconds and send the request to another backend instead: reconnects: 1 load: 1
(mod_fastcgi.c.2725) fcgi-server re-enabled: unix:/mnt/moe/moe-live/tmp/sockets/fcgi.socket-0
(mod_fastcgi.c.1743) connect failed: Connection refused on unix:/mnt/moe/moe-live/tmp/sockets/fcgi.socket-0

Patch tested going both ways: spawn an fcgi backend manually and start lighttpd (it uses the existing backend, as usual); kill the backend, and lighttpd starts a new one; kill lighttpd's backend and spawn a new one manually and it starts using it.

#22 Updated by stbuehler over 5 years ago

  • Target version changed from 1.4.21 to 1.4.22
  • Pending changed from Yes to No
  • Patch available set to No

Did you try a backend without bin-path in lighttpd? I guess the fork will segfault if you try to spawn.

Yes, the is_local property is broken; but ignoring it doesn't help (PROC_STATE_DIED doesn't mean our spawned process died).

I recommend using spawn-fcgi with daemontools in 1.4 and 1.5 instead of internal spawning...

#23 Updated by peto over 5 years ago

Well, it's trading bugs of similar severity, then (having fastcgi stop working is about as fatal as a segfault for sites using it), so the patch is there for people who are using auto-spawning.

#24 Updated by jayprakash over 5 years ago

I am facing a similar problem with the dispatch.fcgi backend. We are eagerly waiting for a fix. If 1.4.22 release would take a substantial time, can we get a patch for this issue. It will help in quicker resolution of the problem.

#25 Updated by peto over 5 years ago

If you only use internal spawning, then my patch above may work for you. I've been using it since then with no problems. (If you use external spawning then you shouldn't have this particular problem anyway, and I wouldn't mix internal and external spawning--that's just asking for trouble...)

#26 Updated by jayprakash over 5 years ago

I have looked at your patch and I am positive that it will work for me. However, I was not able to test it as you have suggested. Could you let me know how did you spawn an fcgi backend manually?

I found the following code in mod_fastcgi.c(lighttpd-1.4.21)

2836 if (hctx->proc->is_local) {
2837 hctx->proc->state = PROC_STATE_DIED_WAIT_FOR_PID;
2838 } else {
2839 hctx->proc->state = PROC_STATE_DIED;
2840 }
2841
2842 hctx->proc->state = PROC_STATE_DIED;

just curious if this is correct?

#27 Updated by peto over 5 years ago

Use spawn-fcgi.

That code looks funny, but I don't know the code well enough to know if it matters.

#28 Updated by stbuehler over 5 years ago

  • Target version changed from 1.4.22 to 1.4.23

#29 Updated by manishpal over 5 years ago

Using spawn-fcgi let's say we spawn three fcgi's and tell lighttpd(1.4.21) to load balance between them .

fastcgi.server => (".fcgi" => ("localhost" =>
("host" => "127.0.0.1" , "port" => 8081),
("host" => "127.0.0.1" , "port" => 8082),
("host" => "127.0.0.1" , "port" => 8083)

))

Problem is if I kill fcgi on port 8083, then every third request returns a 500 Internal Server Error. I would like to know the cause of it ?? and why is the request being served from 8083 fcgi (resulting in 500 error) even if 8081 and 8082 fcgi's are alive ??

#30 Updated by stbuehler about 5 years ago

  • Target version changed from 1.4.23 to 1.4.24

#31 Updated by stbuehler almost 5 years ago

  • Status changed from Need Feedback to Fixed
  • % Done changed from 0 to 100

Applied in changeset r2635.

Also available in: Atom