Project

General

Profile

Bug #2788

FreeBSD/1.4.45/SSL: requests getting stuck in handle-req state occasionally

Added by Max_nl 9 months ago. Updated 9 months ago.

Status:
Fixed
Priority:
Normal
Assignee:
-
Category:
core
Target version:
Start date:
2017-02-08
Due date:
% Done:

100%

Estimated time:
Missing in 1.5.x:

Description

I am running Lighttpd on a high traffic website, and recently upgraded to 1.4.45, as well as made HTTPS connections mandatory on the site.
I am not sure which of the two changes I did is causing the problem, but now connections occasionally get stuck in the "handle-req" state (shown in the server-status page) serving a PHP page, and never get out of that, even though the connection with the client is lost.

It is hard to reproduce. It can go a day without it happening, and then suddenly it happens so often to the point that it quickly runs out of fds, the webserver is no longer accessible and never recovers from that.

2017-02-08 06:15:36: (server.c.1755) [note] sockets disabled, out-of-fds

netstat shows there are a lot of unix sockets open (which lighttpd did not read the available data from?):

Active UNIX domain sockets
Address  Type   Recv-Q Send-Q    Inode     Conn     Refs  Nextref Addr
ffffff02444f80f0 stream      0      0        0 ffffff031763b3c0        0        0
ffffff031763b3c0 stream      0      0        0 ffffff02444f80f0        0        0
ffffff024445a0f0 stream    912      0        0 ffffff0244988c30        0        0 /tmp/php.socket-0
ffffff0244988c30 stream      0      0        0 ffffff024445a0f0        0        0
ffffff02448ab1e0 stream    904      0        0 ffffff02445e21e0        0        0 /tmp/php.socket-1
ffffff02445e21e0 stream      0      0        0 ffffff02448ab1e0        0        0
ffffff0260f8d0f0 stream   1467      0        0 ffffff0260f8d1e0        0        0 /tmp/php.socket-2
ffffff0260f8d1e0 stream      0      0        0 ffffff0260f8d0f0        0        0
ffffff0260f8d2d0 stream   1075      0        0 ffffff0260f8d3c0        0        0 /tmp/php.socket-3
ffffff0260f8d3c0 stream      0      0        0 ffffff0260f8d2d0        0        0
ffffff0260f8d4b0 stream    760      0        0 ffffff0260f8d5a0        0        0 /tmp/php.socket-0
ffffff0260f8d5a0 stream      0      0        0 ffffff0260f8d4b0        0        0
ffffff0260f8d690 stream   1490      0        0 ffffff0260f8d780        0        0 /tmp/php.socket-1
ffffff0260f8d780 stream      0      0        0 ffffff0260f8d690        0        0
ffffff0260f8d870 stream   1052      0        0 ffffff0260f8d960        0        0 /tmp/php.socket-2
ffffff0260f8d960 stream      0      0        0 ffffff0260f8d870        0        0
ffffff0260f8da50 stream   1079      0        0 ffffff0260f8db40        0        0 /tmp/php.socket-3
[snip]
lighttpd-stripped.conf (6.02 KB) lighttpd-stripped.conf configuration Max_nl, 2017-02-10 12:51
netstat (22.8 KB) netstat Max_nl, 2017-02-14 02:07

Associated revisions

Revision 12440e89 (diff)
Added by gstrauss 9 months ago

[core] use kqueue in level-triggered mode (fixes #2788)

use kqueue in level-triggered mode, not edge-triggered

x-ref:
"FreeBSD/1.4.45/SSL: requests getting stuck in handle-req state occasionally"
https://redmine.lighttpd.net/issues/2788

Revision aa923e05 (diff)
Added by gstrauss 9 months ago

[mod_fastcgi,mod_scgi] backend spawn EINTR retry (#2788)

When spawning backends, retry blocking connect() to backend if EINTR
received when attempting to see if backend is already running. EINTR
might be received if a HUP or USR1 signal is received while connecting
(or SIGCHLD on systems without SA_RESTART)

(expected to occur extremely rarely, but simple to handle properly)

x-ref:
"FreeBSD/1.4.45/SSL: requests getting stuck in handle-req state occasionally"
https://redmine.lighttpd.net/issues/2788

Revision 1dd5cce3 (diff)
Added by gstrauss 9 months ago

[mod_fastcgi,mod_scgi] consolidate backend process accounting (#2788)

consolidate backend process accounting for consistency

x-ref:
"FreeBSD/1.4.45/SSL: requests getting stuck in handle-req state occasionally"
https://redmine.lighttpd.net/issues/2788

History

#1

Updated by gstrauss 9 months ago

Please add attachments instead of posting many kb of log data into the issue.

Please attach your lighttpd configuration, and "xxxxx" out any sensitive info.

Do you have too many backends running? Is lighttpd creating backends as well as PHP FastCGI server?

#2

Updated by gstrauss 9 months ago

  • Description updated (diff)
#3

Updated by gstrauss 9 months ago

If you could share the server status page (as an attachment) that might provide some useful information.

Although undesirable from a security perspective, if you temporarily remove the mandatory HTTPS for a few days, you might be able to help narrow down whether the problem is SSL-related or fastcgi-related in lighttpd.

#4

Updated by gstrauss 9 months ago

Given the repetition of the paths
/tmp/php.socket-0
/tmp/php.socket-1
/tmp/php.socket-2
/tmp/php.socket-3
it might be that lighttpd is trying to restart the PHP after it exits if you have set PHP_FCGI_MAX_REQUESTS. If another process attempts to restart the backend, there might be a competition. Seeing your lighttpd.conf would help. (/usr/sbin/lighttpd -p -f /etc/lighttpd/lighttpd.conf) Are you running multiple lighttpd workers? Are you having lighttpd start the PHP backend(s)? Try running lighttpd without multiple workers and see if it keeps up. Your PHP likely uses much more resources than lighttpd.

#5

Updated by Max_nl 9 months ago

Minimum number of PHP processes spawned from lighttpd, since requests are meant to complete very fast.
It also uses a caching system, and most of the time the PHP script only has to return "X-LIGHTTPD-send-file: /some-cached-page" to lighttpd.

In addition to PHP there is also a custom FastCGI application written in C++ for performance reasons.
I have never seen requests get stuck on that, however that application handles a smaller amount of requests than PHP, so that doesn't say much.

I also noticed that some of the requests it gets stuck on seem to originate from the IP-range of a VPN provider, and some others are from the range of Google Cloud services.
So I suspect these might not be legitimate requests from a normal browser.

Have put Cloudflare in front of the site for now, and haven't seen the issue in the last couple days.

#6

Updated by Max_nl 9 months ago

Are you having lighttpd start the PHP backend(s)? Try running lighttpd without multiple workers and see if it keeps up.

Very basic config
Single Lighttpd worker, lighttpd starts PHP
Plain old fcgi SAPI


$ /usr/local/bin/php -v
PHP 5.6.29 (cgi-fcgi) (built: Jan 19 2017 23:06:25)
Copyright (c) 1997-2016 The PHP Group
Zend Engine v2.6.0, Copyright (c) 1998-2016 Zend Technologies

Not using fpm or anything.

#7

Updated by gstrauss 9 months ago

Not sure about on FreeBSD, but on Linux systems, /usr/bin/php-cgi should (in general) be used for FastCGI PHP instead of /usr/bin/php.

What is the value of PHP_FCGI_CHILDREN in the environment? PHP_FCGI_MAX_REQUESTS? You can set these in the fastcgi.server with "bin-environment" => ("PHP_FCGI_CHILDREN" => "1", "PHP_FCGI_MAX_REQUESTS" => "1000000"), as you already know (#2316).

FYI: default max-procs = 4 if not specified. You might specify 1 if lighttpd should spawn only one backend, if the backend itself spawns its own children (PHP_FCGI_CHILDREN), but you might want max-procs = 4 (or higher) if PHP_FCGI_CHILDREN = 1

Since the issue you are seeing has not been reported by others, you might also experiment with

server.event-handler = "poll"

instead of

server.event-handler = "freebsd-kqueue"

just to narrow down whether or not that setting causes the issue.

Of course, with the multiple mitigations that you put in place, this issue becomes even more difficult to reproduce.

#8

Updated by Max_nl 9 months ago

Not sure about on FreeBSD, but on Linux systems, /usr/bin/php-cgi should (in general) be used for FastCGI PHP instead of /usr/bin/php.

PHP was compiled from source with only FastCGI sapi enabled.
So in my case /usr/local/bin/php = what is php-cgi in other distributions.

What is the value of PHP_FCGI_CHILDREN in the environment? PHP_FCGI_MAX_REQUESTS? You can set these in the fastcgi.server with "bin-environment" => ("PHP_FCGI_CHILDREN" => "1", "PHP_FCGI_MAX_REQUESTS" => "1000000"), as you already know (#2316).

I had that line, but removed it recently to rule out my custom line was the problem.
Currently using default values without setting any environment settings.

I am not comfortable posting full server status page here.
But anonymized it look like this when the issue occures:

Server-Status (lighttpd/1.4.45)

Hostname    www.********* ()
Uptime    9 hours 48 min 44 s
Started at    2017-02-08 14:00:46
absolute (since start)
Requests    1 Mreq
Traffic    10.76 Gbyte
average (since start)
Requests    38 req/s
Traffic    319.51 kbyte/s
average (5s sliding average)
Requests    33 req/s
Traffic    133.79 kbyte/s
284 connections
kkhkkkkkkhhhkhhhhhkkhkkkhhkhhkhhkkkhkkkkhhkkkhkkhh
kkkhkhkkhkkkhkhkhkhhkhhkkhhkkhkkhhkkkkhhkhhkhhhkkh
hkhhhhhhkkhkhkhhkkkkhhhkhhkhhhkkkhhhhkkkkhhhhhkhhh
hhkkkkkkhhhhkhkkkkhkhkkkhkkhkhhhhkhkkkkhkhkhkhkhkh
hhkhkkhhkhhhkkhhhkkkhkkkkkkkkkkhhkkkhkkkkkkhkkkkkk
kkkkkkkkkkkkkkkkkkkkkkkkkrrkrkkkrk

166      k = keep-alive
0      . = connect
0      q = req-start
4      r = read
0      Q = req-end
0      R = readpost
114      h = handle-req
0      s = resp-start
0      W = write
0      S = resp-end
0      E = error
0      C = close

Client IP: Read: Written: State: Time: Host: URI: File:

*.*.*.*    0/0    0/0    keep-alive    2    www.*****    /some-php-page.php
*.*.*.*    0/0    0/0    keep-alive    4    www.*****    /some-php-page.php
*.*.*.*    0/0    0/0    handle-req    6583    www.*****    /some-php-page.php
*.*.*.*    0/0    0/0    keep-alive    2    www.*****    /some-php-page.php
*.*.*.*    0/0    0/0    keep-alive    5    www.*****    /some-php-page.php
*.*.*.*    0/0    0/0    keep-alive    2    www.*****    /some-php-page.php
*.*.*.*    0/0    0/0    keep-alive    0    www.*****    /some-php-page.php
*.*.*.*    0/0    0/0    keep-alive    2    www.*****    /some-php-page.php
*.*.*.*    0/0    0/0    keep-alive    5    www.*****    /some-php-page.php
*.*.*.*    0/0    0/0    handle-req    6576    www.*****    /some-php-page.php
*.*.*.*    0/0    0/0    handle-req    6578    www.*****    /some-php-page.php
[...]

114 requests stuck in handle-req for hours, but it happily continues processing other requests until the problem gets more severe and it runs out of FDs.

Of course, with the multiple mitigations that you put in place, this issue becomes even more difficult to reproduce.

Yeah, reproducibility is the problem.
Even before mitigations it could take quite a long time before any request got stuck. But when it happens others will follow quickly.

#9

Updated by gstrauss 9 months ago

Back to what I mentioned before: lots of repetition of /tmp/php.socket-0 /tmp/php.socket-1 /tmp/php.socket-2 /tmp/php.socket-3 would seem to indicate that lighttpd is trying to respawn the backend. There should be some information in the lighttpd error log about this. When lighttpd attempts to respawn the backend, it will unlink the unix domain socket path (e.g. /tmp/php.socket-0) if connect() fails with something other than ENOENT. Unfortunately, the exact reason is not currently logged. Given lighttpd signal dispositions, EINTR is not expected, but lighttpd should handle it (and currently does not). The socket should be blocking and so the connect() should not return EINPROGRESS.

I would be very interested to find out what the exact error is, which leads to lighttpd unlinking the unix domain socket. If you are willing to build lighttpd with a custom patch, I'll provide a patch to add some logging here.

Now, if lighttpd is unable to connect() to /tmp/php.socket-0, what is happening to the backend PHP server at the time? Why would those backend PHP servers stop responding? Do you have any error log of issues with the backend PHP server?

2017-02-08 06:15:36: (server.c.1755) [note] sockets disabled, out-of-fds

According to what you have noted, this is due to a build-up of connections in handle-req state. These likely have backend connections open to the PHP fastcgi server, and are connected to now-unlinked sockets, e.g. older versions of /tmp/php.socket-0. If the backend PHP server has exited, lighttpd should receive HUP on that socket. Why that is not happening might or might not be related to server.event-handler = "freebsd-kqueue", and likewise why lighttpd is not aborting the backend when the client disconnects.

Again, without any sort of reproducibility, this will be very difficult to diagnose. At the moment, this issue has only be reported by you, on FreeBSD. If your server can handle a fairly large logging load, and it probably can, you can set fastcgi.debug = 1 so that the next time this happens, we'll have some more info. However, fastcgi.debug = 1 can produce very large log files on a busy site, so please do not enable this if it adversely affects your server performance.

#10

Updated by Max_nl 9 months ago

There should be some information in the lighttpd error log about this.

This is the 5 minutes leading up to running out of FDs:

2017-02-08 06:10:00: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:10:04: (connections.c.234) SSL (error): 5 -1 35 Resource temporarily unavailable 
2017-02-08 06:10:05: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:10:11: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:10:20: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:10:26: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:10:30: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:10:35: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:10:40: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:10:56: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:10:56: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:10:56: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:10:56: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:11:23: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-08 06:11:23: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-08 06:11:23: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-08 06:11:33: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-08 06:11:33: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-08 06:11:33: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-08 06:11:33: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-08 06:11:47: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-08 06:11:54: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-08 06:11:55: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:11:55: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:11:55: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:11:56: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:11:56: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:12:07: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:12:07: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:12:07: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:12:07: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:12:29: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:12:59: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:12:59: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:12:59: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:13:06: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-08 06:14:23: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-08 06:14:23: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-08 06:14:32: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:14:42: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-08 06:15:36: (server.c.1755) [note] sockets disabled, out-of-fds 

Do have some SSL errors there, but those seem to occur all the time, and think they are just related to connection with client being lost or client trying to use an older disabled SSL version.

Currently on the occasionial PHP error PHP sends an Internal server error code, and logs the real error to FastCGI stderr which appears in lighttpd errorlog. E.g.:

2017-02-10 15:23:15: (mod_fastcgi.c.2543) FastCGI-stderr: PHP Fatal error:  Allowed memory size of 134217728 bytes exhausted (tried to allocate 40534469 bytes) in /usr/home/******/include/funcs.php on line 173

But don't think it respawns on such cases, just frees the resources page was using, and serves the next request. Doesn't get stuck in handle-req on those.

Also do not see anything in normal logs that indicates a process may have crashed.
Normally on FreeBSD you get a "process pid x exited on signal y" message in syslog if there is a segmentation fault.

Think I would need to switch to php-fpm to have php backends do external logging instead of sending the message to lighttpd.
But that behaves differently and does start/spawn/kill own processes, so that would add more uncertain factors to the mix.

I am willing to try that as well as custom patches some time in the future, but not right now, as I will be traveling in the coming weeks.
Will leave it behind Cloudflare for now, and see if it doesn't occur then as temporary solution.

#11

Updated by gstrauss 9 months ago

2017-02-08 06:10:04: (connections.c.234) SSL (error): 5 -1 35 Resource temporarily unavailable

On FreeBSD, errno 35 EAGAIN Resource temporarily unavailable
lighttpd should handle that better but the errors above do not point to any reason why lighttpd is running out of fds.

What version of FreeBSD are you running, and what version of openssl?

Would you see if you can run lighttpd with fastcgi.debug = 1 and if you have enough space for the resulting error log? That will log when lighttpd respawns the PHP backend, which almost certainly must be happening for there to be so many of those sockets with the same name.

#12

Updated by gstrauss 9 months ago

  • Category set to core
  • Status changed from New to Patch Pending
  • Target version changed from 1.4.x to 1.4.46

Looking in fdevent_freebsd_kqueue.c, I see that it has used EV_CLEAR since the first version of lighttpd 1.4 entered into svn (and later git). This looks incorrect to me, since all other mechanisms are level-triggered, rather than edge-triggered. Edge-triggered behavior in lighttpd fdevent use of kqueue could possibly result in missing ready events since lighttpd expects level-triggered behavior and might choose not to handle an event until some later time (removing interest and then re-adding interest later).

If this is indeed the case, then server.event-handler = "poll" would make the problem go away, as would server.event-handler = "libev", if libev is available.

Here is a patch which will fix the issue if you'd like to compile lighttpd and test with kqueue.

--- a/src/fdevent_freebsd_kqueue.c
+++ b/src/fdevent_freebsd_kqueue.c
@@ -75,14 +75,14 @@ static int fdevent_freebsd_kqueue_event_set(fdevents *ev, int fde_ndx, int fd, i
        if (events == oevents) return fd;

        if (addevents & FDEVENT_IN)  {
-               EV_SET(&kev[n], fd, EVFILT_READ, EV_ADD|EV_CLEAR, 0, 0, NULL);
+               EV_SET(&kev[n], fd, EVFILT_READ, EV_ADD, 0, 0, NULL);
                n++;
        } else if (delevents & FDEVENT_IN) {
                EV_SET(&kev[n], fd, EVFILT_READ, EV_DELETE, 0, 0, NULL);
                n++;
        }
        if (addevents & FDEVENT_OUT)  {
-               EV_SET(&kev[n], fd, EVFILT_WRITE, EV_ADD|EV_CLEAR, 0, 0, NULL);
+               EV_SET(&kev[n], fd, EVFILT_WRITE, EV_ADD, 0, 0, NULL);
                n++;
        } else if (delevents & FDEVENT_OUT) {
                EV_SET(&kev[n], fd, EVFILT_WRITE, EV_DELETE, 0, 0, NULL);
#13

Updated by Max_nl 9 months ago

If this is indeed the case, then server.event-handler = "poll" would make the problem go away

Negative.
Tried it for almost 3 days.


lighttpd/1.4.45

Server-Features
RegEx Conditionals enabled
Network Engine
fd-Event-Handler poll

Score is 12 requests stuck in handle-req.

0/0    0/0    handle-req    1388
0/0    0/0    handle-req    2229
0/0    0/0    handle-req    2225
0/0    0/0    handle-req    52183
0/0    0/0    handle-req    1388
0/0    0/0    handle-req    1388
0/0    0/0    handle-req    1388
0/0    0/0    handle-req    2261
0/0    0/0    handle-req    59038
0/0    0/0    handle-req    2226
0/0    0/0    handle-req    2229
0/0    0/0    handle-req    925

netstat does look different.
Now has 200 something open fds to just /tmp/php.socket-3 and not to the other ones.

Nothing special in error log.
Haven't tried fastcgi debug yet.

What version of FreeBSD are you running, and what version of openssl?

FreeBSD kernel used is ancient (8.2). Pretty much everything in userspace is compiled from source and new though.
OpenSSL 1.1.0c

#14

Updated by gstrauss 9 months ago

openssl 1.1.0c is pre-release. Don't you think it would be a good idea to report bleeding-edge versions when you post issues?
See #2784 which will at least explain some of the SSL trace you are seeing in the lighttpd error log.

Score is 12 requests stuck in handle-req.
...
netstat does look different.
Now has 200 something open fds to just /tmp/php.socket-3 and not to the other ones.

Nothing special in error log.
Haven't tried fastcgi debug yet.

The fastcgi debug will be useful to confirm that the PHP backend is being restarted. Still curious as to why that might be.
If lighttpd has 12 stuck requests, why are there 200-some-odd fds leaked for /tmp/php.socket-3. Are there PHP backends that are stuck? Or is lighttpd somehow holding onto those 200+ fds? (Does lighttpd have "more-than-normal" number of fds open? ("more-than-normal" for your system load) Who is holding those inodes open (try 'lsof' or 'fuser')?

FreeBSD 8.2 reached end-of-life over 4 and 1/2 years ago (!!!)
https://www.freebsd.org/security/unsupported.html

Aside: I can't say that I am eager to spend more of my time on this specific issue unless the problem also occurs on a currently-supported FreeBSD kernel. Still, I'll take a look if you can provide the fastcgi debug over a period of time and if you can see about providing more info about how well your backend PHP is running, and when how often it is restarted (e.g. log rotation or other scheduled maintenance).

#15

Updated by Max_nl 9 months ago

openssl 1.1.0c is pre-release. Don't you think it would be a good idea to report bleeding-edge versions when you post issues?

Bleeding-edge is relative.
The openssl.org download page calls it stable.

"Note: The latest stable version is the 1.1.0 series of releases."

FreeBSD 8.2 reached end-of-life over 4 and 1/2 years ago (!!!)

Yeah, site is a bit on a tight budget and server isn't replaced until it falls apart.
Does have been running that site without stuck handle-reqs for 6 years though.

Looking at "ps" it does seem the worker php processes are restared fairly often right now:

www# ps aux | grep php-cgi
www    31179 13.2  0.1 30504  9664  ??  S     1:11PM   0:07.51 /usr/local/bin/php (php-cgi)
www    31191 10.0  0.1 30504 11008  ??  S     1:12PM   0:04.04 /usr/local/bin/php (php-cgi)
www    30974  1.5  0.1 30504 11408  ??  S     1:10PM   0:02.62 /usr/local/bin/php (php-cgi)
www     1201  0.0  0.1 28456  8384  ??  Is   Sat04AM   0:00.08 /usr/local/bin/php (php-cgi)
www     1203  0.0  0.1 28456  8384  ??  Is   Sat04AM   0:00.22 /usr/local/bin/php (php-cgi)
www     1205  0.0  0.1 28456  8384  ??  Is   Sat04AM   0:00.54 /usr/local/bin/php (php-cgi)
www     1207  0.0  0.1 28456  8384  ??  Is   Sat04AM   0:00.52 /usr/local/bin/php (php-cgi)

But that is probably just due to the low default built-in PHP_FCGI_MAX_REQUESTS. Will put my custom line back in.

lsof and sockstat shows lighttpd has 43 active connections to /tmp/php.socket-3 open.
The other 160 that have a connection id of 0 in netstat do not show there at all. So maybe just connections that have been recently closed, or a kernel bug.

#16

Updated by gstrauss 9 months ago

openssl 1.1.0c is pre-release. Don't you think it would be a good idea to report bleeding-edge versions when you post issues?

My mistake. openssl 1.1.0c is not pre-release. However, please note that openssl 1.1.0d was released 26 Jan 2017 and openssl 1.1.0e is due out 16 Feb 2017

But that is probably just due to the low default built-in PHP_FCGI_MAX_REQUESTS. Will put my custom line back in.

That will further mitigate the problem, which is probably better for your system, though will make tracking this down more difficult.

lsof and sockstat shows lighttpd has 43 active connections to /tmp/php.socket-3 open.
The other 160 that have a connection id of 0 in netstat do not show there at all. So maybe just connections that have been recently closed, or a kernel bug.

When lighttpd connect()s to a backend, it waits for POLLOUT until the connect() completes, and then registers interest in POLLIN, which is never removed (unless server.stream-response-body = 2, which you do not have set in your config). Now that you are using server.event-handler = "poll", which is level-triggered, it is strange that lighttpd is not detecting when the backend closes the connection. Those 160 unknown connections suggest a kernel bug (probably fixed long ago in later FreeBSD kernel versions). Similarly, lighttpd keeps client connections in the poll set even if lighttpd is not waiting on POLLIN or POLLOUT, and so lighttpd should receive an event such as POLLHUP when the client closes the connection. That this is not happening is also something probably fixed long ago in later FreeBSD kernel versions.

Have you considered attempting to update the server to the latest kernel in the 8.x series (from FreeBSD 8.4)? You should be able to do so without rebuilding the entire system, and should be able to easily revert back to current kernel if any problems arise from the upgrade.

If you restart the lighttpd server, which will send TERM signal to children in the process tree, including the backends (if started by lighttpd), do all the stray fds get cleaned up? At what frequency is the server rebooted?

#17

Updated by Max_nl 9 months ago

You should be able to do so without rebuilding the entire system, and should be able to easily revert back to current kernel if any problems arise from the upgrade.

This is not Linux...

https://www.freebsd.org/doc/en_US.ISO8859-1/books/handbook/makeworld.html

----
The build world process assumes an upgrade from an older FreeBSD version using the source of a newer version that was obtained using the instructions in Section 23.5, “Synchronizing Source”.

In FreeBSD, the term “world” includes the kernel, core system binaries, libraries, programming files, and built-in compiler. The order in which these components are built and installed is important.

For example, the old compiler might have a bug and not be able to compile the new kernel. Since the new kernel should be built with the new compiler, the new compiler must be built, but not necessarily installed, before the new kernel is built.

The new world might rely on new kernel features, so the new kernel must be installed before the new world is installed. The old world might not run correctly on the new kernel, so the new world must be installed immediately upon installing the new kernel.
---

So yes, you can upgrade.
But if it doesn't work out, there is no easy way you get your old world back.

I don't have zfs snapshots or anything like that on this system.

#18

Updated by gstrauss 9 months ago

hah. there is more than one way to do it. glad you know about ZFS snapshots. I didn't think you built the whole system with 'make world' given the age of your kernel. It's possible to live-boot FreeBSD off of USB without 'make world', but I'm sure you already know that.

Back in the 90's (before filesystem snapshots on consumer systems) safe upgrade and rollback was possible to do with multiple partitions (and still is possible). Boot to one system partition. Upgrade the second. Boot to the second partition. Both mount a third partition of data, which might contain software, but is upgraded after the system partitions. There are almost always multiple ways to mitigate risk, and with the speed of USB nowadays, even a 6-year old system can boot and copy an entire system in minutes or less.

#19

Updated by Max_nl 9 months ago

That will further mitigate the problem, which is probably better for your system, though will make tracking this down more difficult

Yes.
Haven't seen it happen over the past days anymore with poll + my old high PHP_FCGI_MAX_REQUESTS.

However, please note that openssl 1.1.0d was released 26 Jan 2017 and openssl 1.1.0e is due out 16 Feb 2017

Just did an upgrade to OpenSSL 1.1.0e and it did not solve the logging of "error" code 0 stuff.

2017-02-17 22:52:58: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-17 22:53:39: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-17 22:53:41: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-17 22:58:14: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-17 22:59:23: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-17 23:00:45: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-17 23:01:28: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-17 23:01:49: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-17 23:04:00: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-17 23:10:01: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-17 23:12:07: (connections-glue.c.166) SSL: -1 5 54 Connection reset by peer 
2017-02-17 23:13:04: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-17 23:13:04: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-17 23:13:06: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-17 23:13:07: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
2017-02-17 23:13:07: (connections-glue.c.166) SSL: -1 5 0 Unknown error: 0 
#20

Updated by gstrauss 9 months ago

I attempted to untangle some of the gnarly code in mod_fastcgi and mod_scgi surrounding process management. The code is more consolidated and more consistent now, but some gnarly bits remain. Related patches will be linked to this ticket shortly, and may address the issues you were seeing. More testing is needed since I haven't been able to reproduce what you are seeing.

Separately, I'll look a bit more into the openssl spew.

#21

Updated by gstrauss 9 months ago

  • Status changed from Patch Pending to Fixed
  • % Done changed from 0 to 100

Also available in: Atom