Project

General

Profile

[Solved] Slow request handling in Lighttpd 1.4.60

Added by ThomasDevoogdt about 2 months ago

We're currently using Lighttpd 1.4.55 and are looking to move forward. The bug reported in #3099 prevented us from using Lighttpd 1.4.58/59.
Lighttpd 1.4.60 is just released, so I took it immediately up to try it, at least my reported bug in #3099 seems to be solved, but ...

The problem now is, when I do a lot of requests (using a webpage and javascript), that some requests (GET) are taking ~5s while it should be something like ~20ms.

First I thought that my issue was related to #3108, and waited (and even tried the proposed patches), but no luck.

The problem can be seen in this image:
image browser

Logging can be found in lighttpd-conf-export.txt, the used config in lighttpd-conf-export.txt.

Let me know what additional information is needed to troubleshoot.
Some info: Linux version 4.19.150, Buildroot 2021.02.4.


Replies (33)

RE: Slow request handeling in Lighttpd 1.4.60 - Added by Agossi about 2 months ago

I also saw some slower responses in 1.4.60 therefore I tried to disable HTTP2 (new default) via config:

server.feature-flags += ( "server.h2proto" => "disable" )

this seems to workaround the issue for now.

How is your response time with disabled HTTP2 ?

RE: Slow request handeling in Lighttpd 1.4.60 - Added by ThomasDevoogdt about 2 months ago

Yes, that seems to solve the issue. Big thanks.

In Docs_Performance, I read this:

    "support for the HTTP/2 protocol (enabled by default in lighttpd 1.4.59) uses more memory than HTTP/1.1;
    low-memory systems might choose to disable HTTP/2 protocol support: server.feature-flags += ("server.h2proto" => "disable")" 

So why do we see this in 1.4.60?

RE: Slow request handeling in Lighttpd 1.4.60 - Added by gstrauss about 2 months ago

Thank you for the info. This sounds related to #3111 (also recently submitted)

RE: Slow request handeling in Lighttpd 1.4.60 - Added by gstrauss about 2 months ago

ThomasDevoogdt

Would you include a trace with debug.log-response-header = "enable"? Are these static assets? If so, there will be Content-Length, and I can estimate if the total sizes line up with the observations so far in #3111. Thanks.

RE: Slow request handeling in Lighttpd 1.4.60 - Added by ThomasDevoogdt about 2 months ago

I've added a trace with debug.log-response-header enabled, I also added an indication on which call it goes wrong.

RE: Slow request handeling in Lighttpd 1.4.60 - Added by gstrauss about 2 months ago

I understand why you edited the log that you attached, but please realize that I spelled out why I asked for that info: "Are these static assets? If so, there will be Content-Length, and I can estimate if the total sizes line up with the observations so far in #3111."
Since you omitted data from the log without sharing the length of the data omitted, the added log was unfortunately not of much use to me.

Somewhat related: I tracked down one issue and posted a patch in #3111. I am still reviewing code to see if there are related issues. I do not think the patch I just posted to #3111 applies to your issue since the patch does not apply to mod_fastcgi, and this is intentional for #3106 fixed in lighttpd 1.4.60.

RE: Slow request handeling in Lighttpd 1.4.60 - Added by ThomasDevoogdt about 2 months ago

Hey, the data is generated on the fly, so no static data. I did not remove any data other than <link> and <name>. But I can provide other logs if needed. The patch foreseen in #3111 doesn't seem to solve anything.

What I now know:
  • Disabling HTTP/2 (server.feature-flags += ( "server.h2proto" => "disable" )) seems to solve the extreme long GET request, e.g. 1.2 min as seen in the picture.
  • It seems that my back-end is not blocked anymore by roughly 1000ms by adding debug logs (debug.log-response-header = "enable"), is there a race condition? Our back-end could be blamed for this (and I will investigate that also), but I doubt that since it's a piece of code that isn't touched for years. The weird thing here is that the request is almost handled immediately, but the back-end is blocked by that ~1000ms, so it's like Lighttpd is still expecting something.

Thanks already for the service.

RE: Slow request handeling in Lighttpd 1.4.60 - Added by gstrauss about 2 months ago

It seems that my back-end is not blocked anymore by roughly 1000ms by adding debug logs (debug.log-response-header = "enable"), is there a race condition?

There does seem to be an issue with how and when data is received from the backend, so I think the addition of the logging might perturb the timing.

Does the issue occur with server.stream-response-body = 0 ? (and with server.feature-flags += ("server.h2proto" => "enable") and debug.log-response-header = "disable"))

As I mentioned in #3111, lighttpd 1.4.60 schedules together the processing of front-end and backend-end events for requests on the same client connection. While this is more efficient, the change may have uncovered an issue that was hidden by calling the event handling routines multiple times for the connection and the backend. This is what I am trying to track down.

How busy is your system? Would you be able to compile lighttpd with some diagnostic patches (that I will provide) which produce additional debug trace in the error logs?

If the issue goes away with a repeated call to the backend, then that would confirm that we are on the right track. The following is a test patch to check this. While it is not harmful, it is not the correct solution, either, and should be for testing only.

--- a/src/connections.c
+++ b/src/connections.c
@@ -1265,6 +1265,7 @@ connection_state_machine_h2 (request_st * const h2r, connection * const con)
           #endif

             connection_state_machine_loop(r, con);
+            connection_state_machine_loop(r, con);

             if (r->resp_header_len && !chunkqueue_is_empty(&r->write_queue)
                 && max_bytes

RE: Slow request handeling in Lighttpd 1.4.60 - Added by gstrauss about 2 months ago

If you are able to reproduce the issue in a small window, then fastcgi.debug = 3 might provide some useful data to see what is going on.

Separately -- since fastcgi.debug = 3 might perturb timing while logging trace -- an strace of lighttpd while reproducing the issue is likely to be useful.

RE: Slow request handeling in Lighttpd 1.4.60 - Added by gstrauss about 2 months ago

I notice in one of your original logs that every request has two sets of syslog timestamps, separated by ~550 to ~750 microseconds. Perhaps syslog is caching, or is reading from its socket and processing in blocks. Still, on my laptop from 2015, those log entries (one call to syslog() per line) occur in < 10 microseconds, not many hundreds of microseconds. Timing variances of hundreds of microseconds are almost certainly playing a role in the conditions causing the behavior you are seeing, but I will need help to reproduce this issue. How many bytes does your redacted <LINKS> represent? Are we talking 100 bytes? 1k? 2k? 4k?

2020-01-01T17:13:21.321997+00:00 <name> daemon.err lighttpd[31539]: (h2.c.1893) fd:10 id:23 resp: :status: 200
2020-01-01T17:13:21.321997+00:00 <name> daemon.err lighttpd[31539]: (h2.c.1882) fd:10 id:23 resp: content-length: 81
2020-01-01T17:13:21.321997+00:00 <name> daemon.err lighttpd[31539]: (h2.c.1882) fd:10 id:23 resp: content-type: application/json
2020-01-01T17:13:21.321997+00:00 <name> daemon.err lighttpd[31539]: (h2.c.1882) fd:10 id:23 resp: link: <LINKS>
2020-01-01T17:13:21.321997+00:00 <name> daemon.err lighttpd[31539]: (h2.c.1882) fd:10 id:23 resp: strict-transport-security: max-age=63072000; includeSubDomains
2020-01-01T17:13:21.321997+00:00 <name> daemon.err lighttpd[31539]: (h2.c.1882) fd:10 id:23 resp: x-frame-options: DENY
2020-01-01T17:13:21.321997+00:00 <name> daemon.err lighttpd[31539]: (h2.c.1882) fd:10 id:23 resp: x-content-type-options: nosniff
2020-01-01T17:13:21.321997+00:00 <name> daemon.err lighttpd[31539]: (h2.c.1882) fd:10 id:23 resp: content-security-policy: script-src 'self';frame-ancestors 'none'
2020-01-01T17:13:21.321997+00:00 <name> daemon.err lighttpd[31539]: (h2.c.1882) fd:10 id:23 resp: referrer-policy: no-referrer
2020-01-01T17:13:21.321997+00:00 <name> daemon.err lighttpd[31539]: (h2.c.1882) fd:10 id:23 resp: cache-control: no-store
2020-01-01T17:13:21.322714+00:00 <name> daemon.err lighttpd[31539]: (h2.c.1882) fd:10 id:23 resp: pragma: no-cache
2020-01-01T17:13:21.322714+00:00 <name> daemon.err lighttpd[31539]: (h2.c.1882) fd:10 id:23 resp: expires: Fri, 01 Jan 1990 00:00:00 GMT
2020-01-01T17:13:21.322714+00:00 <name> daemon.err lighttpd[31539]: (h2.c.1893) fd:10 id:23 resp: date: Wed, 01 Jan 2020 17:13:21 GMT
2020-01-01T17:13:21.322714+00:00 <name> daemon.err lighttpd[31539]: (h2.c.1893) fd:10 id:23 resp: server: lighttpd/1.4.60

FYI, in lighttpd 1.4.60, you can enable high-precision timestamps in error logs sent to files or pipes with server.feature-flags += ("server.errorlog-high-precision" => "enable") (syslog() handles its own timestamps)

RE: Slow request handeling in Lighttpd 1.4.60 - Added by ThomasDevoogdt about 2 months ago

Hello,

I added in the attachment a recapturing of the logs, in the same mode as my original problem I posted here.

  • fastcgi.debug = 3
  • debug.log-response-header = "enable"
  • server.feature-flags += ("server.errorlog-high-precision" => "enable")
  • # server.feature-flags += ("server.h2proto" => "disable") (so enabled)

I didn't hide the links (here 500 bytes, but sometimes more than 1k), as for further thought it's not needed.
The REST request where it goes wrong (hang for minutes) is `dolphin/capabilities`, but sometimes it's the next request, sometimes a request earlier, so which request doesn't say anything.

If I then add that patch, on the same settings, then there is no request anymore that lasts for minutes, but it still goes slowly, so applying the patch is as good as setting server.feature-flags += ("server.h2proto" => "disable"). (my backend still complains about user.warning seabert[1753]: SCHEDULER..> IOTask fastcgiserver ran for 1014 ms, but that might be another problem)

Adding more connection_state_machine_loop(r, con); doesn't do much either.

This timing table is captured while the patch was in place.

The first request where it goes wrong (randomly) is dolphin/hdmiIn2, which takes 1s, concurrent requests are taking respectively 1 second longer. A bit later, it goes smooth again and it goes wrong again at the dolphin/right request.

I will try strace later.

RE: Slow request handeling in Lighttpd 1.4.60 - Added by gstrauss about 2 months ago

2020-01-01T00:19:19.880091+00:00 <name> daemon.err lighttpd[8764]: (gw_backend.c.349) released proc: pid: 0 socket: unix:/tmp/apiv3-fcgi.sock load: 0
2020-01-01T00:19:20.880855+00:00 <name> daemon.err lighttpd[8764]: (gw_backend.c.1096) proc: unix:/tmp/apiv3-fcgi.sock 0 0 0 0
2020-01-01T00:19:20.880855+00:00 <name> daemon.err lighttpd[8764]: (gw_backend.c.1096) proc: unix:/tmp/testapi-fcgi.sock 0 0 0 0
2020-01-01T00:19:20.880855+00:00 <name> daemon.err lighttpd[8764]: (gw_backend.c.1096) proc: unix:/tmp/thumbnails-fcgi.sock-0 0 1 0 8765
2020-01-01T00:19:20.880855+00:00 <name> daemon.err lighttpd[8764]: (gw_backend.c.1096) proc: unix:/tmp/apiproxy-fcgi.sock 0 0 0 0
2020-01-01T00:19:20.881168+00:00 <name> user.warning seabert[1749]: SCHEDULER..> IOTask fastcgiserver ran for 1003 ms

The released proc shows me that the connection to the backend was closed by lighttpd, so I think that your backend has an issue that is causing it to report SCHEDULER..> IOTask fastcgiserver ran for 1003 ms. An strace should show that lighttpd called close on the socket connection to the backend. If your backend is hanging when it should not, then there is a chance that this results in your backend having a slow response for the next request.

Is the request for hdmiln1 the one for which SCHEDULER..> IOTask fastcgiserver ran for 1003 ms is reported? And the request for hdmiln2 the next request that is slow?

The combination of:
debug.log-request-header = "enable"
debug.log-response-header = "enable"
is useful to match requests with responses and their timing.

The bug reported in #3099 prevented us from using Lighttpd 1.4.58/59

That is a user error in the configuration. If you have a chance, would you please test with lighttpd 1.4.59 to see if the issue still occurs for you with lighttpd 1.4.59?

If I then add that patch, on the same settings, then there is no request anymore that lasts for minutes, but it still goes slowly, so applying the patch is as good as setting server.feature-flags += ("server.h2proto" => "disable").

To which patch are you referring?

Adding more connection_state_machine_loop(r, con); doesn't do much either.

Obviously you do not mean that patch, but that is the only patch I have posted to this post, so I do not know to which patch you are referring with your earlier statement.

Aside: server.feature-flags += ("server.errorlog-high-precision" => "enable") does not do anything when syslog is being used (server.errorlog-use-syslog = "enable"). Your syslog timestamps might show microseconds, but it appears the granularity is actually hundreds of microseconds. For testing purposes, I suggest sending the error trace to a file instead of to syslog, to see if the timestamps of log entries actually show higher precision.

RE: Slow request handeling in Lighttpd 1.4.60 - Added by gstrauss about 2 months ago

lighttpd 1.4.60 adds the ability to set (coarse) timeouts to backends, independent from client timeout settings. You might add the following to your backends in fastcgi.server:

"connect-timeout" => 1,
"write-timeout" => 1,
"read-timeout" => 1

and see if lighttpd reports any timeouts to the backends. A value of '1' might be too short for production use, but might be useful for testing. The lighttpd timeout is coarse and a value of '1' is actually 1 - 2 seconds ([1,2))

RE: Slow request handeling in Lighttpd 1.4.60 - Added by gstrauss about 2 months ago

Looking again at your waterfall and your description, it appears that the client sends multiple requests at the same time, something which is more likely in HTTP/2 than in HTTP/1.1. The long requests follow prior requests to the backend, and are made at almost the same exact time. What I do not see is a long request being the first request made to a backend (in a group of requests made about the same time). This suggests that how your backend is handling a burst of requests might be the issue, not lighttpd.

RE: Slow request handeling in Lighttpd 1.4.60 - Added by gstrauss about 2 months ago

When using strace, I recommend strace -s 4096 -tt for a better idea (for you) what is going on, and for high-precision timestamps. If you are concerned with sharing the results, you can use a very small number like -s 20

For your own troubleshooting, please add trace to your application so that you know when a connection is accepted, when processing of each FastCGI request begins, and when processing of each FastCGI request ends.

RE: Slow request handeling in Lighttpd 1.4.60 - Added by ThomasDevoogdt about 1 month ago

I found where it blocks and what I can do to solve it.

+       FCGX_Detach(m_pFcgxRequest);
        FCGX_Finish_r(m_pFcgxRequest);

If not detached, then finish blocks.
Code under the hood is: https://github.com/FastCGI-Archives/fcgi2/commit/69d807250964425a027716d884644ce02cebc05c

So IDK, if you are something with this info?

RE: Slow request handeling in Lighttpd 1.4.60 - Added by ThomasDevoogdt about 1 month ago

To be a bit more complete:

FCGX_Detach(m_pFcgxRequest):
     r->detached = TRUE

FCGX_Finish_r(m_pFcgxRequest):
    FCGX_Free(m_pFcgxRequest, close)
        OS_IpcClose(request->ipcFd, ! request->detached)

So the shutdown part won't be called, and by this, it also won't block on the libfcgiOsClosePollTimeout.

With that fix, I still have to disable HTTP/2 (server.feature-flags += ( "server.h2proto" => "disable" )) but that's probably because the behaviour of my backend is not yet in line with HTTP/2.

So I would say that I had two problems in one forum post. HTTP/2 (which is enabled by default since 1.4.59), and the hangs on FCGX_Finish_r which I can fix with FCGX_Detach (and this seems to be a regression since Lighttpd 1.4.60).

RE: Slow request handling in Lighttpd 1.4.60 - Added by gstrauss about 1 month ago

I don't have the logs that you have, but with what you have presented here, I fail to see how you can logically conclude that this is a problem in lighttpd instead of a problem in your application. If your application is forking and you FCGX_Detach() before fork(), then you have to FCGX_Attach() in one of parent or child after the fork(). If your app is re-attaching in the parent and then the parent accepts another connection and calls fork(), then your application needs to again FCGX_Detach() the original connection in the new child.

With that fix, I still have to disable HTTP/2 (server.feature-flags += ( "server.h2proto" => "disable" )) but that's probably because the behaviour of my backend is not yet in line with HTTP/2.
... and the hangs on FCGX_Finish_r which I can fix with FCGX_Detach (and this seems to be a regression since Lighttpd 1.4.60).

Did you test with lighttpd 1.4.59? What was the result?

If modifying your app fixes the issue, then the race condition(s) was likely in your app, not lighttpd.

Based on your addition of FCGX_Detach(), it sounds to me that your FastCGI application is not written to properly handle forking. Instead of trying to be more clever, I would recommend that you make your application simpler by serializing request handling in your FastCGI backend. Do not accept a new connection until the first connection has finished. Even then, if your application is using fork/exec, then you should not leak the socket fd into multiple processes. If the request is handled in the parent, the socket fd should be marked O_CLOEXEC before fork/exec, or if you fork but do not exec, then the socket fd should be closed in the child. If the request is handled in the child, then the socket should be closed in the parent after the child is forked.

With that fix, I still have to disable HTTP/2 (server.feature-flags += ( "server.h2proto" => "disable" )) but that's probably because the behaviour of my backend is not yet in line with HTTP/2.

Based on the information you have provided, I don't think this has anything to do with HTTP/2. Have you load-tested your app (with HTTP/2 disabled in lighttpd), having multiple clients make requests in parallel, which might simulate the faster requests to your backend now coming from HTTP/2 connections?

RE: Slow request handling in Lighttpd 1.4.60 - Added by gstrauss about 1 month ago

I do not trust the timing in your logs. 2020-01-01 was about 22 months ago. Also, as I mentioned above, the timestamps from syslog seem to be clusters in blocks, so the precision is not very high even though decimal points are followed by 6 digits. Additionally, unless there is a write error to a backend -- which would be reported in the logs (unless you edited them away in what you posted) -- lighttpd reports the status of each backend proc about once a second, not multiple times a second as your logs show. Is time unstable on your system?

 
2020-01-01T00:19:19.262648+00:00 <name> daemon.err lighttpd[8764]: (gw_backend.c.1096) proc: unix:/tmp/apiv3-fcgi.sock 0 0 0 0
2020-01-01T00:19:19.262648+00:00 <name> daemon.err lighttpd[8764]: (gw_backend.c.1096) proc: unix:/tmp/testapi-fcgi.sock 0 0 0 0
2020-01-01T00:19:19.262648+00:00 <name> daemon.err lighttpd[8764]: (gw_backend.c.1096) proc: unix:/tmp/thumbnails-fcgi.sock-0 0 1 0 8765
2020-01-01T00:19:19.262648+00:00 <name> daemon.err lighttpd[8764]: (gw_backend.c.1096) proc: unix:/tmp/apiproxy-fcgi.sock 0 0 0 0
2020-01-01T00:19:19.877730+00:00 <name> daemon.err lighttpd[8764]: (gw_backend.c.1096) proc: unix:/tmp/apiv3-fcgi.sock 0 0 0 0
2020-01-01T00:19:19.877730+00:00 <name> daemon.err lighttpd[8764]: (gw_backend.c.1096) proc: unix:/tmp/testapi-fcgi.sock 0 0 0 0
2020-01-01T00:19:19.877730+00:00 <name> daemon.err lighttpd[8764]: (gw_backend.c.1096) proc: unix:/tmp/thumbnails-fcgi.sock-0 0 1 0 8765
2020-01-01T00:19:19.877730+00:00 <name> daemon.err lighttpd[8764]: (gw_backend.c.1096) proc: unix:/tmp/apiproxy-fcgi.sock 0 0 0 0

Edit: I should clarify that in lighttpd 1.4.60, lighttpd uses the system monotonic clock for internal timeouts (not for externally-reported stringified times), so reporting the status of each backend proc occurs about once a second on the system monotonic clock. If the system realtime clock is not also generally stable and increasing, you might have other problems.

RE: [Solved] Slow request handling in Lighttpd 1.4.60 - Added by ThomasDevoogdt about 1 month ago

Hi, since that patch is part of Lighttpd 1.4.61, I simply bumped my setup.
Problem is not gone. That FCGX_Detach fix in my back-end code is still needed.
No fork() is done related to FCGX handling, and O_CLOEXEC is set.

About #3099, I was wrong, it still persists in Lighttpd 1.4.60, The crash is related to LibreSSL when an SSL_read call is done. I will solve that by going back to OpenSSL.

A last small remark. It seems that the server.pid-file is not always properly written. The file is touched but no content is written.
If some buildroot user is facing the same problem, then this workaround source:0001-package-lighttpd-S50lighttpd-fix-proc-name.patch can help you.

For me it's fine to close this thread.

Thank you for all the support!

RE: [Solved] Slow request handling in Lighttpd 1.4.60 - Added by gstrauss about 1 month ago

Problem is not gone. That FCGX_Detach fix in my back-end code is still needed.

Unfortunately, there will not be much forward progress until you can help me to reproduce the issue. I would suggest starting with a much simpler FastCGI program (starting with "Hello World") and working from there. Also, please verify if you see this issue with lighttpd 1.4.59 or not, and whether or not you saw this issue with lighttpd 1.4.55. Please confirm that the issue occurs only when you have HTTP/2 enabled in lighttpd. (server.feature-flags += ( "server.h2proto" => "enable" ))

The data you have shared has not been very consistent or trustable (e.g. misleading syslog timestamps). Logs without high precision timestamps are not likely to help track this issue down.

About #3099, I was wrong, it still persists in Lighttpd 1.4.60, The crash is related to LibreSSL when an SSL_read call is done.

That is not actionable information. $SERVER["socket"] must not be nested in other conditions in your lighttpd config. If $SERVER["socket"] is not in a nested condition in lighttpd.conf and lighttpd 1.4.61 mod_openssl built with LibreSSL is crashing in SSL_read(), then a stack trace might help. If you can not get a stack trace, then in #3099 you noted that you added trace to find that the error occurred under SSL_read(). There are many callbacks in the OpenSSL (and LibreSSL) libraries that occur during TLS processing, especially at the start of the connection. These callbacks are called from SSL_read() and SSL_write(), so you'll have to instrument the callbacks (in lighttpd mod_openssl.c) to track down the location of the crash. If you can trigger the crash, it is easier to run lighttpd under a debugger and then to trigger the crash and print a backtrace.

A last small remark. It seems that the server.pid-file is not always properly written. The file is touched but no content is written.

package/lighttpd/S50lighttpd is not from lighttpd and not from this site. You assume that everyone is the same as you and that is not the case.
Linux version 4.19.150, Buildroot 2021.02.4 Whoever wrote package/lighttpd/S50lighttpd for Buildroot probably assumes that procd manages that file. If you also configure lighttpd.conf to write its pid to that file, then you have two processes writing to that file. Two processes writing to that file is a recipe for corrupting the file, which is probably why you see it empty, as lighttpd clears the file upon exit. You should submit your patch to your Linux distro (Buildroot), since your Linux distro (Buildroot) made that mistake.

RE: [Solved] Slow request handling in Lighttpd 1.4.60 - Added by gstrauss about 1 month ago

Problem is not gone. That FCGX_Detach fix in my back-end code is still needed.

Is the problem gone if your code uses FCGX_Detach()? I think you might try to strace your FastCGI application and review what it is doing.

As I posted above:

When using strace, I recommend strace -s 4096 -tt for a better idea (for you) what is going on, and for high-precision timestamps. If you are concerned with sharing the results, you can use a very small number like -s 20

RE: [Solved] Slow request handling in Lighttpd 1.4.60 - Added by ThomasDevoogdt about 1 month ago

FCGX_Detach() solves indeed my problem and was not needed in Lighttpd 1.4.55.

So to go further with your suggestion, I started debugging with strace.

2020-01-07T19:16:32.176544+00:00 mna-440-2530254548 user.warning seabert[18207]: SCHEDULER..> IOTask fastcgiserver ran for 1004 ms
2020-01-07T19:16:32.176544+00:00 mna-440-2530254548 user.notice root: 19:16:32.175033 close(14)               = 0
2020-01-07T19:16:32.185989+00:00 mna-440-2530254548 user.notice root: 19:16:32.175337 epoll_pwait(7, [{EPOLLIN, {u32=806767952, u64=806767952}}], 1025, 1000, NULL, 8) = 1
2020-01-07T19:16:32.185989+00:00 mna-440-2530254548 user.notice root: 19:16:32.185834 ioctl(21, FIONREAD, [224]) = 0
2020-01-07T19:16:32.186433+00:00 mna-440-2530254548 user.notice root: 19:16:32.186092 read(21, "<data>", 8192) = 224
2020-01-07T19:16:32.186642+00:00 mna-440-2530254548 user.notice root: 19:16:32.186341 epoll_ctl(7, EPOLL_CTL_DEL, 21, NULL) = 0
2020-01-07T19:16:32.186934+00:00 mna-440-2530254548 user.notice root: 19:16:32.186561 setsockopt(13, SOL_TCP, TCP_CORK, [1], 4) = 0
2020-01-07T19:16:32.187172+00:00 mna-440-2530254548 user.notice root: 19:16:32.186818 write(13, "<data>", 605) = 605
2020-01-07T19:16:32.187555+00:00 mna-440-2530254548 user.notice root: 19:16:32.187180 setsockopt(13, SOL_TCP, TCP_CORK, [0], 4) = 0
2020-01-07T19:16:32.187764+00:00 mna-440-2530254548 user.notice root: 19:16:32.187439 read(13, 0x301d12a3, 5) = -1 EAGAIN (Resource temporarily unavailable)
2020-01-07T19:16:32.187970+00:00 mna-440-2530254548 user.notice root: 19:16:32.187658 epoll_ctl(7, EPOLL_CTL_MOD, 13, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=806937536, u64=806937536}}) = 0
2020-01-07T19:16:33.189333+00:00 mna-440-2530254548 user.notice root: 19:16:32.187862 epoll_pwait(7, [], 1025, 1000, NULL, 8) = 0
2020-01-07T19:16:33.190196+00:00 mna-440-2530254548 user.warning seabert[18207]: SCHEDULER..> IOTask fastcgiserver ran for 1004 ms
2020-01-07T19:16:33.190196+00:00 mna-440-2530254548 user.notice root: 19:16:33.189157 close(21)               = 0
2020-01-07T19:16:33.201849+00:00 mna-440-2530254548 user.notice root: 19:16:33.189444 epoll_pwait(7, [{EPOLLIN, {u32=806334944, u64=806334944}}], 1025, 1000, NULL, 8) = 1
2020-01-07T19:16:33.202125+00:00 mna-440-2530254548 user.notice root: 19:16:33.201697 ioctl(15, FIONREAD, [224]) = 0
2020-01-07T19:16:33.202125+00:00 mna-440-2530254548 user.notice root: 19:16:33.201959 read(15, "<data>", 8192) = 224
2020-01-07T19:16:33.202539+00:00 mna-440-2530254548 user.notice root: 19:16:33.202214 epoll_ctl(7, EPOLL_CTL_DEL, 15, NULL) = 0
2020-01-07T19:16:33.202748+00:00 mna-440-2530254548 user.notice root: 19:16:33.202438 setsockopt(18, SOL_TCP, TCP_CORK, [1], 4) = 0
2020-01-07T19:16:33.203020+00:00 mna-440-2530254548 user.notice root: 19:16:33.202692 write(18, "<data>", 606) = 606
2020-01-07T19:16:33.203409+00:00 mna-440-2530254548 user.notice root: 19:16:33.203055 setsockopt(18, SOL_TCP, TCP_CORK, [0], 4) = 0
2020-01-07T19:16:33.203616+00:00 mna-440-2530254548 user.notice root: 19:16:33.203318 read(18, 0x301d12a3, 5) = -1 EAGAIN (Resource temporarily unavailable)
2020-01-07T19:16:33.203821+00:00 mna-440-2530254548 user.notice root: 19:16:33.203539 epoll_ctl(7, EPOLL_CTL_MOD, 18, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=806895520, u64=806895520}}) = 0
2020-01-07T19:16:34.205259+00:00 mna-440-2530254548 user.notice root: 19:16:33.203742 epoll_pwait(7, [], 1025, 1000, NULL, 8) = 0
2020-01-07T19:16:34.206644+00:00 mna-440-2530254548 user.warning seabert[18207]: SCHEDULER..> IOTask fastcgiserver ran for 1004 ms

I first found out that doing this:

diff --git a/src/server.c b/src/server.c
index 3eed0aa0..3286fa52 100644
--- a/src/server.c
+++ b/src/server.c
@@ -1970,7 +1970,7 @@ static void server_main_loop (server * const srv) {
         log_con_jqueue = sentinel;
         server_run_con_queue(joblist, sentinel);

-        if (fdevent_poll(srv->ev, log_con_jqueue != sentinel ? 0 : 1000) > 0)
+        if (fdevent_poll(srv->ev, log_con_jqueue != sentinel ? 0 : 2000) > 0)
             last_active_ts = log_monotonic_secs;
     }
 }

Causes these prints on my side:

2020-01-01T00:08:52.370355+00:00 <name> user.warning seabert[1749]: SCHEDULER..> IOTask fastcgiserver ran for 2001 ms
2020-01-01T00:08:57.361583+00:00 <name> user.warning seabert[1749]: SCHEDULER..> IOTask fastcgiserver ran for 2003 ms
2020-01-01T00:09:02.361994+00:00 <name> user.warning seabert[1749]: SCHEDULER..> IOTask fastcgiserver ran for 2002 ms

Since this seems to be part of the scheduling code, it doesn't learn me a lot. But an obvious question here, why is there no task scheduled to finish my task, so that I don't need a FCGX_Detach(). So I searched further on joblist_append.

So diving a bit deeper, and bee35b48779da46d9effdfc1601189e5d08704bd drew my attention (and of a lot of others before I found this one). I reverted it. & voila, fixed.

I'm not sure what this commit does, but I would think that the "work" now is done directly, and not scheduled again for later. Is there something missing at my end which is causing a problem when the "work" is done directly? I write work in-between "", because I can't fully follow what goes on here.

RE: [Solved] Slow request handling in Lighttpd 1.4.60 - Added by gstrauss about 1 month ago

bee35b48779da46d9effdfc1601189e5d08704bd

mark and temporarily defer event handling of backend events in order
to handle at same time as the associated front-end connection events

That saves the ready events and adds the connection to the job queue. At that point, the only reason I can think of why those events would not get processed would be that lighttpd considered the response completed and cleaned up the backend connection.

fastcgi.debug = 3 in lighttpd.conf will give you information about lighttpd backends.

This tells me that lighttpd was waiting to read from the backend on fd 13:

2020-01-07T19:16:32.187764+00:00 mna-440-2530254548 user.notice root: 19:16:32.187439 read(13, 0x301d12a3, 5) = -1 EAGAIN (Resource temporarily unavailable)
2020-01-07T19:16:32.187970+00:00 mna-440-2530254548 user.notice root: 19:16:32.187658 epoll_ctl(7, EPOLL_CTL_MOD, 13, {EPOLLIN|EPOLLERR|EPOLLHUP|EPOLLRDHUP, {u32=806937536, u64=806937536}}) = 0
2020-01-07T19:16:33.189333+00:00 mna-440-2530254548 user.notice root: 19:16:32.187862 epoll_pwait(7, [], 1025, 1000, NULL, 8) = 0
2020-01-07T19:16:33.190196+00:00 mna-440-2530254548 user.warning seabert[18207]: SCHEDULER..> IOTask fastcgiserver ran for 1004 ms

If lighttpd had already read data from the backend, then it should have processed all the completed FastCGI packets that were received, so lighttpd looks like it is waiting to read from the backend, and after one second your backend issues "SCHEDULER..> IOTask fastcgiserver ran for 1004 ms".

However, if reverting bee35b48779da46d9effdfc1601189e5d08704bd hid the issue for you, then I need to think on this to see if any assumptions may have been violated. At the same time, FCGX_Detach() also hides or fixes the issue for you.

(1-25/33)