Project

General

Profile

[Solved] Lighttpd Worker process stuck, CPU usage at 100%

Added by Grundor almost 6 years ago

It seems that after a while of running (about 10~15 minutes) with Lighttpd and processing requests just fine the worker process gets stuck in a loop maxing out the CPU and not responding to any other requests when all workers get stuck, the server didn't respond any other requests.
Looking at images below are possible to see the restart and the suddenly one worker increases the processing to 100% few moments after the second one goes to the same behavior.

Server Info:

SO CentOS Linux release 7.5.1804 (Core)
CPU: Intel D 1540 8c/16t 2.0 GHz/2.6 GHz
RAM: 32 GB DDR4 ECC 2133 MHz
Disks: 2x 2 TB SATA3 2x400gb SSD
SELinux disabled.

Lightpd Version: lighttpd/1.4.49 (ssl) - (from yum repository)
Lighttpd Configuration:

Main Custom configs:

server.port = 80
server.stream-response-body = 2
server.stream-request-body = 2 
server.max-worker = 32
server.use-ipv6 = "disable" 
server.event-handler = "linux-sysepoll" 
server.network-backend = "linux-sendfile" 
server.stat-cache-engine = "fam" 

server.max-fds = 2048
server.max-connections = 1024
server.max-keep-alive-idle = 4
server.max-keep-alive-requests = 4
server.max-read-idle = 10
server.max-write-idle = 700

Full Config: http://paste.lighttpd.net/w5#bDXhWu2DPvGhU2Rh2cC2rVwo

RegEx Conditionals enabled
Fd-Event-Handler linux-sysepoll

Loaded Modules:
  • indexfile
  • openssl
  • access
  • alias
  • auth
  • evasive
  • redirect
  • rewrite
  • setenv
  • compress
  • magnet
  • proxy
  • expire
  • secdownload
  • fastcgi
  • accesslog
  • status
  • dirlisting
  • staticfile
  • authn_file

I didn't find anything from logs except it:

(fdevent_linux_sysepoll.c.39) epoll_ctl failed: No such file or directory, dying
(stat_cache.c.289) no '/' found in the filename:

I try a restart, it takes 2 up to 5 minutes to get restarted.

It started after the last update a few days ago.

I don't know if I have the information to provide, please let me know.

Best.


Replies (7)

RE: Lighttpd Worker process stuck, CPU usage at 100% - Added by avij almost 6 years ago

For clarity -- what is the process name that eats the CPU? lighttpd, PHP or something else? top and strace -p someprocessid may be informative.

RE: Lighttpd Worker process stuck, CPU usage at 100% - Added by Grundor almost 6 years ago

The process name is lighttpd, if you check on the image footer (graph legend) all processes are lighttpd workers.

RE: Lighttpd Worker process stuck, CPU usage at 100% - Added by gstrauss almost 6 years ago

It started after the last update a few days ago.

I don't know if I have the information to provide, please let me know.

Yes, basic information such as "the last update" from what version?

Also, as avij asked, please provide a snippet of what one of the spinning processes is doing by using strace -s 1024 -p someprocessid

$SERVER["socket"] ":443" is a top-level object, so while you have nested it into $HTTP["host"] entries, it probably does not do exactly what you think it is doing. $SERVER["socket"] ":443" should be top-level in the config file, and then $HTTP["host"] should be within the $SERVER["socket"] condition if those hosts need to set their own specific ssl.* directives.

(fdevent_linux_sysepoll.c.39) epoll_ctl failed: No such file or directory, dying

that suggests there might be a bug somewhere which has already cleaned up this fd. Try this instead of "linux-sysepoll"
server.event-handler = "poll"

If that does not help, then please try
server.stat-cache-engine = "simple"

RE: Lighttpd Worker process stuck, CPU usage at 100% - Added by Grundor almost 6 years ago

Basic information such as "the last update" from what version?

From 1.4.48 to 1.4.49

please provide a snippet of what one of the spinning processes is doing by using strace -s 1024 -p someprocessid

I'm not confident with strace,but i used your command:

On the stuck worker it's basically an infinite loop of:

ioctl(25, FIONREAD, [81651])            = 0
ioctl(25, FIONREAD, [81651])            = 0
ioctl(25, FIONREAD, [81651])            = 0
ioctl(25, FIONREAD, [81651])            = 0
ioctl(25, FIONREAD, [81651])            = 0
ioctl(25, FIONREAD, [81651])            = 0
ioctl(25, FIONREAD, [81651])            = 0
ioctl(25, FIONREAD, [81651])            = 0

On the main process:

strace: Process 7876 attached
wait4(-1,

I have many workers, so, I changed to just one to trace to catch it before goes to 100%, here is the output:
http://paste.lighttpd.net/y5#ZbvRUouEkxbI9pXUT76WzC8Y

Basically the same an infinite loop of "ioctl(25, FIONREAD, [81651]) = 0"

$SERVER["socket"] ":443" is a top-level object, so while you have nested it into $HTTP["host"] entries, it probably does not do exactly what you think it is doing. $SERVER["socket"] ":443" should be top-level in the config file, and then $HTTP["host"] should be within the $SERVER["socket"] condition if those hosts need to set their own specific ssl.* directives.

Thanks for the advice, I'll provide the proper fix. But is that any correlation with the problem?

that suggests there might be a bug somewhere which has already cleaned up this fd. Try this instead of "linux-sysepoll"
server.event-handler = "poll"
If that does not help, then please try
server.stat-cache-engine = "simple"

I made those changes, first only to poll, after to cache engine "simple" and also to "disable", at the and with both, the same behavior were observed:

...
ioctl(26, FIONREAD, [81632])            = 0
ioctl(26, FIONREAD, [81632])            = 0
ioctl(26, FIONREAD, [81632])            = 0

After all, I completely disabled the work options remaining only one process, the main process stuck on 100% CPU and on the same looping of ioctl.

http://paste.lighttpd.net/z5#XSkvNV0VaSNSuteNvGJRm0D5

Thanks

RE: Lighttpd Worker process stuck, CPU usage at 100% - Added by gstrauss almost 6 years ago

ioctl(25, FIONREAD, [81651])            = 0
ioctl(25, FIONREAD, [81651])            = 0

Thanks. That's useful to guide us. Might be related to changes made for #2743 in d5d02583
As a short-term mitigation, please try using

server.stream-response-body = 1
server.stream-request-body = 1

$SERVER["socket"] ":443" is a top-level object, so while you have nested it into $HTTP["host"] entries, it probably does not do exactly what you think it is doing. $SERVER["socket"] ":443" should be top-level in the config file, and then $HTTP["host"] should be within the $SERVER["socket"] condition if those hosts need to set their own specific ssl.* directives.

Thanks for the advice, I'll provide the proper fix. But is that any correlation with the problem?

No, not related.

RE: Lighttpd Worker process stuck, CPU usage at 100% - Added by gstrauss almost 6 years ago

This looks just like #2878, which is fixed in lighttpd git master and will be part of the upcoming lighttpd 1.4.50. Sorry for the trouble.

Given the specs on your box, you'll probably find this problem is fixed and performance is better and PHP resource usage is lower with

server.stream-response-body = 1
server.stream-request-body = 1

RE: [Solved] Lighttpd Worker process stuck, CPU usage at 100% - Added by Grundor almost 6 years ago

It worked!

Set the response body 2 solved an issue related to streamed response to "Access-Control-Expose-Headers: X-Event" and "Content-Type: text/event-stream" generated by PHP, not a main function of the application but needed.

Thank you @gstrauss.

    (1-7/7)