Bug #887
closedLighty uses 99% cpu power [strace inside]
Description
I'm on kernel 2.6.16.29 with centos 4.4 installed, it is a SMP kernel and it runs a heavy PHP site. PHP version 5.1.6 with APC 3.0.12p2. This only happens when I use 1.4.12+ (yes even 1.4.13), this prob does not exist in version 1.4.11 which is what I used to use. I'm currently am able to use it fine as I've a dual core so only one core is being used but there are also many other people who have this problem suggested in your blog:
http://blog.lighttpd.net/articles/2006/09/20/pre-release-lighttpd-1-4-12-r1320
Here is the strace:
02:34:04.000304 shutdown(28, 1 /* send /) = 0
02:34:04.000589 shutdown(27, 1 / send /) = 0
02:34:04.000722 shutdown(24, 1 / send /) = 0
02:34:04.000851 shutdown(23, 1 / send /) = 0
02:34:04.000975 shutdown(22, 1 / send /) = 0
02:34:04.001102 shutdown(11, 1 / send /) = 0
02:34:04.001226 shutdown(20, 1 / send /) = 0
02:34:05.000102 shutdown(15, 1 / send /) = 0
02:34:05.000387 shutdown(38, 1 / send /) = 0
02:34:05.000599 shutdown(21, 1 / send /) = 0
02:34:05.000793 shutdown(34, 1 / send /) = 0
02:34:05.000984 shutdown(6, 1 / send /) = 0
02:34:05.001173 shutdown(7, 1 / send /) = 0
02:34:05.001360 shutdown(13, 1 / send /) = 0
02:34:05.001563 shutdown(33, 1 / send /) = 0
02:34:05.001688 shutdown(12, 1 / send /) = 0
02:34:05.739284 shutdown(18, 1 / send /) = 0
02:34:06.000144 shutdown(25, 1 / send /) = 0
02:34:06.000438 shutdown(31, 1 / send /) = 0
02:34:06.000672 shutdown(29, 1 / send /) = 0
02:34:07.000196 shutdown(40, 1 / send /) = 0
02:34:07.000382 shutdown(16, 1 / send /) = 0
02:34:07.000535 shutdown(35, 1 / send /) = 0
02:34:07.000657 shutdown(28, 1 / send /) = 0
02:34:07.000777 shutdown(36, 1 / send /) = 0
02:34:07.000896 shutdown(42, 1 / send /) = 0
02:34:07.001023 shutdown(9, 1 / send /) = 0
02:34:07.001144 shutdown(20, 1 / send /) = 0
02:34:07.001262 shutdown(10, 1 / send /) = 0
02:34:07.001382 shutdown(24, 1 / send /) = 0
02:34:07.001499 shutdown(30, 1 / send /) = 0
02:34:07.474339 shutdown(38, 1 / send /) = 0
02:34:07.871954 shutdown(41, 1 / send /) = 0
02:34:08.000078 shutdown(32, 1 / send /) = 0
02:34:08.000219 shutdown(27, 1 / send /) = 0
02:34:08.000376 shutdown(37, 1 / send /) = 0
02:34:08.000503 shutdown(13, 1 / send /) = 0
02:34:08.000640 shutdown(12, 1 / send /) = 0
02:34:08.000788 shutdown(8, 1 / send /) = 0
02:34:08.000916 shutdown(6, 1 / send /) = 0
02:34:08.001043 shutdown(26, 1 / send /) = 0
02:34:08.001181 shutdown(15, 1 / send /) = 0
02:34:08.001328 shutdown(11, 1 / send /) = 0
02:34:09.000102 shutdown(39, 1 / send /) = 0
02:34:09.000260 shutdown(19, 1 / send /) = 0
02:34:09.000394 shutdown(34, 1 / send /) = 0
02:34:09.000520 shutdown(7, 1 / send /) = 0
02:34:09.000644 shutdown(31, 1 / send /) = 0
02:34:09.000768 shutdown(21, 1 / send /) = 0
02:34:09.000892 shutdown(25, 1 / send /) = 0
02:34:09.001020 shutdown(29, 1 / send /) = 0
02:34:09.001146 shutdown(33, 1 / send /) = 0
02:34:09.001271 shutdown(18, 1 / send /) = 0
02:34:10.000121 shutdown(41, 1 / send /) = 0
02:34:10.000278 shutdown(35, 1 / send /) = 0
02:34:10.000416 shutdown(40, 1 / send /) = 0
02:34:10.000527 shutdown(36, 1 / send /) = 0
02:34:10.000680 shutdown(9, 1 / send /) = 0
02:34:10.000825 shutdown(30, 1 / send /) = 0
02:34:10.000949 shutdown(28, 1 / send /) = 0
02:34:10.001088 shutdown(10, 1 / send /) = 0
02:34:10.001211 shutdown(24, 1 / send /) = 0
02:34:11.000174 shutdown(13, 1 / send /) = 0
02:34:11.000335 shutdown(12, 1 / send /) = 0
02:34:11.000483 shutdown(11, 1 / send /) = 0
02:34:11.000611 shutdown(8, 1 / send /) = 0
02:34:11.000737 shutdown(6, 1 / send /) = 0
02:34:12.000098 shutdown(26, 1 / send /) = 0
02:34:12.000260 shutdown(29, 1 / send /) = 0
02:34:12.000387 shutdown(15, 1 / send /) = 0
02:34:12.000513 shutdown(20, 1 / send /) = 0
02:34:12.000662 shutdown(16, 1 / send /) = 0
02:34:12.000785 shutdown(27, 1 / send /) = 0
02:34:12.000927 shutdown(38, 1 / send /) = 0
02:34:12.001051 shutdown(25, 1 / send /) = 0
02:34:12.001175 shutdown(21, 1 / send /) = 0
02:34:13.000244 shutdown(31, 1 / send /) = 0
02:34:13.000412 shutdown(10, 1 / send /) = 0
02:34:13.000563 shutdown(9, 1 / send /) = 0
02:34:13.000688 shutdown(7, 1 / send /) = 0
02:34:13.000831 shutdown(18, 1 / send /) = 0
02:34:13.000956 shutdown(19, 1 / send /) = 0
02:34:13.998310 shutdown(26, 1 / send /) = 0
02:34:14.000228 shutdown(28, 1 / send /) = 0
02:34:14.000363 shutdown(13, 1 / send /) = 0
02:34:14.000489 shutdown(24, 1 / send /) = 0
02:34:14.000617 shutdown(12, 1 / send /) = 0
02:34:14.000731 shutdown(11, 1 / send /) = 0
02:34:14.664824 shutdown(12, 1 / send /) = 0
02:34:15.000105 shutdown(8, 1 / send /) = 0
02:34:15.000367 shutdown(25, 1 / send /) = 0
02:34:15.000562 shutdown(15, 1 / send /) = 0
02:34:16.000262 shutdown(27, 1 / send /) = 0
02:34:16.000519 shutdown(30, 1 / send /) = 0
02:34:16.000714 shutdown(18, 1 / send /) = 0
02:34:16.000905 shutdown(20, 1 / send /) = 0
02:34:16.001096 shutdown(10, 1 / send /) = 0
02:34:16.001287 shutdown(9, 1 / send /) = 0
02:34:16.001476 shutdown(6, 1 / send /) = 0
02:34:16.001667 shutdown(16, 1 / send /) = 0
02:34:16.001856 shutdown(23, 1 / send /) = 0
02:34:16.002060 shutdown(22, 1 / send /) = 0
02:34:16.002253 shutdown(7, 1 / send /) = 0
02:34:17.000057 shutdown(19, 1 / send /) = 0
02:34:17.000199 shutdown(26, 1 / send /) = 0
02:34:17.000317 shutdown(21, 1 / send /) = 0
02:34:17.000435 shutdown(13, 1 / send /) = 0
02:34:17.080507 shutdown(18, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
02:34:18.000133 shutdown(37, 1 / send /) = 0
02:34:18.000423 shutdown(28, 1 / send /) = 0
02:34:18.000627 shutdown(36, 1 / send /) = 0
02:34:18.000812 shutdown(35, 1 / send /) = 0
02:34:18.000995 shutdown(31, 1 / send /) = 0
02:34:18.001181 shutdown(32, 1 / send /) = 0
02:34:18.001367 shutdown(33, 1 / send /) = 0
02:34:18.001551 shutdown(25, 1 / send /) = 0
02:34:18.001734 shutdown(15, 1 / send /) = 0
02:34:18.001920 shutdown(34, 1 / send /) = 0
02:34:18.002102 shutdown(24, 1 / send /) = 0
02:34:19.000230 shutdown(30, 1 / send /) = 0
02:34:19.000516 shutdown(22, 1 / send /) = 0
02:34:19.000719 shutdown(6, 1 / send /) = 0
02:34:19.000903 shutdown(16, 1 / send /) = 0
02:34:20.000187 shutdown(18, 1 / send /) = 0
02:34:20.000450 shutdown(13, 1 / send /) = 0
02:34:20.000653 shutdown(12, 1 / send /) = 0
02:34:20.000837 shutdown(8, 1 / send /) = 0
02:34:20.001020 shutdown(29, 1 / send /) = 0
02:34:20.001203 shutdown(11, 1 / send /) = 0
02:34:21.000132 shutdown(20, 1 / send /) = 0
02:34:21.000277 shutdown(19, 1 / send /) = 0
02:34:21.000397 shutdown(21, 1 / send /) = 0
02:34:21.000513 shutdown(15, 1 / send /) = 0
02:34:22.002494 shutdown(28, 1 / send /) = 0
02:34:22.002699 shutdown(9, 1 / send /) = 0
02:34:22.002820 shutdown(27, 1 / send /) = 0
02:34:22.002936 shutdown(7, 1 / send /) = 0
02:34:22.003054 shutdown(22, 1 / send /) = 0
02:34:22.003171 shutdown(6, 1 / send /) = 0
02:34:22.003287 shutdown(16, 1 / send /) = 0
02:34:22.003403 shutdown(24, 1 / send /) = 0
02:34:23.000058 shutdown(29, 1 / send /) = 0
02:34:23.000338 shutdown(31, 1 / send /) = 0
02:34:23.000540 shutdown(23, 1 / send /) = 0
02:34:23.000724 shutdown(30, 1 / send /) = 0
02:34:23.000907 shutdown(26, 1 / send /) = 0
02:34:23.001091 shutdown(11, 1 / send /) = 0
02:34:23.001273 shutdown(8, 1 / send /) = 0
02:34:23.001456 shutdown(13, 1 / send /) = 0
02:34:24.000209 shutdown(32, 1 / send /) = 0
02:34:24.000489 shutdown(39, 1 / send /) = 0
02:34:24.000691 shutdown(18, 1 / send /) = 0
02:34:24.000877 shutdown(35, 1 / send /) = 0
02:34:24.001063 shutdown(15, 1 / send /) = 0
02:34:25.000209 shutdown(33, 1 / send /) = 0
02:34:25.000356 shutdown(22, 1 / send /) = 0
02:34:25.000478 shutdown(25, 1 / send /) = 0
02:34:25.000605 shutdown(34, 1 / send /) = 0
02:34:25.000723 shutdown(19, 1 / send /) = 0
02:34:25.000845 shutdown(21, 1 / send /) = 0
02:34:25.000969 shutdown(7, 1 / send /) = 0
02:34:25.001091 shutdown(6, 1 / send /) = 0
02:34:26.000098 shutdown(29, 1 / send /) = 0
02:34:26.000263 shutdown(27, 1 / send /) = 0
02:34:26.000394 shutdown(9, 1 / send /) = 0
02:34:26.000546 shutdown(36, 1 / send /) = 0
02:34:26.000673 shutdown(26, 1 / send /) = 0
02:34:26.000799 shutdown(12, 1 / send /) = 0
02:34:26.000924 shutdown(13, 1 / send /) = 0
02:34:26.001049 shutdown(20, 1 / send /) = 0
02:34:26.001176 shutdown(16, 1 / send /) = 0
02:34:26.001304 shutdown(8, 1 / send /) = 0
02:34:27.000104 shutdown(31, 1 / send /) = 0
02:34:27.000258 shutdown(28, 1 / send /) = 0
02:34:27.000379 shutdown(18, 1 / send /) = 0
02:34:27.000506 shutdown(23, 1 / send /) = 0
02:34:27.000633 shutdown(11, 1 / send /) = 0
02:34:28.000081 shutdown(35, 1 / send /) = 0
02:34:28.000212 shutdown(10, 1 / send /) = 0
02:34:28.000344 shutdown(33, 1 / send /) = 0
02:34:28.000470 shutdown(30, 1 / send /) = 0
02:34:28.000593 shutdown(15, 1 / send /) = 0
02:34:28.000736 shutdown(22, 1 / send /) = 0
02:34:28.000878 shutdown(21, 1 / send /) = 0
02:34:28.414278 shutdown(16, 1 / send */) = 0
Using this strace configuration:
strace -tt -s 4000 -etrace='!open,epoll_wait,epoll_ctl,sendfile64,read,fcntl,write,fcntl64,close,time,stat64,stat,writev,setsockopt,accept,getsockname,ioctl,socket,connect' -p 17591
debug# lighttpd -V
lighttpd-1.4.13 - a light and fast webserver
Build-Date: Oct 9 2006 23:52:55
Event Handlers:
+ select (generic)
+ poll (Unix)
+ rt-signals (Linux 2.4+)
+ epoll (Linux 2.6)
- /dev/poll (Solaris)
- kqueue (FreeBSD)
Network handler:
+ sendfile
Features:
+ IPv6 support
+ zlib support
+ bzip2 support
+ crypt support
- SSL Support
+ PCRE support
- mySQL support
- LDAP support
- memcached support
- FAM support
- LUA support
- xml support
- SQLite support
- GDBM support
I have tried self compiling with the simple ./configure and also using the rpms of fedora. It still does not fix it.
If you have any qs feel free to contact me.
-- Michael Fuita
Files
Updated by Michael about 18 years ago
Additional info:
This does not happen immediately. It happens after around 12 hours of running lighty. When exactly I don't know because I can't monitor it for 12 hours straight but it does happen.
Updated by spillgroup about 18 years ago
We have the same problems on 10+ servers. The weird thing is that the high CPU usage begins on all servers more or less simultaneously. This happens after running more than 24 hours without problems. We're on kernel 2.6.12. If you need input from us, please let us know. Right now we're considering downgrading to 1.4.11.
Updated by spillgroup about 18 years ago
By the way: we are not running any FastCGI processes. It's all static HTML with some mod_rewrite and mod_redirect statements.
Updated by Michael about 18 years ago
It's good to know that this is happening to other people as well. I suspect this could be to do with mod_redirect and mod_rewrite as I have them enabled as well. This is a very broad observation since you and I have it in use.
I have set up a test server for this and will turn off all modules to see if it is the cause of this high cpu usage.
Updated by spillgroup about 18 years ago
It happened again. Made a strace using the same command as mentioned by Michael:
13:18:13.442398 shutdown(274, 1 /* send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.443211 shutdown(203, 1 / send /) = 0
13:18:13.449205 shutdown(41, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.449360 shutdown(365, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.460042 shutdown(175, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.466666 shutdown(79, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.471415 shutdown(483, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.476445 shutdown(75, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.483484 shutdown(315, 1 / send /) = 0
13:18:13.489158 shutdown(639, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.495130 shutdown(108, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.500579 shutdown(243, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.517764 shutdown(162, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.522891 shutdown(290, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.538662 shutdown(238, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.538820 shutdown(427, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.544626 shutdown(483, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.544804 shutdown(144, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.557668 shutdown(333, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.567432 shutdown(41, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.569999 shutdown(51, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.588097 shutdown(373, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.589803 shutdown(225, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.602043 shutdown(177, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.607002 shutdown(489, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.612316 shutdown(495, 1 / send /) = -1 ENOTCONN (Transport endpoint is not connected)
13:18:13.650709 shutdown(290, 1 / send */) = -1 ENOTCONN (Transport endpoint is not connected)
Updated by spillgroup about 18 years ago
Sorry about the formatting. I'll try again:
13:18:13.442398 shutdown(274, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.443211 shutdown(203, 1 /* send */) = 0 13:18:13.449205 shutdown(41, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.449360 shutdown(365, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.460042 shutdown(175, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.466666 shutdown(79, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.471415 shutdown(483, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.476445 shutdown(75, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.483484 shutdown(315, 1 /* send */) = 0 13:18:13.489158 shutdown(639, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.495130 shutdown(108, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.500579 shutdown(243, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.517764 shutdown(162, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.522891 shutdown(290, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.538662 shutdown(238, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.538820 shutdown(427, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.544626 shutdown(483, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.544804 shutdown(144, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.557668 shutdown(333, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.567432 shutdown(41, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.569999 shutdown(51, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.588097 shutdown(373, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.589803 shutdown(225, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.602043 shutdown(177, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.607002 shutdown(489, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.612316 shutdown(495, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected) 13:18:13.650709 shutdown(290, 1 /* send */) = -1 ENOTCONN (Transport endpoint is not connected)
Updated by spillgroup about 18 years ago
Did some more tests. I removed this particular server from my cluster until there were no hits coming in again. Even then Lighttpd stayed using 99% CPU.
A full strace for this situation shows these messages at a very high rate:
epoll_wait(7, {{EPOLLIN, {u32=360, u64=360}}}, 8183, 1000) = 1 ioctl(360, FIONREAD, [0]) = 0 read(360, "", 63) = 0 epoll_ctl(7, EPOLL_CTL_MOD, 360, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=360, u64=360}}) = 0 time(NULL) = 1162556716 epoll_wait(7, {{EPOLLIN, {u32=360, u64=360}}}, 8183, 1000) = 1 ioctl(360, FIONREAD, [0]) = 0 read(360, "", 63) = 0 epoll_ctl(7, EPOLL_CTL_MOD, 360, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=360, u64=360}}) = 0 time(NULL) = 1162556716 epoll_wait(7, {{EPOLLIN, {u32=360, u64=360}}}, 8183, 1000) = 1 ioctl(360, FIONREAD, [0]) = 0 read(360, "", 63) = 0 epoll_ctl(7, EPOLL_CTL_MOD, 360, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=360, u64=360}}) = 0 time(NULL) = 1162556716 epoll_wait(7, {{EPOLLIN, {u32=360, u64=360}}}, 8183, 1000) = 1 ioctl(360, FIONREAD, [0]) = 0 read(360, "", 63) = 0 epoll_ctl(7, EPOLL_CTL_MOD, 360, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=360, u64=360}}) = 0 time(NULL) = 1162556716 epoll_wait(7, {{EPOLLIN, {u32=360, u64=360}}}, 8183, 1000) = 1 ioctl(360, FIONREAD, [0]) = 0 read(360, "", 63) = 0 epoll_ctl(7, EPOLL_CTL_MOD, 360, {EPOLLIN|EPOLLERR|EPOLLHUP, {u32=360, u64=360}}) = 0 time(NULL) = 1162556716 epoll_wait(7, {{EPOLLIN, {u32=360, u64=360}}}, 8183, 1000) = 1 ioctl(360, FIONREAD, [0]) = 0 read(360, "", 63) = 0
After restarting Lighttpd these messages appear at a very low rate:
Updated by jakabosky almost 18 years ago
Please try 1.4.x branch in svn.
svn checkout svn://svn.lighttpd.net/lighttpd/branches/lighttpd-1.4.x/
I fixed a bug with that same strace output.
Updated by spillgroup almost 18 years ago
I am willing to try the SVN version, but I'm afraid we cannot reproduce the bug. In fact we haven't seen the bug for over a month now. So I'm not sure if we can pinpoint the bug.
Updated by jakabosky almost 18 years ago
I can re-produce with bug with un-patched 1.4.13. It happens when a connection closes in the middle of sending the request headers. There is only one location where ioctl(FIONREAD) can return 0 and still call read(). So I am certain this bug is fixed in 1.4.x
Updated by spillgroup almost 18 years ago
We experienced the bug again on a couple of servers. Would it be possible to isolate the patch, so we can apply it to 1.4.13?
Updated by spillgroup almost 18 years ago
Thanks! I'll try to rollout this patch on my production servers later this week.
Updated by jakabosky almost 18 years ago
- Status changed from New to Fixed
- Resolution set to fixed
fixed in r1473
Also available in: Atom