Project

General

Profile

Actions

Bug #1918

closed

Lighttpd causes high load

Added by LaszloKAROLYI about 12 years ago. Updated about 12 years ago.

Status:
Invalid
Priority:
Normal
Category:
core
Target version:
ASK QUESTIONS IN Forums:

Description

Hi,

I've written a php chat for myself, which would not use php heavily. It uses a 'chatdata.txt', which the client is requesting 5 seconds repeatedly. When someone is writing something to the chat, the php parses the chatdata.txt into chatdata.txt.temp, and then replaces chatdata.txt with a rename().

I don't know how this is, but sometimes lighttpd is running at high CPU usage because of that. What i did was to truss (FreeBSD strace) the process on high load, and now i am attaching this file to the bugreport.

I've tried gamin (a bsd file alteration monitor), but it was not working, it caused high load from the first running with lighttpd.

Is there any possibility to fix this bug?

truss file attached.


Files

truss.log (93.1 KB) truss.log LaszloKAROLYI, 2009-02-25 13:13
strace.txt (107 KB) strace.txt Meskalyn, 2009-02-26 23:18
audrey.guiks.net-cpu-day.png (20.7 KB) audrey.guiks.net-cpu-day.png load Meskalyn, 2009-03-11 01:30
strace.txt (15.9 KB) strace.txt borec, 2009-03-19 11:46
Actions #1

Updated by Meskalyn about 12 years ago

Same for me, problem related in this post (without strace) :
http://redmine.lighttpd.net/boards/2/topics/886

Actions #2

Updated by Meskalyn about 12 years ago

I don't know if this can help or something :

22:48:17.563347 read(73, 0x8a31000, 18437) = -1 EAGAIN (Resource temporarily unavailable)
22:48:17.563384 getpid()                = 57470 (ppid 1)
22:48:17.563426 shutdown(73, 1 /* send */) = 0
22:48:17.563492 kevent(0x9, 0xbfbfe4c0, 0x1, 0, 0, 0xbfbfe4b8) = 0
22:48:17.563538 close(73)               = 0
22:48:17.563582 getpid()                = 57470 (ppid 1)
22:48:17.563619 shutdown(52, 1 /* send */) = 0
22:48:17.563666 kevent(0x9, 0xbfbfe4c0, 0x1, 0, 0, 0xbfbfe4b8) = 0
22:48:17.563707 close(52)               = 0
22:48:17.563746 getpid()                = 57470 (ppid 1)
22:48:17.563795 write(101, "\25\3\1\0 LZ\271Q-\343\2561\0\365d\225g\357\204\372\240\307|Z\301bWl\243\210\344Yx\265\f\270"..., 37) = 37
22:48:17.563860 getpid()                = 57470 (ppid 1)
22:48:17.563897 read(101, 0x8bff000, 18437) = -1 EAGAIN (Resource temporarily unavailable)
22:48:17.563933 getpid()                = 57470 (ppid 1)

I see something into a change with EAGAIN, but I don't know nothing about what it meen, I can submit other strace with better format later (I don't know really how to find the answer of the problem)...

Actions #3

Updated by aevernon about 12 years ago

Are either of you able to attach your code to this ticket? I have a FreeBSD box and would like to see if I can reproduce the bug.

Meskalyn, is your application a PHP one, too?

Actions #4

Updated by Meskalyn about 12 years ago

aevernon wrote:

Are either of you able to attach your code to this ticket? I have a FreeBSD box and would like to see if I can reproduce the bug.

Meskalyn, is your application a PHP one, too?

Yes I don't rewrite what I say into the forum (link "upper"), my english isn't very very good and I try to give you the most possible comprehensible information..

My code is a "torrent repository" with "huge" traffic (arround 1M page view / day), on the torrent tracker itself, we are on Debian with Lighty 1.4.19, we see no problem (arround 5M page "hit" / day / tracker (5 lighttpd server in backend HAProxy)).
The code is really "simple", work with php 4, php 5, this is "just" php site...

I'm not able to do more test (I'm going to holidays) during one week, and I don't want to put an "azardous" solution before I go (I'm alone to maintain servers). Currently, server stay up, never goes down during this 4 high load days...

Actions #5

Updated by Meskalyn about 12 years ago

Just a strace added of php-cgi process (the most loaded at the moment)

Actions #6

Updated by aevernon about 12 years ago

Icy asked me to figure out where getpid() is being called. I attached gdb but could only find it being called from server.c:527 and server.c:880 when it starts up.

Actions #7

Updated by LaszloKAROLYI about 12 years ago

aevernon wrote:

Are either of you able to attach your code to this ticket? I have a FreeBSD box and would like to see if I can reproduce the bug.

Meskalyn, is your application a PHP one, too?

Hi,

what code dou you want me to attach? The main point is to request a file (for which you don't have to run a php code), and, modify that file meanwhile with php. The modifier php code snippet is this:

function addChatText($chatText) {
    $fileArray = file(CHATTEXT_PATH, FILE_TEXT);
    $fileArrayCount = count($fileArray);
    for ($lineCount = $fileArrayCount; $lineCount >= MAXLINES; $lineCount--) {
        $line = array_shift($fileArray);
    }
    array_push($fileArray, "<!-- " . strftime('%Y-%m-%d') . " -->" . $chatText);
    // Create temporary file
    file_put_contents(CHATTEXT_PATH . 'temp.txt', $fileArray);
    rename(CHATTEXT_PATH . 'temp.txt', CHATTEXT_PATH);
}

And the requester javascript object:

function loadChatTextWindow() {
    var url = phpVars.chatTextName;
    refreshSeed();
    var myAjax = new Ajax.Request(url, {method: 'get', parameters: 'seed=' + uniqId + '-' + $F('seed'), onComplete: parseCha
tTextWindow } );
    chatTextRefreshId = setTimeout("loadChatTextWindow()", phpVars.chatWindowRefreshTime);
}

function parseChatTextWindow(loadedData) {
    chatWindowText = loadedData.responseText;
    if (oldChatWindowText != chatWindowText) {
        $('chatWindow').innerHTML = chatWindowText;
        oldChatWindowText = chatWindowText;
        setTimeout('scrollWindowToBottom()', 100);
    }
}
Actions #8

Updated by stbuehler about 12 years ago

  • Target version changed from 1.4.21 to 1.4.22
Actions #9

Updated by stbuehler about 12 years ago

  • Target version changed from 1.4.22 to 1.4.23

The only thing i can see in the truss.log is that lighty somehow gets events for an fd, but there is no data:

kevent(112,0x0,0,{0x4,EVFILT_READ,EV_CLEAR,0,0x1,0x0},11096,{1.000000000}) = 1 (0x1)
accept(4,0xbfbfe9fc,0xbfbfea68)                  ERR#35 'Resource temporarily unavailable'

I have no idea what is wrong there, but i don't think it is lighty's fault.

Actions #10

Updated by LaszloKAROLYI about 12 years ago

stbuehler wrote:

The only thing i can see in the truss.log is that lighty somehow gets events for an fd, but there is no data:
[...]

I have no idea what is wrong there, but i don't think it is lighty's fault.

Is it there, what type of event it is? Maybe it's not the expected type of event. Maybe it is not a read event, something else.

That's just a guess.

Actions #11

Updated by Meskalyn about 12 years ago

I'm not very sure, I just done the upgrade from PHP 5.2.8 to 5.2.9 and the load decrease a lot...
I haven't got the "back" to say if it is OR not a lighttpd bug, but it seem to be a php bug, not a lighttpd but.

I have a heavy usage of strtotime() function : http://bugs.php.net/bug.php?id=46889
or it could be this one : http://bugs.php.net/bug.php?id=45423

It's just "suposition" I'm not very user friendly with PHP...

Actions #12

Updated by LaszloKAROLYI about 12 years ago

Meskalyn wrote:

I'm not very sure, I just done the upgrade from PHP 5.2.8 to 5.2.9 and the load decrease a lot...
I haven't got the "back" to say if it is OR not a lighttpd bug, but it seem to be a php bug, not a lighttpd but.

I have a heavy usage of strtotime() function : http://bugs.php.net/bug.php?id=46889
or it could be this one : http://bugs.php.net/bug.php?id=45423

It's just "suposition" I'm not very user friendly with PHP...

Believe me, it's not about php. I can show the CPU statistics on my BSD box, with random peaks on it, caused by lighttpd.

Actions #13

Updated by stbuehler about 12 years ago

How do you know the peaks are caused by lighty? Until now here is nothing we can work with.

Btw: are you using server.max-worker?

Actions #14

Updated by LaszloKAROLYI about 12 years ago

stbuehler wrote:

How do you know the peaks are caused by lighty? Until now here is nothing we can work with.

Btw: are you using server.max-worker?

'cause i'm running a 'top' in a screen session on the server, and every time it gets overloaded, lighttpd processes are on top, using around 95% CPU.

relevant configuration snippet is:

server.event-handler = "freebsd-kqueue"
#server.stat-cache-engine = "fam"
server.max-worker = 4

(the second line shows i was trying to use fam/gamin, but it caused permanent high load, so i stopped using it.)

do you want me to attach my cpu graphics?

Actions #15

Updated by stbuehler about 12 years ago

  • Status changed from New to Invalid

http://redmine.lighttpd.net/wiki/lighttpd/Server.max-workerDetails

DO NOT REPORT ERRORS OR BUGS IF YOU DID NOT TEST WITHOUT THIS OPTION SET

Actions #16

Updated by LaszloKAROLYI about 12 years ago

  • Status changed from Invalid to Patch Pending

stbuehler wrote:

http://redmine.lighttpd.net/wiki/lighttpd/Server.max-workerDetails

DO NOT REPORT ERRORS OR BUGS IF YOU DID NOT TEST WITHOUT THIS OPTION SET

So, whats the difference? The server has 2 hyperthreaded processors, and in the original lighttpd documentation was to set the max-worker value to two times the number of processors.

I still don't see the reason why this bug had to be set as invalid.

Actions #17

Updated by stbuehler about 12 years ago

  • Status changed from Patch Pending to Invalid

I searched the docs in the current version, and there is nothing mentioned "two times the number of processors"...
And I think the docs are pretty clear: no support for max-worker problems.

Actions #18

Updated by LaszloKAROLYI about 12 years ago

stbuehler wrote:

I searched the docs in the current version, and there is nothing mentioned "two times the number of processors"...
And I think the docs are pretty clear: no support for max-worker problems.

okay, i commented out the max-worker option, and used truss on the one running lighttpd process. There is no high load (so far), but those 'Resource temporarily unavailable' lines are still there. Just one example:

kevent(112,{0x75,EVFILT_READ,EV_ADD|EV_CLEAR,0,0x0,0x0},1,0x0,0,{0.000000000}) = 0 (0x0)
accept(4,{ AF_INET 84.206.43.2:32910 },0xbfbfea58) = 149 (0x95)
fcntl(149,F_SETFD,FD_CLOEXEC)                    = 0 (0x0)
fcntl(149,F_SETFL,O_NONBLOCK|0x2)                = 0 (0x0)
ioctl(149,FIONREAD,0xbfbfe5d4)                   = 0 (0x0)
ioctl(149,FIONREAD,0xbfbfe5d4)                   = 0 (0x0)
read(149,0x28d41000,4159)                        ERR#35 'Resource temporarily unavailable'

and there are lots of lines like that. It seems for me that lighttpd gets a read event on a socket, but there's no real data on it.

I don't know what will happen in the evening when the server usage will get higher, but if that one process will eat up the cpu, my server will stop providing its services. That will be critical.

Any suggestions?

Actions #19

Updated by LaszloKAROLYI about 12 years ago

  • Status changed from Invalid to Reopened
Actions #20

Updated by stbuehler about 12 years ago

EAGAIN itself is not an error - it is just bad if you receive "spurious" readiness notifications (that was one problem with max-worker: every time a connection was established, every worker got a readiness signal, but only one could accept it and the others got EAGAIN). Here you just get EAGAIN after lighty accepted the socket, there was no readiness notification for it.

And another thing: high load is not necessarily a bug... it may be just high load :)
To find the bottle neck you need to compile with debug symbols and run a profiler with it (i don't know what is available on freebsd; valgrind/oprofile/..?)

Actions #21

Updated by borec about 12 years ago

Hello, I'm experiencing something similar to the problem described byLaszloKAROLYI. However, in my case lighty is rapidly getting memory full - 6GB in just few seconds. In strace there is lot of EAGAIN (Resource temporarily unavailable).

Tried version 1.4.20 and 1.4.22. Tried running single and multi processes.

Running Gentoo Linux 64bit on Core i7 920 with 6GB DDR3 mem.

Actions #22

Updated by borec about 12 years ago

Strace log attached.

Actions #23

Updated by stbuehler about 12 years ago

  • Status changed from Reopened to Invalid

borec: high memory usage is not the same as high load. You are probably just sending big files via (fast)cgi or mod_proxy. And EAGAIN is nothing bad. (Please read what i wrote...)

As LaszloKAROLYI didn't report high loads since dropping max-worker, i will close this.

Actions

Also available in: Atom