Bug #2718
closedFastCGI stderr pipe breaks causing termination of script.
Description
Hi,
We have long-running (5-10 minutes) PHP maintenance scripts running every few minutes. We noticed that these scripts would complete only around 40% of the time. After stracing the process, we discovered this completely replicatable error:
When using FastCGI, with either lighty's own manager, or via spawn-fcgi, 60% of the time long running PHP processes will have their STDERR pipe break, causing an execution break in PHP. This doesn't throw any error on PHP's side, just ends the script prematurely.
Temporary resolution: By enabling the error_log directly in php.ini and bypassing STDERR in fcgi, we can avoid this bug, but it's an extra log file to manage.
OS: CentOs 5.10 x86_64
Lighttpd (x86_64) versions tested: 1.4.35, 1.4.39 (both internally managed and using spawn-fcgi PHP (32bit) processes).
Below is the strace snippet of the pipe breaking:
gettimeofday({1456448526, 431339}, NULL) = 0 write(5, "\1\7\0\1\0%\3\0Maintenance Script check"..., 400) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- gettimeofday({1456448526, 431519}, NULL) = 0 poll([{fd=8, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout) write(8, "z\0\0\0\3update `DatabaseLock` set `"..., 126) = 126 read(8, "0\0\0\1\0\1\0\2\0\1\0(Rows matched: 1 Cha"..., 16384) = 52 gettimeofday({1456448526, 438149}, NULL) = 0 flock(6, LOCK_UN) = 0 close(6) = 0 rt_sigaction(SIGPIPE, {0x1000000000000001, [], 0}, {0x1000000000000001, ~[HUP INT QUIT ILL ABRT BUS USR1 SEGV TERM CHLD RT_1 RT_2 RT_3 RT_4 RT_11 RT_15 RT_17], SA_STACK|SA_SIGINFO|0x1f7ab88}, 8) = 0 write(8, "\1\0\0\0\1", 5) = 5 shutdown(8, 2 /* send and receive */) = 0 close(8) = 0 rt_sigaction(SIGPIPE, {0x1000000000000001, [], 0}, {0x1000000000000001, ~[TRAP FPE SEGV USR2 STKFLT CONT TSTP TTIN XCPU XFSZ VTALRM WINCH IO RTMIN RT_1 RT_2 RT_3 RT_4 RT_5 RT_7 RT_8 RT_10 RT_11 RT_13 RT_16 RT_17 RT_21 RT_24 RT_25 RT_26 RT_28 RT_29], 0}, 8) = 0 brk(0xdd7d000) = 0xdd7d000 rt_sigaction(SIGPIPE, {0x1000000000000001, [], 0x3606b0 /* SA_??? */}, {0x1000000000000001, ~[HUP INT ILL ABRT BUS KILL PIPE CHLD TSTP TTIN TTOU URG XCPU XFSZ IO PWR SYS RTMIN RT_1 RT_2 RT_3 RT_5 RT_ 7 RT_15 RT_17], 0}, 8) = 0 write(9, "\1\0\0\0\1", 5) = 5 shutdown(9, 2 /* send and receive */) = 0 close(9) = 0 rt_sigaction(SIGPIPE, {0x1000000000000001, [], 0x3606b0 /* SA_??? */}, {0x1000000000000001, ~[TRAP FPE SEGV USR2 STKFLT CONT TSTP TTIN XCPU XFSZ VTALRM WINCH IO RTMIN RT_1 RT_2 RT_3 RT_4 RT_5 RT_7 RT _8 RT_10 RT_11 RT_13 RT_16 RT_17 RT_21 RT_24 RT_25 RT_26 RT_28 RT_29], 0x3606b0 /* SA_??? */}, 8) = 0 setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={30, 0}}, NULL) = 0 rt_sigaction(SIGPROF, {0x10000000081e2840, [], SA_STACK|0x209f870}, {0x10000000081e2840, [ABRT BUS SEGV USR2 TERM STKFLT STOP TSTP URG XCPU RT_1], SA_STACK|SA_RESTART|SA_INTERRUPT|SA_NODEFER|SA_SIGIN FO|SA_NOCLDWAIT|0x2f2c330}, 8) = 0 rt_sigprocmask(SIG_UNBLOCK, [PROF], NULL, 8) = 0 fcntl64(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0 fcntl64(3, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0 fcntl64(4, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0 fcntl64(4, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0 write(5, "\1\7\0\1\0%\3\0Maintenance Script check"..., 400) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- brk(0xdc7d000) = 0xdc7d000 brk(0xdafd000) = 0xdafd000 brk(0xda7d000) = 0xda7d000 brk(0xd8fd000) = 0xd8fd000 brk(0xd87d000) = 0xd87d000 brk(0xd7bd000) = 0xd7bd000 brk(0xd73d000) = 0xd73d000 brk(0xd6bd000) = 0xd6bd000 brk(0xd624000) = 0xd624000 brk(0xd5a4000) = 0xd5a4000 brk(0xd524000) = 0xd524000 brk(0xd4a4000) = 0xd4a4000 brk(0xd424000) = 0xd424000 brk(0xd3a4000) = 0xd3a4000 brk(0xd324000) = 0xd324000 brk(0xd2a4000) = 0xd2a4000 brk(0xd224000) = 0xd224000 brk(0xd164000) = 0xd164000 brk(0xd0a4000) = 0xd0a4000 brk(0xcfa4000) = 0xcfa4000 setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0 write(5, "\1\7\0\1\0%\3\0Maintenance Script check"..., 416) = -1 EPIPE (Broken pipe) --- SIGPIPE (Broken pipe) @ 0 (0) --- shutdown(5, 1 /* send */) = 0 recv(5, "\1\5\0\1\0\0\0\0", 8, 0) = 8 recv(5, "", 8, 0) = 0 close(5)
Files
Updated by gstrauss about 9 years ago
Would you strace -ttt (to include timestamps on the syscalls)? A timeout is probably occurring somewhere.
An alternative to consider is to separate the maintenance tasks from checking the status of the maintenance tasks. Background the maintenance process and include a file lock to avoid accidental concurrency. Potentially run the maintenance periodically from cron. Have the maintenance task output a status page to a temporary file, and then rename into place for access by the webserver. Checking the status of the maintenance task then becomes a simple fetch of the generated status page.
Updated by azilber about 9 years ago
- File 26544-snip.txt 26544-snip.txt added
Attached is the strace -ttt output.
Updated by gstrauss about 9 years ago
All the calls to write(), including the first one, are writing to a broken pipe.
Look back in your trace (before what you have attached) and see where file descriptor 5 gets open()ed or close()d.
Updated by gstrauss about 9 years ago
At the end of the trace, I see
1456899400.445976 recv(5, "\1\5\0\1\0\0\0\0", 8, 0) = 8
1456899400.446000 recv(5, "", 8, 0) = 0
That is a FastCGI STDIN packet of 0-length. What was sent to this script for the request? I don't see your script reading in any FastCGI request from the lighttpd server. I would expect that some time earlier, FCGI_BEGIN_REQUEST was sent. The timestamps in your trace show about 27 milliseconds, which is not a long-running script, so I suspect that information has been omitted from earlier in the request.
I'm sorry, I can't help you without more information. Please try again to include a more complete trace of the request handling by the script.
Updated by azilber about 9 years ago
- File 26544-snip.txt 26544-snip.txt added
gstrauss wrote:
All the calls to write(), including the first one, are writing to a broken pipe.
Look back in your trace (before what you have attached) and see where file descriptor 5 gets open()ed or close()d.
Here's the last ~1k lines of the strace. I've gone through and sanitized the data. This is as much data that I can reasonably send and it contains proprietary information. I don't think it would be too hard to make a php script that loops for 10 minutes and writes to stderr....
Updated by azilber about 9 years ago
gstrauss wrote:
At the end of the trace, I see
1456899400.445976 recv(5, "\1\5\0\1\0\0\0\0", 8, 0) = 8
1456899400.446000 recv(5, "", 8, 0) = 0
That is a FastCGI STDIN packet of 0-length. What was sent to this script for the request? I don't see your script reading in any FastCGI request from the lighttpd server. I would expect that some time earlier, FCGI_BEGIN_REQUEST was sent. The timestamps in your trace show about 27 milliseconds, which is not a long-running script, so I suspect that information has been omitted from earlier in the request.I'm sorry, I can't help you without more information. Please try again to include a more complete trace of the request handling by the script.
The events at 1456899400.445976 are unrelated, that's a health check connection coming in after the script breaks.
Updated by gstrauss about 9 years ago
Sorry you went through that effort. It's still not complete -- you intentionally truncated it again (the exact opposite of what I asked for) -- and therefore not very useful.
Try setting this in your config and restarting the server:
debug.log-timeouts = "enable"
and then look in your error log for "timed out after writing". It is likely that the request is taking too long to respond to the client and it may help if you increase server.max-write-idle, which currently defaults to 360 seconds (6 minutes).
Updated by azilber about 9 years ago
Hi,
Your observation was spot-on. Enabling log-timeouts debug output did
show timeouts, and increasing server.max-write-idle solved the issue. Many
thanks! This avoids me having to filter everything via syslog.
Thanks!
PS: I received the following error when replying to this ticket via email:
Delivery to the following recipient failed permanently: redmine@lighttpd.net Technical details of permanent failure: Google tried to deliver your message, but it was rejected by the server for the recipient domain lighttpd.net by mail.lighttpd.net. [2a01:4f8:100:8082::2]. The error that the other server returned was: 550 5.1.1 <redmine@lighttpd.net>: Recipient address rejected: User unknown in local recipient table
Updated by gstrauss about 9 years ago
- Status changed from New to Invalid
Thanks for the update. Closing ticket. Not a bug in lighttpd.
Thanks, too, for the email error report. I'll pass it along.
Also available in: Atom