Project

General

Profile

Actions

Bug #2718

closed

FastCGI stderr pipe breaks causing termination of script.

Added by azilber about 8 years ago. Updated almost 8 years ago.

Status:
Invalid
Priority:
Normal
Category:
mod_fastcgi
Target version:
-
ASK QUESTIONS IN Forums:

Description

Hi,

We have long-running (5-10 minutes) PHP maintenance scripts running every few minutes.   We noticed that these scripts would complete only around 40% of the time.   After stracing the process, we discovered this completely replicatable error:

When using FastCGI, with either lighty's own manager, or via spawn-fcgi, 60% of the time long running PHP processes will have their STDERR pipe break, causing an execution break in PHP. This doesn't throw any error on PHP's side, just ends the script prematurely.

Temporary resolution: By enabling the error_log directly in php.ini and bypassing STDERR in fcgi, we can avoid this bug, but it's an extra log file to manage.

OS: CentOs 5.10 x86_64
Lighttpd (x86_64) versions tested: 1.4.35, 1.4.39 (both internally managed and using spawn-fcgi PHP (32bit) processes).

Below is the strace snippet of the pipe breaking:

gettimeofday({1456448526, 431339}, NULL) = 0
write(5, "\1\7\0\1\0%\3\0Maintenance Script check"..., 400) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
gettimeofday({1456448526, 431519}, NULL) = 0
poll([{fd=8, events=POLLIN|POLLPRI}], 1, 0) = 0 (Timeout)
write(8, "z\0\0\0\3update `DatabaseLock` set `"..., 126) = 126
read(8, "0\0\0\1\0\1\0\2\0\1\0(Rows matched: 1  Cha"..., 16384) = 52
gettimeofday({1456448526, 438149}, NULL) = 0
flock(6, LOCK_UN)                       = 0
close(6)                                = 0
rt_sigaction(SIGPIPE, {0x1000000000000001, [], 0}, {0x1000000000000001, ~[HUP INT QUIT ILL ABRT BUS USR1 SEGV TERM CHLD RT_1 RT_2 RT_3 RT_4 RT_11 RT_15 RT_17], SA_STACK|SA_SIGINFO|0x1f7ab88}, 8) = 0
write(8, "\1\0\0\0\1", 5)               = 5
shutdown(8, 2 /* send and receive */)   = 0
close(8)                                = 0
rt_sigaction(SIGPIPE, {0x1000000000000001, [], 0}, {0x1000000000000001, ~[TRAP FPE SEGV USR2 STKFLT CONT TSTP TTIN XCPU XFSZ VTALRM WINCH IO RTMIN RT_1 RT_2 RT_3 RT_4 RT_5 RT_7 RT_8 RT_10 RT_11 RT_13
 RT_16 RT_17 RT_21 RT_24 RT_25 RT_26 RT_28 RT_29], 0}, 8) = 0
brk(0xdd7d000)                          = 0xdd7d000
rt_sigaction(SIGPIPE, {0x1000000000000001, [], 0x3606b0 /* SA_??? */}, {0x1000000000000001, ~[HUP INT ILL ABRT BUS KILL PIPE CHLD TSTP TTIN TTOU URG XCPU XFSZ IO PWR SYS RTMIN RT_1 RT_2 RT_3 RT_5 RT_
7 RT_15 RT_17], 0}, 8) = 0
write(9, "\1\0\0\0\1", 5)               = 5
shutdown(9, 2 /* send and receive */)   = 0
close(9)                                = 0
rt_sigaction(SIGPIPE, {0x1000000000000001, [], 0x3606b0 /* SA_??? */}, {0x1000000000000001, ~[TRAP FPE SEGV USR2 STKFLT CONT TSTP TTIN XCPU XFSZ VTALRM WINCH IO RTMIN RT_1 RT_2 RT_3 RT_4 RT_5 RT_7 RT
_8 RT_10 RT_11 RT_13 RT_16 RT_17 RT_21 RT_24 RT_25 RT_26 RT_28 RT_29], 0x3606b0 /* SA_??? */}, 8) = 0
setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={30, 0}}, NULL) = 0
rt_sigaction(SIGPROF, {0x10000000081e2840, [], SA_STACK|0x209f870}, {0x10000000081e2840, [ABRT BUS SEGV USR2 TERM STKFLT STOP TSTP URG XCPU RT_1], SA_STACK|SA_RESTART|SA_INTERRUPT|SA_NODEFER|SA_SIGIN
FO|SA_NOCLDWAIT|0x2f2c330}, 8) = 0
rt_sigprocmask(SIG_UNBLOCK, [PROF], NULL, 8) = 0
fcntl64(3, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
fcntl64(3, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
fcntl64(4, F_SETLKW, {type=F_WRLCK, whence=SEEK_SET, start=0, len=1}) = 0
fcntl64(4, F_SETLKW, {type=F_UNLCK, whence=SEEK_SET, start=0, len=1}) = 0
write(5, "\1\7\0\1\0%\3\0Maintenance Script check"..., 400) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
brk(0xdc7d000)                          = 0xdc7d000
brk(0xdafd000)                          = 0xdafd000
brk(0xda7d000)                          = 0xda7d000
brk(0xd8fd000)                          = 0xd8fd000
brk(0xd87d000)                          = 0xd87d000
brk(0xd7bd000)                          = 0xd7bd000
brk(0xd73d000)                          = 0xd73d000
brk(0xd6bd000)                          = 0xd6bd000
brk(0xd624000)                          = 0xd624000
brk(0xd5a4000)                          = 0xd5a4000
brk(0xd524000)                          = 0xd524000
brk(0xd4a4000)                          = 0xd4a4000
brk(0xd424000)                          = 0xd424000
brk(0xd3a4000)                          = 0xd3a4000
brk(0xd324000)                          = 0xd324000
brk(0xd2a4000)                          = 0xd2a4000
brk(0xd224000)                          = 0xd224000
brk(0xd164000)                          = 0xd164000
brk(0xd0a4000)                          = 0xd0a4000
brk(0xcfa4000)                          = 0xcfa4000
setitimer(ITIMER_PROF, {it_interval={0, 0}, it_value={0, 0}}, NULL) = 0
write(5, "\1\7\0\1\0%\3\0Maintenance Script check"..., 416) = -1 EPIPE (Broken pipe)
--- SIGPIPE (Broken pipe) @ 0 (0) ---
shutdown(5, 1 /* send */)               = 0
recv(5, "\1\5\0\1\0\0\0\0", 8, 0)       = 8
recv(5, "", 8, 0)                       = 0
close(5)

Files

26544-snip.txt (5.58 KB) 26544-snip.txt azilber, 2016-03-02 06:29
26544-snip.txt (65.6 KB) 26544-snip.txt ~1k lines of backtrace. azilber, 2016-03-02 10:02
Actions #1

Updated by gstrauss about 8 years ago

Would you strace -ttt (to include timestamps on the syscalls)? A timeout is probably occurring somewhere.

An alternative to consider is to separate the maintenance tasks from checking the status of the maintenance tasks. Background the maintenance process and include a file lock to avoid accidental concurrency. Potentially run the maintenance periodically from cron. Have the maintenance task output a status page to a temporary file, and then rename into place for access by the webserver. Checking the status of the maintenance task then becomes a simple fetch of the generated status page.

Actions #2

Updated by azilber about 8 years ago

Attached is the strace -ttt output.

Actions #3

Updated by gstrauss about 8 years ago

All the calls to write(), including the first one, are writing to a broken pipe.

Look back in your trace (before what you have attached) and see where file descriptor 5 gets open()ed or close()d.

Actions #4

Updated by gstrauss about 8 years ago

At the end of the trace, I see
1456899400.445976 recv(5, "\1\5\0\1\0\0\0\0", 8, 0) = 8
1456899400.446000 recv(5, "", 8, 0) = 0
That is a FastCGI STDIN packet of 0-length. What was sent to this script for the request? I don't see your script reading in any FastCGI request from the lighttpd server. I would expect that some time earlier, FCGI_BEGIN_REQUEST was sent. The timestamps in your trace show about 27 milliseconds, which is not a long-running script, so I suspect that information has been omitted from earlier in the request.

I'm sorry, I can't help you without more information. Please try again to include a more complete trace of the request handling by the script.

Actions #5

Updated by azilber about 8 years ago

gstrauss wrote:

All the calls to write(), including the first one, are writing to a broken pipe.

Look back in your trace (before what you have attached) and see where file descriptor 5 gets open()ed or close()d.

Here's the last ~1k lines of the strace. I've gone through and sanitized the data. This is as much data that I can reasonably send and it contains proprietary information. I don't think it would be too hard to make a php script that loops for 10 minutes and writes to stderr....

Actions #6

Updated by azilber about 8 years ago

gstrauss wrote:

At the end of the trace, I see
1456899400.445976 recv(5, "\1\5\0\1\0\0\0\0", 8, 0) = 8
1456899400.446000 recv(5, "", 8, 0) = 0
That is a FastCGI STDIN packet of 0-length. What was sent to this script for the request? I don't see your script reading in any FastCGI request from the lighttpd server. I would expect that some time earlier, FCGI_BEGIN_REQUEST was sent. The timestamps in your trace show about 27 milliseconds, which is not a long-running script, so I suspect that information has been omitted from earlier in the request.

I'm sorry, I can't help you without more information. Please try again to include a more complete trace of the request handling by the script.

The events at 1456899400.445976 are unrelated, that's a health check connection coming in after the script breaks.

Actions #7

Updated by gstrauss about 8 years ago

Sorry you went through that effort. It's still not complete -- you intentionally truncated it again (the exact opposite of what I asked for) -- and therefore not very useful.

Try setting this in your config and restarting the server:

debug.log-timeouts = "enable"

and then look in your error log for "timed out after writing". It is likely that the request is taking too long to respond to the client and it may help if you increase server.max-write-idle, which currently defaults to 360 seconds (6 minutes).

Actions #8

Updated by azilber about 8 years ago

Hi,

Your observation was spot-on.  Enabling log-timeouts debug output did
show timeouts, and increasing server.max-write-idle solved the issue. Many
thanks! This avoids me having to filter everything via syslog.

Thanks!

PS: I received the following error when replying to this ticket via email:

Delivery to the following recipient failed permanently:

     redmine@lighttpd.net

Technical details of permanent failure:
Google tried to deliver your message, but it was rejected by the server for the recipient domain lighttpd.net by mail.lighttpd.net. [2a01:4f8:100:8082::2].

The error that the other server returned was:
550 5.1.1 <redmine@lighttpd.net>: Recipient address rejected: User unknown in local recipient table

Actions #9

Updated by gstrauss about 8 years ago

  • Status changed from New to Invalid

Thanks for the update. Closing ticket. Not a bug in lighttpd.

Thanks, too, for the email error report. I'll pass it along.

Actions #10

Updated by stbuehler almost 8 years ago

  • Target version deleted (1.4.x)
Actions

Also available in: Atom