mod_cgi does not handle EINTR during write() to cgi child [PATCH]
|Missing in 1.5.x:|
We are using lighttpd as a file store where users can upload files which are then served out statically. For the upload, there is python cgi script, which simply reads off the POST data and stores it to disk. It appears that under heavy write load (i.e. lots of files are being uploaded very rapidly) lighty occasionally will only send in the truncated POST body to the cgi script; the truncated content is typically 64k in size.
It appears that in mod_cgi when a file/mem chunk is being written to the child, it only handles ENOSPC. All zero and negative (all other errnos) return values from the write() call are considered to mean that no more data is available to write to child, which causes an interrupted write to stop sending any remaining data. However when a write() is interrupted before any data is written, EINTR is raised and write() returns -1.
The problem is most likely caused due to a (possibly) rare timing issue. The flow seems to go something like the following:
1. Lots of POST requests come in and are handled by individual cgi children.
2. Just as a write to one of the cgi children is about to commence, some other child terminates, sending a SIGCHLD which interrupts the write.
3. EINTR is raised.
4. mod_cgi assumes that any return value less than or equal to 0 means no more data is available to read and breaks out of the chunkqueue processing loop.
Steps to Reproduce:
1. Create upload.cgi which reads in stdin (or other structures - based on language), writes the data to a file on disk and returns the number of bytes written in the response (to verify the write).
2. Upload ~1000 files of sizes between (50k and 2Mb) using multiple clients (we used 20) simultaneously.
3. Compare response from upload.cgi with the size of the file uploaded.
A patch which fixes the problem is attached. It catches an EINTR and sets a flag and continues processing the chunks if write returned a negative value and the flag is set. If the flag is not set, it sets the response to be a 500 and breaks out.
Also available in: Atom