FastCGI requests get stuck sending POST data to backend
I've been trying to track down a case of Ruby processes getting stuck for a long time now. Ruby processes incrementally get stuck, until all backends are frozen and I have to kill them manually to kickstart things again. This happens about one process a day or more, depending on site load.
I've recently narrowed it down:
- It happens when uploading files. The files are almost always large enough to require multiple Lighttpd temp files.
- Lighttpd is always in the middle of sending the uploaded files to the backend; Rails has a partially-written /tmp/CGI* file and Lighttpd has some temporary files left to send.
- Ruby is in read(), in FCGI code.
- The Lighttpd connection is in FCGI_STATE_WRITE, as expected.
- fcgi's hctx->fde_ndx != -1, so the fastcgi context thinks that there's an fdevent installed, as there should be. (FreeBSD, using kevent.)
- srv->ev->kq_bevents for hctx->fd is 0, so fdevent_freebsd_kqueue thinks it's set to wake on writable, as it should be.
It's not a content size mismatch--Rails isn't trying to read more data than we're sending. Lighttpd has more data to send: mod_fastcgi is still in WRITE and hctx->wb->bytes_out < hctx->wb->bytes_in.
When I break in during this state and tracked down the backend's fd, I could do this:
(gdb) p write(hctx->fd, hctx, 1)
(This is just write(fd, 'x', 1), but that doesn't work in gdb in FreeBSD, so I had to use a real pointer in the app to give it something to write.)
It's writing the data, so the FD is writable. fdevent/kevent should be waking up the connection. I stopped on a fdevent_poll call from server.c and examined the results; none of them matched hctx->fde_ndx.
It seems like the kevent is getting desynced from the fdevent, but I can't find any way that could happen. I can't find an API to examine the actual contents of a kqueue, either (this is basic stuff; I hate FreeBSD). Nothing unusual is logged around when this happens. I can't imagine that kqueue is losing events--this seems like a very core API.
I'll probably switch to poll or select and see what happens, since at least their complete state can be examined from a debugger, but I'm leery of just hiding the problem.
Suggestions welcome; I'm tired of banging my head on this.
Updated by icy about 10 years ago
Interesting, I've personally seen some strange behaviour (leaking connections for example) with lighty on FreeBSD regarding kqueue but wasn't able to track it down either...
Do you see these problems only on FreeBSD?
In one of the cases I had observed, switching to select solved it but other problems remained and switching to Linux solved all of them.
If you can confirm that, then the problem must be either in our code for kqueue or in kqueue itself.
Updated by stbuehler about 10 years ago
- change the logging in fdevent_*.c from stderr to log_write, so we actually see errors in the log
- handle EINTR
But i think that kqueue in FreeBSD is not really stable, we received reports about FreeBSD kernel crashes when lighty+kqueue was used...
Updated by peto about 10 years ago
I'm seeing it with select and poll, too. poll() does include the FD to the FCGI unix socket, and it's not returning the socket as writable. One side of a unix socket is in read() waiting for data, the other side isn't polling as writable. That doesn't make any sense; it's got to be a kernel bug. I can't reproduce it in a quick standalone test, though; I'd need to narrow it down incrementally to get a test case that I could send upstream.
http://redmine.lighttpd.net/boards/2/topics/show/141 seems like the same problem.
I can disable sendfile on this one, since it's all FCGI and no heavy file load, but that's a terrible solution for the next victim who gets bit by this and spends forever tracking it down. Disabling sendfile entirely is no good, either; I need threaded-sendfile for performance on a file-heavy 1.5 instance on the same system.
The attached patch seems to be working. It's a hack, but it's straightforward and far better than things are now. I havn't done this on the sendfile-based 1.5.x code yet, since I'm not running FCGI in 1.5, but the same thing should be done.
Also available in: Atom