Bug #3265
closedlighttpd stalls on upload to backend if disk is full
Description
If lighttpd (happened with version 1.4.73) buffers uploads on disk (server.stream-request-body = 1
) to proxied backend e.g. proxy.server = ( "/" => ("bla-service" => ("host" => bla_service, "port" => 5001)),)
and lighttpd cannot write anymore, I get the following error message:
2024-09-10 11:50:37: (chunk.c.848) write() temp-file /var/cache/lighttpd/uploads/lighttpd-upload-kuqoxM failed: No space left on device
The disk ist full, message is correct. But:
Afterwards the listen port (in this case 443) is still open, TCP connection is possible/not refused, but no further requests are handled, lighttpd seems on hold/stuck.
I would expect, that lighttpd continues at least partially:
- either that lighttpd terminates, so it can restarted by a monitoring instance
- or close this connection, so that other requests can be handled
Updated by gstrauss 3 months ago
When this occurs, would you strace
the lighttpd process and see on what system call lighttpd is blocked?
The intention is that if ENOSPC, lighttpd will attempt to use other tempdirs (if configured), or else the operation should fail and return an error. It is bug if that is not the case.
Updated by flynn 3 months ago
The affected server ist not under my control anymore (but has been), the issue was only reported to me and solved by killing lighttpd process and cleaning up the disk.
I have the full configuration and try to reproduce the case in a clean environment, but this may take some time ...
Updated by gstrauss 3 months ago
Modern lighttpd attempts to use operating-system-specific mechanisms to efficiently write files to disk. For example, if clear-text HTTP/1.1, then lighttpd might be trying to splice()
from the socket through a pipe to the temporary file on disk, which is a different code path than if those preconditions did not hold.
In order to narrow down where the blocking might be occurring, do you know if the requests were made using TLS (HTTPS), or if they were clear-text (HTTP)? HTTP/2 or HTTP/1.1? Do you know what Linux kernel the machine in question is running?
Updated by gstrauss 3 months ago
2024-09-10 11:50:37: (chunk.c.848) write() temp-file /var/cache/lighttpd/uploads/lighttpd-upload-kuqoxM failed: No space left on device
From a quick code inspection: after lighttpd issues this error message, there are no further tempdirs available to try, and so lighttpd returns an error from the subroutine, and that error is propagated. It was not immediately apparent if/where there were cases that lighttpd could then get stuck.
I'll try to find some time to attempt to reproduce the error reported, but might not have any time until later next week.
Updated by gstrauss 3 months ago · Edited
- Status changed from New to Need Feedback
So far, I am unable to reproduce reported blocking/hanging error with simple TLS setup proxying to a separate backend. I attempted using HTTP/2 with TLS and, separately, using HTTP/1.1 with TLS. In both cases, I was able to use curl --data-binary ...
uploading a huge file to reproduce (chunk.c.944) write() temp-file /dev/shm/lighttpd-upload-HJrJ3E failed: No space left on device
(using lighttpd git master (pre-1.4.77)), and then subsequent requests continued to work as expected, without blocking/hanging.
I would expect, that lighttpd continues at least partially:
- either that lighttpd terminates, so it can restarted by a monitoring instance
- or close this connection, so that other requests can be handled
After logging "No space left on device", lighttpd responds to the failed upload request with 500 Internal Server Error
and continues serving other requests. This is the expected behavior. lighttpd does not terminate the connection or the lighttpd process.
Updated by flynn 3 months ago
I try to collectd today more log messages ...
Some thoughts:- the process, that fills the disk, is not (only) lighttpd, there is at least one more process uploading files. In this case the second process definitly is the root cause for disk full, not lighttpd.
- access and error log are on the same partition as the temp-file in the error message above. Maybe the logging is blocking?
Updated by gstrauss 3 months ago · Edited
Some thoughts:
- the process, that fills the disk, is not (only) lighttpd, there is at least one more process uploading files. In this case the second process definitely is the root cause for disk full, not lighttpd.
- access and error log are on the same partition as the temp-file in the error message above. Maybe the logging is blocking?
It does not matter who fills the disk. lighttpd attempts to write to filesystems which are full will fail with ENOSPC and will not block (and lighttpd should not retry those write()s unless EINTR). However, if the filesystem is full, then attempts to write subsequent error log entries might be lost.
lighttpd can be configured to use syslog or piped-loggers. The logs could also be moved temporarily to a different location. In any case, strace
or pstack
of lighttpd while blocked/hanging would probably be a useful next step, potentially triggered by a log-watcher which sees error log message matching regex "write\(\) temp-file .* failed: No space left on device"
Please also note that server.upload-dirs may be configured to contain multiple directories on different filesystems, and lighttpd will use the next one in the list if an earlier location fails with ENOSPC.
Updated by gstrauss 3 months ago
If the backend to which lighttpd proxies requests is down, then it may appear that lighttpd is unable to send requests to a backend that is unavailable or unresponsive, and lighttpd may not be able to log that to the error log if the error log is on the same volume which has run out of disk space.
Does lighttpd continue to be unresponsive once disk space is cleared and the backend is once again responsive or has been restarted? Or does lighttpd need to be restarted before lighttpd is responsive again?
Updated by gstrauss 3 months ago · Edited
If the server has lighttpd configured to start the proxy backend, then instead of restarting lighttpd, please try killing only the proxy backend so that lighttpd will restart it.
After the disk full condition has been cleared, An strace
of lighttpd while trying to make new connections should reveal if lighttpd calling accept()
on those new connections or if they are simply getting added to kernel socket backlog queue for the listening socket. It may also reveal if the proxy backend is the process that is stuck.
Also available in: Atom