Project

General

Profile

Actions

Bug #3265

closed

lighttpd stalls on upload to backend if disk is full

Added by flynn 3 months ago. Updated 3 months ago.

Status:
Missing Feedback
Priority:
Normal
Category:
core
Target version:
-
ASK QUESTIONS IN Forums:
No

Description

If lighttpd (happened with version 1.4.73) buffers uploads on disk (server.stream-request-body = 1) to proxied backend e.g. proxy.server = ( "/" => ("bla-service" => ("host" => bla_service, "port" => 5001)),) and lighttpd cannot write anymore, I get the following error message:

2024-09-10 11:50:37: (chunk.c.848) write() temp-file /var/cache/lighttpd/uploads/lighttpd-upload-kuqoxM failed: No space left on device

The disk ist full, message is correct. But:
Afterwards the listen port (in this case 443) is still open, TCP connection is possible/not refused, but no further requests are handled, lighttpd seems on hold/stuck.

I would expect, that lighttpd continues at least partially:
- either that lighttpd terminates, so it can restarted by a monitoring instance
- or close this connection, so that other requests can be handled

Actions #1

Updated by gstrauss 3 months ago

When this occurs, would you strace the lighttpd process and see on what system call lighttpd is blocked?

The intention is that if ENOSPC, lighttpd will attempt to use other tempdirs (if configured), or else the operation should fail and return an error. It is bug if that is not the case.

Actions #2

Updated by flynn 3 months ago

The affected server ist not under my control anymore (but has been), the issue was only reported to me and solved by killing lighttpd process and cleaning up the disk.

I have the full configuration and try to reproduce the case in a clean environment, but this may take some time ...

Actions #3

Updated by gstrauss 3 months ago

Modern lighttpd attempts to use operating-system-specific mechanisms to efficiently write files to disk. For example, if clear-text HTTP/1.1, then lighttpd might be trying to splice() from the socket through a pipe to the temporary file on disk, which is a different code path than if those preconditions did not hold.

In order to narrow down where the blocking might be occurring, do you know if the requests were made using TLS (HTTPS), or if they were clear-text (HTTP)? HTTP/2 or HTTP/1.1? Do you know what Linux kernel the machine in question is running?

Actions #4

Updated by gstrauss 3 months ago

2024-09-10 11:50:37: (chunk.c.848) write() temp-file /var/cache/lighttpd/uploads/lighttpd-upload-kuqoxM failed: No space left on device

From a quick code inspection: after lighttpd issues this error message, there are no further tempdirs available to try, and so lighttpd returns an error from the subroutine, and that error is propagated. It was not immediately apparent if/where there were cases that lighttpd could then get stuck.

I'll try to find some time to attempt to reproduce the error reported, but might not have any time until later next week.

Actions #5

Updated by flynn 3 months ago

Requests were made using TLS (HTTPS).

Linux kernel version from Debian 12 between 6.1.99 and 6.1.106.

HTTP/2 or HTTP/1.1 unknown, but HTTP/1.1 is more probable (scripted requests).

Actions #6

Updated by gstrauss 3 months ago · Edited

  • Status changed from New to Need Feedback

So far, I am unable to reproduce reported blocking/hanging error with simple TLS setup proxying to a separate backend. I attempted using HTTP/2 with TLS and, separately, using HTTP/1.1 with TLS. In both cases, I was able to use curl --data-binary ... uploading a huge file to reproduce (chunk.c.944) write() temp-file /dev/shm/lighttpd-upload-HJrJ3E failed: No space left on device (using lighttpd git master (pre-1.4.77)), and then subsequent requests continued to work as expected, without blocking/hanging.

I would expect, that lighttpd continues at least partially:
- either that lighttpd terminates, so it can restarted by a monitoring instance
- or close this connection, so that other requests can be handled

After logging "No space left on device", lighttpd responds to the failed upload request with 500 Internal Server Error and continues serving other requests. This is the expected behavior. lighttpd does not terminate the connection or the lighttpd process.

Actions #7

Updated by flynn 3 months ago

I try to collectd today more log messages ...

Some thoughts:
  • the process, that fills the disk, is not (only) lighttpd, there is at least one more process uploading files. In this case the second process definitly is the root cause for disk full, not lighttpd.
  • access and error log are on the same partition as the temp-file in the error message above. Maybe the logging is blocking?
Actions #8

Updated by gstrauss 3 months ago · Edited

Some thoughts:

  • the process, that fills the disk, is not (only) lighttpd, there is at least one more process uploading files. In this case the second process definitely is the root cause for disk full, not lighttpd.
  • access and error log are on the same partition as the temp-file in the error message above. Maybe the logging is blocking?

It does not matter who fills the disk. lighttpd attempts to write to filesystems which are full will fail with ENOSPC and will not block (and lighttpd should not retry those write()s unless EINTR). However, if the filesystem is full, then attempts to write subsequent error log entries might be lost.

lighttpd can be configured to use syslog or piped-loggers. The logs could also be moved temporarily to a different location. In any case, strace or pstack of lighttpd while blocked/hanging would probably be a useful next step, potentially triggered by a log-watcher which sees error log message matching regex "write\(\) temp-file .* failed: No space left on device"

Please also note that server.upload-dirs may be configured to contain multiple directories on different filesystems, and lighttpd will use the next one in the list if an earlier location fails with ENOSPC.

Actions #9

Updated by flynn 3 months ago

From the logs:
  • error is caused by POST requests with HTTP/1.1
  • the proxied backend is down at least 2min before the error message, this indicates that the disk has been full also 2min before
  • no more access log entries 2min before the error message
Actions #10

Updated by gstrauss 3 months ago

If the backend to which lighttpd proxies requests is down, then it may appear that lighttpd is unable to send requests to a backend that is unavailable or unresponsive, and lighttpd may not be able to log that to the error log if the error log is on the same volume which has run out of disk space.

Does lighttpd continue to be unresponsive once disk space is cleared and the backend is once again responsive or has been restarted? Or does lighttpd need to be restarted before lighttpd is responsive again?

Actions #11

Updated by flynn 3 months ago

After cleaning the disk lighttpd did not become responsive without restart, we waited at least 2min.
After restart everything operated normal.

Actions #12

Updated by gstrauss 3 months ago

Thus far, I have been unable to reproduce this when filling a local disk volume and simulating out of disk space.

Please try to capture strace, pstack, and gdb bt full the next time this happens. Thank you.

Actions #13

Updated by gstrauss 3 months ago · Edited

If the server has lighttpd configured to start the proxy backend, then instead of restarting lighttpd, please try killing only the proxy backend so that lighttpd will restart it.

After the disk full condition has been cleared, An strace of lighttpd while trying to make new connections should reveal if lighttpd calling accept() on those new connections or if they are simply getting added to kernel socket backlog queue for the listening socket. It may also reveal if the proxy backend is the process that is stuck.

Actions #14

Updated by flynn 3 months ago

From my side, the ticket can currently be closed as not reproducible. I will reopen it when I have more information.

Actions #15

Updated by gstrauss 3 months ago

  • Status changed from Need Feedback to Missing Feedback
  • Target version deleted (1.4.77)

Ok. I am still subscribed to the issue and will see updates.

Marking "Missing Feedback" since "could not reproduce". Will review if/when more detailed observations are shared.

Actions

Also available in: Atom