Project

General

Profile

Actions

Bug #760

closed

Random crashing on FreeBSD 6.1

Added by Anonymous over 17 years ago. Updated over 7 years ago.

Status:
Fixed
Priority:
Urgent
Category:
core
Target version:
ASK QUESTIONS IN Forums:

Description

Here is the backtrace, lighttpd crashes randomly about 30-40 times a day on a fairly heavy traffic website that serves 30-60mb files.


Files

lighttpd.trace (1.29 KB) lighttpd.trace lighttpd.trace -- Wilik Anonymous, 2006-07-23 04:36
lighttpd.strace (22.1 KB) lighttpd.strace strace file -- Wilik Anonymous, 2006-07-23 16:57
lighttpd.conf (1.74 KB) lighttpd.conf -- geoff Anonymous, 2007-05-23 00:09
bug760.patch (4.21 KB) bug760.patch Patch to work around problem of crashing on large files -- geoff Anonymous, 2007-06-26 22:39

Related issues 1 (0 open1 closed)

Related to Bug #949: fastcgi, cgi, flush, php5 problem.FixedActions
Actions #1

Updated by wiak over 17 years ago

i have the same problem :/
LightTPD crashes on heavy trafficc random

Actions #2

Updated by Anonymous over 17 years ago

This has been a problem for me on FreeBSD 6.x since I first started using Lighty on v1.4.10. It's running under supervise now, so it restarts immediately, but it's still annoying and not very impressive.

-- weird_ed

Actions #3

Updated by about 17 years ago

Upgrade to 1.4.13 and see if it still happens.

Also, what ulimits are you running lighttpd under?

Actions #4

Updated by about 17 years ago

Also, paste your lighttpd config.

Actions #5

Updated by Anonymous almost 17 years ago

I'm lucky to be able to reproduce the bug at will, so once I found this report it was easy to confirm that I'm having the same problem. Even better, it was also trivial to identify the proximate cause.

The problem is a malloc failure in buffer_prepare_copy. That, in turn, is caused by a massive memory leak. Lighty's process size when it died was 3013792, or over 3 GB. Not coincidentally, the file I was downloading is 3.8 GB in size. Clearly, lighty is either trying to cache the entire file internally, or failing to free buffers as the copy progresses.

A test with a smaller file (0.5 GB) revealed that the process remains large after the file has been downloaded. Since the modern malloc often returns freed space to the system, this indicates that it's a plain memory leak. That, in turn, ought to make the bug pretty easy to find.

-- geoff

Actions #6

Updated by darix almost 17 years ago

what is your usage pattern for lighttpd?
i am only aware of one memory leak in lighttpd in combination with mod_proxy. are you using mod_proxy? and can you attach your config? (you can obfuscate stuff if needed)

Actions #7

Updated by Anonymous almost 17 years ago

Replying to darix:

what is your usage pattern for lighttpd?
i am only aware of one memory leak in lighttpd in combination with mod_proxy. are you using mod_proxy? and can you attach your config? (you can obfuscate stuff if needed)

No need to obfuscate; I just attached the config file. There's no mod_proxy.

The usage pattern is VERY light (only a few users per day), but essentially all the activity is downloads of huge files over a slow link. I did a bit of code browsing, and my guess is that chunk.c doesn't limit the length of the chunk queue. So the slow link backs up, the chunk queue grows to the size of the file, and lighty runs out of memory.

If that guess (and it's only a guess) is correct, there's no memory leak, just a failure to limit the queue length. I haven't dug deeply into the code yet to see whether that's true, nor to see how hard it will be to add a queue limit.

-- geoff

Actions #8

Updated by Anonymous almost 17 years ago

OK, I did a test and it's not a memory leak. I downloaded a 553M-ish file, and lighty went up to 552M in size, then shrank back to a thrifty 26M after the download was done. (Note that it didn't quite get to the size of the file; I think that's because some of the chunks went out over the net while the file was being read in.)

I think this should be easy to fix. I just need to understand how lighty's asynchrony works. Then I could make it stop reading the file when the queue got too big, and come back later.

Oh, one other thing. I keep talking about a file, but the bug is actually related to CGIs. As far as I can tell (I didn't write the Ruby code), our CGI stuffs the file directly to lighty, rather than using X-LIGHTTPD-send-file to get the data to go out. Obviously, that suggests an alternate fix on the Ruby side. But it's still a bug that lighty swallows whatever a broken CGI sends it, without limiting its memory usage.

(As a somewhat related comment, the crash is due to an assertion failure after a malloc. A web server should never crash due to a malloc failure; at an absolute minimum it should generate a log message, and really it should degrade gracefully. A relatively easy quick fix would be to replace assert with a macro that generated a log message before dying.)

-- geoff

Actions #9

Updated by darix almost 17 years ago

configure lighttpd to use sendfile and the memory usage will be lower.

Actions #10

Updated by Anonymous over 16 years ago

Unfortunately, configuring sendfile doesn't help because the Rails version I'm using doesn't support it (nor should it, since sendfile is server-specific). In any case, that only works around the bug. It shouldn't be possible for a misbehaving CGI script to crash the server simply by supplying a large amount of output.

Fortunately, I was able to come up with a patch that mitigates the problem. I will attach it after I complete this comment. My change limits the size of the write queue, and stops reading input from the FastCGI script when it becomes excessively large. The downside of my patch is that the entire server process blocks (this is undoubtedly because I don't properly understand lighty's asynchrony mechanisms). However, if you set max_procs to an appropriate value in the fastcgi.server section of your config file, the blocked process won't be problematic because other processes will handle other users. I used max_procs = 10, since my server has few users despite serving very large files.

WARNING: Install this patch with caution. It will not crash your server, but it may make it inaccessible if lots of users are downloading large files at the same time. I doubt that this is the "correct" fix. However, I hope that this patch is useful to some people who are having this problem, and I hope it will help someone more knowledgeable to develop a better patch.

-- geoff

Actions #11

Updated by stbuehler over 15 years ago

  • Status changed from New to Fixed
  • Resolution set to wontfix

OTOH we don't want to block the backend, as they most often can only handle one reqeuest at a time (or need a thread for every request).

So the patch will not get upstream; i doubt we will change this in 1.4. perhaps someone will fix this for mod_proxy_core in 1.5.

Actions #12

Updated by stbuehler over 15 years ago

  • Status changed from Fixed to Wontfix
Actions #13

Updated by gstrauss almost 8 years ago

  • Related to Bug #949: fastcgi, cgi, flush, php5 problem. added
Actions #14

Updated by gstrauss almost 8 years ago

  • Description updated (diff)
  • Status changed from Wontfix to Patch Pending
  • Target version set to 1.4.40

New: asynchronous, bidirectional streaming support for request and response
Submitted pull request: https://github.com/lighttpd/lighttpd1.4/pull/66

included in the pull request is buffering large responses to temporary files instead of keeping it all in memory

Actions #15

Updated by gstrauss almost 8 years ago

Unfortunately, configuring sendfile doesn't help because the Rails version I'm using doesn't support it

BTW, having lighttpd used sendfile() is separate from Rails backends. Also, Rails app on a local machine can send the X-Sendfile header back to lighttpd instead of transferring the file over socket/pipe, and lighttpd can read the file directly from disk. This is separate from sendfile(), even though the names are similar.

Current HEAD of master contains patches which extend X-Sendfile header to CGI and SCGI, in addition to FastCGI

Actions #16

Updated by gstrauss over 7 years ago

  • Status changed from Patch Pending to Fixed
  • % Done changed from 0 to 100
Actions

Also available in: Atom