Project

General

Profile

Bug #2748

1.4.40/1.4.41 uploads to CGI may cause crash (SIGABRT)

Added by DavidAnderson684 11 months ago. Updated 11 months ago.

Status:
Fixed
Priority:
Normal
Assignee:
-
Category:
core
Target version:
Start date:
2016-08-22
Due date:
% Done:

100%

Estimated time:
Missing in 1.5.x:

Description

Immediately after updating from lighttpd 1.4.39 to 1.4.41 (RedHat CentOS 6 EPEL package), a customer reported that when he attempts to upload files (approximately 10MB in size) via his website (a CodeIgniter install from around 2012, on PHP 5.5, running via CGI), the uploads just hang - they appear to send data at a decent rate, but then nothing happens; there is no error, no outcome.

He repeated this operation and attempt multiple times; always failed.

Upon downgrading the webserver back to 1.4.39 again, the uploads immediately resumed working for him.

Associated revisions

Revision 40f16d52 (diff)
Added by gstrauss 11 months ago

[core] fix crash if ready events on abandoned fd (fixes #2748)

x-ref:
"1.4.40/1.4.41 uploads to CGI may cause crash (SIGABRT)"
https://redmine.lighttpd.net/issues/2748

History

#1 Updated by gstrauss 11 months ago

  • Target version changed from 1.4.x to 1.4.42

Do you have any more information? Any messages in lighttpd error logs? If this is reproducible, is there an strace of lighttpd (and children) available?

There are a couple patches post 1.4.41 which do not affect CGI, so are probably not related to the issue here: #2743 and #2744.
(unrelated to this issue, but necessary to fix Digest auth in 1.4.41 is #2745)

Changes in 1.4.40 and 1.4.41 modified how uploads are stored to temporary files, so please check server.upload-dirs. Would you share the lighttpd.conf (replacing sensitive info with example.com and the like)?

#2 Updated by DavidAnderson684 11 months ago

OK. I have temporarily upgraded the server (which is a live shared webserver, so not easy to have with known bugs for long. I got hit by the Digest auth issue too) back to 1.4.41.

For a test, I've uploaded the basic upload script from here: https://davidwalsh.name/basic-file-uploading-php - modified only by adding error_log('hi') at the top, to log the start of the PHP. It's being run via basic CGI - not FastCGI,

Results: Sometimes works, sometimes not. It's much more likely to not work on a large file (which agrees with what my customer was trying).

When it doesn't work, what happens is:

- Nothing is logged in the lighttpd access log for the request
- The PHP script is never run (and so, there's nothing in the PHP log)
- The HTTP connection just dies, with no response data from the server
- This is logged in the lighttpd error log:

2016-08-22 22:49:52: (mod_cgi.c.749) cgi-FDEVENT_ERR
2016-08-22 22:49:52: (log.c.216) server started
2016-08-22 22:49:52: (server.c.1303) WARNING: unknown config-key: url.redirect-final (ignored)

server.upload-dirs is not set in lighttpd.conf at all, and thus must be defaulting to /var/tmp (according to the manual).

When I look in /var/tmp, every failed attempt has resulted in a file called lighttpd-upload-(random characters). The file size is either slightly more than the file being uploaded (if that file is less than 1MB), or just over 1MB (if it is more).

Unfortunately, it looks really hard to strace the lighttpd process and capture the relevant parts of the output, as the server is very busy.

#3 Updated by gstrauss 11 months ago

Thanks for the thorough info.

Yes, /var/tmp is the default, and lighttpd creates temporary files about 1 MB in size so that they can be removed as they are consumed to free up disk space. Also, the default behavior collects the entire request body before starting the CGI (to keep resource usage low), so that is all expected behavior.

2016-08-22 22:49:52: (mod_cgi.c.749) cgi-FDEVENT_ERR

That would occur only when the pipe to the CGI has an error, in which case, lighttpd should send an error response to the client and should remove the temporary files. Since this is not happening, lighttpd tried to create the CGI script, exec'd something, and went to sleep waiting for events on the pipe, and was woken up with an POLLERR event.

Something else might be causing the pipe error (yet undetermined) since you report that lighttpd does not respond to client, does not write anything to access log, and restarts shortly afterwards.

Is lighttpd crashing? (probably, but why?) (and I presume that something external is restarting lighttpd.)
Is there anything else in the error log prior to the cgi-FDEVENT_ERR?
How complex is the lighttpd.conf? Can you share it (with sensitive info replaced with example.com and such)? Is server.max_worker set?

It's really late at the moment. I'll try to reproduce this error tomorrow.
BTW, the file you uploaded contains &lt;?php<br />, so I probably can't try to run it as-is.

#4 Updated by DavidAnderson684 11 months ago

Yes, lighttpd is crashing, and yes, it is automatically restarted (by init, which is upstart on CentOS 6). Here is my syslog from when I did the update to 1.4.41:

Aug 22 22:42:32 machine yum2413: Updated: lighttpd-1.4.41-1.el6.x86_64
Aug 22 22:42:33 machine yum2413: Updated: lighttpd-fastcgi-1.4.41-1.el6.x86_64
Aug 22 22:45:14 asher init: lighttpd main process (2833) killed by ABRT signal
Aug 22 22:45:14 asher init: lighttpd main process ended, respawning
Aug 22 22:45:19 asher init: lighttpd main process (7253) killed by ABRT signal
Aug 22 22:45:19 asher init: lighttpd main process ended, respawning
Aug 22 22:45:42 asher init: lighttpd main process (7349) killed by ABRT signal
Aug 22 22:45:42 asher init: lighttpd main process ended, respawning
Aug 22 22:49:52 asher init: lighttpd main process (7671) killed by ABRT signal
Aug 22 22:49:52 asher init: lighttpd main process ended, respawning
Aug 22 22:57:34 asher init: lighttpd main process (11542) killed by ABRT signal
Aug 22 22:57:34 asher init: lighttpd main process ended, respawning
Aug 22 22:58:13 asher init: lighttpd main process (17390) killed by ABRT signal
Aug 22 22:58:13 asher init: lighttpd main process ended, respawning

Those times exactly correspond to the times of the failed upload attempts. These are the only instances of lighttpd crashing - i.e. it didn't occur prior to upgrading to 1.4.41, and doesn't occur after downgrading.

Note sure what you mean about "the file you uploaded" - I didn't upload any files. The PHP code in the attached link works - as I say, the crash upon uploading doesn't occur every time, but pretty much every time once you use multi-megabyte files. On files of a few kilobytes, it's only sometimes; on other times, it succeeds. I used that PHP script just as a simplest-possible example of a functioning upload script; the example reported by my client was the upload module in a CodeIgniter site, but it appears to not be necessary to use something that complex.

Is there anything else in the error log prior to the cgi-FDEVENT_ERR?

No - those 3 lines are the only 3 logged on each of the crashes, and are logged each time (note the correspondence between the 3 logged lines in my earlier comment from 1 particular crash, and the corresponding crash logged in the syslog above).

The site runs its PHP scripts via /usr/sbin/suexec . Nothing is logged in the suexec log.

server.max_worker is not set in lighttpd.conf, which is 365Kb long (being a shared hosting server).

Here are hopefully the relevant settings for the site:

server.modules = ( "mod_access", "mod_redirect", "mod_rewrite", "mod_alias", "mod_auth", "mod_magnet", "mod_setenv", "mod_status", "mod_proxy", "mod_cgi", "mod_ssi", "mod_compress", "mod_expire", "mod_secdownload", "mod_accesslog", "mod_extforward" )

$HTTP["host"] =~ "^www\.example.com$|^example.com$" {
    server.document-root = "/var/www/websites/example.com/htdocs" 

    cgi.assign = ( ".php" => "/etc/lighttpd/websites/example.com/suexec-php.sh" )

    server.indexfiles = ( "index.html", "index.htm", "index.php")

    url.rewrite-if-not-file += (
    "^/_min/([a-z]=.*)" => "/_min/index.php?$1",
    "^/_min/([bfg]=.*)" => "/_min/index.php?$1",
    "^/download-legacy/images/(.*)?$" => "/filestore_legacy/images/$1",
    "^/(css/lib/grid)/([0-9]+)_([0-9]+)_([0-9]+).css$" => "/index_wms.php?/grid/$2/$3/$4",
    "^/(images_site|images_admin)(/auto-thumbs)/(.*)$" => "/index_wms.php?/$1$2/$3",
    "^([^\?]+)(\?(.*))?$" => "/index_wms.php/$1?$2" 
    )
}

and the suexec script (suexec-php.sh) has this, which just calls /usr/sbin/suexec (so that each site runs under its own UID):

#!/bin/sh

user=example.com
group=webusers

cd /var/www/websites/example.com/cgi-exec
/usr/sbin/suexec $user $group php.sh

and finally, the php.sh script is:

#!/bin/sh
exec /opt/php55/usr/bin/php-cgi -d short_open_tag=On 2> ../logfiles/example.com-php.log

#5 Updated by DavidAnderson684 11 months ago

P.S. Given the info above, the bug title ought to be changed to "uploads cause crash (SIGABRT)" rather than "uploads hang". I'd missed the crash, because of init restarting lighttpd immediately.

#6 Updated by gstrauss 11 months ago

I was able to reproduce the problem (or one related to it) by having the CGI script send a response without having read all the input when the input was large enough to exceed kernel pipe buffers (so lighttpd was still trying to send input to the CGI script).

If the CGI script reads all the input, the problem does not occur.

If the CGI does not read all the input, and sends a response and exits before lighttpd is woken up to read that input, then lighttpd might get events on two pipes, one to read from the CGI (POLLIN) and one to write to the CGI (POLLERR). However, if lighttpd handles the read and cleans up it's side of the pipes, then by the time the main event loop gets up to the event on the second pipe, the pipe has already been cleaned up and lighttpd performs an explicit abort.

If you were seeing exactly the same issue, I would expect the error log to contain
fdevent.c.214: aborted
However, I am going to fix the core issue, which is a race condition in processing events on two independent pipes in the main event loop, where those two pipes might be cleaned up by a CGI after processing the first event and before events on the second pipe have been handled.

#7 Updated by gstrauss 11 months ago

What is server.event-handler set to in the config file? Or is the default used?
If it is not explicitly "libev", then please try server.event-handler = "libev". If that works, then we have correctly identified the problem.

I am working on a solution and hope to post a patch tomorrow evening.

#8 Updated by gstrauss 11 months ago

  • Subject changed from Uploads hang in 1.4.41 to 1.4.40/1.4.41 uploads to CGI may cause crash (SIGABRT)

#9 Updated by DavidAnderson684 11 months ago

If the CGI script reads all the input, the problem does not occur.

Seems unlikely that PHP's $_FILE handling isn't reading all the input? I could get the error (but not every) time even on small files - I think one was 10KB.

What is server.event-handler set to in the config file? Or is the default used?

Default. I will try the value 'libev' later.

#10 Updated by gstrauss 11 months ago

  • Status changed from New to Patch Pending

I am pushing a patch which fixes the more generic issue of events reported on abandoned fds.

More details: inside the event loop, a batch of ready events might be returned. As lighttpd walks through those events and calls a handler for each event, the handler might close other fds, e.g. other pipes related to handling that CGI request. If those other fds have pending ready events which have not yet been dispatched as lighttpd walks through the ready events, then lighttpd would abort() when it discovered that it was about to dispatch to a handler that had already been cleaned up.

The fix implemented also handles reuse of fds which might occur during event handling if one handler closes an fd and a subsequent handler opens the same fd (A-B-A problem). If lighttpd had a pending event on the fd that was closed by a prior handler, and a subsequent handler did an open() and received the same fd number, then lighttpd might attempt to dispatch the ready event (on fd that was recently closed) to the new open() on the same fd.

@DavidAnderson684: I am still interested in hearing whether or not server.event-handler = "libev" was a valid workaround to the issue you reported. With the patch I am pushing shortly, using the default server.event-handler will one again be the recommended choice.

#11 Updated by gstrauss 11 months ago

  • Status changed from Patch Pending to Fixed
  • % Done changed from 0 to 100

#12 Updated by DavidAnderson684 11 months ago

When setting server.event-handler = "libev" on lighttpd 1.4.41 (CentOS 6 EPEL package), it doesn't start:

(configfile.c.1482) the selected event-handler in unknown or not supported: libev

#13 Updated by gstrauss 11 months ago

(configfile.c.1482) the selected event-handler in unknown or not supported: libev

If "lighttpd -V" doesn't show libev, then it was not built into the CentOS lighttpd package. Sorry.

If you can build from source, the latest lighttpd git master should work for you.
$ git clone https://git.lighttpd.net/lighttpd/lighttpd1.4.git
https://redmine.lighttpd.net/projects/lighttpd/wiki/DevelGit

As long as I can get feedback that this patch fixes the issue, there will probably be a release of lighttpd 1.4.42 towards the end of September. It would be really helpful to find out if this patch works for you. Sorry about the trouble, and thanks for your help in tracking it down.

#14 Updated by DavidAnderson684 11 months ago

Hi, unfortunately, I am hugely busy and don't think I can find time to look into how to compile my own package. I will just report back when 1.4.42 makes its way into CentOS. 1.4.39 is not currently causing me any problems. I can try the 1.4.40 CentOS package, if you suspect that you know which commit caused the problem, and which version it is in - that might help confirm the thesis.

#15 Updated by gstrauss 11 months ago

Sorry. The issue is present in 1.4.40, too, as I noted in the title.

Also available in: Atom