Bug #1913

not fully serving data

Added by ggl almost 6 years ago. Updated over 5 years ago.

Status:FixedStart date:2009-02-23
Priority:HighDue date:
Assignee:stbuehler% Done:

100%

Category:core
Target version:1.4.22
Missing in 1.5.x:

Description

I have upgraded today to 1.4.21 in order to try to resolve some problem with the machine freezing up when I upload files to pixelpost. I haven't even tried to reproduce the problem after the upgrade to 1.4.21 because it has some other bigger issues as it's not even fully serving images. Although the image files are fine lighty is serving them truncated. The system is a FreeBSD 7.1R box. Lighty runs php through mod_fcgi and has mod_simple_vhost, mod_access, mod_auth and mod_userdir enabled. There are no relevant errors in the logs even with debug enabled.

lighttpd.conf.nocomment - sanitized config file (4.1 KB) ggl, 2009-02-23 10:43

ktrace.out (250 KB) aevernon, 2009-02-24 01:58


Related issues

Duplicated by Bug #1920: lighttpd 1.4.21 don't fully support image files Fixed 2009-02-26

Associated revisions

Revision 2403
Added by stbuehler almost 6 years ago

Fix EAGAIN handling for freebsd sendfile (fixes #1913, thx AnMaster for spotting the problem)

Revision 2405
Added by stbuehler over 5 years ago

Treat EPIPE as connection-closed error in network_freebsd_sendfile.c (another fix from #1913)

History

#1 Updated by aevernon almost 6 years ago

I am able to reproduce this on FreeBSD 7.1, PHP 5.2.8, and mod_fastcgi. In my case, Lighty is corrupting static JavaScript files that are about 300 KB. As above, there is nothing relevant in the logs with debug enabled. Rolling back to 1.4.20 or 1.4.19 makes the problem go away.

#2 Updated by stbuehler almost 6 years ago

  • Could you please add a trace of the syscalls (strace/ktrace/whatever...) for such requests?
  • I guess you are using the freebsd-sendfile backend?
  • If you want you could try to revert #1813 (r2389) - that is the only freebsd related change i could find.

#3 Updated by aevernon almost 6 years ago

  • I have attached ktrace.out.
  • Yes, lighttpd -V shows I am using the sendfile network handler.
  • Yes, reverting r2389 does indeed fix the problem.

#4 Updated by dymko almost 6 years ago

  • Target version changed from 1.4.22 to 1.4.21

I have had exactly the same issue with js file (jquery-1.3.1.min.js - 55K) and with images yesterday.
I tried to make target without that diff:
http://redmine.lighttpd.net/projects/lighttpd/repository/diff/branches/lighttpd-1.4.x/src/network_freebsd_sendfile.c?rev=2389&rev_to=2256

And it seems to have worked.

I'm sorry for my mistakes. My English isn't good

FreeBSD 7.0-RELEASE + Lighty + Ramaze with Style(Mongrel) backend.

#5 Updated by nitrox almost 6 years ago

  • Assignee set to stbuehler
  • Priority changed from Urgent to High

#6 Updated by nitrox almost 6 years ago

  • Target version changed from 1.4.21 to 1.4.22

#7 Updated by stbuehler almost 6 years ago

The trace is not really readable (binary form?)

#8 Updated by simmel almost 6 years ago

aevernon wrote:

  • I have attached ktrace.out.
  • Yes, lighttpd -V shows I am using the sendfile network handler.
  • Yes, reverting r2389 does indeed fix the problem.

You need to kdump the ktrace.out, see http://redmine.lighttpd.net/projects/lighttpd/wiki/HowToReportABug#ktrace

#9 Updated by AnMaster almost 6 years ago

I ran into this issue on FreeBSD (FreeBSD 6.4 on x86 in my case) too with 1.4.21. As a temporary workaround I changed to writev for server.network-backend.

Since it is a production box I can't test on it, but if you need help with testing something I guess I could do it on my test server. Just tell me what/how you need tested.

By the way, it seemed to only serve some files incorrectly, mostly the larger ones. While debugging the issue (and before finding this bug report) I dumped the original image and the corrupted variant (downloaded from server with wget) using od and then diffed the results. The differences start at 32768 bytes from the start of the file. The file size was the same for both the original and corrupted file: 192907 bytes.

#10 Updated by AnMaster almost 6 years ago

Further looking at file dumps reveal that in the corrupted file, the different data was a jpeg file header. It seems that there is a jpeg file header every 32768 bytes. Not only that but after 32768 bytes into the file every following "4096 byte block" is just a copy of the first 4096 bytes of the file (with the last such block cut short so the file size is the same as the non-corrupted file).

#11 Updated by AnMaster almost 6 years ago

Looking at the diff for the revision mentioned as causing the issue it seems quite clear:

r (in which number of bytes sent so far is returned from sendfile() in case of EAGAIN) is reset to 0, and is then added to offset: c->offset += r; some lines below.

As the offset to send the next chunk is calculated from this:
offset = c->file.start + c->offset;
it results in sending the same chunk again next time (as far as I can tell from reading the source).

From man sendfile on FreeBSD 6.4:

     [EAGAIN]           The socket is marked for non-blocking I/O and not all
                        data was sent due to the socket buffer being filled.
                        If specified, the number of bytes successfully sent
                        will be returned in *sbytes.

I guess whoever committed r2389 never tested it properly, nor read the relevant man page.

#12 Updated by stbuehler almost 6 years ago

Yes, i did not test it at all - as i have no freebsd running. That is what pre-releases are made for btw.

And yes, i didn't read the man page properly... i didn't expect to get an error if something was successful (most implementations would just return no error and you would see that it wrote less bytes than it could have).

#13 Updated by stbuehler almost 6 years ago

  • Status changed from New to Fixed
  • % Done changed from 0 to 100

Applied in changeset r2403.

#14 Updated by AnMaster almost 6 years ago

By the way the man page also mentions EINTR returns stuff in the same variable:

     [EINTR]            A signal interrupted sendfile() before it could be
                        completed.  If specified, the number of bytes success-
                        fully sent will be returned in *sbytes.

while your commit only mentions:
/* for EAGAIN r still contains the sent bytes */

I'm still not sure the logic is correct though. It seems to not handle EPIPE correctly (assuming it should be handled the way it is handled for writev() in the MEM_CHUNK code in the same file).
     [EPIPE]            The socket peer has closed the connection.

See also http://www.freebsd.org/cgi/man.cgi?query=sendfile

#15 Updated by stbuehler almost 6 years ago

The man page i found didn't event mention EINTR... i guess you see now why its a little bit hard for us to code this :)

So adding EPIPE to the ENOTCONN case (return -2) should be ok?

If you can verify the following patch works, i guess we could still include it in 1.4.22:

diff --git a/src/network_freebsd_sendfile.c b/src/network_freebsd_sendfile.c
index c513bf0..9bdd2a9 100644
--- a/src/network_freebsd_sendfile.c
+++ b/src/network_freebsd_sendfile.c
@@ -167,8 +167,9 @@ int network_write_chunkqueue_freebsdsendfile(server *srv, connection *con, int f
                 switch(errno) {
                 case EAGAIN:
                 case EINTR:
-                    /* for EAGAIN r still contains the sent bytes */
+                    /* for EAGAIN/EINTR r still contains the sent bytes */
                     break; /* try again later */
+                case EPIPE:
                 case ENOTCONN:
                     return -2;
                 default:

#16 Updated by AnMaster almost 6 years ago

The patch seems to work ok (had to apply it manually: it looks like redmine expanded tabs to spaces in the patch you pasted). I haven't tested it's behaviour on EINTR or EPIPE/ENOTCON since I don't know how to reproduce those. But it works for EAGAIN at least.

Against 1.4.21 I ended up with this patch (I guess tabs will get expanded here too):

Index: src/network_freebsd_sendfile.c
===================================================================
--- src/network_freebsd_sendfile.c
+++ src/network_freebsd_sendfile.c
@@ -167,8 +167,9 @@
                                switch(errno) {
                                case EAGAIN:
                                case EINTR:
-                                       r = 0; /* try again later */
-                                       break;
+                                       /* for EAGAIN/EINTR r still contains the sent bytes */
+                                       break; /* try again later */
+                               case EPIPE:
                                case ENOTCONN:
                                        return -2;
                                default:

#17 Updated by aevernon almost 6 years ago

stbuehler wrote:

Yes, i did not test it at all - as i have no freebsd running. That is what pre-releases are made for btw.

I am all for testing pre-releases, but I respectfully disagree with releasing untested code into public releases. This bug was a bad one and made it into the FreeBSD ports tree. If you don't have an appropriate system available for testing, it would be better to delay a patch than risk bringing down people's web servers.

#18 Updated by icy almost 6 years ago

Well, every pre-release is out for quite some time before the final version is released. If noone reports any problems back than we assume it is not bugged.
This is our testing phase. If nobody feels like testing then things like this might happen.
Implementing features, providing support and fixing bugs take a lot of time. And so does testing.
But testing can be done by anyone out there so why not "outsource" it to the community? See it as a way of contributing back to a free product.
But maybe we can "recruit" some dedicated testers with all kinds of OSs, we'll see about that.

#19 Updated by aevernon almost 6 years ago

I understand what you are saying about the project being developed and tested by volunteers. Nevertheless, I still think it is risky for someone to develop an OS-specific patch when he does not have access to that OS and therefore cannot test at all.

In other words, why develop something if you can't test it before committing it?

#20 Updated by icy almost 6 years ago

So you are suggesting we shall drop support for every OS out there that we are not using? Sorry but this is completely irrational thinking.
But please let's not extend this bugreport into a big discussion :)

#21 Updated by aevernon almost 6 years ago

No, I do not suggest dropping support for OSes. I agree that would be irrational.

I am suggesting that all patches committed by core developers should be tested first.

#22 Updated by brad@comstyle.com almost 6 years ago

aevernon wrote:

No, I do not suggest dropping support for OSes. I agree that would be irrational.

I am suggesting that all patches committed by core developers should be tested first.

Where does this magical testing come from?!

#23 Updated by aevernon almost 6 years ago

It should be performed by whoever created the patch.

Test before committing. That's all I'm recommending.

#24 Updated by aevernon almost 6 years ago

To stbuehler: changeset r2403 works fine on my server.

To icy & stbuehler: I can help you test future FreeBSD patches if you want.

#25 Updated by ggl almost 6 years ago

Hello. I'm sure we can find a constructive solution to this testing problem. If you need testing on FreeBSD or any other systems you are not regulary testing on please say so under the prerelease announcement. Also, if you need better testing of X feature that you suspect might break please say so. I'm sure people would gladly test the prerelease rather than risk installing a buggy release labelled as stable and blowing up their servers.

#26 Updated by peto over 5 years ago

It should be performed by whoever created the patch.

I created the patch [1], and the patch I gave has been running in production for months--but the change that was actually applied [2], which I never saw, was entirely different.

[1] http://redmine.lighttpd.net/attachments/677/lighttpd-bad-errno-check.diff
[2] http://redmine.lighttpd.net/projects/lighttpd/repository/revisions/2389/diff/branches

#27 Updated by stbuehler over 5 years ago

I don't think anyone blamed you, peto :)

I now committed the second part of the patch (r2405), and there will be a prerelease soon.

But no, I don't think we will list everyone who should test a prerelease. If you can test it, just test it. Everytime. Just make sure, that the next lighty release will run stable in your environment.

If you want to know where something may have changed (and what you could test additionally), just have a look at the CHANGES list.

And if you reported a bug / proposed a patch, have a look at the fix after it got committed. (blame peto :p)

#28 Updated by aevernon over 5 years ago

I am working on donating a FreeBSD server to the LightTPD project. I'll provide shell access to those who want to help with testing.

Also available in: Atom