https://redmine.lighttpd.net/https://redmine.lighttpd.net/favicon.ico?13667327412020-12-07T12:33:02Zlighty labsLighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121002020-12-07T12:33:02Zflynn
<ul></ul><p>The server has the following backends:<br />- 5 with HTTP-protocol and TCP connection<br />- 1 with HTTP-protocol and UNIX-Domain socket and websockets enabled<br />- 1 with UWSGI-protocol and TCP connection</p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121012020-12-07T19:49:49Zgstrauss
<ul></ul><p>This might take some time to track down.</p>
<p>Do you recall if you saw these errors with any earlier 1.4.56-rcX (release candidate) versions of lighttpd?</p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121022020-12-08T06:57:16Zgstrauss
<ul></ul><p>When you reported <a class="issue tracker-1 status-5 priority-4 priority-default closed" title="Bug: Memory Growth with PUT and full buffered streams (Fixed)" href="https://redmine.lighttpd.net/issues/3033">#3033</a>, did you notice this issue, too, or were you not testing the lighttpd 1.4.56 release candidate on the same server or same usage?</p>
<p>If you can reproduce this easily, please try flipping on and off some of the features introduced in lighttpd 1.4.56 to help narrow down when the issue(s) occur.<br />Does the problem occur with and without HTTP/2?<br /><code>server.feature-flags += ("server.h2proto" => "disable")</code><br />Does the problem occur if (non-<code>Upgrade</code>) requests are sent with HTTP/1.0 by mod_proxy instead of HTTP/1.1?<br /><code>server.feature-flags += ("proxy.force-http10" => "enable")</code><br />(You'll have to scope "proxy.force-http10" to avoid the condition which use mod_proxy to websockets, or else it will ignore the <code>Upgrade</code> header)</p>
<p>From the errors shared, I can see that one of the errors is likely coming from "1 with HTTP-protocol and UNIX-Domain socket and websockets enabled" unless something else in your config is using unix domain sockets.</p>
<p>However, there is probably more than one issue here, since when a response contains <code>Upgrade</code>, the connection is switched to streaming mode with a (limited) in-memory buffer. Yet, I see some errors in your logs relating to file uploads, which are saved into temporary files. Would you share the (approximate) sizes of file uploads and whether those POST (or PUT or other uploads) are to the TCP backends, to the UWSGI backends, or to both?</p>
<p>The errors on the client connection fds with mod_openssl, as well as some of the other errors, suggests there might be an issue with the fd accounting in some usage scenarios.</p>
<p>I'll start trying to reproducing this tomorrow. Any additional hints to help me narrow it down would be appreciated. Thank you.</p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121032020-12-08T07:01:50Zflynn
<ul></ul><p>This problem only occurs on one server, which I directly updated from version 1.4.55 to 1.4.56,<br />so I do not have any comparative values with RC versions.</p>
<p>Possbile differences which may cause the problem:<br />- UNIX domain socket instead of TCP connection<br />- many long "running"/streamed responses in parallel</p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121042020-12-08T07:24:38Zflynn
<ul></ul><p>Because the server is in use, I will probably not be able to do the tests until the weekend.</p>
<p>Websockets are enabled, but all errors occurs in NOT Upgraded responses, as far as I remember.</p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121062020-12-08T21:01:55Zflynn
<ul></ul><p>I made some of your suggested tests:<br />- disable HTTP/2 does NOT solve the problem<br />- scope "proxy.force-http10" does not work, I get the follwing error message: <code>(configfile-glue.c.300) DEPRECATED: do not set server options in conditionals, variable: server.feature-flags</code><br />- setting <code>server.feature-flags += ("proxy.force-http10" => "enable")</code> in global scope seems to solve all problems, at least for the two known test cases</p>
<p>Without <code>server.feature-flags += ("proxy.force-http10" => "enable")</code> (= HTTP/1.1 enable for proxy backends) and checking out git over https I get the following error messages from git/curl:<br />- either <code>Illegal or missing hexadecimal sequence in chunked-encoding</code><br />- or <code>Received HTTP code 408 from proxy after CONNECT</code></p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121072020-12-08T21:56:10Zgstrauss
<ul></ul><p>I am able to reproduce an error, though there might be additional errors.<br />There is an error in mod_proxy in lighttpd 1.4.56. The error can be avoided with one or both of:<br /><pre>
server.stream-response-body = 0
server.feature-flags += ("proxy.force-http10" => "enable")
</pre><br />Disabling streaming response is undesirable. Forcing "proxy.force-http10" is the same behavior as lighttpd 1.4.55, except that forcing HTTP/1.0 also disables the <code>Upgrade</code>, which was special-cased in lighttpd 1.4.55 (and uses HTTP/1.1 in lighttpd 1.4.55, but only with <code>Upgrade</code>)</p>
<p>Now to track down and fix the error...</p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121082020-12-08T22:00:49Zgstrauss
<ul></ul><p>(<a class="user active user-mention" href="https://redmine.lighttpd.net/users/2609">@flynn</a> I had posted the above before seeing your update)</p>
<p>Yes, <code>server.feature-flags</code> is valid only in the global scope, which means that "proxy.force-http10" is not an option for your setup, as doing so would break your websockets proxy.</p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121112020-12-09T06:26:11Zgstrauss
<ul></ul><p>I need to do additional testing, but here is a <strong>preliminary</strong> patch which fixes the error that I reproduced. Explanation: When the backend sends <code>Transfer-Encoding: chunked</code> (which is HTTP/1.1, not HTTP/1.0) and the client is HTTP/1.1, then lighttpd will transfer the chunked encoding as it is sent from the backend without excess data copying involved in decoding and then re-encoding. The bug is that if the first data chunk (or chunks) is <strong>partially</strong> received at the same time as the response headers, then the chunked encoding gets corrupted. (When I wrote this, I must not have tested with large enough chunks to result in a partial chunk in the <strong>first</strong> packet, or my backend must have written the response headers (unbuffered) separately from the data, so no partial chunk was received with the response headers.)<br /><pre>
diff --git a/src/response.c b/src/response.c
index 837178c0..904c96a3 100644
--- a/src/response.c
+++ b/src/response.c
@@ -932,11 +932,23 @@ http_response_write_prepare(request_st * const r)
/* create initial Transfer-Encoding: chunked segment */
buffer * const b =
chunkqueue_prepend_buffer_open(&r->write_queue);
+ if (r->resp_decode_chunked
+ && 0 != r->gw_dechunk->gw_chunked) {
+ /*(reconstitute initial partially-decoded chunk)*/
+ off_t gw_chunked = r->gw_dechunk->gw_chunked;
+ if (gw_chunked > 2)
+ qlen += gw_chunked - 2;
+ else if (1 == gw_chunked)
+ chunkqueue_append_mem(&r->write_queue,
+ CONST_STR_LEN("\r"));
+ }
+ else {
+ chunkqueue_append_mem(&r->write_queue,
+ CONST_STR_LEN("\r\n"));
+ }
buffer_append_uint_hex(b, (uintmax_t)qlen);
buffer_append_string_len(b, CONST_STR_LEN("\r\n"));
chunkqueue_prepend_buffer_commit(&r->write_queue);
- chunkqueue_append_mem(&r->write_queue,
- CONST_STR_LEN("\r\n"));
}
http_header_response_append(r, HTTP_HEADER_TRANSFER_ENCODING,
CONST_STR_LEN("Transfer-Encoding"),
</pre></p>
<p>Even if this patch resolves your two test cases, I am looking further into how to better handle some of the error conditions, since some of the error trace should not occur, specifically operations on bad file descriptors.</p>
<p>Also, the error case that I reproduced does not occur if the client is HTTP/2 since in that case, lighttpd decodes <code>Transfer-Encoding: chunked</code> from the backend, and re-frames into HTTP/2 frames.<br />[Edit: clarification:] Disabling HTTP/2 for testing did not affect whether or not the issue occurred since the issue occurred only for HTTP/1.1 clients.</p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121132020-12-09T07:37:18Zflynn
<ul></ul><p>I have now an error case with an HTTP/2 request, that produces the socket error messages, unfortunately it is not stable reproducable:</p>
<pre>
2020-12-09 08:15:54: (chunk.c.952) read: Success
2020-12-09 08:15:54: (connections.c.466) connection closed: write failed on fd 38
2020-12-09 08:15:55: (chunk.c.952) read: Success
2020-12-09 08:15:55: (connections.c.466) connection closed: write failed on fd 47
2020-12-09 08:15:57: (chunk.c.947) lseek: Bad file descriptor
2020-12-09 08:15:57: (connections.c.466) connection closed: write failed on fd 51
</pre>
<p>It is a git (version 2.29.2) checkout with the following error messages:<br /><pre>
error: 6602 bytes of body are still expected
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
warning: unable to unlink '(null)': Bad address
</pre></p>
<p>It seems, that lighttpd closes the connection in some cases prematurely ...</p>
<p>The same request may succeed in further attempts, so it needs several requests to trigger the error.</p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121142020-12-09T08:19:24Zgstrauss
<ul></ul><p><code>2020-12-09 08:15:54: (chunk.c.952) read: Success</code><br /><code>2020-12-09 08:15:54: (connections.c.466) connection closed: write failed on fd 38</code></p>
<p>Those suggest to me that there is a 0-length data chunk which was not expected and not handled well. Would you <code>strace()</code> to find the args passed to <code>read()</code> or try the following patch?</p>
<pre>
diff --git a/src/chunk.c b/src/chunk.c
index d1e80efb..2af3d3bc 100644
--- a/src/chunk.c
+++ b/src/chunk.c
@@ -942,6 +942,7 @@ chunkqueue_peek_data (chunkqueue * const cq,
off_t toSend = c->file.length - c->offset;
if (toSend > (off_t)space)
toSend = (off_t)space;
+ if (0 == toSend) break;
if (-1 == lseek(c->file.fd, offset, SEEK_SET)) {
log_perror(errh, __FILE__, __LINE__, "lseek");
</pre> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121152020-12-09T10:08:18Zgstrauss
<ul></ul><p>This patch may address some of the <code>Bad file descriptor</code> errors.<br /><pre>
diff --git a/src/chunk.c b/src/chunk.c
index d1e80efb..c35d2a37 100644
--- a/src/chunk.c
+++ b/src/chunk.c
@@ -107,13 +107,13 @@ static void chunk_reset_file_chunk(chunk *c) {
}
else if (c->file.fd != -1) {
close(c->file.fd);
- c->file.fd = -1;
}
if (MAP_FAILED != c->file.mmap.start) {
munmap(c->file.mmap.start, c->file.mmap.length);
c->file.mmap.start = MAP_FAILED;
c->file.mmap.length = c->file.mmap.offset = 0;
}
+ c->file.fd = -1;
c->file.length = 0;
c->type = MEM_CHUNK;
}
</pre></p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121162020-12-09T19:40:43Zgstrauss
<ul></ul><p>To reduce log noise, this patch will skip the warning trace reporting error on backend socket if the connection has been upgraded, e.g. to websockets.<br /><pre>
diff --git a/src/gw_backend.c b/src/gw_backend.c
index 711908a5..f4c6845e 100644
--- a/src/gw_backend.c
+++ b/src/gw_backend.c
@@ -33,6 +33,7 @@
#include "buffer.h"
#include "chunk.h"
#include "fdevent.h"
+#include "http_header.h"
#include "log.h"
#include "sock_addr.h"
@@ -2265,7 +2265,7 @@ static handler_t gw_recv_response_error(gw_handler_ctx * const hctx, request_st
"socket: %s for %s?%.*s, closing connection",
(long long)hctx->wb.bytes_out, proc->connection_name->ptr,
r->uri.path.ptr, BUFFER_INTLEN_PTR(&r->uri.query));
- } else {
+ } else if (!light_btst(r->resp_htags, HTTP_HEADER_UPGRADE)) {
log_error(r->conf.errh, __FILE__, __LINE__,
"response already sent out, but backend returned error on "
"socket: %s for %s?%.*s, terminating connection",
</pre></p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121172020-12-09T19:54:34Zgstrauss
<ul><li><strong>Status</strong> changed from <i>New</i> to <i>Patch Pending</i></li></ul><p>I believe that the above patches fix the issues (and the secondary effects). You should not need the patch with <code>+ if (0 == toSend) break;</code> after applying the patch with <code>+ c->file.fd = -1;</code></p>
<p>The missing <code>+ c->file.fd = -1;</code> initialization is important enough that I'll be rolling and releasing lighttpd 1.4.57 next week (or sooner) once you confirm that the issues have been addressed.</p>
<p>Thank you.</p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121192020-12-10T06:40:42Zflynn
<ul></ul><p>Thank you for patches, I will test them on Saturday.</p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121212020-12-13T22:42:59Zflynn
<ul></ul><p>With these patches (without <code>+ if (0 == toSend) break;</code>) I cannot reproduce the problem, even on repeated attempts and large amount of data transfers. Memory consumption looks good, a maximum of 6MB resident memory.</p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121222020-12-13T23:56:38Zgstrauss
<ul></ul><p>That's great news. Thank you for your thoroughness.</p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121252020-12-14T05:29:34Zflynn
<ul></ul><p>I do not know wheter it belongs to this issue, but after some hours later I got this error on same instance:</p>
<p><code>(gw_backend.c.2270) response already sent out, but backend returned error on socket: unix:/var/xxxxxxxxxxxxx for /xz?, terminating connection</code></p>
<p>After this message the light process has 100% CPU and hangs.</p>
<p>strace does not show any systemn calls ...</p>
<p><code>perf top -p ...</code> shows this functions:<br /><pre>
99.73% lighttpd [.] http_chunk_decode_append_data
0.19% lighttpd [.] http_chunk_append_mem
</pre></p>
<p>It seems, that lighty is stuck in an endless loop.</p>
<p>After a restart of lighttpd everything is ok, no restart of the backends needed.</p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121262020-12-14T06:41:42Zflynn
<ul></ul>Additional observations:
<ul>
<li>first the error message <code>(gw_backend.c.2270) response already sent out, ...</code></li>
<li>monitoring indicates, that the user CPU went up to 100% immediately after the error message, but the time resolution of the monitoring is not precise enough to verify a direct correlation</li>
<li>lighttpd still serves requests, but gets slower</li>
<li>after about 10min the montoring shows, that more and more requests fail, probably due to timeout, there no more lines of the monitoring in the access log until the restart</li>
</ul> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121272020-12-14T10:28:02Zflynn
<ul></ul><p>It happened again and I tried to produce a backtrace in gdb:<br /><pre>
(gdb) bt
#0 0x000055caa0aea546 in http_chunk_decode_append_data (r=0x55caa20b3d60, mem=0x55caa241897f "\r", len=1) at http_chunk.c:350
#1 0x000055caa0af50c3 in http_response_read (r=r@entry=0x55caa20b3d60, opts=opts@entry=0x55caa243cc08, b=b@entry=0x55caa1fc7750,
fdn=0x55caa20be110) at http-header-glue.c:1536
#2 0x000055caa0aed7cf in gw_recv_response (hctx=0x55caa243cb60, r=0x55caa20b3d60) at gw_backend.c:2150
#3 0x000055caa0aedacb in gw_handle_fdevent (ctx=0x55caa243cb60, revents=1) at gw_backend.c:2291
#4 0x000055caa0af23c3 in fdevent_linux_sysepoll_poll (ev=0x55caa1e7ff60, timeout_ms=<optimized out>) at fdevent_linux_sysepoll.c:43
#5 0x000055caa0aead64 in fdevent_poll (ev=0x55caa1e7ff60, timeout_ms=<optimized out>) at fdevent.c:437
#6 0x000055caa0ad9d63 in server_main_loop (srv=0x55caa1dee260) at server.c:1908
#7 0x000055caa0ac8ea7 in main (argc=4, argv=0x7ffffb3e59a8) at server.c:1949
</pre></p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121282020-12-14T15:57:58Zgstrauss
<ul></ul><p>Thanks. That is another bug. I carefully reviewed <code>http_chunk_decode_append_data()</code> for any others like that.<br /><pre>
diff --git a/src/http_chunk.c b/src/http_chunk.c
index 28daaa3b..b08e0dcc 100644
--- a/src/http_chunk.c
+++ b/src/http_chunk.c
@@ -437,7 +437,13 @@ http_chunk_decode_append_data (request_st * const r, const char *mem, off_t len)
len -= 2;
te_chunked = 0;
}
- else if (len == 1 && mem[0] != '\r') return -1;
+ else if (len == 1) {
+ if (mem[0] != '\r') return -1;
+ /*++mem;*/
+ /*--len;*/
+ te_chunked = 1;
+ break;
+ }
}
}
else if (1 == te_chunked) {
</pre></p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121292020-12-14T17:34:35Zgstrauss
<ul></ul><p>Fix for another edge case.<br /><pre>
--- a/src/http_chunk.c
+++ b/src/http_chunk.c
@@ -420,6 +420,7 @@ http_chunk_decode_append_data (request_st * const r, const char *mem, off_t len)
len -= hsz;
te_chunked += 2; /*(for trailing "\r\n" after chunked data)*/
+ if (0 == len) break;
}
if (te_chunked >= 2) {
</pre></p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121302020-12-15T00:16:27Zmgottinger
<ul></ul><p>Thank you very much for implementing HTTP2 - lighty got pretty quick.</p>
<p>I have had messages like:<br /><pre>
(chunk.c.947) lseek: Illegal seek
(connections.c.466) connection closed: write failed on fd 164
</pre></p>
<pre>
(chunk.c.947) lseek: Bad file descriptor
(connections.c.466) connection closed: write failed on fd 169
</pre>
<pre>
(mod_openssl.c.2916) SSL: 5 -1: Bad file descriptor
(mod_openssl.c.3256) SSL (error): 5 -1: Bad file descriptor
(fdevent.c.367) fdevent event_del failed on fd 1853453417: Bad file descriptor
(fdevent.c.400) fdevent event_set failed on fd -1668802746: Bad file descriptor
</pre>
<p>too, they seem to be gone with your patches (excluding "if (0 == toSend) break;" and the one for log noise).<br />I have had most problems with transfering bigger files > 100MB.</p>
<p>I will post any update on this on my end.</p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121382020-12-15T07:32:29Zflynn
<ul></ul><p>Test is running with latest patches in <code>src/http_chunk.c</code>, but I cannot trigger the 100% CPU issue. I just need to wait, whether it happens again or not ...</p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121562020-12-16T07:21:19Zgstrauss
<ul></ul><p>I have added a feature to mod_proxy (for lighttpd 1.4.57) to allow specific backends to be forced to use HTTP/1.0<br /><code>proxy.header += ( "force-http10" => "enable" )</code> (default is <code>disable</code>)<br />If any <code>proxy.header</code> values are set, this value (or default <code>disable</code>) overrides the global <code>server.feature-flags += ("proxy.force-http10" => "...")</code></p> Lighttpd - Bug #3044: Socket errors after update to version 1.4.56https://redmine.lighttpd.net/issues/3044?journal_id=121572020-12-16T07:25:07Zgstrauss
<ul><li><strong>Status</strong> changed from <i>Patch Pending</i> to <i>Fixed</i></li></ul><p>Applied in changeset <a class="changeset" title="[core] fix handling chunked response from backend (fixes #3044) (thx flynn) When the backend se..." href="https://redmine.lighttpd.net/projects/lighttpd/repository/14/revisions/d15e2286f2be549798fa81756758a52336913a7e">d15e2286f2be549798fa81756758a52336913a7e</a>.</p>