Project

General

Profile

Actions

Bug #3044

closed

Socket errors after update to version 1.4.56

Added by flynn almost 4 years ago. Updated almost 4 years ago.

Status:
Fixed
Priority:
Normal
Category:
mod_proxy
Target version:
ASK QUESTIONS IN Forums:
No

Description

After updating to version 1.4.56 I get severe socket errors on the servside and broken connections on the client side.
On the same failed request I get different error messages, but all in the same code area.
Here are some examples:

2020-12-07 12:04:07: (chunk.c.952) read: Success
2020-12-07 12:04:07: (connections.c.466) connection closed: write failed on fd 41

2020-12-07 12:05:22: (mod_openssl.c.3256) SSL (error): 5 -1: Bad file descriptor
2020-12-07 12:05:22: (mod_openssl.c.3256) SSL (error): 5 -1: Bad file descriptor
2020-12-07 12:05:22: (fdevent.c.367) fdevent event_del failed on fd 1856042016: Bad file descriptor
2020-12-07 12:05:22: (connections.c.122) (warning) close: 32: Bad file descriptor

2020-12-07 11:00:09: (chunk.c.947) lseek: Bad file descriptor
2020-12-07 11:00:09: (connections.c.466) connection closed: write failed on fd 32

2020-12-07 10:38:19: (gw_backend.c.2269) response already sent out, but backend returned error on socket: unix:/var/lib/socket/xxx for /xxx?, terminating connection
2020-12-07 10:42:59: (chunk.c.947) lseek: Illegal seek
2020-12-07 10:42:59: (connections.c.466) connection closed: write failed on fd 43
2020-12-07 10:43:02: (mod_openssl.c.3256) SSL (error): 5 -1: Bad file descriptor
2020-12-07 10:43:02: (mod_openssl.c.3256) SSL (error): 5 -1: Bad file descriptor
2020-12-07 10:43:02: (fdevent.c.367) fdevent event_del failed on fd 39: Bad file descriptor
2020-12-07 10:43:02: (connections.c.122) (warning) close: 39: Bad file descriptor
2020-12-07 10:43:26: (chunk.c.647) write() temp-file /var/cache/lighttpd/uploads/lighttpd-upload-C7wcqJ failed: Bad file descriptor
2020-12-07 10:43:26: (chunk.c.659) close() temp-file /var/cache/lighttpd/uploads/lighttpd-upload-C7wcqJ failed: Bad file descriptor
2020-12-07 10:43:26: (gw_backend.c.2269) response already sent out, but backend returned error on socket: unix:/var/lib/socket/xxx for /xxx?, terminating connection
2020-12-07 10:49:29: (chunk.c.952) read: Success
2020-12-07 10:49:29: (connections.c.466) connection closed: write failed on fd 27
2020-12-07 10:51:56: (mod_openssl.c.3089) SSL: 1 error:1408F10B:SSL routines:ssl3_get_record:wrong version number
2020-12-07 10:52:32: (chunk.c.952) read: Success
2020-12-07 10:52:32: (connections.c.466) connection closed: write failed on fd 27
2020-12-07 10:52:34: (chunk.c.952) read: Success
2020-12-07 10:52:34: (connections.c.466) connection closed: write failed on fd 41
2020-12-07 10:52:39: (mod_openssl.c.3256) SSL (error): 5 -1: Bad file descriptor
2020-12-07 10:52:39: (mod_openssl.c.3256) SSL (error): 5 -1: Bad file descriptor
2020-12-07 10:52:39: (fdevent.c.367) fdevent event_del failed on fd 40: Bad file descriptor
2020-12-07 10:52:39: (connections.c.122) (warning) close: 40: Bad file descriptor
2020-12-07 10:52:55: (fdevent.c.367) fdevent event_del failed on fd 0: Operation not permitted
2020-12-07 10:53:11: (mod_openssl.c.3256) SSL (error): 5 -1: Bad file descriptor
2020-12-07 10:53:11: (mod_openssl.c.3256) SSL (error): 5 -1: Bad file descriptor
2020-12-07 10:53:11: (fdevent.c.367) fdevent event_del failed on fd 0: Operation not permitted

All errors are related to reverse proxy configurations using TCP and UNIX domain sockets with

server.stream-response-body = 1
server.stream-request-body = 0

Reverting to version 1.4.55 with the same configuration solves all problems.

Actions #1

Updated by flynn almost 4 years ago

The server has the following backends:
- 5 with HTTP-protocol and TCP connection
- 1 with HTTP-protocol and UNIX-Domain socket and websockets enabled
- 1 with UWSGI-protocol and TCP connection

Actions #2

Updated by gstrauss almost 4 years ago

This might take some time to track down.

Do you recall if you saw these errors with any earlier 1.4.56-rcX (release candidate) versions of lighttpd?

Actions #3

Updated by gstrauss almost 4 years ago

When you reported #3033, did you notice this issue, too, or were you not testing the lighttpd 1.4.56 release candidate on the same server or same usage?

If you can reproduce this easily, please try flipping on and off some of the features introduced in lighttpd 1.4.56 to help narrow down when the issue(s) occur.
Does the problem occur with and without HTTP/2?
server.feature-flags += ("server.h2proto" => "disable")
Does the problem occur if (non-Upgrade) requests are sent with HTTP/1.0 by mod_proxy instead of HTTP/1.1?
server.feature-flags += ("proxy.force-http10" => "enable")
(You'll have to scope "proxy.force-http10" to avoid the condition which use mod_proxy to websockets, or else it will ignore the Upgrade header)

From the errors shared, I can see that one of the errors is likely coming from "1 with HTTP-protocol and UNIX-Domain socket and websockets enabled" unless something else in your config is using unix domain sockets.

However, there is probably more than one issue here, since when a response contains Upgrade, the connection is switched to streaming mode with a (limited) in-memory buffer. Yet, I see some errors in your logs relating to file uploads, which are saved into temporary files. Would you share the (approximate) sizes of file uploads and whether those POST (or PUT or other uploads) are to the TCP backends, to the UWSGI backends, or to both?

The errors on the client connection fds with mod_openssl, as well as some of the other errors, suggests there might be an issue with the fd accounting in some usage scenarios.

I'll start trying to reproducing this tomorrow. Any additional hints to help me narrow it down would be appreciated. Thank you.

Actions #4

Updated by flynn almost 4 years ago

This problem only occurs on one server, which I directly updated from version 1.4.55 to 1.4.56,
so I do not have any comparative values with RC versions.

Possbile differences which may cause the problem:
- UNIX domain socket instead of TCP connection
- many long "running"/streamed responses in parallel

Actions #5

Updated by flynn almost 4 years ago

Because the server is in use, I will probably not be able to do the tests until the weekend.

Websockets are enabled, but all errors occurs in NOT Upgraded responses, as far as I remember.

Actions #6

Updated by flynn almost 4 years ago

I made some of your suggested tests:
- disable HTTP/2 does NOT solve the problem
- scope "proxy.force-http10" does not work, I get the follwing error message: (configfile-glue.c.300) DEPRECATED: do not set server options in conditionals, variable: server.feature-flags
- setting server.feature-flags += ("proxy.force-http10" => "enable") in global scope seems to solve all problems, at least for the two known test cases

Without server.feature-flags += ("proxy.force-http10" => "enable") (= HTTP/1.1 enable for proxy backends) and checking out git over https I get the following error messages from git/curl:
- either Illegal or missing hexadecimal sequence in chunked-encoding
- or Received HTTP code 408 from proxy after CONNECT

Actions #7

Updated by gstrauss almost 4 years ago

I am able to reproduce an error, though there might be additional errors.
There is an error in mod_proxy in lighttpd 1.4.56. The error can be avoided with one or both of:

server.stream-response-body = 0
server.feature-flags += ("proxy.force-http10" => "enable")

Disabling streaming response is undesirable. Forcing "proxy.force-http10" is the same behavior as lighttpd 1.4.55, except that forcing HTTP/1.0 also disables the Upgrade, which was special-cased in lighttpd 1.4.55 (and uses HTTP/1.1 in lighttpd 1.4.55, but only with Upgrade)

Now to track down and fix the error...

Actions #8

Updated by gstrauss almost 4 years ago

(@flynn I had posted the above before seeing your update)

Yes, server.feature-flags is valid only in the global scope, which means that "proxy.force-http10" is not an option for your setup, as doing so would break your websockets proxy.

Actions #9

Updated by gstrauss almost 4 years ago

I need to do additional testing, but here is a preliminary patch which fixes the error that I reproduced. Explanation: When the backend sends Transfer-Encoding: chunked (which is HTTP/1.1, not HTTP/1.0) and the client is HTTP/1.1, then lighttpd will transfer the chunked encoding as it is sent from the backend without excess data copying involved in decoding and then re-encoding. The bug is that if the first data chunk (or chunks) is partially received at the same time as the response headers, then the chunked encoding gets corrupted. (When I wrote this, I must not have tested with large enough chunks to result in a partial chunk in the first packet, or my backend must have written the response headers (unbuffered) separately from the data, so no partial chunk was received with the response headers.)

diff --git a/src/response.c b/src/response.c
index 837178c0..904c96a3 100644
--- a/src/response.c
+++ b/src/response.c
@@ -932,11 +932,23 @@ http_response_write_prepare(request_st * const r)
                     /* create initial Transfer-Encoding: chunked segment */
                     buffer * const b =
                       chunkqueue_prepend_buffer_open(&r->write_queue);
+                    if (r->resp_decode_chunked
+                        && 0 != r->gw_dechunk->gw_chunked) {
+                        /*(reconstitute initial partially-decoded chunk)*/
+                        off_t gw_chunked = r->gw_dechunk->gw_chunked;
+                        if (gw_chunked > 2)
+                            qlen += gw_chunked - 2;
+                        else if (1 == gw_chunked)
+                            chunkqueue_append_mem(&r->write_queue,
+                                                  CONST_STR_LEN("\r"));
+                    }
+                    else {
+                        chunkqueue_append_mem(&r->write_queue,
+                                              CONST_STR_LEN("\r\n"));
+                    }
                     buffer_append_uint_hex(b, (uintmax_t)qlen);
                     buffer_append_string_len(b, CONST_STR_LEN("\r\n"));
                     chunkqueue_prepend_buffer_commit(&r->write_queue);
-                    chunkqueue_append_mem(&r->write_queue,
-                                          CONST_STR_LEN("\r\n"));
                 }
                 http_header_response_append(r, HTTP_HEADER_TRANSFER_ENCODING,
                                             CONST_STR_LEN("Transfer-Encoding"),

Even if this patch resolves your two test cases, I am looking further into how to better handle some of the error conditions, since some of the error trace should not occur, specifically operations on bad file descriptors.

Also, the error case that I reproduced does not occur if the client is HTTP/2 since in that case, lighttpd decodes Transfer-Encoding: chunked from the backend, and re-frames into HTTP/2 frames.
[Edit: clarification:] Disabling HTTP/2 for testing did not affect whether or not the issue occurred since the issue occurred only for HTTP/1.1 clients.

Actions #10

Updated by flynn almost 4 years ago

I have now an error case with an HTTP/2 request, that produces the socket error messages, unfortunately it is not stable reproducable:

2020-12-09 08:15:54: (chunk.c.952) read: Success
2020-12-09 08:15:54: (connections.c.466) connection closed: write failed on fd 38
2020-12-09 08:15:55: (chunk.c.952) read: Success
2020-12-09 08:15:55: (connections.c.466) connection closed: write failed on fd 47
2020-12-09 08:15:57: (chunk.c.947) lseek: Bad file descriptor
2020-12-09 08:15:57: (connections.c.466) connection closed: write failed on fd 51

It is a git (version 2.29.2) checkout with the following error messages:

error: 6602 bytes of body are still expected
fatal: the remote end hung up unexpectedly
fatal: early EOF
fatal: index-pack failed
warning: unable to unlink '(null)': Bad address

It seems, that lighttpd closes the connection in some cases prematurely ...

The same request may succeed in further attempts, so it needs several requests to trigger the error.

Actions #11

Updated by gstrauss almost 4 years ago

2020-12-09 08:15:54: (chunk.c.952) read: Success
2020-12-09 08:15:54: (connections.c.466) connection closed: write failed on fd 38

Those suggest to me that there is a 0-length data chunk which was not expected and not handled well. Would you strace() to find the args passed to read() or try the following patch?

diff --git a/src/chunk.c b/src/chunk.c
index d1e80efb..2af3d3bc 100644
--- a/src/chunk.c
+++ b/src/chunk.c
@@ -942,6 +942,7 @@ chunkqueue_peek_data (chunkqueue * const cq,
                 off_t toSend = c->file.length - c->offset;
                 if (toSend > (off_t)space)
                     toSend = (off_t)space;
+                if (0 == toSend) break;

                 if (-1 == lseek(c->file.fd, offset, SEEK_SET)) {
                     log_perror(errh, __FILE__, __LINE__, "lseek");
Actions #12

Updated by gstrauss almost 4 years ago

This patch may address some of the Bad file descriptor errors.

diff --git a/src/chunk.c b/src/chunk.c
index d1e80efb..c35d2a37 100644
--- a/src/chunk.c
+++ b/src/chunk.c
@@ -107,13 +107,13 @@ static void chunk_reset_file_chunk(chunk *c) {
        }
        else if (c->file.fd != -1) {
                close(c->file.fd);
-               c->file.fd = -1;
        }
        if (MAP_FAILED != c->file.mmap.start) {
                munmap(c->file.mmap.start, c->file.mmap.length);
                c->file.mmap.start = MAP_FAILED;
                c->file.mmap.length = c->file.mmap.offset = 0;
        }
+       c->file.fd = -1;
        c->file.length = 0;
        c->type = MEM_CHUNK;
 }

Actions #13

Updated by gstrauss almost 4 years ago

To reduce log noise, this patch will skip the warning trace reporting error on backend socket if the connection has been upgraded, e.g. to websockets.

diff --git a/src/gw_backend.c b/src/gw_backend.c
index 711908a5..f4c6845e 100644
--- a/src/gw_backend.c
+++ b/src/gw_backend.c
@@ -33,6 +33,7 @@
 #include "buffer.h" 
 #include "chunk.h" 
 #include "fdevent.h" 
+#include "http_header.h" 
 #include "log.h" 
 #include "sock_addr.h" 

@@ -2265,7 +2265,7 @@ static handler_t gw_recv_response_error(gw_handler_ctx * const hctx, request_st
               "socket: %s for %s?%.*s, closing connection",
               (long long)hctx->wb.bytes_out, proc->connection_name->ptr,
               r->uri.path.ptr, BUFFER_INTLEN_PTR(&r->uri.query));
-        } else {
+        } else if (!light_btst(r->resp_htags, HTTP_HEADER_UPGRADE)) {
             log_error(r->conf.errh, __FILE__, __LINE__,
               "response already sent out, but backend returned error on " 
               "socket: %s for %s?%.*s, terminating connection",

Actions #14

Updated by gstrauss almost 4 years ago

  • Status changed from New to Patch Pending

I believe that the above patches fix the issues (and the secondary effects). You should not need the patch with + if (0 == toSend) break; after applying the patch with + c->file.fd = -1;

The missing + c->file.fd = -1; initialization is important enough that I'll be rolling and releasing lighttpd 1.4.57 next week (or sooner) once you confirm that the issues have been addressed.

Thank you.

Actions #15

Updated by flynn almost 4 years ago

Thank you for patches, I will test them on Saturday.

Actions #16

Updated by flynn almost 4 years ago

With these patches (without + if (0 == toSend) break;) I cannot reproduce the problem, even on repeated attempts and large amount of data transfers. Memory consumption looks good, a maximum of 6MB resident memory.

Actions #17

Updated by gstrauss almost 4 years ago

That's great news. Thank you for your thoroughness.

Actions #18

Updated by flynn almost 4 years ago

I do not know wheter it belongs to this issue, but after some hours later I got this error on same instance:

(gw_backend.c.2270) response already sent out, but backend returned error on socket: unix:/var/xxxxxxxxxxxxx for /xz?, terminating connection

After this message the light process has 100% CPU and hangs.

strace does not show any systemn calls ...

perf top -p ... shows this functions:

    99.73%  lighttpd       [.] http_chunk_decode_append_data
     0.19%  lighttpd       [.] http_chunk_append_mem

It seems, that lighty is stuck in an endless loop.

After a restart of lighttpd everything is ok, no restart of the backends needed.

Actions #19

Updated by flynn almost 4 years ago

Additional observations:
  • first the error message (gw_backend.c.2270) response already sent out, ...
  • monitoring indicates, that the user CPU went up to 100% immediately after the error message, but the time resolution of the monitoring is not precise enough to verify a direct correlation
  • lighttpd still serves requests, but gets slower
  • after about 10min the montoring shows, that more and more requests fail, probably due to timeout, there no more lines of the monitoring in the access log until the restart
Actions #20

Updated by flynn almost 4 years ago

It happened again and I tried to produce a backtrace in gdb:

(gdb) bt
#0  0x000055caa0aea546 in http_chunk_decode_append_data (r=0x55caa20b3d60, mem=0x55caa241897f "\r", len=1) at http_chunk.c:350      
#1  0x000055caa0af50c3 in http_response_read (r=r@entry=0x55caa20b3d60, opts=opts@entry=0x55caa243cc08, b=b@entry=0x55caa1fc7750,   
    fdn=0x55caa20be110) at http-header-glue.c:1536
#2  0x000055caa0aed7cf in gw_recv_response (hctx=0x55caa243cb60, r=0x55caa20b3d60) at gw_backend.c:2150                             
#3  0x000055caa0aedacb in gw_handle_fdevent (ctx=0x55caa243cb60, revents=1) at gw_backend.c:2291                                    
#4  0x000055caa0af23c3 in fdevent_linux_sysepoll_poll (ev=0x55caa1e7ff60, timeout_ms=<optimized out>) at fdevent_linux_sysepoll.c:43
#5  0x000055caa0aead64 in fdevent_poll (ev=0x55caa1e7ff60, timeout_ms=<optimized out>) at fdevent.c:437                             
#6  0x000055caa0ad9d63 in server_main_loop (srv=0x55caa1dee260) at server.c:1908                                                    
#7  0x000055caa0ac8ea7 in main (argc=4, argv=0x7ffffb3e59a8) at server.c:1949  

Actions #21

Updated by gstrauss almost 4 years ago

Thanks. That is another bug. I carefully reviewed http_chunk_decode_append_data() for any others like that.

diff --git a/src/http_chunk.c b/src/http_chunk.c
index 28daaa3b..b08e0dcc 100644
--- a/src/http_chunk.c
+++ b/src/http_chunk.c
@@ -437,7 +437,13 @@ http_chunk_decode_append_data (request_st * const r, const char *mem, off_t len)
                     len -= 2;
                     te_chunked = 0;
                 }
-                else if (len == 1 && mem[0] != '\r') return -1;
+                else if (len == 1) {
+                    if (mem[0] != '\r') return -1;
+                    /*++mem;*/
+                    /*--len;*/
+                    te_chunked = 1;
+                    break;
+                }
             }
         }
         else if (1 == te_chunked) {

Actions #22

Updated by gstrauss almost 4 years ago

Fix for another edge case.

--- a/src/http_chunk.c
+++ b/src/http_chunk.c
@@ -420,6 +420,7 @@ http_chunk_decode_append_data (request_st * const r, const char *mem, off_t len)
             len -= hsz;

             te_chunked += 2; /*(for trailing "\r\n" after chunked data)*/
+            if (0 == len) break;
         }

         if (te_chunked >= 2) {

Actions #23

Updated by mgottinger almost 4 years ago

Thank you very much for implementing HTTP2 - lighty got pretty quick.

I have had messages like:

(chunk.c.947) lseek: Illegal seek
(connections.c.466) connection closed: write failed on fd 164

(chunk.c.947) lseek: Bad file descriptor
(connections.c.466) connection closed: write failed on fd 169
(mod_openssl.c.2916) SSL: 5 -1: Bad file descriptor
(mod_openssl.c.3256) SSL (error): 5 -1: Bad file descriptor
(fdevent.c.367) fdevent event_del failed on fd 1853453417: Bad file descriptor
(fdevent.c.400) fdevent event_set failed on fd -1668802746: Bad file descriptor

too, they seem to be gone with your patches (excluding "if (0 == toSend) break;" and the one for log noise).
I have had most problems with transfering bigger files > 100MB.

I will post any update on this on my end.

Actions #24

Updated by flynn almost 4 years ago

Test is running with latest patches in src/http_chunk.c, but I cannot trigger the 100% CPU issue. I just need to wait, whether it happens again or not ...

Actions #25

Updated by gstrauss almost 4 years ago

I have added a feature to mod_proxy (for lighttpd 1.4.57) to allow specific backends to be forced to use HTTP/1.0
proxy.header += ( "force-http10" => "enable" ) (default is disable)
If any proxy.header values are set, this value (or default disable) overrides the global server.feature-flags += ("proxy.force-http10" => "...")

Actions #26

Updated by gstrauss almost 4 years ago

  • Status changed from Patch Pending to Fixed
Actions

Also available in: Atom