Project

General

Profile

Actions

Bug #3084

closed

Memory fragmentation with HTTP/2 enabled

Added by ZivHung over 3 years ago. Updated about 3 years ago.

Status:
Fixed
Priority:
Normal
Category:
core
Target version:
ASK QUESTIONS IN Forums:
Yes

Description

Hello,

We use lighttpd 1.4.59 to receive file for our linux server, according to our test process, we repeatedly transfer a 4 KB file per 10 milliseconds a time, memory of lighttpd process will increase around 3000 KB after 20 minutes, (it will increase about 7500 KB after one hour). However, we use same condition to check lighttpd 1.4.58 , we find memory won’t increase any KB, and after test one hour, memory size would even become less than an hour ago.

In these two experiments, lighttpd.conf is the same, and so does our linux server, therefore, we think there is a memory leak problem in 1.4.59.
Please check our lighttpd.conf setting in attached file.


Files

lighttpd.conf (5.13 KB) lighttpd.conf ZivHung, 2021-06-01 02:14
lighttpd.conf (5.15 KB) lighttpd.conf ZivHung, 2021-06-02 05:44
lighttpd.conf (5.05 KB) lighttpd.conf ZivHung, 2021-06-02 14:13
lighttpd-memory.png (87.3 KB) lighttpd-memory.png Lighttpd memory usage for one week flynn, 2021-06-03 13:03
lighttpd-memory.log (59.7 KB) lighttpd-memory.log flynn, 2021-07-16 10:36
Actions #1

Updated by gstrauss over 3 years ago

Thank you for providing your lighttpd.conf.

Would you please help narrow down what behavior is causing the memory increase?
What requests?
FastCGI requests?
WSTunnel requests?
CGI requests? I see CGI directives in your lighttpd.conf, but you have commented out "mod_cgi" in server.modules

Does the issue go away if you change the streaming parameters? (one or both)

server.stream-request-body = 1
server.stream-response-body = 1

or with
server.stream-request-body = 0
server.stream-response-body = 0

Is there a reason you have disabled TLS session tickets? (-SessionTicket)?
Is there a reason you are using writev instead of the default sendfile for server.network-backend="writev" ?

BTW, if you want lighttpd to omit the Server response header, use server.tag = ""

I see that you have disabled ETag generation. If the responses should never be cached, you might also disable the stat cache and see if that makes a difference in memory use. server.stat-cache-engine = "disable"

Actions #2

Updated by gstrauss over 3 years ago

Another difference between lighttpd 1.4.58 and lighttpd 1.4.59 is enabling HTTP/2 by default.

Please test with server.feature-flags = ( "server.h2proto" => "disable" ) to see if that makes a difference for you.

Actions #3

Updated by ZivHung over 3 years ago

Hello,

Thanks for the tip, after we tested the lighttpd 1.4.59 in adding “server.feature-flags = ( "server.h2proto" => "disable" )”, the memory size didn’t climb up(match our expectation).

About your question, we use FastCGI to transfer CGI command. Besides, change the setting of server.stream-request(response)-body doesn’t change the situation of memory size climbing up, so does the “server.stat-cache-engine = "disable"”, however, we will add this setting because we indeed don’t need to cache the response. (please check our latest lighttpd.conf)

Last question, “Is there a reason you have disabled TLS session tickets? (-SessionTicket)?”, we saw the definition suggest that this setting should be add before lighttpd 1.4.56, so if we use the version after that, is this setting necessary?

Actions #4

Updated by gstrauss over 3 years ago

Thanks for the tip, after we tested the lighttpd 1.4.59 in adding “server.feature-flags = ( "server.h2proto" => "disable" )”, the memory size didn’t climb up(match our expectation).

Depending on usage, lighttpd memory use might increase slightly from use at startup, but memory use should stabilize as the memory is reused. After two hours, has memory use grown from one hour prior? (when "server.h2proto" => "enable") Can you share some details about the client that you are using to upload the 4k file? I would like to try to reproduce the issue. Thanks.

Last question, “Is there a reason you have disabled TLS session tickets? (-SessionTicket)?”, we saw the definition suggest that this setting should be add before lighttpd 1.4.56, so if we use the version after that, is this setting necessary?

The setting is not necessary with lighttpd 1.4.56 or later, and not recommended with lighttpd 1.4.56 and later. lighttpd 1.4.56 and later have built-in logic to rotate the session ticket encryption key, whereas earlier versions did not. Therefore, for earlier versions, it is recommended to use -SessionTicket or to restart lighttpd daily (which will result in openssl regenerating the session ticket encryption key).

BTW, to slightly reduce lighttpd memory footprint on embedded systems, you might comment out modules in server.modules that you are not using, as you have done with "mod_cgi". In your config, I do not see you using "mod_access", "mod_authn_file", "mod_auth", "mod_proxy", or "mod_redirect". Each module adds a minimum of 20k to memory footprint, and modules such as "mod_authn_file" and "mod_proxy" add more.

Actions #5

Updated by ZivHung over 3 years ago

Hello,
Thank you for giving us advice about lighttpd configure settings, indeed there are several settings in “server.modules” that no longer needed to our service, please help to check our latest configure file, to see if there still has any inappropriate setting.

However, unfortunately we cannot directly provide our test details, but we will try to simplify the condition, and provide the test flow if we find out simpler way, thank you.

Actions #6

Updated by gstrauss over 3 years ago

If your site is producing files on disk which change frequently, and do not have unique filenames (e.g. for generated responses and using X-Sendfile), then server.stat-cache-engine = "disable" is appropriate.

If you are serving static content, such as I see in your aliases to css, images, js, and videos, then you will benefit from server.stat-cache-engine = "simple" (the default if omitted)

In short, if your site requires that the stat cache be disabled for proper behavior, then you should disable the stat cache. Otherwise, the default is to enable the stat cache because it provides faster responses to static resources.

I had suggested disabling the stat cache as a troubleshooting step to help isolate which lighttpd subsystem might be affecting the memory usage you are seeing. The stat cache is enabled by default in lighttpd because it is so useful and is generally safe to do so.

Actions #7

Updated by flynn over 3 years ago

I have a similar issue with memory, but I could not make it reproducable enough to write a ticket.

But I have a memory monitoring for lighttpd and I made the following observations:

- the memory is freed, lighttpd returns very close to the memory used after start
- but: it may take days (!) to free the memory, see the attached graph (lighttpd is started at Friday 14:37, look only at the last 6 days)

Maybe freeing the memory is not triggered often enough.

Actions #8

Updated by gstrauss over 3 years ago

This all or in part might be due to memory fragmentation. lighttpd allocates memory for its chunkqueue in chunks, and those chunks are pushed onto a queue to be reused. Every 64 seconds, lighttpd frees the queue of unused chunks. However, the order in which they are freed may differ from the order in which they are allocated, and some long-running connections may still be using later-allocated chunks. After (some) chunks are freed, it is still up to the memory allocator (libc malloc/calloc/realloc unless another is preloaded) when the memory allocator chooses to release the memory back to the system.

It is a good sign to me that the memory (at least in flynn's case) is eventually returned to the system, as that suggests the memory is not leaked. Also, when I tested some scenarios with valgrind some months back, I did not find any leaks.

Actions #9

Updated by gstrauss over 3 years ago

  • Category set to core

BTW, the chunk size defaults to 8k and is configurable with server.chunkqueue-chunk-sz = 8192. Minimum size is 1024, and server.chunkqueue-chunk-sz is rounded up to the closest power of 2.

Actions #10

Updated by flynn over 3 years ago

I started a test with libjemalloc as allocator and some days later I'll test different values of server.chunkqueue-chunk-sz ...

Actions #11

Updated by gstrauss over 3 years ago

Another thought occurred to me. In lighttpd 1.4.59, HTTP/2 is enabled by default in lighttpd.

Starting in lighttpd 1.4.56, lighttpd keeps a pool of struct request_st (and associated buffers), which are used by HTTP/2 connections, one for each stream. The size of this pool can grow to server.max-connections. Now, each HTTP/2 connection can have up to 8 simultaneous streams (hard-coded limit in lighttpd), and additional request_st allocated are freed immediately upon release.

Depending on the access patterns that your server sees, and on whether the streams are handled quickly or have a long lifetime, you might want to reduce server.max-connections, but not so much that you unnecessarily delay lighttpd from answering bursts of connections. Of course, this tuning is necessary only on small, memory restricted servers, which are likely resource-constrained in other ways, too. On larger servers, 30M lighttpd memory use for a very busy server is unlikely to be an issue.

If it turns out that the request_st allocations -- which occur at runtime during bursty activity and then are saved to the request pool -- are causing excessive fragmentation and preventing release of other memory, then I can consider pre-allocating the request pool at startup, up to server.max-connections. Doing so would increase lighttpd's initial memory footprint for the base case (everyone), so I am leaning against this unless the data proves that it would make a drastic difference. That said, the initial graph provided by flynn suggests that this is not the case, since the memory is eventually released (which does not currently happen with the request pool until server shutdown or restart). Another option would be to periodically clear the request pool, as is currently done every 64 seconds with the chunk queue. Since ZivHung reported this increased memory use with HTTP/2 enabled in lighttpd, I might consider tweaking the lighttpd reqpool behavior if the data supports doing so.

BTW, if not specified, server.max-connections is 1/3 server.max-fds. If not specified, server.max-fds is currently 4096, making the default server.max-connections 1365.

Pondering freeing up to 16 request_st from the pool each 64 seconds. After a busy burst of activity which filled the pool, freeing 16 request_st each 64 seconds would clear the pool in ~ 90 mins (with the default server.max-connections).

Actions #12

Updated by flynn over 3 years ago

Using jemalloc as allocator definitily improves the memory usage:

  • base memory usage (4-5MB) is reached more often
  • high memory usage (more than 10MB) is lower and shorter

I consider this memory usage (with jemalloc) to be normal.

So the interaction between lighttpd memory buffers and glibc malloc can be improved, maybe by reducing the memory arenas to 1. I'll test this later.

I had a similar issue with java and glibc malloc, where reducing the glibc memory arenas improved the situation, but jemalloc was the best solution to achieve the lowest memory usage in production.

Actions #13

Updated by gstrauss over 3 years ago

@flynn: thank you for the update

ZivHung: try export MALLOC_ARENA_MAX=2 in the same shell in which you start the lighttpd daemon (and do not disable "server.h2proto")

lighttpd is single-threaded and (in many common use cases) does not heavily use the memory allocator at runtime, so while MALLOC_ARENA_MAX=1 is an option, too, MALLOC_ARENA_MAX=2 is probably a good default as a tradeoff, just in case some custom modules are threaded. If MALLOC_ARENA_MAX is not set in the environment, then I might consider using mallopt to set M_ARENA_MAX to 1 during startup and config processing, which is single threaded and does heavily use the memory allocator, and then setting M_ARENA_MAX to 2 after setup.

[Edit:]
I would prefer to find a reasonable value of MALLOC_ARENA_MAX instead of periodically running the glibc-specific malloc_trim

Actions #14

Updated by gstrauss over 3 years ago

  • Subject changed from Memory leak problem in Lighttpd 1.4.59 to Memory fragmentation with HTTP/2 enabled
Actions #15

Updated by flynn over 3 years ago

In my tests over several days I can not see/measure any significant change with MALLOC_ARENA_MAX=1.

Only with jemalloc I get significant changes in memory usage.

gstrauss: shall I test different values of server.chunkqueue-chunk-sz, r.g. 4096 or 16384?

Actions #16

Updated by gstrauss over 3 years ago

Hmmmm. I was really hoping that MALLOC_ARENA_MAX=1 (or MALLOC_ARENA_MAX=2) would be an good change for the single-threaded lighttpd server, and I might still consider it if there is no indication of harm to do so.

server.chunkqueue-chunk-sz tunable was originally intended for use on severely memory-constrained systems, where a chunk size of 1k might more fully utilize the allocated memory, at a slight cost of increasing the number of memory chunks used.

My guess is that server.chunkqueue-chunk-sz will not have any real differences with regards to memory fragmentation. On the other hand, it might make a difference if chunk size exceeds the threshold for mmap chunks, then that could change the way that chunks are allocated and released back to the system. On the other end of the spectrum, a very small chunk size of 1k might be allocated from a different pool in the arena. I have a hunch that jemalloc has a better algorithm for segmenting the allocation of blocks of different orders of magnitude of size used by lighttpd.

While playing with chunkqueue-chunk-sz might yield a difference, I think the best next step would be to try to identify the source(s) of memory fragmentation. Is it the cache of chunkqueue chunks? Is it the stat_cache? Is it the request_st pool (used by lighttpd HTTP/2 support)? Is it something else or a combination thereof? Since this issue seems to appear with lighttpd HTTP/2 support enabled, it is probably not the stat_cache. In chunk.c, there is chunk_release() and chunk_buffer_release() which could be modified to disable the caching, and to always take the branch to free the object. Similarly, in reqpool.c, request_release could be modified to always free the object.

If you have a decent idea on the number of active connections at any one time (including occasional peaks), and have a good idea how many are brief and how many are long-running (and for how long), then you might lower server.max-connections to see if doing so affects the memory fragmentation over time.

I am also open to other ideas if you have any suggestions.

Actions #17

Updated by gstrauss over 3 years ago

To collect a bit of data, here is a small patch to lighttpd-1.4.59 to write out malloc statistics.

--- a/src/server.c
+++ b/src/server.c
@@ -68,6 +68,8 @@ static const buffer default_server_tag = { CONST_STR_LEN(PACKAGE_DESC)+1, 0 };
 # include <sys/prctl.h>
 #endif

+#include <malloc.h>
+
 #include "sys-crypto.h" 
 #if defined(USE_OPENSSL_CRYPTO) \
  || defined(USE_MBEDTLS_CRYPTO) \
@@ -1826,6 +1828,10 @@ static void server_handle_sigalrm (server * const srv, time_t min_ts, time_t las
                                if (0 == (min_ts & 0x3f)) { /*(once every 64 secs)*/
                                        /* free excess chunkqueue buffers every 64 secs */
                                        chunkqueue_chunk_pool_clear();
+
+                                       if (0 == (mono_ts & 0xff))
+                                               malloc_stats();
+
                                        /* attempt to restart dead piped loggers every 64 secs */
                                        if (0 == srv->srvconf.max_worker)
                                                fdevent_restart_logger_pipes(min_ts);

The patch will write something like the following out to stderr every 256 seconds (4 min 16 seconds)

Arena 0:
system bytes     =     946176
in use bytes     =     419920
Total (incl. mmap):
system bytes     =    1282048
in use bytes     =     755792
max mmap regions =          2
max mmap bytes   =     335872

ZivHung or flynn, would you mind running with this patch for an hour (during which you see the memory use increase), and share the results? Since MALLOC_ARENA_MAX=1 did not have a noticeable effect for flynn, then it is likely that we will see a larger number of mmap regions.

If a small number of bytes from each mmap region remain in use for long periods of time, then the glibc allocator will not be able to return those regions to the OS until a mmap region is completely unused. As the memory usage is smoother with jemalloc, one possible mitigation might be for me to allocate multiple pools of chunks and to prefer reusing chunks from the original pool and releasing excess chunks from later overflow pools back to the OS. The libc memory allocator will still be leveraged, not replaced. Then again, we have not yet established that the chunks are source of the issue, though their allocations may contribute.

As an aside, I have already committed code earlier this year to lighttpd git master which reduces memory fragmentation by doubling buffer sizes when reallocating, rather than the previous historical method in lighttpd of adding needed space and rounding up to nearest 64 byte boundary. This change in behavior should reduce the number of memory reallocations as well as play more nicely with memory allocator buckets. (Note: lighttpd git master should be functional for testing in a staging environment, but I strongly recommend against pushing lighttpd git master straight into production without more thorough testing. Substantial code changes have been made since lighttpd 1.4.59 was released.)

Actions #18

Updated by flynn over 3 years ago

I tested different values of chunkqueue-chunk-sz and cannot measure a significant effect on memory usage.

I'll try your memory patches later ...

Actions #19

Updated by flynn over 3 years ago

The patch does not compile:

server.c: In function ‘server_handle_sigalrm’:
server.c:1832:51: error: ‘mono_ts’ undeclared (first use in this function); did you mean ‘min_ts’?
                                         if (0 == (mono_ts & 0xff))
                                                   ^~~~~~~
                                                   min_ts

There is no mono_ts variable in lighttpd 1.4.59.

Actions #20

Updated by gstrauss over 3 years ago

Yes, that should be min_ts in lighttpd-1.4.59.

(My development branch has changes which move to using a monotonic clock for that bit of code, which is why on my dev branch the variable is mono_ts)

Actions #21

Updated by flynn over 3 years ago

I activated the patch on one server, I see the messages every 256 seconds, but at the moment the memory is at a low level. I'll send a log if the memory is at a higher level.

Actions #22

Updated by flynn over 3 years ago

Because I need uploads with chunked encoding I switched my servers back to server.stream-request-body = 0. From my current workloads I can see:
- the value of server.stream-response-body (0/1/2) has no siginificant impact to memory usage
- server.stream-request-body = 0 has no memory problems. Memory allocations are freed in minutes and memory returns to low values
- server.stream-request-body = 2 has memory problems, at least it takes very long (days) to return the memory to the OS
- server.stream-request-body = 1 maybe the same problems

I try to make some memory debugging eith the patch above on a separate server, but this will take some time.

Actions #23

Updated by gstrauss over 3 years ago

Hmmm. Thanks. Useful info -- not what I expected. Your observations suggest that there may be an issue (and maybe not the only issue) with chunks and chunk buffer usage.

To make sure I understand what you posted, are you saying that you also see the memory issues with HTTP/1.1 chunked encoding, in addition to the HTTP/2 discussion in the original bug report?

Actions #24

Updated by flynn over 3 years ago

I only mentioned the chunked encoding "thing", to make clear, that I cannot run the memory debug patch on a big server, where the memory faster grows.
I did not want to cite it as a possible cause for this ticket

Most of the request are HTTP/2 and chunked requests are rare (but users complain, if it does not work).
Most chunked encoding uploads are triggered by git/LFS or WebDAV, ususally HTTP/2.

Actions #25

Updated by flynn over 3 years ago

Here is a log with a big allocation today with server.stream-response-body = 2 and server.stream-request-body = 0.

Actions #26

Updated by gstrauss over 3 years ago

Thanks.

Jul 16 10:36:16 lighttpd[2322689]: Arena 0:
Jul 16 10:36:16 lighttpd[2322689]: system bytes     =   12554240
Jul 16 10:36:16 lighttpd[2322689]: in use bytes     =    5148400
Jul 16 10:36:16 lighttpd[2322689]: Total (incl. mmap):
Jul 16 10:36:16 lighttpd[2322689]: system bytes     =   12554240
Jul 16 10:36:16 lighttpd[2322689]: in use bytes     =    5148400
Jul 16 10:36:16 lighttpd[2322689]: max mmap regions =          4
Jul 16 10:36:16 lighttpd[2322689]: max mmap bytes   =   12804096
Jul 16 10:40:32 lighttpd[2322689]: Arena 0:
Jul 16 10:40:32 lighttpd[2322689]: system bytes     =  114401280
Jul 16 10:40:32 lighttpd[2322689]: in use bytes     =    5295216
Jul 16 10:40:32 lighttpd[2322689]: Total (incl. mmap):
Jul 16 10:40:32 lighttpd[2322689]: system bytes     =  114401280
Jul 16 10:40:32 lighttpd[2322689]: in use bytes     =    5295216
Jul 16 10:40:32 lighttpd[2322689]: max mmap regions =          4
Jul 16 10:40:32 lighttpd[2322689]: max mmap bytes   =   12804096

A single arena.
A stable use of 4 for max mmap regions.
A large jump from ~12 MB to ~114 MB in a 4 min period is a large jump in sbrk, likely multiple sbrk.
This debunks my theory about many mmap regions.

I know a 4 min period might have many requests to a large server, but is there any noticeable change in the type of requests that come in during that period that could help me to reproduce the issue? It is likely to be some combination of requests and load on the server. While some memory is released back the system a few mins later, > ~90 MB remains with the process, even though most is not reported as used. Is there a spike in the number of simultaneous connections in the time window?

I wonder if a periodic call to malloc_trim(131072) would make a difference.

+                                       if (0 == (min_ts & 0xff)) {
+                                               malloc_stats();
+                                               malloc_trim(131072);
+                                               malloc_stats();
+                                       }

If we can approximate reproducing this in a non-production environment, I would like to see if lighttpd git master behaves better (or worse), as changes on lighttpd git master should reduce memory fragmentation by doubling buffer sizes when reallocating, rather than the previous historical method in lighttpd of adding needed space and rounding up to nearest 64 byte boundary.

Actions #27

Updated by flynn over 3 years ago

I applied the modified patch above, test is running.
I can already see a change: the memory curve goes down significantly every 256 sec, but the average goes up slightly, about 1 MB per hour. This may stop/saturate later ...

I already tried several times to identify the request(s), that let the memory grow, but I could not identify them. There are about 2600 requests in 4min:
  • GET: 1105 (920 HTTP/1.1, 185 HTTP/2.0)
  • POST: 1466 (1447 HTTP/1.1, 19 HTTP/2.0)
  • PRI: 55
  • PATCH: 2, both HTTP/1.1

Most requests are delegated with mod_proxy to a HTTP/1.1-backend with unix domain socket

Should I disable HTTP/2.0 for a test?

Actions #28

Updated by flynn over 3 years ago

As indicated the average memory consumption stabilized now around 6MB.

We should wait at least 1 day, better 2-3 days, whether the big memory alloctions still happen and are not released again shortly afterwards.

Actions #29

Updated by gstrauss over 3 years ago

If malloc_trim() makes such a huge difference, but memory still grows slightly and stabilizes at 6MB, then lighttpd's memory use with HTTP/2 enabled does not appear to be trapping all the excess memory allocated by sbrk
A question for me is why glibc is holding on to the excess memory for so long. (Just to be sure: are you using glibc or musl or something else for libc?)

Should I disable HTTP/2.0 for a test?

I am not sure it would be useful. lighttpd memory usage behavior does change some with HTTP/2, and this usage change likely contributes to how glibc malloc is handling memory allocations. However, you found that using jemalloc gives better behavior and using malloc_trim gives better behavior, so I do not think lighttpd is doing anything "wrong". Still, something lighttpd is doing is resulting in suboptimal behavior from the glibc allocator.

The next version of lighttpd changes lighttpd memory allocation algorithms, so I'd love if you could test with lighttpd git master to see if the problem gets better (or worse). (I'd like to release the next version of lighttpd before end of Q3 and am currently working on timeouts to backends for #3086 )

With or without testing lighttpd git master, you can use environment variables to tune the glibc allocator. man mallopt By default, the glibc allocator uses a dynamic mmap threshold approach to setting some limits, and if you use mallopt or environment variables to set any of M_TRIM_THRESHOLD, M_TOP_PAD, M_MMAP_THRESHOLD, or M_MMAP_MAX (or env var equivalents), then the dynamic mmap threshold approach is disabled.

I wonder how the memory usage would be affected by disabling the dynamic mmap threshold approach and setting environment variables: MALLOC_TRIM_THRESHOLD_=524288 (512KB) and MALLOC_MMAP_THRESHOLD_=327680 (320KB) in the shell that starts lighttpd.

Actions #30

Updated by flynn over 3 years ago

These values are from glibc version 2.28 malloc.

I configured another server to use the malloc trim environent variables.

A fresh compiled lighttpd from your personal git repository results in 400 on php/Davical requests, so I cannot use it (other php applications work). At the moment I do not have time to debug it, php-Log reports a wired content-type problem on a PROPFIND requests:

Application sent content-type of "text/xml" instead of "text/xml" 
Actions #31

Updated by gstrauss over 3 years ago

Application sent content-type of "text/xml" instead of "text/xml"

That's not a very enlightening error message, eh? :)

I tested lighttpd mod_webdav and it passes litmus tests as expected.
Is php/Davical running as CGI or as FastCGI? (I would presume FastCGI via FPM)

Actions #32

Updated by flynn over 3 years ago

php/Davical is running as FastCGI via FPM.

I ran strace on both versions and found a 4-byte difference in the fastcgi buffer:
  • REQUEST_METHODPROPFIND (version 1.4.59)
  • REQUEST_METHODPOST (current git version), this results in the error message above and a return value 400

Davical does not expect a POST request, the error message is misleading.

I hope this helps ...

Actions #33

Updated by gstrauss over 3 years ago

I temporarily added a quick test to tests/mod-fastcgi.t to send a PROPFIND request and check what the (test) fastcgi server received and that appears fine, as would be expected since other PHP scripts work. I'll have to see if something else is getting overwritten/truncated into the FastCGI environment buffer.

--- a/tests/mod-fastcgi.t
+++ b/tests/mod-fastcgi.t
@@ -7,7 +7,7 @@ BEGIN {
 }

 use strict;
-use Test::More tests => 24;
+use Test::More tests => 25;
 use LightyTest;

 my $tf = LightyTest->new();
@@ -25,6 +25,13 @@ SKIP: {
        $tf->{CONFIGFILE} = 'fastcgi-responder.conf';
        ok($tf->start_proc == 0, "Starting lighttpd with $tf->{CONFIGFILE}") or die();

+       $t->{REQUEST}  = ( <<EOF
+PROPFIND /prefix.fcgi/foo/bar?env=REQUEST_METHOD HTTP/1.0
+EOF
+ );
+       $t->{RESPONSE} = [ { 'HTTP-Protocol' => 'HTTP/1.0', 'HTTP-Status' => 200, 'HTTP-Content' => 'PROPFIND' } ];
+       ok($tf->handle_http($t) == 0, 'REQUEST_METHOD');
+
        $t->{REQUEST}  = ( <<EOF
 GET /prefix.fcgi-nonexistent HTTP/1.0
 Host: www.example.org

Actions #34

Updated by gstrauss over 3 years ago

I haven't been able to find any reason why the code would behave that way. The code is shared by the base lighttpd and many modules; it's not a unique code path.

Is it possible that the lighttpd.conf is being parsed differently and that a POST request is getting misdirected to php/Davical?

Actions #35

Updated by gstrauss over 3 years ago

HTTP/2 has special short-codes for GET and POST methods. In my development branch, I have started using LSHPACK_HDR_METHOD_POST from the ls-hpack library which lighttpd uses for HPACK. Does php/Davical work if you disable HTTP/2 with my dev branch?

Actions #36

Updated by gstrauss over 3 years ago

Looks like my trying to use the HTTP/2 short-codes for GET and POST is causing the issue for Davical. At first glance, this appears to be an issue in ls-hpack marking LSHPACK_HDR_METHOD_GET for every method that is not POST, and correctly marking LSHPACK_HDR_METHOD_POST for POST.

curl --http2-prior-knowledge -X PROPFIND "http://127.0.0.1/..." got marked with LSHPACK_HDR_METHOD_GET (not LSHPACK_HDR_UNKNOWN and not LSHPACK_HDR_METHOD_POST), and -X POST got marked as LSHPACK_HDR_METHOD_POST, which is not exactly what you saw, but close.

In any case, a workaround is:

--- a/src/request.c
+++ b/src/request.c
@@ -610,6 +610,7 @@ http_request_parse_header (request_st * const restrict r, http_header_parse_ctx
               case HTTP_HEADER_H2_METHOD_GET:
                 if (__builtin_expect( (HTTP_METHOD_UNSET != r->http_method),0))
                     break; /*(error out in switch() below)*/
+                break; /*(workaround)*/
                 r->http_method = HTTP_METHOD_GET;
                 return 0;
               case HTTP_HEADER_H2_METHOD_POST:

Actions #37

Updated by gstrauss over 3 years ago

Turns out to be an implementation detail in ls-hpack for an overloaded static_table. For HPACK decoding, the hpack_index was only valid for keys, not values, though the keys and values are used for HPACK encoding.
I forced-pushed a fix to my dev branch.

[Edit] correction: so the hpack_index was typically only valid for keys, unless the lsxpack_flag LSXPACK_VAL_MATCHED was also set

Actions #38

Updated by flynn over 3 years ago

Just for confirmation:

  • the PROPFIND problem happens only in the browser, I cannot reproduce it with curl on the cmdline
  • if I disable HTTP/2 in the browser, it works with HTTP/1.1!
Actions #39

Updated by flynn over 3 years ago

Intermediate result:
  • I tried different values of the environment variables MALLOC_TRIM_THRESHOLD_ and MALLOC_MMAP_THRESHOLD_, no significant change in the memory usage, still (too) high.
  • the explicit call of malloc_trim results in an expected memory usage for a server: base line with 6-8MB and some short time peak values

Maybe lighttpd triggers too less malloc_trim calls in glibc malloc ...

Actions #40

Updated by gstrauss over 3 years ago

Thanks for the update.

I'm working on reducing, where possible, long-lived memory allocations that occur after lighttpd startup. This may reduce lighttpd memory usage a bit below your current baseline, so the changes are beneficial in that (resource) direction. There is a small performance cost to changing the algorithms, resulting in some more random writes and possible memory cache misses. I hope to have the code ready for testing (on my dev branch) by the beginning of next week.

Maybe lighttpd triggers too less malloc_trim calls in glibc malloc ...

FYI: I think you meant "Maybe lighttpd triggers too few calls to malloc_trim()..." ("less" versus "fewer"/"few")

Actions #41

Updated by flynn over 3 years ago

Yes, I meant too few calls...

I just wait, until I can test the git tree:
  • PROPFIND fixed for HTTP/2 calls
  • reworked memory allocations
Actions #42

Updated by gstrauss over 3 years ago

I have pushed changes to my dev branch which clear and free the connection and request pools every 64 seconds, along with other maintenance that occurs at that interval.

Periodically clearing the pools not only frees up memory used by the connection and request structures, but also for the buffers and arrays contained within those structures. Setting a bound on how long lighttpd holds onto this memory (when that memory is not actively being used or reused) should help lighttpd reduce its internal memory use after bursts of activity. It remains to be seen whether or not the glibc memory allocator behaves better on your systems with this approach, though I believe this should help.

Actions #43

Updated by gstrauss over 3 years ago

FYI: I just pushed a small update to fix a busy-wait that I introduced when changing the joblist to a linked list, so please re-pull if you pulled my dev branch in the last 24 hours.

Actions #44

Updated by flynn over 3 years ago

with the current git version I get a core dump on testing the configuration files (/usr/sbin/lighttpd -tt -f /etc/lighttpd/lighttpd.conf):

Program received signal SIGSEGV, Segmentation fault.
0x000055555558a8e8 in array_get_buf_ptr (a=a@entry=0x0, k=k@entry=0x55555559852e "server.modules", klen=klen@entry=14) at array.c:418
418        int32_t ipos = array_get_index(a, k, klen);
(gdb) bt
#0  0x000055555558a8e8 in array_get_buf_ptr (a=a@entry=0x0, k=k@entry=0x55555559852e "server.modules", klen=klen@entry=14) at array.c:418
#1  0x0000555555572859 in array_set_key_value (vlen=0, v=0x5555555929ad "", klen=14, k=0x55555559852e "server.modules", a=0x0) at array.h:123
#2  config_plugin_values_init (srv=srv@entry=0x5555555ad520, p_d=p_d@entry=0x7fffffffe7a0, cpk=cpk@entry=0x5555555a9fc0 <cpk>, 
    mname=mname@entry=0x555555597ed9 "global") at configfile-glue.c:296
#3  0x0000555555569a96 in config_insert_srvconf (srv=0x5555555ad520) at configfile.c:738
#4  config_read (srv=srv@entry=0x5555555ad520, fn=<optimized out>) at configfile.c:2546
#5  0x000055555556144c in server_main_setup (srv=srv@entry=0x5555555ad520, argc=argc@entry=4, argv=argv@entry=0x7fffffffeb38)
    at server.c:1120
#6  0x0000555555562f64 in main (argc=4, argv=0x7fffffffeb38) at server.c:2029
Actions #45

Updated by gstrauss over 3 years ago

It is strange that srv->srvconf.config_touched would be NULL. That is initialized in config_init().
Would you try to reproduce the issue with a smaller lighttpd.conf (hopefully something that can be shared)?

Actions #46

Updated by gstrauss over 3 years ago

I should mention that for some reason with this batch of commits on my dev branch, I got some linker errors for unrecognized linker flags and needed to clean the build area for a proper build: make clean; ccache -C; ./autogen.sh; ./configure ...; make
You might try a clean build if you have been pulling from my dev branch.

Actions #47

Updated by gstrauss over 3 years ago

I pushed another set of changes to my dev branch to use linked lists in mod_cgi and mod_uploadprogress. With that, I believe that I have replaced all the places in the most popular lighttpd modules where memory is allocated long-term after startup. There are, of course, exceptions, such as mod_magnet where lua scripts are reloaded.

If the changes on my dev branch are not sufficient for the glibc memory allocator to return memory to the OS, I will consider adding malloc_trim() to periodic maintenance tasks. Any suggestions for frequency? Every 64 seconds? Other desired frequency?

BTW, I still have no idea why flynn would see a NULL ptr crash at startup, on the first directive processed in the global config ("server.modules"), so I hope that a clean build will resolve that mystery. In my own builds, I had to remove autotools config.cache and re-run ./configure ... in my dev branch to resolve conflict caused by caching.

Actions #48

Updated by flynn over 3 years ago

Sorry for the delay...

With a clean build of the current git version I do not get anymore a segfault on testing the configuration.

But I cannot test until the PROPFIND issue is solved with HTTP/2 ... (the code around the suggested workaround above changed too much since).

Actions #49

Updated by gstrauss over 3 years ago

But I cannot test until the PROPFIND issue is solved with HTTP/2 ... (the code around the suggested workaround above changed too much since).

The PROPFIND issue was only on my dev branch and has been fixed on my dev branch. (#3084#note-37)

Actions #50

Updated by flynn over 3 years ago

On PROPFIND requests I still get 400 with the git version, I tested this morning, I try to debug later ...

Actions #51

Updated by gstrauss over 3 years ago

FYI: I have tested with a simple CGI script which prints the environment, and forced HTTP/2 and request method PROPFIND:
curl --http2-prior-knowledge -X PROPFIND -v http://127.0.0.1:8080/env.pl

Actions #52

Updated by flynn over 3 years ago

From my knowledge of the problem your test will succeed, because you do not check whether the PROPFIND method reaches the cgi-script.

From my debug/strace last time th following happens: PROPFIND method is requested but converted by lighttpd to a POST request for the backend, which is not expected. The backend sends a 400 ...

Did you verify, that the cgi-script gets a PROPFIND too?

Actions #53

Updated by gstrauss over 3 years ago

Yes, the CGI script gets the PROPFIND

#!/usr/bin/perl
print "Status: 200\n\n$ENV{REQUEST_METHOD}\n";

Actions #54

Updated by gstrauss over 3 years ago

Did you install the clean build of lighttpd and restart lighttpd? Or is lighttpd -tt failing and lighttpd not actually being restarted?

Actions #55

Updated by flynn over 3 years ago

In tests like this I always run lighttpd in foreground (option -D), to get all log log messages immeditely.

I can reproduce, that PROPFIND reaches the CGI-script with your example above, but it does not work for fastcgi.

Can you please test fastcgi too? I still see in the strace output POST requests instead of PROPFIND.

Actions #56

Updated by flynn over 3 years ago

I tested more backend protocols
  • scgi is ok, PROPFIND is reported as REQUEST_METHOD.
  • reverse proxy is ok too

From my tests only fastcgi still does NOT pass through the correct request method to the backend.

Actions #57

Updated by gstrauss over 3 years ago

Thanks for continuing to look and narrowing things down. The backends all share a substantial amount of code with slightly different protocol filters, but the protocol differences are not specific to HTTP request method. Let me dig in.

Actions #58

Updated by gstrauss over 3 years ago

I am not seeing an issue with HTTP/2 and PHP served via FastCGI.

foo.php

<?php
echo getenv("SERVER_PROTOCOL"),"\n";
echo getenv("REQUEST_METHOD"),"\n";
?>

$ curl --http2-prior-knowledge -X PROPFIND http://127.0.0.1:8080/foo.php
HTTP/2.0
PROPFIND

server.modules += ("mod_fastcgi")
fastcgi.server += (
        ".php" => (
                (
                        "host" => "127.0.0.1",
                        "port" => 10000,
                        "bin-path" => "/usr/bin/php-cgi",
                        "max-procs" => 1,
                ),
        ),
)

Actions #59

Updated by gstrauss over 3 years ago

By any chance are you running multiple lighttpd instances, maybe proxying one to another, and maybe one instance is running an old build of my dev branch, one which contains the bug with HTTP/2 request method?

While I did find (and fix) a bug in the lighttpd attempt to optimize use of ls-hpack, and that bug was related to marking all non-POST request methods as GET with HTTP/2, I haven't been able to reproduce the specific issue you are seeing where HTTP/2 requests with PROPFIND are mismarked as POST.

Actions #60

Updated by gstrauss over 3 years ago

Let's try a different approach. Maybe the bug is elsewhere and PROPFIND/POST is how you are observing it.
In your reproduction, are you able to see the error sending a single request? Or are you sending multiple requests? What is generating those requests? Are some of those requests POST requests?

Actions #61

Updated by flynn over 3 years ago

At the moment I try to reduce/minimize my setup ...

The PROPFIND request is produced by infcloud, a pure javascript calendar/contact web application. PROPFIND is the first request to the php-Davical backend.

Actions #62

Updated by flynn over 3 years ago

Partial success: with minimzed setup I see PROPFIND as request method with your php example above.

I have to search for the configuration or modules which are causing my problem

Actions #63

Updated by gstrauss over 3 years ago

Most lighttpd backends use src/http_cgi.c:http_cgi_headers(), including mod_cgi, mod_fastcgi, and mod_scgi, so if one of them gets REQUEST_METHOD correct, they all should. mod_proxy constructs the request-line differently, but everything uses r->http_method which is assigned in very few places. I have reviewed those places and none are specific to mod_fastcgi.

On the off-chance that there is something in the request that causes corruption, are you able to share the request headers of that PROPFIND request from infcloud to php-Davical? If you do not want to share the PROPFIND request body, would you please let me know the size? With that information, I might be able to reproduce the issue.

Actions #64

Updated by flynn over 3 years ago

Request body is:

<D:propfind xmlns:D="DAV:"><D:prop><D:current-user-principal/></D:prop></D:propfind>

Header as curl options (I removed/blurred Authorization, Origin, Cookie and Referrer):
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0'
-H 'Accept: application/xml, text/xml, */*; q=0.01'
-H 'Accept-Language: de,en-US;q=0.7,en;q=0.3' --compressed
-H 'Content-Type: text/xml; charset=utf-8'
-H 'Authorization: Basic xxxx'
-H 'X-client: InfCloud 0.13.2rc1 (Inf-IT CalDAV/CardDAV Web Client)'
-H 'Depth: 0'
-H 'X-Requested-With: XMLHttpRequest'
-H 'Content-Length: 122'
-H 'Connection: keep-alive'
-H 'Sec-Fetch-Dest: empty'
-H 'Sec-Fetch-Mode: cors'
-H 'Sec-Fetch-Site: same-origin'
-H 'TE: trailers'

Actions #65

Updated by gstrauss over 3 years ago

I tried your request, but did not reproduce the issue. Things worked fine for me after changing Content-Length: 122 to Content-Length: 84 to match curl -X PROPFIND --data-raw '<D:propfind xmlns:D="DAV:"><D:prop><D:current-user-principal/></D:prop></D:propfind>' ...

Actions #66

Updated by flynn over 3 years ago

More progress:
- curl on cmdline works, if I remove the Content-Length header
- firefox seems to cause the problem, as you already found, the Content-Length is wrong calculated

Actions #67

Updated by gstrauss over 3 years ago

If it helps, take those 84 chars. Add \r\n after each >, including the final line. Change < to &lt; and change > to &gt;. That becomes 122 chars.

While I do not know exactly what is happening on the client, the difference in lengths suggests somebody, somewhere is using a different encoding, resulting in a different length. Is something mis-applying application/x-www-form-urlencoded? The Content-Type is specified as Content-Type: text/xml; charset=utf-8, but is something else xml-encoding the payload, including encoding the tags, as if the payload were untrusted user data intended to go between tags, rather than an XML construction including the tags?

Actions #68

Updated by flynn over 3 years ago

It works with chromium, it seems a pure firefox problem, maybe a detail in the HTTP/2 headers, flags or handling.

Any hints how to debug this?

Actions #69

Updated by flynn over 3 years ago

Using debug.log-request-header = "enable" the biggest difference between firefox and chromium is te http-header used only by firefox:

2021-08-11 14:10:26: h2.c.1223) fd:15 id:15 rqst: te: trailers

I did not find a setting to disable this in firefox.

Actions #70

Updated by gstrauss over 3 years ago

Any hints how to debug this?

Firefox has some nice debugging tools in the Web Developer Tools, but I am by no means an expert on clients.

You can verify what lighttpd is receiving in the header:
debug.log-request-header = "enable"

Firefox sends HTTP/2 over TLS, so strace on lighttpd is not going to yield human-readable results unless you man-in-the-middle the cert.
You can patch lighttpd to dump some of the request body. This is kludge and coded quickly; for testing only. It might work, but only for small request bodies:
In src/h2.c:h2_recv_data():

--- a/src/h2.c
+++ b/src/h2.c
@@ -974,6 +974,9 @@ h2_recv_data (connection * const con, const uint8_t * const s, const uint32_t le
     else
         chunkqueue_steal(dst, cq, (off_t)alen);

+chunk *dc = dst->first;
+log_error(r->conf.errh, __FILE__, __LINE__, "%.*s", (int)(buffer_clen(dc->mem)-dc->offset), dc->mem->ptr+dc->offset);
+
     if (pad)
         chunkqueue_mark_written(cq, pad);
     return 1;

As to the TE header, is php-Davical attempting to send trailers in response? I did not think we were getting that far in request/response.
debug.log-response-header = "enable"

Actions #71

Updated by flynn over 3 years ago

Running h2spec on version 1.4.59, I get 2 errors.

Running h2spec on current git version, I get 6 errors.

Can you reproduce this?

Actions #72

Updated by gstrauss over 3 years ago

I can not reproduce what you are seeing with h2spec. What is the target of h2spec that you are using?

If I remember correctly, h2spec tests might fail if payload is <= 16k and lighttpd handles the response in a single DATA frame, where some h2spec tests expect multiple frames.
(Edit: yes: target file must be at least 16k+1 in size so that lighttpd sends multiple DATA frames, and target file must be at least 32k+1 with lighttpd 1.4.60)

I use /index.html and create /index.html with a's to be 64k in size.
I temporarily patch lighttpd with the patch further below.
Then, I execute h2spec with clear-text and TLS tests
h2spec -P /index.html -p 8080 -S
h2spec -P /index.html -p 8443 -S -k -t

Last year, I posted some patches to h2spec. Others agree that h2spec should handle some tests a bit differently, but the patches haven't been looked at by the author.

With a 64k /index.html handled as a static file by lighttpd, and the patch to lighttpd below, the h2spec tests pass 100% for TLS and with one failure for clear-text, and the reason for that is explained in
https://github.com/summerwind/h2spec/issues/122
https://github.com/summerwind/h2spec/issues/112

For h2spec, patch lighttpd temporarily with:

--- a/src/h2.c
+++ b/src/h2.c
@@ -1445,16 +1445,16 @@ h2_recv_headers (connection * const con, uint8_t * const s, uint32_t flen)
         /* RFC 7540 Section 8. HTTP Message Exchanges
          * 8.1.2.6. Malformed Requests and Responses
          *   For malformed requests, a server MAY send an HTTP
          *   response prior to closing or resetting the stream.
          * However, h2spec expects stream PROTOCOL_ERROR.
          * (This is unfortunate, since we would rather send
          *  400 Bad Request which tells client *do not* retry
          *  the bad request without modification)
          * https://github.com/summerwind/h2spec/issues/120
          * https://github.com/summerwind/h2spec/issues/121
          * https://github.com/summerwind/h2spec/issues/122
          */
-      #if 0
+      #if 1
         if (400 == r->http_status && !trailers) {
             h2_send_rst_stream(r, con, H2_E_PROTOCOL_ERROR);
             h2_retire_stream(r, con);

Actions #73

Updated by gstrauss over 3 years ago

For the PROPFIND/POST issue, could you try with debug.log-request-header = "enable" and the patch further above to print the request body received?

Is infcloud on the client side getting some sort of error with the initial request and falling back to retrying to send a POST request? Does Content-Type change?

Actions #74

Updated by gstrauss over 3 years ago

  • Status changed from New to Need Feedback

@ZivHung:
Are you able to test with my lighttpd development branch? Have you noticed any change in memory usage?

Actions #75

Updated by flynn over 3 years ago

I still have not found the cause of the PROPFIND problem in firefox, I'll try to debug tomorrow. Without working PROPFIND I cannot test.

Actions #76

Updated by gstrauss over 3 years ago

@flynn:
Are you able to check with a vanilla Firefox on a vanilla OS installation? Corporate malware and/or Firefox third-party plugins might play a role.
https://support.mozilla.org/en-US/kb/identify-problems-third-party-modules-firefox-windows

-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0'
Based on the above, you're using Firefox/90.0. FYI, I am using Firefox/91.0 and am going to try to reproduce what you are seeing.

Would you confirm that lighttpd is receiving a PROPFIND request from Firefox? Or is lighttpd receiving a POST request?
debug.log-response-header = "enable"

As we troubleshoot, it would be nice to narrow down if this issue is on the client or on the server. Based on the Content-Length disparity, I have a hunch that the client is sending a POST.

Actions #77

Updated by flynn over 3 years ago

I tried already a vanilla palemoon, a firefox derivative, which has the same problem.

debug.log-response-header = "enable" shows the PROPFIND request:

2021-08-23 17:55:53: h2.c.1247) fd:15 id:13 rqst: :method: PROPFIND

I'll debug this tomorrow ...

Actions #78

Updated by gstrauss over 3 years ago

@flynn

If the issue did not occur with lighttpd 1.4.59, but does occur with HTTP/2 with my dev branch, you might take src/h2.c from lighttpd 1.4.59, apply it to my dev branch, and then see if the problem still occurs. The changes to src/h2.c should be standalone, so it should be safe and functional to take src/h2.c from lighttpd 1.4.59. (The same is not true for src/request.c, which is where I fixed the bug in my dev branch where anything other than POST was treated as GET for HTTP/2.) If your scenario works in lighttpd 1.4.59, but not in my dev branch, then there might be a bug in src/h2.c or src/request.c on my dev branch.

Actions #79

Updated by flynn over 3 years ago

Exchanging only src/h2.c from version 1.4.59 does not fix the problem.

I found, that this commit causes the problem: https://git.lighttpd.net/lighttpd/lighttpd1.4/commit/5b2a3103e54757eb501e3bbdd647953e334503af.

Any build before this commit work, builds after this commit do NOT work.

Actions #80

Updated by flynn over 3 years ago

There is difference between the parsed method as string (PROPFIND) and the stored enum value (I added a log message in src/h2.c after the call to http_request_parse_header:

  • Firefox
    2021-08-24 23:54:55: h2.c.1247) fd:15 id:15 rqst: :method: PROPFIND
    2021-08-24 23:54:55: h2.c.1252) r->http_method = 2, hpctx.id = -4, lsx.hpack_index = 3
  • Chromium
    2021-08-24 23:57:56: h2.c.1247) fd:15 id:7 rqst: :method: PROPFIND
    2021-08-24 23:57:56: h2.c.1252) r->http_method = 26, hpctx.id = -3, lsx.hpack_index = 2

The value 2 represents the POST method I see in the php/fastcgi.

The problem seems to be inside the function lshpack_dec_decode ...

Actions #81

Updated by gstrauss over 3 years ago

--- a/src/request.c
+++ b/src/request.c
@@ -651,6 +651,7 @@ http_request_parse_header (request_st * const restrict r, http_header_parse_ctx
                 http_request_header_set_Host(r, v, vlen);
                 return 0;
               case HTTP_HEADER_H2_METHOD_GET:  /*(any method, not only "GET")*/
+              case HTTP_HEADER_H2_METHOD_POST:
                 if (__builtin_expect( (HTTP_METHOD_UNSET != r->http_method), 0))
                     break;
                 r->http_method = get_http_method_key(v, vlen);
@@ -658,11 +659,6 @@ http_request_parse_header (request_st * const restrict r, http_header_parse_ctx
                     return http_request_header_line_invalid(r, 501,
                       "unknown http-method -> 501");
                 return 0;
-              case HTTP_HEADER_H2_METHOD_POST: /*(exactly "POST")*/
-                if (__builtin_expect( (HTTP_METHOD_UNSET != r->http_method), 0))
-                    break;
-                r->http_method = HTTP_METHOD_POST;
-                return 0;
               case HTTP_HEADER_H2_PATH:            /*(any path, not only "/")*/
               case HTTP_HEADER_H2_PATH_INDEX_HTML: /*(exactly "/index.html")*/
                 if (__builtin_expect( (!buffer_is_blank(&r->target)), 0))
Actions #82

Updated by flynn over 3 years ago

This patch fixes the problem in Firefox, PROPFIND works now!

Is it experimental or final?

Actions #83

Updated by gstrauss over 3 years ago

I'll be updating my dev branch (with a force push) momentarily.

That patch is final for this issue. I feel that I have misunderstood some of the detailed internals of HPACK. The comment at the top of the switch() explains that the ls-hpack internals indicate whether the id matches only the key (e.g. ":method") or both key and value (":method" and "POST"). I did not know that the client could have selected this, and apparently (not verified) it looks like Firefox is sending the id for ":method" and "POST", and a flag indicating the match is only for the key (":method").

The other pseudo-headers in request.c:http_request_parse_header() all treat the id as a key-only match, and this patch does the same for :method, making them all consistent.

Thank you for digging into this!

Actions #84

Updated by gstrauss over 3 years ago

  • Status changed from Need Feedback to Patch Pending
  • Target version changed from 1.4.x to 1.4.60
Actions #85

Updated by flynn over 3 years ago

I started some memory tests on a low volume server:
  • base line is now below 2MB (1.4.59: 3MB)
  • memory intensive tasks like big uploads, reverse proxy requests or wsgi requests do not return the additional memory after some time
    (tested at least for 30min)
  • explicit calls to malloc_trim is still the best solution, the memory retuns very close to the base line after some minutes.
Actions #86

Updated by gstrauss over 3 years ago

explicit calls to malloc_trim is still the best solution, the memory retuns very close to the base line after some minutes.

Since malloc_trim is effective, the good news is that lighttpd allocator usage is not trapping excessive memory, preventing the allocator from releasing that memory.

I wonder how the memory usage would be affected by disabling the dynamic mmap threshold approach and setting environment variables: MALLOC_TRIM_THRESHOLD_=524288 (512KB) and MALLOC_MMAP_THRESHOLD_=327680 (320KB) in the shell that starts lighttpd.

Are there any changes if you set MALLOC_TRIM_THRESHOLD_ (note the trailing '_') with my dev branch?

In #3084#note-47 I wrote

If the changes on my dev branch are not sufficient for the glibc memory allocator to return memory to the OS, I will consider adding malloc_trim() to periodic maintenance tasks. Any suggestions for frequency? Every 64 seconds? Other desired frequency?

I am leaning towards every 64 secs (1m 4s) or every 256 secs (4m 16s).

Do you have a suggestion for size that the allocate should keep available on its heap?
malloc_trim(131072); // 128KB
malloc_trim(262144); // 256KB
malloc_trim(524288); // 512KB
I will probably use 128KB unless you have a different preference.

Actions #87

Updated by gstrauss over 3 years ago

From man malloc_trim: "NOTES: This function is automatically called by free(3) in certain circumstances; see the discussion of M_TOP_PAD and M_TRIM_THRESHOLD in mallopt(3)."

From man mallopt: "When dynamic mmap thresholding is in effect, the threshold for trimming the heap is also dynamically adjusted to be twice the dynamic mmap threshold."

Since flynn explicitly calling malloc_trim() successfully releases memory back to the system, glibc must not be triggering this behavior on its own.

Explicitly setting the below should disable the glibc dynamic mmap threshold, and should be equivalent to explicitly calling malloc_trim(131072);, since the default M_TRIM_THRESHOLD is 128k (131072) and the default M_TOP_PAD is 128k (131072). If this works instead of explicitly calling malloc_trim(), then we may consider setting M_TRIM_THRESHOLD to 256k or 512k. I am leaning towards 512k (which was the intent of testing with the environment variable MALLOC_TRIM_THRESHOLD_=524288).

--- a/src/server.c
+++ b/src/server.c
@@ -1985,6 +1985,9 @@ int main (int argc, char **argv) {
   #if defined(HAVE_MALLOPT) && defined(M_ARENA_MAX)
     mallopt(M_ARENA_MAX, 2); /*(ignore error, if any)*/
   #endif
+  #if defined(HAVE_MALLOPT) && defined(M_TRIM_THRESHOLD)
+    mallopt(M_TRIM_THRESHOLD, 131072); /*(ignore error, if any)*/
+  #endif

     /* for nice %b handling in strftime() */
     setlocale(LC_TIME, "C");
Actions #88

Updated by gstrauss over 3 years ago

I force-pushed some updates to my dev branch to run malloc_trim(524288) every 64 seconds, on systems where it is available (which is checked at runtime at startup, just in case an alternative memory allocator is used).

If someone needs a smaller trim threshold, then they can set in the environment MALLOC_TRIM_THRESHOLD_=131072 before starting lighttpd.

For the moment, I have not included mallopt(M_TRIM_THRESHOLD, ...) in the code, but I am still considering it.

@flynn wrote

memory intensive tasks like big uploads, reverse proxy requests or wsgi requests do not return the additional memory after some time

During the memory intensive tasks, to what size does lighttpd memory usage grow? If it grows from 2-3 MB to 10-15 MB, then I am not too concerned. On the other hand, if the memory use grows to 20 MB or more, then that is something I would like to explore further. For a single connection doing a large upload or reverse proxy, memory use should not grow that much (< 1 MB) since lighttpd will use temporary files on disk for large request or response bodies. If there is a burst of many connections doing large transfers, then memory use is expected to grow more than that, but not for a single connection.

If lighttpd memory growth is large (> 1 MB) for a single connection, then that would suggest there are still improvements that should be made in lighttpd.

I have tested lighttpd with h2load for varying sizes of downloads of static files, and see that memory chunks are properly reused. I have done similar with a front-end lighttpd proxying to a backend-end lighttpd. Perhaps I will try with some dynamic content with uploads and downloads, too.

Actions #89

Updated by flynn over 3 years ago

Some answers:
  • at the moment I use malloc_trim(131072) in some tests.
  • the maximum used memory in my tests was 11.3MB with 5 parallel big uploads (each > 100MB) and a wsgi-request to a searx-instance (without explicit malloc_trim)
  • in my older tests some weeks ago setting the malloc environments variables like MALLOC_TRIM_THRESHOLD_ did not reduce the memory as much as the explicit calls to malloc_trim, but I can repeat the tests next week

I updated to the current git-version and started a new test, but will be not able to report before Tuesday.

Actions #90

Updated by gstrauss over 3 years ago

Thanks! You're up late! I had hoped you would see this tomorrow.

I expect that the current set of patches will address the issue of a sustained high memory footprint with glibc.
You already established that jemalloc had better behavior. Hopefully these patches will approach the jemalloc behavior, but with glibc.

Separately, I am planning to add some instrumentation to see if the chunk reuse is working as expected and if it can be improved with large uploads. To help establish the test environment, I am presuming that all connections are over TLS and HTTP/2. Does that match your environment? Do you recall how much memory use increased in the past with TLS and HTTP/1.1, before lighttpd supported HTTP/2? (If your current environment is using HTTP/1.1 uploads, are they chunked or is Content-Length provided? If chunked, then I should be looking more closely at how lighttpd is handling HTTP/1.1 Transfer-Encoding: chunked on request bodies.)

Actions #91

Updated by gstrauss over 3 years ago

FYI: I fixed a bug that was in my dev branch which could result in duplicated data in the response under certain conditions: response > 64k, which spills to tempfiles, and then additional packets received in smaller batch < 32k.

Additional commits have been made to further reduce memory usage for responses > 64k.

Actions #92

Updated by flynn over 3 years ago

I updated my git copy and started new tests: baseline aound 2MB, with big parallel uploads up to 10MB, but return close to 2MB in minutes. Overall the the memory usage has improved/reduced compared to last week.

I made short tests without malloc_trim and MALLOC_TOP_PAD_=131072 MALLOC_TRIM_THRESHOLD_=131072 MALLOC_MMAP_THRESHOLD_=131072, the same results as some weeks ago, no significant freeing of memory after big parallel uploads, stays between 7-10MB.

I report again after 1-2days.

Actions #93

Updated by flynn over 3 years ago

After one week of testing the baseline stabilized at exact 3MB. The memory requirement rises and falls with use with a short delay, as it should.

The difference between malloc_trim(131072) and malloc_trim(524288) is about 200-300kB, but the test with the lower value is still running. Maybe you offer a possibility to change this value (environment variable, setting in conigure script, ...) for people who are very memory contrained.

I also started a test on a bigger server, which had memory usages up to 40-70MB. If this server stays e.,g. below 10MB until Friday, this issue can be considered as solved from my side.

Actions #94

Updated by gstrauss over 3 years ago

After one week of testing the baseline stabilized at exact 3MB. The memory requirement rises and falls with use with a short delay, as it should.

That's good news! :)

Maybe you offer a possibility to change this value (environment variable, setting in conigure script, ...) for people who are very memory contrained.

This patch reuses the MALLOC_TOP_PAD_ environment variable for one of its intended purposes. (I have already trivially tested it, so no need to restart your tests.)

--- a/src/server.c
+++ b/src/server.c
@@ -74,6 +74,7 @@ static const buffer default_server_tag = { CONST_STR_LEN(PACKAGE_DESC)+1, 0 };
 #include <malloc.h>
 #if defined(HAVE_MALLOC_TRIM)
 static int(*malloc_trim_fn)(size_t);
+static size_t malloc_top_pad;
 #endif
 #endif

@@ -1882,7 +1883,7 @@ static void server_handle_sigalrm (server * const srv, unix_time64_t mono_ts, un
                                        request_pool_free();
                                        connections_pool_clear(srv);
                                  #if defined(HAVE_MALLOC_TRIM)
-                                       if (malloc_trim_fn) malloc_trim_fn(524288);
+                                       if (malloc_trim_fn) malloc_trim_fn(malloc_top_pad);
                                  #endif
                                        /* attempt to restart dead piped loggers every 64 secs */
                                        if (0 == srv->srvconf.max_worker)
@@ -2011,6 +2012,14 @@ static int main_init_once (void) {
   #endif

   #if defined(HAVE_MALLOC_TRIM)
+    malloc_top_pad = 524288;
+    {
+        const char * const top_pad_str = getenv("MALLOC_TOP_PAD_");
+        if (top_pad_str) {
+            unsigned long top_pad = strtoul(top_pad_str, NULL, 10);
+            if (top_pad != ULONG_MAX) malloc_top_pad = (size_t)top_pad;
+        }
+    }
   #ifdef LIGHTTPD_STATIC
     malloc_trim_fn = malloc_trim;
   #else

Actions #95

Updated by gstrauss over 3 years ago

  • Status changed from Patch Pending to Fixed
Actions #96

Updated by flynn over 3 years ago

The memory usage on the bigger server differs a littls bit:
  • the baseline is at 6-7MB, this is reasonable, because it has much more backends (proxy, fastcgi, scgi, ...) and lua-scripts
  • the memory usage looks like a sawtooth wave, e.g. it raises from 6 to 14MB in 2h and goes back to 6MB in seconds
  • peak value at the moment 26MB in 1:40h

Maybe freeing memory is still not triggered often enough.

Overall it is much better than version 1.4.59:
  • lower peak values
  • lower baseline
  • faster and more often return to the baseline (it does NOT stay at high values for hours)
Actions #97

Updated by gstrauss over 3 years ago

With what I pushed to lighttpd git master, lighttpd is releasing memory from various object pools every 64 seconds. If the server is idle for 64 seconds (no open connections, including no idle HTTP/2 connections), then memory should return close to the baseline. If there are open connections, then resources associated with those open connections may (temporarily) prevent some memory from being released.

What do your access logs suggest is happening for the length of time HTTP/2 connections are maintained? (Check for the "PRI *" entries)
If you're running mod_status, you can take a look there to get a glimpse at open connections, but ls -l /proc/<pid>/fd on the lighttpd <pid> might also be useful.

Actions #98

Updated by flynn over 3 years ago

I monitor both:
  • mod_status shows values between 2 and 20, average at 4
  • open file handles average at 60 (thereof 38 log file handles) with a maximum at 100

Because I have so much log files, it takes more time to follow the PRI entries, I hope I can spent some time this evening.

I just discovered, that the same log file has multiple file handles, one for each accesslog.filename = "same log filename" configuration line, may be you want to pool this.

Actions #99

Updated by gstrauss over 3 years ago

Because I have so much log files, it takes more time to follow the PRI entries, I hope I can spent some time this evening.

I wouldn't spend too much time here. It probably won't tell you anything you don't already know from mod_status.

I just discovered, that the same log file has multiple file handles, one for each accesslog.filename = "same log filename" configuration line, may be you want to pool this.

Noted; added to the wishlist. I'll look to see if it's easy to do.

Actions #100

Updated by flynn over 3 years ago

I must terminate my testing, because I get errors with the etherpad backend: it is connected with the proxy module and I get the following error message:

2021-09-09 17:50:57: http_chunk.c.354) chunked header invalid chars
2021-09-09 17:50:57: gw_backend.c.2357) response already sent out, but backend returned error on socket: tcp:xyz.xyz.xyz.xyz:xyz for /javascripts?callback=require.define, terminating connection

All requests are HTTP/2. Switching back to version 1.4.59 with no (configuration) changes solves this, etherpad works again.

Actions #101

Updated by gstrauss over 3 years ago

@flynn

Would you please file a new bug for that? proxy.header += ("force-http10" => "enable") for requests might also work for requests to that backend.

Actions #102

Updated by gstrauss over 3 years ago

I found an issue with commit 39a57798 and am working on a fix.

Actions #103

Updated by gstrauss over 3 years ago

Preliminary patch. When lighttpd is dechunking from the backend but not sending chunks to the client, the input buffer is consumed and must be cleared before returning. (Code elsewhere which accumulates small reads in the buffer does not clear the input buffer.)

--- a/src/http_chunk.c
+++ b/src/http_chunk.c
@@ -482,6 +482,8 @@ int http_chunk_decode_append_buffer(request_st * const r, buffer * const mem)
         r->resp_send_chunked = 1;
         return rc;
     }
+    else
+        buffer_clear(mem);

     return 0;
 }

Actions #104

Updated by flynn over 3 years ago

I updated my git copy and etherpad backend works again, I restarted my memory testing.

Actions #105

Updated by flynn over 3 years ago

It seems, that this patch also improved the memory usage: I see less, shorter and smaller "sawtooth waves".

Actions #106

Updated by gstrauss about 3 years ago

FYI: There is a new set of patches on my dev branch which dramatically increases buffer reuse when processing requests to backends (e.g. proxy and fastcgi), reducing memory footprint when lighttpd is under load. With these patches, you should see lower peak memory use. These will likely become part of an upcoming lighttpd 1.4.60 release candidate.

Actions #107

Updated by flynn about 3 years ago

gstrauss after all these fixes regarding HTTP/2 I started tests without periodic calls to malloc_trim and it looks good: memory goes up and down and returns to the same baseline as with malloc_trim. I'll make more tests on more servers and report again in about one week (if not - feel free to remind me).

Maybe the calls to malloc_trim can be omitted now ...

Actions #108

Updated by gstrauss about 3 years ago

Glad to hear that memory use is smoother.

BTW, since malloc_trim is run only once every 64 seconds, I see little reason to remove it. It might help in some cases, and is not likely to cause any harm. ISTR the calls to malloc_trim did make a positive difference following my (intentionally) abusive load testing.

Actions #109

Updated by flynn about 3 years ago

Here my update running without malloc_trim:
  • the situation improved a lot, no more long lasting high peaks of memory usage
  • the average memory usage is about 1-2MB higher having a base value of 5 or 6MB.

Do you know how much time the call to malloc_trim lasts? Is it (a)synchronous?
Is the request, that triggers malloc_trim delayed (or the request after that request?)
My concern is, that all 64sec requests are delayed and do not have the same timing than the others.

Actions #110

Updated by gstrauss about 3 years ago

  • the average memory usage is about 1-2MB higher having a base value of 5 or 6MB.

As you noted, that is for your specific system (with specific modules loaded) and without malloc_trim(). When explicitly using malloc_trim(), lighttpd server.c keeps 512k in call to malloc_trim() so the baseline will increase at least that amount (from startup baseline) after moderate usage of the server. lighttpd looks for the environment variable MALLOC_TOP_PAD_ and will keep that amount instead, if the variable MALLOC_TOP_PAD_ is set in the env before lighttpd startup.

Do you know how much time the call to malloc_trim lasts? Is it (a)synchronous?
Is the request, that triggers malloc_trim delayed (or the request after that request?)
My concern is, that all 64sec requests are delayed and do not have the same timing than the others.

malloc_trim() is not asynchronous. When there is trimming work to be done by malloc_trim(), strace of the process shows calls to madvise() MADV_DONTNEED (IIRC), so depending on the number of pages being released, this can take some number of microseconds. The glibc man page for mallopt() discusses that the heap may be trimmed when free() is called, so malloc_trim()-like behavior may happen at any time. When glibc "dynamic mmap thresholding" is in effect, then when dynamic trimming occurs can change. As we originally saw, dynamic trimming did not seem to be happening much with the preallocation of blocks of lighttpd struct connection and others which grew with (high) load, but were never released. Now, lighttpd keeps a pool of unused struct connection and struct request (and some others), but lighttpd periodically free()s the unused elements in those pools (before malloc_trim()) to potentially release the memory back to the OS.

My concern is, that all 64sec requests are delayed and do not have the same timing than the others.

After an abusive load test has ended and the server is idle, that is when I see malloc_trim() do some work. If the server is still busy, then more of that memory is in use and there is less work to do. Similar to other periodic maintenance in lighttpd, there might be a small blip in how long it takes to handle requests during that periodic maintenance. This blip might measure a few microseconds to maybe low 10s of microseconds, but I have not seen more than that even after abusive load tests. If you did measure a longer delay, then please let me know. If that case were to materialize -- which I do not think occurs in typical usage -- then one solution would be to run malloc_trim() incrementally more frequently (e.g. every 32 seconds or every 16 seconds) so that potentially less work is done each interval.

In short, based on my abusive load tests, I do not think this is something to worry about, but if you have numbers that suggest otherwise then we can revisit.

Actions

Also available in: Atom