Bug #3084: Memory fragmentation with HTTP/2 enabled - Lighttpd - lighty labs

Actions

#1

Updated by gstrauss almost 4 years ago

Thank you for providing your lighttpd.conf.

Would you please help narrow down what behavior is causing the memory increase?
What requests?
FastCGI requests?
WSTunnel requests?
CGI requests? I see CGI directives in your lighttpd.conf, but you have commented out "mod_cgi" in server.modules

Does the issue go away if you change the streaming parameters? (one or both)

server.stream-request-body = 1
server.stream-response-body = 1

or with

server.stream-request-body = 0
server.stream-response-body = 0

Is there a reason you have disabled TLS session tickets? (-SessionTicket)?
Is there a reason you are using writev instead of the default sendfile for server.network-backend="writev" ?

BTW, if you want lighttpd to omit the Server response header, use server.tag = ""

I see that you have disabled ETag generation. If the responses should never be cached, you might also disable the stat cache and see if that makes a difference in memory use. server.stat-cache-engine = "disable"

Actions

Copy link

#2

Updated by gstrauss almost 4 years ago

Another difference between lighttpd 1.4.58 and lighttpd 1.4.59 is enabling HTTP/2 by default.

Please test with server.feature-flags = ( "server.h2proto" => "disable" ) to see if that makes a difference for you.

Actions

Copy link

#3

Updated by ZivHung almost 4 years ago

File lighttpd.conf lighttpd.conf added

Hello,

Thanks for the tip, after we tested the lighttpd 1.4.59 in adding “server.feature-flags = ( "server.h2proto" => "disable" )”, the memory size didn’t climb up(match our expectation).

About your question, we use FastCGI to transfer CGI command. Besides, change the setting of server.stream-request(response)-body doesn’t change the situation of memory size climbing up, so does the “server.stat-cache-engine = "disable"”, however, we will add this setting because we indeed don’t need to cache the response. (please check our latest lighttpd.conf)

Last question, “Is there a reason you have disabled TLS session tickets? (-SessionTicket)?”, we saw the definition suggest that this setting should be add before lighttpd 1.4.56, so if we use the version after that, is this setting necessary?

Actions

Copy link

#4

Updated by gstrauss almost 4 years ago

Thanks for the tip, after we tested the lighttpd 1.4.59 in adding “server.feature-flags = ( "server.h2proto" => "disable" )”, the memory size didn’t climb up(match our expectation).

Depending on usage, lighttpd memory use might increase slightly from use at startup, but memory use should stabilize as the memory is reused. After two hours, has memory use grown from one hour prior? (when "server.h2proto" => "enable") Can you share some details about the client that you are using to upload the 4k file? I would like to try to reproduce the issue. Thanks.

Last question, “Is there a reason you have disabled TLS session tickets? (-SessionTicket)?”, we saw the definition suggest that this setting should be add before lighttpd 1.4.56, so if we use the version after that, is this setting necessary?

The setting is not necessary with lighttpd 1.4.56 or later, and not recommended with lighttpd 1.4.56 and later. lighttpd 1.4.56 and later have built-in logic to rotate the session ticket encryption key, whereas earlier versions did not. Therefore, for earlier versions, it is recommended to use -SessionTicket or to restart lighttpd daily (which will result in openssl regenerating the session ticket encryption key).

BTW, to slightly reduce lighttpd memory footprint on embedded systems, you might comment out modules in server.modules that you are not using, as you have done with "mod_cgi". In your config, I do not see you using "mod_access", "mod_authn_file", "mod_auth", "mod_proxy", or "mod_redirect". Each module adds a minimum of 20k to memory footprint, and modules such as "mod_authn_file" and "mod_proxy" add more.

Actions

Copy link

#5

Updated by ZivHung almost 4 years ago

File lighttpd.conf lighttpd.conf added

Hello,
Thank you for giving us advice about lighttpd configure settings, indeed there are several settings in “server.modules” that no longer needed to our service, please help to check our latest configure file, to see if there still has any inappropriate setting.

However, unfortunately we cannot directly provide our test details, but we will try to simplify the condition, and provide the test flow if we find out simpler way, thank you.

Actions

Copy link

#6

Updated by gstrauss almost 4 years ago

If your site is producing files on disk which change frequently, and do not have unique filenames (e.g. for generated responses and using X-Sendfile), then server.stat-cache-engine = "disable" is appropriate.

If you are serving static content, such as I see in your aliases to css, images, js, and videos, then you will benefit from server.stat-cache-engine = "simple" (the default if omitted)

In short, if your site requires that the stat cache be disabled for proper behavior, then you should disable the stat cache. Otherwise, the default is to enable the stat cache because it provides faster responses to static resources.

I had suggested disabling the stat cache as a troubleshooting step to help isolate which lighttpd subsystem might be affecting the memory usage you are seeing. The stat cache is enabled by default in lighttpd because it is so useful and is generally safe to do so.

Actions

Copy link

#7

Updated by flynn almost 4 years ago

File lighttpd-memory.png lighttpd-memory.png added

I have a similar issue with memory, but I could not make it reproducable enough to write a ticket.

But I have a memory monitoring for lighttpd and I made the following observations:

- the memory is freed, lighttpd returns very close to the memory used after start
- but: it may take days (!) to free the memory, see the attached graph (lighttpd is started at Friday 14:37, look only at the last 6 days)

Maybe freeing the memory is not triggered often enough.

Actions

Copy link

#8

Updated by gstrauss almost 4 years ago

This all or in part might be due to memory fragmentation. lighttpd allocates memory for its chunkqueue in chunks, and those chunks are pushed onto a queue to be reused. Every 64 seconds, lighttpd frees the queue of unused chunks. However, the order in which they are freed may differ from the order in which they are allocated, and some long-running connections may still be using later-allocated chunks. After (some) chunks are freed, it is still up to the memory allocator (libc malloc/calloc/realloc unless another is preloaded) when the memory allocator chooses to release the memory back to the system.

It is a good sign to me that the memory (at least in flynn's case) is eventually returned to the system, as that suggests the memory is not leaked. Also, when I tested some scenarios with valgrind some months back, I did not find any leaks.

Actions

Copy link

#9

Updated by gstrauss almost 4 years ago

Category set to core

BTW, the chunk size defaults to 8k and is configurable with server.chunkqueue-chunk-sz = 8192. Minimum size is 1024, and server.chunkqueue-chunk-sz is rounded up to the closest power of 2.

Actions

Copy link

#10

Updated by flynn almost 4 years ago

I started a test with libjemalloc as allocator and some days later I'll test different values of server.chunkqueue-chunk-sz ...

Actions

Copy link

#11

Updated by gstrauss almost 4 years ago

Another thought occurred to me. In lighttpd 1.4.59, HTTP/2 is enabled by default in lighttpd.

Starting in lighttpd 1.4.56, lighttpd keeps a pool of struct request_st (and associated buffers), which are used by HTTP/2 connections, one for each stream. The size of this pool can grow to server.max-connections. Now, each HTTP/2 connection can have up to 8 simultaneous streams (hard-coded limit in lighttpd), and additional request_st allocated are freed immediately upon release.

Depending on the access patterns that your server sees, and on whether the streams are handled quickly or have a long lifetime, you might want to reduce server.max-connections, but not so much that you unnecessarily delay lighttpd from answering bursts of connections. Of course, this tuning is necessary only on small, memory restricted servers, which are likely resource-constrained in other ways, too. On larger servers, 30M lighttpd memory use for a very busy server is unlikely to be an issue.

If it turns out that the request_st allocations -- which occur at runtime during bursty activity and then are saved to the request pool -- are causing excessive fragmentation and preventing release of other memory, then I can consider pre-allocating the request pool at startup, up to server.max-connections. Doing so would increase lighttpd's initial memory footprint for the base case (everyone), so I am leaning against this unless the data proves that it would make a drastic difference. That said, the initial graph provided by flynn suggests that this is not the case, since the memory is eventually released (which does not currently happen with the request pool until server shutdown or restart). Another option would be to periodically clear the request pool, as is currently done every 64 seconds with the chunk queue. Since ZivHung reported this increased memory use with HTTP/2 enabled in lighttpd, I might consider tweaking the lighttpd reqpool behavior if the data supports doing so.

BTW, if not specified, server.max-connections is 1/3 server.max-fds. If not specified, server.max-fds is currently 4096, making the default server.max-connections 1365.

Pondering freeing up to 16 request_st from the pool each 64 seconds. After a busy burst of activity which filled the pool, freeing 16 request_st each 64 seconds would clear the pool in ~ 90 mins (with the default server.max-connections).

Actions

Copy link

#12

Updated by flynn almost 4 years ago

Using jemalloc as allocator definitily improves the memory usage:

base memory usage (4-5MB) is reached more often
high memory usage (more than 10MB) is lower and shorter

I consider this memory usage (with jemalloc) to be normal.

So the interaction between lighttpd memory buffers and glibc malloc can be improved, maybe by reducing the memory arenas to 1. I'll test this later.

I had a similar issue with java and glibc malloc, where reducing the glibc memory arenas improved the situation, but jemalloc was the best solution to achieve the lowest memory usage in production.

Actions

Copy link

#13

Updated by gstrauss almost 4 years ago

@flynn: thank you for the update

ZivHung: try export MALLOC_ARENA_MAX=2 in the same shell in which you start the lighttpd daemon (and do not disable "server.h2proto")

lighttpd is single-threaded and (in many common use cases) does not heavily use the memory allocator at runtime, so while MALLOC_ARENA_MAX=1 is an option, too, MALLOC_ARENA_MAX=2 is probably a good default as a tradeoff, just in case some custom modules are threaded. If MALLOC_ARENA_MAX is not set in the environment, then I might consider using mallopt to set M_ARENA_MAX to 1 during startup and config processing, which is single threaded and does heavily use the memory allocator, and then setting M_ARENA_MAX to 2 after setup.

[Edit:]
I would prefer to find a reasonable value of MALLOC_ARENA_MAX instead of periodically running the glibc-specific malloc_trim

Actions

Copy link

#14

Updated by gstrauss almost 4 years ago

Subject changed from Memory leak problem in Lighttpd 1.4.59 to Memory fragmentation with HTTP/2 enabled

Actions

Copy link

#15

Updated by flynn almost 4 years ago

In my tests over several days I can not see/measure any significant change with MALLOC_ARENA_MAX=1.

Only with jemalloc I get significant changes in memory usage.

gstrauss: shall I test different values of server.chunkqueue-chunk-sz, r.g. 4096 or 16384?

Actions

Copy link

#16

Updated by gstrauss almost 4 years ago

Hmmmm. I was really hoping that MALLOC_ARENA_MAX=1 (or MALLOC_ARENA_MAX=2) would be an good change for the single-threaded lighttpd server, and I might still consider it if there is no indication of harm to do so.

server.chunkqueue-chunk-sz tunable was originally intended for use on severely memory-constrained systems, where a chunk size of 1k might more fully utilize the allocated memory, at a slight cost of increasing the number of memory chunks used.

My guess is that server.chunkqueue-chunk-sz will not have any real differences with regards to memory fragmentation. On the other hand, it might make a difference if chunk size exceeds the threshold for mmap chunks, then that could change the way that chunks are allocated and released back to the system. On the other end of the spectrum, a very small chunk size of 1k might be allocated from a different pool in the arena. I have a hunch that jemalloc has a better algorithm for segmenting the allocation of blocks of different orders of magnitude of size used by lighttpd.

While playing with chunkqueue-chunk-sz might yield a difference, I think the best next step would be to try to identify the source(s) of memory fragmentation. Is it the cache of chunkqueue chunks? Is it the stat_cache? Is it the request_st pool (used by lighttpd HTTP/2 support)? Is it something else or a combination thereof? Since this issue seems to appear with lighttpd HTTP/2 support enabled, it is probably not the stat_cache. In chunk.c, there is chunk_release() and chunk_buffer_release() which could be modified to disable the caching, and to always take the branch to free the object. Similarly, in reqpool.c, request_release could be modified to always free the object.

If you have a decent idea on the number of active connections at any one time (including occasional peaks), and have a good idea how many are brief and how many are long-running (and for how long), then you might lower server.max-connections to see if doing so affects the memory fragmentation over time.

I am also open to other ideas if you have any suggestions.

Actions

Copy link

#17

Updated by gstrauss almost 4 years ago

To collect a bit of data, here is a small patch to lighttpd-1.4.59 to write out malloc statistics.

--- a/src/server.c
+++ b/src/server.c
@@ -68,6 +68,8 @@ static const buffer default_server_tag = { CONST_STR_LEN(PACKAGE_DESC)+1, 0 };
 # include <sys/prctl.h>
 #endif

+#include <malloc.h>
+
 #include "sys-crypto.h" 
 #if defined(USE_OPENSSL_CRYPTO) \
  || defined(USE_MBEDTLS_CRYPTO) \
@@ -1826,6 +1828,10 @@ static void server_handle_sigalrm (server * const srv, time_t min_ts, time_t las
                                if (0 == (min_ts & 0x3f)) { /*(once every 64 secs)*/
                                        /* free excess chunkqueue buffers every 64 secs */
                                        chunkqueue_chunk_pool_clear();
+
+                                       if (0 == (mono_ts & 0xff))
+                                               malloc_stats();
+
                                        /* attempt to restart dead piped loggers every 64 secs */
                                        if (0 == srv->srvconf.max_worker)
                                                fdevent_restart_logger_pipes(min_ts);

The patch will write something like the following out to stderr every 256 seconds (4 min 16 seconds)

Arena 0:
system bytes     =     946176
in use bytes     =     419920
Total (incl. mmap):
system bytes     =    1282048
in use bytes     =     755792
max mmap regions =          2
max mmap bytes   =     335872

ZivHung or flynn, would you mind running with this patch for an hour (during which you see the memory use increase), and share the results? Since MALLOC_ARENA_MAX=1 did not have a noticeable effect for flynn, then it is likely that we will see a larger number of mmap regions.

If a small number of bytes from each mmap region remain in use for long periods of time, then the glibc allocator will not be able to return those regions to the OS until a mmap region is completely unused. As the memory usage is smoother with jemalloc, one possible mitigation might be for me to allocate multiple pools of chunks and to prefer reusing chunks from the original pool and releasing excess chunks from later overflow pools back to the OS. The libc memory allocator will still be leveraged, not replaced. Then again, we have not yet established that the chunks are source of the issue, though their allocations may contribute.

As an aside, I have already committed code earlier this year to lighttpd git master which reduces memory fragmentation by doubling buffer sizes when reallocating, rather than the previous historical method in lighttpd of adding needed space and rounding up to nearest 64 byte boundary. This change in behavior should reduce the number of memory reallocations as well as play more nicely with memory allocator buckets. (Note: lighttpd git master should be functional for testing in a staging environment, but I strongly recommend against pushing lighttpd git master straight into production without more thorough testing. Substantial code changes have been made since lighttpd 1.4.59 was released.)

Actions

Copy link

#18

Updated by flynn almost 4 years ago

I tested different values of chunkqueue-chunk-sz and cannot measure a significant effect on memory usage.

I'll try your memory patches later ...

Actions

Copy link

#19

Updated by flynn almost 4 years ago

The patch does not compile:

server.c: In function ‘server_handle_sigalrm’:
server.c:1832:51: error: ‘mono_ts’ undeclared (first use in this function); did you mean ‘min_ts’?
                                         if (0 == (mono_ts & 0xff))
                                                   ^~~~~~~
                                                   min_ts

There is no mono_ts variable in lighttpd 1.4.59.

Actions

Copy link

#20

Updated by gstrauss almost 4 years ago

Yes, that should be min_ts in lighttpd-1.4.59.

(My development branch has changes which move to using a monotonic clock for that bit of code, which is why on my dev branch the variable is mono_ts)

Actions

Copy link

#21

Thanks.

Jul 16 10:36:16 lighttpd[2322689]: Arena 0:
Jul 16 10:36:16 lighttpd[2322689]: system bytes     =   12554240
Jul 16 10:36:16 lighttpd[2322689]: in use bytes     =    5148400
Jul 16 10:36:16 lighttpd[2322689]: Total (incl. mmap):
Jul 16 10:36:16 lighttpd[2322689]: system bytes     =   12554240
Jul 16 10:36:16 lighttpd[2322689]: in use bytes     =    5148400
Jul 16 10:36:16 lighttpd[2322689]: max mmap regions =          4
Jul 16 10:36:16 lighttpd[2322689]: max mmap bytes   =   12804096
Jul 16 10:40:32 lighttpd[2322689]: Arena 0:
Jul 16 10:40:32 lighttpd[2322689]: system bytes     =  114401280
Jul 16 10:40:32 lighttpd[2322689]: in use bytes     =    5295216
Jul 16 10:40:32 lighttpd[2322689]: Total (incl. mmap):
Jul 16 10:40:32 lighttpd[2322689]: system bytes     =  114401280
Jul 16 10:40:32 lighttpd[2322689]: in use bytes     =    5295216
Jul 16 10:40:32 lighttpd[2322689]: max mmap regions =          4
Jul 16 10:40:32 lighttpd[2322689]: max mmap bytes   =   12804096

A single arena.
A stable use of 4 for max mmap regions.
A large jump from ~12 MB to ~114 MB in a 4 min period is a large jump in sbrk, likely multiple sbrk.
This debunks my theory about many mmap regions.

I know a 4 min period might have many requests to a large server, but is there any noticeable change in the type of requests that come in during that period that could help me to reproduce the issue? It is likely to be some combination of requests and load on the server. While some memory is released back the system a few mins later, > ~90 MB remains with the process, even though most is not reported as used. Is there a spike in the number of simultaneous connections in the time window?

I wonder if a periodic call to malloc_trim(131072) would make a difference.

+                                       if (0 == (min_ts & 0xff)) {
+                                               malloc_stats();
+                                               malloc_trim(131072);
+                                               malloc_stats();
+                                       }

If we can approximate reproducing this in a non-production environment, I would like to see if lighttpd git master behaves better (or worse), as changes on lighttpd git master should reduce memory fragmentation by doubling buffer sizes when reallocating, rather than the previous historical method in lighttpd of adding needed space and rounding up to nearest 64 byte boundary.

Actions

Copy link

#27

Updated by flynn almost 4 years ago

I applied the modified patch above, test is running.
I can already see a change: the memory curve goes down significantly every 256 sec, but the average goes up slightly, about 1 MB per hour. This may stop/saturate later ...

I already tried several times to identify the request(s), that let the memory grow, but I could not identify them. There are about 2600 requests in 4min:

GET: 1105 (920 HTTP/1.1, 185 HTTP/2.0)
POST: 1466 (1447 HTTP/1.1, 19 HTTP/2.0)
PRI: 55
PATCH: 2, both HTTP/1.1

Most requests are delegated with mod_proxy to a HTTP/1.1-backend with unix domain socket

Should I disable HTTP/2.0 for a test?

Actions

Copy link

#28

Updated by flynn almost 4 years ago

As indicated the average memory consumption stabilized now around 6MB.

We should wait at least 1 day, better 2-3 days, whether the big memory alloctions still happen and are not released again shortly afterwards.

Actions

Copy link

#29

Updated by gstrauss almost 4 years ago

If malloc_trim() makes such a huge difference, but memory still grows slightly and stabilizes at 6MB, then lighttpd's memory use with HTTP/2 enabled does not appear to be trapping all the excess memory allocated by sbrk
A question for me is why glibc is holding on to the excess memory for so long. (Just to be sure: are you using glibc or musl or something else for libc?)

Should I disable HTTP/2.0 for a test?

I am not sure it would be useful. lighttpd memory usage behavior does change some with HTTP/2, and this usage change likely contributes to how glibc malloc is handling memory allocations. However, you found that using jemalloc gives better behavior and using malloc_trim gives better behavior, so I do not think lighttpd is doing anything "wrong". Still, something lighttpd is doing is resulting in suboptimal behavior from the glibc allocator.

The next version of lighttpd changes lighttpd memory allocation algorithms, so I'd love if you could test with lighttpd git master to see if the problem gets better (or worse). (I'd like to release the next version of lighttpd before end of Q3 and am currently working on timeouts to backends for #3086 )

With or without testing lighttpd git master, you can use environment variables to tune the glibc allocator. man mallopt By default, the glibc allocator uses a dynamic mmap threshold approach to setting some limits, and if you use mallopt or environment variables to set any of M_TRIM_THRESHOLD, M_TOP_PAD, M_MMAP_THRESHOLD, or M_MMAP_MAX (or env var equivalents), then the dynamic mmap threshold approach is disabled.

I wonder how the memory usage would be affected by disabling the dynamic mmap threshold approach and setting environment variables: MALLOC_TRIM_THRESHOLD_=524288 (512KB) and MALLOC_MMAP_THRESHOLD_=327680 (320KB) in the shell that starts lighttpd.

Actions

Copy link

#30

Updated by flynn almost 4 years ago

These values are from glibc version 2.28 malloc.

I configured another server to use the malloc trim environent variables.

A fresh compiled lighttpd from your personal git repository results in 400 on php/Davical requests, so I cannot use it (other php applications work). At the moment I do not have time to debug it, php-Log reports a wired content-type problem on a PROPFIND requests:

Application sent content-type of "text/xml" instead of "text/xml"

Actions

Copy link

#31

Updated by gstrauss almost 4 years ago

Application sent content-type of "text/xml" instead of "text/xml"

That's not a very enlightening error message, eh? :)

I tested lighttpd mod_webdav and it passes litmus tests as expected.
Is php/Davical running as CGI or as FastCGI? (I would presume FastCGI via FPM)

Actions

Copy link

#32

Updated by flynn almost 4 years ago

php/Davical is running as FastCGI via FPM.

I ran strace on both versions and found a 4-byte difference in the fastcgi buffer:

REQUEST_METHODPROPFIND (version 1.4.59)
REQUEST_METHODPOST (current git version), this results in the error message above and a return value 400

Davical does not expect a POST request, the error message is misleading.

I hope this helps ...

Actions

Copy link

#33

Updated by gstrauss almost 4 years ago

I temporarily added a quick test to tests/mod-fastcgi.t to send a PROPFIND request and check what the (test) fastcgi server received and that appears fine, as would be expected since other PHP scripts work. I'll have to see if something else is getting overwritten/truncated into the FastCGI environment buffer.

--- a/tests/mod-fastcgi.t
+++ b/tests/mod-fastcgi.t
@@ -7,7 +7,7 @@ BEGIN {
 }

 use strict;
-use Test::More tests => 24;
+use Test::More tests => 25;
 use LightyTest;

 my $tf = LightyTest->new();
@@ -25,6 +25,13 @@ SKIP: {
        $tf->{CONFIGFILE} = 'fastcgi-responder.conf';
        ok($tf->start_proc == 0, "Starting lighttpd with $tf->{CONFIGFILE}") or die();

+       $t->{REQUEST}  = ( <<EOF
+PROPFIND /prefix.fcgi/foo/bar?env=REQUEST_METHOD HTTP/1.0
+EOF
+ );
+       $t->{RESPONSE} = [ { 'HTTP-Protocol' => 'HTTP/1.0', 'HTTP-Status' => 200, 'HTTP-Content' => 'PROPFIND' } ];
+       ok($tf->handle_http($t) == 0, 'REQUEST_METHOD');
+
        $t->{REQUEST}  = ( <<EOF
 GET /prefix.fcgi-nonexistent HTTP/1.0
 Host: www.example.org

Actions

Copy link

#34

Updated by gstrauss almost 4 years ago

I haven't been able to find any reason why the code would behave that way. The code is shared by the base lighttpd and many modules; it's not a unique code path.

Is it possible that the lighttpd.conf is being parsed differently and that a POST request is getting misdirected to php/Davical?

Actions

Copy link

#35

Updated by gstrauss almost 4 years ago

HTTP/2 has special short-codes for GET and POST methods. In my development branch, I have started using LSHPACK_HDR_METHOD_POST from the ls-hpack library which lighttpd uses for HPACK. Does php/Davical work if you disable HTTP/2 with my dev branch?

Actions

Copy link

#36

Updated by gstrauss almost 4 years ago

Looks like my trying to use the HTTP/2 short-codes for GET and POST is causing the issue for Davical. At first glance, this appears to be an issue in ls-hpack marking LSHPACK_HDR_METHOD_GET for every method that is not POST, and correctly marking LSHPACK_HDR_METHOD_POST for POST.

curl --http2-prior-knowledge -X PROPFIND "http://127.0.0.1/..." got marked with LSHPACK_HDR_METHOD_GET (not LSHPACK_HDR_UNKNOWN and not LSHPACK_HDR_METHOD_POST), and -X POST got marked as LSHPACK_HDR_METHOD_POST, which is not exactly what you saw, but close.

In any case, a workaround is:

--- a/src/request.c
+++ b/src/request.c
@@ -610,6 +610,7 @@ http_request_parse_header (request_st * const restrict r, http_header_parse_ctx
               case HTTP_HEADER_H2_METHOD_GET:
                 if (__builtin_expect( (HTTP_METHOD_UNSET != r->http_method),0))
                     break; /*(error out in switch() below)*/
+                break; /*(workaround)*/
                 r->http_method = HTTP_METHOD_GET;
                 return 0;
               case HTTP_HEADER_H2_METHOD_POST:

Actions

Copy link

#37

Updated by gstrauss almost 4 years ago

Turns out to be an implementation detail in ls-hpack for an overloaded static_table. For HPACK decoding, the hpack_index was only valid for keys, not values, though the keys and values are used for HPACK encoding.
I forced-pushed a fix to my dev branch.

[Edit] correction: so the hpack_index was typically only valid for keys, unless the lsxpack_flag LSXPACK_VAL_MATCHED was also set

Actions

Copy link

#38

Updated by flynn almost 4 years ago

Just for confirmation:

the PROPFIND problem happens only in the browser, I cannot reproduce it with curl on the cmdline
if I disable HTTP/2 in the browser, it works with HTTP/1.1!

Actions

Copy link

#39

Updated by flynn almost 4 years ago

Intermediate result:

I tried different values of the environment variables MALLOC_TRIM_THRESHOLD_ and MALLOC_MMAP_THRESHOLD_, no significant change in the memory usage, still (too) high.
the explicit call of malloc_trim results in an expected memory usage for a server: base line with 6-8MB and some short time peak values

Maybe lighttpd triggers too less malloc_trim calls in glibc malloc ...

Actions

Copy link

#40

Updated by gstrauss almost 4 years ago

Thanks for the update.

I'm working on reducing, where possible, long-lived memory allocations that occur after lighttpd startup. This may reduce lighttpd memory usage a bit below your current baseline, so the changes are beneficial in that (resource) direction. There is a small performance cost to changing the algorithms, resulting in some more random writes and possible memory cache misses. I hope to have the code ready for testing (on my dev branch) by the beginning of next week.

Maybe lighttpd triggers too less malloc_trim calls in glibc malloc ...

FYI: I think you meant "Maybe lighttpd triggers too few calls to malloc_trim()..." ("less" versus "fewer"/"few")

Actions

Copy link

#41

Updated by flynn almost 4 years ago

Yes, I meant too few calls...

I just wait, until I can test the git tree:

PROPFIND fixed for HTTP/2 calls
reworked memory allocations

Actions

Copy link

#42

Updated by gstrauss almost 4 years ago

I have pushed changes to my dev branch which clear and free the connection and request pools every 64 seconds, along with other maintenance that occurs at that interval.

Periodically clearing the pools not only frees up memory used by the connection and request structures, but also for the buffers and arrays contained within those structures. Setting a bound on how long lighttpd holds onto this memory (when that memory is not actively being used or reused) should help lighttpd reduce its internal memory use after bursts of activity. It remains to be seen whether or not the glibc memory allocator behaves better on your systems with this approach, though I believe this should help.

Actions

Copy link

#43

Updated by gstrauss almost 4 years ago

FYI: I just pushed a small update to fix a busy-wait that I introduced when changing the joblist to a linked list, so please re-pull if you pulled my dev branch in the last 24 hours.

Actions

Copy link

#44

Updated by flynn almost 4 years ago

with the current git version I get a core dump on testing the configuration files (/usr/sbin/lighttpd -tt -f /etc/lighttpd/lighttpd.conf):

Program received signal SIGSEGV, Segmentation fault.
0x000055555558a8e8 in array_get_buf_ptr (a=a@entry=0x0, k=k@entry=0x55555559852e "server.modules", klen=klen@entry=14) at array.c:418
418        int32_t ipos = array_get_index(a, k, klen);
(gdb) bt
#0  0x000055555558a8e8 in array_get_buf_ptr (a=a@entry=0x0, k=k@entry=0x55555559852e "server.modules", klen=klen@entry=14) at array.c:418
#1  0x0000555555572859 in array_set_key_value (vlen=0, v=0x5555555929ad "", klen=14, k=0x55555559852e "server.modules", a=0x0) at array.h:123
#2  config_plugin_values_init (srv=srv@entry=0x5555555ad520, p_d=p_d@entry=0x7fffffffe7a0, cpk=cpk@entry=0x5555555a9fc0 <cpk>, 
    mname=mname@entry=0x555555597ed9 "global") at configfile-glue.c:296
#3  0x0000555555569a96 in config_insert_srvconf (srv=0x5555555ad520) at configfile.c:738
#4  config_read (srv=srv@entry=0x5555555ad520, fn=<optimized out>) at configfile.c:2546
#5  0x000055555556144c in server_main_setup (srv=srv@entry=0x5555555ad520, argc=argc@entry=4, argv=argv@entry=0x7fffffffeb38)
    at server.c:1120
#6  0x0000555555562f64 in main (argc=4, argv=0x7fffffffeb38) at server.c:2029

Actions

Copy link

#45

Updated by gstrauss almost 4 years ago

It is strange that srv->srvconf.config_touched would be NULL. That is initialized in config_init().
Would you try to reproduce the issue with a smaller lighttpd.conf (hopefully something that can be shared)?

Actions

Copy link

But I cannot test until the PROPFIND issue is solved with HTTP/2 ... (the code around the suggested workaround above changed too much since).

The PROPFIND issue was only on my dev branch and has been fixed on my dev branch. (#3084#note-37)

Actions

Copy link

#50

Updated by flynn almost 4 years ago

On PROPFIND requests I still get 400 with the git version, I tested this morning, I try to debug later ...

Actions

Copy link

#51

Updated by gstrauss almost 4 years ago

FYI: I have tested with a simple CGI script which prints the environment, and forced HTTP/2 and request method PROPFIND:
curl --http2-prior-knowledge -X PROPFIND -v http://127.0.0.1:8080/env.pl

Actions

Copy link

#52

Updated by flynn almost 4 years ago

From my knowledge of the problem your test will succeed, because you do not check whether the PROPFIND method reaches the cgi-script.

From my debug/strace last time th following happens: PROPFIND method is requested but converted by lighttpd to a POST request for the backend, which is not expected. The backend sends a 400 ...

Did you verify, that the cgi-script gets a PROPFIND too?

Actions

Copy link

#53

Updated by gstrauss almost 4 years ago

Yes, the CGI script gets the PROPFIND

#!/usr/bin/perl
print "Status: 200\n\n$ENV{REQUEST_METHOD}\n";

Actions

Copy link

Thanks for continuing to look and narrowing things down. The backends all share a substantial amount of code with slightly different protocol filters, but the protocol differences are not specific to HTTP request method. Let me dig in.

Actions

Copy link

#58

Updated by gstrauss almost 4 years ago

I am not seeing an issue with HTTP/2 and PHP served via FastCGI.

foo.php

<?php
echo getenv("SERVER_PROTOCOL"),"\n";
echo getenv("REQUEST_METHOD"),"\n";
?>

$ curl --http2-prior-knowledge -X PROPFIND http://127.0.0.1:8080/foo.php
HTTP/2.0
PROPFIND

server.modules += ("mod_fastcgi")
fastcgi.server += (
        ".php" => (
                (
                        "host" => "127.0.0.1",
                        "port" => 10000,
                        "bin-path" => "/usr/bin/php-cgi",
                        "max-procs" => 1,
                ),
        ),
)

Actions

Copy link

#59

Updated by gstrauss almost 4 years ago

By any chance are you running multiple lighttpd instances, maybe proxying one to another, and maybe one instance is running an old build of my dev branch, one which contains the bug with HTTP/2 request method?

While I did find (and fix) a bug in the lighttpd attempt to optimize use of ls-hpack, and that bug was related to marking all non-POST request methods as GET with HTTP/2, I haven't been able to reproduce the specific issue you are seeing where HTTP/2 requests with PROPFIND are mismarked as POST.

Actions

Copy link

#60

Request body is:

<D:propfind xmlns:D="DAV:"><D:prop><D:current-user-principal/></D:prop></D:propfind>

Header as curl options (I removed/blurred Authorization, Origin, Cookie and Referrer):
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:90.0) Gecko/20100101 Firefox/90.0'
-H 'Accept: application/xml, text/xml, */*; q=0.01'
-H 'Accept-Language: de,en-US;q=0.7,en;q=0.3' --compressed
-H 'Content-Type: text/xml; charset=utf-8'
-H 'Authorization: Basic xxxx'
-H 'X-client: InfCloud 0.13.2rc1 (Inf-IT CalDAV/CardDAV Web Client)'
-H 'Depth: 0'
-H 'X-Requested-With: XMLHttpRequest'
-H 'Content-Length: 122'
-H 'Connection: keep-alive'
-H 'Sec-Fetch-Dest: empty'
-H 'Sec-Fetch-Mode: cors'
-H 'Sec-Fetch-Site: same-origin'
-H 'TE: trailers'

Actions

Copy link

#65

Updated by gstrauss almost 4 years ago

I tried your request, but did not reproduce the issue. Things worked fine for me after changing Content-Length: 122 to Content-Length: 84 to match curl -X PROPFIND --data-raw '<D:propfind xmlns:D="DAV:"><D:prop><D:current-user-principal/></D:prop></D:propfind>' ...

Actions

Copy link

#66

Using debug.log-request-header = "enable" the biggest difference between firefox and chromium is te http-header used only by firefox:

2021-08-11 14:10:26: h2.c.1223) fd:15 id:15 rqst: te: trailers

I did not find a setting to disable this in firefox.

Actions

Copy link

#70

Updated by gstrauss almost 4 years ago

Any hints how to debug this?

Firefox has some nice debugging tools in the Web Developer Tools, but I am by no means an expert on clients.

You can verify what lighttpd is receiving in the header:
debug.log-request-header = "enable"

Firefox sends HTTP/2 over TLS, so strace on lighttpd is not going to yield human-readable results unless you man-in-the-middle the cert.
You can patch lighttpd to dump some of the request body. This is kludge and coded quickly; for testing only. It might work, but only for small request bodies:
In src/h2.c:h2_recv_data():

--- a/src/h2.c
+++ b/src/h2.c
@@ -974,6 +974,9 @@ h2_recv_data (connection * const con, const uint8_t * const s, const uint32_t le
     else
         chunkqueue_steal(dst, cq, (off_t)alen);

+chunk *dc = dst->first;
+log_error(r->conf.errh, __FILE__, __LINE__, "%.*s", (int)(buffer_clen(dc->mem)-dc->offset), dc->mem->ptr+dc->offset);
+
     if (pad)
         chunkqueue_mark_written(cq, pad);
     return 1;

As to the TE header, is php-Davical attempting to send trailers in response? I did not think we were getting that far in request/response.
debug.log-response-header = "enable"

Actions

Copy link

#71

Updated by flynn almost 4 years ago

Running h2spec on version 1.4.59, I get 2 errors.

Running h2spec on current git version, I get 6 errors.

Can you reproduce this?

Actions

Copy link

#72

Updated by gstrauss almost 4 years ago

I can not reproduce what you are seeing with h2spec. What is the target of h2spec that you are using?

If I remember correctly, h2spec tests might fail if payload is <= 16k and lighttpd handles the response in a single DATA frame, where some h2spec tests expect multiple frames.
(Edit: yes: target file must be at least 16k+1 in size so that lighttpd sends multiple DATA frames, and target file must be at least 32k+1 with lighttpd 1.4.60)

I use /index.html and create /index.html with a's to be 64k in size.
I temporarily patch lighttpd with the patch further below.
Then, I execute h2spec with clear-text and TLS tests
h2spec -P /index.html -p 8080 -S
h2spec -P /index.html -p 8443 -S -k -t

Last year, I posted some patches to h2spec. Others agree that h2spec should handle some tests a bit differently, but the patches haven't been looked at by the author.

With a 64k /index.html handled as a static file by lighttpd, and the patch to lighttpd below, the h2spec tests pass 100% for TLS and with one failure for clear-text, and the reason for that is explained in
https://github.com/summerwind/h2spec/issues/122
https://github.com/summerwind/h2spec/issues/112

For h2spec, patch lighttpd temporarily with:

--- a/src/h2.c
+++ b/src/h2.c
@@ -1445,16 +1445,16 @@ h2_recv_headers (connection * const con, uint8_t * const s, uint32_t flen)
         /* RFC 7540 Section 8. HTTP Message Exchanges
          * 8.1.2.6. Malformed Requests and Responses
          *   For malformed requests, a server MAY send an HTTP
          *   response prior to closing or resetting the stream.
          * However, h2spec expects stream PROTOCOL_ERROR.
          * (This is unfortunate, since we would rather send
          *  400 Bad Request which tells client *do not* retry
          *  the bad request without modification)
          * https://github.com/summerwind/h2spec/issues/120
          * https://github.com/summerwind/h2spec/issues/121
          * https://github.com/summerwind/h2spec/issues/122
          */
-      #if 0
+      #if 1
         if (400 == r->http_status && !trailers) {
             h2_send_rst_stream(r, con, H2_E_PROTOCOL_ERROR);
             h2_retire_stream(r, con);

Actions

Copy link

#73

Updated by gstrauss almost 4 years ago

For the PROPFIND/POST issue, could you try with debug.log-request-header = "enable" and the patch further above to print the request body received?

Is infcloud on the client side getting some sort of error with the initial request and falling back to retrying to send a POST request? Does Content-Type change?

Actions

Copy link

#74

Updated by gstrauss almost 4 years ago

Status changed from New to Need Feedback

@ZivHung:
Are you able to test with my lighttpd development branch? Have you noticed any change in memory usage?

Actions

@flynn

If the issue did not occur with lighttpd 1.4.59, but does occur with HTTP/2 with my dev branch, you might take src/h2.c from lighttpd 1.4.59, apply it to my dev branch, and then see if the problem still occurs. The changes to src/h2.c should be standalone, so it should be safe and functional to take src/h2.c from lighttpd 1.4.59. (The same is not true for src/request.c, which is where I fixed the bug in my dev branch where anything other than POST was treated as GET for HTTP/2.) If your scenario works in lighttpd 1.4.59, but not in my dev branch, then there might be a bug in src/h2.c or src/request.c on my dev branch.

Actions

Copy link

#79

Updated by flynn almost 4 years ago

Exchanging only src/h2.c from version 1.4.59 does not fix the problem.

I found, that this commit causes the problem: https://git.lighttpd.net/lighttpd/lighttpd1.4/commit/5b2a3103e54757eb501e3bbdd647953e334503af.

Any build before this commit work, builds after this commit do NOT work.

Actions

Copy link

#80

Updated by flynn almost 4 years ago

There is difference between the parsed method as string (PROPFIND) and the stored enum value (I added a log message in src/h2.c after the call to http_request_parse_header:

Firefox
2021-08-24 23:54:55: h2.c.1247) fd:15 id:15 rqst: :method: PROPFIND
2021-08-24 23:54:55: h2.c.1252) r->http_method = 2, hpctx.id = -4, lsx.hpack_index = 3
Chromium
2021-08-24 23:57:56: h2.c.1247) fd:15 id:7 rqst: :method: PROPFIND
2021-08-24 23:57:56: h2.c.1252) r->http_method = 26, hpctx.id = -3, lsx.hpack_index = 2

The value 2 represents the POST method I see in the php/fastcgi.

The problem seems to be inside the function lshpack_dec_decode ...

Actions

Copy link

#81

Updated by gstrauss almost 4 years ago

--- a/src/request.c
+++ b/src/request.c
@@ -651,6 +651,7 @@ http_request_parse_header (request_st * const restrict r, http_header_parse_ctx
                 http_request_header_set_Host(r, v, vlen);
                 return 0;
               case HTTP_HEADER_H2_METHOD_GET:  /*(any method, not only "GET")*/
+              case HTTP_HEADER_H2_METHOD_POST:
                 if (__builtin_expect( (HTTP_METHOD_UNSET != r->http_method), 0))
                     break;
                 r->http_method = get_http_method_key(v, vlen);
@@ -658,11 +659,6 @@ http_request_parse_header (request_st * const restrict r, http_header_parse_ctx
                     return http_request_header_line_invalid(r, 501,
                       "unknown http-method -> 501");
                 return 0;
-              case HTTP_HEADER_H2_METHOD_POST: /*(exactly "POST")*/
-                if (__builtin_expect( (HTTP_METHOD_UNSET != r->http_method), 0))
-                    break;
-                r->http_method = HTTP_METHOD_POST;
-                return 0;
               case HTTP_HEADER_H2_PATH:            /*(any path, not only "/")*/
               case HTTP_HEADER_H2_PATH_INDEX_HTML: /*(exactly "/index.html")*/
                 if (__builtin_expect( (!buffer_is_blank(&r->target)), 0))

Actions

Copy link

#82

Updated by flynn almost 4 years ago

This patch fixes the problem in Firefox, PROPFIND works now!

Is it experimental or final?

Actions

Copy link

#83

Updated by gstrauss almost 4 years ago

I'll be updating my dev branch (with a force push) momentarily.

That patch is final for this issue. I feel that I have misunderstood some of the detailed internals of HPACK. The comment at the top of the switch() explains that the ls-hpack internals indicate whether the id matches only the key (e.g. ":method") or both key and value (":method" and "POST"). I did not know that the client could have selected this, and apparently (not verified) it looks like Firefox is sending the id for ":method" and "POST", and a flag indicating the match is only for the key (":method").

The other pseudo-headers in request.c:http_request_parse_header() all treat the id as a key-only match, and this patch does the same for :method, making them all consistent.

Thank you for digging into this!

Actions

Copy link

#84

Updated by gstrauss almost 4 years ago

Status changed from Need Feedback to Patch Pending
Target version changed from 1.4.x to 1.4.60

Actions

Copy link

#85

Updated by flynn almost 4 years ago

I started some memory tests on a low volume server:

base line is now below 2MB (1.4.59: 3MB)
memory intensive tasks like big uploads, reverse proxy requests or wsgi requests do not return the additional memory after some time
(tested at least for 30min)
explicit calls to malloc_trim is still the best solution, the memory retuns very close to the base line after some minutes.

Actions

Copy link

#86

Updated by gstrauss almost 4 years ago

explicit calls to malloc_trim is still the best solution, the memory retuns very close to the base line after some minutes.

Since malloc_trim is effective, the good news is that lighttpd allocator usage is not trapping excessive memory, preventing the allocator from releasing that memory.

I wonder how the memory usage would be affected by disabling the dynamic mmap threshold approach and setting environment variables: MALLOC_TRIM_THRESHOLD_=524288 (512KB) and MALLOC_MMAP_THRESHOLD_=327680 (320KB) in the shell that starts lighttpd.

Are there any changes if you set MALLOC_TRIM_THRESHOLD_ (note the trailing '_') with my dev branch?

In #3084#note-47 I wrote

If the changes on my dev branch are not sufficient for the glibc memory allocator to return memory to the OS, I will consider adding malloc_trim() to periodic maintenance tasks. Any suggestions for frequency? Every 64 seconds? Other desired frequency?

I am leaning towards every 64 secs (1m 4s) or every 256 secs (4m 16s).

Do you have a suggestion for size that the allocate should keep available on its heap?
malloc_trim(131072); // 128KB
malloc_trim(262144); // 256KB
malloc_trim(524288); // 512KB
I will probably use 128KB unless you have a different preference.

Actions

Copy link

#87

Updated by gstrauss almost 4 years ago

From man malloc_trim: "NOTES: This function is automatically called by free(3) in certain circumstances; see the discussion of M_TOP_PAD and M_TRIM_THRESHOLD in mallopt(3)."

From man mallopt: "When dynamic mmap thresholding is in effect, the threshold for trimming the heap is also dynamically adjusted to be twice the dynamic mmap threshold."

Since flynn explicitly calling malloc_trim() successfully releases memory back to the system, glibc must not be triggering this behavior on its own.

Explicitly setting the below should disable the glibc dynamic mmap threshold, and should be equivalent to explicitly calling malloc_trim(131072);, since the default M_TRIM_THRESHOLD is 128k (131072) and the default M_TOP_PAD is 128k (131072). If this works instead of explicitly calling malloc_trim(), then we may consider setting M_TRIM_THRESHOLD to 256k or 512k. I am leaning towards 512k (which was the intent of testing with the environment variable MALLOC_TRIM_THRESHOLD_=524288).

--- a/src/server.c
+++ b/src/server.c
@@ -1985,6 +1985,9 @@ int main (int argc, char **argv) {
   #if defined(HAVE_MALLOPT) && defined(M_ARENA_MAX)
     mallopt(M_ARENA_MAX, 2); /*(ignore error, if any)*/
   #endif
+  #if defined(HAVE_MALLOPT) && defined(M_TRIM_THRESHOLD)
+    mallopt(M_TRIM_THRESHOLD, 131072); /*(ignore error, if any)*/
+  #endif

     /* for nice %b handling in strftime() */
     setlocale(LC_TIME, "C");

Actions

Copy link

#88

Updated by gstrauss almost 4 years ago

I force-pushed some updates to my dev branch to run malloc_trim(524288) every 64 seconds, on systems where it is available (which is checked at runtime at startup, just in case an alternative memory allocator is used).

If someone needs a smaller trim threshold, then they can set in the environment MALLOC_TRIM_THRESHOLD_=131072 before starting lighttpd.

For the moment, I have not included mallopt(M_TRIM_THRESHOLD, ...) in the code, but I am still considering it.

@flynn wrote

memory intensive tasks like big uploads, reverse proxy requests or wsgi requests do not return the additional memory after some time

During the memory intensive tasks, to what size does lighttpd memory usage grow? If it grows from 2-3 MB to 10-15 MB, then I am not too concerned. On the other hand, if the memory use grows to 20 MB or more, then that is something I would like to explore further. For a single connection doing a large upload or reverse proxy, memory use should not grow that much (< 1 MB) since lighttpd will use temporary files on disk for large request or response bodies. If there is a burst of many connections doing large transfers, then memory use is expected to grow more than that, but not for a single connection.

If lighttpd memory growth is large (> 1 MB) for a single connection, then that would suggest there are still improvements that should be made in lighttpd.

I have tested lighttpd with h2load for varying sizes of downloads of static files, and see that memory chunks are properly reused. I have done similar with a front-end lighttpd proxying to a backend-end lighttpd. Perhaps I will try with some dynamic content with uploads and downloads, too.

Actions

Copy link

#89

Updated by flynn almost 4 years ago

Some answers:

at the moment I use malloc_trim(131072) in some tests.
the maximum used memory in my tests was 11.3MB with 5 parallel big uploads (each > 100MB) and a wsgi-request to a searx-instance (without explicit malloc_trim)
in my older tests some weeks ago setting the malloc environments variables like MALLOC_TRIM_THRESHOLD_ did not reduce the memory as much as the explicit calls to malloc_trim, but I can repeat the tests next week

I updated to the current git-version and started a new test, but will be not able to report before Tuesday.

Actions

Copy link

#90

Updated by gstrauss almost 4 years ago

Thanks! You're up late! I had hoped you would see this tomorrow.

I expect that the current set of patches will address the issue of a sustained high memory footprint with glibc.
You already established that jemalloc had better behavior. Hopefully these patches will approach the jemalloc behavior, but with glibc.

Separately, I am planning to add some instrumentation to see if the chunk reuse is working as expected and if it can be improved with large uploads. To help establish the test environment, I am presuming that all connections are over TLS and HTTP/2. Does that match your environment? Do you recall how much memory use increased in the past with TLS and HTTP/1.1, before lighttpd supported HTTP/2? (If your current environment is using HTTP/1.1 uploads, are they chunked or is Content-Length provided? If chunked, then I should be looking more closely at how lighttpd is handling HTTP/1.1 Transfer-Encoding: chunked on request bodies.)

Actions

Copy link

#91

After one week of testing the baseline stabilized at exact 3MB. The memory requirement rises and falls with use with a short delay, as it should.

That's good news! :)

Maybe you offer a possibility to change this value (environment variable, setting in conigure script, ...) for people who are very memory contrained.

This patch reuses the MALLOC_TOP_PAD_ environment variable for one of its intended purposes. (I have already trivially tested it, so no need to restart your tests.)

--- a/src/server.c
+++ b/src/server.c
@@ -74,6 +74,7 @@ static const buffer default_server_tag = { CONST_STR_LEN(PACKAGE_DESC)+1, 0 };
 #include <malloc.h>
 #if defined(HAVE_MALLOC_TRIM)
 static int(*malloc_trim_fn)(size_t);
+static size_t malloc_top_pad;
 #endif
 #endif

@@ -1882,7 +1883,7 @@ static void server_handle_sigalrm (server * const srv, unix_time64_t mono_ts, un
                                        request_pool_free();
                                        connections_pool_clear(srv);
                                  #if defined(HAVE_MALLOC_TRIM)
-                                       if (malloc_trim_fn) malloc_trim_fn(524288);
+                                       if (malloc_trim_fn) malloc_trim_fn(malloc_top_pad);
                                  #endif
                                        /* attempt to restart dead piped loggers every 64 secs */
                                        if (0 == srv->srvconf.max_worker)
@@ -2011,6 +2012,14 @@ static int main_init_once (void) {
   #endif

   #if defined(HAVE_MALLOC_TRIM)
+    malloc_top_pad = 524288;
+    {
+        const char * const top_pad_str = getenv("MALLOC_TOP_PAD_");
+        if (top_pad_str) {
+            unsigned long top_pad = strtoul(top_pad_str, NULL, 10);
+            if (top_pad != ULONG_MAX) malloc_top_pad = (size_t)top_pad;
+        }
+    }
   #ifdef LIGHTTPD_STATIC
     malloc_trim_fn = malloc_trim;
   #else

Actions

Copy link

#95

Updated by gstrauss over 3 years ago

Status changed from Patch Pending to Fixed

Applied in changeset 1a8d8e0ca519b49d9e5be57e5dd5c774dbacdeca.

Actions

Copy link

#96

Updated by flynn over 3 years ago

The memory usage on the bigger server differs a littls bit:

the baseline is at 6-7MB, this is reasonable, because it has much more backends (proxy, fastcgi, scgi, ...) and lua-scripts
the memory usage looks like a sawtooth wave, e.g. it raises from 6 to 14MB in 2h and goes back to 6MB in seconds
peak value at the moment 26MB in 1:40h

Maybe freeing memory is still not triggered often enough.

Overall it is much better than version 1.4.59:

lower peak values
lower baseline
faster and more often return to the baseline (it does NOT stay at high values for hours)

Actions

Copy link

#97

Updated by gstrauss over 3 years ago

With what I pushed to lighttpd git master, lighttpd is releasing memory from various object pools every 64 seconds. If the server is idle for 64 seconds (no open connections, including no idle HTTP/2 connections), then memory should return close to the baseline. If there are open connections, then resources associated with those open connections may (temporarily) prevent some memory from being released.

What do your access logs suggest is happening for the length of time HTTP/2 connections are maintained? (Check for the "PRI *" entries)
If you're running mod_status, you can take a look there to get a glimpse at open connections, but ls -l /proc/<pid>/fd on the lighttpd <pid> might also be useful.

Actions

Copy link

#98

Updated by flynn over 3 years ago

I monitor both:

mod_status shows values between 2 and 20, average at 4
open file handles average at 60 (thereof 38 log file handles) with a maximum at 100

Because I have so much log files, it takes more time to follow the PRI entries, I hope I can spent some time this evening.

I just discovered, that the same log file has multiple file handles, one for each accesslog.filename = "same log filename" configuration line, may be you want to pool this.

Actions

Copy link

#99

Updated by gstrauss over 3 years ago

Because I have so much log files, it takes more time to follow the PRI entries, I hope I can spent some time this evening.

I wouldn't spend too much time here. It probably won't tell you anything you don't already know from mod_status.

I just discovered, that the same log file has multiple file handles, one for each accesslog.filename = "same log filename" configuration line, may be you want to pool this.

Noted; added to the wishlist. I'll look to see if it's easy to do.

Actions

Copy link

#100

Updated by flynn over 3 years ago

I must terminate my testing, because I get errors with the etherpad backend: it is connected with the proxy module and I get the following error message:

2021-09-09 17:50:57: http_chunk.c.354) chunked header invalid chars
2021-09-09 17:50:57: gw_backend.c.2357) response already sent out, but backend returned error on socket: tcp:xyz.xyz.xyz.xyz:xyz for /javascripts?callback=require.define, terminating connection

All requests are HTTP/2. Switching back to version 1.4.59 with no (configuration) changes solves this, etherpad works again.

Actions

Copy link

#101

Updated by gstrauss over 3 years ago

@flynn

Would you please file a new bug for that? proxy.header += ("force-http10" => "enable") for requests might also work for requests to that backend.

Actions

Copy link

#102

Updated by gstrauss over 3 years ago

I found an issue with commit 39a57798 and am working on a fix.

Actions

Copy link

#103

Updated by gstrauss over 3 years ago

Preliminary patch. When lighttpd is dechunking from the backend but not sending chunks to the client, the input buffer is consumed and must be cleared before returning. (Code elsewhere which accumulates small reads in the buffer does not clear the input buffer.)

--- a/src/http_chunk.c
+++ b/src/http_chunk.c
@@ -482,6 +482,8 @@ int http_chunk_decode_append_buffer(request_st * const r, buffer * const mem)
         r->resp_send_chunked = 1;
         return rc;
     }
+    else
+        buffer_clear(mem);

     return 0;
 }

Actions

Copy link

#104

Updated by flynn over 3 years ago

I updated my git copy and etherpad backend works again, I restarted my memory testing.

Actions

Copy link

#105

Updated by flynn over 3 years ago

It seems, that this patch also improved the memory usage: I see less, shorter and smaller "sawtooth waves".

Actions

Copy link

#106

Here my update running without malloc_trim:

the situation improved a lot, no more long lasting high peaks of memory usage
the average memory usage is about 1-2MB higher having a base value of 5 or 6MB.

Do you know how much time the call to malloc_trim lasts? Is it (a)synchronous?
Is the request, that triggers malloc_trim delayed (or the request after that request?)
My concern is, that all 64sec requests are delayed and do not have the same timing than the others.

Actions

Copy link

#110

Updated by gstrauss over 3 years ago

the average memory usage is about 1-2MB higher having a base value of 5 or 6MB.

As you noted, that is for your specific system (with specific modules loaded) and without malloc_trim(). When explicitly using malloc_trim(), lighttpd server.c keeps 512k in call to malloc_trim() so the baseline will increase at least that amount (from startup baseline) after moderate usage of the server. lighttpd looks for the environment variable MALLOC_TOP_PAD_ and will keep that amount instead, if the variable MALLOC_TOP_PAD_ is set in the env before lighttpd startup.

Do you know how much time the call to malloc_trim lasts? Is it (a)synchronous?
Is the request, that triggers malloc_trim delayed (or the request after that request?)
My concern is, that all 64sec requests are delayed and do not have the same timing than the others.

malloc_trim() is not asynchronous. When there is trimming work to be done by malloc_trim(), strace of the process shows calls to madvise() MADV_DONTNEED (IIRC), so depending on the number of pages being released, this can take some number of microseconds. The glibc man page for mallopt() discusses that the heap may be trimmed when free() is called, so malloc_trim()-like behavior may happen at any time. When glibc "dynamic mmap thresholding" is in effect, then when dynamic trimming occurs can change. As we originally saw, dynamic trimming did not seem to be happening much with the preallocation of blocks of lighttpd struct connection and others which grew with (high) load, but were never released. Now, lighttpd keeps a pool of unused struct connection and struct request (and some others), but lighttpd periodically free()s the unused elements in those pools (before malloc_trim()) to potentially release the memory back to the OS.

My concern is, that all 64sec requests are delayed and do not have the same timing than the others.

After an abusive load test has ended and the server is idle, that is when I see malloc_trim() do some work. If the server is still busy, then more of that memory is in use and there is less work to do. Similar to other periodic maintenance in lighttpd, there might be a small blip in how long it takes to handle requests during that periodic maintenance. This blip might measure a few microseconds to maybe low 10s of microseconds, but I have not seen more than that even after abusive load tests. If you did measure a longer delay, then please let me know. If that case were to materialize -- which I do not think occurs in typical usage -- then one solution would be to run malloc_trim() incrementally more frequently (e.g. every 32 seconds or every 16 seconds) so that potentially less work is done each interval.

In short, based on my abusive load tests, I do not think this is something to worry about, but if you have numbers that suggest otherwise then we can revisit.

lighttpd.conf (5.13 KB) lighttpd.conf		ZivHung, 2021-06-01 02:14
lighttpd.conf (5.15 KB) lighttpd.conf		ZivHung, 2021-06-02 05:44
lighttpd.conf (5.05 KB) lighttpd.conf		ZivHung, 2021-06-02 14:13
lighttpd-memory.png (87.3 KB) lighttpd-memory.png	Lighttpd memory usage for one week	flynn, 2021-06-03 13:03
lighttpd-memory.log (59.7 KB) lighttpd-memory.log		flynn, 2021-07-16 10:36

Project

General

Profile

Lighttpd

Custom queries

Bug #3084

Memory fragmentation with HTTP/2 enabled

Updated by gstrauss almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by ZivHung almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by ZivHung almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by flynn almost 4 years ago

Updated by gstrauss almost 4 years ago

Updated by gstrauss almost 4 years ago