Bug #3103: Constant growing memory usage (w/ hung backend; no timeout) - Lighttpd - lighty labs

Actions

Copy link

Bug #3103

closed

Constant growing memory usage (w/ hung backend; no timeout)

Added by flynn over 3 years ago. Updated over 3 years ago.

Status:

Duplicate

Priority:

Normal

Category:

core

Target version:

1.4.60

ASK QUESTIONS IN Forums:

Description

After fixing ticket #3084 the memory usage was perfect for some days.

On Monday morning I updated my git tree and since then the memory usage grows constant over time with about 2MB per hour.

I tested to revert some changes, but I did not succeed.

Files

memory.png (84.4 KB) memory.png

Memory usage last week

flynn, 2021-09-15 05:28

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by gstrauss over 3 years ago

ick. that's not good.

My guess is that there must be some leak in the patches for #3101, since that is the big change since last week.

Actions

Copy link

Updated by gstrauss over 3 years ago

That memory use appears to be a step function. By chance, are you rotating logs every half-hour? (Do you send SIGHUP to lighttpd every half-hour?) Are you using log files? piped loggers? syslog?

Would you try commenting out the call in src/server.c to fdlog_flushall()?

BTW, you should notice that there is now only one fd used by each log file even if you have duplicated accesslog.filename in lighttpd.conf with the same target.

If you wouldn't mind, please do a "clean" build of lighttpd as a sanity check. Thanks.

Actions

Copy link

Updated by flynn over 3 years ago

I recognized your changes regarding the multiple log files, it works for me, the file handles are reduced.

I do NOT rotate the log files every 30min, only once a day.

On suspicion I have reverted before the checkin of dedup changes (243510dbb4d79a3866c288a7d6530f6015c5b537), therefor I cannot comment out fdlog_flushall(), the problem remains.

It must be a change, that came through git-rebasing ... any idea?

Actions

Copy link

Updated by flynn over 3 years ago

I have two servers in test:

the low volume server still has very good memory usage with current git version
on the high volume server the memory constantly grows

So it s not a problem in general, but a certain request type, that I could not discover so far.

Actions

Copy link

Updated by gstrauss over 3 years ago

It must be a change, that came through git-rebasing ... any idea?

Hmmm. I have a copy of my dev repo on a volume I use for occasionally building on Windows under Cygwin. Comparing that to my current dev branch, the only changes made in history prior to "Tue Aug 31 03:28:45 2021 -0400 [core] remove redundant waitpid() on each backend" were to change the status code on backends to 504 upon timeout, and to set malloc_top_pad from the environment, if set. Are you setting MALLOC_TOP_PAD_ in the environment before starting lighttpd?

Note: This is in my git working copy from 1 Sep.

$ git diff HEAD..f364c8ef36c4b65071f5bd8fef44b20e59ed8b7b
diff --git a/src/gw_backend.c b/src/gw_backend.c
index 698ac096d..6c324c915 100644
--- a/src/gw_backend.c
+++ b/src/gw_backend.c
@@ -2675,6 +2675,8 @@ static void gw_handle_trigger_hctx_timeout(gw_handler_ctx * const hctx, const ch
         } /* else "read" */
     }
     gw_backend_error(hctx, r);
+    if (r->http_status == 500 && !r->resp_body_started && !r->handler_module)
+        r->http_status = 504; /*Gateway Timeout*/
 }

 __attribute_noinline__
diff --git a/src/server.c b/src/server.c
index 7485f04f2..5336b0591 100644
--- a/src/server.c
+++ b/src/server.c
@@ -74,6 +74,7 @@ static const buffer default_server_tag = { CONST_STR_LEN(PACKAGE_DESC)+1, 0 };
 #include <malloc.h>
 #if defined(HAVE_MALLOC_TRIM)
 static int(*malloc_trim_fn)(size_t);
+static size_t malloc_top_pad;
 #endif
 #endif

@@ -1882,7 +1883,7 @@ static void server_handle_sigalrm (server * const srv, unix_time64_t mono_ts, un
                                        request_pool_free();
                                        connections_pool_clear(srv);
                                  #if defined(HAVE_MALLOC_TRIM)
-                                       if (malloc_trim_fn) malloc_trim_fn(524288);
+                                       if (malloc_trim_fn) malloc_trim_fn(malloc_top_pad);
                                  #endif
                                        /* attempt to restart dead piped loggers every 64 secs */
                                        if (0 == srv->srvconf.max_worker)
@@ -2011,6 +2012,14 @@ static int main_init_once (void) {
   #endif

   #if defined(HAVE_MALLOC_TRIM)
+    malloc_top_pad = 524288;
+    {
+        const char * const top_pad_str = getenv("MALLOC_TOP_PAD_");
+        if (top_pad_str) {
+            unsigned long top_pad = strtoul(top_pad_str, NULL, 10);
+            if (top_pad != ULONG_MAX) malloc_top_pad = (size_t)top_pad;
+        }
+    }
   #ifdef LIGHTTPD_STATIC
     malloc_trim_fn = malloc_trim;
   #else

Actions

Copy link

Updated by gstrauss over 3 years ago

Category set to core

valgrind --tool=memcheck --leak-check=full --track-origins=yes --show-leak-kinds=all --track-fds=yes --log-file=/var/tmp/vg.log lighttpd -D -f /etc/lighttpd/lighttpd.conf

FYI: I do not see any memory leaks with a simple static file test or with lighttpd using mod_proxy to another lighttpd serving a static file, using a 1 MB static file and using h2load to generate parallel requests for HTTP/2 loading.
(Check /var/tmp/vg.log after Ctrl-C of the valgrind instance)

Actions

Copy link

Updated by flynn over 3 years ago

I found the issue, it's related to ticket #3086.

I tested the new timeouts and disabled them again to be able to switch back.

Because the backend got faulty again lighttpd collected open sockets and each connection consumed allocated memory.
But this show how important backend timeouts can be for reliably long running service.

Actions

Copy link

Updated by gstrauss over 3 years ago

I tested the new timeouts and disabled them again to be able to switch back.

I am glad you were able to track this down. I have been going through diffs trying to figure out what might have changed recently.

Are the timeouts not working? Were the timeouts enabled when you had the issue? Is there something lighttpd can do better here? How many connections were collected each 30 mins? (A single connection that hangs until a timeout should not consume 2 MB of memory.)

Actions

Copy link

Updated by gstrauss over 3 years ago

Potential improvement: I could set the default for "connect-timeout" to be something like 5 or 10 or 15 seconds.

While setting non-zero defaults for "write-timeout" or "read-timeout" (where there previously was no specific timeout) has a higher possibility of breaking someone's existing usage, setting a reasonable connect timeout might help to avoid the effects of stacking errors, such as what happened with hanging connects to your backends.

I think 5 seconds is a long time to connect, but on some slow systems that are "momentarily" busy, this might be okay to some people. Therefore, I am leaning more towards 10 or 15 seconds as the default "connect-timeout", since this is a behavior change and did not occur before. What do you think?

Actions

Copy link

#10

Updated by flynn over 3 years ago

Sorry, I have not expressed myself clearly enough:

the timeouts are working and DO solve the problem
I disabled the timeouts after testing ticket #3086 in the configuration, but I did not restart lighttpd to make it active
the new (timeout disabled) configuration became active with the next restart after the git update, so it gave the wrong impression, the new git update was the reason

The question, whether backend timeouts should be enabled by default, is difficult to answer:

on the side it is definitly an improvement for a stable long running service: less and stable memory usage, no server fault caused by too much open file handles
on the other side it is a change of behaviour and may result in new bug reports, because people detect the first time their slow or faulty backends

But I vote for enabling at least a connect timeout by default, in haproxy I typically set it to 5 seconds.

Actions

Copy link

#11