Project

General

Profile

Actions

Bug #3103

closed

Constant growing memory usage (w/ hung backend; no timeout)

Added by flynn over 2 years ago. Updated over 2 years ago.

Status:
Duplicate
Priority:
Normal
Category:
core
Target version:
ASK QUESTIONS IN Forums:
No

Description

After fixing ticket #3084 the memory usage was perfect for some days.

On Monday morning I updated my git tree and since then the memory usage grows constant over time with about 2MB per hour.

I tested to revert some changes, but I did not succeed.


Files

memory.png (84.4 KB) memory.png Memory usage last week flynn, 2021-09-15 05:28

Related issues 1 (0 open1 closed)

Is duplicate of Bug #3086: sockets disabled, out-of-fds with proxy moduleFixedActions
Actions #1

Updated by gstrauss over 2 years ago

ick. that's not good.

My guess is that there must be some leak in the patches for #3101, since that is the big change since last week.

Actions #2

Updated by gstrauss over 2 years ago

That memory use appears to be a step function. By chance, are you rotating logs every half-hour? (Do you send SIGHUP to lighttpd every half-hour?) Are you using log files? piped loggers? syslog?

Would you try commenting out the call in src/server.c to fdlog_flushall()?

BTW, you should notice that there is now only one fd used by each log file even if you have duplicated accesslog.filename in lighttpd.conf with the same target.

If you wouldn't mind, please do a "clean" build of lighttpd as a sanity check. Thanks.

Actions #3

Updated by flynn over 2 years ago

I recognized your changes regarding the multiple log files, it works for me, the file handles are reduced.

I do NOT rotate the log files every 30min, only once a day.

On suspicion I have reverted before the checkin of dedup changes (243510dbb4d79a3866c288a7d6530f6015c5b537), therefor I cannot comment out fdlog_flushall(), the problem remains.

It must be a change, that came through git-rebasing ... any idea?

Actions #4

Updated by flynn over 2 years ago

I have two servers in test:
  • the low volume server still has very good memory usage with current git version
  • on the high volume server the memory constantly grows

So it s not a problem in general, but a certain request type, that I could not discover so far.

Actions #5

Updated by gstrauss over 2 years ago

It must be a change, that came through git-rebasing ... any idea?

Hmmm. I have a copy of my dev repo on a volume I use for occasionally building on Windows under Cygwin. Comparing that to my current dev branch, the only changes made in history prior to "Tue Aug 31 03:28:45 2021 -0400 [core] remove redundant waitpid() on each backend" were to change the status code on backends to 504 upon timeout, and to set malloc_top_pad from the environment, if set. Are you setting MALLOC_TOP_PAD_ in the environment before starting lighttpd?

Note: This is in my git working copy from 1 Sep.

$ git diff HEAD..f364c8ef36c4b65071f5bd8fef44b20e59ed8b7b
diff --git a/src/gw_backend.c b/src/gw_backend.c
index 698ac096d..6c324c915 100644
--- a/src/gw_backend.c
+++ b/src/gw_backend.c
@@ -2675,6 +2675,8 @@ static void gw_handle_trigger_hctx_timeout(gw_handler_ctx * const hctx, const ch
         } /* else "read" */
     }
     gw_backend_error(hctx, r);
+    if (r->http_status == 500 && !r->resp_body_started && !r->handler_module)
+        r->http_status = 504; /*Gateway Timeout*/
 }

 __attribute_noinline__
diff --git a/src/server.c b/src/server.c
index 7485f04f2..5336b0591 100644
--- a/src/server.c
+++ b/src/server.c
@@ -74,6 +74,7 @@ static const buffer default_server_tag = { CONST_STR_LEN(PACKAGE_DESC)+1, 0 };
 #include <malloc.h>
 #if defined(HAVE_MALLOC_TRIM)
 static int(*malloc_trim_fn)(size_t);
+static size_t malloc_top_pad;
 #endif
 #endif

@@ -1882,7 +1883,7 @@ static void server_handle_sigalrm (server * const srv, unix_time64_t mono_ts, un
                                        request_pool_free();
                                        connections_pool_clear(srv);
                                  #if defined(HAVE_MALLOC_TRIM)
-                                       if (malloc_trim_fn) malloc_trim_fn(524288);
+                                       if (malloc_trim_fn) malloc_trim_fn(malloc_top_pad);
                                  #endif
                                        /* attempt to restart dead piped loggers every 64 secs */
                                        if (0 == srv->srvconf.max_worker)
@@ -2011,6 +2012,14 @@ static int main_init_once (void) {
   #endif

   #if defined(HAVE_MALLOC_TRIM)
+    malloc_top_pad = 524288;
+    {
+        const char * const top_pad_str = getenv("MALLOC_TOP_PAD_");
+        if (top_pad_str) {
+            unsigned long top_pad = strtoul(top_pad_str, NULL, 10);
+            if (top_pad != ULONG_MAX) malloc_top_pad = (size_t)top_pad;
+        }
+    }
   #ifdef LIGHTTPD_STATIC
     malloc_trim_fn = malloc_trim;
   #else

Actions #6

Updated by gstrauss over 2 years ago

  • Category set to core

valgrind --tool=memcheck --leak-check=full --track-origins=yes --show-leak-kinds=all --track-fds=yes --log-file=/var/tmp/vg.log lighttpd -D -f /etc/lighttpd/lighttpd.conf

FYI: I do not see any memory leaks with a simple static file test or with lighttpd using mod_proxy to another lighttpd serving a static file, using a 1 MB static file and using h2load to generate parallel requests for HTTP/2 loading.
(Check /var/tmp/vg.log after Ctrl-C of the valgrind instance)

Actions #7

Updated by flynn over 2 years ago

I found the issue, it's related to ticket #3086.

I tested the new timeouts and disabled them again to be able to switch back.

Because the backend got faulty again lighttpd collected open sockets and each connection consumed allocated memory.
But this show how important backend timeouts can be for reliably long running service.

Actions #8

Updated by gstrauss over 2 years ago

I tested the new timeouts and disabled them again to be able to switch back.

I am glad you were able to track this down. I have been going through diffs trying to figure out what might have changed recently.

Are the timeouts not working? Were the timeouts enabled when you had the issue? Is there something lighttpd can do better here? How many connections were collected each 30 mins? (A single connection that hangs until a timeout should not consume 2 MB of memory.)

Actions #9

Updated by gstrauss over 2 years ago

Potential improvement: I could set the default for "connect-timeout" to be something like 5 or 10 or 15 seconds.

While setting non-zero defaults for "write-timeout" or "read-timeout" (where there previously was no specific timeout) has a higher possibility of breaking someone's existing usage, setting a reasonable connect timeout might help to avoid the effects of stacking errors, such as what happened with hanging connects to your backends.

I think 5 seconds is a long time to connect, but on some slow systems that are "momentarily" busy, this might be okay to some people. Therefore, I am leaning more towards 10 or 15 seconds as the default "connect-timeout", since this is a behavior change and did not occur before. What do you think?

Actions #10

Updated by flynn over 2 years ago

Sorry, I have not expressed myself clearly enough:
  • the timeouts are working and DO solve the problem
  • I disabled the timeouts after testing ticket #3086 in the configuration, but I did not restart lighttpd to make it active
  • the new (timeout disabled) configuration became active with the next restart after the git update, so it gave the wrong impression, the new git update was the reason
The question, whether backend timeouts should be enabled by default, is difficult to answer:
  • on the side it is definitly an improvement for a stable long running service: less and stable memory usage, no server fault caused by too much open file handles
  • on the other side it is a change of behaviour and may result in new bug reports, because people detect the first time their slow or faulty backends

But I vote for enabling at least a connect timeout by default, in haproxy I typically set it to 5 seconds.

Actions #11

Updated by gstrauss over 2 years ago

  • Is duplicate of Bug #3086: sockets disabled, out-of-fds with proxy module added
Actions #12

Updated by gstrauss over 2 years ago

  • Status changed from New to Duplicate

Thanks.

But I vote for enabling at least a connect timeout by default, in haproxy I typically set it to 5 seconds.

I'll compromise and will set the "connect-timeout" to 8 seconds by default. :) It is configurable in lighttpd.conf by the admin.

Actions #13

Updated by gstrauss over 2 years ago

  • Subject changed from Constant growing memory usage to Constant growing memory usage (w/ hung backend; no timeout)
Actions

Also available in: Atom