Bug #3123: child exit causes time jump false positive when server.max-workers is non-zero - Lighttpd - lighty labs

Actions

Copy link

Bug #3123

closed

child exit causes time jump false positive when server.max-workers is non-zero

Added by praiskup over 3 years ago. Updated over 3 years ago.

Status:

Fixed

Priority:

Normal

Category:

core

Target version:

1.4.64

ASK QUESTIONS IN Forums:

Description

We have some historical reasons to run logrotate together with lighttpd with max-workers.
The "max-workers" pushed us towards using "cronolog" too, but not for log-rotation,
just to have properly synchronized logs from multiple processes (according to
mod_accesslog documentation).

The problem is though that, since we use logrotate, we can not easily ping
to the cronolog processes so they reopen the rotated files. Cronolog never
reopens the file unless the filename is to be changed.

The easiest thing would be to kill the cronolog process from logrotate config,
but that means that all the server processes are eventually restarted (the
logic for SIG{HUP/USR1} in master process thinks that the killed cronolog
process was actaully a Lighty worker process... and that it is actually a
signal even situation. So it calculates the time between the last interrupt
and the actual cronolog termination (long time period) - and that leads to
gently then forced restart of all processes.

Long story short, it is not possible to use "logrotate" with "max-workers"
easily, as other tool (like cronolog, but not cronolog) is needed.

One of those things would help a lot:

- having a way to restart piped logger using SIGHUP to Lighty's main process,
similarly as we can reopen the rotated log files
- implementing the pipe (atomic writes) logger directly in Lighty's master process,
instead of relying on the external utility (cronolog) (and turn it on
by default with max-workers option)
- do not handle random log-piping processes as a reason for full restart

Background: https://pagure.io/copr/copr/issue/2001

Actions

Copy link

Updated by praiskup over 3 years ago

do not handle random log-piping processes as a reason for full restart

Do not handle random log-piping process exit as a reason for full restart:
https://github.com/lighttpd/lighttpd1.4/blob/fae1c0519187cb263259990f7b2aa8fdbc2c374a/src/server.c#L1617-L1640

FTR, we use this helper for now:
https://pagure.io/fedora-infra/ansible/blob/013344529aa33e133e95a4fd96703635cab5231f/f/roles/copr/backend/files/copr-lighty-logger
Since it goes through PIPE, we have atomic writes from Lighty and nice logs.
But we also handle the SIGHUP from logrotate.

Actions

Copy link

Updated by gstrauss over 3 years ago

Status changed from New to Invalid
Priority changed from Normal to Low
Target version deleted (~~1.4.xx~~)

I don't think that you understand the difference between a "bug" and a "feature". You filed this as a bug.

I don't think that you understand how to use a pipe logger.

Please do not post further here. Please see How to get support

Actions

Copy link

Updated by gstrauss over 3 years ago

If you have the ability to patch lighttpd, and you can write a shell script to act as a pipe logger, why not simply patch cronolog (or any other piped logger) to re-open your logs?

We have some historical reasons to run logrotate together with lighttpd with max-workers.

There are very few situations in which lighttpd with max-workers is actually necessary. Have you tested server.max-worker = 0 with the latest lighttpd release, currently lighttpd 1.4.63?

Actions

Copy link

Updated by praiskup over 3 years ago

gstrauss wrote in #note-2:

I don't think that you understand the difference between a "bug" and a "feature". You filed this as a bug.

This is a bug, really. It's unreasonably complicated (maybe impossible)
to use 'logrotate' with lighttpd, when 'max-workers' is used. And if anything,
users should be discouraged from using 'logrotate' in the documentation.

I don't think that you understand how to use a pipe logger.

You seem you haven't read the post. Why do you think so?

Please do not post further here. Please see How to get support

I don't need support here. I just wanted to give you feedback that
the configuration experience is..., bad. I thought it could be a bit
more useful than just blogging like:
https://www.ducea.com/2009/03/03/lighttpd-servermax-worker-option-is-this-a-joke/

Actions

Copy link

Updated by gstrauss over 3 years ago

BTW, it is already possible to restart piped loggers (which do not handle it themselves) using a graceful restart of lighttpd.

IFF backend services are run independently from lighttpd (not run from lighttpd via "bin-path"), then it is safe to enable zero-latency gracefully restart in lighttpd:
server.feature-flags += ( "server.graceful-restart-bg" => "enable" )
server.systemd-socket-activation = "enable"

I don't think that you understand how to use a pipe logger.

You seem you haven't read the post. Why do you think so?

Having a separate process manage piped logs isolates management of the logs to that separate process, so any desired change in behavior managing the logs should be handled by that separate process.
cronolog supports --symlink to allow you to have a stable name for the latest log, while allowing cronolog to rotate logs according to how cronolog was configured. I haven't looked to see why you are using logrotate with cronolog.
Since cronolog does not support a signal to re-open logs, I took a quick look at the code and doing such would be a fairly small patch.
Even if cronlog does not support a signal, the running of cronolog could be wrapped by a small shell script which does catch a signal and restarts its child, i.e. cronolog.
So, you see, everything I just wrote is about management of logs by the piped logger process, or its wrapper, and is independent of lighttpd.

Actions

Copy link

Updated by gstrauss over 3 years ago

Subject changed from The mod_accesslog pipe terminated / cronolog to child exit causes time jump false positive when server.max-workers is non-zero
Category set to core
Status changed from Invalid to New
Priority changed from Low to Normal
Target version set to 1.4.64

I see in https://pagure.io/copr/copr/issue/2001 you mentioned:

The fact that SIGHUP was actually never delivered to Lighty, only to
cronolog caused that Lighty's log_epoch_secs was too old (actually
initialized with the "main" process, at the initial process tree start).

Therefore the "clock jumped 4106 secods" (and other even larger values).
This should be measured between SIGHUP (to Lighty) and the actual child
termination (if it doesn't happen quickly enough, child processes are
force-restarted).

That looks like a bug in lighttpd when server.max-workers is used. I misunderstood your original post above containing:

So it calculates the time between the last interrupt and the actual cronolog termination (long time period) - and that leads to gently then forced restart of all processes.

Therefore, your request:

- do not handle random log-piping processes as a reason for full restart

is the right answer, but the bug is actually to change how time jumps are detected when server.max-workers is used.

Actions

Copy link

Updated by gstrauss over 3 years ago

but that means that all the server processes are eventually restarted (the
logic for SIG{HUP/USR1} in master process thinks that the killed cronolog
process was actaully a Lighty worker process... and that it is actually a
signal even situation.

The observation is correct, but your assessment is not. lighttpd does not think cronolog exit is a lighttpd worker process exiting.

You can disable lighttpd time jump detection and restart by configuring lighttpd:
server.feature-flags += ("server.clock-jump-restart" => 0)

Actions

Copy link

Updated by gstrauss over 3 years ago

Status changed from New to Patch Pending

The following patch adjusts how time jumps are detected in the lighttpd parent process when server.max-worker is non-zero.

--- a/src/server.c
+++ b/src/server.c
@@ -247,18 +247,21 @@ server_monotonic_secs (void)
 }

 static unix_time64_t
-server_epoch_secs (server * const srv)
+server_epoch_secs (server * const srv, unix_time64_t mono_ts_delta)
 {
     const unix_time64_t cur_ts = log_epoch_secs;
     const unix_time64_t new_ts = TIME64_CAST(time(NULL));
+    const unix_time64_t new_ts_adj = new_ts - mono_ts_delta;
     /* attempt to detect large clock jump */
-    if (new_ts < cur_ts || new_ts - cur_ts > 300) { /*(5 mins)*/
+    if (new_ts_adj < cur_ts || new_ts_adj - cur_ts > 300) { /*(5 mins)*/
         log_error(srv->errh, __FILE__, __LINE__,
           "warning: clock jumped %lld secs",
-          (long long)((int64_t)new_ts - (int64_t)cur_ts));
+          (long long)((int64_t)new_ts_adj - (int64_t)cur_ts));
         int delta =                             /*(30 mins default)*/
           config_feature_int(srv, "server.clock-jump-restart", 1800);
-        if (delta && (new_ts > cur_ts ? new_ts-cur_ts : cur_ts-new_ts) > delta){
+        if (delta && (new_ts_adj > cur_ts
+                      ? new_ts_adj-cur_ts
+                      : cur_ts-new_ts_adj) > delta) {
             log_error(srv->errh, __FILE__, __LINE__,
               "attempting graceful restart in < ~5 seconds, else hard restart");
             srv->graceful_expire_ts = log_monotonic_secs + 5;
@@ -1613,10 +1616,12 @@ static int server_main_setup (server * const srv, int argc, char **argv) {
                                }
                        } else {
                                int status;
+                               unix_time64_t mono_ts;

                                if (-1 != (pid = fdevent_waitpid_intr(-1, &status))) {
+                                       mono_ts = log_monotonic_secs;
                                        log_monotonic_secs = server_monotonic_secs();
-                                       log_epoch_secs = server_epoch_secs(srv);
+                                       log_epoch_secs = server_epoch_secs(srv, log_monotonic_secs - mono_ts);
                                        if (plugins_call_handle_waitpid(srv, pid, status) != HANDLER_GO_ON) {
                                                if (!timer) alarm((timer = 5));
                                                continue;
@@ -1640,8 +1645,9 @@ static int server_main_setup (server * const srv, int argc, char **argv) {
                                } else {
                                        switch (errno) {
                                        case EINTR:
+                                               mono_ts = log_monotonic_secs;
                                                log_monotonic_secs = server_monotonic_secs();
-                                               log_epoch_secs = server_epoch_secs(srv);
+                                               log_epoch_secs = server_epoch_secs(srv, log_monotonic_secs - mono_ts);
                                                /**
                                                 * if we receive a SIGHUP we have to close our logs ourself as we don't 
                                                 * have the mainloop who can help us here
@@ -1840,7 +1846,7 @@ static void server_handle_sigalrm (server * const srv, unix_time64_t mono_ts, un
                                plugins_call_handle_trigger(srv);

                                log_monotonic_secs = mono_ts;
-                               log_epoch_secs = server_epoch_secs(srv);
+                               log_epoch_secs = server_epoch_secs(srv, 0);

                                /* check idle time limit, if enabled */
                                if (idle_limit && idle_limit < mono_ts - last_active_ts && !graceful_shutdown) {
@@ -1923,7 +1929,7 @@ __attribute_hot__
 __attribute_noinline__
 static void server_main_loop (server * const srv) {
        unix_time64_t last_active_ts = server_monotonic_secs();
-       log_epoch_secs = server_epoch_secs(srv);
+       log_epoch_secs = server_epoch_secs(srv, 0);

        while (!srv_shutdown) {

Actions

Copy link

Updated by praiskup over 3 years ago

gstrauss wrote in #note-8:

The following patch adjusts ..

Thanks for looking at this!

I'm not sure the `mono_ts` is always correctly initialized, but I would have to
look at the code, not just patch. And today is too late for me, I'll take a look
later and test. Do you have some pull-request somewhere?

Actions

Copy link

#10