Project

General

Profile

[Solved] problem during graceful_restart if lighty is running inside docker container and started by bash script using exec

Added by mello7tre about 4 years ago

Hi, first a little background to explain how lighttd is started.
I have built a docker container and run ligthy by CMD "/usr/bin/lighttd -D -f /etc/lighttpd/lighttpd.conf.
But as i need to have a cron daemon running too, for rotating ssl certs, i changed the container CMD to the following bash script:

 #!/bin/bash
cron
exec /usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf

So first bash have pid 1 then it execute cron (that goes in backgroud and will have ppid=1 ) then bash process is replaced by exec with the lighty one and lighty will be executed having pid=1 .

Now the problem:
Every time i send a signal USR1 to lighttpd, server stop but do not restart:

2020-04-13 17:00:13: (server.c.958) [note] graceful shutdown started
2020-04-13 17:00:13: (server.c.2059) server stopped by UID = 0 PID = 220


So i looked at the code and made some tests.
The root cause is in main function in server.c in line " while (waitpid(-1, NULL, 0) > 0) ; "

To better understood the issue i show you the output of the command ps ax -o pid,ppid,pgrp,user,args of the running container:

  PID  PPID  PGRP USER     COMMAND
    1     0     1 www-data /usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf
    8     1     8 root     cron
   31     1     1 www-data /usr/bin/php-cgi
   32     1     1 www-data /usr/bin/php-cgi
   33     1     1 www-data /usr/bin/php-cgi
   34     1     1 www-data /usr/bin/php-cgi

as you can see, as expected, cron have parent pid equals to lighttpd pid.

When we exeute waitpid using -1 as first argument " we wait for any child process ".
So lighty wait for cron to exit too, but cron will never exit and lighty will wait indefinitely.

I tried rebuild lighty replacing -1 with 0 to " wait for any child process whose process group ID is equal to that of the calling process ".
As you can see from ps output process group of php-cgi is 1, equals to light pid, instead process group of cron is different.

I tested it and it successfully restarted after receiving USR1 .

Now i am asking you:
It's should be safe to use process group to identify lighty's childs ?
In that case, can we use it in place of the current method (replacing waitpid(-1, NULL, 0) with waitpid(0, NULL, 0) )?

Best Regards, Alberto.


Replies (6)

RE: problem during graceful_restart if lighty is running inside docker container and started by bash script using exec - Added by gstrauss about 4 years ago

How about without the exec? The bash process will stick around while things are running, but the rest might work for you. The shell script could be enhanced to propagate signals, if needed.

#!/bin/bash
cron
/usr/sbin/lighttpd -D -f /etc/lighttpd/lighttpd.conf

Alternatively, why not daemonize 'cron' into a new process group?

A hack might be to have lighttpd start up cron as a "backend" to lighttpd, even though you should make sure no requests actually go there. Then, if you signalled lighttpd to restart, it would signal its children to exit before restarting itself (lighttpd) and its backends.

Regarding waitpid(-1, NULL, 0) vs waitpid(0, NULL, 0), that would be a behavioral change which might affect current usage, however unlikely. The idea is that lighttpd should reap all of its children that have not detached, which is a best-effort to give things a chance to restart cleanly.

RE: problem during graceful_restart if lighty is running inside docker container and started by bash script using exec - Added by mello7tre about 4 years ago

Thanks for the fast reply.

I would like to keep the "system" more simple as possible, and in particular with docker i would like to have main program running with pid 1 (with no other processes around).

About "why not daemonize 'cron' into a new process group", i do not understand, cron is already running with a different process group (8 in place of 1), the problem is it's parent pid, maybe you mean something else...

About the backend hack, probably cron will be executed as www-data not as root (so other problems will occurs, know that can be workaround to this but as i already said i want to keep "system simple").

I think that all lighty's childs will have the same process group of lighttpd so using 0 should be safe.
But as you said, this is a behavioral change, so probably some tests should be conducted to be sure, and i do not want to force you.

For now i will keep using the recompiled version, thanks anyway.

Regards, Alberto.

RE: problem during graceful_restart if lighty is running inside docker container and started by bash script using exec - Added by gstrauss about 4 years ago

Sorry, I am not as familiar with the Docker environment. I get the impression that your pursuit of perfection has led you to discard other reasonable, slightly less minimal, solutions, e.g. having an extra process, or running cron as the same user as lighttpd for access to the certificates, as long as you also configure any backends such as FastCGI to run under user accounts different than lighttpd. A container, with an isolated namespace and a restricted set of processes running is more well-defined than a general purpose system with many unrelated accounts and services.

I think that all lighty's childs will have the same process group of lighttpd so using 0 should be safe.
But as you said, this is a behavioral change, so probably some tests should be conducted to be sure, and i do not want to force you.

I already know the answer to that: It "should be safe" for you and your usage, but that is not necessarily true for others, even though I expect it to have no ill-effects in the common case. You're welcome to make that change to the lighttpd source code and if it works well for you, that's great, and I see no issues with that change for your usage.

I'll consider the change request, but please don't hold your breath. Even if I choose to make such a change, there is no imminent release of lighttpd that has been scheduled.

[Edit]
An example where this could cause an issue would be if lighttpd spawns a backend, and that backend puts itself in its own process group (setsid()). If that backend was slow to exit after being signalled by lighttpd (for graceful restart), lighttpd would not wait, and lighttpd might restart and then fail to respawn that backend, since it hadn't exited, or have multiple instances which corrupt files used by the backend.

That said, the current behavior of lighttpd is to send a SIGINT to all processes in lighttpd's process group (kill(0, SIGINT)) when lighttpd is performing a graceful restart, so it would be more consistent if lighttpd waited only for those processes, too.

RE: problem during graceful_restart if lighty is running inside docker container and started by bash script using exec - Added by stbuehler about 4 years ago

I think you really should use a "proper" init if you want to run multiple daemons in a container (as you do). docker run even seems to provide an --init option, which is based on https://github.com/krallin/tini.

RE: problem during graceful_restart if lighty is running inside docker container and started by bash script using exec - Added by mello7tre about 4 years ago

Thanks, this advice is really precious, as it solve multiple problems at once.

Normally i avoid running multiple process in the same container and tend to split them in multiple linked containers, but in this case the simple solution to signal/reload lighttpd after having issued a new https cert was to have both cron and lighttpd.

I just tried it and it worked as expected.

Just for info:
for people using AWS ECS service (based on docker) --init option is available too, mapped to option LinuxParameters/InitProcessEnabled.

RE: problem during graceful_restart if lighty is running inside docker container and started by bash script using exec - Added by gstrauss over 3 years ago

Thanks for your update @mello7tre

As recommended by stbuehler, the proper solution is to run docker --init or equivalent.

As to lighttpd use of waitpid(-1, NULL, 0) in main(), various lighttpd modules might start up backend processes through gw_backend.c. Such backend processes (at their discretion) may call setsid() to start a new process group, but remain a child of lighttpd (as opposed to detaching (fork(), setsid(), fork()) During graceful shutdown, these processes get signalled, and should be reaped by lighttpd. All children still attached to the lighttpd process should be reaped before lighttpd gracefully restarts by returing to the top of the loop in main() to re-init the lighttpd server. This is good intended behavior in lighttpd and so I do not plan to change it. If you must start another process in the same process group, then the shell script should background and put that process into its own process group, detached from the original process group, the equivalent of fork(), setsid(), fork() and execve of grandchild, which both parent and grandparent exiting.

    (1-6/6)