Bug #2940
closedmod_authn_ldap/mod_cgi race condition, "Can't contact LDAP server"
Description
There seems to be a race condition bug somewhere between mod_authn_ldap and mod_cgi which manifests itself in
(mod_authn_ldap.c.449) ldap: ldap_sasl_bind_s(): Can't contact LDAP server
messages from lighttpd. This happens when lighttpd gets hit with multiple parallel requests on URLs requiring LDAP auth and being served by CGI scripts.
I noticed this problem when doing git clone from a cgit instance served by lighttpd. The clone operation would fail on a seemingly random object, with HTTP 401 Unauthorized error.
The problem is reproducible with lighttpd v1.4.45, 1.4.51 and master branch (commit 9232145024ae "[core] poll: fdarray uses fd as index, not fde_ndx"). These are all the versions I've tested.
What I've done:
- Online search: "Can't contact LDAP server mod_cgi mod_authn_ldap lighttpd". Doesn't seem relevant.
- Searched this bug tracker: found issue about "mod_auth caching" (implementing this might perhaps workaround this issue).
After that I set out to reproduce and narrow down the bug. I created the following scenarios and have gotten reliable results:
git clone http://server/static-no-auth/repo1.git # success
git clone http://server/static-auth/repo1.git # success
git clone http://server/cgi-no-auth/repo1.git # success
git clone http://server/cgi-auth/repo1.git # failure
repo1.git is a dummy repo with 100 commits. With too few commits (say, 10), there is a pretty good chance of the clone completing. I've never seen the clone succeed with 100 commits.
Alternatively, instead of git clone, running a bunch of curl's in parallel will also trigger the bug.
Even though I've spent many hours on this, I've been unable to write a proper patch. My best "fix" so far is to add a 100ms delay before lighttpd calls ldap_sasl_bind_s(). I've looked at openldap/slapd and lighttpd log files, run under gdb etc.
I set up a git repo for reproducing this bug, it can be found here: https://github.com/bjornfor/lighttpd-auth-ldap-issue.
Updated by gstrauss almost 6 years ago
Thank you for taking the time to try to narrow this down.
lighttpd is single-threaded and mod_authn_ldap is blocking. It is also independent of mod_cgi. However, mod_authn_ldap holds open the connection to the ldap server for reuse. My first thought is that ldap is not setting FD_CLOEXEC on its connection fd, or lighttpd should be doing something additional when mod_cgi calls fork().
Updated by gstrauss over 5 years ago
bjornfor: my apologies for taking so long to get to this after you put in a substantial effort to provide instructions how to reproduce the issue. Thank you very much.
Leveraging your instructions, I have reproduced the issue.
When lots of requests are made in parallel, the "ldap_sasl_bind_s(): Can't contact LDAP server" failure occurs when ldap_sasl_bind_s() aborts after being interrupted by SIGCHLD from exiting CGI processes.
Here is a short-term workaround while I look around to see if there are better solutions to get the desired behavior from ldap_sasl_bind_s():
--- a/src/mod_authn_ldap.c +++ b/src/mod_authn_ldap.c @@ -8,6 +8,7 @@ #include "plugin.h" #include <errno.h> +#include <signal.h> #include <stdlib.h> #include <string.h> @@ -444,11 +445,18 @@ static int mod_authn_ldap_bind(server *srv, LDAP *ld, const char *dn, const char /* RFE: add functionality: LDAP_SASL_EXTERNAL (or GSS-SPNEGO, etc.) */ + sigset_t nsigs, osigs; + sigemptyset(&nsigs); + sigaddset(&nsigs, SIGCHLD); + sigprocmask(SIG_BLOCK, &nsigs, &osigs); + ret = ldap_sasl_bind_s(ld,dn,LDAP_SASL_SIMPLE,&creds,NULL,NULL,NULL); if (ret != LDAP_SUCCESS) { mod_authn_ldap_err(srv, __FILE__, __LINE__, "ldap_sasl_bind_s()", ret); } + sigprocmask(SIG_SETMASK, &osigs, NULL); + return ret; }
Adding a caching layer (#2805) would likely reduce the occurrence of interruption by SIGCHLD, but would not eliminate it.
Another option: if I manually repeat the attempt to ldap_sasl_bind_s() for one retry if it fails, then the problem did not recur for me. However, there is still a chance that it would fail on a busy server handling lots of CGI.
--- a/src/mod_authn_ldap.c +++ b/src/mod_authn_ldap.c @@ -446,7 +446,10 @@ static int mod_authn_ldap_bind(server *srv, LDAP *ld, const char *dn, const char ret = ldap_sasl_bind_s(ld,dn,LDAP_SASL_SIMPLE,&creds,NULL,NULL,NULL); if (ret != LDAP_SUCCESS) { - mod_authn_ldap_err(srv, __FILE__, __LINE__, "ldap_sasl_bind_s()", ret); + ret = ldap_sasl_bind_s(ld,dn,LDAP_SASL_SIMPLE,&creds,NULL,NULL,NULL); + if (ret != LDAP_SUCCESS) { + mod_authn_ldap_err(srv, __FILE__, __LINE__, "ldap_sasl_bind_s()", ret); + } } return ret;
I am not against doing both of the above, though I would prefer if there were a way for Cyrus SASL to be configured to handle EINTR. I am sure there are good reasons for the way the blocking Cyrus SASL routines handle EINTR, e.g. so that it can be interrupted/cancelled by Ctrl-C in a client program, but in this case, that behavior is not desirable. lighttpd is configured to set SIGCHLD handler with SA_RESTART but ldap_sasl_bind_s() must be in a non-restartable system call when the SIGCHLD is received.
Updated by bjornfor over 5 years ago
gstrauss: Thanks for the update and preliminary patches!
Updated by gstrauss over 5 years ago
- Status changed from New to Patch Pending
- Target version changed from 1.4.x to 1.4.54
Solution: ldap_set_option LDAP_OPT_RESTART to handle EINTR on SIGCHLD from CGI
(ldap uses poll(), which is not restartable with sigaction SA_RESTART)
--- a/src/mod_authn_ldap.c +++ b/src/mod_authn_ldap.c @@ -404,6 +404,9 @@ static LDAP * mod_authn_ldap_host_init(server *srv, plugin_config *s) { return NULL; } + /* restart ldap functions if interrupted by a signal, e.g. SIGCHLD */ + ldap_set_option(ld, LDAP_OPT_RESTART, LDAP_OPT_ON); + if (s->auth_ldap_starttls) { /* if no CA file is given, it is ok, as we will use encryption * if the server requires a CAfile it will tell us */
Updated by gstrauss over 5 years ago
- Status changed from Patch Pending to Fixed
- % Done changed from 0 to 100
Applied in changeset ae9cafecea3ca0786dfad260ca064fc824e5ccc9.
Updated by bjornfor over 5 years ago
Thank you!
(I've tested lighttpd v1.4.54 in my bug-reproduction-framework (https://github.com/bjornfor/lighttpd-auth-ldap-issue) and cannot provoke the failure anymore.)
Also available in: Atom