mod_authn_ldap/mod_cgi race condition, "Can't contact LDAP server"
There seems to be a race condition bug somewhere between mod_authn_ldap and mod_cgi which manifests itself in
(mod_authn_ldap.c.449) ldap: ldap_sasl_bind_s(): Can't contact LDAP server
messages from lighttpd. This happens when lighttpd gets hit with multiple parallel requests on URLs requiring LDAP auth and being served by CGI scripts.
I noticed this problem when doing git clone from a cgit instance served by lighttpd. The clone operation would fail on a seemingly random object, with HTTP 401 Unauthorized error.
The problem is reproducible with lighttpd v1.4.45, 1.4.51 and master branch (commit 9232145024ae "[core] poll: fdarray uses fd as index, not fde_ndx"). These are all the versions I've tested.
What I've done:
- Online search: "Can't contact LDAP server mod_cgi mod_authn_ldap lighttpd". Doesn't seem relevant.
- Searched this bug tracker: found issue about "mod_auth caching" (implementing this might perhaps workaround this issue).
After that I set out to reproduce and narrow down the bug. I created the following scenarios and have gotten reliable results:
git clone http://server/static-no-auth/repo1.git # success
git clone http://server/static-auth/repo1.git # success
git clone http://server/cgi-no-auth/repo1.git # success
git clone http://server/cgi-auth/repo1.git # failure
repo1.git is a dummy repo with 100 commits. With too few commits (say, 10), there is a pretty good chance of the clone completing. I've never seen the clone succeed with 100 commits.
Alternatively, instead of git clone, running a bunch of curl's in parallel will also trigger the bug.
Even though I've spent many hours on this, I've been unable to write a proper patch. My best "fix" so far is to add a 100ms delay before lighttpd calls ldap_sasl_bind_s(). I've looked at openldap/slapd and lighttpd log files, run under gdb etc.
I set up a git repo for reproducing this bug, it can be found here: https://github.com/bjornfor/lighttpd-auth-ldap-issue.
Thank you for taking the time to try to narrow this down.
lighttpd is single-threaded and mod_authn_ldap is blocking. It is also independent of mod_cgi. However, mod_authn_ldap holds open the connection to the ldap server for reuse. My first thought is that ldap is not setting FD_CLOEXEC on its connection fd, or lighttpd should be doing something additional when mod_cgi calls fork().
Also available in: Atom