Project

General

Profile

Actions

Bug #2940

closed

mod_authn_ldap/mod_cgi race condition, "Can't contact LDAP server"

Added by bjornfor over 5 years ago. Updated over 5 years ago.

Status:
Fixed
Priority:
Normal
Category:
mod_auth
Target version:
ASK QUESTIONS IN Forums:

Description

There seems to be a race condition bug somewhere between mod_authn_ldap and mod_cgi which manifests itself in

(mod_authn_ldap.c.449) ldap: ldap_sasl_bind_s(): Can't contact LDAP server

messages from lighttpd. This happens when lighttpd gets hit with multiple parallel requests on URLs requiring LDAP auth and being served by CGI scripts.

I noticed this problem when doing git clone from a cgit instance served by lighttpd. The clone operation would fail on a seemingly random object, with HTTP 401 Unauthorized error.

The problem is reproducible with lighttpd v1.4.45, 1.4.51 and master branch (commit 9232145024ae "[core] poll: fdarray uses fd as index, not fde_ndx"). These are all the versions I've tested.

What I've done:

  • Online search: "Can't contact LDAP server mod_cgi mod_authn_ldap lighttpd". Doesn't seem relevant.
  • Searched this bug tracker: found issue about "mod_auth caching" (implementing this might perhaps workaround this issue).

After that I set out to reproduce and narrow down the bug. I created the following scenarios and have gotten reliable results:

git clone http://server/static-no-auth/repo1.git # success
git clone http://server/static-auth/repo1.git # success
git clone http://server/cgi-no-auth/repo1.git # success
git clone http://server/cgi-auth/repo1.git # failure

repo1.git is a dummy repo with 100 commits. With too few commits (say, 10), there is a pretty good chance of the clone completing. I've never seen the clone succeed with 100 commits.

Alternatively, instead of git clone, running a bunch of curl's in parallel will also trigger the bug.

Even though I've spent many hours on this, I've been unable to write a proper patch. My best "fix" so far is to add a 100ms delay before lighttpd calls ldap_sasl_bind_s(). I've looked at openldap/slapd and lighttpd log files, run under gdb etc.

I set up a git repo for reproducing this bug, it can be found here: https://github.com/bjornfor/lighttpd-auth-ldap-issue.

Actions #1

Updated by gstrauss over 5 years ago

Thank you for taking the time to try to narrow this down.

lighttpd is single-threaded and mod_authn_ldap is blocking. It is also independent of mod_cgi. However, mod_authn_ldap holds open the connection to the ldap server for reuse. My first thought is that ldap is not setting FD_CLOEXEC on its connection fd, or lighttpd should be doing something additional when mod_cgi calls fork().

Actions #2

Updated by gstrauss over 5 years ago

  • Category set to mod_auth
Actions #3

Updated by gstrauss over 5 years ago

bjornfor: my apologies for taking so long to get to this after you put in a substantial effort to provide instructions how to reproduce the issue. Thank you very much.

Leveraging your instructions, I have reproduced the issue.

When lots of requests are made in parallel, the "ldap_sasl_bind_s(): Can't contact LDAP server" failure occurs when ldap_sasl_bind_s() aborts after being interrupted by SIGCHLD from exiting CGI processes.

Here is a short-term workaround while I look around to see if there are better solutions to get the desired behavior from ldap_sasl_bind_s():

--- a/src/mod_authn_ldap.c
+++ b/src/mod_authn_ldap.c
@@ -8,6 +8,7 @@
 #include "plugin.h" 

 #include <errno.h>
+#include <signal.h>
 #include <stdlib.h>
 #include <string.h>

@@ -444,11 +445,18 @@ static int mod_authn_ldap_bind(server *srv, LDAP *ld, const char *dn, const char

     /* RFE: add functionality: LDAP_SASL_EXTERNAL (or GSS-SPNEGO, etc.) */

+    sigset_t nsigs, osigs;
+    sigemptyset(&nsigs);
+    sigaddset(&nsigs, SIGCHLD);
+    sigprocmask(SIG_BLOCK, &nsigs, &osigs);
+
     ret = ldap_sasl_bind_s(ld,dn,LDAP_SASL_SIMPLE,&creds,NULL,NULL,NULL);
     if (ret != LDAP_SUCCESS) {
         mod_authn_ldap_err(srv, __FILE__, __LINE__, "ldap_sasl_bind_s()", ret);
     }

+    sigprocmask(SIG_SETMASK, &osigs, NULL);
+
     return ret;
 }

Adding a caching layer (#2805) would likely reduce the occurrence of interruption by SIGCHLD, but would not eliminate it.

Another option: if I manually repeat the attempt to ldap_sasl_bind_s() for one retry if it fails, then the problem did not recur for me. However, there is still a chance that it would fail on a busy server handling lots of CGI.

--- a/src/mod_authn_ldap.c
+++ b/src/mod_authn_ldap.c
@@ -446,7 +446,10 @@ static int mod_authn_ldap_bind(server *srv, LDAP *ld, const char *dn, const char

     ret = ldap_sasl_bind_s(ld,dn,LDAP_SASL_SIMPLE,&creds,NULL,NULL,NULL);
     if (ret != LDAP_SUCCESS) {
-        mod_authn_ldap_err(srv, __FILE__, __LINE__, "ldap_sasl_bind_s()", ret);
+        ret = ldap_sasl_bind_s(ld,dn,LDAP_SASL_SIMPLE,&creds,NULL,NULL,NULL);
+        if (ret != LDAP_SUCCESS) {
+            mod_authn_ldap_err(srv, __FILE__, __LINE__, "ldap_sasl_bind_s()", ret);
+        }
     }

     return ret;

I am not against doing both of the above, though I would prefer if there were a way for Cyrus SASL to be configured to handle EINTR. I am sure there are good reasons for the way the blocking Cyrus SASL routines handle EINTR, e.g. so that it can be interrupted/cancelled by Ctrl-C in a client program, but in this case, that behavior is not desirable. lighttpd is configured to set SIGCHLD handler with SA_RESTART but ldap_sasl_bind_s() must be in a non-restartable system call when the SIGCHLD is received.

Actions #4

Updated by bjornfor over 5 years ago

gstrauss: Thanks for the update and preliminary patches!

Actions #5

Updated by gstrauss over 5 years ago

  • Status changed from New to Patch Pending
  • Target version changed from 1.4.x to 1.4.54

Solution: ldap_set_option LDAP_OPT_RESTART to handle EINTR on SIGCHLD from CGI

(ldap uses poll(), which is not restartable with sigaction SA_RESTART)

--- a/src/mod_authn_ldap.c
+++ b/src/mod_authn_ldap.c
@@ -404,6 +404,9 @@ static LDAP * mod_authn_ldap_host_init(server *srv, plugin_config *s) {
         return NULL;
     }

+    /* restart ldap functions if interrupted by a signal, e.g. SIGCHLD */
+    ldap_set_option(ld, LDAP_OPT_RESTART, LDAP_OPT_ON);
+
     if (s->auth_ldap_starttls) {
         /* if no CA file is given, it is ok, as we will use encryption
          * if the server requires a CAfile it will tell us */
Actions #6

Updated by gstrauss over 5 years ago

  • Status changed from Patch Pending to Fixed
  • % Done changed from 0 to 100
Actions #7

Updated by bjornfor over 5 years ago

Thank you!

(I've tested lighttpd v1.4.54 in my bug-reproduction-framework (https://github.com/bjornfor/lighttpd-auth-ldap-issue) and cannot provoke the failure anymore.)

Actions

Also available in: Atom