Bug #3074: Looooong list of regexp results in assert() in all versions >= 1.4.56 - Lighttpd - lighty labs

Actions

Copy link

Bug #3074

closed

Looooong list of regexp results in assert() in all versions >= 1.4.56

Added by anrxc about 4 years ago. Updated about 2 months ago.

Status:

Invalid

Priority:

Normal

Category:

core

Target version:

ASK QUESTIONS IN Forums:

Description

Hello. I have two large lists of IPs and spam referrers (sourced from nginx ultimate bad bot project) included at the bottom of lighttpd.conf via the include directive.

Every version I tried above 1.4.55 crashes, 1.4.56 to 1.4.59:

    $ lighttpd -f /etc/lighttpd/lighttpd.conf -tt
    configfile-glue.c.273: assertion failed: sizeof(matches) >= srv->config_context->used
    zsh: abort (core dumped)  lighttpd -f /etc/lighttpd/lighttpd.conf -tt

The core seems to contain entire configuration and private information so I am not eager to attach it here.

I am instead hoping this is going to be reproducible if I attach the offending files.

Files

Download all files

lighttpd-V.txt (914 Bytes) lighttpd-V.txt	lighttpd -V	anrxc, 2021-03-16 23:58
badbotipv4.conf (647 KB) badbotipv4.conf	Large bot IP list	anrxc, 2021-03-16 23:59
spamreferrers.conf (459 KB) spamreferrers.conf	Large spam referrer list	anrxc, 2021-03-16 23:59

Actions

Copy link

Updated by gstrauss about 4 years ago

No need to include the core.

The assertion is an explicit assertion in lighttpd.
configfile-glue.c.273: assertion failed: sizeof(matches) >= srv>config_context->used

This is the code leading to configfile-glue.c line 273

    unsigned char matches[4096];   /*directives matches (4k is way too many!)*/
    unsigned short contexts[4096]; /*conditions matches (4k is way too many!)*/
    uint32_t n = 0;
    int rc = 1; /* default is success */
    force_assert(sizeof(matches) >= srv->config_context->used);

The code sets an arbitrary limit of 4096 conditions to set an upper bound on stack usage for this routine.

While you can safely double the limit, such a large number of conditions is not expected, and the memory usage of lighttpd per connection is likely much larger than with expected usage of a much smaller set of conditions.

All that is to say that there are probably better ways to achieve what you are trying to do.

One option is to combine the regexes into a large regex with alternations (|)
Another option is to use mod_magnet and check all those regexes in custom lua code.

Looking at spamreferrers.conf, I would recommend the first option. Doing so should result in a measurable reduction in memory usage in lighttpd per connection, as well as a performance increase going through the regex engine once rather than 6948 times (spamreferrers.conf)

While you could do similar for badbotipv4.conf (9985 lines), here I would recommend the second option to implement the policy comparison in custom lua code and run with mod_magnet

Creating two tables in lua, one for referrer and another for IPs, and then look up the referrer and IP. Doing so would likely be much faster than what you are doing now. It appears that you are interested in exact string matches instead of regexes.

I'll think further about your use case and may post more later.

[edit]
update: concatenating referrers into a single string hits default limits for PCRE. While this could be done in a handful of PCRE expressions with slightly shorter (but still very long) strings, a better answer is lua tables.

Actions

Copy link

Updated by anrxc about 4 years ago

Thank you for the explanation and suggestions. This is hosting some personal git repositories so performance was furthest from my mind. I lazily converted the lists and 1.4.55 just happened to accept them.

Actions

Copy link

Updated by gstrauss about 4 years ago

Subject changed from Long list of regexp crashing all version post 1.4.55 to Looooong list of regexp results in assert() in all versions >= 1.4.56
Status changed from New to Invalid
Target version deleted (~~1.4.x~~)

Since you're not interested in performance, here is some quick Lua code. I am not an expert, but your inquiry sparked my curiousity on the performance.

lighttpd.conf

server.modules += ( "mod_magnet" )
magnet.attract-raw-url-to = ("/path/to/spamrefs.lua")

/path/to/spamrefs.lua

local bot_ips = {
  ["255.255.255.255"]=1,  -- sample IP
}

local bad_refs = {
  ["example.com"]=1,      -- sample Referer
}

-- Referer
local ref = lighty.request["referer"]
if (ref ~= nil) then
  ref = ref:lower()           -- lowercase
  ref = ref:gsub("%.+$", "")  -- remove trailing dots
  -- suffix match from longest to shortest name, splitting on dot
  local dot = 1
  while (dot) do
    if (bad_refs[ref] ~= nil) then
      return 403
    end
    dot = ref:find(".", 1, true)
    if dot then ref = ref:sub(dot+1) end
  end
end

-- Remote IP
local ip = lighty.env["request.remote-ip"]
if (bot_ips[ip] ~= nil) then  -- exact match on IP string (normalized)
  return 403
end

return 0

You can fill in the tables for a working solution, though it is much slower than I had expected when filled in. While a quick test environment can respond to 18,000+ request per second with one entry in each of those tables (as in the sample code above), when I put the 9985 lines from badbotipv4.conf and the 6948 lines from spamreferrers.conf into the script, the throughput dropped to 100 request per second. It did not improve much when I tried luajit libraries instead of my system lua libraries. I guess Lua is building those tables as it processes each request. For kicks and giggles, I might experiment with putting the data in to an external database which is accessed by lua. That should be faster.

Then again, a quick test with lighttpd 1.4.55 with your original config files pushed only 150 request per second and resulted in 295 MB resident memory usage (!), whereas the lua solution with lighttpd 1.4.59 used < 9 MB resident memory after the same test and pushed 100 requests per second.

In any case, while I might still post here, or I might create a topic in the Forums (see Forums in the menu bar at top of page), this is not a bug in lighttpd so I am going to close this issue.

Actions

Copy link

Updated by gstrauss about 4 years ago

lighttpd 1.4.59 with some very large regexes (but not too long) accepted by PCRE library can push > 24000 requests per second in the same basic tests I used above, and using slightly more than 7 MB resident memory. By comparison, your original (ab)use of the lighttpd config syntax took 40x (4000%) the memory and was much slower. The combined regexes are 160x (16000%) faster.

I took the very large lists of referrers and IPs and batched them into multiple very long regexes (but not too long). I then used include "spamre" and include "botsre" in my lighttpd.conf. YMMV.

# (note: implementation details below assume no '|' in input data since we scan and split large string on '|')

perl -e '@a = <>; chomp @a; $str = join("|", map { quotemeta($_) } @a); while ($str ne "") { $pos = index($str, "|", 34000); if ($pos != -1) { $part = substr($str, 0, $pos); substr($str, 0, $pos+1, ""); } else { $part = $str; $str = ""; } print "\$HTTP[\"referer\"] =~ \"(?:\\.|^)(?i:", $part, ")\$\" { url.access-deny = ( \"\" ) }\n"; }' spamreferrers.list > spamre

perl -e '@a = <>; chomp @a; $str = join("|", map { quotemeta($_) } @a); while ($str ne "") { $pos = index($str, "|", 34000); if ($pos != -1) { $part = substr($str, 0, $pos); substr($str, 0, $pos+1, ""); } else { $part = $str; $str = ""; } print "\$HTTP[\"remote-ip\"] =~ \"^(?i:", $part, ")\$\" { url.access-deny = ( \"\" ) }\n"; }' badbotipv4.list > botsre

Actions

Copy link

Updated by gstrauss about 4 years ago

Description updated (diff)

Actions

Copy link

Updated by gstrauss about 4 years ago

The combined regex alternations is the performance winner for quick solutions.

Native C structures could be much faster but are less flexible, as a lighttpd module would need to be written for this special purpose. (Not difficult, just not as easy for immediate use.)

For a flexible solution, I coded up some lua at AbsoLUAtion and linked reject-bad-actors.lua. Using an mcdb constant database for lookups, I was able to achieve > 14000 requests per second. Separating the data into a database allows it to be updated, or for lighttpd to query the updated data without the need to restart lighttpd (see comments in reject-bad-actors.lua). Using lua allows arbitrary policy extensions to be applied by the lua script writer, and can be maintained independently from lighttpd.

Actions

Copy link

Updated by gstrauss about 4 years ago

I posted a messy and long perl one-liner to mod_access documentation to provide sample code to parse https://raw.githubusercontent.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/master/conf.d/globalblacklist.conf into lighttpd rules. Feedback appreciated.

Actions

Copy link

Updated by flynn about 2 months ago

The last perl one liner on mod_access may produce broken regular expressions, if the user agents contains slashes, they must be escaped: / -> \/. Not a fault but better would be, if the single dot is escaped too: . -> \.

Actions

Copy link

Updated by gstrauss about 2 months ago · Edited

Why do you think the / needs to be escaped? Things need to be quoted properly and lighttpd delineates the regex in "", so " needs to be escaped.

There are two instances of $str .= $_."|"; which might be changed to be the same as the last $str .= quotemeta($_)."|";

[Edit] I took another look. The data in the input file is already in regex format, except for the IPs, which is why quotemeta() is not used in the first two instances. Yes, the input file should backslash-escape dots \. in the input regexes for user-agent. They are already escaped in referer. The perl one-liner could safely be extended to backslash-escape dots which are not already backslash-escaped in the user-agent.

Actions

Copy link

#10

Updated by flynn about 2 months ago

I wanted to evaluate this solution and created a separate logfile for every hit, e.g. url.access-deny = ( "" ) -> accesslog.filename = "/var/log/lighttpd/ultimate-bad-bot-blocker-agent.log". This logfile remained empty and I used https://regex101.com/ as regex validator, which reported errors in the regex. After escaping the slashes it works as expected.

Actions

Copy link

#11

Updated by flynn about 2 months ago

Maybe lighttpd should report broken regex in the error log.

Actions

Copy link

#12

Updated by gstrauss about 2 months ago

I used https://regex101.com/ as regex validator, which reported errors in the regex. After escaping the slashes it works as expected.

That part looks to me like a misunderstanding of https://regex101.com/ which expects the regex to be between the two / characters it presents in its UI.

Maybe lighttpd should report broken regex in the error log.

lighttpd does report regex which fails to compile.

Regarding the original issue of the regex not matching input as desired, I'll take a look to see if I can reproduce. These are pushing the limits of regex. What version of the pcre2 regex library are you using? Which platform?

Actions

Copy link

#13

Updated by gstrauss about 2 months ago

I was unable to quickly reproduce. Looks like it works, using the perl script at the end of mod_access to create /path/to/rejections
I will note that the regex for User-Agent and Referer start with an empty pattern and that does not appear to be correct, since it will reject everything when those headers exist. Once the leading (?i:| is replaced with (?i: it works as expected.

server.document-root = "/dev/shm" 
server.port = 8080
server.modules += ("mod_access")
server.modules += ("mod_accesslog")
accesslog.filename = "/dev/stderr" 
include "/path/to/rejections"

Access log entries with 403 Forbidden are printed for randomly chosen User-Agents, one with a / and one without a /
curl -H "User-Agent: FacebookBot" http://localhost:8080/
curl -H "User-Agent: Firefox/7.0" http://localhost:8080/

My test system is Fedora 41, running lighttpd (tip of master branch). Installed package for PCRE2 is pcre2-10.44-1.fc41.1.x86_64

I don't have time at this very moment to validate removing the | from the leading (?i:| and updating mod_access but will try to do so in the next few days.

Actions

Copy link

#14

Updated by gstrauss about 2 months ago · Edited

I modified the perl at the bottom of mod_access to omit the empty initial values.

Actions

Copy link

#15

Updated by gstrauss about 2 months ago

I wanted to evaluate this solution and created a separate logfile for every hit, e.g. url.access-deny = ( "" ) -> accesslog.filename = "/var/log/lighttpd/ultimate-bad-bot-blocker-agent.log".

Works fine for me after changing each line of 'rejections' from { url.access-deny = ( "" ) } to { url.access-deny = ( "" ) accesslog.filename = "/var/log/lighttpd/ultimate-bad-bot-blocker-agent.log" }

Perhaps your config is overwriting accesslog.filename with a condition later in the config file. Last config directive (inside conditions matching the request) wins, so put these rejections at the end of your config file, or end of your config file includes. See Configuration: File Syntax section Conditional Configuration Merging

Actions

Copy link

Also available in: Atom

Project

General

Custom queries

Profile

Lighttpd

Bug #3074

Looooong list of regexp results in assert() in all versions >= 1.4.56

Updated by gstrauss about 4 years ago

Updated by anrxc about 4 years ago

Updated by gstrauss about 4 years ago

Updated by gstrauss about 4 years ago

Updated by gstrauss about 4 years ago

Updated by gstrauss about 4 years ago

Updated by gstrauss about 4 years ago

Updated by flynn about 2 months ago

Updated by gstrauss about 2 months ago · Edited

Updated by flynn about 2 months ago

Updated by flynn about 2 months ago

Updated by gstrauss about 2 months ago

Updated by gstrauss about 2 months ago

Updated by gstrauss about 2 months ago · Edited

Updated by gstrauss about 2 months ago