Bug #3074
closedLooooong list of regexp results in assert() in all versions >= 1.4.56
Description
Hello. I have two large lists of IPs and spam referrers (sourced from nginx ultimate bad bot project) included at the bottom of lighttpd.conf via the include directive.
Every version I tried above 1.4.55 crashes, 1.4.56 to 1.4.59:
$ lighttpd -f /etc/lighttpd/lighttpd.conf -tt configfile-glue.c.273: assertion failed: sizeof(matches) >= srv->config_context->used zsh: abort (core dumped) lighttpd -f /etc/lighttpd/lighttpd.conf -tt
The core seems to contain entire configuration and private information so I am not eager to attach it here.
I am instead hoping this is going to be reproducible if I attach the offending files.
Files
Updated by gstrauss about 4 years ago
No need to include the core.
The assertion is an explicit assertion in lighttpd.configfile-glue.c.273: assertion failed: sizeof(matches) >= srv>config_context->used
This is the code leading to configfile-glue.c
line 273
unsigned char matches[4096]; /*directives matches (4k is way too many!)*/ unsigned short contexts[4096]; /*conditions matches (4k is way too many!)*/ uint32_t n = 0; int rc = 1; /* default is success */ force_assert(sizeof(matches) >= srv->config_context->used);
The code sets an arbitrary limit of 4096 conditions to set an upper bound on stack usage for this routine.
While you can safely double the limit, such a large number of conditions is not expected, and the memory usage of lighttpd per connection is likely much larger than with expected usage of a much smaller set of conditions.
All that is to say that there are probably better ways to achieve what you are trying to do.
- One option is to combine the regexes into a large regex with alternations (
|
) - Another option is to use mod_magnet and check all those regexes in custom lua code.
Looking at spamreferrers.conf, I would recommend the first option. Doing so should result in a measurable reduction in memory usage in lighttpd per connection, as well as a performance increase going through the regex engine once rather than 6948 times (spamreferrers.conf)
While you could do similar for badbotipv4.conf (9985 lines), here I would recommend the second option to implement the policy comparison in custom lua code and run with mod_magnet
Creating two tables in lua, one for referrer and another for IPs, and then look up the referrer and IP. Doing so would likely be much faster than what you are doing now. It appears that you are interested in exact string matches instead of regexes.
I'll think further about your use case and may post more later.
[edit]
update: concatenating referrers into a single string hits default limits for PCRE. While this could be done in a handful of PCRE expressions with slightly shorter (but still very long) strings, a better answer is lua tables.
Updated by anrxc about 4 years ago
Thank you for the explanation and suggestions. This is hosting some personal git repositories so performance was furthest from my mind. I lazily converted the lists and 1.4.55 just happened to accept them.
Updated by gstrauss about 4 years ago
- Subject changed from Long list of regexp crashing all version post 1.4.55 to Looooong list of regexp results in assert() in all versions >= 1.4.56
- Status changed from New to Invalid
- Target version deleted (
1.4.x)
Since you're not interested in performance, here is some quick Lua code. I am not an expert, but your inquiry sparked my curiousity on the performance.
lighttpd.conf
server.modules += ( "mod_magnet" ) magnet.attract-raw-url-to = ("/path/to/spamrefs.lua")
/path/to/spamrefs.lua
local bot_ips = { ["255.255.255.255"]=1, -- sample IP } local bad_refs = { ["example.com"]=1, -- sample Referer } -- Referer local ref = lighty.request["referer"] if (ref ~= nil) then ref = ref:lower() -- lowercase ref = ref:gsub("%.+$", "") -- remove trailing dots -- suffix match from longest to shortest name, splitting on dot local dot = 1 while (dot) do if (bad_refs[ref] ~= nil) then return 403 end dot = ref:find(".", 1, true) if dot then ref = ref:sub(dot+1) end end end -- Remote IP local ip = lighty.env["request.remote-ip"] if (bot_ips[ip] ~= nil) then -- exact match on IP string (normalized) return 403 end return 0
You can fill in the tables for a working solution, though it is much slower than I had expected when filled in. While a quick test environment can respond to 18,000+ request per second with one entry in each of those tables (as in the sample code above), when I put the 9985 lines from badbotipv4.conf and the 6948 lines from spamreferrers.conf into the script, the throughput dropped to 100 request per second. It did not improve much when I tried luajit libraries instead of my system lua libraries. I guess Lua is building those tables as it processes each request. For kicks and giggles, I might experiment with putting the data in to an external database which is accessed by lua. That should be faster.
Then again, a quick test with lighttpd 1.4.55 with your original config files pushed only 150 request per second and resulted in 295 MB resident memory usage (!), whereas the lua solution with lighttpd 1.4.59 used < 9 MB resident memory after the same test and pushed 100 requests per second.
In any case, while I might still post here, or I might create a topic in the Forums (see Forums in the menu bar at top of page), this is not a bug in lighttpd so I am going to close this issue.
Updated by gstrauss about 4 years ago
lighttpd 1.4.59 with some very large regexes (but not too long) accepted by PCRE library can push > 24000 requests per second in the same basic tests I used above, and using slightly more than 7 MB resident memory. By comparison, your original (ab)use of the lighttpd config syntax took 40x (4000%) the memory and was much slower. The combined regexes are 160x (16000%) faster.
I took the very large lists of referrers and IPs and batched them into multiple very long regexes (but not too long). I then used include "spamre"
and include "botsre"
in my lighttpd.conf
. YMMV.
# (note: implementation details below assume no '|' in input data since we scan and split large string on '|') perl -e '@a = <>; chomp @a; $str = join("|", map { quotemeta($_) } @a); while ($str ne "") { $pos = index($str, "|", 34000); if ($pos != -1) { $part = substr($str, 0, $pos); substr($str, 0, $pos+1, ""); } else { $part = $str; $str = ""; } print "\$HTTP[\"referer\"] =~ \"(?:\\.|^)(?i:", $part, ")\$\" { url.access-deny = ( \"\" ) }\n"; }' spamreferrers.list > spamre perl -e '@a = <>; chomp @a; $str = join("|", map { quotemeta($_) } @a); while ($str ne "") { $pos = index($str, "|", 34000); if ($pos != -1) { $part = substr($str, 0, $pos); substr($str, 0, $pos+1, ""); } else { $part = $str; $str = ""; } print "\$HTTP[\"remote-ip\"] =~ \"^(?i:", $part, ")\$\" { url.access-deny = ( \"\" ) }\n"; }' badbotipv4.list > botsre
Updated by gstrauss about 4 years ago
The combined regex alternations is the performance winner for quick solutions.
Native C structures could be much faster but are less flexible, as a lighttpd module would need to be written for this special purpose. (Not difficult, just not as easy for immediate use.)
For a flexible solution, I coded up some lua at AbsoLUAtion and linked reject-bad-actors.lua. Using an mcdb constant database for lookups, I was able to achieve > 14000 requests per second. Separating the data into a database allows it to be updated, or for lighttpd to query the updated data without the need to restart lighttpd (see comments in reject-bad-actors.lua). Using lua allows arbitrary policy extensions to be applied by the lua script writer, and can be maintained independently from lighttpd.
Updated by gstrauss about 4 years ago
I posted a messy and long perl one-liner to mod_access documentation to provide sample code to parse https://raw.githubusercontent.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/master/conf.d/globalblacklist.conf into lighttpd rules. Feedback appreciated.
Updated by flynn 2 days ago
The last perl one liner on mod_access may produce broken regular expressions, if the user agents contains slashes, they must be escaped: /
-> \/
. Not a fault but better would be, if the single dot is escaped too: .
-> \.
Updated by gstrauss 2 days ago ยท Edited
Why do you think the /
needs to be escaped? Things need to be quoted properly and lighttpd delineates the regex in ""
, so "
needs to be escaped.
There are two instances of $str .= $_."|";
which might be changed to be the same as the last $str .= quotemeta($_)."|";
[Edit] I took another look. The data in the input file is already in regex format, except for the IPs, which is why quotemeta()
is not used in the first two instances. Yes, the input file should backslash-escape dots \.
in the input regexes for user-agent. They are already escaped in referer. The perl one-liner could safely be extended to backslash-escape dots which are not already backslash-escaped in the user-agent.
Updated by flynn 1 day ago
I wanted to evaluate this solution and created a separate logfile for every hit, e.g. url.access-deny = ( "" )
-> accesslog.filename = "/var/log/lighttpd/ultimate-bad-bot-blocker-agent.log"
. This logfile remained empty and I used https://regex101.com/ as regex validator, which reported errors in the regex. After escaping the slashes it works as expected.
Updated by gstrauss 1 day ago
I used https://regex101.com/ as regex validator, which reported errors in the regex. After escaping the slashes it works as expected.
That part looks to me like a misunderstanding of https://regex101.com/ which expects the regex to be between the two /
characters it presents in its UI.
Maybe lighttpd should report broken regex in the error log.
lighttpd does report regex which fails to compile.
Regarding the original issue of the regex not matching input as desired, I'll take a look to see if I can reproduce. These are pushing the limits of regex. What version of the pcre2 regex library are you using? Which platform?
Updated by gstrauss 1 day ago
I was unable to quickly reproduce. Looks like it works, using the perl script at the end of mod_access to create /path/to/rejections
I will note that the regex for User-Agent and Referer start with an empty pattern and that does not appear to be correct, since it will reject everything when those headers exist. Once the leading (?i:|
is replaced with (?i:
it works as expected.
server.document-root = "/dev/shm" server.port = 8080 server.modules += ("mod_access") server.modules += ("mod_accesslog") accesslog.filename = "/dev/stderr" include "/path/to/rejections"
Access log entries with 403 Forbidden are printed for randomly chosen User-Agents, one with a
/
and one without a /
curl -H "User-Agent: FacebookBot" http://localhost:8080/
curl -H "User-Agent: Firefox/7.0" http://localhost:8080/
My test system is Fedora 41, running lighttpd (tip of master branch). Installed package for PCRE2 is pcre2-10.44-1.fc41.1.x86_64
I don't have time at this very moment to validate removing the |
from the leading (?i:|
and updating mod_access but will try to do so in the next few days.
Updated by gstrauss about 23 hours ago
I modified the perl at the bottom of [[mod_access] to omit the empty initial values.
Also available in: Atom