Project

General

Profile

Actions

Bug #3074

closed

Looooong list of regexp results in assert() in all versions >= 1.4.56

Added by anrxc over 3 years ago. Updated over 3 years ago.

Status:
Invalid
Priority:
Normal
Category:
core
Target version:
-
ASK QUESTIONS IN Forums:
No

Description

Hello. I have two large lists of IPs and spam referrers (sourced from nginx ultimate bad bot project) included at the bottom of lighttpd.conf via the include directive.

Every version I tried above 1.4.55 crashes, 1.4.56 to 1.4.59:

    $ lighttpd -f /etc/lighttpd/lighttpd.conf -tt
    configfile-glue.c.273: assertion failed: sizeof(matches) >= srv->config_context->used
    zsh: abort (core dumped)  lighttpd -f /etc/lighttpd/lighttpd.conf -tt

The core seems to contain entire configuration and private information so I am not eager to attach it here.

I am instead hoping this is going to be reproducible if I attach the offending files.


Files

lighttpd-V.txt (914 Bytes) lighttpd-V.txt lighttpd -V anrxc, 2021-03-16 23:58
badbotipv4.conf (647 KB) badbotipv4.conf Large bot IP list anrxc, 2021-03-16 23:59
spamreferrers.conf (459 KB) spamreferrers.conf Large spam referrer list anrxc, 2021-03-16 23:59
Actions #1

Updated by gstrauss over 3 years ago

No need to include the core.

The assertion is an explicit assertion in lighttpd.
configfile-glue.c.273: assertion failed: sizeof(matches) >= srv>config_context->used

This is the code leading to configfile-glue.c line 273

    unsigned char matches[4096];   /*directives matches (4k is way too many!)*/
    unsigned short contexts[4096]; /*conditions matches (4k is way too many!)*/
    uint32_t n = 0;
    int rc = 1; /* default is success */
    force_assert(sizeof(matches) >= srv->config_context->used);

The code sets an arbitrary limit of 4096 conditions to set an upper bound on stack usage for this routine.

While you can safely double the limit, such a large number of conditions is not expected, and the memory usage of lighttpd per connection is likely much larger than with expected usage of a much smaller set of conditions.

All that is to say that there are probably better ways to achieve what you are trying to do.

  • One option is to combine the regexes into a large regex with alternations (|)
  • Another option is to use mod_magnet and check all those regexes in custom lua code.

Looking at spamreferrers.conf, I would recommend the first option. Doing so should result in a measurable reduction in memory usage in lighttpd per connection, as well as a performance increase going through the regex engine once rather than 6948 times (spamreferrers.conf)

While you could do similar for badbotipv4.conf (9985 lines), here I would recommend the second option to implement the policy comparison in custom lua code and run with mod_magnet

Creating two tables in lua, one for referrer and another for IPs, and then look up the referrer and IP. Doing so would likely be much faster than what you are doing now. It appears that you are interested in exact string matches instead of regexes.

I'll think further about your use case and may post more later.

[edit]
update: concatenating referrers into a single string hits default limits for PCRE. While this could be done in a handful of PCRE expressions with slightly shorter (but still very long) strings, a better answer is lua tables.

Actions #2

Updated by anrxc over 3 years ago

Thank you for the explanation and suggestions. This is hosting some personal git repositories so performance was furthest from my mind. I lazily converted the lists and 1.4.55 just happened to accept them.

Actions #3

Updated by gstrauss over 3 years ago

  • Subject changed from Long list of regexp crashing all version post 1.4.55 to Looooong list of regexp results in assert() in all versions >= 1.4.56
  • Status changed from New to Invalid
  • Target version deleted (1.4.x)

Since you're not interested in performance, here is some quick Lua code. I am not an expert, but your inquiry sparked my curiousity on the performance.

lighttpd.conf

server.modules += ( "mod_magnet" )
magnet.attract-raw-url-to = ("/path/to/spamrefs.lua")

/path/to/spamrefs.lua
local bot_ips = {
  ["255.255.255.255"]=1,  -- sample IP
}

local bad_refs = {
  ["example.com"]=1,      -- sample Referer
}

-- Referer
local ref = lighty.request["referer"]
if (ref ~= nil) then
  ref = ref:lower()           -- lowercase
  ref = ref:gsub("%.+$", "")  -- remove trailing dots
  -- suffix match from longest to shortest name, splitting on dot
  local dot = 1
  while (dot) do
    if (bad_refs[ref] ~= nil) then
      return 403
    end
    dot = ref:find(".", 1, true)
    if dot then ref = ref:sub(dot+1) end
  end
end

-- Remote IP
local ip = lighty.env["request.remote-ip"]
if (bot_ips[ip] ~= nil) then  -- exact match on IP string (normalized)
  return 403
end

return 0

You can fill in the tables for a working solution, though it is much slower than I had expected when filled in. While a quick test environment can respond to 18,000+ request per second with one entry in each of those tables (as in the sample code above), when I put the 9985 lines from badbotipv4.conf and the 6948 lines from spamreferrers.conf into the script, the throughput dropped to 100 request per second. It did not improve much when I tried luajit libraries instead of my system lua libraries. I guess Lua is building those tables as it processes each request. For kicks and giggles, I might experiment with putting the data in to an external database which is accessed by lua. That should be faster.

Then again, a quick test with lighttpd 1.4.55 with your original config files pushed only 150 request per second and resulted in 295 MB resident memory usage (!), whereas the lua solution with lighttpd 1.4.59 used < 9 MB resident memory after the same test and pushed 100 requests per second.

In any case, while I might still post here, or I might create a topic in the Forums (see Forums in the menu bar at top of page), this is not a bug in lighttpd so I am going to close this issue.

Actions #4

Updated by gstrauss over 3 years ago

lighttpd 1.4.59 with some very large regexes (but not too long) accepted by PCRE library can push > 24000 requests per second in the same basic tests I used above, and using slightly more than 7 MB resident memory. By comparison, your original (ab)use of the lighttpd config syntax took 40x (4000%) the memory and was much slower. The combined regexes are 160x (16000%) faster.

I took the very large lists of referrers and IPs and batched them into multiple very long regexes (but not too long). I then used include "spamre" and include "botsre" in my lighttpd.conf. YMMV.

# (note: implementation details below assume no '|' in input data since we scan and split large string on '|')

perl -e '@a = <>; chomp @a; $str = join("|", map { quotemeta($_) } @a); while ($str ne "") { $pos = index($str, "|", 34000); if ($pos != -1) { $part = substr($str, 0, $pos); substr($str, 0, $pos+1, ""); } else { $part = $str; $str = ""; } print "\$HTTP[\"referer\"] =~ \"(?:\\.|^)(?i:", $part, ")\$\" { url.access-deny = ( \"\" ) }\n"; }' spamreferrers.list > spamre

perl -e '@a = <>; chomp @a; $str = join("|", map { quotemeta($_) } @a); while ($str ne "") { $pos = index($str, "|", 34000); if ($pos != -1) { $part = substr($str, 0, $pos); substr($str, 0, $pos+1, ""); } else { $part = $str; $str = ""; } print "\$HTTP[\"remote-ip\"] =~ \"^(?i:", $part, ")\$\" { url.access-deny = ( \"\" ) }\n"; }' badbotipv4.list > botsre

Actions #5

Updated by gstrauss over 3 years ago

  • Description updated (diff)
Actions #6

Updated by gstrauss over 3 years ago

The combined regex alternations is the performance winner for quick solutions.

Native C structures could be much faster but are less flexible, as a lighttpd module would need to be written for this special purpose. (Not difficult, just not as easy for immediate use.)

For a flexible solution, I coded up some lua at AbsoLUAtion and linked reject-bad-actors.lua. Using an mcdb constant database for lookups, I was able to achieve > 14000 requests per second. Separating the data into a database allows it to be updated, or for lighttpd to query the updated data without the need to restart lighttpd (see comments in reject-bad-actors.lua). Using lua allows arbitrary policy extensions to be applied by the lua script writer, and can be maintained independently from lighttpd.

Actions #7

Updated by gstrauss over 3 years ago

I posted a messy and long perl one-liner to mod_access documentation to provide sample code to parse https://raw.githubusercontent.com/mitchellkrogza/nginx-ultimate-bad-bot-blocker/master/conf.d/globalblacklist.conf into lighttpd rules. Feedback appreciated.

Actions

Also available in: Atom