Allowing multiple "header_regexp" matchers

1. Output of caddy version:

v2.6.2 h1:wKoFIxpmOJLGl3QXoo6PNbYvGW4xLEgo32GPBEjWL8o=

2. How I run Caddy:

caddy.exe run --config Caddyfile

a. System environment:

Any

b. Command:

caddy.exe run --config Caddyfile

c. Service/unit/compose file:

N/A

d. My complete Caddy config:

http://127.0.0.1:8080 {
        @old {
                header_regexp User-Agent (?i)(Chrome|CriOS)/(9|8|7|6|5|4|3|2)
                header_regexp User-Agent (?i)Firefox/(9|8|7|6|5|4|3|2)
                header_regexp User-Agent (?i)NT\s(6|5|4|3)\.
        }

        respond @old 403

        file_server /static/*
}

3. The problem I’m having:

Unfortunately as expected, caddy.exe exits with an error

4. Error messages and/or full log output:

`Error: adapting config using caddyfile: Caddyfile:5 - Error during parsing: header_regexp matcher can only be used once per named matcher, per header field: User-Agent`

5. What I already tried:

Well not much, I know this isn’t supported, it says so in the Docs, but that’s why I’m creating this topic, I would like it to be supported. I don’t see the purpose of the limitation other than for the performance hit, but IMO the decision to take the performance hit should be up to me.

Caddy v1 can OR multiple matcher tokens together of any token type with the “if_op or” (and then a “rewrite hack” to send a 403) I want the same for v2 except there would be no need for the rewrite hack part obviously.

Now I know even if I could do multiple header_regexp it wouldn’t be the most efficient way to block old clients, but please let’s not focus on that.

I guess what I’m trying to achieve is something like the “apache ultimate bad bot blocker” list in Caddy, I’d have the @old matcher that naively blocks old clients, and another that would block bad bots that annoyingly request my site except no more than 60 or so matches.

To achieve such things, it’s impossible without being able to add multiple header regex matches.

Thanks for your time, and apologies in advance if I’m out of my lane asking to add support for this.

6. Links to relevant resources:

Apache Blacklist for reference

A quick workaround is to blend those regular expressions into a single one:

header_regexp User-Agent (?i)(Chrome|CriOS)/(9|8|7|6|5|4|3|2)|Firefox/(9|8|7|6|5|4|3|2)|NT\s(6|5|4|3)\.

Here it is in work: regex101: build, test, and debug regex

1 Like

Thanks for the proposal.
I was already aware I could do this, but it’s just not feasible for some 60 different regex matches, it would be incredibly unreadable and unmanageable. That’s why I’d like to be able to OR multiple matcher tokens of any token type together like in v1.

You could also use the expression matcher instead:

@old expression header_regexp('User-Agent', '(?i)(Chrome|CriOS)/(9|8|7|6|5|4|3|2)') || header_regexp('User-Agent', '(?i)Firefox/(9|8|7|6|5|4|3|2)')
1 Like

A good proposal, but I knew I could do this too, perhaps I should have provided solutions I knew of but were inadequate. Sorry, everyone.

The only problem, it’s not multiline, so it would quickly turn unreadable with many regex-matching tokens.

But that is multiple different header regex matches being OR’ed. It’s already possible in a way, it would just be amazing if this could also be done how I showed it in my caddyfile with a way to set matching to OR like in v1, I can’t see that as a breaking change as long as the default is AND.

And sorry for continuously mentioning v1, I’m sure everyone wants to move on from it and I do too, but I just can’t with this (artificial?) limitation.

After some investigation…

On Line 1010

// If there's already a pattern for this field
// then we would end up overwriting the old one

But the generated JSON config for the “match” is an array, If I manually put those 3 “header_regexp” matchers in the “match” array Caddy doesn’t complain it works as intended and will match to any one of the regex patterns. So I don’t understand this limitation of the Caddyfile?

Working handle/match snip-it:

{
   "handle":[
      {
         "handler":"static_response",
         "status_code":403
      }
   ],
   "match":[
      {
         "header_regexp":{
            "User-Agent":{
               "pattern":"(?i)(Chrome|CriOS)/(9|8|7|6|5|4|3|2)"
            }
         }
      },
      {
         "header_regexp":{
            "User-Agent":{
               "pattern":"(?i)Firefox/(9|8|7|6|5|4|3|2)"
            }
         }
      },
      {
         "header_regexp":{
            "User-Agent":{
               "pattern":"(?i)NT\\s(6|5|4|3)\\."
            }
         }
      }
   ]
}

But the goal is to be able to do this in the Caddyfile alone, so this doesn’t help me too much.

After taking a look at the monstrous Apache Blacklist for reference, I think you’re tackling the problem from the wrong angle and we have an XY problem. Look into combining the map handler with the respective placeholder to label requests, then you can match on the var containing the label. This is close to how the file in your link works. The hint is in the good_bot, bad_bot, and spam_ref sprinkled at the end of every line.

2 Likes

I just knew someone would have to mention the XY Problem, like I said in the original post, I know this isn’t the most efficient way to block bots, but its the most convenient way for me and my website that only I ever use, so I don’t see any problem what-so-ever in wanting to do it that way (I’m only doing ~60 regex matches).

But besides all that, it seems that the map derivative will cover my use case.

You can write expressions multi-line.

@foo `
	header_regexp('User-Agent', '(?i)(Chrome|CriOS)/(9|8|7|6|5|4|3|2)')
	|| header_regexp('User-Agent', '(?i)Firefox/(9|8|7|6|5|4|3|2)')
`

We have no plans to work on the matcher syntax to make it possible to define multiple matcher sets on a single directive in the Caddyfile right now. We tried, but it’s… very difficult.

I talked about this during my conference talk last month and why we chose to go this route:

Also, FWIW, I’d just like to point out that the language and tone you’re using in your replies is off-putting to us trying to help you find a solution. Please keep that in mind. If you come to us with a negative mindset, it’s not enjoyable for us to try to help.

4 Likes

It’s not enjoyable to be told I can’t do something in the doc’s that I could on v1 but not be told why or be given an equivalent way of doing it, I hope you can see how that would be frustrating, or to bring up the xy problem after I asked not the focus on why I want to do it that way, that wasn’t very cool and felt like I was being talked down to.

Even so, I’m greatful for the attempts to help me I received, so I’m sorry if my frustration leaked into the post, Thank you

1 Like

Caddy v2 is a complete rewrite from v1. To assume that you can do everything from v1 in v2 is a mistake. It’s not the same.

The way matchers work in v2 is very different than v1 from a design standpoint.

In v1, each directive was expected to perform its own matching, which meant that each directive had slightly different syntax, or duplicated implementation for matching requests.

In v2, matching was split out into a generalized concept, and matchers can be applied to directives. The JSON config came first (because JSON is the real underlying config language that Caddy understands) and Caddyfile syntax was added afterwards. There’s no such thing as if_op and such in JSON config, because Matt chose to go the route of having implicit ANDs and ORs via JSON structure. If you want to OR then you define separate matcher sets for each condition to OR together.

With named matchers in the Caddyfile, there’s no obvious way to define multiple matcher sets, because a single named matcher is actually a single matcher set. We tried a bunch of ideas, and none of them passed muster. They all had significant problems that made them too hard to implement or had UX problems. For example https://github.com/caddyserver/caddy/pull/4264 was probably the biggest attempt.

Once we added the built-in matchers as functions in CEL, this is by far the best solution to writing complex boolean logic. You can write it with code, so it’s a lot easier to understand conceptually (for those comfortable with &&, || and ( ) syntax of course, but that’s not exactly a big ask to be honest).

The problem is your tone. You wrote this:

You could instead say:

Thanks for answering. But I did consider the pros and cons of solving the problem this way and I chose to do it this way because it’s the most convenient way for me.

It’s not productive to have that kind of reaction to people trying to provide help. It doesn’t make us want to continue to help. Especially when we’re providing free support, as volunteers.

Thanks for understanding.

2 Likes

You could instead say:

Thanks for answering. But I did consider the pros and cons of solving the problem this way and I chose to do it this way because it’s the most convenient way for me.

… I did

I don’t think you’re understanding the point I’m trying to make. It’s about tone.

Saying:

And:

Have an issue with tone. It comes off as dismissive and stubborn.

1 Like

We tried, but it’s… very difficult

It does not have to be difficult at all, you just have to be a bit more formal when solving the problem. If we use the language design 101 we can end up with something like this:

@old {
    or {
        header_regexp User-Agent (?i)(Chrome|CriOS)/(9|8|7|6|5|4|3|2)
        header_regexp User-Agent (?i)Firefox/(9|8|7|6|5|4|3|2)
        header_regexp User-Agent (?i)NT\s(6|5|4|3)\.
    }
}

Easy, future-proof, and would do the job. Still, you may have some internal limitations we are not aware of, but from a syntax standpoint it’s easy. Of course, it’s always easier said than done, but still a small data point for your future consideration.

And how would you do that in JSON config? That’s the key. It needs to work in JSON.

We can definitely craft syntax that would work in Caddyfile, but mapping that to something sensible in JSON is not trivial.

Either way, CEL expressions provide way more power overall, with more terse syntax. We don’t really need more ways to do the same thing. Just use CEL expressions if you need complex boolean logic.

2 Likes

And how would you do that in JSON config? That’s the key. It needs to work in JSON.

I might be missing something, but the “match” key in the JSON is already an array, so just append instead of overwrite, like in my JSON example above, it worked without any issue.

That’s why I’m having a hard time understanding why an or option is omitted from a matcher when underlying JSON fully supports it.

It’s an array of MatcherSet types. Named matchers in the Caddyfile currently can only encode a single MatcherSet.

Honestly, this is the kind of situation where I feel l need to say “if you think it’s so easy, try to contribute the code change”. If it was easy, we would’ve done it already.

It’s more complicated than it sounds at the surface level.

2 Likes

Why was it made to be so complicated at code level to append something to an array in the first place?

Imo seems like bad code design to not be able to use an array as an… array

And I could be wrong, but I vaguely remember someone on GitHub trying to help with this limitation (or something similar) but his PR’s kept getting denied, even when they didn’t break backwards compatibility, so I won’t be going down that route.

I’m gonna lock this thread. This is getting counterproductive. Your continued use of an accusatory tone is not appreciated.