This branch resolves several inconsistencies across Caddy's HTTP facilities rega…rding URI encodings in paths.
I am not entirely sure, but I suppose breaking changes might be possible if users relied on buggy behavior that has only just been determined and is being remedied here.
This PR mainly affects the `path` matcher and the `rewrite` middleware (including both the `rewrite` and `uri` Caddyfile directives). These are extremely commonly-used Caddy features.
## Background
URIs (essentially the part of the URL after the scheme and authority/host, e.g. `/foo/bar?a=b#frag` -- though servers don't really deal with `#fragment` components) are famous for being inconsistently encoded and parsed. Differences in parsing/handling between servers, proxies, and applications often lead to bugs and security vulnerabilities. For example, a path of `//foo/bar` might be considered equivalent to `/foo/bar` by one piece of infrastructure, and different to another. Similarly, `/foo%2Fbar` might or might not be the same as `/foo/bar`. To a router, they could be different. To an application, they could be the same.
A web server like Caddy is between a rock and a hard place, because it finds itself between untrusted clients who send all manner of inconsistent requests, and other servers or applications who expect the request URI to be _just right_. Caddy is often expected to route requests of all varieties and rewrite/transform them into something the backend application (even if that's just the built-in static file server) can use without confusion. The problem is the requirements and expectations vary widely!
Caddy has had several issues over the years where some users expect a URI like `/foo%2Fbar` to be transformed into `/foo/bar` before being proxied. Some want `/foo/bar` to match `/foo%2Fbar`, while others don't. Some want a matcher like `/secret/*` to match URIs like `//secret/*` or `/secret//*` because they put it behind authentication, and if it doesn't match, auth could be bypassed! Windows treats `/file.php . ..` the same as `/file.php` -- even though they technically have different suffixes and file extensions, [causing routing blunders](https://github.com/caddyserver/caddy/pull/2917). Then imagine a path prefix like `/bands/*/*/` that should match `/bands/Pink/Try/` as well as `/bands/AC%2FDC/T.N.T` -- but if the path matcher normalizes (decodes) URIs before matching, the first URI would work but the second would become `/bands/AC/DC/T.N.T` which doesn't match the pattern anymore. To make matters worse, any given URI has multiple valid encoded forms. `%2F%66%6F%6F%2F%62%61%72` can be decoded to `/foo/bar` just as well as `/foo%2Fbar` can, and everything in between can, too. If routers matched on non-normalized URIs, there would be plenty more security bugs to deal with: a pattern of `/foo/*`, which is expected to be authenticated, would no longer match `/foo%2Fbar` even though they are, according to ratified RFCs, _equivalent_.
In other words, encodings are significant to applications, but normalizing URIs to a consistent form is critical for maintaining security.
Let me restate here [what I wrote for the Laravel community](https://github.com/laravel/framework/issues/22125#issuecomment-1208810581) when I started working on this (with minor changes to make sense out of context):
---
RFC 9110, "HTTP Semantics," has a section on HTTP URI normalization, [which says](https://www.rfc-editor.org/rfc/rfc9110.html#name-https-normalization-and-com):
> Two HTTP URIs that are equivalent after normalization (using any method) can be assumed to identify the same resource, and any HTTP component MAY perform normalization. As a result, distinct resources SHOULD NOT be identified by HTTP URIs that are equivalent after normalization.
In other words, `/foo%2Fbar` and `/foo/bar` are equivalent after normalizing, and thus they _SHOULD NOT_ be used for distinct resources. So if you are encoding application data into the path, and that data could possibly have reserved characters / delimiters (like `/`), consider redesigning your API: it is not robust in the harsh HTTP environment.
Note that several RFCs, notably RFCs 3986 and 9110, continually repeat that URI parsing is dependent upon scheme. That's one other problem: we all use the `http://` or `https://` scheme and yet expect applications to handle URIs differently. So of course there's going to be head-butting: we're fighting the design.
To clarify, it is definitely possible for a URI path such as `/band/AC%2fDC/T.N.T` to be "properly" handled by a server application. For this case, simply write a server that decodes everything after `/band/` except `%2f`. :man_shrugging: The problem is that this is difficult _in general_. Depending on what situations you do this, you may be opening yourself to bugs and security holes. This is why Caddy currently handles URIs solely in the unescaped space: it's the "one true" representation of a URI, and normalized HTTP URIs are more or less clearly defined nowadays.
Others might propose a solution to double-encode application data in the path; in other words, have the client send a URI with a path of `/bands/AC%252fDC/T.N.T`. This will probably work, but it's a hack and it [violates spec](https://www.rfc-editor.org/rfc/rfc3986#section-2.4):
> Implementations **must not** percent-encode or decode the same string more than once, as decoding an already decoded string might lead to misinterpreting a percent data octet as the beginning of a percent-encoding, or vice versa in the case of percent-encoding an already percent-encoded string.
Beware of the non-conforming behavior and highlight it very prominently in documentation so you can avoid bugs.
Laravel user @alcaitiff made a comment that some of you may be thinking:
> The router should resolve the route and after that decode parameters, but it does decode the url parameters before resolving the route.
I can't speak for Laravel or what it's doing, but the Go standard library (what Caddy uses), for example, **does** do URL parsing correctly and still has this problem. Go does exactly what you and the spec recommend: it splits the URI into its components and _then_ decodes reserved characters after parsing. It preserves the original, "raw" path in the `RawPath` field and offers the decoded path in the `Path` field. Its [`EscapedPath()`](https://pkg.go.dev/net/url#URL.EscapedPath) method uses `RawPath` _if it is a valid encoding of `Path`_, which is interesting because any given path has multiple valid encodings as I noted above. So if I want to truly "normalize" the URI in Go, I have to call `url.PathEscape(req.URL.Path)` myself and ignore `RawPath` entirely (AFAIK). And guess what, this converts `/foo/bar` to... `/foo/bar`. In other words, decoding `/foo%2Fbar` is not reversible without loss of precision. (Unless the HTTP server knows your business logic, more on that in a moment.)
We can write our own logic, though, that uses `RawPath` as a "hint" (as the Go docs say) to maybe replace `/` with `%2f`, but if we've manipulated/rewritten the URI at all, this becomes infeasible because we don't know where or if that instance still exists in the string.
[RFC 3986 section 2](https://www.rfc-editor.org/rfc/rfc3986#section-2.2) states:
> URI producing applications should percent-encode data octets that
> correspond to characters in the reserved set unless these characters
> are specifically allowed by the URI scheme to represent data in that
> component. If a reserved character is found in a URI component and
> no delimiting role is known for that character, then it must be
> interpreted as representing the data octet corresponding to that
> character's encoding in US-ASCII.
The `/` is in the reserved set. Thus it is up to the implementation to determine whether it is data or a delimiter. I guess Laravel doesn't know, and it's frankly safer to assume it's a delimiter and treat it in its normalized form.
So yes, this issue is frustrating. As a web server author, I feel like I need to write software that can read people's minds: is this slash data or is it a delimiter? The router needs more information, because both are very valid ways of interpreting a URI!
---
## The solution
I think the key to this problem is trying to read the developer's mind: is this character supposed to be a delimiter (part of the path) or data? Should we collapse repeated slashes or no?
The answers depend on the context. For routing / path matching, the answer may be one way, for rewriting it may be another, and for proxying it may be yet another depending on the applications being proxied to.
Nginx, Apache, and Caddy all merge slashes by default when matching. However, Nginx and Apache have options to disable that behavior and preserve the slashes, which can lead to security vulnerabilities. All three do path matching (or routing) in the normalized space to mitigate bugs but, like we saw with Laravel, makes it difficult or impossible to route requests with application data that decode as path-significant characters like `%2F` (`/`), leaving many developers frustrated.
This PR introduces a somewhat novel solution that allows the developer to convey their intent to the server when doing matching and rewriting.
**Simply put, our solution is to interpret encoded characters and multiple slashes in the configuration as a literal conveyance of the developer's intent.** In other words, we don't blanket-unescape the whole URI every time. We do it byte-for-byte in lock-step with the configured pattern to match, and only unescape if the match pattern is not escaped at that position. Similarly, if a configured path has double slashes `//` in it, we do not merge slashes when comparing paths, because we infer the user's intent is to match repeated slashes.
### Path matching
Path matching (aka routing) is still done in the normalized space. That means if you configure a path matcher of `/foo/bar`, it will match `/foo/bar`, `/foo%2Fbar` and even `%2F%66%6F%6F%2F%62%61%72` because we normalize the URI. This is unchanged from before.
But now if you have a path matcher of `/foo%2Fbar`, it will match `/foo%2Fbar` exactly (the escape sequence is case-insensitive), whereas previously it would have only matched `/foo%252Fbar` (i.e. `%` as data). Now, `/foo%2Fbar` will NOT match `/foo/bar` or `%2Ffoo/bar` because we infer intent from seeing escape sequences in the match pattern as application data, not path delimiters.
**This logic handily extends to wildcards, too.** Referring to the previous example from our Laravel discussion, if you want to use `/bands/*/*` it is impossible to match a URI of `/bands/AC%2fDC/T.N.T` (in Laravel, too). But with this change, you can use special "escape-wildcard" characters: `/bands/%*/%*` to indicate that the span matched by the wildcard should _not_ be URI-decoded and should be kept in the escaped/raw space.
So now, if you want to allow band names to have a `/` in them, you can simply write `/bands/%*/%*`.
### Double slashes
Similar to escape sequences, we now disable slash merging automatically if the configured pattern has repeated slashes. Previously, it was impossible to match `//foo` because all URIs were normalized. Now, a path matcher of `//foo` will preserve multiple slashes. (A matcher of `/foo` will still match `//foo`.)
### Rewriting
A common task of rewriting is to strip path prefix and path suffix. The logic explained above has also been implemented for these operations, allowing you to use escaped characters and multiple slashes in your prefix and suffix patterns, and now Caddy will rewrite more intuitively and correctly.
For example, if you want to strip a prefix of `//prefix` from `//prefix/foo`, it will work, whereas before it wouldn't find the prefix because it would look at a fully-normalized URI.
Similarly, you can strip prefixes or suffixes with encoded characters. For example, a prefix of `/foo%2Fbar` will rewrite a URI of `/foo%2Fbar/asdf` into `/asdf`, whereas before it wouldn't find the prefix.
## Is it perfect?
Probably not. Are there bugs? Probably. Have I overlooked things? Almost certainly yes. I'm pretty sure there might be nooks and crannies within Caddy that I missed implementing this. Please file a bug report if you need it to work but doesn't work like you expect.
I'm pretty happy with this approach though. I think it's very useful and I don't know of other mainstream servers or frameworks that implement this behavior. In true Caddy fashion, this should _just work_.
- fixes #4801
- fixes #4923
- fixes #4743