Proxying to multiple upstreams with failover

1. The problem I’m having:

I have a public facing server running caddy v2 and I have a caddy v2 server in my homelab.

I have 2 ways to reach homelab caddy server, 1 with the public IP and another one using a wireguard tunnel. (Reachable from the public facing machine hosting caddy server)

I want to distribute traffic from public facing caddy server to my homelab caddy server. If it can reach it over the public IP, It should use that and if that is not working then it should connect over wireguard.

2. Error messages and/or full log output:

May 13 04:11:21 delbgp caddy[124535]: {"level":"info","ts":1683931281.082274,"logger":"http.handlers.reverse_proxy.health_checker.active","msg":"HTTP request failed","host":"10.0.50.3:443","error":"Get \"https://10.0.50.3:443\": context deadline exceeded"}
May 13 04:11:21 delbgp caddy[124535]: {"level":"info","ts":1683931281.082322,"logger":"http.handlers.reverse_proxy.health_checker.active","msg":"HTTP request failed","host":"43.230.197.97:443","error":"Get \"https://<public-address>:443\": net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)"}

This does not work no matter what I try. It keeps using public IP and eventually responds with 502 error. If I enable healthchecks, It reports http request failed over both endpoints when that’s not true at all! I can reach 10.0.50.3:443 using curl just fine.

3. Caddy version:

v2.6.4 h1:2hwYqiRwk1tf3VruhMpLcYTg+11fCdr8S3jhNAdnPy8=

4. How I installed and ran Caddy:

Binary downloaded from caddy’s website. I enabled layer4 module

a. System environment:

Arch linux machine
systemd: 253.4-1
Arch: x64

b. Command:

/opt/caddy/caddy run --environ --config /etc/caddy/caddy.json

c. Service/unit/compose file:

PASTE OVER THIS, BETWEEN THE ``` LINES.
Please use the preview pane to ensure it looks nice.

d. My complete Caddy config:

{
  "apps": {
    "http": {
      "servers": {
        "srv0": {
          "listen": [":443"],
          "routes": [
            {
              "match": [
                {
                  "host": ["jellyfin.domain"]
                }
              ],
              "handle": [
                {
                  "handler": "subroute",
                  "routes": [
                    {
                      "handle": [
                        {
                          "handler": "reverse_proxy",
                          "upstreams": [
                            {
                              "dial": "<public-addr>:443"
                            },
                            {
                              "dial": "10.0.50.3:443"
                            }
                          ],
                          "transport": {
                            "protocol": "http",
                            "tls": {
                              "server_name": "jellyfin.domain"
                            }
                          },
                          "health_checks": {
                            "active": {
                              "path": "/",
                              "interval": 30
                            },
                            "passive": {
                              "max_fails": 0,
                              "fail_duration": 5
                            }
                          }
                        }
                      ]
                    }
                  ]
                }
              ],
              "terminal": true
            }
          ]
        }
      }
    }
  }
}

5. Links to relevant resources:

Caddy’s documentation is quite confusing. upstreams is an array but how does it pick an url from this array?? Does it go from first to last?

Config that works reasonably well

 {
              "match": [
                {
                  "host": ["git.domain"]
                }
              ],
              "handle": [
                {
                  "handler": "subroute",
                  "routes": [
                    {
                      "handle": [
                        {
                          "handler": "reverse_proxy",
                          "upstreams": [
                            {
                              "dial": "<public-addr>:443"
                            },
                            {
                              "dial": "10.0.50.3:443"
                            }
                          ],
                          "load_balancing": {
                            "selection_policy": {
                              "policy": "first"
                            },
                            "retries": 2,
                            "try_duration": "3s",
                            "try_interval": "10ms"
                          },
                          "transport": {
                            "protocol": "http",
                            "tls": {
                              "server_name": "git.domain"
                            },
                            "compression": true
                          },
                          "health_checks": {
                            "passive": {
                              "max_fails": 1,
                              "fail_duration": "600s",
                              "unhealthy_status": [500, 502, 504, 501],
                              "unhealthy_latency": "5s",
                              "unhealthy_request_count": 1
                            }
                          }
                        }
                      ]
                    }
                  ]
                }
              ],
              "terminal": true
            }

I want to use active healthchecks but how do I specify multiple good status codes? It only takes a single number

Ideally, I only want it to failover if there is a io timeout error or if public-addr is unreachable.

Relying on healthy and unhealthy status codes is just not going to work.

In the homelab caddy instance, I can configure another listener which simply responds with 200 status code.

This can be used for active healthchecks in the public facing instance but what happens when these active healthchecks fail? For now long is that endpoint disabled? I can’t rely on passive healthchecks and status codes

By default, it picks a random upstream. I see you found selection_policy config, and yes first will pick the first available upstream in order in the array.

You can set it to "2xx" to allow 200-299. The default is only 200. We don’t have more complex matching for active health checks right now. What are you trying to do exactly?

Why not? Why can’t you have a /health endpoint or something which always returns 200? This is pretty standard practice.

It will stay down until another iteration of the active health check notices it’s back up.

1 Like

You can set it to "2xx" to allow 200-299. The default is only 200. We don’t have more complex matching for active health checks right now. What are you trying to do exactly?

I tried this but it complained, expect_status is a integer and I am giving it a string

Why not? Why can’t you have a /health endpoint or something which always returns 200? This is pretty standard practice.

Sorry, I should’ve phrased this better. I can’t use this with passive healthchecks. Some times a service may respond with 5xx response code. The public caddy instance will cut off/disable that upstream thinking the service is completely unavailable when it’s not. (I can’t change max_fails because it’s not a reliable solution. it’ll switch to second upstream if a service sends 5xx responses on some paths and eventually disable all upstreams and return 502. This is undesirable behavior because all the other endpoints on that service may be working perfectly fine)

It will stay down until another iteration of the active health check notices it’s back up.

Noted

Here is what my final config looks like,

    {
              "match": [
                {
                  "host": ["git.ishanjain.me"]
                }
              ],
              "handle": [
                {
                  "handler": "subroute",
                  "routes": [
                    {
                      "handle": [
                        {
                          "handler": "reverse_proxy",
                          "upstreams": [
                            {
                              "dial": "<public-addr>:443"
                            },
                            {
                              "dial": "10.0.50.3:443"
                            }
                          ],
                          "load_balancing": {
                            "selection_policy": {
                              "policy": "first"
                            },
                            "retries": 2,
                            "try_duration": "3s",
                            "try_interval": "10ms"
                          },
                          "transport": {
                            "protocol": "http",
                            "tls": {
                              "server_name": "git.ishanjain.me"
                            },
                            "compression": true
                          },
                          "health_checks": {
                            "active": {
                              "uri": "/",
                              "interval": "300s",
                              "timeout": "10s",
                              "expect_status": 200
                            }
                          }
                        }
                      ]
                    }
                  ]
                }
              ],
              "terminal": true
            }

Ideally, I should have a completely separate listener in the caddy homelab instance. This should be used as a health check for that instance.
I also have some other problems with using vaultwarden that’s behind caddy. I’ll look into that next

Ah right, you can set it to 2 when using JSON config to match any 2xx status. The 2xx string can be used in the Caddyfile as a special syntax.

1 Like

In my homelab caddy instance, I added this

 "srv1": {
          "listen": [":9001"],
          "routes": [
            {
              "match": [
                {
                  "host": [
                    "irc.ishanjain.me",
                    "dash.ishanjain.me",
                    "git.ishanjain.me",
                    "jellyfin.ishanjain.me",
                    "ldap.ishanjain.me",
                    "qbit.ishanjain.me"
                  ]
                }
              ],
              "handle": [
                {
                  "handler": "static_response",
                  "status_code": 200,
                  "body": "OK"
                }
              ]
            }
          ]
        }

(You have to configure domains and use https there^ because caddy doesn’t allow sending http healthcheck requests from a https server)

May 13 13:38:47 delbgp caddy[149152]: {"level":"info","ts":1683965327.1157558,"logger":"http.handlers.reverse_proxy.health_checker.active","msg":"HTTP request failed","host":"10.0.50.3:9001","error":"Get \"https://10.0.50.3:9001/\": http: server gave HTTP response to HTTPS client"}

In the public caddy instance, I changed active healthcheck settings to,

 "health_checks": {
                            "active": {
                              "uri": "/",
                              "port": 9001,
                              "interval": "300s",
                              "timeout": "10s",
                              "expect_status": 200
                            }
                          }

The two caddy instances were adding server: caddy header twice. I added this alongside, "handler": "reverse_proxy"

  {
                          "handler": "headers",
                          "response": {
                            "deferred": true,
                            "delete": ["Server"]
                          }
                        },

but this erases both headers. Oh well, I am not going to put more time into getting this right

This is normal and expected. It allows you to actually see that it hit both your servers to produce the response.

I’m confused. I thought you were proxying to port 443 over HTTPS.

That’s a very long interval. That means if the upstream is marked unhealthy, it’ll take up to 5 minutes before it gets marked healthy again if it comes back. I’d recommend using a much lower interval like 5-10s.

I’m confused. I thought you were proxying to port 443 over HTTPS.

I am. I wanted to use port 9001 for health checks and only use http on this port. I want to do this because

  1. I only want to know if caddy homelab instance is available over public address and the wireguard address.
  2. There is much less data transfer this way.
  3. I know if the service is healthy, It’ll send 200 response code. I don’t have to track if it’ll send 2xx code or 3xx code(Although it looks like if the service sends 3xx code, Caddy follows that link and ultimately It’ll work by setting expect_status to 2xx)

I updated active health check settings to send health checks to this port and that’s when I get this error.

I guess from a https server block, It sends https health checks to https://10.0.50.3:9001(With SNI set to the hostname for that block) and expects a https response but caddy homelab was sending it plain http response.

That’s a very long interval. That means if the upstream is marked unhealthy, it’ll take up to 5 minutes before it gets marked healthy again if it comes back. I’d recommend using a much lower interval like 5-10s.

Noted, I’ll reduce it. I set it so high to reduce data transfer but now it’s not an issue to perform much more frequent health checks

Meh, it’s negligible.

Yeah. You can enable TLS on port 9001 (or whatever other port) in your Caddy config, then that would work more naturally. The active checks use the transport configuration as well because if it didn’t it wouldn’t be a representative health check.

Your other option is to add a /health route in your sites (or some other path that you know will not conflict with something in the upstream apps like /health-with-some-random-extra-text-for-uniqueness) that responds with status 200. Just put it as the first route on the :443 server.

Meh, it’s negligible.

On my fiber connection, Sure. On LTE fail over connection it adds up since some of the services have large home/root pages :slight_smile:

Yeah. You can enable TLS on port 9001 (or whatever other port) in your Caddy config,

Yes, this is what I did

That’s why you’d use a special URL instead of the root, then. But I didn’t realize you were using wireless networks for your failover, that seems strange :man_shrugging:

It’s LTE(and eventually when 5G is available, 5G) fail over for my home network.
There are 2 load balanced fiber connections but some times there are fiber cuts and LTE failover is super helpful in that period and I wanted all my stuff to be accessible at all times, even if it’s a little slow during outages

Both fiber connections went down at around the same time yesterday, a lot of my stuff was inaccessible and I decided to fix it :slight_smile:

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.