Trying multiple upstream nodes/IP's before failing with 502

1. Caddy version (caddy version):

2.4.3

d. My complete Caddyfile or JSON config:

{
  "apps": {
    "tls": {
      "certificates": {
        "automate": [
          "sub2.example.com",
          "sub1.example.com"
        ]
      }
    },
    "http": {
      "servers": {
        "6182eba7ee0f66000a2cc665": {
          "listen": [
            ":443"
          ],
          "routes": [
            {
              "match": [
                {
                  "host": [
                    "sub2.example.com",
                    "sub1.example.com"
                  ]
                }
              ],
              "handle": [
                {
                  "handler": "reverse_proxy",
                  "health_checks": {
                    "active": {
                      "uri": "/caddy_api/v1/health_checks",
                      "interval": "30s",
                      "timeout": "5s"
                    }
                  },
                  "upstreams": [
                    {
                      "dial": "11.11.11.111:3000"
                    },
                    {
                      "dial": "22.22.222.222:3000"
                    }
                  ]
                }
              ]
            }
          ]
        }
      }
    }
  }
}

3. The problem I’m having:

We have a setup, where multiple upstreams exists. An offline upstream node can with our settings receive requests for up to 30 + 5 seconds before it gets kicked out. But for our users even a single failed request is problematic.

So decreasing the interval of health checks is not entirely solving our problem.

A failed GET / request results in the client getting a 502 response, and our log throwing this:

{
   "level":"error",
   "ts":1636022945.3532488,
   "msg":"no upstreams available",
   "request":{
      "remote_addr":"11.11.11.111:57724",
      "proto":"HTTP/1.1",
      "method":"GET",
      "host":"sub1.example.com",
      "uri":"/",
      "tls":{
         "resumed":false,
         "version":772,
         "cipher_suite":4865,
         "proto":"",
         "proto_mutual":true,
         "server_name":"sub1.example.com"
      }
   },
   "duration":0.000066667,
   "status":502,
   "err_id":"xhi0angd1",
   "err_trace":"reverseproxy.statusError (reverseproxy.go:857)"
}

Is it possible, rather than throwing the 502 to iterate over the upstream nodes? So if 1 fails, try a few more before giving up and returning 302.

For this, you should use configure the load_balancing options, specifically try_duration, which specifies an amount of time during which Caddy should try to dial another backend if the first connection failed. Make sure to also configure dial_timeout to something low enough (I recommend somewhere between 2s and 5s) and make sure try_duration is longer than the dial_timeout.

Please upgrade to v2.4.5!

1 Like

Thanks a lot, very helpful indeed. So I upgraded the Caddy version to v2.4.5 and added a load_balancing configuration. I also removed the health_checks to make sure the offline upstream nodes are kept in the config.

The load balancing seems to work as intended approx. 80% of the time. I began with a try_duration of 10s and dial_timeout of 3s. I then took 2 of 3 nodes offline. In my calculation this should give Caddy 6 seconds to work its way through the 2 offline nodes, after which it would still have 4 seconds to connect to the online node.

Since I still got occasional 502s (always referencing one of the offline nodes in the log) I increased try_duration to 20s and decreased dial_timeout to 2s. But it just caused Caddy to spend 20 seconds before throwing 502.

My config:

"http": {
    "servers": {
      "6182eba7ee0f66000a2cc665": {
        "listen": [
          ":443"
        ],
        "routes": [
          {
            "match": [
              {
                "host": [
                  "sub2.example.com",
                  "sub1.example.com"
                ]
              }
            ],
            "handle": [
              {
                "handler": "reverse_proxy",
                "transport": {
                  "protocol": "http",
                  "dial_timeout": "2s"
                },
                "load_balancing": {
                  "selection_policy": {
                    "policy": "random"
                  },
                  "try_duration": "20s"
                },
                "upstreams": [
                  {
                    "dial": "11.11.11.111:3000"
                  },
                  {
                    "dial": "22.22.222.222:3000"
                  },
                  {
                    "dial": "33.33.333.333:3000"
                  }
                ]
              }
            ]
          }
        ]
      }
    }
  }
}

And here is a log entry:

{
  "level":"error",
  "ts":1636114193.7555304,
  "msg":"dial tcp 11.111.111.11:3000: i/o timeout",
  "request":{
    "remote_addr":"0.000.000.00:64467",
    "proto":"HTTP/2.0",
    "method":"GET",
    "host":"sub1.example.com",
    "uri":"/",
    "headers":{
    },
    "tls":{
      "resumed":false,
      "version":772,
      "cipher_suite":4865,
      "proto":"h2",
      "proto_mutual":true,
      "server_name":"sub1.example.com"
    }
  },
  "duration":20.016139766,
  "status":502,
  "err_id":"avxnzx7hw",
  "err_trace":"reverseproxy.statusError (reverseproxy.go:858)"
}

You actually shouldn’t remove the health checks. Active health checks will actually help the retry logic to work more reliably.

If an upstream is considered unhealthy, is isn’t removed from the config, but rather just marked unhealthy in-memory, until the next time the health check succeeds. This will exclude it from being tried in the first place, increasing the likelihood of finding a healthy backend faster.

You can also enable passive health checks which can also help mark an upstream as unhealthy quicker than with active checks alone (depending on the traffic rates).

I do find it weird that Caddy doesn’t find a backend within 20 seconds though. Please enable debug logging, which will reveal more information about what Caddy is trying in the meantime.

OK I have reinstated health checks, I only removed it because it gave me more time to test the situation where a node turns unhealthy.

I tried changing the log level, but it seems to be ignored regardless what I do. I still see logs with only error and info levels.

I have tried:

{
  "logging":{
    "sink":{
      "writer":{
        "output":"stdout"
      }
    },
    "logs":{
      "default":{
        "exclude":[
          "http.log.access"
        ]
      },
      "app-logger":{
        "level":"DEBUG",
        "writer":{
          "output":"stdout"
        },
        "include":[
          "http.log.access"
        ]
      }
    }
  },
  "apps":{
    "http":{
      "servers":{
        "server-name":{
          "logs":{
            "default_logger_name":"app-logger"
          }
        }
      }
    }
  }
}

and:

{
  "logging": {
    "logs": {
      "default": {
        "level": "DEBUG"
      }
    }
  }
}

Am I doing something wrong there?

Btw I should mention another observation. If I shut down 2 of 3 upstream nodes the client hangs in like 15 min. (rather than the 20 seconds of try_duration ) before it gets a 502. And the log throws this:

{
  "level":"error",
  "ts":1636382668.9349544,
  "logger":"http.log.error.app-logger",
  "msg":"dial tcp 11.11.111.11:3000: i/o timeout",
  "request":{
    "remote_addr":"0.000.000.00:53094",
    "proto":"HTTP/2.0",
    "method":"GET",
    "host":"sub1.example.com",
    "uri":"/",
    "headers":{},
    "tls":{
      "resumed":true,
      "version":772,
      "cipher_suite":4865,
      "proto":"h2",
      "proto_mutual":true,
      "server_name":"sub1.example.com"
    }
  },
  "duration":941.400664088,
  "status":502,
  "err_id":"rm703d0h1",
  "err_trace":"reverseproxy.statusError (reverseproxy.go:858)"
}

There was 2 of my runs where I only turn off 1 of 3 upstream nodes and got the expected behaviour. Approx. 2 of 3 requests went through to a healthy node immediately, and 1 of 3 had a 2-second delay, which indicates that the proxy waited the configured 2 seconds before moving on to a healthy node. And the log looks like this:

{
  "level":"error",
  "ts":1636385003.1495538,
  "logger":"http.log.access.app-logger",
  "msg":"handled request",
  "request":{
    "remote_addr":"0.000.000.00:50724",
    "proto":"HTTP/2.0",
    "method":"GET",
    "host":"sub1.example.com",
    "uri":"/",
    "headers":{},
    "tls":{
      "resumed":true,
      "version":772,
      "cipher_suite":4865,
      "proto":"h2",
      "proto_mutual":true,
      "server_name":"sub1.example.com"
    }
  },
  "common_log":"0.000.000.00 - - [08/Nov/2021:15:23:23 +0000] \"GET / HTTP/2.0\" 200 1564",
  "user_id":"",
  "duration":2.255837571,
  "size":1564,
  "status":200,
  "resp_headers":{
    "Server":[
      "Caddy"
    ],
    "X-Request-Id":[
      "ecc33bf5-5e44-4395-9f91-3f40e3c32600"
    ],
    "X-Runtime":[
      "0.001976"
    ],
    "Content-Length":[
      "1564"
    ],
    "Content-Type":[
      "text/html; charset=UTF-8"
    ]
  }
}

This went on for about 60 seconds without failures, all the way until the health check stepped in and bumped the offline node to the back of the queue.

I only got to see this behaviour 2 out of maybe 10 test rounds, but it’s interesting that it did happen. During one of the rounds that well I shut down a second node half way (so 2 of 3 nodes were offline) and the old behaviour of never finding an upstream host was then observed immediately after.

Not sure what to make of it…

Access logs will never contain DEBUG level logs. You’re meant to put DEBUG level on the default logger.

{
  "logging": {
    "logs": {
      "default": {
        "level": "DEBUG"
      }
    }
  }
}

This will output more detailed logs to stderr.

You didn’t completely fill out the help topic template, you removed sections. So I don’t know how you’re running Caddy. Next time, please completely fill it out. It only makes us guess and waste time when questions are left unanswered.

I can’t tell you how to find your logs because I don’t know how you’re actually running Caddy.

And that’s with dial_timeout set? That shouldn’t be the case… :thinking:

Apologies, should have filled all items of the form.

We use a custom docker image based at caddy 2.4.5. This is our Dockerfile:

FROM caddy:2.4.5-builder AS builder
RUN xcaddy build --output ./caddy \
    --with github.com/caddyserver/format-encoder

FROM caddy:2.4.5
COPY --from=builder /usr/bin/caddy /usr/bin/caddy

We start our service like this:

docker run --rm --name='proxy' \
--mount type=bind,source=/etc/caddy_config.json,destination=/caddy_config.json \
--publish 80:80 \
--publish 443:443 \
custom/caddy:2.4.5 caddy run --config /caddy_config.json

I have changed logging as per your instructions and compiled a few log entries.

Here is one where an offline upstream node is hit first. The proxy behaves as intended and moves on to a healthy node after 2 seconds:

{
  "level":"debug",
  "ts":1636466502.3905368,
  "logger":"http.handlers.reverse_proxy",
  "msg":"upstream roundtrip",
  "upstream":"11.11.111.11:3000",
  "request":{
    "remote_addr":"000.00.000.000:10295",
    "proto":"HTTP/2.0",
    "method":"GET",
    "host":"sub1.example.com",
    "uri":"/",
    "headers":{},
    "tls":{
      "resumed":false,
      "version":772,
      "cipher_suite":4865,
      "proto":"h2",
      "proto_mutual":true,
      "server_name":"sub1.example.com"
    }
  },
  "duration":2.001191542,
  "error":"dial tcp 11.11.111.11:3000: i/o timeout"
}

{
  "level":"debug",
  "ts":1636466502.6490903,
  "logger":"http.handlers.reverse_proxy",
  "msg":"upstream roundtrip",
  "upstream":"22.22.222.222:3000",
  "request":{
    "remote_addr":"000.00.000.000:10295",
    "proto":"HTTP/2.0",
    "method":"GET",
    "host":"sub1.example.com",
    "uri":"/",
    "headers":{},
    "tls":{
      "resumed":false,
      "version":772,
      "cipher_suite":4865,
      "proto":"h2",
      "proto_mutual":true,
      "server_name":"sub1.example.com"
    }
  },
  "headers":{
    "Content-Type":[
      "text/html; charset=UTF-8"
    ],
    "X-Request-Id":[
      "4e47d9a6-6a3a-4d54-a798-6bfd35977c81"
    ],
    "X-Runtime":[
      "0.001762"
    ],
    "Content-Length":[
      "1564"
    ]
  },
  "status":200
}

Here is one where the proxy needs to try to offline nodes before moving on to the third. What confuses me here is that the first upstream node with IP 22.22.222.22 has 3 log entries, each of them with 2 seconds between them:

{
  "level":"debug",
  "ts":1636467323.549891,
  "logger":"http.handlers.reverse_proxy",
  "msg":"upstream roundtrip",
  "upstream":"22.22.222.22:3000",
  "request":{
    "remote_addr":"000.00.000.000:31768",
    "proto":"HTTP/2.0",
    "method":"GET",
    "host":"sub1.example.com",
    "uri":"/",
    "headers":{},
    "tls":{
      "resumed":false,
      "version":772,
      "cipher_suite":4865,
      "proto":"h2",
      "proto_mutual":true,
      "server_name":"sub1.example.com"
    }
  },
  "duration":2.000809116,
  "error":"dial tcp 22.22.222.22:3000: i/o timeout"
}

{
  "level":"debug",
  "ts":1636467325.8050127,
  "logger":"http.handlers.reverse_proxy",
  "msg":"upstream roundtrip",
  "upstream":"22.22.222.22:3000",
  "request":{
    "remote_addr":"000.00.000.000:31768",
    "proto":"HTTP/2.0",
    "method":"GET",
    "host":"sub1.example.com",
    "uri":"/",
    "headers":{},
    "tls":{
      "resumed":false,
      "version":772,
      "cipher_suite":4865,
      "proto":"h2",
      "proto_mutual":true,
      "server_name":"sub1.example.com"
    }
  },
  "duration":2.000451613,
  "error":"dial tcp 22.22.222.22:3000: i/o timeout"
}

{
  "level":"debug",
  "ts":1636467328.0595753,
  "logger":"http.handlers.reverse_proxy",
  "msg":"upstream roundtrip",
  "upstream":"22.22.222.22:3000",
  "request":{
    "remote_addr":"000.00.000.000:31768",
    "proto":"HTTP/2.0",
    "method":"GET",
    "host":"sub1.example.com",
    "uri":"/",
    "headers":{},
    "tls":{
      "resumed":false,
      "version":772,
      "cipher_suite":4865,
      "proto":"h2",
      "proto_mutual":true,
      "server_name":"sub1.example.com"
    }
  },
  "duration":2.000397228,
  "error":"dial tcp 22.22.222.22:3000: i/o timeout"
}

{
  "level":"debug",
  "ts":1636467330.3136094,
  "logger":"http.handlers.reverse_proxy",
  "msg":"upstream roundtrip",
  "upstream":"33.333.333.33:3000",
  "request":{
    "remote_addr":"000.00.000.000:31768",
    "proto":"HTTP/2.0",
    "method":"GET",
    "host":"sub1.example.com",
    "uri":"/",
    "headers":{},
    "tls":{
      "resumed":false,
      "version":772,
      "cipher_suite":4865,
      "proto":"h2",
      "proto_mutual":true,
      "server_name":"sub1.example.com"
    }
  },
  "duration":2.000392293,
  "error":"dial tcp 33.333.333.33:3000: i/o timeout"
}

{
  "level":"debug",
  "ts":1636467330.571421,
  "logger":"http.handlers.reverse_proxy",
  "msg":"upstream roundtrip",
  "upstream":"23.88.117.149:3000",
  "request":{},
    "tls":{
      "resumed":false,
      "version":772,
      "cipher_suite":4865,
      "proto":"h2",
      "proto_mutual":true,
      "server_name":"sub1.example.com"
    }
  },
  "headers":{
    "X-Xss-Protection":[
      "1; mode=block"
    ],
    "X-Content-Type-Options":[
      "nosniff"
    ],
    "Content-Type":[
      "text/html"
    ],
    "X-Runtime":[
      "0.002176"
    ],
    "X-Frame-Options":[
      "SAMEORIGIN"
    ],
    "X-Download-Options":[
      "noopen"
    ],
    "X-Permitted-Cross-Domain-Policies":[
      "none"
    ],
    "Referrer-Policy":[
      "strict-origin-when-cross-origin"
    ],
    "Cache-Control":[
      "no-cache"
    ],
    "X-Request-Id":[
      "0b24f050-d8c0-4868-a090-b3f5c981bb81"
    ]
  },
  "status":200
}

And the last is one where the request times out after 15 minutes. The 15 minute duration of timed out requests seems to be consistant:

{
  "level":"debug",
  "ts":1636472779.2905798,
  "logger":"http.handlers.reverse_proxy",
  "msg":"upstream roundtrip",
  "upstream":"11.11.111.11:3000",
  "request":{
    "remote_addr":"000.00.000.000:24845",
    "proto":"HTTP/2.0",
    "method":"GET",
    "host":"sub1.example.com",
    "uri":"/",
    "headers":{},
    "tls":{
      "resumed":false,
      "version":772,
      "cipher_suite":4865,
      "proto":"h2",
      "proto_mutual":true,
      "server_name":"sub1.example.com"
    }
  },
  "duration":946.405977244,
  "error":"dial tcp 11.11.111.11:3000: i/o timeout"
}

{
  "level":"error",
  "ts":1636472779.2957606,
  "logger":"http.log.error",
  "msg":"dial tcp 11.11.111.11:3000: i/o timeout",
  "request":{
    "remote_addr":"000.00.000.000:24845",
    "proto":"HTTP/2.0",
    "method":"GET",
    "host":"sub1.example.com",
    "uri":"/",
    "headers":{},
    "tls":{
      "resumed":false,
      "version":772,
      "cipher_suite":4865,
      "proto":"h2",
      "proto_mutual":true,
      "server_name":"sub1.example.com"
    }
  },
  "duration":946.411582797,
  "status":502,
  "err_id":"d0ihj6i7b",
  "err_trace":"reverseproxy.statusError (reverseproxy.go:858)"
}

Since the default load balancing selection policy is random, it might be selecting the same backend multiple times in a row.

I don’t think there’s any logic built-in to skip ones already chosen in previous iterations of the retry loop. That might be something worth adding though. I have a discussion issue open with some thoughts/ideas on some areas for improvement with the retry logic Proxy load balancing retry issues · Issue #4245 · caddyserver/caddy · GitHub

I think you should also turn on passive health checks though. Using passive health checks, the attempts at grabbing an upstream should discount a failing upstream from being chosen earlier, which might avoid choosing the same twice in a row by having marked the one that failed as unavailable.

This, I’m very confused about though. So strange that it takes so long. It must be hitting a different code path. I’m not sure what’s going on :frowning:

The i/o timeout error comes from stdlib, it’s named ErrDeadlineExceeded, but I’m not sure which one is the issue here, cause it’s in low level network stuff:

OK, I have changed the load balancing selection policy to round_robin, think it may have helped a little bit. It’s difficult to make sense of the results, they seem to be different on each test.

I have enabled passive health checks, you are correct it has a positive effect on the selection process because at least only a few requests get stuck before the offline upstream node is bumped back in the queue. But am I correct that the passive health checks makes the active ones obsolete? I tried with this config:

{
  "active": {
    "uri": "/caddy_api/v1/health_checks",
    "interval": "60s",
    "timeout": "5s"
  },
  "passive": {
    "fail_duration": "15m",
    "max_fails": 1,
    "unhealthy_latency": "2s"
  }
}

to test if I could count on the passive checks to blacklist offline nodes and active checks to whitelist them again. My thinking is that each new passive check would cause a delay/stall for the client, so I would run the active checks more frequently than the passive. But in the above example, if the blacklisted node comes back online after e.g. 2 min., it will still remain blacklisted for 15 minutes, even though approx. 13 active health checks must have been performed in the meantime. Is this expected behaviour?

In any case, thanks a lot for your help! I will keep an eye on issue #4245.

No, they go hand-in-hand. They have different purposes, with the same goal.

Passive health checks basically just look at the status code on the request or whether the connection attempt failed, and increments a “fails” counter on that upstream, and each counter has a timer to later decrement it down.

So with your config, any kind of failure will cause the upstream to have its fails counter increased for 15 minutes, then decremented after that time elapsed. I think 15 minutes is way too long probably, doesn’t give much room for agility when spinning back up that upstream. I’d reduce it to something like 30 seconds maybe.

The unhealthy_latency option means “if the request takes more than 2 seconds at all (including successful responses) then increment the fail counter”. Here, 2s is also probably too aggressive, since legitimately successful requests would get marked as a failure. Depends on your system though, if you think that’s truly too long, but I don’t think it’s out of the ordinary for some database lock to make a request take a few seconds :man_shrugging:

Passive health checks only do anything when requests are coming in, so if you have a period of downtime, none of the upstreams will be marked unhealthy. That’s where active health checks come in, they give the backend an opportunity to self report whether it things it’s ready to go. You can also check if your database is up and running, other dependent services are good to go, before reporting back to Caddy that the health is good :+1:

Active health checks won’t make an upstream marked unhealthy from passive health checks as healthy again, they keep separate state. Another reason for not using such a high fail duration like 15 minutes.

1 Like

I see, makes perfect sense.

In our use case, slow responses would not be caused by the individual node, but rather slow response time of external resources used by the upstream node.

In other words, a slow response would not indicate problems with the upstream node itself. So I would like only to consider non-responses as fails.

Does this mean that passive health checks is a bad fit for me? Or could I avoid counting slow responses as fails by setting an arbitrary long unhealthy_latency , e.g. "5m" (or any value longer than my dial_timeout)? Or I could set unhealthy_status to something non-existing like 555.

On that note, is an initial connection/handshake established between Caddy and the upstream node? Or is Caddy unaware of an upstream nodes health while the upstream is processing the request? If, for example, my dial_timeout is "2s", but the upstream node takes 3 seconds to process the request, will Caddy then move on after 2 seconds, or will it wait for the response because an initial connection was established?

If the dial_timeout measures response time rather than some initial exchange, this all means I would need to make sure that my upstream nodes provide a response within the dial_timeout limit, otherwise it would be counted as failed and another upstream node would be attempted, resulting in the request to be submitted and processed twice.

No, you can just omit unhealthy_latency and unhealthy_status to make those checks not apply. They’re optional. For passive health checks to be enabled, you just need fail_duration at minimum.

dial_timeout is just for the “dialing” phase, i.e. the initial TCP connection to the upstream, before Caddy sends over the request. There’s other timeouts you can configure like handshake_timeout for the TLS phase, and response_header_timeout for waiting for the upstream to start writing the response headers. There’s not currently a “global” timeout that covers the entire request-response flow right now.

1 Like

Got it. Thanks a lot for your help, the config that came out os this is much better than our previous. Many thanks.

1 Like

This topic was automatically closed after 30 days. New replies are no longer allowed.