Reverse proxy configuration gives out no upstreams available error

1. Caddy version (caddy version):

v2.4.6

2. How I run Caddy:

a. System environment:

Ubuntu 20.04, official caddy provided ubuntu repository (https://dl.cloudsmith.io/public/caddy/stable/deb/debian any-version/main amd64 Packages)

b. Command:

systemctl start caddy

c. Service/unit/compose file:

# caddy.service
#
# For using Caddy with a config file.
#
# Make sure the ExecStart and ExecReload commands are correct
# for your installation.
#
# See https://caddyserver.com/docs/install for instructions.
#
# WARNING: This service does not use the --resume flag, so if you
# use the API to make changes, they will be overwritten by the
# Caddyfile next time the service is restarted. If you intend to
# use Caddy's API to configure it, add the --resume flag to the
# `caddy run` command or use the caddy-api.service file instead.

[Unit]
Description=Caddy
Documentation=https://caddyserver.com/docs/
After=network.target network-online.target
Requires=network-online.target

[Service]
Type=notify
User=caddy
Group=caddy
ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
AmbientCapabilities=CAP_NET_BIND_SERVICE

[Install]
WantedBy=multi-user.target

d. My complete Caddyfile or JSON config:

rpc-evm-testnet.venidium.io:80 {
	reverse_proxy http://127.0.0.1:8641 http://127.0.0.1:8642 http://127.0.0.1:8643 http://127.0.0.1:8644 http://127.0.0.1:8645 {
		lb_policy first
		lb_try_duration 1s
		lb_try_interval 250ms

		fail_duration 2s
		max_fails 1
		unhealthy_status 5xx
		unhealthy_latency 1s
		unhealthy_request_count 1
	}
}

3. The problem I’m having:

Most of the time, the request-response lifecycle works but often enough it fails and caddy logs give a no upstreams available error.

4. Error messages and/or full log output:

$ curl -v -s -X POST --data '{"jsonrpc":"2.0","method":"eth_blockNumber","params":[],"id":83}' https://rpc-evm-testnet.venidium.io/ | jq

*   Trying 104.21.68.13...
* TCP_NODELAY set
* Connected to rpc-evm-testnet.venidium.io (104.21.68.13) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
} [241 bytes data]
* TLSv1.2 (IN), TLS handshake, Server hello (2):
{ [100 bytes data]
* TLSv1.2 (IN), TLS handshake, Certificate (11):
{ [2328 bytes data]
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
{ [114 bytes data]
* TLSv1.2 (IN), TLS handshake, Server finished (14):
{ [4 bytes data]
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
} [37 bytes data]
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
} [1 bytes data]
* TLSv1.2 (OUT), TLS handshake, Finished (20):
} [16 bytes data]
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
{ [1 bytes data]
* TLSv1.2 (IN), TLS handshake, Finished (20):
{ [16 bytes data]
* SSL connection using TLSv1.2 / ECDHE-ECDSA-CHACHA20-POLY1305
* ALPN, server accepted to use h2
* Server certificate:
*  subject: C=US; ST=California; L=San Francisco; O=Cloudflare, Inc.; CN=sni.cloudflaressl.com
*  start date: May 10 00:00:00 2021 GMT
*  expire date: May  9 23:59:59 2022 GMT
*  subjectAltName: host "rpc-evm-testnet.venidium.io" matched cert's "*.venidium.io"
*  issuer: C=US; O=Cloudflare, Inc.; CN=Cloudflare Inc ECC CA-3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x138010a00)
> POST / HTTP/2
> Host: rpc-evm-testnet.venidium.io
> User-Agent: curl/7.64.1
> Accept: */*
> Content-Length: 64
> Content-Type: application/x-www-form-urlencoded
>
* Connection state changed (MAX_CONCURRENT_STREAMS == 256)!
} [64 bytes data]
* We are completely uploaded and fine
< HTTP/2 502
< date: Fri, 04 Mar 2022 14:22:47 GMT
< content-length: 0
< cf-cache-status: DYNAMIC
< expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
< report-to: {"endpoints":[{"url":"https:\/\/a.nel.cloudflare.com\/report\/v3?s=ngVYM7QY1rTXFiZQBjfdylbFIztCwvtOIk%2BFsUTnFxbIFd8YAJ7F0qO3S3gYsxzJOp%2FBl0leOJu%2B%2BVqf5MT2e6Vq%2B%2FLP1SHRLVYtKCWZoVclY68WD6xMFLa2MkZr0KIUJXETX5FNgBG5e%2FsHa7U%3D"}],"group":"cf-nel","max_age":604800}
< nel: {"success_fraction":0,"report_to":"cf-nel","max_age":604800}
< server: cloudflare
< cf-ray: 6e6b44183ffcd2f0-LCA
< alt-svc: h3=":443"; ma=86400, h3-29=":443"; ma=86400
<
{ [0 bytes data]
* Connection #0 to host rpc-evm-testnet.venidium.io left intact
* Closing connection 0
{"level":"error","ts":1646403767.1132739,"logger":"http.log.error","msg":"no upstreams available","request":{"remote_addr":"172.68.171.138:19302","proto":"HTTP/1.1","method":"POST","host":"rpc-evm-testnet.venidium.io","uri":"/","headers":{"Connection":["Keep-Alive"],"Accept-Encoding":["gzip"],"Cdn-Loop":["cloudflare"],"Content-Type":["application/x-www-form-urlencoded"],"Cf-Ipcountry":["CY"],"X-Forwarded-Proto":["https"],"User-Agent":["curl/7.64.1"],"Accept":["*/*"],"X-Forwarded-For":["66.205.75.30"],"Cf-Ray":["6e6b44183ffcd2f0-LCA"],"Cf-Visitor":["{\"scheme\":\"https\"}"],"Content-Length":["64"],"Cf-Connecting-Ip":["66.205.75.30"]}},"duration":8.8462e-05,"status":502,"err_id":"63qvc3agt","err_trace":"reverseproxy.statusError (reverseproxy.go:886)"}

5. What I already tried:

I tried tweaking lb_try_duration, lb_try_interval, fail_duration, unhealthy_latency and unhealthy_request_count by increasing everything while maintaining the ratio. The behavior didn’t change.

6. Links to relevant resources:

The default dial timeout in Caddy is 10s (in v2.4.6, but it’s being changed to 3s in v2.5.0), so your try_duration of 1s is too short to actually attempt to connect to a 2nd upstream during the retry loop.

The try_duration option configures the total amount of time Caddy will keep trying to connect to an upstream after receiving the request. If it takes longer than 1s to fail to connect (which it’s very likely to) then it’ll never try another.

unhealthy_latency 1s is also very aggressive, I don’t think it would be uncommon for requests to take longer than 1s to complete in many cases, depending on all kinds of factors. But I guess it depends on your app.

unhealthy_request_count 1 is also extremely aggressive, because this will only allow a single simultaneous request to each upstream, and if more than one request is sent there, it gets marked as unhealthy.

fail_duration 2s means that if an upstream is marked unhealthy for any of the above reasons, then it’ll be remembered for 2s, so if you send enough simultaneous requests, it’s very easy for all the upstreams to get marked unhealthy at the same time.

1 Like

This topic was automatically closed after 30 days. New replies are no longer allowed.