I am load balancing 8 gunicorn/FastAPI server processes using Caddy. Under small loads, it works fine. Under “heavy” load, 502 errors are occasionally thrown.
To get such an error to be thrown, I need to run around 1000 simultaneous requests. The failure rate is around 1%.
This is probably the upstream dropping requests when it can’t keep up.
Caddy doesn’t drop any requests, if it’s under pressure it will just hold onto requests until it can catch up. Many other servers don’t do that though, and they’ll just try to shed load by dropping connections. Nginx does this by default for example, as far as I understand.
Is there any particular reason you’re using round_robin? It might not be the most efficient way to handle it, you might want to try least_conn instead to hopefully better distribute the load.
Also you could consider enabling retries with lb_try_duration or lb_retries but it depends on the kinds of requests you’re getting (GET can be retried but not POST, for example, because the request body is consumable)
This was my understanding based on what I read here.
I am trying to replicate a working nginx reverse proxy, which uses round robin. The nginx version works without any changes to the upstreams.
This error occurs on POST, so I guess retries won’t save me here.
Is there any other log info which might help me debug? For what it’s worth, the 502 responses do not always have the same msg in logs. Here is another - note msg: "handled request"