Reverse proxy - occasional 502 errors under load

1. The problem I’m having:

I am load balancing 8 gunicorn/FastAPI server processes using Caddy. Under small loads, it works fine. Under “heavy” load, 502 errors are occasionally thrown.

To get such an error to be thrown, I need to run around 1000 simultaneous requests. The failure rate is around 1%.

2. Error messages and/or full log output:

Here is the log of a failure.

{
    "level": "error",
    "ts": 1692155150.8127933,
    "logger": "http.log.error.log0",
    "msg": "read tcp 192.168.101.41:34838->192.168.101.142:9006: read: connection reset by peer",
    "request":
    {
        "remote_ip": "192.168.101.128",
        "remote_port": "39504",
        "proto": "HTTP/1.1",
        "method": "POST",
        "host": "c3-a1:8000",
        "uri": "/fm",
        "headers":
        {
            "Accept-Encoding":
            [
                "gzip, deflate, br"
            ],
            "Accept":
            [
                "*/*"
            ],
            "Connection":
            [
                "keep-alive"
            ],
            "Content-Length":
            [
                "1161"
            ],
            "Content-Type":
            [
                "application/json"
            ],
            "User-Agent":
            [
                "python-requests/2.31.0"
            ]
        }
    },
    "duration": 29.10034914,
    "status": 502,
    "err_id": "ajjtkc6ba",
    "err_trace": "reverseproxy.statusError (reverseproxy.go:1299)"
}

3. Caddy version:

v2.6.4 h1:2hwYqiRwk1tf3VruhMpLcYTg+11fCdr8S3jhNAdnPy8=

4. How I installed and ran Caddy:

Installed via conda, from conda-forge channel.

a. System environment:

Centos 8, not-docker.

b. Command:

caddy

c. Service/unit/compose file:

N/A

d. My complete Caddy config:

Here is the caddyfile:

:8000 {
	log {
		output stdout
	}
	reverse_proxy * {
	    to ws-04:9000 ws-04:9001 ws-04:9002 ws-04:9003 ws-04:9004 ws-04:9005 ws-04:9006 ws-04:9007 ws-04:9008
		lb_policy round_robin
	}
}

5. Links to relevant resources:

N/A

This is probably the upstream dropping requests when it can’t keep up.

Caddy doesn’t drop any requests, if it’s under pressure it will just hold onto requests until it can catch up. Many other servers don’t do that though, and they’ll just try to shed load by dropping connections. Nginx does this by default for example, as far as I understand.

Is there any particular reason you’re using round_robin? It might not be the most efficient way to handle it, you might want to try least_conn instead to hopefully better distribute the load.

Also you could consider enabling retries with lb_try_duration or lb_retries but it depends on the kinds of requests you’re getting (GET can be retried but not POST, for example, because the request body is consumable)

Thanks @francislavoie !

This was my understanding based on what I read here.

I am trying to replicate a working nginx reverse proxy, which uses round robin. The nginx version works without any changes to the upstreams.

This error occurs on POST, so I guess retries won’t save me here.


Is there any other log info which might help me debug? For what it’s worth, the 502 responses do not always have the same msg in logs. Here is another - note msg: "handled request"

{
    "level": "error",
    "ts": 1692196157.7682438,
    "logger": "http.log.access.log0",
    "msg": "handled request",
    "request":
    {
        "remote_ip": "192.168.101.94",
        "remote_port": "58322",
        "proto": "HTTP/1.1",
        "method": "POST",
        "host": "c7-a6:8000",
        "uri": "/fm",
        "headers":
        {
            "User-Agent":
            [
                "python-requests/2.31.0"
            ],
            "Accept-Encoding":
            [
                "gzip, deflate, br"
            ],
            "Accept":
            [
                "*/*"
            ],
            "Connection":
            [
                "keep-alive"
            ],
            "Content-Length":
            [
                "1161"
            ],
            "Content-Type":
            [
                "application/json"
            ]
        }
    },
    "user_id": "",
    "duration": 18.110046461,
    "size": 0,
    "status": 502,
    "resp_headers":
    {
        "Server":
        [
            "Caddy"
        ]
    }
}

Yes, enable the debug global option. It should show why it wasn’t able to select an upstream and returned 502.

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.