Caddy always use 100% cpu

1. The problem I’m having:

I’m running reverse_proxy with 32c ec2 and traffic more than 10,000 requests every second.
I don’t know why caddy always use 100% of cpu.

2. Error messages and/or full log output:

N/A

3. Caddy version:

latest running in docker container

4. How I installed and ran Caddy:

caddy:latest when docker run

a. System environment:

Ubuntu 22.04

b. Command:

docker run -d -p 80:80 -p 443:443 -p 2019:2019 -v ./Caddyfile:/etc/caddy/Caddyfile -v ./data:/data caddy

c. Service/unit/compose file:

N/A

d. My complete Caddy config:

{
        admin :2019
}

domain {
        encode gzip
        header {
                Access-Control-Allow-Headers *
                Access-Control-Allow-Origin *
                Access-Control-Allow-Methods "GET, POST, PUT, PATCH, DELETE, OPTIONS"
                Access-Control-Max-Age "3600"
                defer
        }

        reverse_proxy http://ip:3000 {
                transport http {
                        keepalive 5m
                }
        }
}

5. Links to relevant resources:

I have used pprof to inspect what functions are using cpu

Please fill out the help topic template as per the forum rules. We’re missing too much detail here to help. Post your profiles as well if you collected them. Just a screenshot like that doesn’t tell us enough. We need the full context.

1 Like

Thanks, have edited format

Where’s the rest of the profile?

But 100% CPU is not too surprising for loads like that, especially with gzip enabled. You’re paying for the cores, might as well use them. It’d be senseless to throttle back just because. The real question is, what kind of latency is there?

3 Likes

If disabled gzip, also used 100% cpu.

And large of requests not arrived backend which reverse_proxy is serving

pprof shows that caddy always invoke DialContext

Unless we have the actual profile, we can only guess as good as you. Please post the full profile.

2 Likes

pprof.pb.gz

please download the profile

Thanks!

Wow, that’s super interesting. Looks like a lot of time is spent on dialing. But very little CPU time. Just slow dialing. What happens when you run time curl -v http://ip:3000 ?

2 Likes

At the time, cpu usage is 100%, i think all cores is in slow dialing.

curl -v http://ip:3000 would return imediatelly and backend is working well, i don’t know what would happed if run this command on caddy server when its cpu usage is 100%

Please keep the session. I’ll post if any updates asap

Was the curl run inside the container or on host? If on host, can you try from inside the container? I also found this, which may be relevant.

1 Like

caddy is inside container, curl was on host

backend is on other host and is not inside container

Then try curl from inside caddy container

1 Like

curl is wokring inside caddy container for http://internal_ip:port and not working for https domain

looks like that main problem is slow dialing, how can i fix it?

The profile you shared shows 93% of the time is spent on the connect syscall. You might be experiencing packet loss or something in your environment slowing down the syscall. You’ll have to do packet capture with tshark and look at that.

3 Likes

You can try disable userland-proxy in Docker daemon config. On a linux host you would usually do this:

/etc/docker/daemon.json:

{
  "userland-proxy": false
}

Then restart the Docker daemon: systemctl restart docker

This disables some conveniences mostly related to localhost to container routing. I documented differences here some time ago, with a handy true / false table.

Disabling the proxy like this will remove notable amount of network overhead if there was a performance issue there (quite observable difference with network perf tests at least).


You can run into a variety of other gotchas that depends on the environment and how you’ve configured it.

Another big performance gotcha that you can run into is with file descriptors, this shouldn’t affect Caddy, but if the issue is possibly tied to the service you’re connecting to then you should consider checking ulimit -Sn output from inside the container, if that number is not 1024, it will likely approximately 1 million, or 1 billion.

Either of those larger values is technically a bug and should be finally resolved in a Docker release later this year or in 2025. 1 million usually isn’t noticeable for regressions it causes, but 1 billion often is. Some software will also appear to randomly crash if the value is above 1024 due to select() legacy syscall usage.

To rule this one out, if you have a value that was higher than 1024, configure the container ulimit setting (CLI, Compose) to use 1024 as the soft limit and 524288 as the hard limit, the soft limit is the important one.

A common bug this resolves is high CPU activity from processes that are iterating through the entire file descriptor limit. 1024 is a small range vs a billion syscalls. I’ve come across various software where that introduced slowdowns of operations that should take less than 1 sec become 10 minutes or even beyond an hour.


HTTPS doesn’t normally support IP in certs from public CAs.

Just query the actual FQDN and if the cert is provisioned internally but you lack it in your trust store, you can skip verification curl --insecure https://example.com (you should also be able to do https://ip too since this skipping cert verification via --insecure).

Since you mentioned the backend service you’re connecting to is on another host, and is the IP you’re connecting to, if that runs in Docker too, you should apply the advice about ulimit above, and possible userland-proxy to that instead. Shouldn’t really matter for Caddy container AFAIK.

If anything, I would suggest trying to verify if you can reproduce with Caddy and the backend service on the same server. There are a variety of network gotchas that can occur with Docker, especially if preserving the original client IP is important.

3 Likes