How to debug high CPU (part2)

tcurdt · September 24, 2025, 1:15pm

This is a continuation of How to debug high CPU so since the setup has not changed (see the link), I just jump straight to the debugging.

Thanks for the replies in other thread - somehow I did not get notified
It overall felt like a reporting problem back in August - but now it is having an impact.

With a peak of about 50-60 req/s I a don’t feel we should have a CPU load like this.
It’s higher than the database and more than the backends themselves.

Caddy is acting pretty much only as a reverse proxy handling the TLS termination.

Is it just me or does this look excessive?

 go tool pprof cpu.prof
File: caddy
Build ID: 8a32851908a303960579c849e7121ac8eb2325d5
Type: cpu
Time: 2025-09-24 14:48:25 CEST
Duration: 30.17s, Total samples = 82.67s (274.05%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 74.88s, 90.58% of 82.67s total
Dropped 516 nodes (cum <= 0.41s)
Showing top 10 nodes out of 54
      flat  flat%   sum%        cum   cum%
    70.04s 84.72% 84.72%     70.04s 84.72%  internal/runtime/syscall.Syscall6
     2.46s  2.98% 87.70%      2.46s  2.98%  runtime.futex
     0.43s  0.52% 88.22%      1.08s  1.31%  runtime.exitsyscall
     0.36s  0.44% 88.65%      0.72s  0.87%  runtime.stealWork
     0.33s   0.4% 89.05%      3.32s  4.02%  runtime.netpoll
     0.29s  0.35% 89.40%     38.65s 46.75%  crypto/tls.(*Conn).Write
     0.29s  0.35% 89.75%      7.59s  9.18%  runtime.findRunnable
     0.27s  0.33% 90.08%     38.05s 46.03%  crypto/tls.(*Conn).writeRecordLocked
     0.26s  0.31% 90.40%     73.53s 88.94%  io.copyBuffer
     0.15s  0.18% 90.58%      0.80s  0.97%  crypto/internal/fips140/aes/gcm.seal
(pprof) top20 -cum
Showing nodes accounting for 71.80s, 86.85% of 82.67s total
Dropped 516 nodes (cum <= 0.41s)
Showing top 20 nodes out of 54
      flat  flat%   sum%        cum   cum%
         0     0%     0%     73.53s 88.94%  io.Copy (inline)
     0.26s  0.31%  0.31%     73.53s 88.94%  io.copyBuffer
         0     0%  0.31%     71.84s 86.90%  github.com/caddyserver/caddy/v2/modules/caddyhttp.(*hijackedConn).ReadFrom
         0     0%  0.31%     71.84s 86.90%  github.com/caddyserver/caddy/v2/modules/caddyhttp/reverseproxy.switchProtocolCopier.copyFromBackend
    70.04s 84.72% 85.04%     70.04s 84.72%  internal/runtime/syscall.Syscall6
     0.01s 0.012% 85.05%     68.97s 83.43%  internal/poll.ignoringEINTRIO (inline)
     0.09s  0.11% 85.16%     68.75s 83.16%  syscall.Syscall
     0.10s  0.12% 85.28%     67.25s 81.35%  syscall.RawSyscall6
     0.29s  0.35% 85.63%     38.65s 46.75%  crypto/tls.(*Conn).Write
     0.27s  0.33% 85.96%     38.05s 46.03%  crypto/tls.(*Conn).writeRecordLocked
     0.03s 0.036% 85.99%     36.83s 44.55%  net.(*conn).Write
     0.04s 0.048% 86.04%     36.80s 44.51%  net.(*netFD).Write
     0.10s  0.12% 86.16%     36.71s 44.41%  internal/poll.(*FD).Write
     0.04s 0.048% 86.21%     36.31s 43.92%  syscall.Write (inline)
     0.04s 0.048% 86.26%     36.27s 43.87%  syscall.write
     0.14s  0.17% 86.43%     35.92s 43.45%  crypto/tls.(*Conn).write
     0.04s 0.048% 86.48%     33.30s 40.28%  net.(*conn).Read
     0.06s 0.073% 86.55%     33.26s 40.23%  net.(*netFD).Read
     0.10s  0.12% 86.67%     33.19s 40.15%  internal/poll.(*FD).Read
     0.15s  0.18% 86.85%     33.04s 39.97%  net/http.(*readWriteCloserBody).Read

Is there a way to see what URLs contribute the most? (we don’t really have an access log atm)

How would you dig deeper?

What else would be of interest? I have this available:

curl -s "http://${CADDY_IP}:2019/debug/pprof/profile" > $DIR/cpu.prof
curl -s "http://${CADDY_IP}:2019/debug/pprof/goroutine?debug=1" > $DIR/go1.txt
#curl -s "http://${CADDY_IP}:2019/debug/pprof/goroutine?debug=2" > $DIR/go2.txt

matt · September 26, 2025, 3:04pm

Can you post the profile? And yeah the goroutine dump is good too.

Also see:

This will help you troubleshoot on your own while waiting for others to help.

tcurdt · September 29, 2025, 1:42pm

@matt I had to wait for a more busy time …but attached is the profile

For some context: caddy has about 10-15x the cpu usage of our database.
To me it looks like an IO bottleneck - but what is that for a reverse proxy?

Any suggestion much appreciated.

https://files.vafer.org/profile.tgz

matt · October 3, 2025, 9:13pm

Yeah man, that’s weird. 60 seconds spent deep in a syscall.

Something configured weirdly in your kernel? There’s a LOT of CPU time just spent writing buffers, in other words, I/O.

How’s your I/O hardware? Anything failing?

tcurdt · October 4, 2025, 10:41am

It’s a NixOS server on a virtual server on Hetzner via a Hypervisor. Caddy is running in k3s.
TBH I am not even sure where to start digging with all these layer .

Sometimes I really miss bare metal.

Right now we have little traffic so it’s a bit pointless … but this is what I would look at on the host system:

nix-shell -p iotop sysstat

iostat --human
Linux 6.12.19-hardened1 (blue-c-1)      10/04/2025      _x86_64_        (16 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           2.9%    0.0%    2.6%    0.1%    0.0%   94.3%

Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
sda             155.19        15.6M       220.4k       114.8k     159.3T       2.2T       1.1T

iotop -o
shows about around 160 K/s

vmstat 1
shows close to idle

sudo smartctl -a /dev/sda
all virtual

I am wondering if the syscalls could actually be a wait instead of a busy CPU?
But I guess I will have to wait for some traffic to dig deeper.

To test in this direction I added

      servers {
        timeouts {
          read_body 10s
          read_header 5s
          write 30s
        }
      }

to tighten timeouts.

I am wondering if websockets could be the main driver for the behaviour.
But without dissecting this during high traffic times that’s just speculation.

tcurdt · October 6, 2025, 9:14am

Under load:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          28.8%    0.0%   31.1%    0.1%    0.0%   40.0%

Device            r/s     rkB/s   rrqm/s  %rrqm r_await rareq-sz     w/s     wkB/s   wrqm/s  %wrqm w_await wareq-sz     d/s     dkB/s   drqm/s  %drqm d_await dareq-sz     f/s f_await  aqu-sz  %util
sda              0.00      0.0k     0.00   0.0%    0.00     0.0k   69.00    312.0k     2.00   2.8%    0.96     4.5k    0.00      0.0k     0.00   0.0%    0.00     0.0k   17.00    0.24    0.07   1.1%

$ sudo iotop -o

shows more writes than expected … around 600K/s

$ vmstat 1                                                                                                                                                         procs -----------memory---------- ---swap-- -----io---- -system-- -------cpu-------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st gu                                                                                              7  0      0 664404 3456648 21303808    0    0 16287   219 9405    6  3  3 94  0  0  0
 8  0      0 646264 3456648 21303808    0    0     0   544 54608 47362 22 30 48  0  0  0
13  0      0 641832 3456648 21304204    0    0     0   940 56520 50195 20 32 47  0  0  0
 7  0      0 627060 3456648 21304108    0    0    16  1012 50743 51557 23 31 46  0  0  0
12  0      0 636836 3456648 21304152    0    0    24   492 56306 51506 22 31 47  0  0  0
12  0      0 601968 3456648 21304392    0    0     0   536 60360 51603 31 31 38  0  0  0
12  0      0 549912 3456648 21304612    0    0    24   556 56148 48272 25 33 43  0  0  0                                                                                         8  0      0 578808 3456648 21304440    0    0     0   204 52973 48609 23 30 47  0  0  0
 7  0      0 579988 3456648 21304548    0    0     0   436 50653 54147 21 33 46  0  0  0                                                                                        14  0      0 579132 3456648 21304716    0    0     0   536 64240 55177 35 31 34  0  0  0
 8  0      0 576756 3456648 21304860    0    0     0   368 55395 55853 27 30 43  0  0  0
 8  0      0 560448 3456648 21305036    0    0     0   516 51363 46865 32 31 37  0  0  0
 6  0      0 599976 3456648 21304584    0    0     0   308 53645 46642 35 29 36  0  0  0
 6  0      0 593848 3456648 21304640    0    0     8  1068 56547 44477 20 30 49  0  0  0
 6  0      0 642732 3456648 21304656    0    0     0   440 53389 49895 18 31 51  0  0  0
 8  0      0 635900 3456648 21304664    0    0     0   272 54679 48894 19 31 50  0  0  0                                                                                         8  0      0 633384 3456648 21304736    0    0    40  1068 56049 46807 19 31 49  0  0  0

but overall I would say…

CPU

20-35% user plus 29-33% system CPU usage feels quite busy.
Only 34-51% idle and not much breathing room.

The high system/kernel CPU matches the profiling result with the system calls.
The 6-14 processes in the run queue indicate CPU contention.

Memory

No swap usage, no memory pressure
Looks OK.

IO

Minimal disk reads.
Moderate disk writes.
Not really IO bound.

System

50k-64k interrupts/sec seems high
44k-55k context switches/sec seems too high

Any thoughts?

tcurdt · October 6, 2025, 9:24am

The odd thing: I restart caddy and CPU drops and does not shoot back up again like I would expect if this is purely due to traffic.

matt · October 6, 2025, 9:55pm

So restarting Caddy fixes the problem? The CPU doesn’t come back up again?

I don’t know anything about Nix or Hetzner’s hypervisor. Can you reproduce it without one of those abstractions?

tcurdt · October 6, 2025, 10:53pm

CPU ramps up slowly after the restart.
If this was cause by the traffic directly I would expect it to be much quicker.

I suspect all the k3s abstractions not dealing to well with longer websockets connections - but I somehow need a way to confirm this.

Can I see connections stats in the caddy metrics somehow? I didn’t really find much in that direction.

curl -s http://$CADDY:2019/metrics | grep conn

This really wants me to move caddy closer to the metal. Still pondering how to do the share the acme certs in a good way though. All I found seems to require xcaddy (consul, redis, other dbs)

matt · October 7, 2025, 8:44pm

There’s Prometheus metrics you can enable: Monitoring Caddy with Prometheus metrics — Caddy Documentation

Just embrace xcaddy (or our download page) and use plugins when you need to

system · November 6, 2025, 8:45pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.