1. The problem I’m having:
Caddy was oom-kill’ed due to a lot of traffic. I’m looking to tune Caddy to prevent this in the future.
This is a follow-up post to: Are there ways to tune Caddy to reduce memory usage?
OOM-kill was in invoked 6 times in a short window last night on my server. Each time caddy was killed, likely due to prioritization (oom_score_adj), though also because Caddy was using around/over 1GB of ram (on a 2GB VM). At peak, there were ~7500 people that would send 2 requests to my caddy server every minute, with the 2 requests going to a different reverse_proxy target. One being my server (ffxiv) and the other being Plausible.io server (self-hosted on the same server as caddy).
The domains taking all the traffic are is.xivup.com
and pls.xivup.com
.
I was luckily (unluckily) up due to a kid, so I logged onto the server quickly and observed some behavior from that time. Some notes:
- All 7000 people that were sending requests send them within ~2 seconds of one another.
- Response time for requests from my own browser to the site was ~3 seconds. (when it’s normal load, the same requests respond in 24ms).
- between the
ffxiv
process andcaddy
, 100% of CPU was being used. - I ended up stopping Plausible at 3:23:26 (just a few seconds before the last crash), which seemed to resolve the caddy crash issue. Likely because it gave Caddy more headroom for memory, and removed all the CPU work that Plausible was doing with beam/clickhouse/postgres.
BTW: I’m happy to do any analysis requested here. I can look into the pprofs or whatever, just need a little guidance about what to look for.
2. Error messages and/or full log output:
Timestamp for the below syslog is: 2024-01-16T02:59:32.027888-07:00.
I have the syslog oom-kill for the other 5 entries, I’m just excluding them for brevity. Timestamps for those were:
- 2024-01-16T03:03:42.792627-07:00
- 2024-01-16T03:08:36.859366-07:00
- 2024-01-16T03:12:33.576843-07:00
- 2024-01-16T03:17:49.258687-07:00
- 2024-01-16T03:23:32.673244-07:00
As requested in the previous thread, I have caddy heap/goroutine pprof dumps from around this time period. I don’t have dumps exactly from when it crashed, but around then (I took dumps every 5 minutes with a cron). https://jameswendel.com/2024-01-16-pprof.tar contains these for the time pariod starting at 2:50am until 4:00am.
Note that the RSS listed below is in page sizes, so the Caddy RSS of 287373 means it was using 1,149,492 KB of memory.
top invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
CPU: 0 PID: 18523 Comm: top Not tainted 6.5.0-5-amd64 #1 Debian 6.5.13-1
Hardware name: Linode Compute Instance/Standard PC (Q35 + ICH9, 2009), BIOS Not Specified
Call Trace:
<TASK>
dump_stack_lvl+0x47/0x60
dump_header+0x4a/0x240
oom_kill_process+0xf9/0x190
out_of_memory+0x256/0x540
__alloc_pages_slowpath.constprop.0+0xa11/0xd30
__alloc_pages+0x30b/0x330
folio_alloc+0x1b/0x50
__filemap_get_folio+0xca/0x240
filemap_fault+0x14b/0x9f0
? srso_return_thunk+0x5/0x10
? filemap_map_pages+0x2d7/0x550
__do_fault+0x33/0x130
do_fault+0x248/0x3d0
__handle_mm_fault+0x65b/0xbb0
handle_mm_fault+0x155/0x350
do_user_addr_fault+0x203/0x640
exc_page_fault+0x7f/0x180
asm_exc_page_fault+0x26/0x30
RIP: 0033:0x7fac1f792c9e
Code: Unable to access opcode bytes at 0x7fac1f792c74.
RSP: 002b:00007ffe63001270 EFLAGS: 00010202
RAX: 0000000000000008 RBX: 00007ffe63001290 RCX: 0000000000000001
RDX: 00007ffe63001290 RSI: 0000000000000001 RDI: 00007ffe63001290
RBP: 00007ffe63001290 R08: 0000000000000064 R09: 0000000000000008
R10: 00007fac1f92f240 R11: 0000000000000202 R12: 0000562f2951e01b
R13: 0000000000000008 R14: 0000000000000000 R15: 00007fac1f378808
</TASK>
Mem-Info:
active_anon:83012 inactive_anon:314869 isolated_anon:0
active_file:2 inactive_file:42 isolated_file:0
unevictable:0 dirty:11 writeback:0
slab_reclaimable:13709 slab_unreclaimable:49671
mapped:1914 shmem:2095 pagetables:3133
sec_pagetables:0 bounce:0
kernel_misc_reclaimable:0
free:13554 free_pcp:125 free_cma:0
Node 0 active_anon:332048kB inactive_anon:1259476kB active_file:8kB inactive_file:168kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:7656kB dirty:44kB writeback:0kB shmem:8380kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 974848kB writeback_tmp:0kB kernel_stack:9648kB pagetables:12532kB sec_pagetables:0kB all_unreclaimable? no
Node 0 DMA free:7984kB boost:0kB min:348kB low:432kB high:516kB reserved_highatomic:0KB active_anon:652kB inactive_anon:5092kB active_file:0kB inactive_file:36kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 1910 1910 1910 1910
Node 0 DMA32 free:46008kB boost:0kB min:44704kB low:55880kB high:67056kB reserved_highatomic:2048KB active_anon:331392kB inactive_anon:1254388kB active_file:8kB inactive_file:132kB unevictable:0kB writepending:44kB present:2080616kB managed:1996864kB mlocked:0kB bounce:0kB free_pcp:660kB local_pcp:660kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0 0
Node 0 DMA: 3*4kB (UM) 7*8kB (E) 39*16kB (UME) 32*32kB (UE) 22*64kB (UME) 10*128kB (UE) 6*256kB (UME) 0*512kB 0*1024kB 1*2048kB (M) 0*4096kB = 7988kB
Node 0 DMA32: 2818*4kB (MEH) 890*8kB (MEH) 954*16kB (UMEH) 276*32kB (MEH) 53*64kB (MEH) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 46008kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
2653 total pagecache pages
440 pages in swap cache
Free swap = 44kB
Total swap = 524284kB
524152 pages RAM
0 pages HighMem/MovableOnly
21096 pages reserved
0 pages hwpoisoned
Tasks state (memory values in pages):
[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
[ 230] 0 230 32178 73 241664 288 -250 systemd-journal
[ 279] 0 279 6702 17 73728 320 -1000 systemd-udevd
[ 553] 0 553 2103 50 53248 928 0 haveged
[ 554] 101 554 22864 32 73728 224 0 systemd-timesyn
[ 559] 997 559 607081 287373 2560000 5312 0 caddy
[ 562] 0 562 1660 32 53248 32 0 cron
[ 563] 104 563 2374 32 65536 160 -900 dbus-daemon
[ 565] 0 565 327022 22457 344064 1755 0 ffxiv
[ 568] 0 568 55476 96 77824 384 0 rsyslogd
[ 570] 0 570 4645 32 77824 256 0 systemd-logind
[ 580] 0 580 1474 32 53248 32 0 agetty
[ 581] 0 581 1379 32 49152 32 0 agetty
[ 583] 0 583 339123 1343 290816 4416 -999 containerd
[ 591] 109 591 6140 62 86016 352 0 opendkim
[ 593] 109 593 65570 49 122880 704 0 opendkim
[ 611] 0 611 4028 64 69632 288 -1000 sshd
[ 615] 0 615 27069 59 114688 2912 0 unattended-upgr
[ 633] 0 633 371827 2362 438272 7808 -500 dockerd
[ 955] 0 955 180018 521 106496 1920 -998 containerd-shim
[ 965] 0 965 328343 5056 221184 2392 -500 docker-proxy
[ 986] 0 986 180082 527 106496 2080 -998 containerd-shim
[ 994] 0 994 328487 6082 241664 2271 -500 docker-proxy
[ 1064] 0 1064 10743 40 77824 160 0 master
[ 1069] 108 1069 10899 32 73728 160 0 qmgr
[ 1075] 101 1075 14041 32 147456 160 0 exim
[ 1083] 101 1083 972303 15228 2867200 19955 0 clickhouse-serv
[ 1099] 0 1099 180082 531 114688 2048 -998 containerd-shim
[ 1101] 0 1101 180082 525 110592 1760 -998 containerd-shim
[ 1147] 1000 1147 392887 53829 1187840 44100 0 beam.smp
[ 1154] 70 1154 42561 96 122880 384 0 postgres
[ 1553] 70 1553 42593 96 139264 416 0 postgres
[ 1554] 70 1554 42578 96 118784 384 0 postgres
[ 1555] 70 1555 42570 96 110592 384 0 postgres
[ 1556] 70 1556 42744 128 122880 480 0 postgres
[ 1557] 70 1557 6274 64 94208 416 0 postgres
[ 1558] 70 1558 42700 96 114688 480 0 postgres
[ 1991] 1000 1991 224 32 40960 0 0 epmd
[ 1994] 1000 1994 203 32 45056 0 0 erl_child_setup
[ 2015] 1000 2015 211 32 36864 0 0 inet_gethost
[ 2016] 1000 2016 211 32 36864 0 0 inet_gethost
[ 2017] 1000 2017 211 32 36864 0 0 inet_gethost
[ 2032] 70 2032 43392 87 147456 1056 0 postgres
[ 2033] 70 2033 43729 93 147456 1248 0 postgres
[ 2034] 70 2034 43443 409 147456 928 0 postgres
[ 2035] 70 2035 43438 310 147456 896 0 postgres
[ 2036] 70 2036 43397 158 143360 1024 0 postgres
[ 2037] 70 2037 43412 249 143360 960 0 postgres
[ 2038] 70 2038 43452 360 143360 928 0 postgres
[ 2039] 70 2039 43454 512 143360 1024 0 postgres
[ 2040] 70 2040 43412 307 143360 960 0 postgres
[ 2041] 70 2041 43451 280 143360 960 0 postgres
[ 2065] 70 2065 42914 96 131072 608 0 postgres
[ 2211] 108 2211 12473 32 90112 352 0 tlsmgr
[ 14064] 0 14064 4645 32 77824 384 0 sshd
[ 14068] 1000 14068 5115 32 81920 448 100 systemd
[ 14072] 1000 14072 5358 15 77824 352 100 (sd-pam)
[ 14095] 1000 14095 4710 83 77824 416 0 sshd
[ 14096] 1000 14096 2440 32 57344 768 0 bash
[ 18523] 1000 18523 3250 160 65536 448 0 top
[ 21886] 108 21886 10853 32 73728 160 0 pickup
[ 23091] 70 23091 42807 480 126976 328 0 postgres
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/system.slice/caddy.service,task=caddy,pid=559,uid=997
Out of memory: Killed process 559 (caddy) total-vm:2428324kB, anon-rss:1149364kB, file-rss:128kB, shmem-rss:0kB, UID:997 pgtables:2500kB oom_score_adj:0
caddy.service: A process of this unit has been killed by the OOM killer.
caddy.service: Main process exited, code=killed, status=9/KILL
caddy.service: Failed with result 'oom-kill'.
caddy.service: Consumed 1h 7min 17.408s CPU time.
caddy.service: Scheduled restart job, restart counter is at 1.
Starting caddy.service - Caddy...
3. Caddy version:
v2.7.6 h1:w0NymbG2m9PcvKWsrXO6EEkY9Ru4FJK8uQbYcev1p3A=
4. How I installed and ran Caddy:
Install — Caddy Documentation, using systemd to start it up.
a. System environment:
- Linode 2 GB (shared 1-cpu 2GB memory, configured with 512MB swap)
- debian-testing
- systemd
b. Command:
/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
c. Service/unit/compose file:
- /lib/systemd/system/caddy.service
[Unit]
Description=Caddy
Documentation=https://caddyserver.com/docs/
After=network.target network-online.target
Requires=network-online.target
[Service]
Type=notify
User=caddy
Group=caddy
ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile --force
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
AmbientCapabilities=CAP_NET_ADMIN CAP_NET_BIND_SERVICE
[Install]
WantedBy=multi-user.target
- /etc/systemd/system/caddy.service.d/override.conf
[Unit]
StartLimitIntervalSec=10s
[Service]
Restart=on-failure
RestartSec=5s
d. My complete Caddy config:
{
# auto_https off
email <redacted>@gmail.com
}
(static) {
@static {
file
path *.ico *.css *.js *.gif *.jpg *.jpeg *.png *.svg *.webp *.woff *.woff2 *.json
}
header @static Cache-Control max-age=5184000
}
(security) {
header {
# enable HSTS
Strict-Transport-Security "max-age=63072000; includeSubDomains; preload"
# disable clients from sniffing the media type
X-Content-Type-Options nosniff
# keep referrer data off of HTTP connections
Referrer-Policy no-referrer-when-downgrade
}
}
(errorfiles) {
handle_errors {
@custom_err file /{err.status_code}.html /err.html
handle @custom_err {
rewrite * {file_match.relative}
file_server
}
respond "{err.status_code} {err.status_text}"
}
}
(logs) {
log {
output file /var/log/caddy/{args.0}.log
}
}
www.jwendel.net {
# TODO
# import security
redir https://jwendel.net{uri}
}
http:// {
respond "Hi!"
}
jwendel.net {
root * /var/www/jwendel.net/
encode zstd gzip
file_server
# import logs jwendel.net
import static
handle_errors {
@custom_err file /{err.status_code}.html /err.html
handle @custom_err {
rewrite * {file_match.relative}
file_server
}
respond "{err.status_code} {err.status_text}"
}
}
pls.xivup.com,
pls.jwendel.net {
rewrite /pl.js /js/plausible.js
reverse_proxy localhost:8000 {
flush_interval -1
}
header /pl.js {
Cache-Control: "Cache-Control max-age=604800, public, must-revalidate" #7d
}
}
xivup.com {
redir https://is.xivup.com{uri}
}
is.xivup.com,
test.xivup.com {
reverse_proxy localhost:8001
}
www.kyrra.net {
redir https://kyrra.net{uri}
}
kyrra.net {
root * /var/www/kyrra.net/
encode zstd gzip
file_server
import static
import errorfiles
}
www.jameswendel.com {
redir https://jameswendel.com{uri}
}
jameswendel.com {
root * /var/www/jameswendel.com/
encode zstd gzip
file_server
import static
import errorfiles
}
www.nlworks.com {
redir https://nlworks.com{uri}
}
nlworks.com {
root * /var/www/nlworks.com/
encode zstd gzip
file_server
import static
import errorfiles
}
5. Links to relevant resources:
- https://is.xivup.com - website in question
- https://jameswendel.com/2024-01-16-pprof.tar - pprof dumps from around the crashes