Caddy being killed due to OOM with heavy traffic

Kyrra · January 16, 2024, 8:37pm

1. The problem I’m having:

Caddy was oom-kill’ed due to a lot of traffic. I’m looking to tune Caddy to prevent this in the future.

This is a follow-up post to: Are there ways to tune Caddy to reduce memory usage?

OOM-kill was in invoked 6 times in a short window last night on my server. Each time caddy was killed, likely due to prioritization (oom_score_adj), though also because Caddy was using around/over 1GB of ram (on a 2GB VM). At peak, there were ~7500 people that would send 2 requests to my caddy server every minute, with the 2 requests going to a different reverse_proxy target. One being my server (ffxiv) and the other being Plausible.io server (self-hosted on the same server as caddy).

The domains taking all the traffic are is.xivup.com and pls.xivup.com.

I was luckily (unluckily) up due to a kid, so I logged onto the server quickly and observed some behavior from that time. Some notes:

All 7000 people that were sending requests send them within ~2 seconds of one another.
Response time for requests from my own browser to the site was ~3 seconds. (when it’s normal load, the same requests respond in 24ms).
between the ffxiv process and caddy, 100% of CPU was being used.
I ended up stopping Plausible at 3:23:26 (just a few seconds before the last crash), which seemed to resolve the caddy crash issue. Likely because it gave Caddy more headroom for memory, and removed all the CPU work that Plausible was doing with beam/clickhouse/postgres.

BTW: I’m happy to do any analysis requested here. I can look into the pprofs or whatever, just need a little guidance about what to look for.

2. Error messages and/or full log output:

Timestamp for the below syslog is: 2024-01-16T02:59:32.027888-07:00.

I have the syslog oom-kill for the other 5 entries, I’m just excluding them for brevity. Timestamps for those were:

2024-01-16T03:03:42.792627-07:00
2024-01-16T03:08:36.859366-07:00
2024-01-16T03:12:33.576843-07:00
2024-01-16T03:17:49.258687-07:00
2024-01-16T03:23:32.673244-07:00

As requested in the previous thread, I have caddy heap/goroutine pprof dumps from around this time period. I don’t have dumps exactly from when it crashed, but around then (I took dumps every 5 minutes with a cron). https://jameswendel.com/2024-01-16-pprof.tar contains these for the time pariod starting at 2:50am until 4:00am.

Note that the RSS listed below is in page sizes, so the Caddy RSS of 287373 means it was using 1,149,492 KB of memory.

top invoked oom-killer: gfp_mask=0x140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
CPU: 0 PID: 18523 Comm: top Not tainted 6.5.0-5-amd64 #1  Debian 6.5.13-1
Hardware name: Linode Compute Instance/Standard PC (Q35 + ICH9, 2009), BIOS Not Specified
Call Trace:
 <TASK>
 dump_stack_lvl+0x47/0x60
 dump_header+0x4a/0x240
 oom_kill_process+0xf9/0x190
 out_of_memory+0x256/0x540
 __alloc_pages_slowpath.constprop.0+0xa11/0xd30
 __alloc_pages+0x30b/0x330
 folio_alloc+0x1b/0x50
 __filemap_get_folio+0xca/0x240
 filemap_fault+0x14b/0x9f0
 ? srso_return_thunk+0x5/0x10
 ? filemap_map_pages+0x2d7/0x550
 __do_fault+0x33/0x130
 do_fault+0x248/0x3d0
 __handle_mm_fault+0x65b/0xbb0
 handle_mm_fault+0x155/0x350
 do_user_addr_fault+0x203/0x640
 exc_page_fault+0x7f/0x180
 asm_exc_page_fault+0x26/0x30
RIP: 0033:0x7fac1f792c9e
Code: Unable to access opcode bytes at 0x7fac1f792c74.
RSP: 002b:00007ffe63001270 EFLAGS: 00010202
RAX: 0000000000000008 RBX: 00007ffe63001290 RCX: 0000000000000001
RDX: 00007ffe63001290 RSI: 0000000000000001 RDI: 00007ffe63001290
RBP: 00007ffe63001290 R08: 0000000000000064 R09: 0000000000000008
R10: 00007fac1f92f240 R11: 0000000000000202 R12: 0000562f2951e01b
R13: 0000000000000008 R14: 0000000000000000 R15: 00007fac1f378808
 </TASK>
Mem-Info:
active_anon:83012 inactive_anon:314869 isolated_anon:0
 active_file:2 inactive_file:42 isolated_file:0
 unevictable:0 dirty:11 writeback:0
 slab_reclaimable:13709 slab_unreclaimable:49671
 mapped:1914 shmem:2095 pagetables:3133
 sec_pagetables:0 bounce:0
 kernel_misc_reclaimable:0
 free:13554 free_pcp:125 free_cma:0
Node 0 active_anon:332048kB inactive_anon:1259476kB active_file:8kB inactive_file:168kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:7656kB dirty:44kB writeback:0kB shmem:8380kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 974848kB writeback_tmp:0kB kernel_stack:9648kB pagetables:12532kB sec_pagetables:0kB all_unreclaimable? no
Node 0 DMA free:7984kB boost:0kB min:348kB low:432kB high:516kB reserved_highatomic:0KB active_anon:652kB inactive_anon:5092kB active_file:0kB inactive_file:36kB unevictable:0kB writepending:0kB present:15992kB managed:15360kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
lowmem_reserve[]: 0 1910 1910 1910 1910
Node 0 DMA32 free:46008kB boost:0kB min:44704kB low:55880kB high:67056kB reserved_highatomic:2048KB active_anon:331392kB inactive_anon:1254388kB active_file:8kB inactive_file:132kB unevictable:0kB writepending:44kB present:2080616kB managed:1996864kB mlocked:0kB bounce:0kB free_pcp:660kB local_pcp:660kB free_cma:0kB
lowmem_reserve[]: 0 0 0 0 0
Node 0 DMA: 3*4kB (UM) 7*8kB (E) 39*16kB (UME) 32*32kB (UE) 22*64kB (UME) 10*128kB (UE) 6*256kB (UME) 0*512kB 0*1024kB 1*2048kB (M) 0*4096kB = 7988kB
Node 0 DMA32: 2818*4kB (MEH) 890*8kB (MEH) 954*16kB (UMEH) 276*32kB (MEH) 53*64kB (MEH) 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 46008kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
2653 total pagecache pages
440 pages in swap cache
Free swap  = 44kB
Total swap = 524284kB
524152 pages RAM
0 pages HighMem/MovableOnly
21096 pages reserved
0 pages hwpoisoned
Tasks state (memory values in pages):
[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name
[    230]     0   230    32178       73   241664      288          -250 systemd-journal
[    279]     0   279     6702       17    73728      320         -1000 systemd-udevd
[    553]     0   553     2103       50    53248      928             0 haveged
[    554]   101   554    22864       32    73728      224             0 systemd-timesyn
[    559]   997   559   607081   287373  2560000     5312             0 caddy
[    562]     0   562     1660       32    53248       32             0 cron
[    563]   104   563     2374       32    65536      160          -900 dbus-daemon
[    565]     0   565   327022    22457   344064     1755             0 ffxiv
[    568]     0   568    55476       96    77824      384             0 rsyslogd
[    570]     0   570     4645       32    77824      256             0 systemd-logind
[    580]     0   580     1474       32    53248       32             0 agetty
[    581]     0   581     1379       32    49152       32             0 agetty
[    583]     0   583   339123     1343   290816     4416          -999 containerd
[    591]   109   591     6140       62    86016      352             0 opendkim
[    593]   109   593    65570       49   122880      704             0 opendkim
[    611]     0   611     4028       64    69632      288         -1000 sshd
[    615]     0   615    27069       59   114688     2912             0 unattended-upgr
[    633]     0   633   371827     2362   438272     7808          -500 dockerd
[    955]     0   955   180018      521   106496     1920          -998 containerd-shim
[    965]     0   965   328343     5056   221184     2392          -500 docker-proxy
[    986]     0   986   180082      527   106496     2080          -998 containerd-shim
[    994]     0   994   328487     6082   241664     2271          -500 docker-proxy
[   1064]     0  1064    10743       40    77824      160             0 master
[   1069]   108  1069    10899       32    73728      160             0 qmgr
[   1075]   101  1075    14041       32   147456      160             0 exim
[   1083]   101  1083   972303    15228  2867200    19955             0 clickhouse-serv
[   1099]     0  1099   180082      531   114688     2048          -998 containerd-shim
[   1101]     0  1101   180082      525   110592     1760          -998 containerd-shim
[   1147]  1000  1147   392887    53829  1187840    44100             0 beam.smp
[   1154]    70  1154    42561       96   122880      384             0 postgres
[   1553]    70  1553    42593       96   139264      416             0 postgres
[   1554]    70  1554    42578       96   118784      384             0 postgres
[   1555]    70  1555    42570       96   110592      384             0 postgres
[   1556]    70  1556    42744      128   122880      480             0 postgres
[   1557]    70  1557     6274       64    94208      416             0 postgres
[   1558]    70  1558    42700       96   114688      480             0 postgres
[   1991]  1000  1991      224       32    40960        0             0 epmd
[   1994]  1000  1994      203       32    45056        0             0 erl_child_setup
[   2015]  1000  2015      211       32    36864        0             0 inet_gethost
[   2016]  1000  2016      211       32    36864        0             0 inet_gethost
[   2017]  1000  2017      211       32    36864        0             0 inet_gethost
[   2032]    70  2032    43392       87   147456     1056             0 postgres
[   2033]    70  2033    43729       93   147456     1248             0 postgres
[   2034]    70  2034    43443      409   147456      928             0 postgres
[   2035]    70  2035    43438      310   147456      896             0 postgres
[   2036]    70  2036    43397      158   143360     1024             0 postgres
[   2037]    70  2037    43412      249   143360      960             0 postgres
[   2038]    70  2038    43452      360   143360      928             0 postgres
[   2039]    70  2039    43454      512   143360     1024             0 postgres
[   2040]    70  2040    43412      307   143360      960             0 postgres
[   2041]    70  2041    43451      280   143360      960             0 postgres
[   2065]    70  2065    42914       96   131072      608             0 postgres
[   2211]   108  2211    12473       32    90112      352             0 tlsmgr
[  14064]     0 14064     4645       32    77824      384             0 sshd
[  14068]  1000 14068     5115       32    81920      448           100 systemd
[  14072]  1000 14072     5358       15    77824      352           100 (sd-pam)
[  14095]  1000 14095     4710       83    77824      416             0 sshd
[  14096]  1000 14096     2440       32    57344      768             0 bash
[  18523]  1000 18523     3250      160    65536      448             0 top
[  21886]   108 21886    10853       32    73728      160             0 pickup
[  23091]    70 23091    42807      480   126976      328             0 postgres
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=user.slice,mems_allowed=0,global_oom,task_memcg=/system.slice/caddy.service,task=caddy,pid=559,uid=997
Out of memory: Killed process 559 (caddy) total-vm:2428324kB, anon-rss:1149364kB, file-rss:128kB, shmem-rss:0kB, UID:997 pgtables:2500kB oom_score_adj:0
caddy.service: A process of this unit has been killed by the OOM killer.
caddy.service: Main process exited, code=killed, status=9/KILL
caddy.service: Failed with result 'oom-kill'.
caddy.service: Consumed 1h 7min 17.408s CPU time.
caddy.service: Scheduled restart job, restart counter is at 1.
Starting caddy.service - Caddy...

3. Caddy version:

v2.7.6 h1:w0NymbG2m9PcvKWsrXO6EEkY9Ru4FJK8uQbYcev1p3A=

4. How I installed and ran Caddy:

Install — Caddy Documentation, using systemd to start it up.

a. System environment:

Linode 2 GB (shared 1-cpu 2GB memory, configured with 512MB swap)
debian-testing
systemd

b. Command:

/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile

c. Service/unit/compose file:

/lib/systemd/system/caddy.service

[Unit]
Description=Caddy
Documentation=https://caddyserver.com/docs/
After=network.target network-online.target
Requires=network-online.target

[Service]
Type=notify
User=caddy
Group=caddy
ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile --force
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
AmbientCapabilities=CAP_NET_ADMIN CAP_NET_BIND_SERVICE

[Install]
WantedBy=multi-user.target

/etc/systemd/system/caddy.service.d/override.conf

[Unit]
StartLimitIntervalSec=10s

[Service]
Restart=on-failure
RestartSec=5s

d. My complete Caddy config:

{
        #       auto_https off
        email <redacted>@gmail.com
}

(static) {
        @static {
                file
                path *.ico *.css *.js *.gif *.jpg *.jpeg *.png *.svg *.webp *.woff *.woff2 *.json
        }
        header @static Cache-Control max-age=5184000
}

(security) {
        header {
                # enable HSTS
                Strict-Transport-Security "max-age=63072000; includeSubDomains; preload"
                # disable clients from sniffing the media type
                X-Content-Type-Options nosniff
                # keep referrer data off of HTTP connections
                Referrer-Policy no-referrer-when-downgrade
        }
}

(errorfiles) {
        handle_errors {
                @custom_err file /{err.status_code}.html /err.html
                handle @custom_err {
                        rewrite * {file_match.relative}
                        file_server
                }
                respond "{err.status_code} {err.status_text}"
        }
}

(logs) {
        log {
                output file /var/log/caddy/{args.0}.log
        }
}

www.jwendel.net {
        # TODO
        # import security
        redir https://jwendel.net{uri}
}

http:// {
        respond "Hi!"
}

jwendel.net {
        root * /var/www/jwendel.net/
        encode zstd gzip
        file_server
        # import logs jwendel.net
        import static
        handle_errors {
                @custom_err file /{err.status_code}.html /err.html
                handle @custom_err {
                        rewrite * {file_match.relative}
                        file_server
                }
                respond "{err.status_code} {err.status_text}"
        }
}

pls.xivup.com,
pls.jwendel.net {
        rewrite /pl.js /js/plausible.js
        reverse_proxy localhost:8000 {
                flush_interval -1
        }
        header /pl.js {
                Cache-Control: "Cache-Control max-age=604800, public, must-revalidate" #7d
        }
}

xivup.com {
        redir https://is.xivup.com{uri}
}

is.xivup.com,
test.xivup.com {
        reverse_proxy localhost:8001
}

www.kyrra.net {
        redir https://kyrra.net{uri}
}

kyrra.net {
        root * /var/www/kyrra.net/
        encode zstd gzip
        file_server
        import static
        import errorfiles
}

www.jameswendel.com {
        redir https://jameswendel.com{uri}
}

jameswendel.com {
        root * /var/www/jameswendel.com/
        encode zstd gzip
        file_server
        import static
        import errorfiles
}

www.nlworks.com {
        redir https://nlworks.com{uri}
}

nlworks.com {
        root * /var/www/nlworks.com/
        encode zstd gzip
        file_server
        import static
        import errorfiles
}

5. Links to relevant resources:

https://is.xivup.com - website in question
https://jameswendel.com/2024-01-16-pprof.tar - pprof dumps from around the crashes

francislavoie · January 17, 2024, 2:00am

We have a page in the docs which covers how to profile a Caddy app (or any Go app, really).

I think you’ll need to consider increasing to an instance with more CPU/memory. I doubt there will be much you can do to optimize Caddy here.

One thing you might consider turning off is gzip compression, it can use up a lot of CPU and memory. It’s a trade-off with bandwidth.

Kyrra · January 17, 2024, 2:19am

All of the traffic for this case were on pls.xivup.com and is.xivup.com, both of which are just reverse_proxy configs. On the pls one, I have flush_interval -1 set, but I don’t configure any gzip compression within Caddy for these domains. Unless there is some gzip default for reverse_proxy, I’m not doing anything here within caddy.

(I do have gzip enabled in the downstream server from caddy, so the CPU and memory load should be there, not within cadd).

Would setting flush_interval -1 on is.xivup.com also be recommended (maybe helping to prevent Caddy from allocating response buffers)?

I’m likely going to try converting back to Nginx to see what kind of memory footprint I see there.

matt · January 17, 2024, 3:04am

The proxy does support gzip compression from the backend, so it may be decompressing content if the backend supports compression. (This can be disabled too if desired.)

We’d know the answer to this with a profile, are you able to collect one?

You can do that, but it won’t help us fix any problems.

francislavoie · January 17, 2024, 3:34am

I want to clarify this – not quite accurate.

The proxy would only decompress the upstream’s compressed content if the original client did not set the Accept-Encoding header.

All modern browsers set Accept-Encoding, so that effectively never happens except for server-to-server type requests.

The proxy will transparently pass the compressed data from the upstream back to the client if the client wanted compressed content.

matt · January 17, 2024, 3:36am

Oh! Sorry – Francis just pointed out that you did link to a profile of sorts. I did not see that initially, my apologies.

matt · January 17, 2024, 4:21am

From looking at the profiles, it looks like the highest consumers of memory are reading HTTP requests. Specifically, allocating for HTTP/2 headers.

Similarly, goroutines show the highest time spent in reading from connections.

What kinds of requests are your clients sending? How big are the headers?

I’d be curious if you tried setting max_header_bytes but I think this property is only exposed in the JSON config as it’s a bit obscure/uncommon:

Also, if your server is simply getting a lot of traffic, it’s either not legitimate (DDoS, in which case you need a CDN like Cloudflare, or a firewall), or you simply need more hardware to handle the load.

Kyrra · January 17, 2024, 5:08am

Btw, I was able to manually recreate the OOM. It looks like I had to send requests to 2 different endpoints at once. It seems it’s Plausible that is the bigger offender here for clogging up Caddy.

I think my short-term solution is going to programmatically disable Plausible when I get over a certain number of requests per minute.

Thanks for this question and the earlier ones in this thread, it made me think about that, and there are 2 requests coming into my server. I was effectively getting ~14000 requests to the server in the period of a few seconds. 7000 are to is.xivup.com and 7000 to pls.xivup.com. Below are what the Chrome generated from network view.

I’m betting the fact that 7000 of them are POST requests with a body is causing a bit of a headache for Caddy.

curl 'https://is.xivup.com/indexdata' \
  -H 'authority: is.xivup.com' \
  -H 'accept: */*' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H 'referer: https://is.xivup.com/' \
  -H 'sec-ch-ua: "Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "Windows"' \
  -H 'sec-fetch-dest: empty' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-site: same-origin' \
  -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' \
  --compressed

curl 'https://pls.xivup.com/api/event' \
  -H 'authority: pls.xivup.com' \
  -H 'accept: */*' \
  -H 'accept-language: en-US,en;q=0.9' \
  -H 'content-type: text/plain' \
  -H 'origin: https://is.xivup.com' \
  -H 'referer: https://is.xivup.com/' \
  -H 'sec-ch-ua: "Not_A Brand";v="8", "Chromium";v="120", "Google Chrome";v="120"' \
  -H 'sec-ch-ua-mobile: ?0' \
  -H 'sec-ch-ua-platform: "Windows"' \
  -H 'sec-fetch-dest: empty' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-site: same-site' \
  -H 'user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36' \
  --data-raw '{"n":"indexData","u":"https://is.xivup.com/","d":"is.xivup.com","r":"https://is.xivup.com/v2","w":2560,"h":0}' \
  --compressed

The traffic is real, it just tends to be a lot in a really short window. Basically I get a ton of traffic when Final Fantasy 14 (an MMO video game) is down for maintenance, people come to the my page to check for the server status. On a non-maintenance day, I only get ~30 users per minute. When there is a big maintenance, I get ~7000 people. And they want to refresh and get the latest status. I likely could CDN it, but was hoping I could serve it myself.

I think disabling Plausible is the answer. Those requests are just too intense for the server.

matt · January 17, 2024, 5:38am

Ah, that’s very interesting.

Sorry again for my laziness above (haven’t had much sleep lately). Your topic was well-written and I didn’t read it super thoroughly.

I’m willing to bet that 3k-14k req/sec is going to use a lot of RAM no matter what server you’re using, because the bytes have to go somewhere (they could go to disk, but that would be disastrously slow). Caddy can definitely handle that load – I’ve clocked it at over 100k req/sec on my desktop workstation that has much more memory than your Linode – it just needs memory for that.

Caddy’s reverse proxy is only making about 7% of the allocations. The vast majority of allocations are coming from the Go std lib moreso than Caddy code specifically, suggesting that merely accepting the connections and decoding the requests simply takes a lot of memory.

Kyrra · January 17, 2024, 5:58am

No worries at all. I appreciate you even looking at this at all.

I actually have my old Nginx config still around, and a quick test of it, for just pure benchmarking on a smaller server like this, it does seem like Nginx handles a high number of similar requests better than Caddy does. Though, as you point out, this is likely a limitation of the Go standard library’s HTTP server code.

I am really drawn to the simplicity of Caddy’s config and TLS Cert handling (which is what made me try it from the start). But given my odd traffic patterns and being cheap about my server, Nginx (or using a CDN) may be a better fit.

matt · January 17, 2024, 4:16pm

Note that nginx will drop connections when it gets flooded – you may not even know this is happening – whereas Caddy will not drop requests on the floor, even if it takes longer to answer the clients.

Kyrra · January 17, 2024, 5:10pm

I noticed that last night with a synthetic load:

> bombardier.exe -d 60s "https://is.xivup.com/v2indexdata" --fasthttp -c 7000 -t 60s -a

Caddy:
Bombarding https://is.xivup.com:443/v2indexdata for 1m0s using 7000 connection(s)
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec       460.15     562.16    3863.83
  Latency        13.15s      6.84s     45.00s
  HTTP codes:
    1xx - 0, 2xx - 31275, 3xx - 0, 4xx - 0, 5xx - 3292
    others - 1
  Throughput:     2.96MB/s


Nginx:
Bombarding https://is.xivup.com:443/v2indexdata for 1m0s using 7000 connection(s)
Done!
Statistics        Avg      Stdev        Max
  Reqs/sec      1941.26    1119.63   13996.68
  Latency         3.53s      6.95s      1.00m
  HTTP codes:
    1xx - 0, 2xx - 112435, 3xx - 0, 4xx - 0, 5xx - 0
    others - 9435
  Throughput:     8.29MB/s

Caddy returned 500s, Nginx dropped. Nginx dropped 7% of requests, while caddy returned 500 on 9%.

matt · January 17, 2024, 5:33pm

Those 500s may not actually be 500, it’s 5xx – I’d be curious what it really is. If it’s like 502 or 504, it indicates the proxy backend is being slow, but Caddy is working.

What’s the “Others” on the nginx test?

You may be able to tune Caddy’s proxy to perform better as well.

Kyrra · January 17, 2024, 6:21pm

“other” is a mix of these 5 errors

dial tcp 192.53.160.191:443: connectex: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.
tls: first record does not look like a TLS handshake
tls handshake timed out
read tcp wsarecv: A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond. 
read tcp : wsarecv: An existing connection was forcibly closed by the remote host.

To be clear, the above tests were using HTTP1.1 on the client, with KeepAlive disabled. So lots of overhead from TCP / TLS handshakes.

If I have the benchmark client use HTTP2 with keepalive (which then stresses the proxying ability of Caddy and Nginx), Caddy gains some ground on Nginx. Neither spit out any errors.

> bombardier.exe -d 90s "https://is.xivup.com/v2indexdata" --http2 -c 7000 -t 60s

Caddy:
Bombarding https://is.xivup.com:443/v2indexdata for 1m30s using 7000 connection(s)
Statistics        Avg      Stdev        Max
  Reqs/sec       730.29    1110.81   14807.50
  Latency         9.39s      2.83s     27.34s
  HTTP codes:
    1xx - 0, 2xx - 68944, 3xx - 0, 4xx - 0, 5xx - 884
    others - 0
  Throughput:     1.05MB/s

Nginx:
Bombarding https://is.xivup.com:443/v2indexdata for 1m30s using 7000 connection(s)
Statistics        Avg      Stdev        Max
  Reqs/sec      1218.57     534.81    3725.00
  Latency         5.69s   380.88ms      8.73s
  HTTP codes:
    1xx - 0, 2xx - 113956, 3xx - 0, 4xx - 0, 5xx - 0
    others - 0
  Throughput:     1.87MB/s

The tool I’m using sadly won’t let me print the specific 5xx errors. Running the benchmark multiple times, I sometimes don’t get any 5xx errors. So you’re likely right it’s probably 504.

Kyrra · January 18, 2024, 1:25am

Beyond throughput, memory usage of Caddy is a concern (which is why I started this thread).

Playing with pprof a bit more, I did notice the heap wasn’t all that big (maybe 200MB), but the process still had a 1GB RES within linux. I dumped the “allocs” for caddy, and it is 8.33GB. Looks like most of the memory allocated (that I guess gets GCed) is the deflate/gzip buffers and some other buffers for moving around responses. My downstream proxy app returned gzip’ed responses, which I’m guessing makes caddy do this work?

(screenshots of “allocs” pprof)

After the CPU load goes down some, I see the memory usage of caddy start to go back down to a normal normal value. I know in Java world I would set the -Xmx flag and the JVM would divert extra CPU to keep doing GCs when it got to the cap.

Question: It looks like Go recently got this limit with GOMEMLIMIT recently. Maybe I can try setting that to make the GC spend some more time keeping memory usage low? (that or GOGC along side it)

francislavoie · January 18, 2024, 9:32am

Yes, you can do that.

That’s weird. Are your clients not sending Accept-Encoding: gzip? If not, then Go is probably decompressing the response before sending it to the client. If it’s coming from web browsers, it shouldn’t be an issue.

If you’re using your benchmarking tool, make sure it sets Accept-Encoding: gzip otherwise it’s not a representative test to real-world conditions.

Kyrra · January 19, 2024, 4:59am

Yup, I added the header to the benchmark tool, and that noise went away in the allocs report. Good call! I’m still seeing the same memory pressure still, just from other things. But a good reminder that HTTP headers can have a pretty large impact on the behavior.

I tested it to be sure and Caddy respects the GOMEMLIMIT setting. I ran systemctl edit caddy, and put in the following setting.

[Service]
Environment=GOMEMLIMIT=500MiB

I then watched top while running my synthetic load. The caddy process ranged in the 500-600MB RES in top, which is better than the 1GB it would grow to otherwise. It definitely impacted throughput, but better than the server locking up and oom-killing tasks.

I’ve also spent a bunch of time playing with synthetic benchmark tools, and realizing how little I understand what’s going on here. When I’m trying to stress the connection part, I see wildly different behavior between bombardier with HTTP1 and FastHTTP, and also compared with Vegeta. The clients just have different characteristics, and they seem to interact with Nginx and Caddy in different ways. Hard to know how much is my test machine, Windows vs linux, the client, or the server.

I think I’ll stick with Caddy for now with the memory limit in place, and see how things go. Thanks for all the feedback here.

system · February 18, 2024, 4:59am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.