Seeking Performance Suggestions for Optimized Caddyfile

I’m not having a problem per se, but I’ll provide the template information for completeness and pose my question at the end.

1. Output of caddy version:

(devel) (I’m on NixOS; the source revision is 2.5.2)

2. How I run Caddy:

a. System environment:

root@nixos:~/ > nixos-version
22.05.20220825.058de38 (Quokka)
root@nixos:~/ > systemctl --version
systemd 250 (250.4)

b. Command:

It’s a systemd service with WantedBy=multi-user.target

c. Service/unit/compose file:

caddy.service Description=Caddy Documentation=https://caddyserver.com/docs/ After=network.target network-online.target Requires=network-online.target StartLimitBurst=10 StartLimitIntervalSec=14400

[Service]
Type=notify
User=caddy
Group=caddy
Environment=“LOCALE_ARCHIVE=/nix/store/fqmdxlbk32miazamkyavwwiwkn146i37-glibc-locales-2.34-210/lib/locale/locale-archive”
Environment=“PATH=/nix/store/p643r4aczmzb0dhyrx3dj592f0s5v7xj-coreutils-9.0/bin:/nix/store/7g48ahc3xnmb5b851vw60nbdgvk0wsf8-findutils-4.9.0/bin:/nix/store/ja8bi2cbpm36nwqy1hvklm3y9n7s3247-gnugrep-3.7/bin:/nix/store/lrxxki2m4gr4w3lxw08qpd465skpa04y-gnused-4.8/bin:/nix/store/sj364k7lsr47i87f7iv835lvvn7g4fqm-systemd-250.4/bin:/nix/store/p643r4aczmzb0dhyrx3dj592f0s5v7xj-coreutils-9.0/sbin:/nix/store/7g48ahc3xnmb5b851vw60nbdgvk0wsf8-findutils-4.9.0/sbin:/nix/store/ja8bi2cbpm36nwqy1hvklm3y9n7s3247-gnugrep-3.7/sbin:/nix/store/lrxxki2m4gr4w3lxw08qpd465skpa04y-gnused-4.8/sbin:/nix/store/sj364k7lsr47i87f7iv835lvvn7g4fqm-systemd-250.4/sbin”
Environment=“TZDIR=/nix/store/4lh2brlyffixgz1z31p51q5i0y395lqa-tzdata-2022b/share/zoneinfo”
ExecStart=/nix/store/6kq7pfy5yiil6bms3qs4dg0z2bl3z6bi-caddy-2.5.2/bin/caddy run --environ --config /etc/caddy/Caddyfile
ExecReload=/nix/store/6kq7pfy5yiil6bms3qs4dg0z2bl3z6bi-caddy-2.5.2/bin/caddy reload --config /etc/caddy/Caddyfile
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
AmbientCapabilities=CAP_NET_BIND_SERVICE
ExecReload=
ExecReload=/nix/store/6kq7pfy5yiil6bms3qs4dg0z2bl3z6bi-caddy-2.5.2/bin/caddy reload --config /nix/store/ysmpqr5gps9fl8kx9m0m29s9qfc6zryv-Caddyfile --adapter caddyfile
ExecStart=
ExecStart=/nix/store/6kq7pfy5yiil6bms3qs4dg0z2bl3z6bi-caddy-2.5.2/bin/caddy run --config /nix/store/ysmpqr5gps9fl8kx9m0m29s9qfc6zryv-Caddyfile --adapter caddyfile
ExecStartPre=/nix/store/6kq7pfy5yiil6bms3qs4dg0z2bl3z6bi-caddy-2.5.2/bin/caddy validate --config /nix/store/ysmpqr5gps9fl8kx9m0m29s9qfc6zryv-Caddyfile --adapter caddyfile
Group=caddy
LimitNOFILE=250000:250000
LogsDirectory=caddy
NoNewPrivileges=true
PrivateDevices=true
ProtectHome=true
ReadWriteDirectories=/var/lib/caddy
Restart=on-abnormal
StateDirectory=caddy
User=caddy

[Install]
WantedBy=multi-user.target

d. My complete Caddy config:

:8080 {
  handle_path /html {
    root * /nix/store/carv5x76ywyfhiq1rvqccrbpqnfvxrmb-static-html
    file_server
  }

  handle /synthetic {
    respond "Hello, world!"
  }

  handle /proxy {
    reverse_proxy localhost:8081
  }
}

3. The problem I’m having:

I’m benchmarking Caddy. My Caddyfile works, so I’m primarily interested in whatever performance-related improvements I can make to make each of the three handlers respond speedily.

4. Error messages and/or full log output:

(no errors)

5. What I already tried:

I’ve done some reading both on these forums and on the Internet generally, but am looking for some more a) up-to-date and b) specific suggestions (if there are any)

6. Links to relevant resources:

(nil)


Hi! I’m the obnoxious user on Twitter who started this thread with whomever runs the Caddy account asking about performance. I’m running… lots of tests in a variety of scenarios against a like Nginx configuration.

I’m breaking apart benchmarking scenarios into a few different categories, including using a) default configuration file settings and b) optimized configuration file settings. My Caddyfile is intentionally bare in order to solely test i) hard-coded “synthetic” responses, ii) file serving, and iii) proxied requests.

Given my example Caddyfile, what (if any) performance optimizations are there to be had? My only real finding was to possibly install this caching module in order to cache responses. I’m already tweaking some sysctl settings like bumping up NOFILE (to 250000) and these as well just for general web-serving performance:

net.ipv4.ip_local_port_range "1024 65535"
net.ipv4.tcp_tw_reuse 1
net.ipv4.tcp_timestamps 1

Any general guidance? Thank you!

1 Like

There’s not really anything to tune. The defaults are good for 99% of users.

You might tweak some of the numbers for timeouts and such in reverse_proxy, but that’s entirely dependent on the application you’re serving and the types of patterns you expect from that application.

2 Likes

Thank you for the quick feedback, @francislavoie! I was skimming the documentation looking for knobs and dials and was slightly worried I might’ve missed something but if the defaults work well, then that’s fine by me.

I’ll take a look at reverse_proxy parameters but it sounds like there’s probably not much there, either, since my reverse proxy target is a very lightweight lighttpd listener with barely any latency. I appreciate it!

1 Like

You might try setting flush_interval -1 but I have no idea if that’s going to help or hurt performance for your tests.

Btw while you’re testing, you might be interested in capturing profiles during some tests (but note that collecting a profile will slightly degrade performance). Use http://localhost:2019/debug/pprof to get some basic tools for inspecting heap and CPU/goroutines. Profiles can be captured to help assess what is impacting performance.

Please also make sure you’re using Caddy 2.6 beta 3 released just a few minutes ago: Release v2.6.0-beta.3 · caddyserver/caddy · GitHub

We’d really like that new version to undergo scrutiny. It doesn’t contain a lot of optimizations but it DOES contain significant optimizations in the php_fastcgi directive, if that happens to be relevant.

I’m not sure what to tell you about tuning your system though. That’s something I have no expertise in.

I’d also be interested in the test results with metrics disabled (or even commented from the source code entirely) – I know it has significant performance impact and there’s an issue to improve it:

I don’t think nginx emits these kinds of metrics by default, so to be on par with other servers (if you’re going to compare), probably will want to completely disable or rip out the metrics.

Test with different size responses too. Note that I have clocked Caddy at faster than nginx with over 100k req/sec on my commodity laptop hardware without any special tuning (or removing metrics, etc) – just vanilla config files for both servers. There are so many factors that these kinds of tests probably won’t actually tell you much. If anything, it’ll end up being, “Yeah, looks like Go is slightly slower than C because of garbage collection, so you trade off a slight edge in performance for memory safety to become impervious to all the scary exploits of C programs but are still fast enough for any real-world use case.” – i.e. things we already know.

Hey @matt! Thank you so much for the additional detail.

You’re probably correct that the ultimate conclusion here will be “garbage-collected is a little slower, but with reasonable tradeoffs”; my intent is primarily to put some measurable metrics behind the conclusion - I’m striving very hard to make like configurations and environments so I can distill my results down solely to pure performance so that the measurements are comparable to the extent that each respective delta means something generally applicable.

I’ll probably add an appendix with a deeper dive into the golang prof metrics - I’m already observing golang GC stairstep memory graphs when I reach the higher concurrency numbers and am pretty curious about what the specific malloc/frees are.

I actually didn’t have a repsonse size as one of my variables! That’s probably worth factoring in (my axes are default vs. optimized configuration, concurrent requests, synthetic/HTML/reverse proxied). I’ll probably add a section testing that.

Thanks again for the feedback - I’ll absolutely report back once I’ve completed my benchmarks. Although I’m conducting my tests as impartially as possible, I’m in the same category of “I know nginx is probably a little faster, but I prefer Caddy’s tradeoffs” so my hope is that some comprehensive measurements can alleviate concerns about that gap being wider than it actually is.

1 Like

Alrighty. Thanks – keep me posted.

You could also fiddle with read_buffer and write_buffer settings on the HTTP transport of the reverse proxy:

reverse_proxy ... {
    ...
    transport http {
        read_buffer R
        write_buffer W
    }
}

You could also try tuning keepalive settings.

Docs here:

@tylerjl Would you please consider this patch in your tests? It is significantly faster when we use sendfile()! caddyhttp: ensure ResponseWriterWrapper and ResponseRecorder use ReadFrom if the underlying response writer implements it. by flga · Pull Request #5022 · caddyserver/caddy · GitHub

This is done! I tested most everything I sought out to test, though there are certainly more areas I could potentially probe. Happy to answer questions if there are any.

3 Likes

This is probably the most thorough and thoughtful performance comparison in this vein that I’ve ever read! Good job.

I am curious how some of the results might change for reverse_proxy if you configure the HTTP transport’s read and write buffer sizes. Right now they default to 4KB.

My hypothesis is that a larger buffer does better for larger responses, with less pressure on the GC.

In my testing on a single index file (again, the homepage of Caddy’s website) smaller buffers of 1KB or 2KB performed better than the default size. From about 37k req/sec to 41k req/sec.

One other thing I noticed… you should be seeing a speedup of at least 50% for sendfile, I would think. The fact that they are only very slightly different is surprising to me. Like, caddy-default-html-small – the difference is negligible IMO. How sure are we that sendfile was used on one test and not the other? (can you do an strace to verify?)