The delay is always because of “processing” as I’m local on the machine. Results of tests 1 to 3 are close and the minor variations can probably be disregarded. It’s as if I passed another threshold with the last server.
I’m using CentOs 7 with a minimal install with only Caddy installed (using a post install script to rinse and repeat tests).
Do you have any suggestions on how to potentially improve requests per second in Caddy or by a server config?
Caddy’s scaling with some type of files seems limited or maybe I’m missing a configuration option?
Could a caching tool help with such files?
I’ve tested with an html file as well as a small jpg and performance seems to scale really well with those type of files.
Server 3 in previous tests wasn’t available anymore when I tried to provision it. I think test results for server 1 were a bit of luck with possibly a newly provisioned server that I was the only one on.
Results are about the same. I would say gzip isn’t a factor in this and minor variations are more a result of the fact cores and SSD drives are shared and performance may vary a bit from “neighbors” on the server.
If you could run a profile of the binary while under load that would tell us where the performance hit is. (Also which version of Caddy are you running?)
Without a profile, it’s almost impossible to know how to make the correct optimizations.
Sure. The profiler will tell us which areas of the code are using the most memory and which functions are taking the most CPU cycles.
You might even be able to get away with using the pprof directive and, while (or just after) benchmarking, you can open the pprof page and view the stack traces.
I setupped a new server to complete the ab testing on the first server:
ab -n 100 -c 20 http://###.##.##.###:2015/a.mp4
It confirmed that both servers can deliver 102mbps between each other by using external connections (guaranteed speed). They received/delivered 8.7GB in 696.6s.
Caddy eventually calls the last line of this function in http/server.go:
func (cr *connReader) startBackgroundRead() {
cr.lock()
defer cr.unlock()
if cr.inRead {
panic(“invalid concurrent Body.Read call”)
}
if cr.hasByte {
return
}
cr.inRead = true
cr.conn.rwc.SetReadDeadline(time.Time{})
go cr.backgroundRead()
}
From this line (from the version I have):
20 GoRoutines (as many as concurrent users in ab) end up with such stacks:
I’m not that well-versed in Go and I’m unable to test locally on the same server, I’m also not used to use pprof, so I can’t confirm yet why “Processing” takes as long to complete locally. Maybe that I’m hitting IO limits on the server by doing such tests locally?
I could share pprof data. I’m unsure if they may help.
Hmm, hard to say so far. This is a good start, though. I/O limits are one possible explanation. Can you produce the results of a profile when you have a chance? The tutorial I linked above should be instructive.