Any ideas on how to lower processing time with larger files?

I’m testing Caddy to potentially serve 200MB-1.5GB video files to multiple concurrent users.

I did tests with a 333MB mp4 video file with Apache Benchmark (AB).

I tested this 5 times on different servers:
ab -n 100 -c 20 http://localhost:2015/a.mp4

Here are results for “Requests per second” on Cloud VPS:

  1. 1 vCore(s) - 2.4 GHz - 2 GB RAM - 10 GB SSD: 10.55 [#/sec]
  2. 2 vCore(s) - 2.4 GHz - 8 GB RAM - 40 GB SSD: 10.13 [#/sec]
  3. 4 vCore(s) - 3.1 GHz - 8 GB RAM - 100 GB (non-SSD): 10.35 [#/sec]
  4. 32 vCore(s) - 3.1 GHz - 120 GB RAM - 800 GB (SSD): 19.79 [#/sec]

The delay is always because of “processing” as I’m local on the machine. Results of tests 1 to 3 are close and the minor variations can probably be disregarded. It’s as if I passed another threshold with the last server.

I’m using CentOs 7 with a minimal install with only Caddy installed (using a post install script to rinse and repeat tests).

Do you have any suggestions on how to potentially improve requests per second in Caddy or by a server config?

Caddy’s scaling with some type of files seems limited or maybe I’m missing a configuration option?

Could a caching tool help with such files?

I’ve tested with an html file as well as a small jpg and performance seems to scale really well with those type of files.

What’s your Caddyfile?

I tested with this Caddyfile:

:2015 {
root /home/centos/caddySites
gzip
}

https isn’t enabled on this server.

Try turning off gzip, then re-run the benchmarks.

I completed new tests without gzip:

  1. 1 vCore(s) - 2.4 GHz - 2 GB RAM - 10 GB SSD: 13.64 [#/sec]
  2. 2 vCore(s) - 2.4 GHz - 8 GB RAM - 40 GB SSD: 10.88 [#/sec]
  3. 32 vCore(s) - 3.1 GHz - 120 GB RAM - 800 GB (SSD): 20.71 [#/sec]

Server 3 in previous tests wasn’t available anymore when I tried to provision it. I think test results for server 1 were a bit of luck with possibly a newly provisioned server that I was the only one on.

Results are about the same. I would say gzip isn’t a factor in this and minor variations are more a result of the fact cores and SSD drives are shared and performance may vary a bit from “neighbors” on the server.

I also tried this one separately:

timeouts 0

It didn’t make a noticeable change.

If you could run a profile of the binary while under load that would tell us where the performance hit is. (Also which version of Caddy are you running?)

Without a profile, it’s almost impossible to know how to make the correct optimizations.

I’m running version 0.10.9 for amd64 available here: https://github.com/mholt/caddy/releases

I’ll read the link you mentioned and I’ll read this too: https://caddyserver.com/docs/pprof

Thanks a lot for the suggestion.

Sure. The profiler will tell us which areas of the code are using the most memory and which functions are taking the most CPU cycles.

You might even be able to get away with using the pprof directive and, while (or just after) benchmarking, you can open the pprof page and view the stack traces.

Thanks a lot @matt.

I tested with pprof and it did give me some info.

pprof seemed to block ab tests like this:

ab -n 100 -c 20 http://localhost/a.mp4

I setupped a new server to complete the ab testing on the first server:

ab -n 100 -c 20 http://###.##.##.###:2015/a.mp4

It confirmed that both servers can deliver 102mbps between each other by using external connections (guaranteed speed). They received/delivered 8.7GB in 696.6s.

Caddy eventually calls the last line of this function in http/server.go:

func (cr *connReader) startBackgroundRead() {
cr.lock()
defer cr.unlock()
if cr.inRead {
panic(“invalid concurrent Body.Read call”)
}
if cr.hasByte {
return
}
cr.inRead = true
cr.conn.rwc.SetReadDeadline(time.Time{})
go cr.backgroundRead()
}

From this line (from the version I have):

20 GoRoutines (as many as concurrent users in ab) end up with such stacks:

goroutine 58 [IO wait]:
internal/poll.runtime_pollWait(0x7fa98efad1f0, 0x72, 0x0)
/usr/local/go/src/runtime/netpoll.go:173 +0x57
internal/poll.(*pollDesc).wait(0xc420403498, 0x72, 0xffffffffffffff00, 0xe7e5a0, 0xe783f0)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:85 +0xae
internal/poll.(*pollDesc).waitRead(0xc420403498, 0xc420401c00, 0x1, 0x1)
/usr/local/go/src/internal/poll/fd_poll_runtime.go:90 +0x3d
internal/poll.(*FD).Read(0xc420403480, 0xc420401c01, 0x1, 0x1, 0x0, 0x0, 0x0)
/usr/local/go/src/internal/poll/fd_unix.go:125 +0x18a
net.(*netFD).Read(0xc420403480, 0xc420401c01, 0x1, 0x1, 0x0, 0x0, 0x0)
/usr/local/go/src/net/fd_unix.go:202 +0x52
net.(*conn).Read(0xc42000c928, 0xc420401c01, 0x1, 0x1, 0x0, 0x0, 0x0)
/usr/local/go/src/net/net.go:176 +0x6d
net/http.(*connReader).backgroundRead(0xc420401bf0)
/usr/local/go/src/net/http/server.go:660 +0x62
created by net/http.(*connReader).startBackgroundRead
/usr/local/go/src/net/http/server.go:656 +0xd8

I’m not that well-versed in Go and I’m unable to test locally on the same server, I’m also not used to use pprof, so I can’t confirm yet why “Processing” takes as long to complete locally. Maybe that I’m hitting IO limits on the server by doing such tests locally?

I could share pprof data. I’m unsure if they may help.

Hmm, hard to say so far. This is a good start, though. I/O limits are one possible explanation. Can you produce the results of a profile when you have a chance? The tutorial I linked above should be instructive.

I tried to do two. I’ve not been able to use them though.

I uploaded them here: http://192.99.9.50/profiles.zip

I used something similar to this to generate them:

http://localhost:2015/debug/pprof/profile