Debugging long response times


(Lucas) #1

Recently I’ve been looking quite hard at speeding up my web applications, so I’ve been doing things like optimising my code, creating static HTML files instead of generating pages every request etc., and it’s all gone quite well.

I can now get a page in ~50ms after all of the logic and database access in my code has finish executing.

My problem is though, that if I leave the page open for some time, for example, up to a minute (not sure of an exact time for this), and then make a request to the same page again, it suddenly takes a whole second longer to respond. This is based on the TTFB metric in the Chrome network tab for the document itself.

I know that my code is ok, because even when the response takes over a second the timings in my code report ~50ms or less every single time.

So I was wondering how I would go about debugging this.

For context, the server I’m running my code in is Hapi for Node.js, and Caddy just does a transparent proxy to it.

This is the relevant part of my Caddyfile:

proxy / localhost:3001 {
    transparent
}

As far as I can see this could either be Caddy or Hapi, so does anyone have any suggestions as to how I would go about ruling out Caddy as the problem?
I’m having trouble thinking of what to look at, but it always happens after I’ve left the page for a short time. If I keep on refreshing the page the timings never go above a second and stay within the expected response time.

If it helps, I usually build Caddy from source, so if there’s anywhere in the code that would make sense to add some timings then I’m ok with doing that too.


(Lucas) #2

I’ve done some more testing and I think I can say with certainty that this is a Caddy problem.

The way I’ve been debugging this is to try different versions of code, Caddy and Go in different combinations.
The only thing that ever makes a difference to the TTFB metric is the version of Caddy I use.

I tried the following versions:

  • v0.10.10 (max TTFB ~350ms)
  • v0.10.11 (max TTFB ~350ms)
  • v0.10.12 (max TTFB ~350ms)
  • v0.10.13 (max TTFB ~1 second)
  • v0.10.14 (max TTFB ~1 second)
  • v0.11.0 (max TTFB ~1 second)

All of them were tested with Go versions 1.10.3 and 1.11.0.

The Go version made no difference to the numbers for me so I decided to stick with 1.11.0.
My Caddyfile and code I was testing with didn’t change at all during the tests.

After an initial page load it always seems to happen after leaving the page alone for ~10 seconds or more.
If I refresh the page as soon as its finished loading then it doesn’t seem to have any delay.
I’m not sure what changed to make this happen, but for now I’ll have to just stick to using version 0.10.12.


(Lucas) #3

I’ve looked into this a bit more and I’ve found out what’s causing the sudden >= 1 second response times.

It’s the Dial function that’s used in the http.Transport in the NewSingleHostReverseProxy function in the caddyhttp/proxy/reverseproxy.go file.

I’m testing with a modified build of v0.10.12, and I can get it to give me >= 1 second response times if I overwrite the Dial function in the http.Transport with my own one.

The culprit here is the net.Dialer's DualStack field. In the standard library the DefaultTransport sets this to true, which is what I believe Caddy was using up until Caddy v0.10.12. But since Caddy v0.10.13 it looks like a custom dialer was created and DualStack is never set, so defaults to false.

The time I mentioned where I “left the page alone” relates to the TLSHandshakeTimeout value, which in Caddy is set to the value of defaultCryptoHandshakeTimeout, so 10 seconds. I don’t know exactly how the dialer in Go works but I assume after that time it would have to do something again (the handshake?) which explains why I was seeing this behaviour after a certain amount of time.

The ~350ms times I’m getting up until Caddy v0.10.12 relate to the net.Dialer's FallbackDelay field, which when not set (like in the standard library’s DefaultTransport) defaults to 300ms. Given that my application executes in ~50ms this explains where the extra 300ms comes from when refreshing the page after the TLSHandshakeTimeout duration.

Would it be fair to classify this as a bug in Caddy, and would it be fair to say that DualStack should be enabled by default in the new custom dialer?


(Lucas) #4

Since this feels like a major step back to me, and the fix is simple, I created a PR here: https://github.com/mholt/caddy/pull/2305


(Matthew Fay) #5

Thanks so much for your efforts and contribution!


(Matt Holt) #6

Thanks Lucas, looking forward to getting that merged. :slight_smile: