Caddy p99 always 4.95ms

flak · May 24, 2024, 2:44pm

1. The problem I’m having:

I’m plotting the p99 latency of a redirect response, I’m finding it weird that even under heavy traffic, the result is always 4.95ms.

The prometheus metric I’m using:
histogram_quantile(0.99, sum(rate(caddy_http_request_duration_seconds_bucket[1m])) by (le))

I’ve used k6.io to generate 2k requests per second over a couple of minutes, CPU usage went to 60% and the value is always 4.95 ms.

2. Error messages and/or full log output:

N/A

3. Caddy version:

caddy:2.7.6-builder docker container.

4. How I installed and ran Caddy:

a. System environment:

Docker containers.
Docker

b. Command:

docker-compose up

c. Service/unit/compose file:

services:
  caddy:
    build:
      dockerfile: ./caddy/caddy.Dockerfile
    container_name: caddy
    environment:
      - REDIS_HOST=redis
      - REDIS_PORT=6379
      - REDIS_USERNAME=default
      - REDIS_PASSWORD=redis
      - HTTP_CHALLENGE_SERVER=http-challenge
    volumes:
      - ./caddy/Caddyfile:/etc/caddy/Caddyfile

d. My complete Caddy config:

{
	on_demand_tls {
		ask http://{$HTTP_CHALLENGE_SERVER}/ask
	}
	storage redis {
		host {$REDIS_HOST}
		port {$REDIS_PORT}
		username {$REDIS_USERNAME}
		password {$REDIS_PASSWORD}
	}
	servers {
		metrics
	}
}

https://* {
	tls internal {
		on_demand
	}
        @www header_regexp nonwww Host ^(www\.)(.*)
	@no-www host www.{args.0}
	handle @www {
		redir https://{re.nonwww.2}{uri}
	}
	@non-www not header_regexp Host ^(www\.)(.*)
	handle @non-www {
		redir https://www.{host}{uri}
	}	respond 404
}

5. Links to relevant resources:

francislavoie · May 24, 2024, 6:42pm

If you enable access logs (add the log directive), what do you see for duration?

hairyhenderson · May 24, 2024, 6:44pm

Can you share a screenshot of the result of that query? Is it exactly 4.95ms, or is there any variation?

Could you share the k6 script?

Can you also try running the test over a longer period of time (maybe 10-20mins)?

And maybe also query for other quantiles too, like 0.5, 0.75, 0.90, and 0.95, just to see if there’s variation at lower quantiles.

flak · May 27, 2024, 7:14am

The output of sorted log docker logs <container> | jq '.duration' | sort -rn | head -n 25:

9.9999e-05
9.9999e-05
9.9999e-05
9.9999e-05
9.9999e-05
9.9998e-05
9.9998e-05
9.9998e-05
9.9998e-05
9.9997e-05
9.9997e-05
9.9997e-05
9.9997e-05
9.9997e-05
9.9996e-05
9.9996e-05
9.9996e-05
9.9996e-05
9.9996e-05
9.9995e-05
9.9995e-05
9.9995e-05
9.9995e-05
9.9995e-05
9.9995e-05

k6.js:

import http from 'k6/http';

export const options = {
    insecureSkipTLSVerify: true,
    vus: 1000,
    duration: '10m',
};

export default function () {
    http.get('https://<URL>', {redirects: 0});
};

Output of k6:

❯ k6 run k6.local.js
     scenarios: (100.00%) 1 scenario, 1000 max VUs, 10m30s max duration (incl. graceful stop):
              * default: 1000 looping VUs for 10m0s (gracefulStop: 30s)

     data_received..................: 55 MB  92 kB/s      data_sent......................: 36 MB  61 kB/s
     http_req_blocked...............: avg=1.24ms   min=0s      med=0s       max=2.44s   p(90)=1µs    p(95)=1µs
     http_req_connecting............: avg=333.58µs min=0s      med=0s       max=1.91s   p(90)=0s     p(95)=0s
     http_req_duration..............: avg=600.1ms  min=55.43ms med=522.27ms max=5.8s    p(90)=1.04s  p(95)=1.3s
       { expected_response:true }...: avg=600.1ms  min=55.43ms med=522.27ms max=5.8s    p(90)=1.04s  p(95)=1.3s
     http_req_failed................: 0.00%  ✓ 0           ✗ 997739
     http_req_receiving.............: avg=1.18ms   min=4µs     med=48µs     max=32.31ms p(90)=4.69ms p(95)=7.17ms
     http_req_sending...............: avg=76.68µs  min=5µs     med=59µs     max=25.45ms p(90)=115µs  p(95)=143µs
     http_req_tls_handshaking.......: avg=857.68µs min=0s      med=0s       max=1.6s    p(90)=0s     p(95)=0s
     http_req_waiting...............: avg=598.83ms min=55.3ms  med=519.43ms max=5.79s   p(90)=1.03s  p(95)=1.3s
     http_reqs......................: 997739 1662.103917/s
     iteration_duration.............: avg=601.56ms min=55.6ms  med=522.47ms max=5.8s    p(90)=1.04s  p(95)=1.31s
     iterations.....................: 997739 1662.103917/s
     vus............................: 1000   min=1000      max=1000
     vus_max........................: 1000   min=1000      max=1000


running (10m00.3s), 0000/1000 VUs, 997739 complete and 0 interrupted iterations
default ✓ [========================] 1000 VUs  10m0s

I’ve got p90 in k6 of 1 second, but p99 on prometheus is always 4.95 ms and p90 is 4.52 ms.
The formula on prometheus: histogram_quantile(0.99, sum(rate(caddy_http_request_duration_seconds_bucket{code=~"302"}[5m])) by (le))

I was looking for measurements closer to what I got in k6 (excluding client network latency of course).

flak · June 4, 2024, 2:31pm

@francislavoie do you have any insights here? It seems the top duration measurements are always very low, I was expecting more variation under load. Thank you.

matt · June 4, 2024, 2:46pm

These values are not precise enough. Is that really the output of the logs? All the requests likely have different duration values, but they’re so close together that this representation can’t express that.

flak · June 5, 2024, 1:14pm

@matt that’s what the logs are outputting, and those are the highest values.

Caddy is only replying with redirects (www to root domain), perhaps that’s why the duration is so low. But it’s issuing TLS certificates, so I was expecting the first request to take longer.
If I do a curl https://<domain> and the TLS cert is not issued yet, then the metrics should be higher and account for the time of the TLS cert being generated.

francislavoie · June 6, 2024, 5:54am

You can adjust the log output’s duration_format log (Caddyfile directive) — Caddy Documentation. Try unix_milli_float which would move the decimal point to the right by 4 positions, so it shouldn’t use an e exponent in the output.

I think the metrics only include the time taken for the HTTP part of the request, not including the TLS handshake. That’s because the metrics are only initialized after the connection is established and handed off to our ServeHTTP part of the code, because the TLS connection stuff happens lower level inside of the Go stdlib.

system · July 7, 2024, 5:12pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.