Monitoring "golden signals"

1. Output of caddy version:

$ caddy version
v2.5.2 h1:eCJdLyEyAGzuQTa5Mh3gETnYWDClo1LjtQm2q9RNZrs=

2. How I run Caddy:

a. System environment:

Directly on Debian GNU/Linux 11 (bullseye) with Systemd.

b. Command:

automatically via systemd

c. Service/unit/compose file:

Main Caddyfile

import {{ caddy_sites_conf_dir }}/*.caddyconf

First imported file

neururer.pjsmets.com
{
    encode zstd gzip

    root * /var/www/neururer.pjsmets.com/html
    file_server
}

Second imported file

www.pjsmets.com, pjsmets.com
{
    encode zstd gzip

    root * /var/www/pjsmets.com/html
    file_server
}

d. My complete Caddy config:

;
; Ansible managed
;
; source: https://github.com/mholt/caddy/blob/master/dist/init/linux-systemd/caddy.service
; version: 6be0386
; changes: Set variables via Ansible

[Unit]
Description=Caddy HTTP/2 web server
Documentation=https://caddyserver.com/docs
After=network-online.target
Wants=network-online.target systemd-networkd-wait-online.service
StartLimitIntervalSec=86400
StartLimitBurst=5

[Service]
Restart=on-failure

; User and group the process will run as.
User=www-data
Group=www-data

; Letsencrypt-issued certificates will be written to this directory.
Environment=CADDYPATH=/etc/ssl/caddy

ExecStart="/usr/local/bin/caddy" run --environ --config "/etc/caddy/Caddyfile"
ExecReload="/usr/local/bin/caddy" reload --config "/etc/caddy/Caddyfile"

; Limit the number of file descriptors; see `man systemd.exec` for more limit settings.
LimitNOFILE=1048576

; Use private /tmp and /var/tmp, which are discarded after caddy stops.
PrivateTmp=true
; Use a minimal /dev
PrivateDevices=true
; Hide /home, /root, and /run/user. Nobody will steal your SSH-keys.
ProtectHome=false
; Make /usr, /boot, /etc and possibly some more folders read-only.
ProtectSystem=full
; … except /etc/ssl/caddy, because we want Letsencrypt-certificates there.
;   This merely retains r/w access rights, it does not add any new. Must still be writable on the host!
ReadWriteDirectories=/etc/ssl/caddy /var/log/caddy

; The following additional security directives only work with systemd v229 or later.
; They further retrict privileges that can be gained by caddy.
; Note that you may have to add capabilities required by any plugins in use.
CapabilityBoundingSet=CAP_NET_BIND_SERVICE
AmbientCapabilities=CAP_NET_BIND_SERVICE
NoNewPrivileges=true

[Install]
WantedBy=multi-user.target

3. The problem I’m having:

I’m playing around with monitoring my Caddy setup and I’m wondering how to monitor the “golden signals”:

  • Latency/response time
  • Traffic
  • Errors

For the response times, the Caddy docs already give some nice sample queries. For the other 2, it’s not so clear to me how to measure them.

  • Traffic: the caddy_http_requests_total metric gives counts per middleware handler. How can this be converted into total incoming traffic on the server? The docs warn about just summing everything up:

    Because all middleware handlers are instrumented, and many requests are handled by multiple handlers, make sure not to simply sum all the counters together.

  • Errors: There is caddy_http_request_errors_total, but this seems to combine 400 and 500 errors. Its it somehow possible to split this up?

4. Error messages and/or full log output:

/

5. What I already tried:

6. Links to relevant resources:

See inline links.

/cc @hairyhenderson if you have thoughts

1 Like

Hi @Herrsubset,

For request rate (traffic), you can just take the rate of increase of the caddy_http_request_duration_seconds_count metric. A query like sum(rate(caddy_http_request_duration_seconds_count[5m])) would give you the per-second request rate, as averaged over 5 minute time windows. You can filter and aggregate as well as needed.

The caddy_http_request_errors_total metric tracks middleware errors, not explicitly error return codes. To track error rates, you can again use caddy_http_request_duration_seconds_count - it has a code label that you can use. To track, for example, the ratio of 5xx errors by handler, you can use a PromQL query like:

sum(rate(caddy_http_request_duration_seconds_count{code=~"^5.*"}[5m])) by (handler)
/
sum(rate(caddy_http_request_duration_seconds_count[5m])) by (handler)

If you’re interested, some useful queries are available in a preconfigured Caddy integration in Grafana Cloud - see Caddy integration | Grafana Cloud documentation for some information - the code behind this integration is open source, too. (Note: I work for Grafana Labs on Grafana Cloud, so I may be biased :wink:)

3 Likes

Thanks for the query examples @hairyhenderson!

I’m actually trying all of this out on Grafana Cloud, but I missed that there’s an integration for Caddy. I configured the Grafana Agent to scrape the endpoint in some other way. Are there benefits to using this integration?

More on-topic though, do you think it makes sense to add these “golden signal monitoring” queries to the Caddy docs? I guess they would be a good start for many people trying to monitor their web server(s)?

1 Like

I’m actually trying all of this out on Grafana Cloud

Awesome! :tada:

but I missed that there’s an integration for Caddy.

Yeah, there are a lot of integrations now so it’s getting harder to discover them… Our integrations team is working on making things more discoverable, so hopefully this will become easier in future.

I configured the Grafana Agent to scrape the endpoint in some other way. Are there benefits to using this integration?

Mostly just the preconfigured dashboard. Some integrations have more functionality - our k8s integration is quite feature-rich, for example. The Caddy one is fairly basic though - just a dashboard and configuration instructions.

do you think it makes sense to add these “golden signal monitoring” queries to the Caddy docs? I guess they would be a good start for many people trying to monitor their web server(s)?

Yes, totally! Do you feel up to contributing a PR? The documentation is here.

3 Likes

Thanks for all the info!

I’d love to contribute a PR, I’ll try to do in the coming week, I should have some time available :slight_smile:

1 Like

This topic was automatically closed after 30 days. New replies are no longer allowed.