The Caddy Telemetry Project


(Matt Holt) #1

I have the opportunity to work on Caddy for academic credit until April. The project involves implementing telemetry into Caddy so that we can gain a better understanding of how not only Caddy is being used, but also the Internet at large. The plan/goal is to enable telemetry by default on all Caddy instances (with an option to turn it off, of course) so that the public can inspect the behavior of Web clients and observe the health of the Internet in near-real-time.

Work has already begun. I am inviting any and all interested researchers or developers to participate! Please introduce yourself and I will send you an invite. We need feedback regarding measurements to collect and technologies to use, as well as actual implementation assistance. If you would like to contribute to this effort in any way, please let me know with an introduction of yourself and your interests and I will invite you to our Slack channel.

A few of us have already drafted a list of almost 100 questions to answer with a data set constructed from web servers operating all over the world on different networks and serving various different clients.

Some sample questions related to Caddy itself:

  • How many instances are running?
  • What countries are the instances in?
  • What platform are they running on? OS/arch/CPUs, etc.
  • Which Caddy versions are being used?
  • Is Caddy being used locally or in production?

A few sample questions about TLS and Internet security:

  • What is being advertised in TLS ClientHellos?
  • What is being advertised by Caddy’s ServerHellos?
  • What ends up being negotiated for a TLS connection?
  • Why do TLS connections fail? What alerts are raised?
  • How are certificates being managed by Caddy?
  • Which clients fail to adhere to HSTS?
  • How long is the average certificate chain?
  • Which HTTPS connections are being MITM’ed? (What is the client being advertised vs. what characteristics does the client actually exhibit?)
  • How often is SNI being used?
  • Which clients are behaving abnormally? Which botnets might be rising? Who is being DDOS’ed or who is performing a DDOS?

A sample of other questions:

  • What HTTP errors are most common?
  • What errors is Caddy having internally?
  • How many connections use IPv6 vs. IPv4?
  • What is the latency in sending responses to clients?

These are just a few of the many, many questions we would like to be able to answer. There are Google Docs documents with the full list that are shared in our Slack channel which can be edited by participants.

And yes, of course privacy, transparency, and applicable laws are all going to be a huge part of our discussions – at the proper time. The first stages are to determine at a high level which measurements to collect and choose technologies to store and produce the data set and make it available.

If you have any questions or would like to participate, reply here and introduce yourself! We’d love to have you participate!

(Keep in mind that extended discussion about telemetry should be focused in our Slack channel for now. We don’t want to lose track of anything.)


(Aaron Ellington) #2

Hello! This sounds like fun, I would be interested in being involved. I run an instance of Caddy on about 8 different machines, some for dev purposes and some for production. aaronellington@gmail.com


(Matt Holt) #3

Great - sent you an invite! Turn on notifications for that channel to stay updated.


(Toby Allen) #4

I would be happy to be involved.

We have 2 instance of caddy without plugins, happy to share telemetry.


(Matt Holt) #5

Excellent! You are already on our Slack, so feel free to join the #telemetry channel there. The links to the docs in that channel will have the information that we’ve currently laid out.


(Jorge Luiz Correa Bernhard Tautz) #6

Hi, I’m arriving late on the post, bu if it’s still possible I would like to help too. Currently we have two instances.


(Matt Holt) #7

Great! I will send an invite to your email address.


(Matt Holt) #8

Exciting day! I have a working implementation of telemetry in Caddy in the diagnostics branch: https://github.com/mholt/caddy/tree/diagnostics

For now, we’re calling it diagnostics for two reasons:

  • It accurately reflects what the data is being used for, specifically the answering of specific questions related to Internet health, usage, and client behaviors; and for diagnosing problems by their symptoms.
  • The term “telemetry” may trigger unnecessary negative reactions because of years of misinformation and assumptions about what telemetry is or does, along with the possible stigma after years of being associated with megacorps having controversial dedication to user privacy. The term “diagnostic” more positively conveys health and understanding, and that’s what we’re going for. (The fears about common telemetry are relatively unfounded; i.e. Firefox and Chrome telemetry is a huge reason the Web has been able to move to HTTPS at the rate it has!)

A few very simple metrics have been added as well. They were chosen for their simplicity and to help prove the system. At startup, we report things like number of sites; number of listeners; OS/arch; basic CPU info such as brand/number of cores/whether it has AES-NI; Caddy version, etc. We bundle User-Agent strings to help us test metrics collection from requests and to understand the true spread of clients. We also count how many certificates are being managed by Caddy with the ACME protocol. (When finished, all metrics will be documented.)

Diagnostics can be disabled with the -no-diagnostics flag.

Rather than risk this thread going way off-topic from project updates, feel free to join us on Slack to discuss it – let me know if you want an invite! If you join, please participate. Lurking is fine, but it doesn’t do the rest of us much good. :slight_smile:

When this is released in a few months, it should offer us the first-ever open data set of the health of the Internet from a server perspective, which is really exciting. We’re looking at only a few minutes delay for most of the stats, which will help us in detecting emerging problems or trends and respond quickly, and to help developers and researchers make more informed decisions. It will also improve Caddy since we’ll have an idea of how much certain features are used, or what kinds of errors are happening, etc.

Thank you to all who have participated so far! Looking forward to the next steps with you!