The Caddy Telemetry Project

I have the opportunity to work on Caddy for academic credit until April. The project involves implementing telemetry into Caddy so that we can gain a better understanding of how not only Caddy is being used, but also the Internet at large. The plan/goal is to enable telemetry by default on all Caddy instances (with an option to turn it off, of course) so that the public can inspect the behavior of Web clients and observe the health of the Internet in near-real-time.

Work has already begun. I am inviting any and all interested researchers or developers to participate! Please introduce yourself and I will send you an invite. We need feedback regarding measurements to collect and technologies to use, as well as actual implementation assistance. If you would like to contribute to this effort in any way, please let me know with an introduction of yourself and your interests and I will invite you to our Slack channel.

A few of us have already drafted a list of almost 100 questions to answer with a data set constructed from web servers operating all over the world on different networks and serving various different clients.

Some sample questions related to Caddy itself:

  • How many instances are running?
  • What countries are the instances in?
  • What platform are they running on? OS/arch/CPUs, etc.
  • Which Caddy versions are being used?
  • Is Caddy being used locally or in production?

A few sample questions about TLS and Internet security:

  • What is being advertised in TLS ClientHellos?
  • What is being advertised by Caddy’s ServerHellos?
  • What ends up being negotiated for a TLS connection?
  • Why do TLS connections fail? What alerts are raised?
  • How are certificates being managed by Caddy?
  • Which clients fail to adhere to HSTS?
  • How long is the average certificate chain?
  • Which HTTPS connections are being MITM’ed? (What is the client being advertised vs. what characteristics does the client actually exhibit?)
  • How often is SNI being used?
  • Which clients are behaving abnormally? Which botnets might be rising? Who is being DDOS’ed or who is performing a DDOS?

A sample of other questions:

  • What HTTP errors are most common?
  • What errors is Caddy having internally?
  • How many connections use IPv6 vs. IPv4?
  • What is the latency in sending responses to clients?

These are just a few of the many, many questions we would like to be able to answer. There are Google Docs documents with the full list that are shared in our Slack channel which can be edited by participants.

And yes, of course privacy, transparency, and applicable laws are all going to be a huge part of our discussions – at the proper time. The first stages are to determine at a high level which measurements to collect and choose technologies to store and produce the data set and make it available.

If you have any questions or would like to participate, reply here and introduce yourself! We’d love to have you participate!

(Keep in mind that extended discussion about telemetry should be focused in our Slack channel for now. We don’t want to lose track of anything.)

3 Likes

Hello! This sounds like fun, I would be interested in being involved. I run an instance of Caddy on about 8 different machines, some for dev purposes and some for production. aaronellington@gmail.com

Great - sent you an invite! Turn on notifications for that channel to stay updated.

I would be happy to be involved.

We have 2 instance of caddy without plugins, happy to share telemetry.

1 Like

Excellent! You are already on our Slack, so feel free to join the #telemetry channel there. The links to the docs in that channel will have the information that we’ve currently laid out.

Hi, I’m arriving late on the post, bu if it’s still possible I would like to help too. Currently we have two instances.

1 Like

Great! I will send an invite to your email address.

Exciting day! I have a working implementation of telemetry in Caddy in the telemetry branch: https://github.com/mholt/caddy/tree/telemetry

A few very simple metrics have been added as well. They were chosen for their simplicity and to help prove the system. At startup, we report things like number of sites; number of listeners; OS/arch; basic CPU info such as brand/number of cores/whether it has AES-NI; Caddy version, etc. We bundle User-Agent strings to help us test metrics collection from requests and to understand the true spread of clients. We also count how many certificates are being managed by Caddy with the ACME protocol. (When finished, all metrics will be documented.)

Rather than risk this thread going way off-topic from project updates, feel free to join us on Slack to discuss it – let me know if you want an invite! If you join, please participate. Lurking is fine, but it doesn’t do the rest of us much good. :slight_smile:

When this is released in a few months, it should offer us the first-ever open data set of the health of the Internet from a server perspective, which is really exciting. We’re looking at only a few minutes delay for most of the stats, which will help us in detecting emerging problems or trends and respond quickly, and to help developers and researchers make more informed decisions. It will also improve Caddy since we’ll have an idea of how much certain features are used, or what kinds of errors are happening, etc.

Thank you to all who have participated so far! Looking forward to the next steps with you!

2 Likes

Latest update:

I have successfully gotten all the pieces working, end-to-end. Meaning, Caddy can report telemetry to a central collection endpoint, which saves the payload to a database. Meanwhile, I’ve made a web page that can request some basic stats from the database and show the results in near-real-time. It’s simple for now, but already pretty cool. :ok_hand:

So here’s the plan for now. I want to release this with version 0.11 next month. It will be a fairly basic implementation of telemetry to help prove its technical aspects. Along with the release, a new portion of the Caddy website will be available where you can go to view some global stats and look up reports for your own instances that you’re running.

I could really use more participants before the release, but the release will go forward anyway. If you want to contribute or be involved in a way that has some weight to the project, now is the time: after the release may be too late to make some changes. I’m especially interested to hear from any researchers who would like certain metrics to be reported! Testers get access to the staging database during this development phase, so you can look through all the data.

If you want to participate in testing, consider this your open invitation. Please let me know and I’ll invite you to our Slack channel where you can discuss it with us. I want to keep the discussion centralized so anything important doesn’t slip through.

You can build Caddy from the telemetry branch here to try it: https://github.com/mholt/caddy/pull/2079

Telemetry can be disabled, of course. Exactly how that is done is still being discussed. So far it looks like it will be a compile-time option, which offers several advantages, including knowing approximately how representative the metrics we do receive actually are of the whole population. A command-line flag would not be able to tell us that.

I’m also going to be working on documentation in the coming days, to be sure that is ready for the release.

Thanks for any and all who help in this project!

I just upgraded my caddy installation to use the telemetry version, it is possible to see that website?

Awesome, thanks!

The web page isn’t published yet, it’s still local. I need more feedback on what to put on it. DM me your email address and I will send you an invite to our Slack!

(Closing this thread, in favor of the designated discussion thread and the telemetry pre-release announcement blog post. We’ll keep the discussion as cohesive as possible this way.)