Awesome post, thanks @rugk. I’d like to explore this section a little:
I think this in particular seems borne out of (perhaps irrational) fear; in particular I note that stripping UUIDs loses the telemetry project a lot of helpful per-instance information. For example, the capability to calculate averages of averages (how many requests does the average instance serve per second?). How do we answer those kinds of questions without being able to keep separate instances of Caddy discrete?
Not just helpful to everyone in general who might benefit from public information, but also those of us who are interested in particular in one of the closing paragraphs of the announcement:
Our hope is that you will find telemetry a useful resource. On top of telemetry, we look forward to providing you with premium monitoring/alerting services, advanced reports, and data export directly to third-party services and tools (note that we would sell these tools/services, not the data itself). If enough people participate in telemetry, we may be able to do away with paid licenses, which is even more appealing.
– https://caddyserver.com/blog/caddy-0_11-telemetry#our-vision-of-telemetry
Which would of course not be possible at all if it were impossible to get all data for a given UUID, as you suggest.
I submit that as no individually identifying information is collected, and data is already aggregated by the local Caddy instance, keeping a UUID of that Caddy instance isn’t de-anonymizing. To answer the salient point - that the data can’t be correlated in some way to say that XYZ user visited website example.com
served by Caddy UUID 1234 - I quote the article a little more (apologies for such heavy quoting):
Telemetry does NOT collect personal information. No cookies, no session IDs, no way to identify individual clients connecting to your server. Telemetry is concerned with benign, aggregate counts: successful TLS connections, HTTP requests handled, response latency, etc.; technical characteristics: properties of TLS handshakes, software version, User-Agent strings, MITM assessments, etc.; and timestamps; things like that.
– https://caddyserver.com/blog/caddy-0_11-telemetry#what-is-collected
One nice advantage of server-side telemetry is that the data is naturally aggregated—not just by metric name, but also by entire individual clients/users. Unlike most client-side telemetry implementations, our telemetry server does NOT receive any connections from individual end users (browsers) or information from any one end user.
– https://caddyserver.com/blog/caddy-0_11-telemetry#your-controls-and-privacy
I also wouldn’t mind discussing this point further too:
- Also consider that maybe I have strict privacy requirements for one domain, but not for the other. So let me opt-out for some domains while allowing telemetry for others.
I have to admit I really don’t see the usefulness of this one; the telemetry is already aimed at being totally blind to the sites, individual clients, or content served. I see it more as metrics of the Caddy instance itself, rather than being partitioned into sets of statistics for each site.
In terms of the end result of what the telemetry server sees, all this achieves is an arbitrary skew of the overall data your Caddy instance will be aggregating. Do you see much value in that? I am curious to know your answer.
I’d love to see the public stuff be public domain. I believe that the devs want to see this kind of information publicly usable - it’s already collected at scale by the likes of Cloudflare, Google et al for sure, just not available to you and me.
This is an interesting point, assuming that User-Agent is considered as personally identifiable as an IP address. Generally not, of course; but any user could set their own User-Agent arbitrarily, which makes it plausible that a European user with a globally unique User-Agent could cause us to violate GDPR, even in aggregate data.
It might be necessary to give Caddy the ability to aggregate counts of certain detectable common client types and throw out the rest.