Caddy 0.11 Will Have Telemetry - discuss

Awesome post, thanks @rugk. I’d like to explore this section a little:

I think this in particular seems borne out of (perhaps irrational) fear; in particular I note that stripping UUIDs loses the telemetry project a lot of helpful per-instance information. For example, the capability to calculate averages of averages (how many requests does the average instance serve per second?). How do we answer those kinds of questions without being able to keep separate instances of Caddy discrete?

Not just helpful to everyone in general who might benefit from public information, but also those of us who are interested in particular in one of the closing paragraphs of the announcement:

Our hope is that you will find telemetry a useful resource. On top of telemetry, we look forward to providing you with premium monitoring/alerting services, advanced reports, and data export directly to third-party services and tools (note that we would sell these tools/services, not the data itself). If enough people participate in telemetry, we may be able to do away with paid licenses, which is even more appealing.

https://caddyserver.com/blog/caddy-0_11-telemetry#our-vision-of-telemetry

Which would of course not be possible at all if it were impossible to get all data for a given UUID, as you suggest.

I submit that as no individually identifying information is collected, and data is already aggregated by the local Caddy instance, keeping a UUID of that Caddy instance isn’t de-anonymizing. To answer the salient point - that the data can’t be correlated in some way to say that XYZ user visited website example.com served by Caddy UUID 1234 - I quote the article a little more (apologies for such heavy quoting):

Telemetry does NOT collect personal information. No cookies, no session IDs, no way to identify individual clients connecting to your server. Telemetry is concerned with benign, aggregate counts: successful TLS connections, HTTP requests handled, response latency, etc.; technical characteristics: properties of TLS handshakes, software version, User-Agent strings, MITM assessments, etc.; and timestamps; things like that.

https://caddyserver.com/blog/caddy-0_11-telemetry#what-is-collected

One nice advantage of server-side telemetry is that the data is naturally aggregated—not just by metric name, but also by entire individual clients/users. Unlike most client-side telemetry implementations, our telemetry server does NOT receive any connections from individual end users (browsers) or information from any one end user.

https://caddyserver.com/blog/caddy-0_11-telemetry#your-controls-and-privacy

I also wouldn’t mind discussing this point further too:

  • Also consider that maybe I have strict privacy requirements for one domain, but not for the other. So let me opt-out for some domains while allowing telemetry for others.

I have to admit I really don’t see the usefulness of this one; the telemetry is already aimed at being totally blind to the sites, individual clients, or content served. I see it more as metrics of the Caddy instance itself, rather than being partitioned into sets of statistics for each site.

In terms of the end result of what the telemetry server sees, all this achieves is an arbitrary skew of the overall data your Caddy instance will be aggregating. Do you see much value in that? I am curious to know your answer.

I’d love to see the public stuff be public domain. I believe that the devs want to see this kind of information publicly usable - it’s already collected at scale by the likes of Cloudflare, Google et al for sure, just not available to you and me.

This is an interesting point, assuming that User-Agent is considered as personally identifiable as an IP address. Generally not, of course; but any user could set their own User-Agent arbitrarily, which makes it plausible that a European user with a globally unique User-Agent could cause us to violate GDPR, even in aggregate data.

It might be necessary to give Caddy the ability to aggregate counts of certain detectable common client types and throw out the rest.

2 Likes

Hi Matt,

A question about this: “the telemetry server has the ability to remotely disable (but NOT enable!) telemetry in Caddy instances at any time”.

Would this be abusable, I mean how difficult would it be for a third party to remotely disable the telemetry in a Caddy server?

Blessings,
Peter

From a technical standpoint, I expect this will be done by the second party (telemetry server) when the first party (local Caddy) connects to check in. The local Caddy won’t have any API or equivalent to send a command to in order to disable its telemetry, hence a third party could never send such a command. This would also be why the telemetry server can never re-enable telemetry remotely - once the local Caddy is instructed not to check in again, it can’t be told to start once more.

1 Like

Hi Peter – an attacker would have to gain privileged access to the right machine and make specific modifications in order to configure clients to shut off telemetry.

@rugk - Thank you for taking the time to elaborate your thoughts with reasoning! I will consider everything you said, but let me respond to a couple things here:

The problem with run-time options is that we lose information about how reliable and representative the metrics are. For some research questions, that jeopardizes the usefulness of the data set, which may defeat the purpose of collecting telemetry in the first place. We will look into this more as time goes on, if we feel we have enough statistical information about our sample.

I resent that Europe thinks it can make laws and project them onto non-Europeans in other sovereign territories. :stuck_out_tongue: I would be interested to know how much money it would cost them to enforce their laws in the US, especially considering such a small project as Caddy. I guess if they go to all that trouble, it’s their taxpayers money, not their own. But as you said, I do not think GDPR applies here, so the point is probably moot.

Like Matthew said, I also do not see the technical justification for this. We would lose essential grouping information. I’ll defer to his post which is much more informative about this!

Other than those main points, I think we’re on the same page. And I agree with you on many of your points which I didn’t highlight here! Anyway, it’s why we’re easing into this rather than unleashing a huge feature set all at once, we’re being conservative.

1 Like

I just updated Visual Studio Code and on the first time I opened it, this was shown:

Screenshot%20from%202018-04-09%2015-53-12

Perhaps if telemetry is enabled, show a console message every time Caddy is started stating the it is collecting usage data, with links to the documentation of how it is used and a link to a documentation explaining how to turn it off. (And maybe also add a flag hide this warning) Something like:

caddy              # Starts caddy with telemetry and the console notice.
caddy -telemetry   # Starts caddy with telemetry. No notice.
caddy -disable-telemetry # Starts Caddy without telemetry and no notice.

What do you think?

3 Likes

The console message: Sure, we can consider that.

But again, telemetry will probably be togglable at compile time, so we can measure how representative it is.

1 Like

I give a big thumbs up to a nice little notice on startup

Sure; and I resent the cases in which the US does the same in reverse (most specifically in claiming legal right of access to data on computers elsewhere in the world).

In any case, my concern would be whether I as a user in the EU (for now!) could possibly be affected.

As for whether the data collected has any privacy concerns at all, I would say that in the end that’s not the issue; if people can see something they can misinterpret, they will do so. Caddy may never draw such people’s attention, but might you not want to have a defence against unjustified complaints in place - in the form of requiring an opt-in?

Oh, and a notice on start-up in practice does nothing useful in the case of systems running as a service…

Paul

2 Likes

It seems like you’re arguing for the Caddy telemetry project to debase itself - that is, to sabotage the integrity of any data produced and thus the main value of the project in the first place - not for the sake of a credible argument supporting that course of action, but out of fear of what misinformed complainants might do if that course of action is not taken.

I believe that would be intellectually dishonest, and I think that any argument based on this line of thought should have no influence whatsoever on the direction the devs choose to take Caddy… on this or any other topic.

Prudence is one thing, and it takes the form of an honest, thoughtful, and comprehensive blog post outlining exactly to what extent you as a user anywhere could possibly be affected.

(The GDPR has been a major part of this thread, yes, but having privacy concerns at the forefront of the telemetry project isn’t just driven by fear of legal repercussions in the EU. It has to be based on an honest belief in privacy for privacy’s sake, for all users, and not just privacy theatre… or it’s just for show, really.)

Its hard to know whether you are being serious or making a joke.

Are you saying that if the activities of Light Code Labs LLC were in scope of GDPR it would simply ignore it due to the size of the project (“considering such a small project as Caddy”)?

That seems like an uncharitable interpretation of Matt’s comment.

It might be best to refrain from speculating on what the EU will or will not do to enforce the GDPR outside of its own sovereignty. The politics aren’t all that relevant to Caddy from a technical standpoint, and neither is anyone’s personal opinion on that topic, so lets keep the discussion practical.

Not at all; I think you are deluding yourselves if you feel that opt-in will unbalance the data but opt-out will not. In both cases the dataset is distorted by people’s choices, and sadly this is inevitable. But, if you wish, an opt-in could be advertised as prominently as has been suggested for an opt-out. Or even don’t have a default - make it so that a command line or Caddyfile parameter saying “accept” or “decline” is required before Caddy will run.

As for the rest, I guess we’ll just have to disagree. The likelihood that a project such as Caddy would get into the kind of trouble that Facebook has over sharing data that the owners of which had allowed to be public is minuscule. But having worked for the past fifteen years in a field where I had to show powerful regulators how I was preparing for such small risks has made me rather sensitive in this matter.

I’m happy to debate exactly what the value of opt-in data vs opt-out data is. But that was not the important part of my reply to you. This next bit was:

And the reason was this comment:

So to reiterate: are you going to make an argument FOR opt-out, or will you stick with your argument AGAINST opt-in that seems purely based on fear of angry, misinformed people? I believe that if the event comes that Caddy must defend its decision against powerful regulators, and they’re angry and misinformed, there will be an opportunity to make them informed.


If you’re referring to Cambridge Analytica, it’s well understood that only a very small fraction of the data set given to CA had actually used the app in question and agreed to share that data with the app’s developer. The rest were merely friends with the people who agreed.

Um, swap opt-out and opt-in there. I’ve said what I think, and don’t feel the need to try any harder to influence your choice.

1 Like

@omz13 I was expressing a political opinion – nothing more. :slight_smile:

@pwhodges Thanks for your comments, I think they’re reasonable – and I appreciate you taking the time to write them!

I will take these into consideration as we wrap up the initial release here soon.

Took some time for my reply, sorry.

Okay, good point. But I have an easy solution:
Just hash the UUID on the telemtry server (with a strong password hash function)! Then you can still correlate the data, but unless someone submits their UUID to you, you cannot correlate the data to that single server. :smiley:

That’s your view, but not the one of the server admins. Say the have banking-secure.example and banking-website.example. They may enable telemtry for the second, but possibly they are disallowed by law to enable stat submissions for the first. Maybe they e.g. do not even are alloowed to keep logs for the first, etc. (This is just an example.)

It’s still opt-out, so no skew. Only few users may . This is a slight change, but if you would want 100% unbiased data, you would have to not include the possibility to opt-out. And you know, this is not a good thing. So keep the opt-out! And make it possible to disable for the server admin:

Depends:

  • Must-have: Opt-out at server admin level. (whoever compiles the software – distro owner or so – should not have the control over whether I – the server admin – want to submit telemetry or not. At least I should be able to override their decision, at least in the negative way to always be able to disable it.)
  • Nice-to-have: Opt-out per domain.

I just think these two can quite easily be combined, by adding a config option. :smiley:


So basically you ant to force users to have telemetry enabled? That’s stupid and will likely result in a fork or so. You don’t want that…
There is no alternative to runtime-opt-out . I know opt-in skews the statistics (and that’s the only thing you write in your blog), but opt-out not soo much as it seems. Remember that even Firefox and stuff provide a runtime opt-out! It would be silly for Mozilla to say: Hey you Firefox from your distro, then telemetry is always disabled. You got it from us? Then telemetry is always enabled.

The emoji does not make that polemic and incorrect statement any better. Again to explain: The GDBR applies to Euopean citizens and their data. So when you process/offer a product for Europeans, that applies. Of course, it does not apply to US customers or so.

Thwey would get the monesy back from you, that’s not hard. The only thing would be whether they’d care to really enforce that law for such a small project as Caddy.
But even without any legal requirement… let’s ignore the GDPR (and any potential legal threat) for a moment: It does not make a difference. You should still do it exactly the same, anonymize data, aggregate data.

Actually, when you as I just came up with the ID and hash the data, you could answer/implement such requests. User can enter their own UUID, server hashes it and get all the data.
That does not even need authentication as the UUID (if properly used with a rate-limiting/password hashing) acts as the authentication in this case.
So I think that may be a good idea. Especially, as, GDPR again, also includes the right for users to lookup their own data. So if this is done, users could do that. :smiley:

As explained that is not the case. Of course also a compile-time option skews the data, so I’d say this is the “skewness” scala:

  • no option to opt-out: no data skewing, but as users are lost due to fork actually also a kind of wrong data
  • opt-out at compile time: say 10%-15% data skewing (percent does not mean anything, just for the relations) – many distros will likely disable that by default
  • opt-out at compile time & runtime: 15%-20% data skewing
  • opt-out at runtime: 5%-10% data skewing
  • opt-in: likely >60% data skewing

That’s how I would estimate it. Based on how much users/maintainers/installations would likely disable that feature.
Thing is we have no facts here and never can, because to get out how many installations have that feature disabled, we’d need tracking again. So that does not work. One can only guess.

I mean if you have 3 million users, where 5000 do not send telemetry it also does not matter. Statistics always have some discrepancy, but you can neglect it when you do opt-out IMHO. Few users will adjust such a setting (especially if they feel that you care about privacy and that telemetry is actually useful).


Also you seem to ask about opt-in vs opt-out at compile time. I say opt-out at runtime. Is not that a reasonable compromise?
You see many users would argue for opt-in. I see your point that this skews data, but having an opt-out at compile time is not even worth the word “opt-out” (because it’s the wrong people who opt out. I am the server admin, this is my data. So I need an opt-out, not someone who wants to decide that for me.)

No – of course not. That’s contrary to everything we’ve been saying from the beginning. Telemetry can be disabled! It is entirely optional (literally, it is an option).

You will have the ability to make that decision, of course – you’re the sysadmin, you have the ability to control what programs you run with what configuration you want. This isn’t a hosted platform or a social network where we’re forcing you one way or the other behind a walled garden!

I just wanted to highlight that a compile-time opt-out is no opt-out for the server admin. Not all server admins use your pre-compiled binaries or compile it themselves. :smile:

I’m not sure I understand what this will achieve. The UUID is not derived from, nor can it reveal, any identifying information about the server it came from - except for the fact that it’s tied to the data it came with.

Hashing it doesn’t stop someone from requesting all data for a given UUID, it doesn’t protect any sensitive secrets or credentials from the database owner (as the UUID is neither), and it doesn’t stop the database owner from separating out any discrete instance of Caddy.

I suppose you could argue that the database owner can’t take the hash and put straight in to the front-end of the metrics site, but why would they do that when they can just look at the data in the database?

UUIDs are equally meritorious as hashed UUIDs or any other sufficiently random token for authentication purposes.

I’d argue that the data isn’t actually related to the site at all, but to Caddy itself. These aren’t web logs - they’re aggregated Caddy response latencies, MITM counts, TLS implementation, etc. across the entire web server. Once they’re aggregated locally, they’re indistinguishable.

So if someone higher up dictates that these stats can’t be included in the aggregate, I’d say that the entire server should be telemetry-disabled. There’s no real difference between the stats from a safe site and the stats from a sensitive site, so it would be best for it all to go off.

I don’t want to speak for Matt, but when I refer to the server admin, I mean the person ultimately responsible for the server, with the power to make decisions about what software runs on it. In this case, the person who provisions the server and puts the Caddy binary on it is the server admin, and they have the ability to choose whether they want telemetry. If I were just logging on to administer the web server, I’d be the web admin.

In the end, everyone either uses pre-compiled binaries or compiles it themselves.

2 Likes

As for hashing: Yes, maybe it does indeed not help much.

Are you kidding? There is the choice to use it from distro authors. And such a thing is always better than both options you mention, as they do not provide automatic updates.
So for security reasons (actually I thought that’s a topic for Caddy) you should really support apt/dnf installation (from distros).

And as such, it must still be configurable by the server admin. They can chooses the software, yes, but they are likely to deliberately choose the distro’s version for installing. And if they do so, they still want to be able to opt-out.
Otherwise that is no real opt-out and the opt-out is worth nothing.