Caddy 0.11 Will Have Telemetry - discuss

matt · March 31, 2018, 3:59am

Welcome! This thread is the final, pre-release discussion of Caddy Telemetry, which will be shipping with version 0.11 sometime in April.

First, please read the full announcement blog post about telemetry if you haven’t already. It’s really important.

We realize that some of you may have strong feelings about Caddy emitting telemetry. Our goal for this discussion is to raise awareness of this change before the release so as to avoid misunderstandings and the spread of misinformation, like was had with the announcement of commercial licenses last year.

We want your feedback! Some ideas:

What charts/plots/numbers/tables should be shown on the page where you look up your instance?
- For example: requests/sec, top user agents, number of goroutines, maybe whether your dev instance is publicly accessible (by accident?), etc.
Should telemetry be opt-in/out? Why? Discuss the tradeoffs.
- I’ll be blunt: we want telemetry to be on by default. We feel very strongly that this will produce a more useful, less biased data set, and at no cost to you. Those that feel strongly to have it disabled may of course turn it off quite easily, but we instead encourage you to participate with the rest of the community. If you insist on it being off by default: why? Or if you think it’s fine to be on by default, why?
What would you do with access to the telemetry data? What kind of research makes you excited/intrigued?
- Let’s hear some questions you want to see answered!
What questions do you have about telemetry? We’ll answer them!

We hope that people on both sides of an issue will participate here, rather than only those who oppose the idea of telemetry. If you are indifferent about telemetry, or even in support of it, please join the discussion anyway - we need to hear from everyone!

Also, we expect to read some actual technical arguments backed by reason – not just something like “this will be bad for your project” or “this will be useful to me” – explain why. The point here is to bring the community together on this as much as possible, even though there will not be 100% consensus.

Telemetry will be going out on Caddy 0.11, and now is your last chance to help shape it before the release.

We’re really excited, and hope you are too. Thanks for participating!

omz13 · March 31, 2018, 7:03am

Matt,

#2. Should telemetry be opt-in/out? Why? Discuss the tradeoffs

I am a bit disappointed that you even have to ask this. On the one hand its probably a cultural thing (the American default being opt-out, the European default being opt-in), but on the other hand, after the recent FaceBook fiasco I would have thought there would be more “sensitivity” to thinking about privacy before jumping in with a de facto opt-in (and the potential that that choice brings).

Clearly, I am going to say that it should be opt-in, and here are a few reasons why (with the last one being the more important):

On a practical level, if there are millions of Caddy instances in the wild, as they upgrade you are going to get hammered with a lot of data? Are you sure you’re reporting infrastructure can cope? Then again, that’s your problem and not mine
You need to be very explicit and detail exactly what data you are collecting, what you are going to do with it, how long you are going to keep it, and how people can delete it (you cover the first one, but not in any detail, cover the second, but not the last two). I want to see a sample of the telemetry data that you send (digging through the pull request for the caddy source and looking at +4000 lines to work out what’s in your json payload is not my idea of fun… and pity those who can’t read go and will have no idea of what they’re looking at).
The law of unintended consequences. With the telemetry data that you are collecting, could it be used for nefarious purposes, now or in the future? You state that “Telemetry does NOT collect personal information”, which may be true, but what it will collect, perhaps directly or indirectly, is the fingerprint of a server running caddy, and that even phoning home to drop off the telemetry data itself could have consequences (unintended or otherwise).

I would therefore strongly urge you to have a rethink… make it opt-in… and for each subset of metrics you are collecting, make them explicitly opt-in (i.e. no only do you have to opt-in at a top-level, you then have to explicitly opt-in for each sub-set of telemetry data to be sent).

Where is the source code for the telemetrics server? Is this closed or open? If you’re getting all this data wouldn’t it be nice to see how you’re crunching it?

What charts/plots/numbers/tables should be shown on the page where you look up your instance?

I’d be curious to see the MITM metrics. Just how prevalent is this?

What would you do with access to the telemetry data? What kind of research makes you excited/intrigued?

I’d see some interesting trend analysis possibilities.

gmacon · March 31, 2018, 3:32pm

I’m concerned about your use of UUIDs as random identifiers. There have been cases in the past where UUIDs weren’t unguessable, and that could cause problems here, too. I haven’t looked at the underlying implementation, so I don’t know that your UUIDs are guessable, but it would make me feel better if the document had said something along the lines of “We use 128 bits of output from a cryptographically-secure pseudorandom number generator as the unique id.”

I’d also be interested to look at a draft of the new Telemetry documentation page; @omz13 makes a good point about the relative ease of reading prose and code.

Finally, a question: Have you had anyone evaluate some telemetry from the current proposal to see if they can deanonymize it?

matt · March 31, 2018, 6:38pm

Yeah, so am I, but I know that if we don’t have a discussion, the community has more potential to spread misinformation and FUD, rather than actually coming together to solve a great problem.

You are right though – I bet cultural differences play big role in perception of this issue.

Did you mean “opt-out”? Either way, I think it’s unfair to claim that we’re not being sensitive to thinking about privacy, given all the effort we’re making to announce this before the release so that nothing comes as an unpleasant surprise to the community. Make sure to give credit where credit is due.

That’s my primary concern, to be honest. The telemetry “protocol” is designed in such a way that if the telemetry server is getting slammed with too much data, it can disable certain metrics in clients or space out updates more so that it receives less data, to help manage the load. I’m mostly concerned about storage costs, which we’ll be watching closely.

We will have very clear, concise documentation about this. The page is already drafted, and it lists every single metric that is collected.

We do cover these in the blog post – it’s just we aren’t sure of the details yet. It depends on realized costs, research needs, etc. That’s why we’re soliciting feedback: how far back is useful/necessary to keep? Obviously we’re incentivized to keep less data because it is cheaper that way. And as the blog post stated, we are investigating ways to permit authorized deletion of data emitted by an instance.

As I said, all the data telemetry sends will be documented on the website and it will be made available for viewing/download.

This is true of everything, including running a web server at all. Remember, this is a web server, not a web client representing a specific user. The dynamics are a little different, even though of course there may be people who use the data in unexpected ways. Rather than dwell on the fear of what could happen and thus paralyze the project and community, I’d much rather have experts constructively review the telemetry system and find concrete attacks against it so we can prevent them. Let’s keep this discussion evidence-based, not FUD-based.

I don’t think this is practical from a usability standpoint. The amount of data being sent is already orders of magnitude less than what Firefox–a personal, client-side application, I should emphasize–collects, for example, and AFAIK Firefox doesn’t have options to toggle individual metrics.

Great! How do you want it broken down? By User-Agent? Over time? A table or a chart?

Sounds good, thanks for letting us know!

UUIDs are not meant to be secret. We do not use them as passwords or keys. Can you identify a specific attack caused by guessing a UUID? If so, we will redesign our use of them. But right now we use them similar to how Firefox does for each individual end user/client, except that ours are only a less-specific, server-side UUID.

Sure, I’ll try to get it up a few days before the release. IMO, it’s much easier to follow than other telemetry projects’ documentation. True to Caddy form, it will be simple to understand.

We have been extending invitations to experts in this field to give it a review, but it’s tricky since we can’t afford an audit; they will have to volunteer their time.

gmacon · March 31, 2018, 7:04pm

I think these UUIDs do need to remain secret. With my UUID, you can get the telemetry from my server from your proposed web interface. Poking around the Firefox telemetry page, I didn’t see an obvious way for a random person knowing my UUID to get the telemetry from my browser. (It looks like a rogue Mozilla employee could do that via one of the ad-hoc query interfaces, but a rogue employee could probably do much worse things anyway.)

matt · March 31, 2018, 8:35pm

Given that the data will eventually be made available in some form anyway, I don’t see why this is a valid attack. The UUID protects nothing that is secret. I’d be more interested if you can guess a specific instance’s UUID, but even then, no personal information is exposed.

Varbin · April 4, 2018, 4:29pm

I would guess UUID4s are meant here - so they contain 122-bit random data.

I would guess, that the web interface will only show aggregated data (I just guess that). I would guess, too, that the UUID will then only be used for what I would call grouping, e.g. 90% of the mitmed connections came from servers in somewhere.

lemmi · April 4, 2018, 7:47pm

I count myself to the privacy camp, so here are some suggestions to further disarm concerns. I fully understand the reason for opt-out default. I think it is a reasonable choice. But what I liked to see as a package manager is a compile time option, an environment flag and/or a command line option for the default. People choose certain distributions for a reason, so maybe enable the distributions to also make that particular choice for them.

matt · April 4, 2018, 9:43pm

Correct.

Also correct. The stats page does not reveal any UUIDs, and UUIDs would only be used behind-the-scenes for grouping, to be able to compute averages, etc, or for research. The only use of UUID right now is so you can look up your instance data, but we decided that in the long term, having a unique ID is critical in order to obtain useful insights.

Thanks for the feedback! Since telemetry will be a compile-time decision, we are looking into a build tag to enable or disable it, although you can already just change a variable in the source code.

omz13 · April 5, 2018, 6:59am

So now you’re saying telemetries will be compile time and not run time?

Whitestrake · April 5, 2018, 7:02am

Can’t it be both? Include or exclude at compile-time, then opt-in or opt-out at run-time if it was included?

omz13 · April 5, 2018, 7:06am

I am still looking for additional clarity: if it is included at compile time, will there be a runtime option, and will the default be opt-in or opt-out?

Whitestrake · April 5, 2018, 7:19am

I don’t know if that’s been settled yet. I think I’ve seen @matt state elsewhere that he’d prefer it to be a purely compile-time decision from a reliability standpoint, but if you opted to have the telemetry included at compile, I imagine the logical answer for run-time would be an opt-out flag.

omz13 · April 5, 2018, 1:11pm

Yes, I did. Sorry @matt, fat fingers on my side Notwithstanding, I think you got what I meant to say… and that we have fundamentally opposite positions regarding opt-out and opt-in.

And I think that by being opt-out you are being insensitive to thinking about privacy because this is a clear case of not implementing privacy by design. Given that Caddy boats about “security defaults” this seems to be an ironic position to take.

And its hard to really have an open debate because its in a draft and not published contemporaneously with your blog post/notice. So its been a case of having to trawl through the code and see what you are doing… and becoming rather concerned about the amount (scope) of data that you are gathering, and where it seems to be going.

And the devil is in the details.

Reliability is not really the issue at hand…

According to yours (and @matt’s) logic… but not to me and others.

By implementing telemetrics as an opt-out solution Caddy will:

Be breaking the concept of privacy by design.
Be introducing unexpected behaviour into a server component. Yes, a client browser, like Firefox or whatever, pushes metrics and its all very opt-out… but its a browser, and its an accepted (if perhaps not acceptable) paradigm for it to do this behaviour. A server component, however, is different. At the moment, if I want to obtain metrics using something like prometheus its several steps: 1) deliberate inclusion of a plugin; 2) deliberate inclusion of a specific directive within each server block that I want to monitor; 3) deliberate exposure of a collections endpoint; 4) deliberate configuration of a server to collect the data. That’s a lot of deliberate steps that need to be undertake to get at the data. What you are proposing is that by default the server will collect and send data automatically to you (please define where!). Do you not grok how that is such a different and unexpected mechanism from the status quo? This raises a large flag vis-à-vis informed consent.
The metrics you are collecting is very expansive, and you need to come up with a justification for why you are collecting what you do. For example, you gather GOOS and GOARCH data, but then include some very detailed cpu data too. Why do you need to know this? What purpose does it serve, if any? I’d like to see a solid link between the data you are collecting and any “insights” that you think you will be able to obtain from them (to justify the collection)… at the moment it very much looks like you are collecting a lot of data for no very specific purpose because the motives on the announcement post are a bit too general.

matt · April 5, 2018, 1:27pm

So now you’re saying telemetries will be compile time and not run time?

No. Whether telemetry is enabled is compile-time, but actual telemetry is of course run-time.

Lucas · April 5, 2018, 4:01pm

While I don’t have well thought out arguments to post here, I will say that I only just noticed this whole telemetry thing, and the idea of it being opt-out feels a little strange to me, coming from the server that boasts secure defaults etc.

While this feels more like a privacy issue than a security one, the whole opt-out by default thing just leaves me with a taste of “Caddy is the HTTP/2 web server with automatic HTTPS. Now with added spying!”. (I know you’re not spying by the way, but that’s what people will jump to no matter what you do).

I understand that you don’t want to introduce bias in the collected data by default, but really, no one else is going to care about that; only you care about that. All other people are going to care about is the fact that it feels like they’re being spied on in some way, no matter how well you explain yourself.

Maybe that’s a bit too hyperbolic (or maybe not… this is the internet after all), but I hope you get what I’m saying.

Personally, I’ll just rip the whole thing out and compile it all myself anyway, so in the end it doesn’t affect me, but honestly, if you really value people’s privacy and you want them to see that, then it should just be opt-in by default. I would guess that most people would disagree with you, no matter what your arguments for opt-out are, and would want it to be opt-in by default instead.

I do have a question for you, despite the fact I’ll be ripping the whole thing out.

Have you thought about the GDPR and how it might affect people running Caddy?
Have you thought about any privacy laws in all of the countries around the world?
How does opt-out by default affect users of Caddy in relation to existing privacy laws and GDPR?

I haven’t read the links you posted in the original post so I don’t know if that’s answered in there, but you did ask for questions

anon12668908 · April 5, 2018, 5:33pm

I would like to remark that, beyond any technical arguments, @Lucas is spot on noting that opt-out telemetry simply leaves an unpleasant taste.

Unfortunetly, I don’t think that this can be countered by any technical arguments; not with the current privacy zeitgeist.

The comparisons to Firefox are somewhat bizare, given how much backlash they received for their opt-out SHIELD studies.

matt · April 5, 2018, 8:47pm

Except that I have not yet heard any convincing arguments from you or anyone that telemetry is a privacy violation. Caddy’s goal is to make the Web better: more secure, more free+open, and more private. With telemetry, we get anonymous technical metrics that can be used to make the Web more secure, more free+open, and more private. It’s a win-win, no irony about it.

We’re not publishing the docs yet because it’s not finished and it’s in motion and changing, plain and simple. We don’t want to confuse people more than they already are.

Ooo.

I disagree, on the grounds that no evidence has been presented which shows that telemetry has not been designed for privacy.

Hmmm, sounds arbitrary, wishy-washy at best.

Yes! Caddy is different! That is the point. We’re making the Web better than it is now by taking steps no other Web servers will or can do.

You must absolutely hate projects like CockroachDB then. (Take a look at their docs. Did you know they emit telemetry by default and don’t make the data available?)

We collect less than 1/10 that of what client-side programs like Firefox do.

Would have been useful to know how many Caddy instances were affected by Spectre/Meltdown, for example.

Okay, okay, you caught me: I’m secretly building up an empire that aggregates server metrics for the purposes of global Internet domination, and even though I don’t want anyone to find out, the best explanations I could come up with for putting all this work and effort into it, under the consultation of multiple research and academic institutions, is that “it’s for the good of the Internet” and “we want you to have better insights to your web servers’ activity and experiences.” It’s a lame excuse, but it won’t matter after I reveal all the UUIDs at the end of my sinister plot and suddenly everyone finds out that instance be6f1d2c-386b-4c6d-a4f8-fdcef9f31271 served 61,322 TLS handshakes on July 17, 2018. Then all the power will be mine and no one will stand in my way!

… all facetiousness aside, it’s not like I could have any other motivation. You can question my motives all day long but if you’re that skeptical of me (and the community behind this), why engage in conversation at all? Why use Caddy at all?

Well, there might be one thing we could do. CockroachDB–an insanely popular database startup–emits telemetry by default and they didn’t make a big fuss of it and nobody seems to mind; they have instructions on how to disable it but I doubt it would have this kind of response if they did open it up for community feedback. So I hope, given the alternatives, that people appreciate this instead of become all skeptical armchair experts.

What do you mean by this? (And why do whatever that is instead of simply disabling it?)

Yes, but I am not a lawyer and I cannot afford one. GDPR is EU law, and I have no presence in the EU. Additionally, GDPR applies specifically to personal information, which telemetry does not collect. Telemetry consists only of technical metrics. So I would say that GDPR does not apply.

I expect people to follow the laws of the land in which they live.

You should ask a lawyer that if you’re that concerned.

Thanks for your thoughtful response, but please go back and read the posts, okay?

Only when you know about it (cf. CockroachDB already mentioned, which so many people love)

I am quickly learning that people tend to act more on emotion than on logic, for better or for worse. (I have no opinion on the SHIELD studies, and can’t speak for Mozilla.)

Lucas · April 6, 2018, 2:23am

I’ve always avoided projects like CockroachDB specifically for the reason that it emits telemetry by default.

Whenever I can I only use software that doesn’t track anything I’m doing, whether the data is anonymised or not. Being a developer this isn’t always possible since I have to use certain software throughout the day, but I do what I can.

Caddy is a little different in that it hasn’t emitted telemetry up to this point, and it’s the only server with sane default settings, so I don’t mind going through the hassle of disabling it entirely.

Sorry, I was still being a bit too hyperbolic. I mean I’ll just compile it without telemetry (assuming that’s possible) and throw any flags onto the command line that would entirely disable it if required. It’s a real hassle, but Caddy has been so good in every other respect that I don’t mind having to do it, even if I think it should really be opt-in.

Presence in the EU doesn’t matter, it will apply to anyone in any part of the world dealing with any data from anyone in the EU. So even if you’re based in the U.S., if you deal with any data from someone in the EU, then the GDPR applies to you. Like you say though, if you aren’t collecting personal data it probably doesn’t apply to you in this case.

I did have a quick look at the posts you linked to in the original post, and I tried to look for what kind of data you will be collecting. The post just says “aggregate counts”, which isn’t really useful at all, so I had a look at the diff on github, but there’s no way I’m going to look through the whole thing right now so I’ll just ask.

Do you collect IP addresses in any way, for any reason?

The reason I ask is because in some cases GDPR does consider an IP address to be personal information, so if you collect them in any way then you might want to look into it a bit more.

To be honest, even if you don’t think you’re collecting any personal data, and even if you don’t think it applies to you, you should look into the GDPR anyway, since it will apply to you if you end up collecting something that seems benign, but turns out to be considered “personal” by the regulation.

The reason I asked if you thought about how it might affect users of Caddy is because opt-out by default could make following the law in certain countries more of a hassle.

There will of course also be people that don’t realise that Caddy is emitting telemetry and may be violating laws in their own country without realising.

You can always say that the onus is on the user to check their software before they use it, and I would agree, but I thought Caddy was trying to make things easy, right?

It could be that Caddy emitting telemetry wouldn’t put anyone in violation of any laws around the world, but since I don’t know the laws of every country in the world I would personally err on the side of caution and say that opt-in is just the better choice when thinking about the best interests of the user.

I’m not concerned for myself. I believe opt-in is the right choice, but like I said I’ll be disabling the whole thing anyway since I know what I’m doing.

Either way, from your responses so far it seems like you have your heart set on making it opt-out by default, so I doubt you’ll change your mind; this is your project after all.

Having said that, I’m sure that most people would appreciate it being opt-in by default, even just for their own peace of mind, which is a perfectly valid reason that someone might want it to be that way, and should be considered as a serious argument against opt-out by default.

The reason I say this is because really, your only real argument for opt-out by default is to not introduce bias in the data that’s been gathered, which is just as arbitrary and based on personal motivation and emotion as any argument out there for opt-in since there isn’t any technical reason for it.

I wish I could give you some concrete technical arguments for opt-in, but it seems like when it comes to this issue it’s almost entirely based on motivation and emotion, so technical arguments won’t do for either side.