Caddy 0.11 Will Have Telemetry - discuss

Whitestrake · April 6, 2018, 3:59am

Any state is within its power to declare that its laws affect people outside of its sovereignty, but the question that will always be brought up is enforceability. The EU has direct jurisdiction over bodies regulated in the EU, and can issue fines (I believe up to €20 million or 4% of revenue). But unless Light Code Labs open a European branch, or the US government decides to hold US businesses accountable directly for EU law, there’s little difference between that and North Korea declaring that US businesses must pay them fines, as far as I can tell.

I’m not a lawyer though, and I’d probably just do everything in my power not to poke that bear without the advice of one. You bring up a great point about IP addresses as a area of concern. I don’t think the IP addresses, either of the server or its clients, are terribly useful information from a standpoint of technical metrics and I don’t think at present it’s planned to collect any (please do correct me if I’m wrong).

That said, I wonder if the language of Recital 49 could cover the use of IP addresses by the Caddy telemetry server to aggregate and detect patterns of abusive clients, with the intent of ensuring Caddy is resistant to such abuse. The wording is “strictly necessary”, though, so probably not; Cloudflare could probably argue this, but they’re doing far more active network security, while Caddy’s security is all in its implementation.

Is there any merit to having to select between telemetry and non-telemetry, with neither option selected by default? For example, the https://getcaddy.com script aborts if you don’t explicitly specify either “personal” or “commercial” usage for the binary.

Lucas · April 6, 2018, 5:15am

Yes, that is a good point and something I should have also mentioned. Just because a regulation says something it doesn’t mean it will be equally enforced everywhere.

Of course none of us here are lawyers, and all we can do is make assumptions based on whatever information we read about it all, but I mostly wanted to bring it up since right now we have no idea how well this whole thing will be enforced until a few companies at least have been made an example of in different situations.

Basically my thinking is that it’s better to avoid any potential problems surrounding the GDPR rather than just taking the risk because it might not be enforced in the US.

Despite my personal preference being that it should be strictly opt-in, this sounds like a fair compromise as far as download options go. It doesn’t eliminate bias in the data set quite as much as opt-out does, but it could result in less bias than opt-in without people feeling like they’re data is being taken without permission.

Do you think it would be possible to extend that kind of explicit choice to the compilation of Caddy using tags or something like that too? I would happy with compilation being opt-in or explicit choice, but not with opt-out.

Whitestrake · April 6, 2018, 5:27am

A very fair stance, and the one I’d take, at least until real lawyers are involved and advise. I don’t think it will be difficult to avoid, all things considered.

If Caddy were to avoid defaulting on OR off for telemetry, I imagine both versions would have to be tagged; if only one is, it would be construed as the default, I expect.

rugk · April 8, 2018, 2:37pm

So I am also a privacy advocate, but that does not mean you have to reject any telemetry thing, just because it is called telemetry. I feel that this word alone awakes bad feelings for some…

Also, remember to read the blog post! It really clears things up and explains the whole thing.

So my two cents:

Yes, do opt-out. As you said, everything else is biased and the data may not really be good.
I think this is already the plan, but just to be sure: Of course the opt-out has to be at runtime, i.e. as a server admin I want to opt-out in the config file. You can also include an additional opt-out at compile time, as some distro’s very unlikely will use that. But make sure I can opt-out or -in, even if I get caddy from somewhere else. Nothing is worse than having no way to configure such an essential options as a server admin. (And no, not everyone uses your website for downloading, see Linux distros e.g.)
Also consider that maybe I have strict privacy requirements for one domain, but not for the other. So let me opt-out for some domains while allowing telemetry for others.
If you really cannot decide, do it the third way (done by Atom e.g.): Introduce a required configuration option, with no default, so server admins have to decide whether to enable telemetry or not.
AFAIK (IANAL) your telemetry is GDPR-complaint! GDRP always only matters, if you collect personally identifiable statistics, so e.g. ip address, user names etc. So basically if you have no ID and you thus cannot correlate ones data with a person (even if you have access to IP databases of providers, etc.).
When we take the server admin as a person, we have an ID of course. But yet again it is not the data of the server admin and I also think the collected data is not sensitive.
TL;DR: GDPR allows anonymous collection of data. As long as you do this, this is fine.
Off-topic, but as one said it: Yes the GDPR applies to anyone (also US), as long as you have a market in Europe.
(also GDPR) Only the user agent thing might be kinda problematic as it can be quite unique (so identifying a personal is possible). Remember that truly anonymising data is hard.
My suggestion: Anonymize the data on the client (i.e. Caddy here) already. E.g. just let it parse the user agents, so you can send data to the telemetry server about the browser used, the browser version used etc, but not the whole UA string, which could contain anything…
Now the UUID thing: The best would be if you could aggregate the data directly when you receive it at the telemetry server. That means:
- It must not be possible to have a lookup: “Get me all data for UUID 1234…”.
- UUIDs may not be stored on the server. At lest not connected with the statistics data. So use them to rate-limit stuff or so, or save the date for the next connection, but not for the general data you submit. As GBDR says: Aggregated anonymous data can be collected. You “just” need to make sure you cannot use the data and correlate it in some way, so you can find that out that user XYZ visited the website ABC, which is running Caddy,
Also, of course, the telemetry server has to have logging (of IP addresses) disabled.
BTW references to Facebook are useless here. They are not at all related. They more or less sold access to the data or did not limited access to personal data. Caddy will do something completly different.
Of course, you have to be transparent and document everything. But that’s also planned as I see.
The downside of the previous point is of course that users will just read “telemetry” and complain, thinking Caddy spies. So if it goes bad, you get bad press coverage. If it goes well, you get reasonable press coverage, where people don’t freak out.
I think the “risk” is worth it though. You just need to explicitly highlight what you do not collect and explain why the data you collect is not sensitive. (Your blog post is already quite good in that sense.)
Again of course: Publish great statistics and make the (aggregated!) data public domain or so (at lest a free license) and let everyone download it and use for scientific stuff, whatever…

TL;DR: Do opt-out, if you are sure your implementation cares for privacy. To do this remember the GDPR, aggregate/anonymize data as fast as possible (i.e. on the client, if possible) and make sure the thing you collect is statistical data, not data about persons.

Whitestrake · April 8, 2018, 11:51pm

Awesome post, thanks @rugk. I’d like to explore this section a little:

I think this in particular seems borne out of (perhaps irrational) fear; in particular I note that stripping UUIDs loses the telemetry project a lot of helpful per-instance information. For example, the capability to calculate averages of averages (how many requests does the average instance serve per second?). How do we answer those kinds of questions without being able to keep separate instances of Caddy discrete?

Not just helpful to everyone in general who might benefit from public information, but also those of us who are interested in particular in one of the closing paragraphs of the announcement:

Our hope is that you will find telemetry a useful resource. On top of telemetry, we look forward to providing you with premium monitoring/alerting services, advanced reports, and data export directly to third-party services and tools (note that we would sell these tools/services, not the data itself). If enough people participate in telemetry, we may be able to do away with paid licenses, which is even more appealing.

– https://caddyserver.com/blog/caddy-0_11-telemetry#our-vision-of-telemetry

Which would of course not be possible at all if it were impossible to get all data for a given UUID, as you suggest.

I submit that as no individually identifying information is collected, and data is already aggregated by the local Caddy instance, keeping a UUID of that Caddy instance isn’t de-anonymizing. To answer the salient point - that the data can’t be correlated in some way to say that XYZ user visited website example.com served by Caddy UUID 1234 - I quote the article a little more (apologies for such heavy quoting):

Telemetry does NOT collect personal information. No cookies, no session IDs, no way to identify individual clients connecting to your server. Telemetry is concerned with benign, aggregate counts: successful TLS connections, HTTP requests handled, response latency, etc.; technical characteristics: properties of TLS handshakes, software version, User-Agent strings, MITM assessments, etc.; and timestamps; things like that.

– https://caddyserver.com/blog/caddy-0_11-telemetry#what-is-collected

One nice advantage of server-side telemetry is that the data is naturally aggregated—not just by metric name, but also by entire individual clients/users. Unlike most client-side telemetry implementations, our telemetry server does NOT receive any connections from individual end users (browsers) or information from any one end user.

– https://caddyserver.com/blog/caddy-0_11-telemetry#your-controls-and-privacy

I also wouldn’t mind discussing this point further too:

Also consider that maybe I have strict privacy requirements for one domain, but not for the other. So let me opt-out for some domains while allowing telemetry for others.

I have to admit I really don’t see the usefulness of this one; the telemetry is already aimed at being totally blind to the sites, individual clients, or content served. I see it more as metrics of the Caddy instance itself, rather than being partitioned into sets of statistics for each site.

In terms of the end result of what the telemetry server sees, all this achieves is an arbitrary skew of the overall data your Caddy instance will be aggregating. Do you see much value in that? I am curious to know your answer.

I’d love to see the public stuff be public domain. I believe that the devs want to see this kind of information publicly usable - it’s already collected at scale by the likes of Cloudflare, Google et al for sure, just not available to you and me.

This is an interesting point, assuming that User-Agent is considered as personally identifiable as an IP address. Generally not, of course; but any user could set their own User-Agent arbitrarily, which makes it plausible that a European user with a globally unique User-Agent could cause us to violate GDPR, even in aggregate data.

It might be necessary to give Caddy the ability to aggregate counts of certain detectable common client types and throw out the rest.

pepa65 · April 9, 2018, 2:34am

Hi Matt,

A question about this: “the telemetry server has the ability to remotely disable (but NOT enable!) telemetry in Caddy instances at any time”.

Would this be abusable, I mean how difficult would it be for a third party to remotely disable the telemetry in a Caddy server?

Blessings,
Peter

Whitestrake · April 9, 2018, 3:02am

From a technical standpoint, I expect this will be done by the second party (telemetry server) when the first party (local Caddy) connects to check in. The local Caddy won’t have any API or equivalent to send a command to in order to disable its telemetry, hence a third party could never send such a command. This would also be why the telemetry server can never re-enable telemetry remotely - once the local Caddy is instructed not to check in again, it can’t be told to start once more.

matt · April 9, 2018, 4:08am

Hi Peter – an attacker would have to gain privileged access to the right machine and make specific modifications in order to configure clients to shut off telemetry.

@rugk - Thank you for taking the time to elaborate your thoughts with reasoning! I will consider everything you said, but let me respond to a couple things here:

The problem with run-time options is that we lose information about how reliable and representative the metrics are. For some research questions, that jeopardizes the usefulness of the data set, which may defeat the purpose of collecting telemetry in the first place. We will look into this more as time goes on, if we feel we have enough statistical information about our sample.

I resent that Europe thinks it can make laws and project them onto non-Europeans in other sovereign territories. I would be interested to know how much money it would cost them to enforce their laws in the US, especially considering such a small project as Caddy. I guess if they go to all that trouble, it’s their taxpayers money, not their own. But as you said, I do not think GDPR applies here, so the point is probably moot.

Like Matthew said, I also do not see the technical justification for this. We would lose essential grouping information. I’ll defer to his post which is much more informative about this!

Other than those main points, I think we’re on the same page. And I agree with you on many of your points which I didn’t highlight here! Anyway, it’s why we’re easing into this rather than unleashing a huge feature set all at once, we’re being conservative.

lbguilherme · April 9, 2018, 6:58pm

I just updated Visual Studio Code and on the first time I opened it, this was shown:

Screenshot%20from%202018-04-09%2015-53-12

Perhaps if telemetry is enabled, show a console message every time Caddy is started stating the it is collecting usage data, with links to the documentation of how it is used and a link to a documentation explaining how to turn it off. (And maybe also add a flag hide this warning) Something like:

caddy              # Starts caddy with telemetry and the console notice.
caddy -telemetry   # Starts caddy with telemetry. No notice.
caddy -disable-telemetry # Starts Caddy without telemetry and no notice.

What do you think?

matt · April 9, 2018, 7:07pm

The console message: Sure, we can consider that.

But again, telemetry will probably be togglable at compile time, so we can measure how representative it is.

tobya · April 9, 2018, 8:06pm

I give a big thumbs up to a nice little notice on startup

pwhodges · April 9, 2018, 8:09pm

Sure; and I resent the cases in which the US does the same in reverse (most specifically in claiming legal right of access to data on computers elsewhere in the world).

In any case, my concern would be whether I as a user in the EU (for now!) could possibly be affected.

As for whether the data collected has any privacy concerns at all, I would say that in the end that’s not the issue; if people can see something they can misinterpret, they will do so. Caddy may never draw such people’s attention, but might you not want to have a defence against unjustified complaints in place - in the form of requiring an opt-in?

Oh, and a notice on start-up in practice does nothing useful in the case of systems running as a service…

Paul

Whitestrake · April 10, 2018, 12:46am

It seems like you’re arguing for the Caddy telemetry project to debase itself - that is, to sabotage the integrity of any data produced and thus the main value of the project in the first place - not for the sake of a credible argument supporting that course of action, but out of fear of what misinformed complainants might do if that course of action is not taken.

I believe that would be intellectually dishonest, and I think that any argument based on this line of thought should have no influence whatsoever on the direction the devs choose to take Caddy… on this or any other topic.

Prudence is one thing, and it takes the form of an honest, thoughtful, and comprehensive blog post outlining exactly to what extent you as a user anywhere could possibly be affected.

(The GDPR has been a major part of this thread, yes, but having privacy concerns at the forefront of the telemetry project isn’t just driven by fear of legal repercussions in the EU. It has to be based on an honest belief in privacy for privacy’s sake, for all users, and not just privacy theatre… or it’s just for show, really.)

omz13 · April 10, 2018, 8:03am

Its hard to know whether you are being serious or making a joke.

Are you saying that if the activities of Light Code Labs LLC were in scope of GDPR it would simply ignore it due to the size of the project (“considering such a small project as Caddy”)?

Whitestrake · April 10, 2018, 8:31am

That seems like an uncharitable interpretation of Matt’s comment.

It might be best to refrain from speculating on what the EU will or will not do to enforce the GDPR outside of its own sovereignty. The politics aren’t all that relevant to Caddy from a technical standpoint, and neither is anyone’s personal opinion on that topic, so lets keep the discussion practical.

pwhodges · April 10, 2018, 9:29am

Not at all; I think you are deluding yourselves if you feel that opt-in will unbalance the data but opt-out will not. In both cases the dataset is distorted by people’s choices, and sadly this is inevitable. But, if you wish, an opt-in could be advertised as prominently as has been suggested for an opt-out. Or even don’t have a default - make it so that a command line or Caddyfile parameter saying “accept” or “decline” is required before Caddy will run.

As for the rest, I guess we’ll just have to disagree. The likelihood that a project such as Caddy would get into the kind of trouble that Facebook has over sharing data that the owners of which had allowed to be public is minuscule. But having worked for the past fifteen years in a field where I had to show powerful regulators how I was preparing for such small risks has made me rather sensitive in this matter.

Whitestrake · April 10, 2018, 11:43am

I’m happy to debate exactly what the value of opt-in data vs opt-out data is. But that was not the important part of my reply to you. This next bit was:

And the reason was this comment:

So to reiterate: are you going to make an argument FOR opt-out, or will you stick with your argument AGAINST opt-in that seems purely based on fear of angry, misinformed people? I believe that if the event comes that Caddy must defend its decision against powerful regulators, and they’re angry and misinformed, there will be an opportunity to make them informed.

If you’re referring to Cambridge Analytica, it’s well understood that only a very small fraction of the data set given to CA had actually used the app in question and agreed to share that data with the app’s developer. The rest were merely friends with the people who agreed.

pwhodges · April 10, 2018, 12:08pm

Um, swap opt-out and opt-in there. I’ve said what I think, and don’t feel the need to try any harder to influence your choice.

matt · April 10, 2018, 1:25pm

@omz13 I was expressing a political opinion – nothing more.

@pwhodges Thanks for your comments, I think they’re reasonable – and I appreciate you taking the time to write them!

I will take these into consideration as we wrap up the initial release here soon.

rugk · April 11, 2018, 3:11pm

Took some time for my reply, sorry.

Okay, good point. But I have an easy solution:
Just hash the UUID on the telemtry server (with a strong password hash function)! Then you can still correlate the data, but unless someone submits their UUID to you, you cannot correlate the data to that single server.

That’s your view, but not the one of the server admins. Say the have banking-secure.example and banking-website.example. They may enable telemtry for the second, but possibly they are disallowed by law to enable stat submissions for the first. Maybe they e.g. do not even are alloowed to keep logs for the first, etc. (This is just an example.)

It’s still opt-out, so no skew. Only few users may . This is a slight change, but if you would want 100% unbiased data, you would have to not include the possibility to opt-out. And you know, this is not a good thing. So keep the opt-out! And make it possible to disable for the server admin:

Depends:

Must-have: Opt-out at server admin level. (whoever compiles the software – distro owner or so – should not have the control over whether I – the server admin – want to submit telemetry or not. At least I should be able to override their decision, at least in the negative way to always be able to disable it.)
Nice-to-have: Opt-out per domain.

I just think these two can quite easily be combined, by adding a config option.

So basically you ant to force users to have telemetry enabled? That’s stupid and will likely result in a fork or so. You don’t want that…
There is no alternative to runtime-opt-out . I know opt-in skews the statistics (and that’s the only thing you write in your blog), but opt-out not soo much as it seems. Remember that even Firefox and stuff provide a runtime opt-out! It would be silly for Mozilla to say: Hey you Firefox from your distro, then telemetry is always disabled. You got it from us? Then telemetry is always enabled.

The emoji does not make that polemic and incorrect statement any better. Again to explain: The GDBR applies to Euopean citizens and their data. So when you process/offer a product for Europeans, that applies. Of course, it does not apply to US customers or so.

Thwey would get the monesy back from you, that’s not hard. The only thing would be whether they’d care to really enforce that law for such a small project as Caddy.
But even without any legal requirement… let’s ignore the GDPR (and any potential legal threat) for a moment: It does not make a difference. You should still do it exactly the same, anonymize data, aggregate data.

Actually, when you as I just came up with the ID and hash the data, you could answer/implement such requests. User can enter their own UUID, server hashes it and get all the data.
That does not even need authentication as the UUID (if properly used with a rate-limiting/password hashing) acts as the authentication in this case.
So I think that may be a good idea. Especially, as, GDPR again, also includes the right for users to lookup their own data. So if this is done, users could do that.

As explained that is not the case. Of course also a compile-time option skews the data, so I’d say this is the “skewness” scala:

no option to opt-out: no data skewing, but as users are lost due to fork actually also a kind of wrong data
opt-out at compile time: say 10%-15% data skewing (percent does not mean anything, just for the relations) – many distros will likely disable that by default
opt-out at compile time & runtime: 15%-20% data skewing
opt-out at runtime: 5%-10% data skewing
opt-in: likely >60% data skewing

That’s how I would estimate it. Based on how much users/maintainers/installations would likely disable that feature.
Thing is we have no facts here and never can, because to get out how many installations have that feature disabled, we’d need tracking again. So that does not work. One can only guess.

I mean if you have 3 million users, where 5000 do not send telemetry it also does not matter. Statistics always have some discrepancy, but you can neglect it when you do opt-out IMHO. Few users will adjust such a setting (especially if they feel that you care about privacy and that telemetry is actually useful).

Also you seem to ask about opt-in vs opt-out at compile time. I say opt-out at runtime. Is not that a reasonable compromise?
You see many users would argue for opt-in. I see your point that this skews data, but having an opt-out at compile time is not even worth the word “opt-out” (because it’s the wrong people who opt out. I am the server admin, this is my data. So I need an opt-out, not someone who wants to decide that for me.)