Fault tolerance, docker, clustering

mxrlkn · August 14, 2019, 10:47am

Hi. I’m looking into using caddy in a docker swarm cluster with the docker plugin and the consul plugin (or hopefully an s3 plugin later).

I’ve got it setup and it works just as I hoped. The docker plugin reads the docker services labels and starts proxying all of my services.

The issue I’ve run into is that if a single service is configured wrong, caddy will stop proxying all of my services. I guess it’s because the docker plugin generates one big caddyfile and if that caddyfile has errors, caddy won’t start.

I’d like it so that if a service is misconfigured, it doesn’t take everything down.

I don’t know if caddy has an “ignore errors” setting or if it’s the docker plugin I should look into.

I’d love to hear if anyone have any solutions or recommendations.

gamalan · August 14, 2019, 3:29pm

Based on Caddy docs, it should be able import from additional files or folder. So the docker plugin should be able to separate it into multiple files.

But not sure about the ignoring config. Although, it should be possible to check the config syntax before adding it to the loaded configuration.

matt · August 14, 2019, 4:47pm

My honest recommendation would be to not generate a wrong configuration? That sounds like a bug (in the docker plugin, or whatever it is that’s making the invalid Caddyfile).

mxrlkn · August 14, 2019, 5:26pm

The problem is that the caddyfile is generated dynamically by the docker plugin.

So for example here’s the docker service labels you’d use.

labels:
  caddy.targetport: 8080
  caddy.address: example.com

The plugin then generates:

example.com {
  proxy / dockerservice:8080
}

But if you for example want to proxy websockets and someone on your team misspells websocket:

labels:
  caddy.targetport: 8080
  caddy.address: example.com
  caddy.proxy.websockett: ""

Which generates:

example.com {
  proxy / dockerservice:8080 {
    websockett
  }
}

Caddy will throw an error CaddyFile error: :40 - Error during parsing: unknown property 'websockett' and not proxy all the other services that are correctly configured.

So the docker plugin just creates proxies by reading docker labels without doing any checking (it seems) of whether or not the caddyfile is correct.

Maybe I’m just trying to use the wrong tool for the job, but the new clustering features + the docker plugin makes caddy very interesting for my use case.

@gamalan So if the plugin separates each block into files, that would keep caddy running even though there’s an error in one of the files?

matt · August 14, 2019, 7:44pm

You have my attention – because if that’s what makes Caddy useful over other products, I want to be sure we capitalize on that.

But, genuine question that I suppose is obvious: do any other web servers continue on error if the config is invalid? What if it’s unparseable (i.e. the structure itself is malformed or something)? What do you expect the server to do in these cases?

I mean, isn’t the correct answer here to fix the typo? Programs won’t compile if an identifier is misspelled, and I think there’s good reason for that.

How about an alternative idea, which we are developing for Caddy Enterprise: config changes are vetted through a test corpus before being applied to the the cluster or instance. Would that do?

mxrlkn · August 14, 2019, 8:17pm

I’ve used Traefik before which does continue to run even if you feed it wrong settings (with docker labels at least). Traefik does have built in support for docker though. It’ll write the error but continue to run the other services.

I’m guessing that traefik validates each service individually and then stitches the working ones together. But I honestly don’t know.

Yeah I could test the configuration before deploying, but that means feeding the new labels into the docker plugin and then run caddy with that configuration and see if it fails. Could be a solution.

matt · August 14, 2019, 8:41pm

Was thinking more like Caddy 2 would do this for you, like on POST /load or something.

Also, did you know that Caddy 1 itself does not actually use bad configs and stop a running server? If you use SIGUSR1 you can reload the Caddyfile and reject an invalid config, rolling back to the currently-working one, without any downtime. That’s what you really should be doing if you’re not already.

gamalan · August 14, 2019, 8:41pm

I also have similar use case as this, and also just moved from Traefik. Based on https://caddyserver.com/docs/cli Caddy does have validate function, which can be file or even glob string. Which mean the docker plugin, could theoretically validate service label before adding it to the Caddyfile it manages. But this will need to be added to the plugin first.

matt · August 14, 2019, 9:00pm

Even the validate option is not needed if you use USR1.

Whitestrake · August 14, 2019, 11:34pm

Hi @mxrlkn, quick question: Are you stopping and restarting Caddy, or is the Docker plugin itself doing the restarting and bringing sites down?

I’d been under the impression that the Docker plugin used reloads internally, i.e. USR1, but I don’t use it myself (been using docker-gen since before the Docker plugin was made available), so I might be wrong.

mxrlkn · August 15, 2019, 12:17am

Sorry that’s right. it reloads internally (I guess with USR1) and it still runs with the last working config. My horror scenario is that a bad config goes unnoticed, the caddy container restarts for some reason, and then everything goes down.

@gamalan Interesting. That would probably be the best solution in my case.

matt · August 15, 2019, 3:28am

I’m not sure why that would happen if using USR1. (I guess there could be a bug but it’s pretty clear in the code that we return on error.) As long as USR1 is being used to reload the config, it should be safe.

matt · August 15, 2019, 4:00am

So wait, you’re saying that the bad config is being applied though?

Caddy will throw an error CaddyFile error: :40 - Error during parsing: unknown property 'websockett' and not proxy all the other services that are correctly configured.

So it stops proxying? Does the process get stopped altogether?

I ask, because if USR1 is in fact being used and the behavior changes after a failed config reload, that’s something worth looking into – but probably without any plugins or Docker stuff, is best for reproducibility.

gamalan · August 15, 2019, 5:34am

Testing this on my development environment, Caddy v1.0.3, docker loader plugin, plus other plugin, it seems @mxrlkn no need to worry, because Caddy will still run normally, because the reloading process using SIGUSR1

[INFO] SIGUSR1: Reloading

But there will be indeed some error logged such as

[ERROR] CaddyFile error: :26 - Error during parsing: unknown property 'transparentt'

matt · August 15, 2019, 5:48am

That’s great news. Seems that things are working as expected then; that indicates the reload has failed, so the invalid config has not been applied. It should leave the server running in the same config as it was before. If in fact the behavior changes, that’d be what I’m interested in. (Although, even if that was the case, it almost certainly isn’t in Caddy 2, which has a much improved config implementation. More efficient and robust and easier to interface with.)

mxrlkn · August 15, 2019, 11:53am

Yes it still works great with USR1 when you update a docker service with bad labels. Caddy logs the error but still runs. All good.

The problem happens if you apply the bad docker labels, USR1 kicks in, everything is running fine (with errors logged), and then restart the caddy container.

So if you restart caddy it starts with a fresh config, which in this scenario is still misconfigured, and then it won’t start.

@gamalan I just tested it again to make sure. You can try yourself to restart caddy after the config error.

Whitestrake · August 15, 2019, 1:01pm

Aye, that can be a problem. At this point, though, I think that the answer is to check your logs when you push config changes and make sure they were applied correctly, rather than engineer some kind of solution to make Caddy accept a bad config on a fresh start.

gamalan · August 15, 2019, 1:15pm

I see, it does happen @mxrlkn,

Although, i think it’s unlikely to restart the Caddy docker instance at the same time of adding/updating service/stack. Well, except if there are update to the Caddy docker image.

But because the docker plugin already know that there are an error, it could skip adding the bad config. Need some changes though.

mxrlkn · August 15, 2019, 1:30pm

I think the best way is to implement it into the docker plugin as @gamalan said.

Checking the logs would be a hassle imo. If a bad config is pushed and then 2 weeks later you need to update caddy, or change the configuration of the caddy service, or something. You need to go through all the logs, which also contains all kinds of other information.

Also, servers do sometimes go down. In the case of docker, containers can restart for all sorts of reasons. Containers are not necessarily meant to live long, so docker can and will restart containers for various reasons. I’ve also had several instances where I needed to restart docker itself.

Whitestrake · August 15, 2019, 11:07pm

Implement what into the Docker plugin, exactly?

What gamalan said was for Caddy to validate the new config before using it. It does this. The issue is that when you start the container fresh, it has no context for what is the new or old config, what is a good config, and what was the last known good config.

Checking the logs is as simple as tailing the log file when you update the other Docker containers. The Docker plugin sees the changes in real time, so you can see straight away whether Caddy succeeded or not.

Even if you couldn’t simply tail the log in another window and for some reason it was only suitable to check much later, instead of checking your work as you push it live - two weeks is trivial for logging. This is what grep is for, along with a myriad of other tools.

I think your best option here is training the users who will be pushing Docker updates to monitor the Caddy log to ensure their work is not faulty. And maybe writing a bash script to send you an email if the Caddy log produces an Error during parsing.

Container lifecycle is defined by the administrator who implemented it. Docker does not randomly restart containers unless it has been configured somehow to do so. My Caddy container’s uptime is exactly the last time I logged in to update the version. My Netdata container, though - that’s been restarted automatically to update because it’s being watched by Watchtower. Restarting Docker for some reason is no different than having to restart a VM host for some reason. What kind of automation do you have that you require your long-life Docker containers to be interruptible?