One site fails on startup - Caddy does not start

osirisguitar · January 29, 2018, 10:45am

Is there any way to configure Caddy to skip faulty site configurations and run those that work? Currently, a single site failing will prevent Caddy from starting and bring down all sites…

I have some very strange problems with TLS for two of my sites (they work fine with HTTP only) and troubleshooting is a pain since it brings down all of my other sites.

If not - how do I troubleshoot my configs in a non-production environment?

osirisguitar · January 29, 2018, 10:46am

I found this topic: All sites down when Caddy can't validate ONE site

But how would I do this in a container (my Caddy is in Docker)?

Lucas · January 29, 2018, 12:11pm

I wonder if you want something similar to a question I asked a while ago: Setting up sites in Caddy when the DNS hasn't been set up yet

The solution to my question in that thread was the max_certs subdirective.

Since you are saying you are having TLS problems I’m guessing you want to start up Caddy even if a particular site’s TLS isn’t working at the time. The max_certs subdirective basically enables on-demand TLS so that Caddy doesn’t need to verify all TLS certificates on start up.

matt · January 29, 2018, 3:46pm

Caddy doesn’t “bring down all sites” – if it wasn’t running before, how could it bring them all down?

Using on-demand TLS will allow you to serve the site without verifying it until a request comes in for it. But Caddy simply can’t serve a site over HTTPS unless it can validate your ownership of the domain with a CA like Let’s Encrypt. So deferring that until it’s first needed would be the only way. Still, I recommend not configuring Caddy to do something it cannot do until it can do it.

osirisguitar · January 29, 2018, 4:34pm

I have Caddy deployed in Docker, so changing config means redeploying. That brings all sites down.

If Caddy can’t start with the new config, it means all sites are down.

matt · January 29, 2018, 4:52pm

“Changing config” means you should be using SIGUSR1 to reload it gracefully, not stopping it entirely and then trying to start it again. If you stop the server, that’s your fault, not Caddy’s, because there is a way to update the config without any downtime using signals.

osirisguitar · January 29, 2018, 6:54pm

Ok, cool it with Linus Torvalds impersonations a bit…

My post was a question, not criticism. I realize the question wasn’t clearly asked, since restarting Caddy on config update is not the recommended way (from these forums I see I am not alone).

Since service restart is the normal way of updating (even when it’s a config update) a service in a Docker container I then wanted to know if there is a way to make Caddy start even if one site fails (since my sites stop working when Caddy is not running, regardless if that is my own fault).

If the answer is no, that’s fine. I don’t hate Caddy, I really like it. Well done for making it!

matt · January 29, 2018, 10:52pm

Sorry. The reason I come down strongly in response to language like this:

a single site failing will prevent Caddy from starting and bring down all sites

is because it misleads or confuses other readers when really what is happening is:

I have Caddy deployed in Docker, so changing config means redeploying. That brings all sites down.

Your way of using of Docker is killing Caddy processes, which brings all sites down. Caddy isn’t bringing the sites down. You’re stopping Caddy and then it’s unable to run after you make a change. That’s much less surprising and much more acceptable from a software quality standpoint.

Since about September there’s been a lot of misinformation about Caddy that I’ve been correcting as thoroughly as I can: misinformation which leads to making less-than-optimal and even insecure decisions about how to host and serve sites. Many sites were down sporadically this week in central Europe, for example, because site owners read stuff like this and then chose Apache or nginx instead of Caddy and then got bit by low-quality OCSP implementations when an OCSP responder had trouble, when really the information they read about Caddy (“it’s not open source” - “it takes all your sites down” - “it’s not fast” - “it’s unstable” - “it’s adware” - “it’s not powerful” - all some real examples of common misinformation) is false in the first place. Now, granted, it’s the responsibility of the reader to be critical of what they’re reading, rather than finding what they think is dirt on some software project and then just accepting it because, hey, that’s fun, and it makes them feel more informed - BUT I’m still gonna do what I can to make sure the record is straight, especially within our own community.

So, again, sorry – it’s nothing personal. And I love that you’re using Caddy. And I hope it serves you well. If anything, I just have a vendetta against claims like was originally posted.

If Docker really has no way of passing signals like USR1 to the processes, then that is an unfortunate technical limitation in Docker and you will have to change the way you serve your sites. The best thing to do is fix the configuration that is broken (or file a bug report, if it is a bug). But since you have not posted any information about log output or error messages, actually giving you help is difficult. We really do want to help you, but without anything to go on, all we can do is wave our hands and suggest generic things.

Whitestrake · January 30, 2018, 3:51am

For anyone interested, there’s at least two ways this can be handled.

Firstly, docker kill can be given a --signal or -s flag. For example: docker kill --signal=SIGUSR1 web_caddy_1

Notably, this only works if Caddy is PID 1 in the container, or the PID 1 process is configured to forward USR1 to Caddy.

Secondly - and this is what I usually rely on - docker exec can be used to pkill Caddy within the container, for example: docker exec web_caddy_1 pkill -SIGUSR1 caddy

If pkill is not available in your container, you can use Caddy’s -pidfile flag and read its contents to kill -SIGUSR1 instead. I’ve never seen kill not available in a container before, but if that were ever the case, Caddy would probably be PID 1 anyway and docker kill would work.

docker kill | Docker Docs
docker exec | Docker Docs
https://caddyserver.com/docs/cli#pidfile

Whitestrake · January 30, 2018, 4:19am

On a separate note, if it helps understanding, the decision to have Caddy bail out on a bad config is intentional. The reasoning behind it is that if there’s a problem, it’s important for the admin setting it up to deal with it straight away.

It’s also important to note that there’s a big difference between a service restart and container recycling. When you restart a Docker container, you’re not restarting the service, you’re shutting down the service and starting it again. Caddy can’t offer any graceful handling for configuration changes during this process without abandoning the principle of bailing out on a misconfiguration at startup, since you’ve killed its process and it can’t exactly persist the previous configuration.

If your heart is set on using container recycling to update configuration, you could consider a different architecture, one that’s more resilient to these kinds of configuration failures.

Since the main problem in your case seems to be TLS misconfiguration, you could do something like this, I wrote about some time ago:

Best practise for multiple tenant, multiple HTTPS domain server?

If I were going to set up a massively multi-tenanted fully-HTTPS shared hosting service, I would probably put Caddy in front with a really simple file:
:80, 443 {
  tls {
    max_certs [some large number]
  }
  proxy / http://haproxy:80 {
    transparent
  }
}
This would make startup pretty fast. I expect I would set [some large number] to the weekly rate-limit of LetsEncrypt and restart Caddy once a week.

A host-permissive, On-Demand TLS setup at the edge has minimal points of failure and starts up regardless of certificate-on-disk or network status - the point of failure for TLS issues would be delayed to client connection time. The configuration would be unlikely to change in the long run, too.

Instead of proxying back to HAProxy, send traffic back to your main Caddy instance - the one with all your individual site configuration - and use HTTP or self-signed over an internal network between the two. You can then muck around with your configuration all you like (assuming the Caddyfile syntax is correct…), and recycle both containers at will without fear of being unable to bring them back up properly due to TLS misconfiguration.