Multiple running instances with Let's Encrypt on-demand TLS

nodesocket · February 13, 2018, 1:26am

Hello, we are using Caddy with on-demand Let’s Encrypt TLS generation since our customers host their own DNS and point domains to us at-will.

Our Caddy config is pretty straightforward:

:443 {
    gzip
    tls support@ourdomain.com
    errors /var/log/caddy/error.log

    header / Strict-Transport-Security "max-age=15768000;"

    root /var/www/frontend

    tls {
        max_certs 1000
    }

    fastcgi / 127.0.0.1:3001 php {
        env RDS_ENDPOINT {$RDS_ENDPOINT}
        index index.php
    }
}

This works well for a single Caddy instance, but when we deploy multiple Caddy instances a few problems quickly arise. First, we run into Let’s Encrypt rate limits which are silly low (duplicate certificate limit of 5 certificates per week). Second, each Caddy instance tries to renew Let’s Encrypt certificates causing lots of duplicate requests and again rate limits.

What are common patterns and approaches to resolve this and run multiple Caddy instances with on-demand Let’s Encrypt? Could we use amazon elastic file system and point the caddy SSL certificate at it?

Whitestrake · February 13, 2018, 1:43am

I personally haven’t seen a common pattern emerge yet with respect to solving the Caddy clustering problem.

Using shared persistent storage is one option, but has caveats as multiple Caddies might try to read and write these files at the same time. This can be mitigated somewhat by staggering instance starts.

Unfortunately as you’ve noted, due to the rate limits, LetsEncrypt simply can’t scale massively. One thing that might help would be to tier your hosts a little differently. Have something like this at the front:

:443 {
  tls support@ourdomain.com
  errors /var/log/caddy/error.log

  tls {
    max_certs 1000
  }

  proxy / upstream {
    transparent
  }
}

And something like this upstream:

:80 {
  gzip
  errors /var/log/caddy/error.log

  header / Strict-Transport-Security "max-age=15768000;"

  root /var/www/frontend

  fastcgi / 127.0.0.1:3001 php {
    env RDS_ENDPOINT {$RDS_ENDPOINT}
  }
}

By shifting as much processing and file access off your TLS termination proxy as possible, you can reduce your scale requirements for your certificate-handling instances and use them as load balancers instead.

nodesocket · February 13, 2018, 1:48am

Thank you for the reply. It seems by using a shared file system such as Amazon Elastic File System and pointing TLS files there should work nicely. As I understand it though, renewal requests are kicked off by Caddy when the process starts and every 12 hours after. We use Terraform to deploy or instances, and unfortunately they will all start around the same second +/- 10 seconds because of the Terraform automation.

Does it make sense for Caddy to not hardcode 12 hours for checking Let’s Encrypt renewals, but a random value between 10 and 14 hours? Would that help?

Whitestrake · February 13, 2018, 3:50am

Well, the only way to guarantee they won’t ever overlap is to stagger them with the same duration. But if your Caddy instances are short-lived (or at least impermanent), randomizing the timing is probably an effective way to mitigate it.

matt · February 13, 2018, 6:13am

@nodesocket @Whitestrake When https://github.com/mholt/caddy/pull/2015 is merged (in hopefully a few days), Caddy will be much more compatible with a shared certificate environment.

Here’s one thing that change does: Caddy will check on disk before attempting to renew a certificate. If the certificate on disk already has a later expiration date, it will consider the certificate as renewed and simply load it and use it, rather than trying another renewal.

This means that if multiple Caddy instances share the same $CADDYPATH and use the same CA and stuff, they will be able to effectively share certificates without hitting rate limits, as long as they check for renewals not at the exact same time. (We’re talking a window of time as long as it takes to renew a certificate, which is just a few seconds at most, usually.) So if you stagger the execution of your Caddy instances by a few seconds or minutes, it’ll almost always be sure to synchronize successfully with the other ones. (It’s not perfect, but it’s a good stepping stone until we can abstract away the storage mechanism entirely, and include synchronization facilities.)

So, my advice: wait for the next release (coming right after Go 1.10) and use it; it should fix your issue. Remember to build from source when using for business purposes, or purchase a commercial license (and I mention this in case others read this who are in a similar situation). Even if you build from source, we recommend purchasing a commercial license or extended support when using the software for your business, so you have more direct access to us if things go wrong or you have questions.

matt · February 15, 2018, 2:05am

As mentioned in the issue, but I wanted to clarify it here too for readers:

That PR no longer requires you to stagger the execution of your Caddy instances for “safely” sharing certificates. I’ve implemented locking that can be shared by multiple instances of Caddy as long as they share the same certificate store (e.g. a drive mounted as a local folder with the file system). So, in version 0.10.11, Caddy will be able to share certificates with multiple instances and synchronize renewals of them.

system · May 16, 2018, 2:05am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.