Docker swarm cluster and high availability

Hello

I’ve been trying to deploy Traefik v2.x for weeks on a Docker Swarm cluster to get a high available reverse proxy, but it’s been a hell so far. It rains complaints about the breaking changes from v1.x to v2.x so I decided to give something else a try -> Caddy.
And finally, we also got the news that HA feature has been moved into a very expensive enterprise version, big bummer :frowning:

I have read a lot about it so far and want to give it a try, except I would like to have a little jump start on how Caddy handles this whole concept and features (or not).

Can someone help me out with the basic or can share a basic docker compose file to get started?

I’m using Docker Swarm, I have 3 manager nodes and 6 worker nodes so far (the worker nodes will expand further up to about 70 nodes for all the applications I need to migrate)

  • I need a high available load balancer
  • Need to be able to handle the Let’s Encrypt certs
  • We will deploy Portainer afterwards or Swarmpit to manage the docker containers
  • be able to run NGINX for some Wordpress website (due to fastcgi caching that is required for blazing fast sites)
  • be able to run OpenLiteSpeed for some websites, due to extremely fast LSI caching in Wordpress
  • both NGINX and OpenLiteSpeed need to operate in the network while Caddy remains the reverse proxy and load balancer

I was considering Consul to handle the storage of the certs and configuration files, in case something goes wrong that the load balancer keeps working instead putting 900+ applications down.
If there are better ways to do this with Caddy, I’m all ears.

Thanks!

I’m not sure I’ll be able to answer all your questions, I haven’t worked on setting up a cluster yet. But here’s some info that could help you get started.

There’s an official docker image for Caddy v2 you can use: https://github.com/caddyserver/caddy-docker

There’s a WIP ingress controller which can be used alongside Caddy v2: https://github.com/caddyserver/ingress. I don’t have any insight on how to use it though, you’ll need to get help from someone else (maybe open an issue on that repo) My brain went the wrong direction, you’re using swarm, not k8s so this isn’t relevant for you I think.

Caddy’s storage is pluggable and should support Consul, but I’m not aware of a Consul storage plugin for v2 yet. Docs for storage are here: https://caddyserver.com/docs/json/#storage and here https://caddyserver.com/docs/caddyfile/options (global storage option via Caddyfile). Plugins require building Caddy from source with an additional import added. See the caddy-docker README for instructions on using the builder docker image to help with this.

Putting all this together, I’m sure Caddy could fit well for your use-case. It’s still just a bit scattered because v2 is still in beta and there hasn’t been a large amount of work put into the clustering usecase yet (mainly because @matt doesn’t have much experience with it and neither do most of the other top contributors).

There’s an implementation here: https://github.com/pteich/caddy-tlsconsul - but it does need a v2 module. Once the implementation is done, these Caddy storage adapter modules are very easy to make (just mapping struct fields from one type to another, basically), would take maybe 20 minutes?

Anyway, @codeagency, we’d love it if you could contribute what you need to the project and then start using it!

Thanks for your feedback.
Great to hear that Caddy could potentially do everything we are looking for.

I have seen that Caddy2 seems to have some options for clustering that can be enabled.
I’ll have a go with it and see how far I can get.

Hello Matt

Also thank you for your feedback.
Well since v2 is out now ,even though it’s beta, I’m not going to start on v1 for that short time being and then end up with an upgrade mess.
I had enough for that with Traefik v1>v2 for the past weeks so hard I gave up on them completely.

Are there any other K/V solutions that are proven to work great with Caddy?
I also read that Traefik could handle with Redis or variants like KeyDB.
Does this also apply for Caddy? Any plugins available for this? Also v2?

As a side note:
I got feedback from Containous/Traefik as there seems to be a lot confusing going on regarding concepts of HA and SSL certs;
The problem with Traefik v2.x and onwards, is that they have removed the distributed storage feature from their CE version. Traefik itself is still HA, also in CE version.
But the EE version is distributed-by-design, and their the SSL storage comes as a “plus” bonus because the core is already distributed.
CE lacks this completely as of V2.x and upwards.
Recently in v2.2 (beta RC), they introduced back the K/V providers but it’s a pain in the *** to get it working.

I hope Caddy has a better workaround/solution for this?
My problem with the whole thing is following:

We are using Docker Swarm with multiple managers. If Traefik/Caddy is deployed as a HA cluster and one of the master nodes goes down, we expect that the load balancer/reverse proxy does not kill the entire cluster and its applications.
If you have an SSL cert applied to a specific application and the LB goes down, but another one takes over the risk is that the SSL cert is not available, the acme.json file is not reachable on master node 2 as it does not exist there, http>https forwarding is enabled so poof your entire applications are now off the grid.
So we must have something in place to handle the distribution of the SSL certs.
You also don’t want the LB to start fetching new certs over and over each time the LB node goes down, because that could kill your rate limits at Let’s Encrypt.

@matt
Does Caddy also has auto service discovery feature?

If you’re comfortable with it, you could use glusterfs as the backing storage for Caddy and have every Caddy instance use that same storage. Then you won’t need a K/V solution. Caddy is very reliable when it comes to certificate management, using files as distributed locks to avoid overlap, as long as they all share the same storage.

Caddy v1 did support SRV DNS records for load balancing the reverse proxy, I’m not sure about the status of that in v2, it might not be in yet.

That was one of the alternatives I was exploring also, except not GlusterFS but CEPH (because GlusterFS seems to eat up CPU a lot) and CEPH is more capable of doing larger shares and more I/O working than GlusterFS.

We also have a Rancher setup running, just for testing and there we tried OpenEBS. Although it looked very promising, it’s a hell and pain to get a storage layer working in a K8s environment.
I might also check if OpenEBS or a different one could work for Docker swarm clusters.

1 Like

@matt
On your website front page it shows this block:

CLUSTER SUPPORT

Caddy can share managed certificates stored on disk with other instances and synchronize renewals in fleet deployments.

Does this apply to Caddy v1 or also v2 already?
Do you have any documentation on this part please?

Thanks.

Just configure all the storage to be the same for all your Caddy instances: https://caddyserver.com/docs/json/#storage

There are more backends available, just gotta take a few minutes and make them into Caddy 2 plugins.

I will have a look at it, I’m completely new to Caddy so I have no experience with creating Caddy plugins.
Is there also an option to hire/pay for the resources and somebody creates them?
If you say it’s just a few minutes work, I can imagine anybody could chime in and join forces to creates those plugin in very short times.

Thanks!

Sure, sounds good.

Absolutely – we do have an exclusive partnership with Ardan Labs for Caddy support+development. If you send a quick message with this form, Miguel can work out an arrangement.

(To be quite honest with you, though, this particular task should only require about an hour or two of a developer’s time, at least, any developer who is familiar with Caddy modules.)

1 Like