Internal CA - Single point of failure?

Consider the topology diagram in @Rob789’s wiki article Use Caddy for local HTTPS (TLS) between front-end reverse proxy and LAN hosts.

It seems to me that as long as I have a master copy of the Caddyfile, if the frontend Caddy server were to fail, it would not take too much effort to swing the port forwarding on the router across to a new instance of the Caddy server to take over the reverse proxy function for the backend servers. However, the original frontend Caddy server has also been configured as the CA for the internal network. Swinging across to a new frontend Caddy server would break all mTLS links would it not? These are not so easy to re-establish as a new root.crt from the CA would have to be copied to every backend server that uses mTLS. Am I correct, or, is there a way to duplicate the CA exactly in the second frontend Caddy instance?

If both frontend Caddy instances share the same storage, then they will use the same root CA to issue certs.

1 Like

So I’ve studied @matt’s wiki article Load balancing Caddy and after further research began to realise that, for a home network, there are a couple of issues:

  1. It is not necessarily a viable proposition to place Caddy behind a h/w or cloud-based load balancer; and
  2. Sharing storage between two Caddy RPs could be problematic as the shared storage itself is a single point of failure.

It’s still worthwhile considering setting up a redundant RP on a separate h/w server particularly if there are lots of upstream services. I wonder if a second Caddy RP could be set up in a ‘cold standby’ mode and swung into operation on failure on the primary RP? Storage is local to the RP, but could be replicated. Visually, it would look like this…

Is this a reasonable way to proceed? Assuming it is, the question I have is around the storage replication schedule. Let’s say local storage is replicated every 15 minutes. Just before the next replication event, the primary RP fails and the standby RP is swung into operation. There’s a potential loss of 15 minutes of ‘something’ on the replicated storage. What is the impact, if any, on mTLS or will it just sort itself out through the issue of new internal certificates?

There are ways around that; Consul or Redis can be run in clustered modes. But yeah, if you load balance, then you need to load balance everything in your control, up and and possibly including your router.

You’re right, but for the market segment I’m interested in, which is home and small business, it’s probably out-of-scope. There are going to be some design compromises. I think the goal is to try to minimise downtime during exceptions.

Key single points of failure are concentrators for upstream services. These include the router, internet connection and reverse proxy server. For the market segment of interest, these are possible mitigation strategies:

  1. router - A swap out router with the latest config preloaded.
  2. internet connection - A router capable of and configured to failover to the mobile data network.
  3. reverse proxy server - This is what I’m exploring through this thread.

Against this backdrop, I think the question is still meaningful… If there’s a switch to a standby RP, does the slight inconsistency in the replicated storage database have a detrimental impact on mTLS, or, is mTLS likely to sort itself out through the reissue of internal certificates?

I’d say it’s probably fine, cause the main thing re mTLS is the root CA cert/key since that’s what the upstreams trust. For public TLS certs, it can potentially be more of an issue with rate limits if you have a lot of domains but otherwise the chance of a re-issue quickly after another is not a big problem.

So I’d say “sure, should be fine, but try to be present in case something goes wrong cause it’s untested”.

Your in-principle acknowledgement that the approach is feasible and could possibly work gives me the confidence to move to a proof-of-concept. It’s been a long, but interesting journey to get to this point.

@Rob789’s wiki article Use Caddy for local HTTPS (TLS) between front-end reverse proxy and LAN hosts was really the catalyst for me to take a closer look at mTLS.

My personal journey with mTLS started on May 4 with mTLS under FreeBSD. The first big hurdle was the seemingly insurmountable system trust store issue for FreeBSD. Next were issues around the stability of mTLS in mTLS: tls internal error, which surprisingly appear to have recently been resolved indirectly through changes in Caddyfile design arising from Load balancing queries.The icing on the cake for me though was a solution yesterday for WordPress and mTLS. After weeks of unsuccessfully trying different things to get this to work, I still can’t believe an elegant solution was dropped on my lap yesterday.

There’s been plenty of frustration along the way. All the mTLS hurdles I faced now appear to have melted away. I could not have done it without your help @francislavoie . I’m ever so grateful and feel privileged to have had you on this drawn-out journey with me. It’s taken over two months and I’ve come out the other end battered, but confident about taking mTLS off the drawing board and moving it into production.

2 Likes

This topic was automatically closed after 30 days. New replies are no longer allowed.