Caddy behind an AWS network load balancer health check race condition

nodesocket · February 18, 2018, 3:09am

We are using Caddy behind a AWS network load balancer sending both TCP 80 and TCP 443 traffic. The problem is the health check for the AWS network load balancer keeps on taking out the backend and as a result Caddy can’t start because it cannot perform the Let’s Encrypt verification. This is a race condition.

Is anybody using Caddy on-demand TLS behind a network load balancer? How is this possible? Seems like having a /health ping endpoint in Caddy that always responds, even if Caddy cannot get Let’s Encrypt TLS certs should be possible. Currently Caddy just crashes and fails to start.

matt · February 18, 2018, 5:50pm

I’m not sure I understand.

The problem is the health check for the AWS network load balancer keeps on taking out the backend

This sounds like an issue with AWS or your setup. Can you give more details? I think we’re lost somewhere between “as a result” and “this is a race condition.”

eva2000 · February 18, 2018, 8:35pm

I believe @nodesocket means AWS Load balancer removes Caddy backend server when it detects Caddy is unavailable/down. And Caddy can’t start in on-demand mode because letsencrypt verification through AWS Load balancer doesn’t work as there is no Caddy backend to pass the request from Letsencrypt to Caddy backend.

nodesocket · February 18, 2018, 11:26pm

Matt, eva2000 has it. Essentially assume there is an AWS network load balancer with a single ec2 instance behind it running Caddy. The load balancer has taken the ec2 instance out of rotation because it fails the health checks. The problem is that I cannot start Caddy because the load balancer is not forwarding traffic. You see the problem? I had to hack around this horrible by starting a python simple server on ports (80 and 443), wait until the network load balancer put the ec2 instance back in rotation, stop the python simple servers, and start Caddy. Surprisingly this hack worked.

matt · February 19, 2018, 12:34am

And why is it doing that?

Whitestrake · February 19, 2018, 12:42am

Sounds like a catch-22 rather than a race condition.

Caddy can’t start with Automatic HTTPS unless it can requisition certificates;
Caddy can’t requisition certificates unless the load balancer is routing requests to it;
The load balancer won’t route requests to Caddy unless it passes health checks;
Caddy won’t pass health checks unless it’s started;
GOTO 1

If you can’t configure the load balancer to let Caddy do its job, the likely next best solution is DNS validation.

matt · February 19, 2018, 2:55am

On-Demand TLS is automatic HTTPS but it defers the acquisition of certificates until handshake-time.

But anyway, the Caddy docs talk about this situation (using Caddy behind a reverse proxy or load balancer) specifically. Using the DNS challenge is a good way to go. Or On-Demand TLS. Or configure the load balancer differently. Point is, there are plenty of ways to handle this situation, and there’s nothing “horrible” about it.

Whitestrake · February 19, 2018, 4:17am

You’re right, as long as there’s no regular, valid domains in the Caddyfile, it should just start up. The problem here, though, seems to be that:

This, as far as I’m aware, only ever happens with regular Automatic HTTPS (not On-Demand). Although I suppose it’s possible that the health check is probing a HTTPS endpoint, provoking the verification which never succeeds (another catch-22 - Caddy holding the health check up waiting for a verification which requires the health check to succeed before it can proceed…). But while this would result in a botched health check and a backend removal, it wouldn’t stop Caddy from starting.

@nodesocket, would you mind posting your Caddyfile? And do you have logs of the startup failures? Could help us narrow down what needs to be done to get things working as expected (unless you’re happy with your Python solution, of course!).

nodesocket · February 19, 2018, 10:03pm

Whitestrake and Matt,

I believe I sort of worked around this issue by doing the following. However as you’ll see it is extremely fragile and requires manual intervention.

1.) Changed the AWS network load balancer listener on TCP port 443 to health check TCP port 80 instead of the default 443. Thus both listeners TCP 80 and 443 both health check port 80.

2.) Started a simple python http server on TCP port 80 and waited for the AWS network load balancer to put the backend EC2 instance back in the pool (active). Quickly stopped the python http server and started Caddy.

Here is my Caddyfile. You will notice I am using both regular automatic https and on-demand.

portal.mydomain.com {
	gzip
	tls support@mydomain.com
	errors /var/log/caddy/error.log

	header / Strict-Transport-Security "max-age=15768000;"
	
    root /var/www/portal

    fastcgi / 127.0.0.1:3000 php {
        env RDS_ENDPOINT {$RDS_ENDPOINT}
        index index.php
    }
}

urlf.mydomain.com {
	gzip
	tls support@mydomain.com
	errors /var/log/caddy/error.log

	header / Strict-Transport-Security "max-age=15768000;"
	
    root /var/www/frontend

    fastcgi / 127.0.0.1:3001 php {
        env RDS_ENDPOINT {$RDS_ENDPOINT}
        index index.php
    }
}

:443 {
	gzip
	tls support@mydomain.com
	errors /var/log/caddy/error.log

	header / Strict-Transport-Security "max-age=15768000;"
	
    root /var/www/frontend

    tls {
        # on-demand
        max_certs 1000
    }

    fastcgi / 127.0.0.1:3001 php {
        env RDS_ENDPOINT {$RDS_ENDPOINT}
        index index.php
    }
}

Whitestrake · February 20, 2018, 12:30am

Ahh, yep.

My strong recommendation, in that case, is to use DNS validation for those sites which won’t have On-Demand certificates.

If you do, you can take your Python server out of the mix, change your health checks back to probing both HTTP and HTTPS on those domains, Caddy will be able to start and requisition the certificates it needs, and when it’s done AWS will be able to see that and start routing it.

The thing to avoid here would be health checking port 443 for a random host. The health check should be for a domain Caddy will start up with a certificate for.

system · May 21, 2018, 12:30am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.