Caddy behind an AWS network load balancer health check race condition

We are using Caddy behind a AWS network load balancer sending both TCP 80 and TCP 443 traffic. The problem is the health check for the AWS network load balancer keeps on taking out the backend and as a result Caddy can’t start because it cannot perform the Let’s Encrypt verification. This is a race condition.

Is anybody using Caddy on-demand TLS behind a network load balancer? How is this possible? Seems like having a /health ping endpoint in Caddy that always responds, even if Caddy cannot get Let’s Encrypt TLS certs should be possible. Currently Caddy just crashes and fails to start.

I’m not sure I understand.

The problem is the health check for the AWS network load balancer keeps on taking out the backend

This sounds like an issue with AWS or your setup. Can you give more details? I think we’re lost somewhere between “as a result” and “this is a race condition.”

I believe @nodesocket means AWS Load balancer removes Caddy backend server when it detects Caddy is unavailable/down. And Caddy can’t start in on-demand mode because letsencrypt verification through AWS Load balancer doesn’t work as there is no Caddy backend to pass the request from Letsencrypt to Caddy backend.

Matt, eva2000 has it. Essentially assume there is an AWS network load balancer with a single ec2 instance behind it running Caddy. The load balancer has taken the ec2 instance out of rotation because it fails the health checks. The problem is that I cannot start Caddy because the load balancer is not forwarding traffic. You see the problem? I had to hack around this horrible by starting a python simple server on ports (80 and 443), wait until the network load balancer put the ec2 instance back in rotation, stop the python simple servers, and start Caddy. Surprisingly this hack worked.

And why is it doing that? :thinking:

Sounds like a catch-22 rather than a race condition.

Caddy can’t start with Automatic HTTPS unless it can requisition certificates;
Caddy can’t requisition certificates unless the load balancer is routing requests to it;
The load balancer won’t route requests to Caddy unless it passes health checks;
Caddy won’t pass health checks unless it’s started;
GOTO 1

If you can’t configure the load balancer to let Caddy do its job, the likely next best solution is DNS validation.

On-Demand TLS is automatic HTTPS but it defers the acquisition of certificates until handshake-time.

But anyway, the Caddy docs talk about this situation (using Caddy behind a reverse proxy or load balancer) specifically. Using the DNS challenge is a good way to go. Or On-Demand TLS. Or configure the load balancer differently. Point is, there are plenty of ways to handle this situation, and there’s nothing “horrible” about it.

You’re right, as long as there’s no regular, valid domains in the Caddyfile, it should just start up. The problem here, though, seems to be that:

This, as far as I’m aware, only ever happens with regular Automatic HTTPS (not On-Demand). Although I suppose it’s possible that the health check is probing a HTTPS endpoint, provoking the verification which never succeeds (another catch-22 - Caddy holding the health check up waiting for a verification which requires the health check to succeed before it can proceed…). But while this would result in a botched health check and a backend removal, it wouldn’t stop Caddy from starting.

@nodesocket, would you mind posting your Caddyfile? And do you have logs of the startup failures? Could help us narrow down what needs to be done to get things working as expected (unless you’re happy with your Python solution, of course!).

1 Like

Whitestrake and Matt,

I believe I sort of worked around this issue by doing the following. However as you’ll see it is extremely fragile and requires manual intervention.

1.) Changed the AWS network load balancer listener on TCP port 443 to health check TCP port 80 instead of the default 443. Thus both listeners TCP 80 and 443 both health check port 80.

2.) Started a simple python http server on TCP port 80 and waited for the AWS network load balancer to put the backend EC2 instance back in the pool (active). Quickly stopped the python http server and started Caddy.

Here is my Caddyfile. You will notice I am using both regular automatic https and on-demand.

portal.mydomain.com {
	gzip
	tls support@mydomain.com
	errors /var/log/caddy/error.log

	header / Strict-Transport-Security "max-age=15768000;"
	
    root /var/www/portal

    fastcgi / 127.0.0.1:3000 php {
        env RDS_ENDPOINT {$RDS_ENDPOINT}
        index index.php
    }
}

urlf.mydomain.com {
	gzip
	tls support@mydomain.com
	errors /var/log/caddy/error.log

	header / Strict-Transport-Security "max-age=15768000;"
	
    root /var/www/frontend

    fastcgi / 127.0.0.1:3001 php {
        env RDS_ENDPOINT {$RDS_ENDPOINT}
        index index.php
    }
}

:443 {
	gzip
	tls support@mydomain.com
	errors /var/log/caddy/error.log

	header / Strict-Transport-Security "max-age=15768000;"
	
    root /var/www/frontend

    tls {
        # on-demand
        max_certs 1000
    }

    fastcgi / 127.0.0.1:3001 php {
        env RDS_ENDPOINT {$RDS_ENDPOINT}
        index index.php
    }
}

Ahh, yep.

My strong recommendation, in that case, is to use DNS validation for those sites which won’t have On-Demand certificates.

If you do, you can take your Python server out of the mix, change your health checks back to probing both HTTP and HTTPS on those domains, Caddy will be able to start and requisition the certificates it needs, and when it’s done AWS will be able to see that and start routing it.

The thing to avoid here would be health checking port 443 for a random host. The health check should be for a domain Caddy will start up with a certificate for.

1 Like

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.