Random SSL_ERROR_INTERNAL_ERROR_ALERT

Owen_Conti · May 25, 2023, 9:17pm

1. The problem I’m having:

We’re having users and internal team members report random occurrences of a SSL_ERROR_INTERNAL_ERROR_ALERT when trying to connect to our Caddy fleet. Having the user refresh the page seems to fix the issue for them and the page loads correctly.

Example URLs:

2. Error messages and/or full log output:

I’m not sure which Caddy logs would be helpful. We have 42,000+ items in our Dynamo certificate storage table. I also haven’t been able to replicate the issue myself which makes it even more difficult to track down.

Other helpful info is that app.aryeo.com is proxied through Cloudflare, whereas our wildcard subdomains go through Caddy first and then are reverse proxied to app.aryeo.com (so they get the benefit of Cloudflare, but they hit Caddy first).

3. Caddy version:

2.5.2

4. How I installed and ran Caddy:

a. System environment:

Ubuntu 20.04 on EC2, not using Docker

b. Command:

sudo systemctl daemon-reload
sudo systemctl enable caddy
sudo service caddy reload

d. My complete Caddy config:

{
    servers {
        timeouts {
            read_body 25s
            read_header 5s
            write 30s
            idle 60s
        }
    }

    # Ensure we validate that the custom domain exists in Aryeo before
    # trying to obtain a certificate for the domain
    on_demand_tls {
        ask https://app.aryeo.com/ask
    }

    # Store certificates in DynamoDB to share amongst nodes in the cluster
    storage dynamodb caddy_ssl_certificates {
        aws_region us-east-1
    }
    storage_clean_interval 32d
}

:8001 {
    respond /health "I'm healthy!" 200
}

:8002 {
    metrics
}

(modify-headers) {
    # Drop the Caddy identifier header
    header -Server

    # Add a header to identify the region that served the request
    header AryeoRegion "us-east-1"
    header AryeoNode ""
}

(reverse-proxy) {
    import modify-headers

    reverse_proxy https://app.aryeo.com {
        header_up Host app.aryeo.com
        header_up User-Custom-Domain {host}
        header_up X-Forwarded-Host {host}
        header_up X-Forwarded-Port 443
        health_timeout 5s

        lb_try_duration 5s
        lb_try_interval 250ms

        transport http {
            dial_timeout 5s
        }
    }
}

(cloudflare-tls) {
    # When obtaining certificates for any *.aryeo.com domain
    # use the installed Cloudflare module to allow Caddy to
    # create any necessary TXT records for domain validation
    tls {
        dns cloudflare REDACTED
        resolvers 1.1.1.1
    }
}

(access-logs) {
    log {
        output net udp/localhost:10519
        format filter {
            wrap json
            fields {
                request>headers>Accept delete
                request>headers>Accept-Encoding delete
                request>headers>Accept-Language delete
                request>headers>Sec-Fetch-Dest delete
                request>headers>Sec-Fetch-Mode delete
            }
        }
    }
}

(php-app) {
    import access-logs
    import modify-headers

    root * /home/forge/aryeo.com/current/public
    file_server

    php_fastcgi unix//run/php/php8.0-fpm.sock {
        root /home/forge/aryeo.com/current/public

        header_down AryeoStatic false

        # how long to try selecting available backends for each request
        lb_try_duration 10s
        lb_try_interval 500ms

        # how long to wait when connecting to the upstream socket
        dial_timeout 3s

        # how long to wait when reading from the FastCGI server
        read_timeout 30s

        # how long to wait when sending to the FastCGI server
        write_timeout 30s

        # Expose these env vars to PHP for the Datadog trace extension to use
        env DD_SERVICE laravel
        env DD_ENV production
        env DD_TRACE_LARAVEL_ENABLED true

        # Allow the following IPs to proxy to us
        # TODO: Update to Cloudflare's IP range in the future
        trusted_proxies 0.0.0.0/0
    }

    @blocked {
        path */wp-* *wlwmanifest.xml *xmlrpc.php *.php* *.ini* *.html* *.jsp* *.srf* */etc/passwd* */administrator/* *.pem* *.crt* *.key* *.p12* *.csr*
    }
    respond @blocked 403
}

(aryeo-app) {
    import php-app
    import cloudflare-tls
}

www.aryeo.com {
    import aryeo-app
}
app.aryeo.com {
    import aryeo-app
}
api.aryeo.com {
    import aryeo-app
}
webhook.aryeo.com {
    import aryeo-app
}
*.aryeo.com {
    import cloudflare-tls
    import reverse-proxy
    import access-logs
}

https:// {
    tls support@aryeo.com {
        on_demand
    }

    import reverse-proxy
    import access-logs
}

5. Links to relevant resources:

N/A

matt · May 25, 2023, 9:25pm

I would look for anything that contains the subject names where the error is happening.

Add debug to your global options block of your config to reveal more detailed logs.

Even without debug logs, the problem should be clear for an error like this once we see the errors emitted by Caddy.

I recommend upgrading Caddy to the latest version as well.

My current suspicion is the storage medium / plugin. But the logs will help us know for sure.

Owen_Conti · May 25, 2023, 9:46pm

The hard part is since I can’t replicate it and we have a fleet of servers serving the requests, I don’t know which logs maps to the error happening.

I enabled the debug logs and then grep’d for error and found a few that could be potential causes?

May 25 21:34:45 web-production-ue1-n-1 caddy[753275]: {"level":"debug","ts":1685050485.3904254,"logger":"http.stdlib","msg":"http: TLS handshake error from 174.207.226.108:1822: EOF"}

May 25 21:34:26 web-production-ue1-n-1 caddy[753275]: {"level":"debug","ts":1685050466.9140286,"logger":"http.stdlib","msg":"http: TLS handshake error from 111.90.211.104:47592: read tcp 172.31.83.153:443->111.90.211.104:47592: i/o timeout"}

May 25 21: 34:54 web-production-uel-n-1 caddy [753275]: {"level": "debug", "ts": 1685050494.1760044, "logger": "http.stdlib", "msg": "http: TLS handshake error from 71.161.225.205:60819: tls: client offered onlv unsupported versions: [301]"}

Not sure if those are all normal or not.

francislavoie · May 25, 2023, 9:48pm

Yeah, definitely upgrade to at least v2.6.4. Even better if you can upgrade to v2.7.0-beta.1 because of what I’m gonna suggest below

You can change this to header_up Host {upstream_hostport} which avoids a repetition in the config

Caddy adds this automatically now, you can remove this line

The default is now 3s. You can probably remove this (and along with it the transport block), 3s should probably be plenty anyway.

If the try duration is only 5s and the dial timeout is 5s, a retry will effectively never happen. I’d suggest increasing this to like 8s or 10s maybe along with a shorter dial timeout so that there’s room for at least one or two retries.

Are you sure you need this? AFAIK Laravel doesn’t care about this one. We pass down X-Forwarded-Proto which has https as the value, which has the same effective meaning.

This is kinda redundant because of X-Forwarded-Host, you can probably drop that one too.

You don’t need to set the root option here, it’s redundant because you already used the root directive with the same path.

Since v2.6.4, you can set trusted proxies in global options to have it apply to all the proxies in the config:

{
	servers {
		trusted_proxies static 0.0.0.0/0
	}
}

But even better is there’s a cloudflare IP source plugin which would allow you to change it to this:

{
	servers {
		trusted_proxies cloudflare
	}
}

And even better than that, v2.7.0-beta.1 has client_ip_headers as a new option alongside trusted_proxies to read the client IP from the Cf-Connecting-Ip header which Cloudflare uses:

{
	servers {
		trusted_proxies cloudflare
		client_ip_headers Cf-Connecting-Ip
	}
}

This will also make the real client IP appear in the access logs which can be useful.

francislavoie · May 25, 2023, 9:50pm

Those could very possibly be just bots/crawlers failing to connect due to trying to find holes. So yeah, it can be hard to tell unless you can cross-ref with an exact timestamp you notice the problem.

Owen_Conti · May 25, 2023, 9:55pm

Thanks @francislavoie. A version upgrade and clean up is on the TODO list. Unfortunately it’s a bit of a long process so I need to find some time for it.

Owen_Conti · May 25, 2023, 10:05pm

Side question @matt @francislavoie - what’s the current recommended storage engine for sharing certs across servers?

francislavoie · May 25, 2023, 10:32pm

Honestly, none of them are perfect. But I’ve seen the least problems using the Redis one, because it has properly implemented locking and is super cheap to run. But since you’re on AWS, I’m not sure that there’s a good hosted Redis option for you? So Dynamo is fine but it does seems needlessly expensive because it doesn’t do scans cheaply.

matt · May 25, 2023, 11:18pm

Redis is probably a good bet; or if you have a SQL database:

But let’s narrow down the actual cause first.

This means the client closed the connection; normal.

This means the connection was idle too long. Normal.

This means the client only supports an old TLS version. Normal.

All of these are really common and aren’t related to this issue, unfortunately; usually those clients are bots or scripts etc.

What you’d be looking for is something related to the TLS certificate, most likely.

If it helps, I was able to trigger the error today at 11:17 PM UTC for drew-clements.aryeo.com. So maybe if you want to look for that occurrence in your logs, it could be useful.

Owen_Conti · June 14, 2023, 2:33am

Following up here:

We haven’t heard any complaints about the issue after upgrading Caddy to 2.6.4!

system · July 14, 2023, 2:33am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.