Error "TLS alert, internal error (592)" (again)

1. Caddy version (caddy version):

v2.4.3 h1:Y1FaV2N4WO3rBqxSYA8UZsZTQdN+PwcoOcAiZTM8C0I=

2. How I run Caddy:

a. System environment:

ubuntu 18.04
systemd
baremetal
Caddy installed package from deb repo

b. Command:

systemctl start caddy 

c. Service/unit/compose file:

[Unit]
Description=Caddy
Documentation=https://caddyserver.com/docs/
After=network.target network-online.target
Requires=network-online.target

[Service]
Type=notify
User=caddy
Group=caddy
ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
AmbientCapabilities=CAP_NET_BIND_SERVICE

[Install]
WantedBy=multi-user.target

d. My complete Caddyfile or JSON config:

(log) {
    log {
        output file {args.0} {
            roll_size     100MiB
        }
        level DEBUG
        format json
        format filter {
            wrap json
            fields {
                common_log delete
                resp_headers>Content-Security-Policy delete
                request>remote_addr ip_mask {
                    ipv4 24
                    ipv6 32
                }
            }
        }
    }
}

{
    debug

    #Bind admin endpoint on all interfaces
    admin 0.0.0.0:2019

    on_demand_tls {
        ask http://localhost:8000/api/cname/validate
    }

    # Workaround to prevent on_demand being enabled on all domains
    # see https://github.com/caddyserver/caddy/pull/4128
    email security+caddy@lempire.co

    import log /var/log/caddy/default_req.log

    # Reduce caddy reload grace_period to 3 min (default was 20 min) to improve reload
    # See https://caddy.community/t/hangs-on-reload/12010/35
    # Because of websockets long cnx, reload always took grace_period duration
    grace_period 3m
}

# Redirect any accesses using an IP instead of a domain name
# hackerone report: https://hackerone.com/reports/723126
51.91.128.49:80,
91.121.37.140:80,
147.135.229.201:80,
51.254.243.65:80,
54.36.18.249:80,
46.105.54.36:80
{
    redir https://app.lemlist.com{uri}
}

(cloudflare_matchers) {
    remote_ip 103.21.244.0/22
    remote_ip 103.22.200.0/22
    remote_ip 103.31.4.0/22
    remote_ip 104.16.0.0/13
    remote_ip 104.24.0.0/14
    remote_ip 108.162.192.0/18
    remote_ip 131.0.72.0/22
    remote_ip 141.101.64.0/18
    remote_ip 162.158.0.0/15
    remote_ip 172.64.0.0/13
    remote_ip 173.245.48.0/20
    remote_ip 188.114.96.0/20
    remote_ip 190.93.240.0/20
    remote_ip 197.234.240.0/22
    remote_ip 198.41.128.0/17
    remote_ip 2400:cb00::/32
    remote_ip 2606:4700::/32
    remote_ip 2803:f800::/32
    remote_ip 2405:b500::/32
    remote_ip 2405:8100::/32
    remote_ip 2a06:98c0::/29
    remote_ip 2c0f:f248::/32
}

(front_servers) {
    to localhost:42001
    to localhost:42002
    to localhost:42003
    to localhost:42004
    to localhost:42005
    to localhost:42006
    to localhost:42007
    to localhost:42008

    health_uri /api/ping
    health_interval 1s
    health_timeout 5s
}

(badbots_filter) {
    @badbots {
        header_regexp User-Agent "(?i)Fuzz Faster U Fool.*"
    }
    header @badbots {
        X-Cache-Debug "bbf"
        Server ""
    }
    respond @badbots "Access denied" 403
}

#
# Main application
#

app.lemlist.com,
app2.lemlist.com
{
    import log /var/log/caddy/front_req.log

    # route keyword is mandatory to force Caddy to keep our
    #   directives order instead of using the Caddy builtin order
    route {

        request_body {
            max_size 20MB
        }

        @from_cloudflare {
            import cloudflare_matchers
        }

        @not_from_cloudflare {
            not {
                import cloudflare_matchers
            }
        }

        ##############################################
        # pass all api lempire (e, i) requests to one
        #   dedicated Meteor app process
        ##############################################
        @api_e_i {
            path /api/lempire/*
        }

        header @api_e_i {
            X-Cache-Debug "app.api_e_i"
            Server ""
        }

        reverse_proxy @api_e_i {
            to localhost:8000
        }

        ##############################################
        # pass all api requests to Meteor api process
        ##############################################
        @api {
            path /api/*
        }

        header @api {
            X-Cache-Debug "app.api"
            Server ""
        }

        reverse_proxy @api {
            to localhost:42100
        }

        ##############################################
        # Caches
        ##############################################
        # must be after /api routes or https://app.lemlist.com/api/froala/files/*.jpg will not be served by meteor
        @cache_app {
            path *.svg *.woff2 *.eot *.ttf *.woff *.jpg *.jpeg *.png *.gif *.ico *.html
        }

        header @cache_app {
            # 1d expiration
            Cache-Control max-age=86400
            X-Cache-Debug "app.cache"
            Server ""
        }

        file_server @cache_app {
            root /opt/lemlist/app/programs/web.browser/app
        }

        @cache_app2 {
            path_regexp ^/lib/.*\.(map|js|css)$
        }

        header @cache_app2 {
            # 1d expiration
            Cache-Control max-age=86400
            X-Cache-Debug "app.cache2"
            Server ""
        }

        file_server @cache_app2 {
            root /opt/lemlist/app/programs/web.browser/app
        }


        @cache_version {
            path_regexp ^/[0-9a-z]+\.(js|css|map).*
        }

        header @cache_version {
            # 1d expiration
            Cache-Control max-age=86400
            X-Cache-Debug "app.cache_version"
            Server ""
        }

        file_server @cache_version {
            root /opt/lemlist/web.browser
        }


        ##############################################
        # Anything else goes to Meteor fronts
        ##############################################
        header @from_cloudflare {
            X-Cache-Debug "app.main.cf"
            Server ""
        }

        reverse_proxy @from_cloudflare {
            header_up X-Real-IP {http.request.header.CF-Connecting-IP}
            header_up X-Forwarded-For {http.request.header.CF-Connecting-IP}
            header_up Lb-Hash {http.request.header.CF-Connecting-IP}:{remote_port}

            import front_servers

            lb_policy header CF-Connecting-IP
        }

        header @not_from_cloudflare {
            X-Cache-Debug "app.main"
            Server ""
        }

        reverse_proxy @not_from_cloudflare {
            header_up Lb-Hash {remote_host}:{remote_port}

            import front_servers

            lb_policy ip_hash
        }

        handle * {
            respond 404
        }
    }

    handle_errors {
        @errors {
            expression {http.error.status_code} == '502'
        }
        rewrite @errors /{http.error.status_code}.html
        file_server {
            root /var/www
        }
    }

    #dev only : to work on laptop localhost
    #tls internal
}

#
# Custom domains
#
(custom_domains) {
    import log /var/log/caddy/custom_{args.0}_req.log

    route {
        import badbots_filter

        #######################################
        # render a special page when on root so the user can see if
        #   its custom domain is correctly setted up
        #######################################
        header / {
            X-Cache-Debug "custom.cname_ok"
            Server ""
        }

        file_server / {
            root /var/www/lemlist_cname_ok
        }

        #######################################
        # pass all requests "track open/click" to track
        #######################################
        @track {
            path_regexp ^/api/(t/c|track/open|reply)/
        }

        header @track {
            X-Cache-Debug "custom.track"
            Server ""
        }

        reverse_proxy @track {
            to localhost:42160
        }

        #######################################
        # pass all requests "images + pages" to images
        #######################################

        @images {
            path /*
        }

        header @images {
            X-Cache-Debug "custom.images"
            Server ""
        }

        reverse_proxy @images {
            to localhost:42140
        }

        handle * {
            respond 404
        }

    }
}

# WARNING : DO NOT add any other site address than :443 here
:443
{
    import custom_domains on_demand

    tls {
        on_demand

        issuer acme {
            email security+caddy@lempire.co
            dir https://acme.zerossl.com/v2/DV90
            eab ...
            ...
        }

        issuer acme {
            email security+caddy@lempire.co
            dir https://acme-v02.api.letsencrypt.org/directory
        }
    }
}

pages.lemlist.com
{
    import custom_domains predefined
}

# WARNING : DO NOT add any other site address than :80 here
:80
{
    import custom_domains port80
}

#
# public direct access to service
#
(public_passthrough) {
    {args.0}
    {
        import log /var/log/caddy/public_passthrough_{args.0}_req.log

        route {
            request_body {
                max_size 20MB
            }

            header * {
                X-Cache-Debug "{args.0}.main"
                Server ""
            }

            reverse_proxy * {
                to localhost:{args.1}
            }
        }
    }
}
import public_passthrough api.lemlist.com 42100
import public_passthrough img.lemlist.com 42150
# https://front.lempire.co -> admin
import public_passthrough front.lem.ovh 8000


#
# private direct access to services
#
(private_passthrough) {
    front.lem.ovh:{args.0}
    {
        import log /var/log/caddy/private_passthrough_{args.0}_req.log

        route {
            @cache_version {
                path_regexp ^/[0-9a-z]+\.(js|css|map).*
            }

            header @cache_version {
                # 1d expiration
                Cache-Control max-age=86400
                X-Cache-Debug "direct.cache_version"
                Server ""
            }

            file_server @cache_version {
                root /opt/lemlist/web.browser
            }

            header * {
                X-Cache-Debug "direct.main"
                Server ""
            }

            reverse_proxy * {
                to localhost:{args.1}
            }
        }

        basicauth * {
            zzz ....
        }
    }
}

# Fronts
import private_passthrough 43001 42001
import private_passthrough 43002 42002
import private_passthrough 43003 42003
import private_passthrough 43004 42004
import private_passthrough 43005 42005
import private_passthrough 43006 42006
import private_passthrough 43007 42007
import private_passthrough 43008 42008

# API
import private_passthrough 43100 42100
# Cron
import private_passthrough 43130 42130
# Images Front
import private_passthrough 43150 42150
# Track
import private_passthrough 43160 42160
# Images
import private_passthrough 43140 42140

3. The problem Iā€™m having:

About once per month, after days running fine, site in unreachable:

curl -k -vvv https://img.lemlist.com
*   Trying 147.135.229.201...
* TCP_NODELAY set
* Connected to img.lemlist.com (147.135.229.201) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/cert.pem
  CApath: none
* TLSv1.2 (OUT), TLS handshake, Client hello (1):
* TLSv1.2 (IN), TLS alert, internal error (592):
* error:14004438:SSL routines:CONNECT_CR_SRVR_HELLO:tlsv1 alert internal error
* Closing connection 0
curl: (35) error:14004438:SSL routines:CONNECT_CR_SRVR_HELLO:tlsv1 alert internal error

Even if certificate is valid, and less than 2 months old.

4. Error messages and/or full log output:

When it happens, got tons of :

2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.2718267,"logger":"http.stdlib","msg":"http: TLS handshake error from 54.36.63.50:60384: no certificate available for 'img.lemlist.com'"}
2021-08-05 07:54:12	
{"level":"debug","ts":1628142852.7338707,"logger":"http.stdlib","msg":"http: TLS handshake error from 54.36.63.50:47688: no certificate available for 'img.lemlist.com'"}
2021-08-05 07:54:11	
{"level":"debug","ts":1628142851.146231,"logger":"http.stdlib","msg":"http: TLS handshake error from 54.36.63.50:45850: no certificate available for 'img.lemlist.com'"}

After restart caddy:

2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.0244107,"logger":"http","msg":"starting server loop","address":"[::]:80","http3":false,"tls":false}
2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.0244377,"logger":"http","msg":"starting server loop","address":"[::]:43003","http3":false,"tls":true}
2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.0244606,"logger":"http","msg":"starting server loop","address":"[::]:43140","http3":false,"tls":true}
2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.0244842,"logger":"http","msg":"starting server loop","address":"[::]:43100","http3":false,"tls":true}
2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.0245056,"logger":"http","msg":"starting server loop","address":"[::]:43002","http3":false,"tls":true}
2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.0245278,"logger":"http","msg":"starting server loop","address":"[::]:43005","http3":false,"tls":true}
2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.0245743,"logger":"http","msg":"starting server loop","address":"[::]:43006","http3":false,"tls":true}
2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.0246098,"logger":"http","msg":"starting server loop","address":"[::]:43007","http3":false,"tls":true}
2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.0246308,"logger":"http","msg":"starting server loop","address":"[::]:43008","http3":false,"tls":true}
2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.0246575,"logger":"http","msg":"starting server loop","address":"[::]:43001","http3":false,"tls":true}
2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.0247173,"logger":"http","msg":"starting server loop","address":"[::]:443","http3":false,"tls":true}
2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.0247524,"logger":"http","msg":"starting server loop","address":"[::]:43004","http3":false,"tls":true}
2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.0247736,"logger":"http","msg":"starting server loop","address":"[::]:43130","http3":false,"tls":true}
2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.0248327,"logger":"http","msg":"starting server loop","address":"[::]:43160","http3":false,"tls":true}
2021-08-05 07:54:30	
{"level":"info","ts":1628142870.024842,"logger":"http","msg":"enabling automatic TLS certificate management","domains":["app.lemlist.com","front.lem.ovh","app2.lemlist.com","api.lemlist.com","img.lemlist.com","pages.lemlist.com"]}
2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.0256736,"logger":"tls","msg":"loading managed certificate","domain":"app.lemlist.com","expiration":1631525890,"issuer_key":"acme-v02.api.letsencrypt.org-directory","storage":"FileStorage:/var/lib/caddy/.local/share/caddy"}
2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.1930182,"logger":"tls","msg":"loading managed certificate","domain":"front.lem.ovh","expiration":1633242076,"issuer_key":"acme-v02.api.letsencrypt.org-directory","storage":"FileStorage:/var/lib/caddy/.local/share/caddy"}
...
2021-08-05 07:54:30	
{"level":"debug","ts":1628142870.4712768,"logger":"tls","msg":"loading managed certificate","domain":"img.lemlist.com","expiration":1634652604,"issuer_key":"acme-v02.api.letsencrypt.org-directory","storage":"FileStorage:/var/lib/caddy/.local/share/caddy"}

5. What I already tried:

I enabled debug logging as suggested in a previosu thread:

6. Links to relevant resources:

Of course certificate exists, since site was working fine

 /var/li/ca/.l/s/c/c/acme-v02.api.letsencrypt.org-directory/img.lemlist.com $ ls -la                                                                                                                             
total 16K
-rw------- 1 caddy caddy 5.2K Jul 21 17:10 img.lemlist.com.crt
-rw------- 1 caddy caddy  154 Jul 21 17:10 img.lemlist.com.json
-rw------- 1 caddy caddy  227 Jul 21 17:10 img.lemlist.com.key

How many domains/certificates do you have Caddy managing? My first hunch is that the cert for that domain was evicted from the in-memory cache to make room for others. If that happens though, Caddy should still find it from file, but there might be somekind of edgecase that youā€™re hitting.

Ultimately @matt will be the best person to help here.

Indeed we got a lot of domains (vanity domains), I donā€™t have the real number, but something like 15K :fearful:
So indeed I can understand we hit some limit here :slight_smile:

Shouldnā€™t be a limitā€¦ if the certificate isnā€™t in the cache, it should be loaded into the cache.

Does this only happen for domains that are explicitly specified in your config like img.lemlist.com? i.e. not for on-demand certs? If so I have a hunch. Please verify when you can.

I just done lot of grep on caddy logs. For the last month, it seems there is no this kind of pattern for on-demand certs.
Every time there was a no certificate available for non explicit domain, there was a good reason for that (non 200 response from on_demand_tls/ask url)

1 Like

Ok I think I gotcha, so the unexplained ā€œno certificate availableā€ errors are for the explicitly-configured domains only, i.e. only explicitly-configured domains have unexplainable errors?

If I got that right, then I might know what is going on. I just want to double-check before I say it and implement a fix, because reproducing this would be a pain, ha.

Exactly, sorry if I was unclear.

1 Like

No problem, thanks for clarifying.

Alrightyā€¦ hereā€™s what I think is happening. Maybe you can give me your thoughts on it too as I work on a fix.

When a handshake is in process, Caddy calls this function to get a certificate to complete it:

This should obtain and load a certificate from storage, if necessary (the two true args).

First, we see if the certificate exists in the in-memory cache:

If so, we simply return it and use it. If not, this logic is invoked:

Notice that we only do any of it if OnDemand is enabled (and this depends on the ServerName, or hostname, of the handshake). If itā€™s not, the function simply gives up, hence an error like ā€œno certificate available.ā€

Of course, getting that error is odd when the domain is explicitly configured: the cert should be obtained and loaded when Caddy starts. So why do those sometimes have ā€œno certificate availableā€ errors, while dynamic certs for domains that arenā€™t in the config work just fine all the time?

My hunch is because the in-memory cache has a maximum capacity:

And when many thousands of certificates are loaded, we have to evict one (which we do at random) to make room for another:

So, it follows that certificates for domains which appear explicitly in your config (and are not managed OnDemand) may work, then stop working if they get evicted, because the logic during the handshake forbids loading them into the cache at handshake-time (since they should have been loaded at startup); whereas the dynamic certs for OnDemand domain names always work since they can be loaded at handshake-time, replacing a cert that was loaded at startup.

Oops? Anyway, @francislavoie is right, youā€™re hitting an edge case. Most users with lots of dynamic certs like your config has donā€™t also specify hard-coded domains, and I guess I didnā€™t think about this case.

So we probably need to enable loading a certificate from storage even for non-OnDemand hostnames if the cache is at capacity, would you agree?

(FWIW I think your hunch aligns with what I was thinking it might be @matt)

1 Like

I have proposed a fix here:

@alban Any chance you could give that a try? Let me know if you want some pointers on compiling Caddy with the change. Youā€™ll need to do go get for that specific version of CertMagic before building.

Thanks @matt !

I have run the following commands:

~/work/caddy# go get -d -v github.com/caddyserver/certmagic@c6323a283019682740b0661488dd80e0d09fcb9b
go: downloading github.com/caddyserver/certmagic v0.14.2-0.20210816194047-c6323a283019
go: downloading github.com/klauspost/cpuid v1.2.0
go: downloading github.com/libdns/libdns v0.2.1
go: downloading github.com/mholt/acmez v0.1.3
go: downloading github.com/klauspost/cpuid/v2 v2.0.6
go: downloading github.com/miekg/dns v1.1.42
go: downloading go.uber.org/zap v1.17.0
go: downloading golang.org/x/crypto v0.0.0-20210513164829-c07d793c2f9a
go: downloading golang.org/x/net v0.0.0-20210525063256-abc453219eb5
go: downloading golang.org/x/sys v0.0.0-20210603081109-ebe580a85c40
go: downloading go.uber.org/atomic v1.7.0
go: downloading go.uber.org/multierr v1.6.0
go: downloading golang.org/x/text v0.3.6
go get: upgraded github.com/caddyserver/certmagic v0.14.0 => v0.14.2-0.20210816194047-c6323a283019
cd cmd/caddy/
go build
...

This build is now running on prod server since
08:45 am UTC
Tuesday, August 17, 2021

caddy version
(devel)

Iā€™m keeping an eye on metrics & logs. :crossed_fingers:

2 Likes

Awesome, thanks so much for trying it out!

I got a bad news :frowning:
The ā€œno certificate available for app.lemlist.comā€ occurred again this week-end at 1am local time, so luckily when traffic was quite low.

I checked caddy version is still (devel) (no unattended upgrade)

So it seems the bug is still there.
Or my way to build caddy (described above) with specific commit of Certmagic was not correct?

As an attempt to ā€œmitigateā€ the issue, I added a cron to restart Caddy every night :frowning:

Tell me if you see something I could try/log to help find the root cause?

Dangā€¦ that does look rightā€¦ can you do caddy build-info to verify?

Iā€™m pushing a commit soon that adds debug logging around the cert cache.

Hm, so I donā€™t really know what the problem would be. But I added a lot of new debug-level logs in CertMagic (over a couple of commits), finishing with this one:

Caddy is updated to use that commit now, and will likely be tagged v2.4.4 at this commit: go.mod: Upgrade CertMagic to v0.14.4 Ā· caddyserver/caddy@f43fd6f Ā· GitHub

So, if you could deploy v2.4.4 (releasing today), and be sure to enable debug-level logging (which it looks like you already do, yay) then that should give some clues.

Please disable that restarter cron as well. Thanks!

Hello tanks a lot,

v2.4.4 is deployed with debug logging enabled.

I just need to add better alerting, then I will remove the daily restart cron.
Iā€™ll keep you updated

1 Like

Thanks. If/when it happens again, search the logs for mentions of the domain name for which there is ā€œno certificate availableā€ and then report what you find. :+1:

@alban Just wondering if you have any update before I tag another new release ā€“ are you seeing more debug logs?

Hello, indeed, logs are more verbose about certificates.
So far, no issue, not a surprise, as it ā€œonlyā€ happens once a month.

Always same 3 lines repeating a lot when grepping tls & app.lemlist

	2021-09-02 19:08:46	
{"level":"debug","ts":1630602526.3940926,"logger":"tls.handshake","msg":"default certificate selection results","identifier":"app.lemlist.com","subjects":["app.lemlist.com"],"managed":true,"issuer_key":"acme-v02.api.letsencrypt.org-directory","hash":"b2f55ab76915fce99a7615079b8cf943a2b098b96c0be2f4ea869020c3a83a9d"}
2021-09-02 19:08:46	
{"level":"debug","ts":1630602526.394104,"logger":"tls.handshake","msg":"matched certificate in cache","subjects":["app.lemlist.com"],"managed":true,"expiration":1638114773,"hash":"b2f55ab76915fce99a7615079b8cf943a2b098b96c0be2f4ea869020c3a83a9d"}
2021-09-02 19:08:47	
{"level":"debug","ts":1630602527.0670524,"logger":"tls.handshake","msg":"choosing certificate","identifier":"app.lemlist.com","num_choices":1}
1 Like