SSL Errors csr_cn_is_invalid and "error while checking if stored certificate is also expiring soon"

skip2networks · September 4, 2024, 4:03am

1. The problem I’m having:

We run Caddy in a cluster with shared S3 storage. We’ve been doing this without any issues or interruptions for more than a year. Certificates are no longer renewing using the ZeroSSL API.

Similar to ZeroSSL certificate query API error after upgrade to Caddy v2.8.4 - #10 by Antonin but they ran into a 64-character limit on the ZeroSSL API side. Not the issue we’re having.

Also similar to Zerossl issuance error after upgrade to Caddy v2.8.0 - #11 by whizzygeeks but I haven’t seen much movement there.

2. Error messages and/or full log output:

{
    "dt": "2024-09-04T03:07:30.600294855+00:00",
    "log": {
        "attempt": 2,
        "elapsed": 64.623345006,
        "error": "[www.ginlab.io] Renew: creating certificate: POST https://api.zerossl.com/certificates?access_key=redacted: HTTP 200: API error 2836: csr_cn_is_invalid (details=map[]) (raw={\"success\":false,\"error\":{\"code\":2836,\"type\":\"csr_cn_is_invalid\"}} decode_error=json: unknown field \"success\")",
        "level": "error",
        "logger": "tls.renew",
        "max_duration": 2592000,
        "msg": "will retry",
        "retrying_in": 120,
        "ts": 1725419249.8697977
    }
}

3. Caddy version:

v2.8.4

4. How I installed and ran Caddy:

Using the package from ports and then replacing with our own binary. FreshPorts -- www/caddy: Fast, cross-platform HTTP/2 web server with automatic HTTPS

Xcaddy generated with:

a. System environment:

FreeBSD 14.0 and 14.1

b. Command:

See the package freebsd-ports/www/caddy/files/caddy.in at main · freebsd/freebsd-ports · GitHub

/usr/bin/su -m www -c /usr/bin/caddy start --config /usr/local/etc/caddy/Caddyfile --pidfile /var/run/caddy/caddy.pid >> /var/caddy/caddy.log

d. My complete Caddy config:

This is not my complete Caddyfile/caddy config as it spans dozens of nested files. Most importantly, it has not changed since the last successful renewal.

{
        order coraza_waf first 
        order cache before rewrite
        storage s3 {
                host "redacted"
                bucket "certs"
                access_id "redacted"
                secret_key "redacted"
                prefix "ssl"
                insecure false #disables SSL if true
        }
        email noc@skip2.net
        cert_issuer zerossl redacted
        log default {
                format json
                level info
		output file /var/log/caddy/caddy.log
        }
        cache {
                cache_name Souin
                log_level info
                key {
                        hide
                }
                redis {
                        url redacted
                }
                allowed_http_verbs GET POST PATCH
                ttl 10s
        }
        servers {
		trusted_proxies static redacted
		client_ip_headers X-Forwarded-For X-Real-IP
                metrics
	}
}
https://www.skip2.net {
    encode zstd br gzip
    log
    import default
    import cto
    import xfo SAMEORIGIN
    import ref same-origin
    import hsts "max-age=90; includeSubDomains"
    redir /whoami https://{system.hostname}.pop.skip2.net/whoami 301
    redir /dashboard https://{system.hostname}.pop.skip2.net/dashboard 301
    reverse_proxy /blog* cname.vercel-dns.com {
        header_up Host skip2.net
        import intercept-errors
    }
    reverse_proxy * skip2.netlify.app {
        header_up Host skip2.netlify.app
        import intercept-errors
    }
    import rm-thirdpty-headers
}

5. Links to relevant resources:

n/a

skip2networks · September 4, 2024, 4:07am

My certificates expired 2 hours ago, no idea when they stopped renewing.

I was able to mostly work around this by doing one or both of the following:

deleting the /var/db/caddy/data (equivalent of /var/lib/caddy on Linux) folder on every node in the cluster and running ‘caddy restart’ (‘caddy reload’ did not work).
creating a new empty shared S3 storage destination and using that instead

After these two steps, certs started rolling in right away for most domains. There are still some in error and I’m also seeing a new error in the logs:

{
    "dt": "2024-09-04T04:13:51.599758603+00:00",
    "log": {
        "error": "file does not exist",
        "identifiers": [
            "kord5001.pop.skip2.net"
        ],
        "level": "warn",
        "logger": "tls.cache.maintenance",
        "msg": "error while checking if stored certificate is also expiring soon",
        "ts": 1725423230.7948484
    }
}

matt · September 4, 2024, 4:06pm

Hmm, odd. I’ll look into it. I might have more questions soon.

skip2networks · September 4, 2024, 4:28pm

Anything you need at all
I can provide a backup from the /var/db/caddy folder and I still have the original shared certificate storage preserved.

Still not able to get that kord5001.pop.skip2.net certificate, same error.

Thank you

matt · September 4, 2024, 8:12pm

That error is probably expected – is it still renewing those certs at least?

Curious about that one domain that isn’t working…

I’ll commit a debug log to CertMagic that should emit the contents of the CSR so we can see why the CN is invalid. You’ll have to build with the latest commit of CertMagic (let me know if you would like a sample command) and enable debug logs. They can be quite noisy so you’ll want to enable them for as long as you need then turn them off.

skip2networks · September 5, 2024, 1:28pm

Thank you for your help with this, Matt.

That error is probably expected – is it still renewing those certs at least?

Which error? I assume "error": "file does not exist". Either way, both errors resulted in the certificate not being renewed. They would just get requeued for renewal.

Yesterday we ran really routine OS patches & package updates across the cluster and rebooted each node. The last 2 expired certificates that were hanging with the "error": "file does not exist" started renewing after this reboot.

At this point all certificates have been renewed but I’m gonna have nightmares about seeing csr_cn_is_invalid in the logs again. Of course, ZeroSSL isn’t able to troubleshoot without seeing the CSR or more logs from us.

I’m not sure how to replicate the issue to troubleshoot further since we’re not getting either error anymore. I set up alerts on the error so if it happens again I’ll know before the certs expire. Lemme know if I can do something else to help.

matt · September 5, 2024, 1:48pm

Sounds good. Thanks for your patience!

If it happens again (after the next release), I’ll have a better sense of things since it will be in debug logs.

ezitisitis · September 27, 2024, 6:18am

Met same problem.

No issue on initial certificate receival.

Error messages and/or full log output:

{
  "level": "info",
  "ts": 1727416797.1459706,
  "logger": "tls.issuance.zerossl",
  "msg": "creating certificate",
  "identifiers": [
    "app.itchanged.dev"
  ]
}
{
  "level": "info",
  "ts": 1727416797.2228742,
  "logger": "tls",
  "msg": "certificate is in configured renewal window based on expiration date",
  "subjects": [
    "app.itchanged.dev"
  ],
  "expiration": 1728691200,
  "ari_cert_id": "",
  "next_ari_update": null,
  "renew_check_interval": 600,
  "window_start": -6795364578.8713455,
  "window_end": -6795364578.8713455,
  "remaining": 1274402.777126986
}
{
  "level": "error",
  "ts": 1727416797.8363113,
  "logger": "tls.renew",
  "msg": "could not get certificate from issuer",
  "identifier": "app.itchanged.dev",
  "issuer": "zerossl",
  "error": "creating certificate: POST https://api.zerossl.com/certificates?access_key=redacted: HTTP 200: API error 2836: csr_cn_is_invalid (details=map[]) (raw={\"success\":false,\"error\":{\"code\":2836,\"type\":\"csr_cn_is_invalid\"}} decode_error=json: unknown field \"success\")"
}
{
  "level": "error",
  "ts": 1727416797.8365476,
  "logger": "tls.renew",
  "msg": "will retry",
  "error": "[app.itchanged.dev] Renew: creating certificate: POST https://api.zerossl.com/certificates?access_key=redacted: HTTP 200: API error 2836: csr_cn_is_invalid (details=map[]) (raw={\"success\":false,\"error\":{\"code\":2836,\"type\":\"csr_cn_is_invalid\"}} decode_error=json: unknown field \"success\")",
  "attempt": 1,
  "retrying_in": 60,
  "elapsed": 0.693961665,
  "max_duration": 2592000
}

. Caddy version:

v2.8.4

System environment:

Ubuntu 24.04 LTS

My complete Caddy config:

{
	email valid@email.here // email edited
	on_demand_tls {
		ask http://app.itchanged.dev/api/domain-check
	}

	log {
		output file /var/log/caddy/global_access.log {
            		roll_size 10MB
            		roll_keep 5
            		roll_keep_for 720h
        	}
		format json
	}
}

# snippets/laravel-app
# {args.0} represents the root url of the app. Example: "exmaple.com".
# {args.1} represents the root path to the app. Example: "/var/www/html/laravel-app"

(laravel-app) {
	{args[0]} {
		# apply security header
		header {
			# keep referrer data off of HTTP connections
			Referrer-Policy no-referrer-when-downgrade
			# Referrer-Policy "strict-origin-when-cross-origin"

			# enable HSTS
			Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"

			# Enable cross-site filter (XSS) and tell browser to block detected attacks
			X-Xss-Protection "1; mode=block"

			# disable clients from sniffing the media type
			X-Content-Type-Options "nosniff"

			# clickjacking protection
			X-Frame-Options "SAMEORIGIN"

			Content-Security-Policy "upgrade-insecure-requests"

			# hide server name
			-Server Caddy
		}

		tls {
			on_demand
			issuer zerossl 1234567890 // key obviously edited
		}
		# Resolve the root directory for the app
		root * {args[1]}/public

		# Provide Zstd and Gzip compression
		encode zstd gzip

		# Allow caddy to serve static files
		file_server

		# Enable PHP-FPM
		# Change this based on installed php version
		php_fastcgi unix//run/php/php8.3-fpm.sock
	}
}

# Use the "laravel-app" snippet for our site:
import laravel-app app.itchanged.dev /var/www/it-changed-portal
import laravel-app domain.itchanged.dev /var/www/it-changed-portal
import laravel-app *.itchanged.dev /var/www/it-changed-portal

http://app.itchanged.dev {
	handle /api/domain-check* {
		root * /var/www/it-changed-portal/public
		php_fastcgi unix//run/php/php8.3-fpm.sock
	}

	# Fallback to redirect
	handle {
		redir https://{host}{uri} 308
	}
}

app.tidpunkt.com {
	header {
		# keep referrer data off of HTTP connections
		Referrer-Policy no-referrer-when-downgrade
		# Referrer-Policy "strict-origin-when-cross-origin"

		# enable HSTS
		Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"

		# Enable cross-site filter (XSS) and tell browser to block detected attacks
		X-Xss-Protection "1; mode=block"

		# disable clients from sniffing the media type
		X-Content-Type-Options "nosniff"

		# clickjacking protection
		X-Frame-Options "SAMEORIGIN"

		Content-Security-Policy "upgrade-insecure-requests"

		# hide server name
		-Server Caddy
	}

	# Resolve the root directory for the app
	root * /var/www/tidpunkt/public

	# Provide Zstd and Gzip compression
	encode zstd gzip

	# Allow caddy to serve static files
	file_server

	# Enable PHP-FPM
	# Change this based on installed php version
	php_fastcgi unix//run/php/php8.3-fpm.sock
}

:443 {
	# apply security header
	header {
		# keep referrer data off of HTTP connections
		Referrer-Policy no-referrer-when-downgrade
		# Referrer-Policy "strict-origin-when-cross-origin"

		# enable HSTS
		Strict-Transport-Security "max-age=31536000; includeSubDomains; preload"

		# Enable cross-site filter (XSS) and tell browser to block detected attacks
		X-Xss-Protection "1; mode=block"

		# disable clients from sniffing the media type
		X-Content-Type-Options "nosniff"

		# clickjacking protection
		X-Frame-Options "SAMEORIGIN"

		Content-Security-Policy "upgrade-insecure-requests"

		# hide server name
		-Server Caddy
	}

	tls {
		on_demand
		issuer zerossl 1234567890 // key edited
	}
	# Resolve the root directory for the app
	root * /var/www/it-changed-portal/public

	# Provide Zstd and Gzip compression
	encode zstd gzip

	# Allow caddy to serve static files
	file_server

	# Enable PHP-FPM
	# Change this based on installed php version
	php_fastcgi unix//run/php/php8.3-fpm.sock
}

— EDIT1 —

caddy stop && rm -rf path/to/certificate/dir && caddy start resolved issue, but anyway I would consider that as a hack not a solution

francislavoie · September 27, 2024, 12:05pm

@ezitisitis see this part from above:

We would need those debug logs to know what’s going on.

whizzygeeks · October 7, 2024, 9:48am

@matt skip2networks I solved csr cn is invalid by doing the following :-

ported caddy to Intel architecture ( probably this wasnt the issue )
removed rate limit on domain validator api ( this was not required and seems to be the primary issue )
removed rate limit on waf and stopped fail2ban initially ( seems to be a secondary issue after few succesful generation since zerossl doesn’t provide ip range for whitelisting )

It took almost on-off 3 months in figuring out since logs were not that detailed.

@matt @francislavoie logging and log format options needs serious attention. Please also mark my ticket solved since I am unable to do it.

francislavoie · October 7, 2024, 10:04am

@whizzygeeks what exactly do you mean by “rate limit” here? Could you be more specific? That could mean a lot of different things.

In what way? Please be more specific about your concerns. Vague comments like this don’t help us implement any improvements.

whizzygeeks · October 7, 2024, 10:45am

Apologies since i didn’t explain the resolution steps taken

Refer Pt 2.

on_demand_tls {
ask http://domain-validator.wg.com/px-reg
interval 1m
burst 60
}

Commented interval and burst directive from above block …This solved the primary issue (csr_cn_is_invalid) and lot of zerossl certificates were generated while upgrading from 2.6 to 2.8

Refer Pt3.

But after sometime it start failing again with csr_cn_is_invalid error. Then i figured out issue is similar and zerossl api is unable to complete the HTTP handshake (assumed since logs had the same output )

Did the following -

Disabled following rate limit caddy module directive in config
rate_limit {
zone dynamic_example {
key {remote_host}
events 350
window 1m
jitter 2.0
sweep_interval 1m
}
}
Disabled and later Increased AWS WAF rate limit directive
Was also using fail2ban and found few blocked ip addresses belonging to zerossl api. Disabled that too

Now why the issue suddenly cropped for us ?
Since all zerossl certificates were re-issued due to zerossl storage format change between 2.6 to 2.7 upgrade, we were hitting rate limits. There were more then 5K certificates getting generated and I believe most of them who are facing this issue with same error are probably due to WAF or IDS or rate limit .

Expectations from logs ?

After a certain timeout unverified token from zerossl should log just once indicating vertification or handshake error
CSR details should be visible in logs
Option to change formatting from JSON to Plain text ( nginx like logs ) should be available. Transform-encoder has cpu spikes nadmemory leaks issue and didnt work for us

francislavoie · October 7, 2024, 11:19am

Ah, yeah those (interval, burst) have been deprecated and are slated for removal (as your logs would tell you with a warning). They don’t actually work correctly, and prevent the wrong things.

Fair reminder though, I’ll do a PR now to remove them completely at this point, since they’ve been deprecated long enough. (Done caddytls: Drop `rate_limit` and `burst`, has been deprecated by francislavoie · Pull Request #6611 · caddyserver/caddy · GitHub)

Ah, you had this in your http://domain-validator.wg.com/px-reg route? Yeah you definitely shouldn’t be rate limiting your ask endpoint, so that makes sense to remove.

I’m not sure I understand what you mean here. Can you elaborate?

This is coming, actually Debug log when creating CSR · caddyserver/certmagic@80bb9a8 · GitHub

Have you reported those memory leak issues? That’s the first we hear of that.

What do you mean by “didn’t work for us”? Please be more specific.

You can use the console format which is more human readable: log (Caddyfile directive) — Caddy Documentation.

system · November 6, 2024, 11:20am

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.