Wildcard leads to "context deadline exceeded" and then "no certificate available" errors

1. Caddy version (caddy version):

v2.2.1 h1:Q62GWHMtztnvyRU+KPOpw6fNfeCD3SkwH7SfT1Tgt2c=

With two plugins: caddy-dns/route53 and a custom/modified version of certmagic-storage-dynamodb.

2. How I run Caddy:

a. System environment:

Amazon Linux 2

b. Command:

caddy run --config /home/ec2-user/caddy.json

c. Service/unit/compose file:

n/a

d. My complete Caddyfile or JSON config:

{
  "admin": {
    "disabled": true
  },
  "logging": {
    "sink": {
      "writer": {
        "output": "stdout"
      }
    },
    "logs": {
      "log2stdout": {
        "writer": {
          "output": "stdout"
        },
        "level": "DEBUG"
      }
    }
  },
  "apps": {
    "http": {
      "servers": {
        "websites": {
          "listen": [
            ":443"
          ],
          "routes": [
            {
              "match": [
                {
                  "host": [
                    "*.plumpeskimos.com"
                  ]
                }
              ],
              "handle": [
                {
                  "handler": "reverse_proxy",
                  "upstreams": [
                    {
                      "dial": "localhost:8080"
                    }
                  ],
                  "flush_interval": -1,
                  "buffer_requests": false
                }
              ],
              "terminal": true
            },
            {
              "match": [
                {
                  "not": [
                    {
                      "header_regexp": {
                        "host": {
                          "name": "disallow_branded_host",
                          "pattern": ".+\\.plumpeskimos\\.com$"
                        }
                      }
                    }
                  ]
                }
              ],
              "handle": [
                {
                  "handler": "reverse_proxy",
                  "upstreams": [
                    {
                      "dial": "localhost:8080"
                    }
                  ],
                  "flush_interval": -1,
                  "buffer_requests": false
                }
              ],
              "terminal": true
            }
          ],
          "strict_sni_host": true
        }
      }
    },
    "tls": {
      "automation": {
        "policies": [
          {
            "subjects": [
              "*.plumpeskimos.com"
            ],
            "issuer": {
              "module": "acme",
              "ca": "https://acme-staging-v02.api.letsencrypt.org/directory",
              "email": "cory@stam.pr",
              "challenges": {
                "dns": {
                  "provider": {
                    "name": "route53"
                  }
                }
              },
              "trusted_roots_pem_files": []
            },
            "storage": {
              "module": "dynamodb",
              "table": "websites-edge-certificate-storage",
              "aws_region": "us-west-2"
            },
            "on_demand": false
          },
          {
            "issuer": {
              "module": "acme",
              "ca": "https://acme-staging-v02.api.letsencrypt.org/directory",
              "email": "cory@stam.pr",
              "challenges": {
                "http": {
                  "disabled": false
                },
                "tls-alpn": {
                  "disabled": false
                }
              },
              "trusted_roots_pem_files": []
            },
            "storage": {
              "module": "dynamodb",
              "table": "websites-edge-certificate-storage",
              "aws_region": "us-west-2"
            },
            "on_demand": true
          }
        ],
        "on_demand": {
          "ask": "http://localhost:8080/.stampr/.system/verify-tenant-ssl-provisioning-status"
        }
      }
    }
  }
}

3. The problem I’m having:

I have a wildcard being automatically generated and stored in dynamodb storage. Sometimes when I visit the site caddy logs an error “no certificate available” and thus doesn’t serve the site.

The server is started using an AWS AMI with caddy already running so the time between when the AMI was built to when it starts is sometimes a long time.

Deleting the ocsp file/record in storage and restarting caddy always fixes the error. Not sure if it’s a problem with caddy or the storage plugin.

4. Error messages and/or full log output:

/home/ec2-user/.pm2/logs/proxy-out.log last 15 lines:
{"level":"debug","ts":1606633449.010966,"logger":"http","msg":"starting server loop","address":"[::]:80","http3":false,"tls":false}
{"level":"info","ts":1606633449.0113878,"logger":"http","msg":"enabling automatic TLS certificate management","domains":["*.plumpeskimos.com"]}
{"level":"info","ts":1606633449.4136097,"logger":"tls","msg":"cleaned up storage units"}
{"level":"info","ts":1606633455.3049645,"logger":"tls.cache.maintenance","msg":"stopped background certificate maintenance","cache":"0xc0005101c0"}
2020/11/30 22:53:16 [INFO] Redirected sink to here (stdout)
{"level":"info","ts":1606776796.9810793,"logger":"http","msg":"server is listening only on the HTTPS port but has no TLS connection policies; adding one to enable TLS","server_name":"websites","https_port":443}
{"level":"info","ts":1606776796.981112,"logger":"http","msg":"enabling automatic HTTP->HTTPS redirects","server_name":"websites"}
{"level":"info","ts":1606776796.9841342,"logger":"tls.cache.maintenance","msg":"started background certificate maintenance","cache":"0xc0002c0150"}
{"level":"debug","ts":1606776796.9851458,"logger":"http","msg":"starting server loop","address":"[::]:443","http3":false,"tls":true}
{"level":"debug","ts":1606776796.98519,"logger":"http","msg":"starting server loop","address":"[::]:80","http3":false,"tls":false}
{"level":"info","ts":1606776796.9851973,"logger":"http","msg":"enabling automatic TLS certificate management","domains":["*.plumpeskimos.com"]}
{"level":"error","ts":1606776976.9870756,"logger":"tls","msg":"job failed","error":"*.plumpeskimos.com: obtaining certificate: failed storage check: RequestCanceled: request context canceled\ncaused by: context deadline exceeded - storage is probably misconfigured"}
{"level":"info","ts":1606777036.9947038,"logger":"tls","msg":"cleaned up storage units"}
{"level":"debug","ts":1606777934.6477273,"logger":"http.stdlib","msg":"http: TLS handshake error from 167.248.133.38:42308: read tcp 10.2.1.238:443->167.248.133.38:42308: read: connection reset by peer"}
{"level":"debug","ts":1606778004.7133768,"logger":"http.stdlib","msg":"http: TLS handshake error from 167.248.133.38:41938: error checking http://localhost:8080/.stampr/.system/verify-tenant-ssl-provisioning-status to determine if certificate for hostname '10.2.1.238' should be allowed: Get \"http://localhost:8080/.stampr/.system/verify-tenant-ssl-provisioning-status?domain=10.2.1.238\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}

/home/ec2-user/.pm2/logs/proxy-error.log last 15 lines:
{"level":"info","ts":1606633454.8046432,"msg":"shutting down apps then terminating","signal":"SIGTERM"}
{"level":"info","ts":1606633454.944,"msg":"shutting down","signal":"SIGINT"}
{"level":"info","ts":1606633455.3049645,"logger":"tls.cache.maintenance","msg":"stopped background certificate maintenance","cache":"0xc0005101c0"}
{"level":"info","ts":1606633455.3050168,"msg":"shutdown done","signal":"SIGTERM"}
{"level":"info","ts":1606776796.976701,"msg":"using provided configuration","config_file":"/home/ec2-user/caddy.json","config_adapter":""}
{"level":"warn","ts":1606776796.9792683,"logger":"admin","msg":"admin endpoint disabled"}
2020/11/30 22:53:16 [INFO] Redirecting sink to: stdout
{"level":"info","ts":1606776796.9810793,"logger":"http","msg":"server is listening only on the HTTPS port but has no TLS connection policies; adding one to enable TLS","server_name":"websites","https_port":443}
{"level":"info","ts":1606776796.981112,"logger":"http","msg":"enabling automatic HTTP->HTTPS redirects","server_name":"websites"}
{"level":"info","ts":1606776796.9841342,"logger":"tls.cache.maintenance","msg":"started background certificate maintenance","cache":"0xc0002c0150"}
{"level":"info","ts":1606776796.9851973,"logger":"http","msg":"enabling automatic TLS certificate management","domains":["*.plumpeskimos.com"]}
{"level":"info","ts":1606776857.035941,"msg":"autosaved config","file":"/home/ec2-user/.config/caddy/autosave.json"}
{"level":"info","ts":1606776857.0359654,"msg":"serving initial configuration"}
{"level":"error","ts":1606776976.9870756,"logger":"tls","msg":"job failed","error":"*.plumpeskimos.com: obtaining certificate: failed storage check: RequestCanceled: request context canceled\ncaused by: context deadline exceeded - storage is probably misconfigured"}
{"level":"info","ts":1606777036.9947038,"logger":"tls","msg":"cleaned up storage units"}

Then I try to make a request:

{"level":"debug","ts":1606778935.7608654,"logger":"http.stdlib","msg":"http: TLS handshake error from 98.173.45.78:52018: no certificate available for 'whatever2.plumpeskimos.com'"}
{"level":"debug","ts":1606778935.8069701,"logger":"http.stdlib","msg":"http: TLS handshake error from 98.173.45.78:52019: no certificate available for 'whatever2.plumpeskimos.com'"}

5. What I already tried:

My knowledge of ocsp is superficial, but it might be related to ocsp somehow.

I don’t have concrete proof, but it seems like deleting the “ocsp/wildcard_.plumpeskimos.com-8d4312c0” file/record and restarting caddy always resolves the problem for that server. (No idea about an environment with multiple servers.)

Maybe a timeout? Evidence: I fixed the problem on one instance => terminated it => started a new instance and it was still working. It might’ve been because the AMI was just built.

The error about “context deadline exceeded” seems to also add more weight behind a timeout.

My thoery: It seems like this has something to do with the fact the AMI was created with caddy already running and the ocsp value it created on start while the AMI was being built is then no longer valid when a new instance is launched with that AMI after a certain amount of time.

Any ideas?

6. Links to relevant resources:

None

I see v2.2.2 was released and after updating I don’t think I’m seeing the issue any longer. Yay.

Hmm do you mean 2.3.0-beta.1? There is no 2.2.2.

i was pretty sure it said 2.2.2 but i don’t have the log any longer to confirm. it definitely didn’t say beta. at any rate, i was mistaken the issue is resolved.

2.2.2 isn’t a thing:

But if you’re building from source then the tooling may +1 the minor version number as a “virtual version” (no idea if that’s the right name for it).

How did you rebuild Caddy? What command did you use exactly?

I was using xcaddy so that’s likely what was happening.

This seems to be my fault. I was launching ec2 instances into a private vpc so there was no public internet egress… or something. I would assign a public IP so I could log in and debug and a side effect of that was the instance now wasn’t having a vpc issue. I’m a bit confused why/how on-demand certs were working when the wildcard wasn’t but I’m chalking it up to something weird with my setup and not caddy.

One thing that I think caddy could improve here is retrying on storage failure. Once it failed to load the wildcard cert (automatic, but not on-demand), it gave up permanently until restart and would always return no cert avail. Instead, it should probably retry intermittently when an automatic cert fails to load on start i.e. storage failures should not be treated as permanent.

As for “context deadline exceeded” – This is an error coming from the underlying dynamo ddb client. It was timing out while reading or writing to ddb.

This topic was automatically closed after 30 days. New replies are no longer allowed.