Caddy + dynamodb: background maintenance taking too long and creating read spikes on dynamodb

rauny · November 9, 2021, 8:09pm

1. Caddy version (`caddy version`):

v2.4.6

2. How I run Caddy:

a. System environment:

Kubernetes, with image caddy:2.4.6-alpine

b. Service/unit/compose file:

FROM caddy:2.4.6-builder AS builder

RUN xcaddy build \
    --with github.com/silinternational/certmagic-storage-dynamodb

FROM caddy:2.4.6-alpine

COPY --from=builder /usr/bin/caddy /usr/bin/caddy

c. My complete Caddyfile or JSON config:

{
	on_demand_tls {
		ask https://check-domain.internal.endpoint/cname
		interval 1m
		burst 200
	}
	storage_clean_interval 90d
	storage dynamodb caddy-certificates {
		aws_region us-east-1
	}
}

https://
tls {
	on_demand
	issuer zerossl <key> {
		email <email>
		timeout 3m
	}
	issuer acme {
		email <email>
		timeout 3m
	}
}
reverse_proxy hostingapp

3. The problem I’m having:

Caddy is working great so far, it’s serving 25k+ certificates without issues and using low resources.

The problem happens when the background certificate maintenance task is initiated. It’s running for 9h now (since the last server start).

When it’s not running the normal value for reads on DynamoDB are around 700 per minute, but once the background maintenance task starts it goes up to 40k.

I know the problem most certainly is not in Caddy but in the plugin I’m using, but maybe there’s a workaround for this issue?

I was looking for a way to disable the background certificate maintenance task and handle it outside Caddy, with a script running on a Lambda for example.

I’m thinking about disabling the background maintenance task, because reading the logs, it looks like the task is initiated on each server, which could increase the reads I’m seeing. But still, I’m not sure about that.

4. Error messages and/or full log output:

{"level":"info","ts":1636457588.0270805,"logger":"tls.cache.maintenance","msg":"started backgroun certificate maintenance","cache":"0xc000830fc0"}
{"level":"info","ts":1636457595.3115501,"logger":"tls.cache.maintenance","msg":"started backgroun certificate maintenance","cache":"0xc000426310"}
{"level":"info","ts":1636457579.519036,"logger":"tls.cache.maintenance","msg":"started background certificate maintenance","cache":"0xc000b14620"}

5. What I already tried:

I’ve read the docs and increased the value of storage_clean_interval
I’ve read Cost of this module · Issue #18 · silinternational/certmagic-storage-dynamodb · GitHub
Searched for similar issues on both: GitHub and here on the Forum

6. Links to relevant resources:

Cost of this module · Issue #18 · silinternational/certmagic-storage-dynamodb · GitHub

francislavoie · November 10, 2021, 2:21am

I’m not sure there’s any better solution right now.

Would it be possible to use Redis as your storage backend instead? If you’re able to self-host that, costs would be essentially zero, and doing “scans” for expired certs would probably be nearly instant because those kinds of lookups are much more efficient with Redis.

You could also consider Consul. I have less experience with it though.

Really the problem is DynamoDB doesn’t seem to have a good way to fetch all the relevant data in one shot I guess, and does many queries instead. And apparently that’s cost-prohibitive.

rauny · November 10, 2021, 8:38am

Thanks for your answer!

Are you referring to this plugin: GitHub - gamalan/caddy-tlsredis: Redis Storage using for Caddy TLS Data?

Yes, it is possible to use Redis as my storage. I’m going to give it a try next week.

francislavoie · November 10, 2021, 8:51am

Yep, that one

Richard_Stupek1 · November 10, 2021, 4:12pm

It’s not really that dynamodb can’t fetch the relevant data, its the way the certmagic continually does lists/loads and a list requires a table scan due to certmagic keys being hierarchical .

matt · November 10, 2021, 10:13pm

DynamoDB is capable, but unfortunately its API and billing policy makes basic file system ops expensive.

Richard_Stupek1 · November 11, 2021, 7:47pm

I think this particular module impact could be reduced if the list function cached data for a period of time so that a list or load call had the data cached from a previous list.

rauny · November 12, 2021, 9:57am

Update:

After reaching 30k+ certificates, DynamoDB started throttling reads, the background maintenance task was consuming all available reads. So this happened next:

When generating a new certificate, Caddy tried to check if the certificate existed on database, but it couldn’t because of the throttling.
So it generated a new one and stored it on database, no throttling for write.
Repeat

In some cases it generated 50 certificates for the same domain.

In my opinion, if you are planning to handle 20k+ certificates, you should avoid the DynamoDB plugin for now.
Maybe a refactor to disable the background maintenance task and use DynamoDB TTL Feature?

Current setup:

Caddy version 2.4.6
2 Replicas
Filesystem
30k certificates

Caddyfile

{
	on_demand_tls {
		ask https://check-domain.internal.endpoint/cname
		interval 1m
		burst 200
	}
	storage_clean_interval 90d
}

https://

tls {
	on_demand

	issuer zerossl {env.ZEROSSL_TOKEN} {
		email {env.EMAIL}
		timeout 3m
	}

	issuer acme {
		email {env.EMAIL}
		timeout 3m
	}
}

reverse_proxy hostingapp

Questions:

The background maintenance task is still taking a long time to finish.
It’s running for 2 hours and it didn’t finish yet. Is this normal?
Running multiple replicas sharing the same storage (filesystem) requires any additional configuration?

Thank you!

francislavoie · November 12, 2021, 10:15am

Yikes

Do you just mean that you’re not seeing stopped background certificate maintenance? In that case, that’s normal, because that only shows up when the server is being shut down. An infinite loop/ticker is running in the background which starts with the server and periodically runs OCSP and cert renewal jobs.

To clarify, the “storage cleaning” job is not the same as the “background maintenance” job.

Storage cleaning always happens once when the server starts, and then periodically according to storage_clean_interval. Its purpose is to delete old/expired assets (certs, keys, accounts) that weren’t otherwise removed due to being “forgotten”.

“Background maintenance” is all the OCSP and cert renewal jobs, and by default runs every 10 minutes. I guess this was “too fast” for DynamoDB with 30k certs (which is kinda crazy, that should be no issue… sigh). Unfortunately we haven’t made the renew_interval option configurable via the Caddyfile, because nobody’s actually had a need for it and asked us to make it configurable. But it is configurable via JSON right now. See JSON Config Structure - Caddy Documentation. Caddy will do the renewal checks on all the certs it has in memory, which is probably all 30k in your case.

As long as the filesystem storage does atomic operations, you should be okay. Not really anything else to configure since Caddy will just use regular syscalls for file IO.

What are you using to share the storage between your replicas?

rauny · November 12, 2021, 10:29am

Ah yes, sorry for mixing them up this is the message:

{"level":"info","ts":1636704464.4828146,"logger":"tls.cache.maintenance","msg":"started background certificate maintenance","cache":"0xc0002e1ce0"}

AWS EFS

I’m going to replace my Caddyfile with JSON, and postback the results!

Thank you so much for all the help!

rauny · November 13, 2021, 2:52pm

I have converted my Caddyfile to JSON with the following command: caddy adapt --config Caddyfile --pretty and added the renew_interval configuration. Caddy is running now for 48hs without major issues.

Before converting Caddyfile to JSON and setting the renew_interval, Caddy was freezing every 4~5 hours.

Something similar with this report: https://caddy.community/t/caddy-server-huge-drop-of-requests/14180/2

But I wasn’t able to find the root cause. My hypothesis is that the process was running every 10 minutes for 30k+ certificates and getting stuck.

One minor issue I’m having now is that a very small quantity of certificates are being generated twice. For reference, here is my Caddy.json

{
  "apps": {
    "http": {
      "servers": {
        "srv0": {
          "listen": [
            ":443"
          ],
          "routes": [
            {
              "handle": [
                {
                  "handler": "reverse_proxy",
                  "upstreams": [
                    {
                      "dial": "hostingapp"
                    }
                  ]
                }
              ]
            }
          ],
          "tls_connection_policies": [
            {}
          ]
        }
      }
    },
    "tls": {
      "automation": {
        "policies": [
          {
            "issuers": [
              {
                "acme_timeout": "30s",
                "api_key": "{env.ZEROSSL_TOKEN}",
                "email": "{env.EMAIL}",
                "module": "zerossl"
              },
              {
                "acme_timeout": "30s",
                "email": "{env.EMAIL}",
                "module": "acme"
              }
            ],
            "on_demand": true
          }
        ],
        "on_demand": {
          "rate_limit": {
            "interval": "1m",
            "burst": 10
          },
          "ask": "https://check-domain.internal.endpoint/cname"
        },
        "storage_clean_interval": "90d",
        "renew_interval": "6h"
      }
    }
  }
}

Any idea or tips on how to debug this problem?

Thank you!

francislavoie · November 18, 2021, 8:40am

My only guess for this is that EFS isn’t fast enough, and 2 requests hit the different Caddy instances relatively at the same time before the filesystem had time to sync between the two. I hope that’s not the case.

Sorry I didn’t comment earlier, I didn’t have anything more than a hunch, but it seems like you marked it as answered, so it seems like it’s stable?

I suppose I’ll start recommending to AWS users to use EFS instead of dynamodb from now on.

Please consider commenting on the github issue on the dynamodb repo with your lastest findings, I’m sure others will find it useful to know that it might not be safe to use with large amounts of certificates.

rauny · November 18, 2021, 11:11am

Hello,

Yes, it’s very stable now, the only minor issue is the duplication of a very few % of certificates. I’m going to enable the debug mode and try to pin down the problem.

I’m going to do that! Again, thank you so much for all the help provided!

system · December 9, 2021, 8:10pm

This topic was automatically closed after 30 days. New replies are no longer allowed.