How to not consume so many file descriptors?

thebaer · December 9, 2020, 4:10pm

I run a web hosting platform, and just replaced our Nginx + custom SSL setup with Caddy v2.2.1 for users who connect their own domain to our service.

Since doing this, we’ve regularly run into “too many open files” issues after about 12-18 hours of uptime. Restarting the Caddy server fixes this completely (until about 12 hours later). We’ve tried changing the file limits available to Caddy, but I’m not sure of the exact setup needed since this old machine is on Ubuntu 12.04, using Upstart instead of Systemd, etc.

For a variety of reasons, we can’t upgrade this server at the moment. So for the short term, I’m trying to figure out how to make Caddy work reliably here. Are there any configuration changes I can make to ensure Caddy isn’t keeping so many file descriptors open?

Full Caddyfile:

{
    on_demand_tls {
        ask      https://-redacted-
        interval 1m
        burst    10
    }
}

https:// {
    tls webmaster@example.com {
        on_demand
    }
    reverse_proxy 127.0.0.1:8888 {
        header_up +X-HTTPS-Protocol https
    }
    log {
        output file /var/log/caddy.log {
            roll_keep 5
            roll_keep_for 336h
        }
        format console
    }
}

Error messages:

2020/12/09 15:59:39.583 ERROR   http.log.error.log0     dial tcp 127.0.0.1:9999: socket: too many open files

Debugging:

$ pidof caddy
757
$ ls -l /proc/757/fd | wc -l
1025

matt · December 9, 2020, 4:45pm

Are you behind a CDN like Cloudflare?

thebaer · December 9, 2020, 5:27pm

The majority of the custom domains we’re serving aren’t behind CloudFlare, but some are, yes. Do you think that’s the culprit?

I should mention that during the transition, I noticed that most sites using CloudFlare weren’t able to generate a certificate. So we started instructing CloudFlare users to switch off their proxy service, and added some entries to our Caddyfile as a stopgap, which seemed to fix connection issues. E.g.:

behind.cloudflare.com {
    tls internal
    reverse_proxy 127.0.0.1:8888 {
        header_up +X-HTTPS-Protocol tls
    }
}

matt · December 9, 2020, 5:31pm

Cloudflare is known for holding connections open indefinitely. We’ve already implemented a fix on our end that is available in the latest beta: caddyhttp: New idle_timeout default of 5m · caddyserver/caddy@1438e4d · GitHub

The TLS-ALPN challenge will always fail behind TLS termination, and the HTTP challenge will fail if the challenge request is not proxied through to Caddy. If one of the challenges can succeed, certificates won’t be a problem even behind a CDN, as long as they go through to the same Caddy cluster that initiated the challenge. There’s also the DNS challenge but this requires credentials to a provider API.

thebaer · December 9, 2020, 7:07pm

Thanks, that’s great to know! If I didn’t want to run the beta yet, is there any way to set timeouts in the Caddyfile? Or would I need to switch to JSON config?

Yeah, in my testing, it seems CloudFlare causes the HTTP challenge to fail with context deadline exceeded:

2020/12/09 19:11:15.898 tls.issuance.acme.acme_client   deactivating authorization      {"identifier": "behind.cloudflare.com", "authz": "https://acme-v02.api.letsencrypt.org/acme/authz-v3/1111111111", "error": "request to https://acme-v02.api.letsencrypt.org/acme/authz-v3/1111111111 failed after 1 attempts: context deadline exceeded"}

matt · December 9, 2020, 7:23pm

There is, in the beta. httpcaddyfile: Configure servers via global options (#3836) · caddyserver/caddy@3cfefeb · GitHub

Hmm that error is about deactivating the auth, but what about the actual validation results? What are the full logs?

system · January 8, 2021, 4:10pm

This topic was automatically closed after 30 days. New replies are no longer allowed.