Caddy becomes unresponsive after /load API

1. The problem I’m having:

For some reason, after adding NTLM plugin (may or may not be related) caddy hangs after I call /load config API (with or without must-revalidate)

This I have only experienced in production servers (not reproducible in dev).

I initially thought it might have something to do with the number of TCP connections so I set grace period to 1s but to no avail.

Steps to reproduce (in my setup)

  1. /load just the certificates when server starts
  2. Once server gets the applications from backend it’ll POST localhost:2019/config/apps/http/servers/srv0/routes/...
  3. (Runs fine for sometime, users start accessing, debug logs show many connections)
  4. After X minutes server sends list of new applications again.
  5. Then I “reload” the server by performing steps 1 and 2
  6. But when I load the empty config (see d below) the /config/ API stops responding so I cannot add applications in step 2, causing a downtime in production
  7. Only way I fix it right now is have a script which will monitor /config/ API and if that does not respond kill caddy and start again

2. Error messages and/or full log output:

{"level":"debug","ts":1713409248.0986166,"logger":"http","msg":"starting server loop","address":"[::]:443","tls":true,"http3":false}
{"level":"info","ts":1713409248.099862,"logger":"http.log","msg":"server running","name":"srv0","protocols":["h1"]}
{"level":"debug","ts":1713409248.1021903,"logger":"http","msg":"[Stop] Stop gracefully shuts down HTTP Server"}
{"level":"info","ts":1713409248.102396,"logger":"http","msg":"servers shutting down; grace period initiated","duration":1}
{"level":"debug","ts":1713409248.1028795,"logger":"http","msg":"[Stop][app.Servers] Inside the for loop"}
{"level":"debug","ts":1713409248.1029367,"logger":"http","msg":"before startedShutdown.Wait()"}
{"level":"debug","ts":1713409248.1029773,"logger":"http","msg":"[App][Stop] Inside stopServer routine"}
{"level":"debug","ts":1713409248.1041775,"logger":"http","msg":"after startedShutdown.Wait()"}
{"level":"debug","ts":1713409248.1046526,"logger":"http","msg":"after caddy.Exiting() finishedShutdown.Wait()"}
{"level":"debug","ts":1713409248.1050546,"logger":"http","msg":"all servers stop hooks called."}
{"level":"debug","ts":1713409255.2162657,"logger":"events","msg":"event","name":"tls_get_certificate","id":"a7143d90-0910-4f26-b190-fdf3e9d711db","origin":"tls","data":{"client_hello":{"CipherSuites":[6682,4865,4866,4867,49195,49199,49196,49200,52393,52392,49171,49172,156,157,47,53],"ServerName":"621de2378fb0100001b01e9a.SLUG-48.gw.app.DOMAIN.io","SupportedCurves":[2570,29,23,24],"SupportedPoints":"AA==","SignatureSchemes":[1027,2052,1025,1283,2053,1281,2054,1537],"SupportedProtos":["h2","http/1.1"],"SupportedVersions":[6682,772,771],"RemoteAddr":{"IP":"106.195.2.171","Port":42990,"Zone":""},"LocalAddr":{"IP":"10.33.186.156","Port":443,"Zone":""}}}}
{"level":"debug","ts":1713409255.216553,"logger":"tls.handshake","msg":"no matching certificates and no custom selection logic","identifier":"621de2378fb0100001b01e9a.SLUG-48.gw.app.DOMAIN.io"}
{"level":"debug","ts":1713409255.2165728,"logger":"tls.handshake","msg":"choosing certificate","identifier":"*.SLUG-48.gw.app.DOMAIN.io","num_choices":1}
{"level":"debug","ts":1713409255.2165983,"logger":"tls.handshake","msg":"default certificate selection results","identifier":"*.SLUG-48.gw.app.DOMAIN.io","subjects":["SLUG-48.gw.app.DOMAIN.io","*.SLUG-48.gw.app.DOMAIN.io"],"managed":false,"issuer_key":"","hash":"504822b710543addb05d07478e425e274bc1ce6e9f19bd0e54770258cad2820e"}
{"level":"debug","ts":1713409255.2166154,"logger":"tls.handshake","msg":"matched certificate in cache","remote_ip":"106.195.2.171","remote_port":"42990","subjects":["SLUG-48.gw.app.DOMAIN.io","*.SLUG-48.gw.app.DOMAIN.io"],"managed":false,"expiration":1719727084,"hash":"504822b710543addb05d07478e425e274bc1ce6e9f19bd0e54770258cad2820e"}

3. Caddy version:

Master as of April 18th, 2024 (2.7.6 release + any commits after that)

4. How I installed and ran Caddy:

a. System environment:

Ubuntu

b. Command:

./caddy start

c. Service/unit/compose file:

d. My complete Caddy config:

{
  "apps": {
    "http": {
      "servers": {
        "srv0": {
          "listen": [
            ":443"
          ],
          "protocols": [
            "h1"
          ],
          "routes": null,
          "tls_connection_policies": [
            {
              "match": {
                "sni": [
                  "*.SLUG-48.gw.app.DOMAIN.io"
                ]
              }
            },
            {}
          ],
          "automatic_https": {
            "disable": true
          }
        }
      },
      "grace_period": 1000000000
    },
    "tls": {
      "certificates": {
        "load_files": [
          {
            "certificate": "/opt/sdpgw/letsencrypt/fullchain.pem",
            "key": "/opt/sdpgw/letsencrypt/privatekey.pem"
          }
        ]
      }
    }
  },
  "logging": {
    "sink": {
      "writer": {
        "output": "file",
        "filename": "/opt/instasafe/caddy.logs",
        "roll": true,
        "roll_size_mb": 100,
        "roll_keep": 10,
        "roll_keep_days": 7
      }
    },
    "logs": {
      "Empty": {
        "writer": {
          "output": "file",
          "filename": "/opt/instasafe/caddy.logs",
          "roll": true,
          "roll_size_mb": 100,
          "roll_keep": 10,
          "roll_keep_days": 7
        },
        "level": "DEBUG"
      }
    }
  }
}

5. Links to relevant resources:

What are those? The Caddy version you mention doesn’t emit logs like that. :face_with_monocle:

The reason I ask is because if there are code changes that aren’t being disclosed then it makes troubleshooting kind of a waste of time until we’re on the same page first.

Thats fair enough

It is basically me adding a bunch of logs to understanding where it might be getting stuck. My initial hunch was the waitgroups in Stop servers method. But since “all servers stop hooks called”, then the issue is somewhere down the line

No other functional changes or otherwise was done except some logs to aid in debugging :slight_smile:

1 Like

Ok. Well, can you provide a minimally reproducible test case? So that we can easily see the behavior and troubleshoot it.

Also, a complete log of curl -v commands (their inputs and outputs) as well as the server-side logs of this process would be very helpful.

1 Like

I am not sure how to best reproduce it, as it only occurs in production

These are the steps

  1. ./caddy start
  2. Load 200+ applications
  3. Let users start accessing them
  4. “Clear” all app (reload config) [The config is provided above]
  5. Load apps again

It never goes to step 5 cause it gets stuck at step 4

Once it goes to “stale” state, curl command does not return

curl localhost:2019/config/


(Stays there indefinitely)

If any user accesses it in this time it’ll just give a white screen

PS :: Not sure if it is of any help, but NTLM module is loaded and the server is running on h1 only

Ok. Since I can’t reproduce it, you’re going to have to get some more information I guess. How about a profile. During the time that it’s “stuck” can you get a profile? On the machine, go to http://localhost:2019/debug/pprof/ and then download the heap and goroutine snapshots. More details/instructions:

It’ll be good to know where the code is at at the time.