Server not reachable after a few days

kristofferf · January 17, 2022, 9:28am

1. Caddy version (`caddy version`):

2.4.6

2. How I run Caddy:

Trough Docker Compose using image: caddy:2.4.6

a. System environment:

Ubuntu 20.04.3 LTS on Digital Ocean
docker-compose version 1.29.2

b. Command:

docker-compose up -d

c. Service/unit/compose file:

docker-compose.yml:

version: '3.7'

networks:
  bonsy:

services:
  caddy:
    image: caddy:2.4.6
    container_name: bonsy_caddy
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./docker/caddy/prod/Caddyfile:/etc/caddy/Caddyfile
      - ./site:/var/www
      - caddy_data:/data
      - caddy_config:/config
    depends_on:
      - php
      - mariadb
    networks:
      - bonsy

 php:
    build:
      args:
        user: ${PHP_USER}
        uid: ${PHP_UID}
      context: .
      dockerfile: ./docker/php/Dockerfile
    container_name: bonsy_php
    restart: unless-stopped
    volumes:
      - ./site:/var/www
    working_dir: /var/www
    links:
      - mariadb
    networks:
      - bonsy

  mariadb:
    image: mariadb:10.6.4
    container_name: bonsy_mariadb
    restart: unless-stopped
    tty: true
    ports:
      - "3306:3306"
    volumes:
      - db_data:/var/lib/mysql
      - ./docker/mariadb/initdb:/docker-entrypoint-initdb.d
      - ./docker/mariadb/prod.cnf:/etc/mysql/conf.d/custom.cnf
    environment:
      MARIADB_ROOT_PASSWORD: ${DB_ROOT_PASSWORD}
      MARIADB_DATABASE: ${DB_DATABASE}
    networks:
      - bonsy

volumes:
  caddy_data:
  caddy_config:
  db_data:
    driver: local

d. My complete Caddyfile or JSON config:

4.225.66.142:80,
bonsy.se,
app.bonsy.se {
    root * /var/www/public
    php_fastcgi /* php:9000
    encode gzip
    file_server
}

www.bonsy.se {
    redir https://bonsy.se{uri}
}

3. The problem I’m having:

Everything runs as normal for a few days. Sometimes more than one week. Then suddenly I’m no longer able to access site. Both domain name and IP-address is down. Though all config above is for 1 server, I’m experiencing the exact same problem on another droplet on Digital Ocean. Only difference is that the other domain seems to last longer before it stops working.

When running docker ps to check services, everything is running as normal.

I’m unsure if this issue is related to Caddy. If I run docker-compose stop caddy and then docker-compose start caddy everything gets back to normal.

4. Error messages and/or full log output:

The last logs before the server is not reachable
caddy_1 | {“level”:“info”,“ts”:1642340792.576553,“logger”:“tls.cache.maintenance”,“msg”:“advancing OCSP staple”,“identifiers”:[“bonsy.se”],“from”:1642640398,“to”:1642856398}
caddy_1 | {“level”:“info”,“ts”:1642340792.7062128,“logger”:“tls.cache.maintenance”,“msg”:“advancing OCSP staple”,“identifiers”:[“app.bonsy.se”],“from”:1642640398,“to”:1642856398}
caddy_1 | {“level”:“info”,“ts”:1642340792.8394566,“logger”:“tls.cache.maintenance”,“msg”:“advancing OCSP staple”,“identifiers”:[“www.bonsy.se”],“from”:1642640398,“to”:1642856398}
The logs after docker caddy service is restarted
ddy_1 | {“level”:“info”,“ts”:1642413590.9739013,“msg”:“shutting down apps, then terminating”,“signal”:“SIGTERM”}
caddy_1 | {“level”:“warn”,“ts”:1642413590.9739437,“msg”:“exiting; byeee!! ”,“signal”:“SIGTERM”}
caddy_1 | {“level”:“info”,“ts”:1642413590.9779565,“logger”:“tls.cache.maintenance”,“msg”:“stopped background certificate maintenance”,“cache”:“0xc000533d50”}
caddy_1 | {“level”:“info”,“ts”:1642413590.9792933,“logger”:“admin”,“msg”:“stopped previous server”,“address”:“tcp/localhost:2019”}
caddy_1 | {“level”:“info”,“ts”:1642413590.9793048,“msg”:“shutdown complete”,“signal”:“SIGTERM”,“exit_code”:0}
caddy_1 | {“level”:“info”,“ts”:1642413596.872622,“msg”:“using provided configuration”,“config_file”:"/etc/caddy/Caddyfile",“config_adapter”:“caddyfile”}
caddy_1 | {“level”:“info”,“ts”:1642413596.8796692,“logger”:“admin”,“msg”:“admin endpoint started”,“address”:“tcp/localhost:2019”,“enforce_origin”:false,“origins”:[“localhost:2019”,"[::1]:2019",“127.0.0.1:2019”]}
caddy_1 | {“level”:“info”,“ts”:1642413596.8798656,“logger”:“http”,“msg”:“server is listening only on the HTTPS port but has no TLS connection policies; adding one to enable TLS”,“server_name”:“srv0”,“https_port”:443}
caddy_1 | {“level”:“info”,“ts”:1642413596.8798788,“logger”:“http”,“msg”:“enabling automatic HTTP->HTTPS redirects”,“server_name”:“srv0”}
caddy_1 | {“level”:“info”,“ts”:1642413596.8798866,“logger”:“http”,“msg”:“server is listening only on the HTTP port, so no automatic HTTPS will be applied to this server”,“server_name”:“srv1”,“http_port”:80}
caddy_1 | {“level”:“info”,“ts”:1642413596.8829477,“logger”:“http”,“msg”:“enabling automatic TLS certificate management”,“domains”:[“bonsy.se”,“app.bonsy.se”,“www.bonsy.se”]}
caddy_1 | {“level”:“info”,“ts”:1642413596.888416,“logger”:“tls.cache.maintenance”,“msg”:“started background certificate maintenance”,“cache”:“0xc000539e30”}
caddy_1 | {“level”:“info”,“ts”:1642413596.888456,“logger”:“tls”,“msg”:“cleaning storage unit”,“description”:“FileStorage:/data/caddy”}
caddy_1 | {“level”:“info”,“ts”:1642413596.8890398,“logger”:“tls”,“msg”:“finished cleaning storage units”}
caddy_1 | {“level”:“info”,“ts”:1642413597.2713878,“msg”:“autosaved config (load with --resume flag)”,“file”:"/config/caddy/autosave.json"}
caddy_1 | {“level”:“info”,“ts”:1642413597.2715614,“msg”:“serving initial configuration”}

5. What I already tried:

Rechecked all settings. Tried caddy fmt --overwrite command. Tried to search google and forum for similar errors, but unable to find other cases where docker or caddy stops after a few days.

6. Links to relevant resources:

N/A

francislavoie · January 17, 2022, 11:34am

Interestingly, the forum software also runs on DigitalOcean but sometimes stops working as well, rejecting even SSH connections. It might be an issue with DO

But the fact that restarting the container fixes it would suggest that it’s not an issue with DO.

Are you able to make a request from that machine itself to Caddy using curl -v when it goes down?

Try turning on the debug global option, and watch the logs when making requests when it “goes down”. It might hopefully explain what’s going on. Add this at the top of your Caddyfile:

{
	debug
}

Maybe you could look Docker’s own logs to see if something started breaking there at some point.

Docker does often run a userland proxy, you could experiment with turning that off (via a docker daemon option).

mdathersajjad · January 17, 2022, 11:54am

I too faced this issue. Its running perfectly fine for few days then stops all of a sudden and i get a barrage of emails saying the site is down. I tried monitoring the logs all that happens is request hits, size of response is 0 and status is 0. I just kill and restart the process everything works smoothly. Tried increasing the socket open limit to more than 2 lakhs still same result.

kristofferf · January 17, 2022, 2:33pm

Thanks for replying! I will test curl next time the server goes down. It’s hard to debug when it happens so infrequently.

I will add debug to Caddyfile, to check if it says anything. And also check the docker log. I have checked docker log for each container, but it doesn’t show any errors. I will check out te userland proxy for docker too.

kristofferf · January 17, 2022, 2:34pm

Strange, it seems to be a similar problem as mine. Follow along, and I’ll add a reply if I find something that works.

matt · January 17, 2022, 5:04pm

What does “down” mean? No route to host? Connection handshake doesn’t complete? Timeout? Connection refused? Request is read but no response received?

mdathersajjad · January 18, 2022, 8:03am

Down means, caddy server is not responding to any request. In the browser the page just keeps on loading. In the caddy logs i see status:0, size: 0. By printing that in logs i think caddy receives the request but doesn’t respond with any content. I will try to post a sample log entry when that happens. Few of the hosts i am managing i added them in down detector. After few days of running i receive emails that the host is down. If i dont do anything it resolves automatically after 5-15 mins and caddy starts responding again. If i want to resolve it quickly i just stop and restart caddy

matt · January 18, 2022, 8:07am

Ok, but more detail would be helpful. Don’t use browsers, use curl -v. Then post exact input and output and caddy logs too (with debug mode on). That will give us the best picture. Thanks!

kristofferf · January 18, 2022, 8:42am

In my case I’m unable to open site in browser. Both domain and IP-address does not respond. An GET request with curl does not respond. But I’m able to SSH into the server. The PHP or Caddy log does not register any error or traffic. My site isn’t yet critical, so I will leave it broken next time it happens so I can debug.

kristofferf · January 18, 2022, 8:45am

Not my case. It stays broken until I restart one of the containers. Yesterday I restarted the PHP container and it was available again. So I it is not clear that this is a Caddy problem. But again, I didn’t have this issue with Nginx.

mdathersajjad · January 18, 2022, 10:02am

I will paste curl -v response when it does the same thing again. Here is the json that gets logged when its down or not responding.

{
  "level": "info",
  "ts": 1639460479.0482252,
  "logger": "http.log.access.log0",
  "msg": "handled request",
  "request": {
    "remote_addr": "na",
    "proto": "HTTP/2.0",
    "method": "GET",
    "host": "na",
    "uri": "/",
    "headers": {
      "Accept-Language": ["en-US,en;q=0.9"],
      "Accept-Encoding": ["gzip, deflate, br"],
      "User-Agent": [
        "Mozilla/5.0 (iPhone; CPU iPhone OS 15_1_1 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/15.1 Mobile/15E148 Safari/604.1"
      ],
      "Accept": [
        "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8"
      ]
    },
    "tls": {
      "resumed": false,
      "version": 772,
      "cipher_suite": 4865,
      "proto": "h2",
      "proto_mutual": true,
      "server_name": "na"
    }
  },
  "common_log": "Upstream IP - - [14/Dec/2021:05:41:19 +0000] \"GET / HTTP/2.0\" 0 0",
  "duration": 0.62777269,
  "size": 0,
  "status": 0,
  "resp_headers": { "Server": ["Caddy"] }
}

mdathersajjad · January 18, 2022, 10:05am

My configuration is simple listening on 443 port, having a reverse proxy and ondemand tls. This is how it looks

{
    on_demand_tls {
        ask https://askendpoint
        interval 1m
        burst    5
    }
}
:443 {
    tls email {
        on_demand
    }
    reverse_proxy upstream_hostname {
        header_up Host hostname
        header_up Domain {host}
        header_up X-Forwarded-Port {server_port}
        health_timeout 5s
    }
    log {
        output file /var/log/caddy/access.log {
                roll_size 1gb
                roll_keep 4
                roll_keep_for 720h

                format filter {
                        wrap console
                        fields {
                                request>headers>Authorization delete
                                request>resp_headers delete
                                request>headers>Cookie delete
                        }
                }
        }
    }
}

mdathersajjad · January 18, 2022, 10:06am

I have confirmed when the caddy server does not respond the upstream server is up and responding fine. There is no problem with upstream server. This is the curl output @matt

It happened again. I was able to do curl -v . Downdetecter notified connection timedout after 25 seconds. curl -v output is listed below

curl -X GET customerdomain -v
Note: Unnecessary use of -X or --request, GET is already inferred.
*   Trying caddyserverip:443...
* TCP_NODELAY set
* Connected to customerdomain (caddyserverip) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256
* ALPN, server accepted to use h2
* Server certificate:
*  subject: CN=customerdomain
*  start date: Jan 13 08:58:11 2022 GMT
*  expire date: Apr 13 08:58:10 2022 GMT
*  subjectAltName: host "customerdomain" matched cert's "customerdomain"
*  issuer: C=US; O=Let's Encrypt; CN=R3
*  SSL certificate verify ok.
* Using HTTP2, server supports multi-use
* Connection state changed (HTTP/2 confirmed)
* Copying HTTP/2 data in stream buffer to connection buffer after upgrade: len=0
* Using Stream ID: 1 (easy handle 0x55a8c0cabe30)
> GET / HTTP/2
> Host: customerdomain
> user-agent: curl/7.68.0
> accept: */*
> 
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* Connection state changed (MAX_CONCURRENT_STREAMS == 250)!



* Empty reply from server
* Connection #0 to host customerdomain left intact
curl: (52) Empty reply from server

francislavoie · January 18, 2022, 2:23pm

Hmm, okay from those last logs, it looks like it does connect, but it seems like you didn’t even receive any response headers? For example I don’t see Server: Caddy in there.

It looks like you didn’t have the debug global option turned on here though, I think we’ll need to see those to have an idea what Caddy is doing. Caddy will print its additional logs to stdout/stderr (I’m not talking about access logs here, I’m talking about runtime logs, the log directive configures access logs)

If Caddy does get to the point that it does write out the access logs, I have to think that it did attempt to write a response because that happens at the end of the request.

mdathersajjad · January 18, 2022, 2:33pm

Yes its not turned on . Its a production server. I will try to add it to provide additional info, i don’t know exactly when that happens will try to add it to see how much caddy generates logs

mdathersajjad · January 19, 2022, 4:47am

@francislavoie i added debug above the on_demand_tls . like

{
    debug
    on_demand_tls {
        ask https://askendpoint
        interval 1m
        burst    5
    }
}

But i don’t see any log in access.log with level debug. is that fine.

matt · January 19, 2022, 4:51am

Yes, access logs don’t currently write debug-level logs; you’ll see them in your regular / process logs (stderr by default).

matt · January 19, 2022, 4:52am

What does this mean precisely, though? Please post the full command and its full output (use curl -v).

kristofferf · January 19, 2022, 9:23am

Yes, I will when the server goes down again. It usually takes a few days. What I ment was that I have built an API on another DO droplet, that also is experiencing the same problem. And a WordPress plugin that I’ve made is fetching data from this droplet using Curl in PHP. The plugin doesn’t get any response from the API server when this problem occurs. But that is probably not the same as using curl -v. I’ll get back with details when my server is down again.

mdathersajjad · January 19, 2022, 1:06pm

@matt @francislavoie

it happened now caddy didn’t respond to any request. i got the debug logs. Here is the pattern for one url.

{
    "level": "debug",
    "ts": 1642596149.0729373,
    "logger": "http.handlers.reverse_proxy",
    "msg": "upstream roundtrip",
    "upstream": "upstreamserver:443",
    "request": {
      "remote_addr": "REMOTEADDRESS",
      "proto": "HTTP/2.0",
      "method": "GET",
      "host": "upstreamserver",
      "uri": "/",
      "headers": {
        "Sec-Ch-Ua-Mobile": ["?0"],
        "Sec-Fetch-User": ["?1"],
        "Sec-Ch-Ua-Platform": ["\"Linux\""],
        "User-Agent": [
          "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.43"
        ],
        "Sec-Fetch-Dest": ["document"],
        "Accept-Encoding": ["gzip, deflate, br"],
        "X-Forwarded-For": ["XFORWARDEDFOR"],
        "Accept-Language": ["en-GB,en;q=0.9,en-US;q=0.8"],
        "Upgrade-Insecure-Requests": ["1"],
        "Sec-Fetch-Mode": ["navigate"],
        "Sec-Ch-Ua": [
          "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"96\", \"Microsoft Edge\";v=\"96\""
        ],        
        "Domain": ["userdomain"],
        "X-Forwarded-Port": [""],
        "X-Forwarded-Proto": ["https"],
        "Accept": [
          "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
        ],
        "Sec-Fetch-Site": ["none"]
      },
      "tls": {
        "resumed": true,
        "version": 772,
        "cipher_suite": 4865,
        "proto": "h2",
        "proto_mutual": true,
        "server_name": "userdomain"
      }
    },
    "duration": 45.448787874,
    "error": "context canceled"
  },
  {
    "level": "info",
    "ts": 1642596149.0730722,
    "logger": "http.log.access.log0",
    "msg": "handled request",
    "request": {
      "remote_addr": "REMOTEADDRESS",
      "proto": "HTTP/2.0",
      "method": "GET",
      "host": "userdomain",
      "uri": "/",
      "headers": {
        "Accept-Encoding": ["gzip, deflate, br"],
        "Sec-Ch-Ua-Platform": ["\"Linux\""],
        "Upgrade-Insecure-Requests": ["1"],
        "User-Agent": [
          "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.55 Safari/537.36 Edg/96.0.1054.43"
        ],
        "Accept": [
          "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"
        ],
        "Sec-Fetch-Mode": ["navigate"],
        "Sec-Fetch-Dest": ["document"],
        "Sec-Ch-Ua": [
          "\" Not A;Brand\";v=\"99\", \"Chromium\";v=\"96\", \"Microsoft Edge\";v=\"96\""
        ],
        "Sec-Ch-Ua-Mobile": ["?0"],
        "Sec-Fetch-Site": ["none"],
        "Sec-Fetch-User": ["?1"],
        "Accept-Language": ["en-GB,en;q=0.9,en-US;q=0.8"],        
      },
      "tls": {
        "resumed": true,
        "version": 772,
        "cipher_suite": 4865,
        "proto": "h2",
        "proto_mutual": true,
        "server_name": "userdomain"
      }
    },
    "common_log": "XFORWARDEDFOR - - [19/Jan/2022:12:42:29 +0000] \"GET / HTTP/2.0\" 0 0",
    "duration": 45.449036952,
    "size": 0,
    "status": 0,
    "resp_headers": { "Server": ["Caddy"] }
  },

In the debug log context canceled. what does it mean. Can you please help. I restarted the server immediately and it started processing the requests