Caddy keeps going inactive

stevendesu · April 24, 2017, 1:29pm

Core Issue

I’ve only had this issue on one of my development servers, but until I understand it it’s making me really paranoid for any production server running Caddy.

I’m using systemd to run Caddy as a service. Randomly the Caddy service will simply stop working. Sometimes it goes a week or two before it goes down, sometimes it goes down within a day of me restarting it.

The service logs are never helpful. Here’s the output from “service caddy status”:

● caddy.service - Caddy HTTP/2 web server
Loaded: loaded (/etc/systemd/system/caddy.service; enabled; vendor preset: enabled)
Active: inactive (dead) since Mon 2017-04-24 03:36:13 EDT; 5h 47min ago
Docs: Welcome — Caddy Documentation
Process: 17870 ExecStart=/usr/local/bin/caddy -log stdout -agree=true -conf=/etc/caddy/Caddyfile -root=/var/tmp (code=killed, signal=PIPE)
Main PID: 17870 (code=killed, signal=PIPE)

Apr 24 00:36:13 dev-server caddy[17870]: 2017/04/24 00:36:13 [INFO] Done checking OCSP staples
Apr 24 00:44:59 dev-server caddy[17870]: 2017/04/24 00:44:59 http: TLS handshake error from 71.6.202.196:60000: EOF
Apr 24 00:49:26 dev-server caddy[17870]: 2017/04/24 00:49:26 [INFO] www.pj85896.com - No such site at :80 (Remote: 23.247.72.83, Referer: )
Apr 24 01:36:13 dev-server caddy[17870]: 2017/04/24 01:36:13 [INFO] Scanning for stale OCSP staples
Apr 24 01:36:13 dev-server caddy[17870]: 2017/04/24 01:36:13 [INFO] Done checking OCSP staples
Apr 24 01:55:41 dev-server caddy[17870]: 2017/04/24 01:55:41 [INFO] - No such site at :80 (Remote: 222.144.186.189, Referer: )
Apr 24 02:36:13 dev-server caddy[17870]: 2017/04/24 02:36:13 [INFO] Scanning for expiring certificates
Apr 24 02:36:13 dev-server caddy[17870]: 2017/04/24 02:36:13 [INFO] Done checking certificates
Apr 24 02:36:13 dev-server caddy[17870]: 2017/04/24 02:36:13 [INFO] Scanning for stale OCSP staples
Apr 24 02:36:13 dev-server caddy[17870]: 2017/04/24 02:36:13 [INFO] Done checking OCSP staples

Note that the “www.pj85896.com” domain is not one I own nor is it hosted on this server. I simply see tons of similar requests throughout the day where people attempt to access non-existent websites on my server. I figure it’s either a script trying to fuzz the web server or this IP address once belonged to those domains and some automated task still hasn’t been updated.

The only clue I have as to what’s going on is that it always crashes one hour after the last “Done checking OCSP staples”, which suggests it’s crashing when trying to perform the “Scanning for stale OCSP staples” step.

Additional Info

Here is my Caddyfile:

import vhosts/*

And my vhosts/* contains six files for different development subdomains – all things like “client1.clients.example.com” and “client2.clients.example.com”

Each one of these vhost/client1.clients.example.com files has the same layout:

client1.clients.example.com {
    header / Strict-Transport-Security "max-age=1814400; includeSubDomains; preload"
    header /css/ Cache-Control "max-age=2592000"
    header /js/ Cache-Control "max-age=2592000"
    header /img/ Cache-Control "max-age=2592000"
    root /mnt/dev-code/client1/public
    fastcgi / /var/run/php/php7.0-fpm.sock php
    log /var/log/caddy/access/client1.log
    log /var/log/caddy/errors/client2.log
    gzip
    rewrite / {
        to {uri} {uri}/ /index.php?{query}
    }
}

The access and error logs give no hints around the time of the server crashing, although humorously they are flooded with requests for /wp-admin.php and similar

Running journalctl -u caddy also doesn’t give any useful information – the last few lines are always identical to the last few lines in service caddy status, which itself didn’t tell me much

My dev server has 4 gigabytes of RAM and even when I’m actively using the web server and downloading multiple files, it almost never exceeds 500 megabytes being used – most of which is used by MySQL

matt · April 24, 2017, 2:42pm

What does “stop working” mean exactly? Is that the full log?

stevendesu · April 24, 2017, 3:49pm

By “stop working” I mean the service status changes to “Inactive (dead)” and the server stops responding to requests on port 80.

The log that I posted is the full result of typing service caddy status – it’s a subset of journalctl -u caddy.service, which I can dump if you believe it would be helpful. If there are any other logs you believe would be helpful, let me know and I will dump them.

I am unaware of any logs created by Caddy beyond the access and error logs, neither of which had any meaningful information in them

I’m running Caddy v0.9.0 if that affects anything

matt · April 24, 2017, 7:08pm

Hmm, why are you running such an old version? That’s nearly a year old. I can help you better after you upgrade.

stevendesu · April 24, 2017, 10:05pm

It’s the version I’ve had on my dev server since I installed Caddy, and there’s no convenient way to auto-update Caddy that I’m aware of (I can’t just apt-get upgrade)

I’ll bump the version and come back here to comment if it happens again. As I mentioned, sometimes it happens after a day and sometimes after two weeks, so there’s no telling the next time it will crash (if it does)

matt · April 24, 2017, 10:40pm

Sure, just keep me posted. I really can’t remember the nuances of v0.9.0 anymore.

(FYI, there is an easy way to update but it doesn’t float everyone’s boat. But for some it does! Just today: https://twitter.com/amstutzIT/status/856619354259148800)

wmark · April 30, 2017, 7:58pm

Other than that there’s not enough here to triage this issue.

Caddy gets the SIGPIPE signal, exits, doesn’t get restarted and hence becomes inactive. We’d need to know what causes said signal (or closes Caddy’s stdout and/or stderr or the connection it wants to write to).

In the meanwhile you could run it with option RestartForceExitStatus=SIGPIPE and go through logs of all services and the kernel (share what you find) on the next restart.

Chances are we’re looking here at an issue that has already been addressed in a more recent version of systemd, Golang, or Caddy.

stevendesu · April 30, 2017, 11:43pm

@wmark

If it’s getting a signal to exit, I can only assume that signal is being sent by systemd. I updated Caddy, but I’ll make sure systemd is updated as well

So far in the past 6 days it hasn’t gone down with Caddy v0.10, but that doesn’t mean it won’t happen.

wmark · May 1, 2017, 1:35am

Keep an open mind.

As quick fix that’s okay. Might make it harder to find the cause, though.[quote=“stevendesu, post:8, topic:1865”]
So far in the past 6 days it hasn’t gone down with Caddy v0.10, but that doesn’t mean it won’t happen.
[/quote]

I recommend you install something which watches events or logs and alerts you to any services going down unexpectedly. Something very crude could be:

[Unit]
OnFailure=send-email.service

# send-email.service
[Unit]
Description=Send alert
; …

[Service]
Type=oneshot
ExecStart=/bin/bash -c "printf 'From: Z\r\nTo: X\r\nSubject: Y\r\n\r\nCheck the logs.\r\n' | sendmail steve@desu.com"

system · July 30, 2017, 1:38am

This topic was automatically closed 90 days after the last reply. New replies are no longer allowed.