EC2 with a Caddy + Gunicorn setup sporadically unreachable

1. Caddy version:

v2.6.3

2. How I installed, and run Caddy:

Installed via apt using the commands from the official docs

a. System environment:

EC2 instance running Ubuntu 22.04

b. Command:

systemctl reload caddy

c. Service/unit/compose file:

Using the stock systemd service and unit file

d. My complete Caddy config:

psymetricstest.com {
    @notStatic {
        not {
            path /staticfiles/*
        }
    }

    handle_path /staticfiles/* {
        file_server
        root * /opt/app_repo/static/
    }

    reverse_proxy @notStatic unix//run/gunicorn.sock {
        header_up Host {host}
    }

    log {
        output file /opt/app_repo/caddy.access.log {
            roll_size 1gb
            roll_keep 5
            roll_keep_for 720h
        }
    }
}

3. The problem I’m having:

I have an EC2 instance that runs a Django via gunicorn, in which Caddy sits on top of. The domain is hosted in Route53 with an A record pointing to the IP address of the instance.

For completion, here’s my gunicorn files as well:

# gunicorn.service
[Unit]
Description=gunicorn daemon
Requires=gunicorn.socket
After=network.target

[Service]
User=root
Group=root
WorkingDirectory=/opt/app_repo
Restart=always
ExecStart=/opt/app_repo/venv/bin/gunicorn \
          --access-logfile /opt/app_repo/gunicorn.access.log \
          --error-logfile /opt/app_repo/gunicorn.error.log \
          --timeout 600 \
          --workers 5 \
          --bind unix:/run/gunicorn.sock \
          --log-level DEBUG \
          --capture-output \
          app_repo.wsgi:application

[Install]
WantedBy=multi-user.target
# gunicorn.socket
[Unit]
Description=gunicorn socket

[Socket]
ListenStream=/run/gunicorn.sock

[Install]
WantedBy=sockets.target

The problem is that the site is reported as unreachable by our monitoring tool (and confirmed by some clients as well) for 5-10 minutes everyday with no apparent pattern. Whenever I SSH back onto the server, the gunicorn and caddy service are up and running (checked via systemctl status ). Checking journalctl doesn’t yield any helpful details:

4. Error messages and/or full log output:

$ journalctl -u gunicorn --boot
Feb 14 18:27:50 ip-172-31-3-73 systemd[1]: Started gunicorn daemon.
Feb 15 13:02:26 ip-172-31-3-73 systemd[1]: Stopping gunicorn daemon...
Feb 15 13:02:26 ip-172-31-3-73 systemd[1]: gunicorn.service: Deactivated successfully.
Feb 15 13:02:26 ip-172-31-3-73 systemd[1]: Stopped gunicorn daemon.
Feb 15 13:02:26 ip-172-31-3-73 systemd[1]: gunicorn.service: Consumed 1h 15min 13.075s CPU time.
Feb 15 13:02:26 ip-172-31-3-73 systemd[1]: Started gunicorn daemon.
Feb 15 13:16:52 ip-172-31-3-73 systemd[1]: Stopping gunicorn daemon...
Feb 15 13:16:53 ip-172-31-3-73 systemd[1]: gunicorn.service: Deactivated successfully.
Feb 15 13:16:53 ip-172-31-3-73 systemd[1]: Stopped gunicorn daemon.
Feb 15 13:16:53 ip-172-31-3-73 systemd[1]: gunicorn.service: Consumed 39.035s CPU time.
Feb 15 13:16:53 ip-172-31-3-73 systemd[1]: Started gunicorn daemon.
$ # grepped to when the recent outage happened
$ journalctl -u caddy --boot | grep "Feb 16" | grep "error"
Feb 16 03:10:09 ip-172-31-3-73 caddy[5328]: {"level":"error","ts":1676517009.8251915,"logger":"http.handlers.reverse_proxy","msg":"aborting with incomplete response","error":"http2: stream closed"}

grep-ing dmesg for gunicorn and caddy doesn’t yield anything as well as far as I can tell.

$ dmesg | grep caddy
$ dmesg | grep gunicorn
[    2.972213] systemd[1]: Configuration file /etc/systemd/system/gunicorn.socket is marked world-writable. Please remove world writability permission bits. Proceeding anyway.
[    2.984758] systemd[1]: Configuration file /etc/systemd/system/gunicorn.service is marked world-writable. Please remove world writability permission bits. Proceeding anyway.

5. What I already tried:

Aside from looking at the logs on step #4, I’m also monitoring htop to see if there’s any clue I can find. But unfortunately I’ve never had it open while the outage happens because the timeframe varies day to day.

I know Caddy is only one of the moving parts in my setup and it could very well be a non-Caddy issue, but at this point I don’t know how to confirm for sure. Any help is very appreciated!

6. Links to relevant resources:

Can you elaborate on ‘unreachable’?

Specifically, are you getting timeouts trying to connect, or are connections being rejected?

It does sound either way like it’s the actual route to host that’s being lost here, or an errant firewall along the way.

Is your monitoring tool agentless? Can you put a monitoring agent (a la Sensu) on the erroring host to help indicate whether it’s a network connectivity issue? Maybe implement other monitoring techniques alongside your HTTP checks - ICMP maybe?

1 Like

That’s a good question, I’m actually not sure! My networking skills is very limited so please bear with me. We’re using StatusCake and here’s what they’re reporting us:

Root Data Root Value
StatusCode 0
ResponseTime 0
Issue Request Timeout
Confirmations 2
Trace == DNS LOOKUP (psymetricstest.com) == * 44.198.160.231 == TRACEROUTE TO HOST (psymetricstest.com) ==* Could not run a traceroute.
Headers “No response!”
Additional After a set timeout rate the site did not respond
ReportingServer DOUK4 (167.71.143.76)

That’s actually pretty useful. When the HTTP check failed, StatusCake is telling you that DNS is resolving, but that traceroute couldn’t be run against the resulting IP address.

DNS works, but no HTTP, and no UDP, so I think we can safely conclude that it’s likely no packets are reaching your instance whatsoever during this downtime. That either means the instance itself is freezing or going down during these periods, or there’s a routing issue with Amazon EC2 itself.

1 Like

Appreciate the explanation. Do you have any suggestion on what actions I can take to figure out the root cause?

Not particularly, other than raise a ticket with Amazon.

Putting a push-monitoring agent on the erroring host during a downtime period might help you figure out whether it’s a bi-directional routing failure or if EC2 is simply preventing any incoming packets from reaching your instance during these events, but ultimately knowing that doesn’t change the fact that it seems like packets aren’t reaching it when they need to be able to reach it, which I’d call an underlying service issue.

2 Likes

Thanks, I did send a support ticket. Hopefully I can get a helpful response. I appreciate your time!

1 Like

Update: we’ve experienced the same outage yesterday and I managed to take a look at Cloudwatch, and here’s what I got:

The outage happened around 17:00. Does this tell anything to you?

The fact it happens at 5 o’clock on the dot, and the CPU spikes, along with network packets in and out…

There’s something happening on the host itself - locking it up for some reason. Something scheduled. Anything on crontab, maybe?

If it’s not immediately obvious, then further troubleshooting what, exactly, is the problem - assuming Amazon says on their end that there’s nothing they can do - is a job for a Linux professional, a little bit beyond the scope of a help forum. I wouldn’t even bother troubleshooting this, myself; my usual go-to when I get this kind of weirdness is to simply nuke it from orbit, scrap the instance, spin up a new one. Instances are so easy to just replace.

I forgot to mention, but this outage happens to 2-3 instances all at once. This is the only part where I’m hesitant to throw it up as an underlying service issue and could have been a configuration error on my end.

AWS also came back to me that given my instance status checks had always been healthy since provisioning the servers, they couldn’t do anything on their end.

Really frustrating but interesting at the same time!

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.