Docker container enters zombie state

wilson29thid · March 21, 2020, 2:04pm

1. My Caddy version (`caddy version`):

v2.0.0-beta.17

2. How I run Caddy:

a. System environment:

Ubuntu running docker

b. Command:

docker-compose up -d

c. Service/unit/compose file:

  reverse-proxy:
    container_name: reverse-proxy
    image: caddy/caddy:v2.0.0-beta.17
    restart: unless-stopped
    ports:
      - "80:80"
      - "443:443"
    user: root
    volumes:
      - ./Caddyfile.prod:/etc/caddy/Caddyfile
      - caddy-config:/root/.config/caddy
      - caddy-data:/root/.local/share/caddy

d. My complete Caddyfile or JSON config:

{
  email admin@29th.org
  # acme_ca https://acme-staging-v02.api.letsencrypt.org/directory
}

29th.org {
  redir https://www.{host}{uri} permanent
}

www.29th.org {
  reverse_proxy homepage:80
}

personnel.29th.org {
  reverse_proxy app:8080
}

api.29th.org {
  reverse_proxy api:80
}

forums.29th.org {
  reverse_proxy forums:80 {
    header_up X-Forwarded-Proto {http.request.scheme}
  }
}

portainer.29th.org {
  reverse_proxy portainer:9000
}

bitwarden.29th.org {
  encode gzip

  reverse_proxy /notifications/hub/negotiate bitwarden:80
  reverse_proxy /notifications/hub bitwarden:3012
  reverse_proxy bitwarden:80
}

3. The problem I’m having:

I run docker-compose in production. It’s worked fine for months, but twice this week the site has gone down with a “Connection refused” error. Upon investigation, it appears the caddy container is in a zombie-like state and no longer handling requests. It also appears that one of the containers being reverse proxied (app) is down. Perhaps that container went down first and reverse-proxy entered a zombie state after failing to reach it? Guessing…

4. Error messages and/or full log output:

root@dockerprod:/usr/local/src# docker-compose ps
    Name                   Command                  State                             Ports
------------------------------------------------------------------------------------------------------------------
api             docker-php-entrypoint apac ...   Up             80/tcp
app             docker-entrypoint.sh npm r ...   Up             8080/tcp
bitwarden       /bitwarden_rs                    Up (healthy)   3012/tcp, 80/tcp
forums          docker-php-entrypoint apac ...   Up             80/tcp
homepage        nginx -g daemon off;             Up             80/tcp
portainer       /portainer --admin-passwor ...   Up             9000/tcp
reverse-proxy   caddy run --config /etc/ca ...   Up             2019/tcp, 0.0.0.0:443->443/tcp, 0.0.0.0:80->80/tcp

(When I began, app had a state of Exit 127 I believe. Now gone in terminal history since I’ve restarted it.)

root@dockerprod:/usr/local/src# docker-compose top reverse-proxy
Traceback (most recent call last):
  File "bin/docker-compose", line 6, in <module>
  File "compose/cli/main.py", line 71, in main
  File "compose/cli/main.py", line 127, in perform_command
  File "compose/cli/main.py", line 941, in top
TypeError: 'NoneType' object is not iterable
[1073] Failed to execute script docker-compose

root@dockerprod:/usr/local/src# docker container top reverse-proxy
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD

Here I am attempting to stop it. I’ve tried using docker-compose stop as well with the same effect. kill also has the same effect.

root@dockerprod:/usr/local/src# docker container stop reverse-proxy
reverse-proxy
root@dockerprod:/usr/local/src# docker container ps
CONTAINER ID        IMAGE                        COMMAND                  CREATED             STATUS                PORTS                                                NAMES
5cd5e1fd4025        29th/forums:latest           "docker-php-entrypoi…"   2 days ago          Up 2 days             80/tcp                                               forums
a2f964bb67eb        29th/personnel-api:latest    "docker-php-entrypoi…"   2 days ago          Up 2 days             80/tcp                                               api
c6d1d5103282        portainer/portainer          "/portainer --admin-…"   2 days ago          Up 2 days             9000/tcp                                             portainer
1250c050644b        bitwardenrs/server-mysql     "/bitwarden_rs"          2 days ago          Up 2 days (healthy)   80/tcp, 3012/tcp                                     bitwarden
2f024f1d0daf        caddy/caddy:v2.0.0-beta.17   "caddy run --config …"   2 days ago          Up 2 days             0.0.0.0:80->80/tcp, 0.0.0.0:443->443/tcp, 2019/tcp   reverse-proxy
49122372dda6        nginx:1.17.7                 "nginx -g 'daemon of…"   2 days ago          Up 2 days             80/tcp                                               homepage
2902a6eb4170        29th/personnel-app:latest    "docker-entrypoint.s…"   2 days ago          Up 33 minutes         8080/tcp                                             app

The first time this happened, I tried docker rm -f reverse-proxy which did successfully remove it, but then when I tried to bring it back up with docker-compose up -d reverse-proxy I got an error about port 443 already being allocated (assumedly by the zombie container process). I had to reboot the server to fix that.

root@dockerprod:/usr/local/src# docker-compose exec reverse-proxy bash
cannot exec in a stopped state: unknown
root@dockerprod:/usr/local/src# docker container update --restart=no reverse-proxy
Error response from daemon: Cannot update container 2f024f1d0dafa6473b557655a9e3685029bd53deae6f8459413b526738ec7243: cannot update a stopped container: unknown

The logs from the reverse-proxy container are from 13 hours ago, right around the time the site went down. But I always see logs like this in the container in production so I don’t see anything unusual.

reverse-proxy    | 2020/03/21 01:13:34 http: TLS handshake error from 157.55.39.23:7879: no certificate available for ''
reverse-proxy    | 2020/03/21 01:13:34 http: TLS handshake error from 157.55.39.23:8043: tls: client offered only unsupported versions: [302 301]
reverse-proxy    | 2020/03/21 01:13:34 http: TLS handshake error from 157.55.39.23:8100: tls: client offered only unsupported versions: [301]
reverse-proxy    | 2020/03/21 01:13:34 http: TLS handshake error from 157.55.39.23:8137: EOF
reverse-proxy    | 2020/03/21 01:18:07 http: TLS handshake error from 184.105.247.195:34238: no certificate available for ''
reverse-proxy    | 2020/03/21 01:18:46 http: TLS handshake error from 71.175.49.17:49655: EOF
reverse-proxy    | 2020/03/21 01:18:46 http: TLS handshake error from 71.175.49.17:49651: EOF
reverse-proxy    | 2020/03/21 01:19:42 http: TLS handshake error from 71.232.250.251:54046: EOF
reverse-proxy    | 2020/03/21 01:29:55 http: TLS handshake error from 71.175.49.17:49854: EOF
reverse-proxy    | 2020/03/21 01:29:55 http: TLS handshake error from 71.175.49.17:49855: EOF
reverse-proxy    | 2020/03/21 01:29:55 http: TLS handshake error from 71.175.49.17:49856: EOF
reverse-proxy    | 2020/03/21 01:30:20 http: TLS handshake error from 202.107.226.3:23559: no certificate available for 'www.google-analytics.com'
reverse-proxy    | 2020/03/21 01:33:58 http: TLS handshake error from 185.94.219.160:52720: EOF
reverse-proxy    | 2020/03/21 01:34:08 http: TLS handshake error from 186.251.10.90:43489: EOF
reverse-proxy    | 2020/03/21 01:35:44 http: TLS handshake error from 135.23.214.137:50012: EOF
reverse-proxy    | 2020/03/21 01:35:44 http: TLS handshake error from 135.23.214.137:50011: EOF

5. What I already tried:

I was originally using the alpine image (before the official image used versioned tags) from a month or so ago. When this issue happened earlier this week I switched to the most recent tagged image, v2.0.0-beta.17 and the issue happened again a couple days later.

To fix it the first time, I force removed the image and then had to reboot the server because port 443 was still allocated. This time, I rebooted the server without force killing the image. docker-compose ps then showed the reverse-proxy container in a state of Exit 255. I ran docker-compose restart reverse-proxy and the site came back up.

6. Links to relevant resources:

I found an issue on moby/moby that sounds similar.

francislavoie · March 21, 2020, 6:07pm

At a glance, I think your volume paths are outdated, they were changed in the official Docker image recently to /data and /config for simplicity.

As for the actual issues you’re encountering, I really don’t have a clue what’s going on. It does sound like it might be more of an issue with Docker itself than Caddy.

As for your logs and the TLS handshake errors, I’m pretty sure that’s due to clients that don’t support SNI attempting to connect. This results in Caddy not knowing which certificate to serve, so it quits.

A workaround for this would be to use the default_sni global option (which may have been added in beta 18? I don’t remember what version that was added, sorry) to force Caddy to pick a domain so it can serve a certificate. (@matt hopefully I’m not spewing lies here?)

wilson29thid · March 22, 2020, 1:37pm

I’ll try updating the volume paths. Was this change announced anywhere? The caddy docs still seem to point to the original locations, unless I’ve misread them, and the docker readme doesn’t mention them at all.

matt · March 22, 2020, 1:40pm

The “no certificate available for” errors are indeed because the client didn’t give a (recognized) ServerName in the handshake. But Caddy will never quit because of that.

Honestly all those TLS handshake errors look benign/normal to me – spammy/bad clients most likely, and really nothing to do with the container problem.

francislavoie · March 22, 2020, 5:21pm

Not exactly. It’s still in beta, so we were less worried about making breaking changes. It initially was done here Adding data and config volumes by hairyhenderson · Pull Request #39 · caddyserver/caddy-docker · GitHub and later changed again in Fix VOLUME instructions to point to right locations by hairyhenderson · Pull Request #48 · caddyserver/caddy-docker · GitHub (the original change had a mistake).

system · April 21, 2020, 5:21pm

This topic was automatically closed 30 days after the last reply. New replies are no longer allowed.