Race Condition when clustering Caddy2 with the caddy run --watch

1. Caddy version (caddy version):

V2.3.0

2. How I run Caddy:

Docker-compose Caddy2 cluster with shared storage mounted at /etc/caddy.

a. System environment:

Centos 8 with Docker

b. Command:

caddy run --config /etc/caddy/config/caddy/autosave.json --watch

c. Configuration files:

version: "3.7"

services:
  caddy:
    container_name: caddy
    image: caddy:latest
    build:
      context: .
      dockerfile: ./Dockerfile
    restart: unless-stopped
    network_mode: host
    volumes:
      - ./caddy:/etc/caddy
    environment:
      - XDG_CONFIG_HOME=/etc/caddy/config
      - XDG_DATA_HOME=/etc/caddy/data
      - PROXY_HOST=127.0.0.1
      - PROXY_PORT=8080
      - INITIAL_HOSTNAME=mydomain.com
      - ADMIN_API_HOST=0.0.0.0
      - ADMIN_API_PORT=2019

FROM caddy:alpine

RUN apk add gettext

ADD ./caddy-template.json /opt/caddy/caddy-template.json

ADD ./run.sh /opt/run.sh

CMD /opt/run.sh

caddy-template.json


{"admin":{"listen":"$ADMIN_API_HOST:$ADMIN_API_PORT"},"apps":{"http":{"servers":{"srv0":{"listen":[":443"],"routes":[{"handle":[{"handler":"subroute","routes":[{"handle":[{"handler":"reverse_proxy","upstreams":[{"dial":"$PROXY_HOST:$PROXY_PORT"}]}]}]}],"match":[{"@id":"default","host":["$INITIAL_HOSTNAME"]}],"terminal":true}]}}}}}

run.sh

#!/bin/sh

[ -d /etc/caddy/config/caddy ] || mkdir -p /etc/caddy/config/caddy/
[ -d /etc/caddy/data/caddy ] || mkdir -p /etc/caddy/data/caddy/

if [[ ! -f /etc/caddy/config/caddy/autosave.json ]]; then
  envsubst < /opt/caddy/caddy-template.json > /etc/caddy/config/caddy/autosave.json
fi

caddy run --config /etc/caddy/config/caddy/autosave.json --watch

3. The problem I’m having:

There is a race-condition with clustered Caddy nodes when simultaneously modifying multiple configuration requests by admin API. It takes approximately 1 minute for each node to recognise the configuration changes and apply. Caddy2 needs to give time for each node in the cluster to update changes before processing another modification request, else some requests get overridden and do not apply.

One node gets the API call and processes it, while the other node also gets another API call and processes it. Both nodes use --watch and it is a race which node catches the configuration change first and applies to itself, thus overwriting its own processed call.

I get this problem every time by running curl -X PUT -H "Content-Type: application/json" -d '{"@id":"ryan","host":["ryan.fakedomain.com"]}' "http://127.0.0.1:2019/config/apps/http/servers/srv0/routes/0/match/0" on both nodes simultaneously, but with different hostnames and @id for each API call of course.

Configuration is not clustered. Only storage is.

Caddy does not communicate configuration changes to other instances. It has no way to do so anyways, it doesn’t know about other instances. The storage clustering is only a product of using mutexes for all storage operations, which ensures that they can play together.

OK. Fair enough. Would you have any suggestions on how I could get round this issue please?

You shouldn’t be trying to use autosave.json as a mechanism for clustering. Its purpose is just to preserve a backup of the currently running config, and make the --resume option work to start from the last good config. There’s no locking mechanism at play here, so you’ll get contention if you try to have more than one Caddy instance play with this file.

OK. I did that because my Admin API calls did not survive a restart when the Caddyfile did not represent the changes made by API call.

Maybe I have just been doing it wrong. I need to rethink and do more testing I guess where I am not using the autosave.json but my API calls still survive reboots.

Thank you for your support and I will let you know how I get on.

Thanks for the added information (from GitHub issue).

I second what Francis said, and would probably have said just about the same thing myself.

The last loaded config is always stored in autosave.json (unless autosave is disabled explicitly) so that you can use --resume to apply that config at startup. The --resume flag is needed if you intended to start with the last loaded configuration, because otherwise it’s ambiguous if a config file also exists. Our Getting Started tutorial has a whole section about this, explaining that you should have just one source of truth.

Your config changes do survive reboots in the autosave file, but you have to apply that file to use it.

Hi,

OK I tried with --resume and that did work as expected for surviving restarts. Although, with --resume , --watch cannot find the config to watch as --resume overrides --config that --watch seems to use to know which file to watch. This means the other nodes do not even get updated at all, which seems worse than what I had before with just the contention issue.

Also, all my changes once live are by API calls and as I understand it, I would need to watch the autosave.conf to make this work anyway.

I really like this software, and am determined to make it work for my use case.

I may just have to find a way to avoid the contention issues and stick with what I had for now.

At least I can then add redundancy to perform updates without downtime and to be more resilient to host failure.

Could this be something that might be in the works for the future?

Regards,
Ryan

You can keep an eye on this issue:

OK thankyou :slight_smile: