API getting unresponsive

ReptoxX · November 18, 2021, 1:08pm

1. Caddy version (`caddy version`):

v2.4.6 h1:HGkGICFGvyrodcqOOclHKfvJC0qTU7vny/7FhYp9hNw=

2. How I run Caddy:

I run caddy pretty basic. All there is are two rules.

All other rules are getting dynamically added through a bash script

curl "http://localhost:2019/config/apps/http/servers/srv0/routes/0" -X PUT \
    -H "Content-Type: application/json" \
    -d '{
        "@id": "'"$CLIENT"'-'"$EMP"'",
        "handle":[
        {
            "handler":"subroute",
            "routes":[
            {
                "handle":[
                {
                    "handler":"reverse_proxy",
                    "upstreams":[
                    {
                        "dial":"127.0.0.1:'"$PORT"'"
                    }
                    ]
                }
                ]
            }
            ]
        }
        ],
        "match":[
        {
            "host":[
                "'"$CLIENT"'.'"$EMP"'.reptoxx.de"
            ]
        }
        ],
        "terminal":true
    }'

a. System environment:

Ubuntu 20.04, xcaddy installed with systemd.

b. Command:

curl http://localhost:2019/config/

c. Service/unit/compose file:

# caddy.service
#
# For using Caddy with a config file.
#
# Make sure the ExecStart and ExecReload commands are correct
# for your installation.
#
# See https://caddyserver.com/docs/install for instructions.
#
# WARNING: This service does not use the --resume flag, so if you
# use the API to make changes, they will be overwritten by the
# Caddyfile next time the service is restarted. If you intend to
# use Caddy's API to configure it, add the --resume flag to the
# `caddy run` command or use the caddy-api.service file instead.

[Unit]
Description=Caddy
Documentation=https://caddyserver.com/docs/
After=network.target network-online.target
Requires=network-online.target

[Service]
Type=notify
User=caddy
Group=caddy
ExecStart=/usr/bin/caddy run --environ --config /etc/caddy/Caddyfile
ExecReload=/usr/bin/caddy reload --config /etc/caddy/Caddyfile
TimeoutStopSec=5s
LimitNOFILE=1048576
LimitNPROC=512
PrivateTmp=true
ProtectSystem=full
AmbientCapabilities=CAP_NET_BIND_SERVICE

[Install]
WantedBy=multi-user.target

d. My complete Caddyfile or JSON config:

swing-test.reptoxx.de {
        basicauth * {
           reptoxx <i'm not allowed to share that>
        }
        reverse_proxy {
            to https://127.0.0.1:9081
            header_up Host {host}
            header_up X-Real-IP {remote_host}
            header_up X-Forwarded-For {remote_host}
            transport http {
                versions 1.1
                tls_insecure_skip_verify
            }
        }
}

*.*.reptoxx.de {
        log {
         output stdout
         format console
         level debug
        }
        tls {
                on_demand
        }
        respond "test"
}

3. The problem I’m having:

The API is getting unresponsive once a websocket connection is established. So once i open up the socket through swing-test, i can’t query the api anymore.

4. Error messages and/or full log output:

$ curl -v http://localhost:2019/config/
*   Trying 127.0.0.1:2019...
* TCP_NODELAY set
* Connected to localhost (127.0.0.1) port 2019 (#0)
> GET /config/ HTTP/1.1
> Host: localhost:2019
> User-Agent: curl/7.68.0
> Accept: */*
>

{"level":"info","ts":1637242972.0179756,"logger":"admin.api","msg":"received request","method":"GET","host":"localhost:2019","uri":"/config/","remote_addr":"127.0.0.1:51284","headers":{"Accept":["*/*"],"User-Agent":["curl/7.68.0"]}}

5. What I already tried:

I often just restart caddy to the point of it having no routes anymore. Then i can add a couple more, that i needed and after that establishing the websocket again.
I also tried to debug the issue, but the fact that it’s not responding at all, i don’t know where to start.

matt · November 19, 2021, 3:53am

This could be because changes to the config involve cycling the servers, which incurs a graceful (zero-downtime) shutdown. WS connections stay open a long time, so they eventually get forcefully closed. That should happen after a few seconds, but it should eventually happen. Does it never finish?

ReptoxX · November 19, 2021, 10:37am

No it did never finish. I let it run for about 20 minutes without seeing a result.
But sometimes (can’t make up whats triggering it) it does work even with an open websocket connection.

matt · November 19, 2021, 3:11pm

Do you have specific steps that can reliably reproduce this? If so I’ll look into it. Minimal config would be needed.

ReptoxX · November 19, 2021, 3:45pm

I try my best to support and help in that problem. So while it’s not occurring 100% the time one of us is connected to the websocket, it happens more often then not.

I’m currently trying to get phpstorm working with docker and caddy as a reverse proxy.

docker run

docker run --rm -p 8887:8887 --name phpstorm-ide projectorimages/projector-phpstorm

Caddy config

phpstorm.domain.example {
	reverse_proxy {
	    to http://127.0.0.1:8887
	    header_up Host {host}
	    header_up X-Real-IP {remote_host}
	    header_up X-Forwarded-For {remote_host}
	    transport http {
		versions 1.1
                tls_insecure_skip_verify
	    }
	}
}

Of course SSL is handled through caddy with letsencrypt.

Once you are connected to the ide, you should be able to reproduce it.

curl -v http://localhost:2019/config/

I really hope that this is helpful and you can recreate my issue.

ReptoxX · December 9, 2021, 12:17pm

I don’t want this topic to get closed.
Is there any progress?

Regards

matt · December 9, 2021, 6:12pm

Sorry, been swamped with other things… still on my list though, I have this bookmarked.

I don’t use Docker though. Does this happen with a plain Caddy installation on your native machine? Or is it only through Docker that it happens?

ReptoxX · December 9, 2021, 8:24pm

I’m using the projector installation in a docker container with exposed ports and run a caddy installation native on linux.
Didnt try to use caddy in a docker container to be honest.

matt · December 9, 2021, 8:54pm

Only thing I can think of at this point is to start putting in log prints to see where the request gets hung up. Starting here:

I would put fmt.Println("Got here") (not even joking) at a few places, maybe starting before and after the lock acquisition in readConfig(), to see if it’s able to get the lock. Then just keep going to see how far it gets before it stops.

It’s really quite easy to do this thankfully. Just make sure you have Go installed, then clone the repo and add those lines in various places. Then run go run main.go or go build to try it out.

matt · December 14, 2021, 10:48pm

@ReptoxX Did you have a chance to do that? Any questions I can help with?

ReptoxX · December 15, 2021, 10:09am

I did add the debug outputs to the file you mentioned and only got above readConfig never below.
As i don’t really know go or the project at all i’m not sure where readConfig is otherwise i would add debug output to this aswell.

Edit: Sourcegraph shows it… Feeling dumb right now. Continue to add debug outputs to the correct lines.

Edit 2: It stops right there

Appears that he can’t lock?

matt · December 15, 2021, 4:43pm

Thanks, that’s super helpful. So yeah, apparently something isn’t letting go of the lock around the config.

Something else to try. There are other places where the lock is obtained and then released for currentCfgMu, like here:

It might be worth trying to put some prints just after obtaining a lock and just before releasing the lock at a few places in that file, where you see currentCfgMu.Lock() or currentCfgMu.Unlock(). If you see defer currentCfgMu.Unlock(), then replace it with:

defer func() {
    fmt.Println("Got here, unlocking...")
    currentCfgMu.Unlock()
}()

or something like that, so you can know when it gets unlocked. I’m curious to see where in the code we get a lock but don’t unlock it.

system · December 18, 2021, 1:08pm

This topic was automatically closed after 30 days. New replies are no longer allowed.

matt · December 18, 2021, 9:07pm

@ReptoxX I reopened this topic and I await your reply when you have a chance. Thanks!

matt · December 18, 2021, 11:10pm

Couple other ideas!

Use the /debug/pprof endpoint of the admin API to see a stack trace, and or use Ctrl+\ to send SIGQUIT: https://twitter.com/marcaruel/status/1472342598169813003

ReptoxX · January 11, 2022, 5:05pm

Hey Matt,

I’m sorry that i did not respond to this topic while nye. But now i have time again to try to help you finding the issue.
I added a couple new outputs to the caddy.go.

Jan 11 18:00:31 reptoxx-dev caddy[2433237]: [RPTX] BEFORE READ CONFIG
Jan 11 18:00:31 reptoxx-dev caddy[2433237]: [RPTX] RLOCK CALLED - L199
Jan 11 18:00:31 reptoxx-dev caddy[2433237]: [RPTX] DID IT UNLOCK? - L201
Jan 11 18:00:31 reptoxx-dev caddy[2433237]: [RPTX] RUNLOCK - ReadConfig - L200
Jan 11 18:00:31 reptoxx-dev caddy[2433237]: [RPTX] AFTER READ CONFIG
Jan 11 18:00:31 reptoxx-dev caddy[2433237]: [RPTX] changeConfig - LOCK - L132

I’m not able to run the program as the user so i can’t send sigquit to a service? Or am i?

Regards

matt · January 11, 2022, 6:22pm

Thank you!!

Looks like the locking is fine.

What happens if you set a grace_period in your Caddyfile? Global options (Caddyfile) — Caddy Documentation - try like 5s.

ReptoxX · January 13, 2022, 8:06am

This seems to add the config
Before it would just hang forever now it continues after like 10s (it’s not exactly the time i specified).
With 300ms it’s acceptable.

But unfortunately it doesn’t seem to reload the config as its still not available.

Regards

API getting unresponsive

1. Caddy version (caddy version):

2. How I run Caddy:

a. System environment:

b. Command:

c. Service/unit/compose file:

d. My complete Caddyfile or JSON config:

3. The problem I’m having:

4. Error messages and/or full log output:

5. What I already tried:

1. Caddy version (`caddy version`):