Hangs on reload

francislavoie · April 9, 2021, 5:38am

I’m only now noticing that your systemd config is non-standard. We ship a recommended one with our debian package, with additional details here if want to set it up manually:

github.com

caddyserver/dist/blob/master/init/caddy.service

# caddy.service
#
# For using Caddy with a config file.
#
# Make sure the ExecStart and ExecReload commands are correct
# for your installation.
#
# See https://caddyserver.com/docs/install for instructions.
#
# WARNING: This service does not use the --resume flag, so if you
# use the API to make changes, they will be overwritten by the
# Caddyfile next time the service is restarted. If you intend to
# use Caddy's API to configure it, add the --resume flag to the
# `caddy run` command or use the caddy-api.service file instead.

[Unit]
Description=Caddy
Documentation=https://caddyserver.com/docs/
After=network.target network-online.target
Requires=network-online.target

This file has been truncated. show original

It might be good to switch to using caddy run to reduce the variables at play here (most users use caddy run)

kesuskim · April 9, 2021, 9:30am

okay, here is link of profile. link

kesuskim · April 9, 2021, 9:46am

I’ve changed systemd service file to standard, and also manually run caddy with caddy run, but … why… no progress still…

possibly this is related? I do not know what to do more

matt · April 9, 2021, 6:04pm

Thanks for the profile!

So these are the top CPU cycles:

File: caddy
Type: cpu
Time: Apr 8, 2021 at 10:16pm (MDT)
Duration: 30mins, Total samples = 24.43s ( 1.36%)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 14.58s, 59.68% of 24.43s total
Dropped 831 nodes (cum <= 0.12s)
Showing top 10 nodes out of 273
      flat  flat%   sum%        cum   cum%
     3.93s 16.09% 16.09%      4.49s 18.38%  syscall.Syscall
     3.14s 12.85% 28.94%      3.14s 12.85%  crypto/aes.gcmAesEnc
     2.01s  8.23% 37.17%      2.01s  8.23%  golang.org/x/crypto/blowfish.encryptBlock
     2.01s  8.23% 45.40%      2.01s  8.23%  runtime.epollwait
     1.57s  6.43% 51.82%      1.57s  6.43%  runtime.futex
     0.58s  2.37% 54.20%      0.58s  2.37%  vendor/golang.org/x/crypto/curve25519.ladderstep
     0.42s  1.72% 55.91%      0.42s  1.72%  runtime.memmove
     0.40s  1.64% 57.55%      0.40s  1.64%  crypto/sha256.block
     0.30s  1.23% 58.78%      4.27s 17.48%  runtime.findrunnable
     0.22s   0.9% 59.68%      1.30s  5.32%  runtime.mallocgc

So over the 30 minute profile, it gathered 24s of CPU time, most of which were spent in system calls and AES/blowfish encryption. I also want to know the answer to Francis’ questions - what kind of hardware are you running this on (specifically)?

Another significant chunk of time is in runtime.findrunnable, which appears to have a related bug here: runtime: scheduler is slow when goroutines are frequently woken · Issue #18237 · golang/go · GitHub – not sure if relevant to your situation.

It still doesn’t account for the full 18-20 minutes. I’m not sure why yet.

kesuskim · April 12, 2021, 12:06am

Sorry for late answer, there was some personal pressing matter to concern
Anyway, Hardware caddy is running on is bare-metal on premise server, HP ProLiant DL320e Gen8 model, which has cpu of Intel Xeon(R) CPU E3-1240 V2 3.40GHz 4Core 8Thread (HyperThreading), 8GB RAM.

I hope this is not related, but if it does, then… maybe I should find another way to deal with; I was okay with nginx, probably I have to switch back…?

matt · April 12, 2021, 12:09am

No, if you switch now, all the troubleshooting effort will go to waste and I’ll get mad. Let’s figure this out.

Can I access the instance currently exhibiting the behavior?

kesuskim · April 12, 2021, 12:10am

OK will never try switching back.

Well there is some firewall issues, so you may need to use VPN, can I contact you privately?

matt · April 12, 2021, 12:11am

Maybe we could jump on a video chat? Not today – probably tomorrow or something – but I can troubleshoot it with you, with you driving. (I won’t take over for liability reasons.)

kesuskim · April 12, 2021, 12:12am

I’d love it Thanks for your concern, deeply appreciate it. When is your viable time? As our timezone is different, I’d prepare if I can

francislavoie · April 12, 2021, 2:43am

Looks like it does have AES-NI support, FWIW

kesuskim · April 12, 2021, 3:05am

It seems AES-NI support already enabled on system.

user@localhost:~$ sudo sort -u /proc/crypto | grep module
module       : aes_x86_64
module       : aesni_intel
module       : crc32_pclmul
module       : crct10dif_pclmul
module       : ghash_clmulni_intel
module       : kernel
module       : pcbc

user@localhost:~$ sudo cpuid | grep -i aes | sort | uniq
      AES instruction                         = true

matt · April 12, 2021, 4:56pm

I wonder if there’s something weird with the VPN connection or something else about the network. Can you reproduce the behavior in any other environment?

kesuskim · April 13, 2021, 12:43am

For the case of network, I do not have to use VPN myself; It is public IP address range with firewall enabled (because of some special environment I am in, every device within this network is given public IP address).

I cannot reproduce same behavior in any other environment (tested in aws vm, but is fast )

system · May 2, 2021, 9:57am

This topic was automatically closed after 30 days. New replies are no longer allowed.

matt · May 7, 2021, 3:24pm

After discussing on a call, we determined that this is probably because Caddy (actually, Go standard library) is doing a graceful shutdown of the servers: http package - net/http - Go Packages

Shutdown gracefully shuts down the server without interrupting any active connections. Shutdown works by first closing all open listeners, then closing all idle connections, and then waiting indefinitely for connections to return to idle and then shut down. If the provided context expires before the shutdown is complete, Shutdown returns the context’s error, otherwise it returns any error returned from closing the Server’s underlying Listener(s).

By default, no grace period is set, so there is no timeout. If a client holds onto connections for 20+ minutes (like if they are downloading a big file very slowly), it could cause connections to be retained. This is an ideal user experience but maybe not so ideal for servers.

The grace period is configurable in JSON: JSON Config Structure - Caddy Documentation - and I bet setting it (say, 5 to 10 seconds, maybe 30 if you really want to) should cause reloads to finish closer to that amount of time.

It is not currently configurable in the Caddyfile, which would be a nice global option to have.

francislavoie · May 7, 2021, 3:55pm

And here’s a quick PR