Keeping lock file fresh: unexpected end of JSON input - terminating lock maintenance

osuhickeys · March 16, 2022, 3:49pm

1. Caddy version (`caddy version`):

2. How I run Caddy:

docker start caddy

a. System environment:

Ubuntu 20.04
Docker 20.10.12
Caddy 2.4.6

b. Command:

CMD ["caddy", "docker-proxy"]

c. Service/unit/compose file:

Dockerfile

ARG CADDY_VERSION=2.4.6

FROM caddy:${CADDY_VERSION}-builder AS builder

RUN xcaddy build \
    --with github.com/lucaslorentz/caddy-docker-proxy/plugin/v2 \
    --with github.com/greenpau/caddy-security \
    --with github.com/greenpau/caddy-trace \
	--with github.com/porech/caddy-maxmind-geolocation \
	--with github.com/caddy-dns/cloudflare \
	--with github.com/kirsch33/realip

FROM caddy:${CADDY_VERSION}-alpine

RUN apk add --no-cache tzdata

COPY --from=builder /usr/bin/caddy /usr/bin/caddy

CMD ["caddy", "docker-proxy"]

docker-compose

  caddy:
    hostname: caddy    
    image: caddy220312 #220216
    container_name: caddy
    restart: always
    environment:
      - TZ=${TZ}
      - CADDY_DOCKER_CADDYFILE_PATH=/data/CaddyFile
      - EMAIL=${EMAIL}
      - CADDY_EMAIL_SECRET=${CADDY_EMAIL_SECRET}
      - SMTP=${SMTP}
      - CRYPTO_KEY=${CRYPTO_KEY}
      - CLOUDFLARE_DOMAIN=${CLOUDFLARE_DOMAIN}
      - CADDY_REG_CODE=${CADDY_REG_CODE}
      - GEO_COUNTRY=${GEO_COUNTRY}
      - GEO_SUBDIVISION=${GEO_SUBDIVISION}
      - GEO_METRO_CODE=${GEO_METRO_CODE}
      - CLOUDFLARE_API=${CLOUDFLARE_API}
      - FQDN=${FQDN}
      - USER=${USER}
    ports:
      - 8069:8069
      - 443:443
      - 2019:2019
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      - "${docker}/caddy2/data:/data"
      - "${docker}/caddy2/config:/config"
      - "${docker}:/docker"
      - "${xbmc}/ebooks/Magazines:/magazines"
      - "/var/log:/logs"
      - "/etc/localtime:/etc/localtime"
      - "/etc/timezone:/etc/timezone"
    labels:
      - com.centurylinklabs.watchtower.enable=false
      - traefik.http.routers.auth.rule=Host(`${FQDN}`)
    networks:
      - wg-pia

d. My complete Caddyfile or JSON config:

{"level":"info","ts":1647347192.7575867,"logger":"docker-proxy","msg":"New Caddyfile","caddyfile":"{
        http_port 8069
        https_port 443
        email {$EMAIL}
        debug
        log {
                output file /data/caddy.log
                format json {
                        time_format wall
                        level_format upper
                }
        }
        security {
                credentials {$EMAIL} {
                        username {$EMAIL}
                        password {$CADDY_EMAIL_SECRET}
                }
                messaging email provider gmail {
                        address {$SMTP}
                        protocol smtp
                        sender {$EMAIL}
                        credentials {$EMAIL}
                }
                authentication portal myportal {
                        enable source ip tracking
                        cookie domain {$CLOUDFLARE_DOMAIN}
                        cookie lifetime 86400
                        crypto default token lifetime 86400
                        crypto key sign-verify {$CRYPTO_KEY}
                        backend local /config/caddy/users.json local
                        transform user {
                                match origin local
                                require mfa
                                action add role authp/superadmin authp/user
                        }
                        registration {
                                dropbox /config/caddy/registrations.json
                                title \"User Registration\"
                                code {$CADDY_REG_CODE}
                                require domain mx
                                email provider gmail
                                admin email {$EMAIL}
                        }
                        ui {
                                links {
					"My Identity" "/auth/whoami" icon "las la-user"
                                }
                        }
                        transform user {
				exact match sub {$USER}
				action add role authp/superadmin
				ui link "Portal Settings" /auth/settings icon "las la-cog"
				ui link "Adguard" "/agh" icon "las la-cog" target_blank
                        }
                }
                authorization policy mypolicy {
                        validate source address
                        set auth url /auth/
                        crypto key verify {$CRYPTO_KEY}
                        allow roles authp/superadmin authp/user
                }
        }
}
auth.samcro1967.net {
        tls {
                dns cloudflare {{$CLOUDFLARE_API}}
        }
        route /auth* {
                authenticate * with myportal
        }
        route /xauth* {
                authenticate * with myportal
        }
        route {
                redir https://{hostport}/login/ 302
        }
        log {
                output file /data/access.log
                format json
        }
        redir /agh /agh/
        route /agh/* {
                authorize with mypolicy
                uri strip_prefix /agh
                reverse_proxy 192.168.1.104:8126 {
                        header_down Location /login.html /agh/login.html
                }
        }
}
"}

3. The problem I’m having:

Several hours after restarting caddy I get “Error 525 SSL handshake failed” when accessing auth.samcro1967.net. Restarting the service resolves the issue for a few hours. The only error I see in the logs is “Keeping lock file fresh: unexpected end of JSON input - terminating lock maintenance”.

I do have one lock file “issue_cert_auth.samcro1967.net.lock” that looks like this:

{"created":"2022-03-16T03:05:10.80765497-05:00","updated":"2022-03-16T10:35:29.374659462-05:00"}

This started a few weeks ago after I upgraded from caddy-auth-portal to caddy-security, but I don’t think it is related. Not aware of anything else that changed at the time this started happening.

4. Error messages and/or full log output:

{"level":"info","ts":1647347154.2962456,"logger":"docker-proxy","msg":"Sending configuration to","server":"localhost"}
{"level":"info","ts":1647347154.2994285,"logger":"admin.api","msg":"received request","method":"POST","host":"localhost:2019","uri":"/load","remote_addr":"127.0.0.1:41300","headers":{"Accept-Encoding":["gzip"],"Content-Length":["22129"],"Content-Type":["application/json"],"User-Agent":["Go-http-client/1.1"]}}
{"level":"info","ts":1647347154.3037798,"msg":"redirected default logger","from":"stderr","to":"/data/caddy.log"}
{"level":"info","ts":1647347154.4039662,"logger":"docker-proxy","msg":"Successfully configured","server":"localhost"}
{"level":"info","ts":1647347192.6328156,"logger":"docker-proxy","msg":"Running caddy proxy server"}
{"level":"info","ts":1647347192.6346178,"logger":"admin","msg":"admin endpoint started","address":"tcp/localhost:2019","enforce_origin":false,"origins":["localhost:2019","[::1]:2019","127.0.0.1:2019"]}
{"level":"info","ts":1647347192.6347795,"msg":"autosaved config (load with --resume flag)","file":"/config/caddy/autosave.json"}
{"level":"info","ts":1647347192.6347876,"logger":"docker-proxy","msg":"Running caddy proxy controller"}
{"level":"info","ts":1647347192.6353683,"logger":"docker-proxy","msg":"Start","CaddyfilePath":"/data/CaddyFile","LabelPrefix":"caddy","PollingInterval":30,"ProcessCaddyfile":true,"ProxyServiceTasks":true,"IngressNetworks":"[]"}
{"level":"info","ts":1647347192.635559,"logger":"docker-proxy","msg":"Caddy ContainerID","ID":"7ada9f90bba653f0e23ea2a436d9ba5d88b3f676736914727fb59f0d24da147b"}
{"level":"info","ts":1647347192.63632,"logger":"docker-proxy","msg":"Connecting to docker events"}{"level":"info","ts":1647347192.6391163,"logger":"docker-proxy","msg":"IngressNetworksMap","ingres":"map[12d99874b921591e07c8c2c0c50f6fb013984cd9045b8b0dda296fea3ce8cc18:true]"}
{"level":"info","ts":1647347192.6546054,"logger":"docker-proxy","msg":"Swarm is available","new":false}
{"level":"info","ts":1647347192.6551454,"logger":"docker-proxy","msg":"Skipping swarm config caddyfiles because swarm is not available"}
{"level":"info","ts":1647347192.708459,"logger":"docker-proxy","msg":"Skipping swarm services because swarm is not available"}
{"level":"info","ts":1647347192.7661257,"logger":"docker-proxy","msg":"Sending configuration to","server":"localhost"}
{"level":"info","ts":1647347192.7672646,"logger":"admin.api","msg":"received request","method":"POST","host":"localhost:2019","uri":"/load","remote_addr":"127.0.0.1:41304","headers":{"Accept-Encoding":["gzip"],"Content-Length":["22129"],"Content-Type":["application/json"],"User-Agent":["Go-http-client/1.1"]}}
{"level":"info","ts":1647347192.771858,"msg":"redirected default logger","from":"stderr","to":"/data/caddy.log"}
{"level":"info","ts":1647347192.8547845,"logger":"docker-proxy","msg":"Successfully configured","server":"localhost"}
2022/03/16 04:38:03 [ERROR] Keeping lock file fresh: unexpected end of JSON input - terminating lock maintenance (lockfile: /data/caddy/locks/issue_cert_auth.samcro1967.ne

5. What I already tried:

I found a few similiar issues in the forum, but not sure they are relevant.

I am not using AWS EFS like this one.

I already have the CloudFlare DNS challenge added and see no errors in the log that it is failing.

6. Links to relevant resources:

francislavoie · March 16, 2022, 4:51pm

Hmm. Well you can probably just delete /data/caddy/locks/issue_cert_auth.samcro1967.ne for now as a workaround.

I think v2.5.0-beta.1 has changes that should fix this issue.

matt · March 16, 2022, 5:32pm

I’m not sure why it would in this case… but it’s possible, I guess.

Anyway, even if you’re not using EFS, this error happens when the file system doesn’t actually sync writes to the disk properly. The file is empty when it should have content. Are you using any sort of mount or virtualized disk?

osuhickeys · March 16, 2022, 8:19pm

I have deleted the lock file and it reappears after caddy is restarted. The lock file is not empty. The contents are in the original post above. As far as I can tell, it looks like it should. It is not empty and does not appear to be incomplete.

The Ubuntu machine is a virtual machine. As far as Ubuntu is concerned, it is a local disk. The Ubuntu virtual machine’s vhdx file is on a NAS that is LAN connected via iSCSI at 10 GBe. That config has been consistent before the issue started. I have rebooted the NAS since this started. After 3 or 4 moths of uptime on the NAS I do occasionally start to see some wonkiness (technical term), but it is normally related to permission issues, not writes completing successfully. I am able to docker exec into caddy and see the file as well as delete it so I do not think it is a permissions issue. I can even cp new files from the host to it with no issue so writes appear to work as expected as well.

I will try the new beta version in the next few days and see it that solves the issue or maybe gives a different error message that might shed some more light on the underlying issue.

matt · March 17, 2022, 1:21am

That is so weird. I can only say that I have seen this error happen with network/virtualized file systems (like yours). I don’t know why this happens with those kinds of disks, but it does. I know I’ve never seen this error reported for most common, robust file systems.

Thank you for your investigation. I’d love to track down what causes this. If you conduct any more tests or gather more information, please do share!

osuhickeys · March 18, 2022, 1:13pm

I have been running the beta version for over 24 hours now and have not had an instance where I could not access caddy. I would have had 3 or 4 failures in 24 hours with 2.4.6. I do have 3 “Keeping lock file fresh: unexpected end of JSON input - terminating lock maintenance” in the logs that occurred about 20 hours after I upgraded, but the site is still accessible. I think I will let it run a few more days and if I am able to access the site without a restart, I will roll back to 2.4.6 to confirm the issue starts again and something in the upgrade from 2.4.6 to the beta version resolved the site accessibility issue at least.

I am not sure what it would tell us, but I think I might be able to roll back to version 2.4.6 and caddy-auth-portal to confirm it was not happening there either, but it will take some work. I would need to restore previous version of my docker-compose and Caddyfle files and then add some new services to docker-compose I have deployed since then. If that would be of some value, let me know.

matt · March 18, 2022, 2:34pm

Well, that’s good news I guess! We did make some minor improvements to error handling with regards to storage, but nothing that I know of that would eliminate the error message you’re seeing (esp. since I am still convinced so far that they are errors with the underlying storage device).

Since rolling back sounds like it’ll be a fair amount of work, I won’t ask that of you, I think the fact that the availability problem seems solved after so long is a good sign.

osuhickeys · March 18, 2022, 4:06pm

FYI as some additional information…

I am not seeing any errors in the NAS log, the VM syslog, or the VH host logs related to failed writes. The times in the caddy log do seem to align with busier storage times. I seem to have something running every 30 minutes that causes I/O to spike. These graphs are from the VM’s perspective. The lock refresh error occurred 3 out of 48 times in a 24 hour period. I don’t have a historical view of the NAS, but will look into setting that up and track down the process causing storage spikes every 30 min. I don’t think that is the underlying issue, but I am curious.

I am running 7 x SSDs in RAID 5 on the NAS connected via 10 GBe iSCSI. The SSDs are rated at “Max Sequential Read / Write 560 MB/s” and “Random Read/Write (4KB) Max. 11K IOPS”. Even though the storage is busier at the times the error occurs, the peaks are well within the storage performance capabilities.

2022/03/18 03:47:51 [ERROR] Keeping lock file fresh: unexpected end of JSON input - terminating lock maintenance (lockfile: /data/caddy/locks/issue_cert_auth.samcro1967.net.lock)
2022/03/18 05:40:25 [ERROR] Keeping lock file fresh: unexpected end of JSON input - terminating lock maintenance (lockfile: /data/caddy/locks/issue_cert_auth.samcro1967.net.lock)
2022/03/18 05:49:55 [ERROR] Keeping lock file fresh: unexpected end of JSON input - terminating lock maintenance (lockfile: /data/caddy/locks/issue_cert_auth.samcro1967.net.lock)

matt · March 18, 2022, 4:13pm

This is really good to know, thanks.

I’d love to get to the bottom of this issue. I haven’t been able to repro this myself so I really appreciate your efforts here. It all seems to happen over network or virtual storage devices, and our code to write those lock files is using standard calls like Open, Read, Seek, Truncate, Write, and Sync:

github.com

caddyserver/certmagic/blob/9a56fcd4f9cc49b8068eab8674a9bd4468068ca3/filestorage.go#L288-L330

      
        
            // updateLockfileFreshness updates the lock file at filename
            // with the current timestamp. It returns true if the parent
            // loop can terminate (i.e. no more need to update the lock).
            func updateLockfileFreshness(filename string) (bool, error) {
            	f, err := os.OpenFile(filename, os.O_RDWR, 0644)
            	if os.IsNotExist(err) {
            		return true, nil // lock released
            	}
            	if err != nil {
            		return true, err
            	}
            	defer f.Close()
            
            
	// read contents
            	metaBytes, err := io.ReadAll(io.LimitReader(f, 2048))
            	if err != nil {
            		return true, err
            	}
            	var meta lockMeta
            	if err := json.Unmarshal(metaBytes, &meta); err != nil {

This file has been truncated. show original

Of course if the storage device fails to implement those calls properly then it will introduce problems. But these are basic operations that should be child’s play to a file system.

osuhickeys · March 18, 2022, 6:21pm

I have enabled historical monitoring the NAS. Should have some good data in about 24 hours. I found what is causing the spike every 30 mins. I am running syncthing every 30 minutes to replicate my docker volumes folder from the VM disk to a folder on the NAS outside of the VM disk. Guessing I should exclude the locks folder (maybe even the certificates folder)? Can certainly see how that would cause an issue if it is being copied at the exact second caddy is trying to update it. Odds seem low, but probably worth trying.

Still can’t explain why this started when it did if this is my root cause. Have been running synching this way for 2 or 3 years now.

Been meaning to get around to looking at snapshots for the folder, but have not been motivated enough to figure out how to do so on Ubuntu and then do it. I normally just exclude a folder when I encounter an issue with syncthing replication.

system · April 15, 2022, 3:49pm

This topic was automatically closed after 30 days. New replies are no longer allowed.