Blocking User Agent and logging not working anymore

1. Caddy version (caddy version):

2.3.0

2. How I run Caddy:

a. System environment:

ubuntu VPS LTS latest

b. Command:

paste command here

c. Service/unit/compose file:

paste full file contents here

d. My complete Caddyfile or JSON config:

# GLOBAL
{
        # Global options block. Entirely optional, https is on by default
        # Optional email key for lets encrypt

        # Optional staging lets encrypt for testing.
        # acme_ca https://acme-staging-v02.api.letsencrypt.org/directory

        servers {
                timeouts {
                        read_body   10s
                        read_header 10s
                        write       10s
                        idle        2m
                }
                max_header_size 16384
        }

}

# SNIPPETS

(mustheaders) {
        header {
                Strict-Transport-Security "max-age=31536000; includesubdomains; preload"
                X-Content-Type-Options "nosniff"
                X-Frame-Options "SAMEORIGIN"
                Referrer-Policy "same-origin"
                X-Xss-Protection "1; mode=block"
                Feature-Policy "accelerometer 'none'; autoplay 'none'; camera 'none'; encrypted-media 'none'; fullscreen 'self'; geolocation 'none'; gyroscope 'none'; magnetometer 'none'; microphone 'none'; midi 'none'; payment 'none'; picture-in-picture *; sync-xhr 'none'; usb 'none'"
                Expect-CT "max-age=604800"
                -Server
        }
}


(compression) {
        encode zstd gzip
}

(caching) {

        @static {
                file
                path *.css *.js *.ico *.gif *.jpg *.jpeg *.png *.svg *.woff
        }
        handle @static {
                header ?Cache-Control "public, max-age=5184000, must-revalidate"
        }
        handle {
                header ?Cache-Control "no-cache, no-store, must-revalidate"
        }
}

(security) {

        # deny all access to these folders
        @denied_folders path_regexp /(\.github|cache|bin|logs|backup|test)/.*$
        respond @denied_folders "Access to this folder denied" 403

        # deny running scripts inside core system folders
        @denied_system_scripts path_regexp /(core|content|test|system|vendor)/.*\.(txt|xml|md|html|yaml|php|pl|py|cgi|twig|sh|bat|yml|js)$
        respond @denied_system_scripts "Access running scripts denied" 403

        # deny running scripts inside user folder
        @denied_user_folder path_regexp /user/.*\.(txt|md|yaml|php|pl|py|cgi|twig|sh|bat|yml|js)$
        respond @denied_user_folder "Access running scripts denied" 403

        # deny access to specific files in the root folder
        @denied_root_folder path_regexp /(index.php.*|wp-admin.php|wp-login.php|wp-config.php.*|xmlrpc.php|config.production.json|config.development.json|package.json|renovate.json|ghost.js|startup.js|\.editorconfig|\.eslintignore|\.eslintrc.json|\.gitattributes|\.gitignore|\.gitmodules|\.npmignore|Gruntfile.js|LICENSE|MigratorConfig.js|LICENSE.txt|composer.lock|composer.json|nginx.conf|web.config|htaccess.txt|\.htaccess)
        respond @denied_root_folder "Access to the file denied" 403

        # block bad crawlers
        @badbots header User-Agent "AhrefsBot, DotBot, MauiBot, SemrushBot, PetalBot, MJ12bot, Seomoz, SEOstats, aesop_com_spiderman, alexibot, backweb, batchftp, bigfoot, blackwidow, blowfish, botalot, buddy, builtbottough, bullseye, cheesebot, chinaclaw, cosmos, crescent, curl, custo, da, diibot, disco, dittospyder, dragonfly, drip, easydl, ebingbong, erocrawler, exabot, eyenetie, filehound, flashget, flunky, frontpage, getright, getweb, go-ahead-got-it, gotit, grabnet, grafula, harvest, hloader, hmview, httplib, humanlinks, ilsebot, infonavirobot, infotekies, intelliseek, interget, iria, jennybot, jetcar, joc, justview, jyxobot, kenjin, keyword, larbin, leechftp, lexibot, lftp, libweb, likse, linkscan, linkwalker, lnspiderguy, lwp, magnet, mag-net, markwatch, memo, miixpc, mirror, missigua, moget, nameprotect, navroad, backdoorbot, nearsite, netants, netcraft, netmechanic, netspider, nextgensearchbot, attach, nicerspro, nimblecrawler, npbot, openfind, outfoxbot, pagegrabber, papa, pavuk, pcbrowser, pockey, propowerbot, prowebwalker, psbot, pump, queryn, recorder, realdownload, reaper, reget, true_robot, repomonkey, rma, internetseer, sitesnagger, siphon, slysearch, smartdownload, snake, snapbot, snoopy, sogou, spacebison, spankbot, spanner, sqworm, superbot, superhttp, surfbot, asterias, suzuran, szukacz, takeout, teleport, telesoft, thenomad, tighttwatbot, titan, urldispatcher, turingos, turnitinbot, *vacuum*, vci, voideye, libwww-perl, widow, wisenutbot, wwwoffle, xaldon, xenu, zeus, zyborg, anonymouse, *zip*, *mail*, *enhanc*, *fetch*, *auto*, *bandit*, *clip*, *copier*, *master*, *reaper*, *sauger*, *quester*, *whack*, *picker*, *catch*, *vampire*, *hari*, *offline*, *track*, *craftbot*, *download*, *extract*, *stripper*, *sucker*, *ninja*, *clshttp*, *webspider*, *leacher*, *collector*, *grabber*, *webpictures*, *seo*, *hole*, *copyright*, *check*"
        respond @badbots "Access for bad crawlers denied" 403
}

(proxy) {
        header_up X-Forwarded-Proto {scheme}
        header_up X-Forwarded-For {remote}
        header_up X-Real-IP {remote}
        header_down X-Powered-By "the Holy Spirit"
        header_down Server "CERN httpd"
}

(logs) {
        log {
            output file /var/log/caddy/caddy.log
        }
}


# STRIP WWW PREFIX

www.example.com {
        redir * https://{http.request.host.labels.1}.{http.request.host.labels.0}{path} permanent
}

# WEBSITES
:80 {
        respond "Access denied" 403 {
        close
        }
}

example.com {
        import mustheaders
        import caching
        import security
        respond /healthcheck 200
        reverse_proxy 127.0.0.1:2351 {
                import proxy
        }

        import logs
}

3. The problem I’m having:

Caddy stoped making logs in 18 April, no new logs there. However ghost CMS logs
Caddy do not block bad crawlers - Caddyfile config not working in this matter.

4. Error messages and/or full log output:

5. What I already tried:

Changed user agent with Chrome extension.

6. Links to relevant resources:

Anyone knows why blocking bad crawlers not working?

This line matches a User-Agent header of exactly that string. Are you sure that’s what you want? You probably want to match any of those, not all of those.

Of course any of them. How it should look like?

Per the documentation

Different header fields within the same set are AND-ed. Multiple values per field are OR’ed.

So you will have to split them like this

@badbots {
	header "User-Agent" AhrefsBot
	header "User-Agent" *copyright*
	header "User-Agent" *check*
}

Or switch to header_regexp as such

	@badbots header_regexp User-Agent "(AhrefsBot|copyright|check)"

1 Like

Do I need to use stars string to make it more flexible as on my example above or it is not nessesary?

You will not need them because this is not a string matching with a wildcard identifier rather this is a regular expression. You can learn about them here:

You can test/validate the regular expressions using this website: regex101. Ensure you pick the “Golang” flavor on the left.

1 Like

It seems to be case sensitive and wildcard need it too. Just checked this with chrome plugin by removing one character from blocked bot name or use lowercase only.

I will use .* mark.

It doesn’t have to be case-sensitive. Check this StackOverflow answer here.

Yes I can add (?i), but should I insert in each term like below or is there any way to insert it once for all?

@badbots header_regexp User-Agent "((?i)AhrefsBot|(?i)copyright|(?i)check)"

You only need it once at the beginning. You can always test it to validate the behavior.

This topic was automatically closed after 30 days. New replies are no longer allowed.