Blocking User Agent and logging not working anymore

Kened33r · May 2, 2021, 3:23pm

1. Caddy version (`caddy version`):

2.3.0

2. How I run Caddy:

a. System environment:

ubuntu VPS LTS latest

b. Command:

paste command here

c. Service/unit/compose file:

paste full file contents here

d. My complete Caddyfile or JSON config:

# GLOBAL
{
        # Global options block. Entirely optional, https is on by default
        # Optional email key for lets encrypt

        # Optional staging lets encrypt for testing.
        # acme_ca https://acme-staging-v02.api.letsencrypt.org/directory

        servers {
                timeouts {
                        read_body   10s
                        read_header 10s
                        write       10s
                        idle        2m
                }
                max_header_size 16384
        }

}

# SNIPPETS

(mustheaders) {
        header {
                Strict-Transport-Security "max-age=31536000; includesubdomains; preload"
                X-Content-Type-Options "nosniff"
                X-Frame-Options "SAMEORIGIN"
                Referrer-Policy "same-origin"
                X-Xss-Protection "1; mode=block"
                Feature-Policy "accelerometer 'none'; autoplay 'none'; camera 'none'; encrypted-media 'none'; fullscreen 'self'; geolocation 'none'; gyroscope 'none'; magnetometer 'none'; microphone 'none'; midi 'none'; payment 'none'; picture-in-picture *; sync-xhr 'none'; usb 'none'"
                Expect-CT "max-age=604800"
                -Server
        }
}


(compression) {
        encode zstd gzip
}

(caching) {

        @static {
                file
                path *.css *.js *.ico *.gif *.jpg *.jpeg *.png *.svg *.woff
        }
        handle @static {
                header ?Cache-Control "public, max-age=5184000, must-revalidate"
        }
        handle {
                header ?Cache-Control "no-cache, no-store, must-revalidate"
        }
}

(security) {

        # deny all access to these folders
        @denied_folders path_regexp /(\.github|cache|bin|logs|backup|test)/.*$
        respond @denied_folders "Access to this folder denied" 403

        # deny running scripts inside core system folders
        @denied_system_scripts path_regexp /(core|content|test|system|vendor)/.*\.(txt|xml|md|html|yaml|php|pl|py|cgi|twig|sh|bat|yml|js)$
        respond @denied_system_scripts "Access running scripts denied" 403

        # deny running scripts inside user folder
        @denied_user_folder path_regexp /user/.*\.(txt|md|yaml|php|pl|py|cgi|twig|sh|bat|yml|js)$
        respond @denied_user_folder "Access running scripts denied" 403

        # deny access to specific files in the root folder
        @denied_root_folder path_regexp /(index.php.*|wp-admin.php|wp-login.php|wp-config.php.*|xmlrpc.php|config.production.json|config.development.json|package.json|renovate.json|ghost.js|startup.js|\.editorconfig|\.eslintignore|\.eslintrc.json|\.gitattributes|\.gitignore|\.gitmodules|\.npmignore|Gruntfile.js|LICENSE|MigratorConfig.js|LICENSE.txt|composer.lock|composer.json|nginx.conf|web.config|htaccess.txt|\.htaccess)
        respond @denied_root_folder "Access to the file denied" 403

        # block bad crawlers
        @badbots header User-Agent "AhrefsBot, DotBot, MauiBot, SemrushBot, PetalBot, MJ12bot, Seomoz, SEOstats, aesop_com_spiderman, alexibot, backweb, batchftp, bigfoot, blackwidow, blowfish, botalot, buddy, builtbottough, bullseye, cheesebot, chinaclaw, cosmos, crescent, curl, custo, da, diibot, disco, dittospyder, dragonfly, drip, easydl, ebingbong, erocrawler, exabot, eyenetie, filehound, flashget, flunky, frontpage, getright, getweb, go-ahead-got-it, gotit, grabnet, grafula, harvest, hloader, hmview, httplib, humanlinks, ilsebot, infonavirobot, infotekies, intelliseek, interget, iria, jennybot, jetcar, joc, justview, jyxobot, kenjin, keyword, larbin, leechftp, lexibot, lftp, libweb, likse, linkscan, linkwalker, lnspiderguy, lwp, magnet, mag-net, markwatch, memo, miixpc, mirror, missigua, moget, nameprotect, navroad, backdoorbot, nearsite, netants, netcraft, netmechanic, netspider, nextgensearchbot, attach, nicerspro, nimblecrawler, npbot, openfind, outfoxbot, pagegrabber, papa, pavuk, pcbrowser, pockey, propowerbot, prowebwalker, psbot, pump, queryn, recorder, realdownload, reaper, reget, true_robot, repomonkey, rma, internetseer, sitesnagger, siphon, slysearch, smartdownload, snake, snapbot, snoopy, sogou, spacebison, spankbot, spanner, sqworm, superbot, superhttp, surfbot, asterias, suzuran, szukacz, takeout, teleport, telesoft, thenomad, tighttwatbot, titan, urldispatcher, turingos, turnitinbot, *vacuum*, vci, voideye, libwww-perl, widow, wisenutbot, wwwoffle, xaldon, xenu, zeus, zyborg, anonymouse, *zip*, *mail*, *enhanc*, *fetch*, *auto*, *bandit*, *clip*, *copier*, *master*, *reaper*, *sauger*, *quester*, *whack*, *picker*, *catch*, *vampire*, *hari*, *offline*, *track*, *craftbot*, *download*, *extract*, *stripper*, *sucker*, *ninja*, *clshttp*, *webspider*, *leacher*, *collector*, *grabber*, *webpictures*, *seo*, *hole*, *copyright*, *check*"
        respond @badbots "Access for bad crawlers denied" 403
}

(proxy) {
        header_up X-Forwarded-Proto {scheme}
        header_up X-Forwarded-For {remote}
        header_up X-Real-IP {remote}
        header_down X-Powered-By "the Holy Spirit"
        header_down Server "CERN httpd"
}

(logs) {
        log {
            output file /var/log/caddy/caddy.log
        }
}


# STRIP WWW PREFIX

www.example.com {
        redir * https://{http.request.host.labels.1}.{http.request.host.labels.0}{path} permanent
}

# WEBSITES
:80 {
        respond "Access denied" 403 {
        close
        }
}

example.com {
        import mustheaders
        import caching
        import security
        respond /healthcheck 200
        reverse_proxy 127.0.0.1:2351 {
                import proxy
        }

        import logs
}

3. The problem I’m having:

Caddy stoped making logs in 18 April, no new logs there. However ghost CMS logs
Caddy do not block bad crawlers - Caddyfile config not working in this matter.

4. Error messages and/or full log output:

5. What I already tried:

Changed user agent with Chrome extension.

6. Links to relevant resources:

Kened33r · May 3, 2021, 9:47pm

Anyone knows why blocking bad crawlers not working?

matt · May 3, 2021, 10:26pm

Kened33r:

        @badbots header User-Agent "AhrefsBot, DotBot, MauiBot, SemrushBot, PetalBot, MJ12bot, Seomoz, SEOstats, aesop_com_spiderman, alexibot, backweb, batchftp, bigfoot, blackwidow, blowfish, botalot, buddy, builtbottough, bullseye, cheesebot, chinaclaw, cosmos, crescent, curl, custo, da, diibot, disco, dittospyder, dragonfly, drip, easydl, ebingbong, erocrawler, exabot, eyenetie, filehound, flashget, flunky, frontpage, getright, getweb, go-ahead-got-it, gotit, grabnet, grafula, harvest, hloader, hmview, httplib, humanlinks, ilsebot, infonavirobot, infotekies, intelliseek, interget, iria, jennybot, jetcar, joc, justview, jyxobot, kenjin, keyword, larbin, leechftp, lexibot, lftp, libweb, likse, linkscan, linkwalker, lnspiderguy, lwp, magnet, mag-net, markwatch, memo, miixpc, mirror, missigua, moget, nameprotect, navroad, backdoorbot, nearsite, netants, netcraft, netmechanic, netspider, nextgensearchbot, attach, nicerspro, nimblecrawler, npbot, openfind, outfoxbot, pagegrabber, papa, pavuk, pcbrowser, pockey, propowerbot, prowebwalker, psbot, pump, queryn, recorder, realdownload, reaper, reget, true_robot, repomonkey, rma, internetseer, sitesnagger, siphon, slysearch, smartdownload, snake, snapbot, snoopy, sogou, spacebison, spankbot, spanner, sqworm, superbot, superhttp, surfbot, asterias, suzuran, szukacz, takeout, teleport, telesoft, thenomad, tighttwatbot, titan, urldispatcher, turingos, turnitinbot, *vacuum*, vci, voideye, libwww-perl, widow, wisenutbot, wwwoffle, xaldon, xenu, zeus, zyborg, anonymouse, *zip*, *mail*, *enhanc*, *fetch*, *auto*, *bandit*, *clip*, *copier*, *master*, *reaper*, *sauger*, *quester*, *whack*, *picker*, *catch*, *vampire*, *hari*, *offline*, *track*, *craftbot*, *download*, *extract*, *stripper*, *sucker*, *ninja*, *clshttp*, *webspider*, *leacher*, *collector*, *grabber*, *webpictures*, *seo*, *hole*, *copyright*, *check*"

This line matches a User-Agent header of exactly that string. Are you sure that’s what you want? You probably want to match any of those, not all of those.

Kened33r · May 4, 2021, 10:37am

Of course any of them. How it should look like?

Mohammed90 · May 4, 2021, 5:42pm

Per the documentation

Different header fields within the same set are AND-ed. Multiple values per field are OR’ed.

So you will have to split them like this

@badbots {
	header "User-Agent" AhrefsBot
	header "User-Agent" *copyright*
	header "User-Agent" *check*
}

Or switch to header_regexp as such

	@badbots header_regexp User-Agent "(AhrefsBot|copyright|check)"

Kened33r · May 5, 2021, 6:19pm

Do I need to use stars string to make it more flexible as on my example above or it is not nessesary?

Mohammed90 · May 5, 2021, 7:37pm

You will not need them because this is not a string matching with a wildcard identifier rather this is a regular expression. You can learn about them here:

You can test/validate the regular expressions using this website: regex101. Ensure you pick the “Golang” flavor on the left.

Kened33r · May 5, 2021, 10:05pm

It seems to be case sensitive and wildcard need it too. Just checked this with chrome plugin by removing one character from blocked bot name or use lowercase only.

I will use .* mark.

Mohammed90 · May 5, 2021, 10:47pm

It doesn’t have to be case-sensitive. Check this StackOverflow answer here.

Kened33r · May 6, 2021, 9:53am

Yes I can add (?i), but should I insert in each term like below or is there any way to insert it once for all?

@badbots header_regexp User-Agent "((?i)AhrefsBot|(?i)copyright|(?i)check)"

Mohammed90 · May 6, 2021, 10:17am

You only need it once at the beginning. You can always test it to validate the behavior.

system · June 1, 2021, 3:24pm

This topic was automatically closed after 30 days. New replies are no longer allowed.