Caddy / GoAccess log file analysis

DWiskow · August 5, 2020, 2:13pm

I really like GoAccess (https://goaccess.io/) as a tool for convenient and quick analysis of access logs . . . it shares a philosophy, if not its development language, with Caddy in that it is self-contained and stand-alone with no dependencies (and can even generate self-contained access log file reporting in a single HTML file, that you could then auto-deploy on your web site) . . . but, currently, it is difficult to obtain the full benefit from GoAccess with Caddy log files as there is not a log file format shared by both Caddy and GoAccess that passes the full set of Caddy log data that GoAccess can use (other than the somewhat limited, common format).

I believe that implementing JSON log file input is on the GoAccess list/roadmap somewhere, but that doesn’t help us now.

To solve this issue, I have written a python3 script that will convert Caddy JSON log data into a format that maximises the input to GoAccess. This converts all of the relevant available data [that GoAccess can make use of] from Caddy JSON log data into a format that can then be parsed and understood by GoAccess.

The latest version can even support real-time analysis of log files (with html output) by leveraging TCP/IP network sockets (read all the way to the bottom of this thread to see some examples of how to use it). Access the latest version of this python script in my github repository where you can download it . . . CaddyGoAccessDataLoggerConverter

The initial python code example (below)

#!/usr/bin/python3
""" ***************************************************************************
Takes a caddy JSON log file as input and converts it into a format suitable for
analysis by goaccess. To use the output file, goaccess must be run with the
log-format specified as detailed below

   goaccess caddy.log --log-format="%d %t %v %h %m %U %H %s %b %T %R %u" --date-format=%F --time-format=%H:%M:%S >caddy.html

Note: when running caddy the Caddyfile must define "format json" as shown in
the example below

   localhost {
      file_server
      log {
         output file caddy.log {
            roll_local_time true
         }
         format json
      }
   }

*************************************************************************** """
import sys, signal, getopt, json
from datetime import datetime
from time import sleep

interval = 0
inputfile = ''

def shortHelp():
   print()
   print(' logConvert.py -i <interval time in seconds> filename')
   print()

def longHelp():
   print()
   print(' logConvert.py -i <interval time in seconds> filename')
   print()
   print(' If interval is zero (or -i option is ommitted) logConvert will process the')
   print(' input file and then exit, otherwise logConvert will process the input file')
   print(' then sleep for the specified interval before processing any additional')
   print(' entries (and then repeat indefinately until terminated)')
   print()
   print(' output from logConvert can be caputred to a file by adding " >outputfilename"')
   print(' to the end of the command line')
   print()
   print(' To then process the output file with goaccess, use the following command')
   print()
   print(' goaccess caddy.log --log-format="%d %t %v %h %m %U %H %s %b %T %R %u" --date-format=%F --time-format=%H:%M:%S >caddy.html')
   print()

def processArgs(argv):
   global interval
   global inputfile
   try:
      opts, args = getopt.getopt(argv,"hi:",["interval="])
   except getopt.GetoptError:
      longHelp()
      sys.exit(2)
   for opt, arg in opts:
      if opt == '-h':
         longHelp()
         sys.exit()
      elif opt in ("-i", "--interval"):
         try:
            interval = int(arg)
         except:
            print()
            print(' Interval must be a whole number of seconds')
            print()
           sys.exit(2)
     if (len(args) > 0):
         inputfile = args[0]
     else:
         longHelp()
         sys.exit(2)

def main():
   processArgs(sys.argv[1:])
   try:
      with open(inputfile) as f:
         while True:
            line = f.readline()
            if (line != ""):
               data =(json.loads(line))
               ts = str(datetime.fromtimestamp(data['ts']))
               date = ts[0:10]
               time = ts[11:19]
               ip = (data['request']['remote_addr'])
               ip = ip[0:ip.rindex(':')]
               if (ip[0] == '['):
                  ip = ip[1:ip.rindex(']')]
               if "Referer" in data['request']['headers'].keys():
                  referer = '"'+data['request']['headers']['Referer'][0]+'"'
               else:
                  referer = '""'
               if "User-Agent" in data['request']['headers'].keys():
                  user_agent = '"'+data['request']['headers']['User-Agent'][0]+'"'
               else:
                  user_agent = '""'
               latency=data['duration']
               print(
                  date,                               # %d
                  time,                               # %t
                  '"'+(data['request']['host'])+'"',  # %v
                  ip,                                 # %h
                  data['request']['method'],          # %m
                  '"'+(data['request']['uri'])+'"',   # %U
                  data['request']['proto'],           # %H
                  data['status'],                     # %s
                  data['size'],                       # %b
                  latency,                            # %T
                  referer,                            # %R
                  user_agent                          # %u
               )
            elif (interval > 0):
              sleep(interval)
            elif (interval == 0):
               break
   except FileNotFoundError:
      print()
      print(' Input file "{}" not found'.format(inputfile))
      print()

def signal_handler(signal, frame):
   sys.exit(0)

if __name__ == "__main__":
   signal.signal(signal.SIGINT, signal_handler)
   main()

Copy the above text into a file called logConvert.py and use chmod +x logConvert.py to make it executable. The Caddy log file can then be converted by using the following command:

./logConvert.py access.log >access.goaccess.log

To continually monitor and convert the Caddy log file [i.e. stream the logs] use the following command instead:

./logConvert.py -i 300 access.log >access.goaccess.log

This will convert all of the current content of the Caddy log file, write it out into the specified file (access.goaccess.log in this case) and then go to sleep for 300 seconds. Once the sleep period has elapsed, any further entires in the Caddy log file will be processed and added to the output file. By using this approach the output file can be continually updated/synchronised with Caddy log files in near real time (or at whatever interval you choose).

Alternatively, using cron, logConvert.py could be executed in the early hours of the morning each day to process the Caddy log file. If this were also combined with automatically executing goaccess after completion to create a self-contained html output file, a static html file could be generated every day and made available as a page on your web site.

The output from logConvert.py is in the following format

2020-08-03 19:17:37 "example.com" 192.168.1.1 GET "/" HTTP/1.1 200 458 0.005674565 "" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
2020-08-03 18:35:53 "example.com" 192.168.100.3 GET "/wp-login.php" HTTP/1.1 404 0 0.000298749 "http://example.com/wp-login.php" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"
2020-08-03 19:04:35 "example.com" 192.168.200.56 GET "/admin/" HTTP/1.1 404 0 0.000482654 "http://example.com/admin/" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:62.0) Gecko/20100101 Firefox/62.0"

By virtue of including the ‘host’ name in the output (“example.com” as shown above) , logConvert.py is ideally suited to a multiple domain/sub-domain scenario for which GoAccess can also provide analysis.

francislavoie · August 5, 2020, 2:58pm

Very cool! Do you think this could run as a service that accepts logs via a net socket? This way it could stream them instead of doing it in batches I think.

DWiskow · August 5, 2020, 3:05pm

If you run it with the time interval set to some value other than 0 (./logConvert.py -t 300 access.log > access.goaccess.log), in this example 5 minutes, then it will effectively ‘stream’ all additions to the Caddy log file (access.log) into to the converted log file (access.goccess.log) at 5 minute intervals (after an initial pass to convert everything already in the Caddy log file.

This makes it very low overhead, as it simply ‘sleeps’ for the number of seconds specified, checks to see if any new lines have been written to the log and then either goes directly back to sleep or processes the additional lines only and then goes back to sleep.

francislavoie · August 5, 2020, 3:29pm

That’s true, but it’s also not really “real-time” because there’s a day of up to N seconds where goaccess will not be made aware of the logs. It ends up being bursty on a 5 minute interval. With the net approach, you could handle the logs immediately which wouldn’t really be any more expensive (other than the slight overhead of the net socket but that’s quite minimal if it’s running on the same machine).

For reference, here’s the issue where JSON (and other structured logs) support is being tracked on the Goaccess project (as of this writing, unstarted as far as I can tell):

https://github.com/allinurl/goaccess/issues/1768

DWiskow · August 5, 2020, 3:49pm

@francislavoie, you are absolutely correct . . . but, how often do we really look at Caddys log files? . . . a daily or hourly update that then leveraged goaccess to auto-generate a self-contained html web page that we could easily access (as an actual page on the web site) would probably be good enough [IMHO].

Alternatively, the conversion could always be run on demand for those [rare] occasions where a specific need requires it.

DWiskow · August 5, 2020, 4:03pm

REMOVED . . . I was talking nonsense !

francislavoie · August 5, 2020, 4:07pm

Are you sure GoAccess doesn’t just open a read-only stream of the log file?

One of the listed features https://goaccess.io/features is “Completely Real Time”:

All panels and metrics are timed to be updated every 200 ms on the terminal output and every second on the HTML output.

DWiskow · August 5, 2020, 4:21pm

Again, you are correct . . . whilst goaccess can manage realtime updates to its output (terminal or html), the issue of true real time updates from Caddy to goaccess will remain until goaccess is able to directly consume Caddy log files [and process all of the data within them] . . . this is possible today, but the format of Caddy log file that goaccess will accept is quite limited in its data content (so sub-optimal).

If you were to use logConvert.py with the time interval set to 1 second, this would then feed goaccess at 1 second intervals [which aligns with its html file update rate] . . . and, as logConvert is only processing the additional new log records [not the entire file] each time, this should not introduce too much overhead.

DWiskow · August 5, 2020, 4:44pm

with respect to “network sockets” . . . I don’t see any reason why this code couldn’t be adapted to be driven via network sockets rather than an actual log file.

with open(inputfile) as f:
    while True:
        line = f.readline()

This would be relatively easy, assuming the log data is also passed as JSON. Effectively, it would only require replacing the file open & read (lines 88-90 shown above) with opening a socket and waiting for a message . . . are you aware of any further documentation on what Caddy provides beyond the . . .

net

A network socket.
output net <address>

<address>  is the [address](https://caddyserver.com/docs/conventions#network-addresses) to write logs to.

. . . lines provided in the Caddy documentation [i.e. confirmation that it is the entire JSON data set that is output in a single shot].

I ask this because often, when using sockets, only data that is different is transmitted which requires that the listener manages a complete set of data, updates it on an item by item basis and knows what the trigger to write the log is.

francislavoie · August 5, 2020, 8:05pm

Unfortunately I don’t think there’s any more documentation yet.

The implementation is found here, it’s very simple, it just writes a byte array to the socket: caddy/netwriter.go at master · caddyserver/caddy · GitHub

I think the byte array is just whatever the configured encoder outputs, so I think you can assume it to be a JSON string. You might need to try it out to find out.

matt · August 6, 2020, 12:30am

Yep, can confirm, it’s the exact same thing that is written to a file or whatever other output is configured: just to a network socket instead. So if the encoder is configured to be JSON (default for unsupervised executions), the socket will see a stream of JSON objects come in.

DWiskow · August 6, 2020, 4:08pm

Below is a proof of concept of real-time capture of Caddy log data over network sockets. This takes the form of a simple 16 line python script (socketSrv.py) that establishes a network socket server on the specified <ip address><port> and waits for a connection from Caddy

Note: socketSrv.py must run before Caddy, otherwise Caddy will fail to start as it will be unable to connect to the socket specified in its Caddyfile.

#!/usr/bin/python3
""" *******************************************************************************
A simple TCP socket server that listens on the specified <host>:<port> and 
outputs any messages received to stdout
 
    Note: when running caddy the Caddyfile must define "output net <127.0.0.1:55555>"
    and "format json" as shown in the example below

            localhost {
                file_server
                log {
                    output net 127.0.0.1:55555
                    format json
                }   
            }

socketSrv.py must be run before caddy, or caddy will fail to start as it will not
be able to connect to the specified logging server socket
******************************************************************************* """
import socket, sys, signal
server_address = ('127.0.0.1', 55555)

def main():
    with socket.create_server(server_address, family=socket.AF_INET, reuse_port=True) as sock:
        sock.listen()
        connection, client_address = sock.accept()
        with connection:
            while True:
                data = connection.recv(4096).decode()
                if data:
                    print(data)

def signal_handler(signal, frame):
    sys.exit(0)

if __name__ == "__main__":
    signal.signal(signal.SIGINT, signal_handler)
    main()

The output from socketSrv.py can be redirected to a file in the usual way bey adding “>filename” to the command line when sockerSrv.py is executed. An example of the output produced is shown below.

{"level":"info","ts":1596728860.889645,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_addr":"[::1]:52830","proto":"HTTP/2.0","method":"GET","host":"localhost","uri":"/","headers":{"Purpose":["prefetch"],"Sec-Fetch-Site":["none"],"Sec-Fetch-User":["?1"],"Sec-Fetch-Dest":["document"],"Accept-Encoding":["gzip, deflate, br"],"User-Agent":["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"],"Accept":["text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"],"Sec-Fetch-Mode":["navigate"],"Accept-Language":["en-GB,en-US;q=0.9,en;q=0.8,fr;q=0.7,la;q=0.6"],"Upgrade-Insecure-Requests":["1"]},"tls":{"resumed":false,"version":772,"ciphersuite":4865,"proto":"h2","proto_mutual":true,"server_name":"localhost"}},"common_log":"::1 - - [06/Aug/2020:16:47:40 +0100] \"GET / HTTP/2.0\" 200 1102","duration":0.006761898,"size":1102,"status":200,"resp_headers":{"Server":["Caddy"],"Content-Encoding":["gzip"],"Content-Type":["text/html; charset=utf-8"],"Vary":["Accept-Encoding"]}}
{"level":"info","ts":1596728874.495338,"logger":"http.log.access.log0","msg":"handled request","request":{"remote_addr":"[::1]:52830","proto":"HTTP/2.0","method":"GET","host":"localhost","uri":"/second","headers":{"Sec-Fetch-Mode":["navigate"],"Sec-Fetch-User":["?1"],"Sec-Fetch-Dest":["document"],"Accept-Encoding":["gzip, deflate, br"],"Accept-Language":["en-GB,en-US;q=0.9,en;q=0.8,fr;q=0.7,la;q=0.6"],"Upgrade-Insecure-Requests":["1"],"Sec-Fetch-Site":["same-origin"],"Referer":["https://localhost/"],"User-Agent":["Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36"],"Accept":["text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9"]},"tls":{"resumed":false,"version":772,"ciphersuite":4865,"proto":"h2","proto_mutual":true,"server_name":"localhost"}},"common_log":"::1 - - [06/Aug/2020:16:47:54 +0100] \"GET /second HTTP/2.0\" 200 1067","duration":0.001129496,"size":1067,"status":200,"resp_headers":{"Server":["Caddy"],"Content-Encoding":["gzip"],"Content-Type":["text/html; charset=utf-8"],"Vary":["Accept-Encoding"]}}

This is clearly recognisable as the standard/complete JSON logging output from Caddy . . .

. . . from here it is a small step to generate Caddy JSON log data in a file format that maximises the input to, and enables real-time log analysis using goaccess.

Using this approach the capture, conversion and output to file of Caddy log data could be run on the same, or a different server, to that which Caddy is run on. This would support minimising resource usage/conflict on the Caddy reverse proxy/web server, if required.

Relocating the Caddy log data conversion and capture to a secondary server would also allow relocation of goaccess to that server, whilst still supporting real-time Caddy log file analysis.

DWiskow · August 8, 2020, 9:06pm

I have now created a version of my python script that can be run as a service or in batch, pseudo real-time batch or TCP/IP network socket real-time modes . . .

. . . the significant increase in lines of code (and effort) in creating this is primarily related to the additional error checking and embedded --help / argument(s) now supported, plus the TCP/IP network socket option.

I have published the code on github in my repository CaddyGoAccessDataLoggerConverter where you can download it.

Below are some examples of how caddyLog.py can be run.

./caddyLog.py -n localhost:55555 -g access.goaccess.log -j access.json

   Sets up a TCP/IP network socket server on IP address localhost:55555
   and outputs any Log data streamed to it by Caddy over the network to
   a file named "access.goaccess.log" (containing Caddy log data converted
   into a format compatible with GoAccess - https://goaccess.io/) AND ALSO
   to an output file named "access.json" (containing the complete Caddy log
   data in JSON format)
                
   Optionally select only the -g [--outputGoAccessFilename] OR the -j
   [--outputJSONfilename] to output a single file of the required
   format

./caddyLog.py -i access.log -g access.goaccess.log

   Reads in the data from the file "access.log" (JSON format) and writes
   out a file named "access.goaccess.log" (containing the Caddy log data
   converted into a format compatible with GoAccess).

./caddyLog.py -i access.log -i 600 -g access.goaccess.log

   Reads in the data from the file "access.log" (in JSON format) and writes
   out a file named "access.goaccess.log" (containing the Caddy log data
   converted into a format compatible with goAccess), then repeatedly
   sleeps for 10 minutes (600 seconds) before checking to see if any
   additional Caddy log data has been written to "access.log". If
   additional data has been added to "access.log", it will convert it and
   append it to "access.goaccess.log" before again sleeping.

Executing caddyLog.py with the argument -h or --help will provide more instructions and detail on how to use caddyLog.py

I would be very interested to get any feedback/comments on this as I believe it is robust and production ready.

louis77 · September 29, 2020, 9:37am

Many thanks for that great script! I didn’t know about Goaccess and was already looking out for some commercial logging services. Thanks to this and your work we now have the perfect setup!

Let me know if and how I can buy you a coffee or something.