Summer Camp 2020 – Log Analysis

Paul Buonopane walks us through how he would solve the Summer Camp 2020 – Log Analysis challenge: Web Access Log.

Some logs are simple and small enough to process with common spreadsheet software like Excel and Google Sheets. While this particular access log is both simple and small enough that it could probably be coerced into a spreadsheet, that usually isn’t a viable solution for access logs in the real world. Web servers often receive thousands or even millions of requests each day, resulting in very large files. Furthermore, it’s not unusual for some entries to contain data that make it difficult to properly separate fields.

During a competition, writing a fancy script to parse logs is often a last resort because it’s time-consuming. Harder challenges often require scripting or specialized tools; however, it’s a good idea to practice such techniques with easier challenges.

The following script isn’t representative of what I would personally write during a competition; it’s significantly cleaner and has verbose commenting. However, it’s a good demonstration of a relatively simple script to parse and process data that can have an ambiguous structure at times. To run the script, you will need a recent version of Python 3. I tested the script with Python 3.8.2, the latest stable version at the time of writing. The script should be in the same directory as the access.log file.

#!/usr/bin/env python3

import re
import locale
from datetime import datetime

# Since we're parsing dates with English month names, we need to enforce an English locale.
locale.setlocale(locale.LC_ALL, "en_US")

# These regular expressesions (regexes) are where most of the parsing magic happens.
entry_regex = re.compile(r"""
    ^      (?P<ip>         [0-9a-f.:]+        )     # Support both IPv4 and IPv6.
    [ ]    (?P<identd>     -                  )     # This is a historical field that Nginx doesn't support; it will always be a single hyphen.
    [ ]    (?P<user>       .*?                )     # The username could contain anything, and, unfortunately, it's not quoted.  We can only have one "anything goes" field like this.
    [ ] \[ (?P<timestamp>  [0-9a-z :/_+-]{26} ) \]  # We don't need to be too strict about the date; weparse it to a proper date later.
    [ ] "  (?P<request>    [^"]*              ) "   # Not all web servers escape log fields, but Nginx does.  Quotes and control codes will never appear.
    [ ]    (?P<status>     \d{3}              )     # Should always be exactly 3 digits.
    [ ]    (?P<bytes_sent> \d+ | -            )     # This field can be a hyphen instead of a number when no response is sent.
    [ ] "  (?P<referrer>   [^"]*              ) "   # The header is "Referer", but that's a misspelling, and newer standards spell it correctly.
    [ ] "  (?P<user_agent> [^"]*              ) "   # This can be just about anything; fortunately, it's quoted and escaped.
    $
""", re.VERBOSE | re.IGNORECASE)

request_regex = re.compile(r"""
    ^   (?P<method>   [\w.:-]+             )   # GET
    [ ] (?P<uri>      .+                   )   # /index.html
    [ ] (?P<protocol> [a-z]+ (?:/[0-9.]+)? )   # HTTP, HTTP/1.1; some clients attempt without the version specifier
    $
""", re.VERBOSE | re.IGNORECASE)


class Entry:
    def __init__(self, line):
        line_match = entry_regex.fullmatch(line)

        if line_match is None:
            raise Exception(f"Line doesn't line_match regex: {line}")

		# Most of the parsing is handled by the regexes, but we have to do a little manual labor for some fields.

        self.ip         = line_match.group("ip")
        self.user       = line_match.group("user")
        self.timestamp  = datetime.strptime(line_match.group("timestamp"), "%d/%b/%Y:%H:%M:%S %z")
        self.request    = line_match.group("request")
        self.status     = int(line_match.group("status"))
        self.bytes_sent = 0 if line_match.group("bytes_sent") == "-" else int(line_match.group("bytes_sent"))
        self.referrer   = line_match.group("referrer")
        self.user_agent = line_match.group("user_agent")

        request_match = request_regex.fullmatch(self.request)

        if request_match is not None:
            self.method   = request_match.group("method")
            self.uri      = request_match.group("uri")
            self.protocol = request_match.group("protocol")
        else:
            self.method   = None
            self.uri      = None
            self.protocol = None


with open("access.log", "r") as f:
    entries = [Entry(line) for line in (line.strip() for line in f) if line != ""]


# Analyze all the data that we've parsed.
answers = [
    ("Total requests"      , len(entries)),
    ("Unique status codes" , len(set(x.status for x in entries))),
    ("Max body size"       , max(x.bytes_sent for x in entries)),
    ("Tunnel attempts"     , len([x for x in entries if x.method == "CONNECT"])),  # Note that methods are technically case-sensitive.
    ("Invalid requests"    , len([x for x in entries if x.method is None and r"\x" in x.request])),  # Raw binary data will typically include a lot of escape sequences.
    ("SSL/TLS attempts"    , len([x for x in entries if x.request.startswith(r"\x16\x03\x01\x00")])),  # A little time with a search engine tells us this is commonly seen at the start of a request line in the relevant scenario.
    ("Unique user agents"  , len(set(x.user_agent for x in entries if x.user_agent not in ("-", "")))),  # Keep in mind that blank fields are often replaced with a hyphen.
    ("Firefox requests"    , len([x for x in entries if "firefox" in x.user_agent.lower()])),
    ("CVE-2020-8515"       , len([x for x in entries if x.uri is not None and x.uri.startswith("/cgi-bin/mainfunction.cgi")])),  # Researching CVE-2020-8515 tells us that we should expect to see this URI.
]

# This is just to format our output nicely and wouldn't be necessary during a competition.
max_width = max(len(q) for q, _ in answers)
answer_format = "{:>" + str(max_width) + "s}: {:>5d}"

for q, a in answers:
    print(answer_format.format(q, a))

When run, we expect to see the following output:

     Total requests:   #
Unique status codes:   #
      Max body size:   #
    Tunnel attempts:   #
   Invalid requests:   #
   SSL/TLS attempts:   #
 Unique user agents:   #
   Firefox requests:   #
      CVE-2020-8515:   #

Now you’ll see actual numbers where the # symbols are, but you’ll have to do a little bit for work for yourself to get the answers.

Paul