Command-line Log Analysis FOR THE WIN (1/3): How to Approach a Wild Log

By Taisa

Background

If you’ve read my Meet the Ambassadors page, you know that I was mildly traumatized by logs when I sat for the CompTIA exams in 2017. There had been no such things in my books.

My first season of National Cyber League (NCL):
In 2018, I analyzed logs by hand on a browser-only Chromebook. (See my tips for doing that here.) The only Log Analysis “tool” I used at the time was this string of characters that turns a browser into a text editor when you type it into the URL bar:

data:text/html,<html contenteditable>

(Try it! It’s cool. 😀 )

All things considered, I didn’t do too badly.

My second season of NCL:
In 2019, I was the recipient of a hand-me-down equipment upgrade, and this is what I wrote at the top of my notes for the SSH (Easy) Log Analysis challenge in the Gymnasium:

[This challenge] may be “easy,” but Log Analysis is a weak area for me, so we’re going to use this challenge to learn stuff.

My third season of NCL:
At the end of the Spring 2020 Preseason, I ranked #1 on the leaderboard for the Log Analysis module.

I adore Splunk, but I had used nothing but command-line tools during that particular Game.

Most of what I did to teach myself about command-line tools is here in this blog series.


Prerequisites:


How to Approach a Wild Log

1. Determine Whether the File is a Good Candidate for Command-line Log Analysis

GOOD CANDIDATES

  • Plain text files: Any file with the extension .log, .txt, or .csv will already be open in NCL’s embedded text viewer. Log files that present well in a plain text viewer are easy to parse using command line.
  • Repeating entry patterns: Good candidates will also demonstrate a pattern in their log entry format. Each line (or set of lines) should look relatively similar to the majority of other lines (or sets of lines).

POOR CANDIDATES

  • Files that don’t present well in text viewers: What about something like a .sqlite file that doesn’t open automatically in NCL’s embedded viewer? It was probably designed for a specific type of viewer. Google “how to open/read/view” these types of files. Your plan of attack won’t be covered in this guide. (You’ll still have fun, though; I promise!)
  • Varied log entry patterns: Output such as you would get from the dig command or a virus scan will be too varied for automated command-line parsing by us novices. Manual analysis will serve us best when logs are short and the entry formats are highly varied.

2. Understand What the Fields in Each Log Entry Mean

THE PROBLEM

Sometimes you can guess what the fields in a log entry mean just by looking at them. For example, this set of log entries from the Leaping challenge is pretty straightforward:

Sun Mar 19 03:38:38 2017 [pid 24540] CONNECT: Client "59.188.221.110"

Sun Mar 19 03:38:42 2017 [pid 24539] [anonymous] FAIL LOGIN: Client "59.188.221.110"

You see a date, time, probably a year, an identifier of some sort, an event, and there’s an IP address that we know belongs to a client because it’s labeled as such.

Sometimes the meaning of fields is less obvious. Here’s a sample from the Web Access Log challenge:

44.224.22.196 - - [02/May/2020:00:56:42 +0000] "GET / HTTP/1.1" 200 18 "-" "AWS Security Scanner"

What does 200 mean? What does 18 mean? What does "-" mean?

At least the time field is still pretty obvious there, but where is the time field in this log entry?

2 234812349823 eni-02536d9e39b22fea6 108.17.136.219 10.100.200.10 62093 443 6 123 26435 1572654092 1572654105 ACCEPT OK

The question was a little misleading, because there are actually two timestamp fields there. Don’t feel bad if you missed them! You’ll eventually learn to identify alternative methods of logging time, but the answer isn’t obvious if this is your first rodeo.

What if you came across log entries that resembled this single-line monstrosity?

79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be awsexamplebucket1 [06/Feb/2019:00:00:38 +0000] 192.0.2.3 79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be 3E57427F3EXAMPLE REST.GET.VERSIONING - "GET /awsexamplebucket1?versioning HTTP/1.1" 200 - 113 - 7 - "-" "S3Console/0.4" - s9lzHYrFp76ZVxRcpX9+5cjAnEH2ROuNkd2BHfIa6UkFVdtjf5mKR3/eTPFvsiP/XV/VLi31234= SigV2 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader awsexamplebucket1.s3.us-west-1.amazonaws.com TLSV1.1

Yes, there have been log entries like that in the NCL Games! (And that one was a Medium challenge!) How would you even begin to answer the question, “What is the average amount of time that requests spent in flight,” for that log?

Easy! You would begin by figuring out which field represents flight time, right?

THE SOLUTION

Look for context clues to discern the kind of log(s) you’re looking at. Check these places:

  • The name of the challenge, e.g. a challenge named VSFTPD, or Wpull, or S3 doesn’t make sense as a description unless it’s the proper name of some kind of software or service.
  • The name of the log file itself, e.g. a log file named cloudfront.log, browser.sqlite, or Windows Update.csv, gives you information—either in the file name or the extension—as to what types of events will be in the log.
  • The scenario, e.g. an instruction to “analyze HTTP error logs” or information that “the log in this challenge was generated by Nginx” makes it clear exactly what sort of log you’re getting yourself into.

Note: It’s possible you’re looking at a log for a “custom application.” You’ll be told as much, if that’s the case. In that event, you’ll need to rely on your intuition to decipher the meaning of the log entries. (Don’t worry! You’ve got this!)

If you have any more information than that, though, turn to Google and search for log format for that type of log.

If nothing helpful comes up right away, adjust your search. Using our sample challenges as examples:

  • Try narrowing the search to find more precise documentation and samples: The Leaping challenge tells us we are looking at a VSFTPD log. What happens when you Google VSFTPD log format? Nothing useful? How about when you refine the search by appending terms that appear repeatedly within the log, like “CONNECT” and “FAIL LOGIN“? Try this Google search: vsftpd log format connect “fail login”—and notice the result that has “documentation” in the title.
  • Try broadening the search to return simpler introductory information: For the Web Access Log challenge, the scenario tells us this is an Nginx log. What happens when you Google web access log format Nginx? Is that information a little too technical? What if you make the search less refined, and just Google web access log format without “Nginx“? Do you know what 200, 18, and "-" mean now?

ALWAYS MAKE A TABLE

Always make a table out of the log format information that you uncover. As we go through examples in this blog series, you’ll see that we refer back to such tables repeatedly to help us target the right information.

Here’s what my table for a web access log entry would look like:

Sample Log Entry:
35.187.132.245 - - [02/May/2020:00:05:32 +0000] "GET / HTTP/1.1" 200 18 "http://54.236.11.149" "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US) AppEngine-Google; (+http://code.google.com/appengine; appid: s~virustotalcloud)"

Field DescriptionExample
IP of client / remote host35.187.132.245
identd (identity of client)-
userid (remote user)-
Timestamp in format [day/month/year:hour:minute:second zone][02/May/2020:00:05:32 +0000]
Request line in format “METHOD RESOURCE PROTOCOL”"GET / HTTP/1.1"
Status code200
Size of the object returned to the client18
Referer"http://54.236.11.149"
User-agent"Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US) AppEngine-Google; (+http://code.google.com/appengine; appid: s~virustotalcloud)"

That information all came from the very first result when I Googled web access log format(*).

It would be worth Googling further to find out more about that size field—how is that being measured? In bytes? Kilobytes? Refining the search to include “Nginx” again will answer that question, now that we have a better general idea!


3. Get a Sense of What “Normal” and “Suspicious” Look Like

The first step to spotting unusual activity is generally to establish a baseline. A baseline represents how things look and behave when everything is normal. When compared to a baseline, weird activity should look, well, weird!

For a small log, like in the Web Access Log challenge (156 lines), just visually skim through the entire log to get a sense of what generally looks normal.

For a larger log, like in the Leaping challenge (over 10,000 lines!), try visually inspecting the first and last 50-100 lines or so using the head and tail utilities:

cat leaping.log | head -50

cat leaping.log | tail -100

(Advanced readers, I know you’re looking at those two commands and cringing. Yes, tail -100 leaping.log will do the same thing in fewer characters, and tail -f leaping.log is a cool real-world command for watching the end of a log in real-time. There will be nifty commands for advanced readers in this series, too, but we want to start out simply and establish good habits that will work well every time. As we start piping multiple commands together, catting the file first will make it easier to control and modify our output later.)

If you’re new to the head and tail utilities, they’re great helper commands:

  • They can show you a sample of log entries when you don’t need to see every single entry.
  • You can also use head -1 and tail -1 to quickly isolate the first or last entry in a set.
  • Without any options specified, the head and tail utilities will default to showing 10 results.

For now, just by skimming the Leaping log, we already see a ton of failed login attempts. Even though making a baseline from this log alone would suggest that’s normal behavior, we know that failed login attempts are a problem. In the case of the Leaping log, we don’t have the baseline—we have the suspicious activity.

In the Web Access Log challenge, suspicious activity requires more detective work. Retired Player Ambassador John “Mako” McGill wrote a great blog post on using grep -v to strip away uninteresting elements from a log to expose the elements you’re looking for.


4. TOOL TIME!!!

WebWitch has a terrific primer introducing a few of these tools in her blog post, Leaping into Log Analysis.

Below is a master list of my favorite command-line tools for NCL:

UtilitySwitchDescription
catoutputs file content, e.g. cat leaping.log
|“pipe” – takes the output of the command on the left and uses it as input for the command on the right
grepconducts a case-sensitive search
-imakes the search case-insensitive
-vexcludes lines that match the search term
-Eadds support for a more extensive regular expression language (regex); we’ll use it to unlock match repetition (“{ }“) when hunting for IPs
-oextracts matching strings from their lines
-A“after” – returns the line that matches the search term plus however many lines after it that you indicate, e.g. grep "flag" -A 2 returns 3 adjacent lines starting with the line that matched the search term flag
-B“before” – returns the line that matches the search term plus however many lines before it that you indicate, e.g. grep "flag" -B 1 returns 2 adjacent lines ending with the line that matched the search term flag
cutextracts fields (columns that you specify) from lines
-dspecifies the use of a delimiter other than the default tab; it can only be one character
-fdenotes which field(s) to display, indicated by their numerical order of appearance; you can select a single field (-f 1), a set of fields (-f 1,3,7), a range of fields (-f 2-7), all fields ’til the end of the line (-f 3-), or any combination (-f 2,4-6,9-)
sortalphabetizes input, e.g. a, b, c, 1, 101, 2
-nsorts input numerically instead of alphabetically (so instead of 1, 101, 2 you get 1, 2, 101)
-rsorts in reverse order (handy for showing largest numbers first)
-ksorts by the field number you indicate, instead of sorting from the first character in the line; e.g. use sort -k 3 to sort by field 3, use sort -k 3,5 to sort based on the string that starts in field 3 and ends in field 5, or use sort -k 5 -k 3 to sort first by the contents of field 5 and then by the contents of field 3
-tspecifies the use of a delimiter other than the default space; handy when using -k
uniqdeduplicates – detects repeated lines if they are adjacent and removes the duplicates (deduplication)
-cprefixes each line with the number of adjacent occurrences, so you know how many were found in that position before the duplicates were removed
wc“word count” – counts the number of lines, words, and characters
-ldisplays the line count only
tr“translate” – a tool for modifying output
-ddeletes characters, either specific strings or entire classes of characters
--helpquick, handy way to see the character classes
pastemerges lines of files into parallel columns
-s“serial” – merges items from a column-style list onto a single horizontal line and separates them by a tab
-dspecifies the use of a delimiter other than the default tab
bca calculator; may need to be installed (sudo apt-get install bc)
headdisplays the first 10 lines of input; use head -[number] to change the default number, e.g. head -3
taildisplays the last 10 lines of input; use tail -[number] to change the default number, e.g. tail -25
echogenerates the string you specify as output

Pro Tip

Enter the name of any utility followed by --help to see its description and a list of options, e.g. tr --help or wc --help.

Simply remembering --help saves you from having to memorize all those other switches by giving you a way to quickly look them up!


Special Tools

These unique tools and commands have all come in handy for me on more than one occasion during the NCL Games.

If you’re new to command-line, come back to this section after completing the rest of this Command-line Log Analysis 3-part blog series. (Final post drops on 02/17.)

This information will feel like TMI for now, but later it will feel like the legendary missing link.

awk

I’m intimidated by the unbridled power of awk because I find it less intuitive to use, but even I have to admit that some things are just easier to do with awk. WebWitch likes awk for isolating columns (which I prefer to do with cut like a barbarian), but I keep coming back to the elegant simplicity of awk when I want to pick out full log entries based on a match within a specific field.

What does that mean? Let’s use access.log from the Web Access Log challenge to illustrate. Check out this sample log entry:

82.81.142.50 - - [02/May/2020:18:34:54 +0000] "HEAD / HTTP/1.1" 200 0 "-" "Wget/1.15 (linux-gnu)"

Say I want to find log entries like that one, where 0 occurs in the size field. If I try something like this –

cat access.log | grep "0"

– it will return literally every line in the log, because they all contain a zero somewhere!

You can isolate a specific field with cut, but what happens when you cut out the field that you want to search on?

cat access.log | cut -d " " -f 10 | grep "0"

You lose the rest of the line! All you see is a column of zeroes. You could perform recon on those lost lines by telling grep to include line numbers with the -n switch, but then you have to go chasing after those lines.

While there is a one-liner way to grep a specific field, it’s cumbersome and easy to make mistakes with:

cat access.log | grep "^\S\+\s\+\S\+\s\+\S\+\s\+\S\+\s\+\S\+\s\+\S\+\s\+\S\+\s\+\S\+\s\+\S\+\s\+0"

( ^ indicates the start of the line, \S matches non-space characters, and \s matches space characters.)

Compare that to awk performing the exact same operation as simply as this:

cat access.log | awk '$10 == "0"'

Even if you go no further with awk, tuck that one command in your pocket. The $ symbol indicates the position of the field, and the string you want to match goes to the right of the operator in the middle. The == operator will return only exact matches, but you could swap it out for the ~ operator which would return any matches containing the string you specify.

Let’s try a few commands out on access.log to illustrate. Start by isolating the unique user agent strings to use as a data set:

cat access.log | cut -d '"' -f 6 | sort | uniq

There are two lines of output in particular to keep an eye on for this exercise:

Googlebot/2.X (http://www.googlebot.com/bot.html)
polaris botnet

Append awk as shown below and observe the output:

cat access.log | cut -d '"' -f 6 | sort | uniq | awk '$2 == "bot"'

Using the command above, there is no output! The == operator will only make a match if the second field (as delimited by the space character) contains only bot and nothing else. You can verify that with this command:

cat access.log | cut -d '"' -f 6 | sort | uniq | awk '$2 == "botnet"'

One line is returned this time—the one with botnet and nothing but botnet in its second field.

Now try awk with the ~ operator:

cat access.log | cut -d '"' -f 6 | sort | uniq | awk '$2 ~ "bot"'

This command returns both lines, because both lines contain the string bot somewhere within their second field.

You’re not limited to matching alphanumerical values, by the way. Here’s an example of how to match on a string containing punctuation: "CONNECT (quotation mark included). Notice below that the quotation mark needs to be escaped with a backslash ( \ ). Try running this on access.log, too:

cat access.log | awk '$6 == "\"CONNECT"'

Compare Two Files for Unique Entries

No matter how many times I see this work of genius, it still gives me butterflies:

grep -Fxv -f first-file.txt second-file.txt

Basically looks for all lines in second-file.txt which don’t match any line in first-file.txt.

https://stackoverflow.com/questions/11099894/comparing-two-unsorted-lists-in-linux-listing-the-unique-in-the-second-file

Run grep --help to break down what’s going on there.

Now, say you’ve got output from one command, and you want to compare it to the output of another command. What do I mean? Let’s walk through a concrete scenario using the access.log file from the Web Access Log challenge.

While putting together this blog series, I was having trouble figuring out why these two commands –

cat access.log | grep -v "CONNECT\|HEAD\|GET\|POST" | wc -l

cat access.log | awk '$7 == "400"' | wc -l

– were returning different tallies. Don’t worry about what the commands mean. What’s important here is that, in theory, they should have been just two different ways to arrive at the same result. However, the first command counted 11 log entries, and the second counted only 10. What was the lone log entry causing the discrepancy?

Sure, in this case I could have stared at it ’til I found the culprit (and honestly that’s what I did), but while that’s feasible for ~10 entries, it’s a different story when there are ~10,000 entries to compare. That’s where this spectacular command comes in.

Step 1 – Redirect ( > ) each command’s output to its own file, thus:

cat access.log | grep -v "CONNECT\|HEAD\|GET\|POST" > 11-entries.txt

cat access.log | awk '$7 == 400' > 10-entries.txt

Step 2 – Compare:

grep -Fxv -f 10-entries.txt 11-entries.txt

A singular log entry emerges that appears on the second list but not on the first list.

It’s easy to imagine using this command now to compare lists of names or IP addresses to identify a rogue element, wouldn’t you say?


Parse a User Agent String

Sometimes you’re asked to decipher a user agent string.

It’s not always tough. See if you can tell whether this user agent string belongs to a mobile device or a desktop device:

Mozilla/5.0 (Android 10; Mobile; rv:68.0) Gecko/68.0 Firefox/68.0

Other times, it feels like your accuracy depends on how lucky you are today.

What browser produced this user agent string?

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/78.0.3904.70 Safari/537.36

Try inputting both sample strings in this awesome online tool:

https://developers.whatismybrowser.com/useragents/parse/#parse-useragent


What’s Next?!

I learn best from examples and repetition, so, next, we’ll see how to use these command-line tools to solve the sample Log Analysis challenges hosted here at CryptoKait.com:

  1. Command-line Log Analysis FOR THE WIN (2/3): Walking through “Leaping”
    (Coming 02/15)
  2. Command-line Log Analysis FOR THE WIN (3/3): Untangling a Web Access Log
    (Coming 02/17)

Have a favorite command-line tool or trick not mentioned here? Please tell us about it in the comments, or consider applying for Feature Friday if you could write an entire blog post about it!

One thought on “Command-line Log Analysis FOR THE WIN (1/3): How to Approach a Wild Log

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.