If you’ve read my Meet the Ambassadors page, you know that I was mildly traumatized by logs when I sat for the CompTIA exams in 2017. There had been no such things in my books.
My first season of National Cyber League (NCL):
In 2018, I analyzed logs by hand on a browser-only Chromebook. (See my tips for doing that here.) The only Log Analysis “tool” I used at the time was this string of characters that turns a browser into a text editor when you type it into the URL bar:
(Try it! It’s cool. 😀 )
All things considered, I didn’t do too badly.
My second season of NCL:
In 2019, I was the recipient of a hand-me-down equipment upgrade, and this is what I wrote at the top of my notes for the SSH (Easy) Log Analysis challenge in the Gymnasium:
[This challenge] may be “easy,” but Log Analysis is a weak area for me, so we’re going to use this challenge to learn stuff.
My third season of NCL:
At the end of the Spring 2020 Preseason, I ranked #1 on the leaderboard for the Log Analysis module.
I adore Splunk, but I had used nothing but command-line tools during that particular Game.
Most of what I did to teach myself about command-line tools is here in this blog series.
- Recommended: A Linux-based command-line terminal – such as the one found on Kali Linux. Kali is recommended for use with the NCL Games(*). These blog posts can show you how to set up a Kali virtual machine:
- Building Your Virtual Playground: Setting Up VMs for the National Cyber League Games
- Fresh Coffee… I Mean Kali!
- See the Setting Up a VM blog category for more.
- Alternatively: If you’re using a device that you can’t install tools on –
- If your device is a Chromebook, see if you have the option to set up Linux (Beta) on it.
- All others: This browser-based command-line terminal will allow you to upload log files to it, and it comes with all the tools we’ll be using pre-installed: Fabrice Bellard’s JSLinux.
- Leaping into Log Analysis – Covers the most introductory basics if you’re brand new to command line.
How to Approach a Wild Log
1. Determine Whether the File is a Good Candidate for Command-line Log Analysis
- Plain text files: Any file with the extension .log, .txt, or .csv will already be open in NCL’s embedded text viewer. Log files that present well in a plain text viewer are easy to parse using command line.
- Repeating entry patterns: Good candidates will also demonstrate a pattern in their log entry format. Each line (or set of lines) should look relatively similar to the majority of other lines (or sets of lines).
- Files that don’t present well in text viewers: What about something like a .sqlite file that doesn’t open automatically in NCL’s embedded viewer? It was probably designed for a specific type of viewer. Google “how to open/read/view” these types of files. Your plan of attack won’t be covered in this guide. (You’ll still have fun, though; I promise!)
- Varied log entry patterns: Output such as you would get from the
digcommand or a virus scan will be too varied for automated command-line parsing by us novices. Manual analysis will serve us best when logs are short and the entry formats are highly varied.
2. Understand What the Fields in Each Log Entry Mean
Sometimes you can guess what the fields in a log entry mean just by looking at them. For example, this set of log entries from the Leaping challenge is pretty straightforward:
Sun Mar 19 03:38:38 2017 [pid 24540] CONNECT: Client "126.96.36.199" Sun Mar 19 03:38:42 2017 [pid 24539] [anonymous] FAIL LOGIN: Client "188.8.131.52"
You see a date, time, probably a year, an identifier of some sort, an event, and there’s an IP address that we know belongs to a client because it’s labeled as such.
Sometimes the meaning of fields is less obvious. Here’s a sample from the Web Access Log challenge:
184.108.40.206 - - [02/May/2020:00:56:42 +0000] "GET / HTTP/1.1" 200 18 "-" "AWS Security Scanner"
200 mean? What does
18 mean? What does
At least the time field is still pretty obvious there, but where is the time field in this log entry?
2 234812349823 eni-02536d9e39b22fea6 220.127.116.11 10.100.200.10 62093 443 6 123 26435 1572654092 1572654105 ACCEPT OK
The question was a little misleading, because there are actually two timestamp fields there. Don’t feel bad if you missed them! You’ll eventually learn to identify alternative methods of logging time, but the answer isn’t obvious if this is your first rodeo.
What if you came across log entries that resembled this single-line monstrosity?
79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be awsexamplebucket1 [06/Feb/2019:00:00:38 +0000] 192.0.2.3 79a59df900b949e55d96a1e698fbacedfd6e09d98eacf8f8d5218e7cd47ef2be 3E57427F3EXAMPLE REST.GET.VERSIONING - "GET /awsexamplebucket1?versioning HTTP/1.1" 200 - 113 - 7 - "-" "S3Console/0.4" - s9lzHYrFp76ZVxRcpX9+5cjAnEH2ROuNkd2BHfIa6UkFVdtjf5mKR3/eTPFvsiP/XV/VLi31234= SigV2 ECDHE-RSA-AES128-GCM-SHA256 AuthHeader awsexamplebucket1.s3.us-west-1.amazonaws.com TLSV1.1
Yes, there have been log entries like that in the NCL Games! (And that one was a Medium challenge!) How would you even begin to answer the question, “What is the average amount of time that requests spent in flight,” for that log?
Easy! You would begin by figuring out which field represents flight time, right?
Look for context clues to discern the kind of log(s) you’re looking at. Check these places:
- The name of the challenge, e.g. a challenge named VSFTPD, or Wpull, or S3 doesn’t make sense as a description unless it’s the proper name of some kind of software or service.
- The name of the log file itself, e.g. a log file named cloudfront.log, browser.sqlite, or Windows Update.csv, gives you information—either in the file name or the extension—as to what types of events will be in the log.
- The scenario, e.g. an instruction to “analyze HTTP error logs” or information that “the log in this challenge was generated by Nginx” makes it clear exactly what sort of log you’re getting yourself into.
Note: It’s possible you’re looking at a log for a “custom application.” You’ll be told as much, if that’s the case. In that event, you’ll need to rely on your intuition to decipher the meaning of the log entries. (Don’t worry! You’ve got this!)
If you have any more information than that, though, turn to Google and search for log format for that type of log.
If nothing helpful comes up right away, adjust your search. Using our sample challenges as examples:
- Try narrowing the search to find more precise documentation and samples: The Leaping challenge tells us we are looking at a VSFTPD log. What happens when you Google VSFTPD log format? Nothing useful? How about when you refine the search by appending terms that appear repeatedly within the log, like “CONNECT” and “FAIL LOGIN“? Try this Google search: vsftpd log format connect “fail login”—and notice the result that has “documentation” in the title.
- Try broadening the search to return simpler introductory information: For the Web Access Log challenge, the scenario tells us this is an Nginx log. What happens when you Google web access log format Nginx? Is that information a little too technical? What if you make the search less refined, and just Google web access log format without “Nginx“? Do you know what
ALWAYS MAKE A TABLE
Always make a table out of the log format information that you uncover. As we go through examples in this blog series, you’ll see that we refer back to such tables repeatedly to help us target the right information.
Here’s what my table for a web access log entry would look like:
Sample Log Entry:
18.104.22.168 - - [02/May/2020:00:05:32 +0000] "GET / HTTP/1.1" 200 18 "http://22.214.171.124" "Mozilla/5.0 (Windows; U; MSIE 9.0; Windows NT 9.0; en-US) AppEngine-Google; (+http://code.google.com/appengine; appid: s~virustotalcloud)"
|IP of client / remote host|
|identd (identity of client)|
|userid (remote user)|
|Timestamp in format [day/month/year:hour:minute:second zone]|
|Request line in format “METHOD RESOURCE PROTOCOL”|
|Size of the object returned to the client|
That information all came from the very first result when I Googled web access log format(*).
It would be worth Googling further to find out more about that size field—how is that being measured? In bytes? Kilobytes? Refining the search to include “Nginx” again will answer that question, now that we have a better general idea!
3. Get a Sense of What “Normal” and “Suspicious” Look Like
The first step to spotting unusual activity is generally to establish a baseline. A baseline represents how things look and behave when everything is normal. When compared to a baseline, weird activity should look, well, weird!
For a small log, like in the Web Access Log challenge (156 lines), just visually skim through the entire log to get a sense of what generally looks normal.
For a larger log, like in the Leaping challenge (over 10,000 lines!), try visually inspecting the first and last 50-100 lines or so using the
cat leaping.log | head -50 cat leaping.log | tail -100
(Advanced readers, I know you’re looking at those two commands and cringing. Yes,
tail -100 leaping.log will do the same thing in fewer characters, and
tail -f leaping.log is a cool real-world command for watching the end of a log in real-time. There will be nifty commands for advanced readers in this series, too, but we want to start out simply and establish good habits that will work well every time. As we start piping multiple commands together,
catting the file first will make it easier to control and modify our output later.)
If you’re new to the
tail utilities, they’re great helper commands:
- They can show you a sample of log entries when you don’t need to see every single entry.
- You can also use
tail -1to quickly isolate the first or last entry in a set.
- Without any options specified, the
tailutilities will default to showing 10 results.
For now, just by skimming the Leaping log, we already see a ton of failed login attempts. Even though making a baseline from this log alone would suggest that’s normal behavior, we know that failed login attempts are a problem. In the case of the Leaping log, we don’t have the baseline—we have the suspicious activity.
In the Web Access Log challenge, suspicious activity requires more detective work. Retired Player Ambassador John “Mako” McGill wrote a great blog post on using
grep -v to strip away uninteresting elements from a log to expose the elements you’re looking for.
4. TOOL TIME!!!
WebWitch has a terrific primer introducing a few of these tools in her blog post, Leaping into Log Analysis.
Below is a master list of my favorite command-line tools for NCL:
|outputs file content, e.g. |
|“pipe” – takes the output of the command on the left and uses it as input for the command on the right|
|conducts a case-sensitive search|
|makes the search case-insensitive|
|excludes lines that match the search term|
|adds support for a more extensive regular expression language (regex); we’ll use it to unlock match repetition (“|
|extracts matching strings from their lines|
|“after” – returns the line that matches the search term plus however many lines after it that you indicate, e.g. |
|“before” – returns the line that matches the search term plus however many lines before it that you indicate, e.g. |
|extracts fields (columns that you specify) from lines|
|specifies the use of a delimiter other than the default tab; it can only be one character|
|denotes which field(s) to display, indicated by their numerical order of appearance; you can select a single field (|
|alphabetizes input, e.g. a, b, c, 1, 101, 2|
|sorts input numerically instead of alphabetically (so instead of 1, 101, 2 you get 1, 2, 101)|
|sorts in reverse order (handy for showing largest numbers first)|
|sorts by the field number you indicate, instead of sorting from the first character in the line; e.g. use |
|specifies the use of a delimiter other than the default space; handy when using |
|deduplicates – detects repeated lines if they are adjacent and removes the duplicates (deduplication)|
|prefixes each line with the number of adjacent occurrences, so you know how many were found in that position before the duplicates were removed|
|“word count” – counts the number of lines, words, and characters|
|displays the line count only|
|“translate” – a tool for modifying output|
|deletes characters, either specific strings or entire classes of characters|
|quick, handy way to see the character classes|
|merges lines of files into parallel columns|
|“serial” – merges items from a column-style list onto a single horizontal line and separates them by a tab|
|specifies the use of a delimiter other than the default tab|
|a calculator; may need to be installed (|
|displays the first 10 lines of input; use |
|displays the last 10 lines of input; use |
|generates the string you specify as output|
Enter the name of any utility followed by
--help to see its description and a list of options, e.g.
tr --help or
--help saves you from having to memorize all those other switches by giving you a way to quickly look them up!
These unique tools and commands have all come in handy for me on more than one occasion during the NCL Games.
If you’re new to command-line, come back to this section after completing the rest of this Command-line Log Analysis 3-part blog series. (Final post drops on 02/17.)
This information will feel like TMI for now, but later it will feel like the legendary missing link.
I’m intimidated by the unbridled power of
awk because I find it less intuitive to use, but even I have to admit that some things are just easier to do with
awk. WebWitch likes
awk for isolating columns (which I prefer to do with
cut like a barbarian), but I keep coming back to the elegant simplicity of
awk when I want to pick out full log entries based on a match within a specific field.
What does that mean? Let’s use access.log from the Web Access Log challenge to illustrate. Check out this sample log entry:
126.96.36.199 - - [02/May/2020:18:34:54 +0000] "HEAD / HTTP/1.1" 200 0 "-" "Wget/1.15 (linux-gnu)"
Say I want to find log entries like that one, where
0 occurs in the size field. If I try something like this –
cat access.log | grep "0"
– it will return literally every line in the log, because they all contain a zero somewhere!
You can isolate a specific field with
cut, but what happens when you
cut out the field that you want to search on?
cat access.log | cut -d " " -f 10 | grep "0"
You lose the rest of the line! All you see is a column of zeroes. You could perform recon on those lost lines by telling
grep to include line numbers with the
-n switch, but then you have to go chasing after those lines.
While there is a one-liner way to
grep a specific field, it’s cumbersome and easy to make mistakes with:
cat access.log | grep "^\S\+\s\+\S\+\s\+\S\+\s\+\S\+\s\+\S\+\s\+\S\+\s\+\S\+\s\+\S\+\s\+\S\+\s\+0"
^ indicates the start of the line,
\S matches non-space characters, and
\s matches space characters.)
Compare that to
awk performing the exact same operation as simply as this:
cat access.log | awk '$10 == "0"'
Even if you go no further with
awk, tuck that one command in your pocket. The
$ symbol indicates the position of the field, and the string you want to match goes to the right of the operator in the middle. The
== operator will return only exact matches, but you could swap it out for the
~ operator which would return any matches containing the string you specify.
Let’s try a few commands out on access.log to illustrate. Start by isolating the unique user agent strings to use as a data set:
cat access.log | cut -d '"' -f 6 | sort | uniq
There are two lines of output in particular to keep an eye on for this exercise:
awk as shown below and observe the output:
cat access.log | cut -d '"' -f 6 | sort | uniq | awk '$2 == "bot"'
Using the command above, there is no output! The
== operator will only make a match if the second field (as delimited by the space character) contains only
bot and nothing else. You can verify that with this command:
cat access.log | cut -d '"' -f 6 | sort | uniq | awk '$2 == "botnet"'
One line is returned this time—the one with
botnet and nothing but
botnet in its second field.
awk with the
cat access.log | cut -d '"' -f 6 | sort | uniq | awk '$2 ~ "bot"'
This command returns both lines, because both lines contain the string
bot somewhere within their second field.
You’re not limited to matching alphanumerical values, by the way. Here’s an example of how to match on a string containing punctuation:
"CONNECT (quotation mark included). Notice below that the quotation mark needs to be escaped with a backslash (
\ ). Try running this on access.log, too:
cat access.log | awk '$6 == "
Compare Two Files for Unique Entries
No matter how many times I see this work of genius, it still gives me butterflies:
grep -Fxv -f first-file.txt second-file.txt
Basically looks for all lines inhttps://stackoverflow.com/questions/11099894/comparing-two-unsorted-lists-in-linux-listing-the-unique-in-the-second-file
second-file.txtwhich don’t match any line in
grep --help to break down what’s going on there.
Now, say you’ve got output from one command, and you want to compare it to the output of another command. What do I mean? Let’s walk through a concrete scenario using the access.log file from the Web Access Log challenge.
While putting together this blog series, I was having trouble figuring out why these two commands –
cat access.log | grep -v "CONNECT\|HEAD\|GET\|POST" | wc -l cat access.log | awk '$7 == "400"' | wc -l
– were returning different tallies. Don’t worry about what the commands mean. What’s important here is that, in theory, they should have been just two different ways to arrive at the same result. However, the first command counted 11 log entries, and the second counted only 10. What was the lone log entry causing the discrepancy?
Sure, in this case I could have stared at it ’til I found the culprit (and honestly that’s what I did), but while that’s feasible for ~10 entries, it’s a different story when there are ~10,000 entries to compare. That’s where this spectacular command comes in.
Step 1 – Redirect (
> ) each command’s output to its own file, thus:
cat access.log | grep -v "CONNECT\|HEAD\|GET\|POST" > 11-entries.txt cat access.log | awk '$7 == 400' > 10-entries.txt
Step 2 – Compare:
grep -Fxv -f 10-entries.txt 11-entries.txt
A singular log entry emerges that appears on the second list but not on the first list.
It’s easy to imagine using this command now to compare lists of names or IP addresses to identify a rogue element, wouldn’t you say?
Parse a User Agent String
Sometimes you’re asked to decipher a user agent string.
It’s not always tough. See if you can tell whether this user agent string belongs to a mobile device or a desktop device:
Mozilla/5.0 (Android 10; Mobile; rv:68.0) Gecko/68.0 Firefox/68.0
Other times, it feels like your accuracy depends on how lucky you are today.
What browser produced this user agent string?
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/78.0.3904.70 Safari/537.36
Try inputting both sample strings in this awesome online tool:
I learn best from examples and repetition, so, next, we’ll see how to use these command-line tools to solve the sample Log Analysis challenges hosted here at CryptoKait.com:
- Command-line Log Analysis FOR THE WIN (2/3): Walking through “Leaping”
- Command-line Log Analysis FOR THE WIN (3/3): Untangling a Web Access Log
Have a favorite command-line tool or trick not mentioned here? Please tell us about it in the comments, or consider applying for Feature Friday if you could write an entire blog post about it!
One thought on “Command-line Log Analysis FOR THE WIN (1/3): How to Approach a Wild Log”