“Sculpture, per se, is the simplest thing in the world. All you have to do is to take a big chunk of marble and a hammer and chisel, make up your mind what you are about to create and chip off all the marble you don’t want.”Paris Gaulois
I love this sarcastic oversimplification. It reminds us that what others see as easy is really hard for those actually doing it. It shows us that the artist needs the right tools to do the job. What it neglects, intentionally, is that it does take a small amount of skill, persistence, and patience to achieve what want.
However, the approach is not wrong – and it does apply to parsing log files as surely as carving a bear from a block of wood. The idea is to remove all the information from the log that we don’t want to see, so that we are left with the obvious answer. Basically, we are cutting away all the parts that are not a bear.
To do that, we need to consider what is normal, and what is expected. If we were using the big axe, grep, for example – we might get several entries. By chaining together statements with the | we can use the -v flag (invert selection) to remove entries from our statement.
grep StuffWeWant file.foo | grep -v StuffWeDontWant | grep -v MoreStuffWeDontWant
And so on, until we get down to the information we are looking for. This requires that we really understand how to select data. Right now, we are cutting with an axe. It might be good enough to give us what we want, but we really want to learn how to carve, not just cut.
Carving requires a decent understanding of regular expressions. When I say that, my back teeth start hurting. Regular expressions can be a really difficult concept to grasp. I will stick to the basics here, but there are entire books written on regular expressions, also called regex. Check out some really good tutorials here.
Regex statements are pattern matching. This is a key concept whether you are using regular expressions in a Linux command line tool, a log parsing tool, or even Excel. (Some of you may laugh, but CryptoKait reviews log files for work and still sometimes uses Excel. Her tutorial on that is here.) You need to be able to look at a line in a log and find what each entry has in common, and what is different. How are the fields separated? A comma, space, tab, or newline character? When you look at a single line entry, how can you chop that line down into pieces? Well, not with an axe.
For example, you might want all the IP addresses from the entire log, but then narrow it down to only unique IP addresses. Then you need to count those addresses. How do you get rid of all the timestamp data, error codes, messages, and other unrelated parts?
Think of each piece of data as a field in a spreadsheet. So each line in the log is a row, and each data field is a column. See where I am going with this? You want to write a statement that only returns the field that you want, not the entire row.
Introducing awk, sed, uniq, and sort. These are your chisels. These let you really cut the bear out of the outline you cut out with grep. Chaining these tools together is key! I am not going into specific tutorials for these tools, but you should be aware that they exist and a basic definition of what they do – this will help you when you are working on your carving.
awk: uses pattern matching to print a line in a file. Similar to grep, except that awk can actually add, modify, or delete data from a file. This is much more dangerous than grep, which is kind of like “read only”. However, if you are working on an actual file and removing entries from it, awk is the right tool.
sed: think of sed as a search-and-replace tool, that uses regex for pattern matching. When used with grep, you can pipe data to sed and modify it before it displays to the screen.
uniq: you want to get rid of duplicates? uniq is a great tool for that. You want to count non-duplicate values? uniq does that too!
sort: before it gets to your screen you may want to display the data in a specific order, such as high value to low value. Before data goes to the next command, you may want to sort it as well.
There is a really great cheat sheet for all of these commands here.
Consider the following statement:
grep stuffwewant file.foo | sort -k 2 | uniq -f 1 | sort -n | sed ‘s/\s*[0-9]\+\s\+//’
Start with the axe, then use the chisels to cut each piece away until you get exactly what you are looking for. Start by building the query one piece at a time, and testing it. Make sure that you are getting the exact result you want from that command. Then pipe it to another, and another, until you have it!
Is that it?
No, my advanced users, I haven’t forgotten you. For those that are not satisfied with a single, elegant, one-line solution from the command line (that is completely indecipherable to anyone but the person that wrote it), I suggest you use Python. Python is excellent at data stream manipulation and it allows you to go line by line to design your program. Then, with a little modification, you can reuse the program for a similar challenge. The best part? Regular expressions work in Python too! Here is a cheat sheet for your field manual.
Also not to be forgotten are these previous posts on Log Analysis!
Thank you for reading! Good luck, and keep hacking away at it!