Sed & awk examples

Warning: This blogpost has been automatically converted from WordPress to Jekyll, and hasn't been fully checked yet. It might be possible that it misses some code snippets or that the formatting is not yet complete. As soon as this blogpost has been checked, this banner will automatically be removed.
Warning: This blogpost has been posted over two years ago. That is a long time in development-world! The story here may not be relevant, complete or secure. Code might not be complete or obsoleted, and even my current vision might have (completely) changed on the subject. So please do read further, but use it with caution.
Posted on 11 Dec 2010
Tagged with: [ awk ]  [ sed

Did you know you can write a webserver in awk or that sed supports conditional jumps? Probably not… These tool (languages, actually) are much more powerful than most people know. The sed & awk combination gives you massive power IF used correctly. Although most people use it for simple tasks like search/replacing or displaying certain columns of a file, the potential is much higher. I will discuss a few real-life examples I use from time to time…

Our data

All examples in the post uses an apache logfile taken from this blog-site.  You can try the examples yourselve with your own apache logfiles (the more data, the better), or use the anonymized apache log which you can download here. Note that only the IP-addresses are anonymized. All other data you find in there is real data.

Anonymized apache log (zipped: 928Kb)

A quick introduction into awk

Awk uses a concept of records and fields, where every record by default is separated by a \n, and every field by a space (or tab). If you want to change the field separator for using a double colon (:), you can issue a “FS=:” in your awk-script.  Same goes for the record separator, but you have to use the “RS” variable for that. Normally records are separated by newlines, since you will feed awk with line-data from files so changing the RS variable is something you probably will not do often. There are other standard variables that you can use or set which are not really important now.

Each awk script consists of several awk-rules that are formatted like:

[searchpattern] { action }

If the search pattern matches the record, the actions will be executed. If your search pattern is /^test/, only the lines that start with “test” will be executed. There are 2 special pattern cases: BEGIN and END. When you issue BEGIN, that is exectuded once at the start of the program, the END pattern will always be executed last. If you don’t specify a pattern, the actions will be exectuded for every line (record actually). This is called the default pattern or null-pattern.

An example awk-script:

BEGIN { FS=':'; print "This is the first line\n" }
{ print "LINE",NR,$1 }
END { print "This is the last line\n" }</pre>

This program does a few things. On startup, it will execute the BEGIN action. That will  set the field-separator to :, which means that our data should be separated by a ‘:’. Secondly, it will print “This is the first line”. As you can see, you can add multiple commands in an action-block.

Then, for each line you feed into awk, it will print “LINE”, the number of records currently processed, plus the first field of that line (stored in $1). Additional fields are placed in the $2, $3 etc variables. The whole line is placed into $0. If you change the value of $1, you will see this back in the $0 string.

At the end of the input data, when awk has no more records to process, it will execute the END block, which prints “This is the last line”. You can try this for yourself with your /etc/passwd file:

cat /etc/passwd | awk 'BEGIN { FS=":"; print "This is the first line\n" } \
{ print "LINE",NR,$1 } END { print "This is the last line\n" }'

Outside the pattern/action blocks, you can define your own functions and/or use any of the internal functions.

Now let’s try awk with our logfile. Our record separator is the standard newline and the field separator is the space. We don’t have to change those. A standard line looks like this:

116.191.61.250 - - [06/Dec/2010:00:23:37 +0100] "GET /2010/07/30/creating-a-traceroute-program-in-php/ HTTP/1.1" 206 10473 "-" "Mozilla/5.0 Firefox/3.0.5"

which will be split into the following fields:

116.191.61.250
-
-
[06/Dec/2010:00:23:37
+0100]
"GET
/2010/07/30/creating-a-traceroute-program-in-php/
HTTP/1.1"
206
10473
"-"
"Mozilla/5.0
Firefox/3.0.5"

We will run into trouble when we want to fetch the complete user-agent string. In this record we need to combine field $11 and $12, but for other records, we might need more (or less) records. We will deal with this problem a bit later.

How the data was anonymized

Since I (and you as a visitor) don’t want your IP-address to be spread around the internet, I’ve anonymized the log data. It’s a fairly easy process that is done in 2 steps:

  1. IP’s are translated into random values.
  2. Admin url’s are removed.

Step 1: Translating IP’s

All the IP’s are translated into random IP’s, but every IP has it’s own random counterpart. This means that you can still identify users who are browsing through the site. The actual command I have used for this is:

cat apache-anon-noadmin.log | awk 'function ri(n) {  return int(n*rand()); }  \
BEGIN { srand(); }  { if (! ($1 in randip)) {  \
randip[$1] = sprintf("%d.%d.%d.%d", ri(255), ri(255), ri(255), ri(255)); } \
$1 = randip[$1]; print $0  }'

If you read a bit further we will find out what this will actually do, but most of it you should be able to understand (a least the global format).

Step 2: Removing admin url’s

I don’t like that everybody can view all the admin-requests I’ve done on the site. Luckely this is a very simple process. We only have to remove the requests that start with “/wp-admin”. This can be done by an inverse grep-command:

cat apache-anon.log | grep -v '/wp-admin' > apache-anon-noadmin.log

Example 1: count http status codes

For now we want to deal with the status-codes. This is found in field $9. The following code will print every field 9 for every record from our log:

cat apache-anon-noadmin.log | awk ' { print $9 } '

That’s nice, but let’s aggregate this data. We want to know how many times we outputted each status code. By using the “uniq” command, we can count (and display) the number of times we encounter data, but before we can use uniq we have to sort the data since uniq will stop counting as soon as another piece of data is encountered. (try the following line with and without the “sort” to see what I mean).

cat apache-anon-noadmin.log | awk ' { print $9 } ' | sort | uniq -c

And the output should be:

72951 200
  235 206
 1400 301
   38 302
 2911 304
 2133 404
 1474 500

As you see, the 200 (which stands for OK), is returned 72951 times, while we returned 2133 times a 404 (page not found). Cool…

Example 2: top 10 of visiting ip’s

Let’s try to create some top-10’s. The first one about the IP’s that did the most pageviews (my fans, but most probably it would be me :p)

cat apache-anon-noadmin.log | awk '{ print $1 ; }' | \
sort | uniq -c | sort -n -r | head -n 10</pre>

We use awk to print the first field - the IP, we sort and count them. THEN we sort again, but this time in a reversed order and with a natural sort so 10 will be sorted after 9, instead of after 1. (again, remove the sort to find out what I mean). After this, we filter out the first 10 lines with the head command, which only prints the first 10 lines.

As you can see, I use (a lot of) different unix commands to achieve what I need to do. It MIGHT be possible to this all with awk itself as well, but by using other commands we get the job done quick and easy.

Example 3: traffic in kilobytes per status code

Let’s introduce arrays. Field $10 holds the number of bytes we have send out, and field $9 the status code. In the null-pattern (the block without any pattern, that will be executed on every line) we add the number of bytes to the array in the $9 index. It will NOT print out any information yet. At the end of the program, we will iterate over the “total”-array and print each status code, and the total sum of bytes / 1024, so we get kilobytes. Still, pretty easy to understand.

cat apache-anon-noadmin.log  | awk ' { total[$9] += $10 } \
END {  for (x in total) { printf "Status code %3d : %9.2f Kb\n", x, total[x]/1024 } } '
Status code 200 : 329836.22 Kb
Status code 206 :   4649.29 Kb
Status code 301 :    535.72 Kb
Status code 302 :     20.26 Kb
Status code 304 :    572.77 Kb
Status code 404 :   5106.29 Kb
Status code 500 :   2336.42 Kb

Not a lot of redirections, but still: 5 megabyte wasted by serving pages that are not found :(

Let’s expand this example so we get a total sum:

cat apache-anon-noadmin.log  | awk ' { totalkb += $10; total[$9] += $10 } \
END {  for (x in total) { printf "Status code %3d : %9.2f Kb\n", x, total[x]/1024 } \
printf ("\nTotal send      : %9.2f Kb\n", totalkb/1024); } '
Status code 200 : 329836.22 Kb
Status code 206 :   4649.29 Kb
Status code 301 :    535.72 Kb
Status code 302 :     20.26 Kb
Status code 304 :    572.77 Kb
Status code 404 :   5106.29 Kb
Status code 500 :   2336.42 Kb

Total send      : 343056.96 Kb

Example 4: top 10 referrers

We use the " as separator here. We need this because the referrer is inside those quotes. This is how we can deal with request-url’s, the referrers and user-agents without problems. This time we don’t use a BEGIN block to change the FS-variable, but we change it through a command line parameter. Now, most of the referrers are either from our own blog, or a ‘-‘, when no referrer is given. We add additional grep commands to remove those referrers. Again, sorting, doing a unique count, reverse nat sorting and limiting with head gives us a nice result:

cat apache-anon-noadmin.log | awk -F\" ' { print $4 } ' | \
grep -v '-' | grep -v 'http://www.adayinthelife' | sort | \
uniq -c | sort -rn | head -n 10
343 http://www.phpdeveloper.org/news/15544
175 http://www.dzone.com/links/rss/top5_certifications_for_every_php_programmer.html
 71 http://www.dzone.com/links/index.html
 64 http://www.google.com/reader/view/
 54 http://www.phpdeveloper.org/
 50 http://phpdeveloper.org/
 49 http://www.dzone.com/links/r/top5_certifications_for_every_php_programmer.html
 45 http://www.phpdeveloper.org/news/15544?utm_source=twitterfeed&utm_medium=twitter
 22 http://abcphp.com/41578/
 21 http://twitter.com</pre>

At least I can see quickly to which sites I need to send some christmas cards to.

Example 5: top 10 user-agents

How simple is this? The user-agent is in column 6 instead of 4 and we don’t need the grep’s, so this one needs no explanation:

cat apache-anon-noadmin.log | awk -F\" ' { print $6 } ' | \
sort | uniq -c | sort -rn | head -n 10
 5891 Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.215 Safari/534.10
 4145 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12
 3440 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.215 Safari/534.10
 2338 Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12
 2314 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.0.6) Gecko/2009011912 Firefox/3.0.6
 2001 Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12
 1959 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.215 Safari/534.10
 1241 Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_5; en-us) AppleWebKit/533.19.4 (KHTML, like Gecko) Version/5.0.3 Safari/533.19.4
 1122 Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.10 (KHTML, like Gecko) Chrome/8.0.552.215 Safari/534.10
 1010 Mozilla/5.0 (Windows; U; Windows NT 6.1; en-GB; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12</pre>

A lot of mac-users (yay!) and IE is nowhere seen near the top-10 list. But then again, I actually don’t know the actual user-agent IE uses :)

Conclusion

Awk is probably more powerful that you would have imagined. I’ve added some basic examples and granted, you COULD do this all with any other language. Awk’s main power comes from simplifying of things: you don’t need to split, explode, join, implode, paste, read, write, open or close data and it’s easy to parse data from files with a simple one-liners. It’s ideal to add awk to your scripts and as seen in the examples I’ve given, you can do even use it for complicated tasks. I suggest that you learn it and love it…