-2

I have a simple awk program ip.awk to find the highest occurrence of an ip address in a log file. IP addresses are in the first column:

$cat ip.awk

{ ip[$1]++ }
END {
for (i in ip)
        if ( max < ip[i] ) {
                max = ip[i]
                maxnumber = i }
print maxnumber, " has accessed ", max, " times.", " $1 is: ", $1 }

And I am using it to parse a file access.log, a few sample entries from which are shown below:

173.13.151.14 - - [11/Sep/2014:23:57:53 +0100] "GET /wp/wp-includes/js/jquery/jquery-migrate.min.js?ver=1.2.1 HTTP/1.1" 200 7404 "http://theurbanpenguin.com/wp/?p=2407" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
173.13.151.14 - - [11/Sep/2014:23:57:53 +0100] "GET /wp/wp-content/themes/twentytwelve/js/navigation.js?ver=20140711 HTTP/1.1" 200 1720 "http://theurbanpenguin.com/wp/?p=2407" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
173.13.151.14 - - [11/Sep/2014:23:57:53 +0100] "GET /wp/wp-content/uploads/2013/11/tailshadow.png HTTP/1.1" 200 11433 "http://theurbanpenguin.com/wp/?p=2407" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
173.13.151.14 - - [11/Sep/2014:23:57:53 +0100] "GET /wp/wp-content/uploads/2014/05/cropped-wp3.png HTTP/1.1" 200 65326 "http://theurbanpenguin.com/wp/?p=2407" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"
173.13.151.14 - - [11/Sep/2014:23:57:53 +0100] "GET /wp/?p=2407 HTTP/1.1" 200 21717 "https://www.google.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.94 Safari/537.36"

The awk script, rightly gives, I believe,:

$awk -f ip.awk access.log 
68.107.81.110  has accessed  311  times.  $1 is:  70.168.57.66

My confusion is in the value of $1, which from what I understand, should be changed line-by-line to the value in the first column of the line as awk moves through the log file access.log.

This checks out from the check that I added at the end of the program ( "$1 is: ", $1" ), as this gives back the ip address of the last line ( the log files is 30000+ lines so I made a program to check that this script was actually working:

$cat testfile.log 
1   apple
2   banana
2   banana
3
3
3
4
4
4
4
5
5   flerb
5   flerb
5   flerb
5   flerb
5   flerb , green - tea
6
7
8   grapes 0 and some more filler to make a long line
9

But when I do this I get, the right answer, but don't get "9" for the value of $1 when I print it out. What am I missing?

$awk -f ip.awk testfile.log 
5  has accessed  6  times.  $1 is: 

Attempting to eliminate another variable I awked the first column of ip addresses alone to a new file and ran ip.awk on it, and got the exact same results as when I run ip.awk on the full log file, as expected. I also feel like I'm missing something fundamental because how is a dotted-decimal ip address being used with an array? Also: if I use 1.0 2.0... for 1 2... I still get the correct answer but still no $1 value.

Answer: As thecarpy suggested, the problem was that when entering values in my testfile I hit enter after the last value, adding a superfluous newline which set $1 to an empty string when it parsed that line.

flerb
  • 963
  • I am unable to replicate your results. Given your awk script and input file, my output is: 5 has accessed 6 times. $1 is: 9. – DopeGhoti Jul 10 '17 at 22:55
  • That's odd, I've ran it and few times and just ran it again to be sure, I still get the same result. – flerb Jul 10 '17 at 22:56
  • 2
    Would you not have a superfluous newline at the end, sorry, silly question, but I cannot reproduce either ... I have GNU Awk 4.1.4 – thecarpy Jul 10 '17 at 23:05
  • 1
    That was it, a superfluous newline at the end. Not a silly question at all apparently. – flerb Jul 10 '17 at 23:09

1 Answers1

3

In an awk program the END block is run when you have read all the data, so there is no input line to parse. (You might find that some implementations of awk leave $1 as the first field of the last line. See Is the AWK END behavior to keep the last line loaded in $0 in the man page.)

awk uses associative arrays. This means you can use any string as the index. Numeric arrays work because a[1] is the array a[] subscripted by the string that happens to be the single character 1. It could equally be a[one] or even a[banana]. The dotted quad for your IP address is just a string.

Chris Davies
  • 116,213
  • 16
  • 160
  • 287