343

I have Apache logfile, access.log, how to count number of line occurrence in that file? for example the result of cut -f 7 -d ' ' | cut -d '?' -f 1 | tr '[:upper:]' '[:lower:]' is

a.php
b.php
a.php
c.php
d.php
b.php
a.php

the result that I want is:

3 a.php
2 b.php
1 d.php # order doesn't matter
1 c.php 
Kokizzu
  • 9,699

5 Answers5

445
| sort | uniq -c

As stated in the comments.

Piping the output into sort organises the output into alphabetical/numerical order.

This is a requirement because uniq only matches on repeated lines, ie

a
b
a

If you use uniq on this text file, it will return the following:

a
b
a

This is because the two as are separated by the b - they are not consecutive lines. However if you first sort the data into alphabetical order first like

a
a
b

Then uniq will remove the repeating lines. The -c option of uniq counts the number of duplicates and provides output in the form:

2 a
1 b

References:

visudo
  • 4,761
  • 1
    Welcome to Unix & Linux :) Don't hesitate to add more details to your answer and explain why and how this works ;) – John WH Smith Nov 26 '14 at 12:18
  • 1
    printf '%s\n' ①.php ②.php | sort | uniq -c gives me 2 ①.php – Stéphane Chazelas Nov 26 '14 at 12:50
  • @StéphaneChazelas Thats because the printf prints php\nphp –  Nov 26 '14 at 13:52
  • 6
    @Jidder, no, that's because ①.php sorts the same as ②.php in my locale because no sorting order is defined for those and character in my locale. If you want unique values for any byte values (remember file paths are not necessarily text), then you need to fix the locale to C: | LC_ALL=C sort | LC_ALL=C uniq -c. – Stéphane Chazelas Nov 26 '14 at 14:00
  • 4
    In order to have the resulting count file sorted you should consider adding the "sort -nr" as @eduard-florinescu answers below. – Lluís Suñol Mar 26 '18 at 11:41
240
[your command] | sort | uniq -c | sort -nr

The accepted answer is almost complete you might want to add an extra sort -nr at the end to sort the results with the lines that occur most often first

uniq options:

-c, --count
       prefix lines by the number of occurrences

sort options:

-n, --numeric-sort
       compare according to string numerical value
-r, --reverse
       reverse the result of comparisons

In the particular case were the lines you are sorting are numbers, you need use sort -gr instead of sort -nr, see comment

  • 3
    Thanks so much for letting me know about -n option. – Sigur Nov 30 '16 at 17:00
  • 5
    Great answer, here's what I use to get a wordcount out of file with sentences: tr ' ' '\n' < $FILE | sort | uniq -c | sort -nr > wordcount.txt. The first command replaces spaces with newlines, allowing for the rest of the command to work as expected. – Bar Jul 20 '17 at 00:08
  • 6
    Using the options above I get " 1" before " 23344". Using sort -gr instead solves this. -g: compare according to general numerical value (instead of -n: compare according to string numerical value). – Peter Jaric Feb 14 '19 at 12:24
  • 1
    @PeterJaric Great catch and very useful to know about -gr but I think the output of uniq -c will be as such that sort -nr will work as intended – Eduard Florinescu Feb 14 '19 at 13:09
  • 7
    Actually, when the data are numbers, -gr works better. Try these two examples, differing only in the g and n flags: echo "1 11 1 2" | tr ' ' '\n' | sort | uniq -c | sort -nr and echo "1 11 1 2" | tr ' ' '\n' | sort | uniq -c | sort -gr. The first one sorts incorrectly, but not the second one. – Peter Jaric Feb 15 '19 at 10:31
  • @you are right, this is a corner case but I will mention it in the answer – Eduard Florinescu Feb 15 '19 at 13:06
  • 2
    sort -g and sort -n give me the same output for the given example on coreutils 9.1 (also tested with LC_ALL=C). – TheHardew May 25 '22 at 18:55
26

You can use an associative array on awk and then -optionally- sort:

$ awk ' { tot[$0]++ } END { for (i in tot) print tot[i],i } ' access.log | sort

output:

1 c.php
1 d.php
2 b.php
3 a.php
slm
  • 369,824
  • How would you count the number of occurrences as the pipe is sending data? – user123456 Oct 09 '16 at 18:00
  • 8
    This approach is very valuable if the input list is very large, because it does not require reading the entire list into memory and then sorting it. – neirbowj Nov 03 '19 at 18:01
5

You can use clickhouse-client tool for working with files like with a sql table with a single column in this case:

clickhouse-local --query \
"select data, count() from file('access.log', TSV, 'data String') group by data order by count(*) desc limit 10"

My brief experiment shows it's about 50 times faster than

cat access.log | sort | uniq -c | sort -nr | head 10
AdminBee
  • 22,803
0

There is only 1 sample for d.php. So you'll get nice output like this.

wolf@linux:~$ cat file | sort | uniq -c
      3 a.php
      2 b.php
      1 c.php
      1 d.php
wolf@linux:~$

What happens when there is 4 d.php?

wolf@linux:~$ cat file | sort | uniq -c
      3 a.php
      2 b.php
      1 c.php
      4 d.php
wolf@linux:~$ 

If you want to sort the output by the number of occurrence, you might want to send the stdout to sort again.

wolf@linux:~$ cat file | sort | uniq -c | sort
      1 c.php
      2 b.php
      3 a.php
      4 d.php
wolf@linux:~$ 

Use -r for reverse

wolf@linux:~$ cat file | sort | uniq -c | sort -r
      4 d.php
      3 a.php
      2 b.php
      1 c.php
wolf@linux:~$ 

Hope this example helps

Wolf
  • 1,631