Sort and count number of occurrence of lines

Question

I have Apache logfile, access.log, how to count number of line occurrence in that file? for example the result of cut -f 7 -d ' ' | cut -d '?' -f 1 | tr '[:upper:]' '[:lower:]' is

a.php
b.php
a.php
c.php
d.php
b.php
a.php

the result that I want is:

3 a.php
2 b.php
1 d.php # order doesn't matter
1 c.php

Do you have an example of the line in the log, as i think this could all be done with awk without all the pipes. — , Nov 26 '14 at 13:54
it's ok, 8.1GB log file processed in about 2 minutes, and it's done for now, no longer need this anymore :3 — Kokizzu, Nov 26 '14 at 14:19
Asked one year earlier here: https://stackoverflow.com/q/15984414/322020 — Nakilon, Apr 23 '20 at 13:30
@StéphaneChazelas, sort using C locale is 10 times faster on my machine! Thanks — leogama, Jul 25 '21 at 23:18

score 445 · Accepted Answer · edited Jul 31 '19 at 14:35

445

| sort | uniq -c

As stated in the comments.

Piping the output into sort organises the output into alphabetical/numerical order.

This is a requirement because uniq only matches on repeated lines, ie

a
b
a

If you use uniq on this text file, it will return the following:

a
b
a

This is because the two as are separated by the b - they are not consecutive lines. However if you first sort the data into alphabetical order first like

a
a
b

Then uniq will remove the repeating lines. The -c option of uniq counts the number of duplicates and provides output in the form:

2 a
1 b

References:

edited Jul 31 '19 at 14:35

Jonathon Reinhart

1,861

answered Nov 26 '14 at 11:36

visudo

4,761

1

Welcome to Unix & Linux :) Don't hesitate to add more details to your answer and explain why and how this works ;) – John WH Smith Nov 26 '14 at 12:18
1

printf '%s\n' ①.php ②.php | sort | uniq -c gives me 2 ①.php – Stéphane Chazelas Nov 26 '14 at 12:50
@StéphaneChazelas Thats because the printf prints php\nphp – Nov 26 '14 at 13:52
6

@Jidder, no, that's because ①.php sorts the same as ②.php in my locale because no sorting order is defined for those ① and ② character in my locale. If you want unique values for any byte values (remember file paths are not necessarily text), then you need to fix the locale to C: | LC_ALL=C sort | LC_ALL=C uniq -c. – Stéphane Chazelas Nov 26 '14 at 14:00
4

In order to have the resulting count file sorted you should consider adding the "sort -nr" as @eduard-florinescu answers below. – Lluís Suñol Mar 26 '18 at 11:41

Eduard Florinescu · Answer 2 · 2019-02-15T13:09:16.707

240

[your command] | sort | uniq -c | sort -nr

The accepted answer is almost complete you might want to add an extra sort -nr at the end to sort the results with the lines that occur most often first

uniq options:

-c, --count
       prefix lines by the number of occurrences

sort options:

-n, --numeric-sort
       compare according to string numerical value
-r, --reverse
       reverse the result of comparisons

In the particular case were the lines you are sorting are numbers, you need use sort -gr instead of sort -nr, see comment

edited Feb 15 '19 at 13:09

answered Feb 17 '16 at 14:50

Eduard Florinescu

11,703

3

Thanks so much for letting me know about -n option. – Sigur Nov 30 '16 at 17:00
5

Great answer, here's what I use to get a wordcount out of file with sentences: tr ' ' '\n' < $FILE | sort | uniq -c | sort -nr > wordcount.txt. The first command replaces spaces with newlines, allowing for the rest of the command to work as expected. – Bar Jul 20 '17 at 00:08
6

Using the options above I get " 1" before " 23344". Using sort -gr instead solves this. -g: compare according to general numerical value (instead of -n: compare according to string numerical value). – Peter Jaric Feb 14 '19 at 12:24
1

@PeterJaric Great catch and very useful to know about -gr but I think the output of uniq -c will be as such that sort -nr will work as intended – Eduard Florinescu Feb 14 '19 at 13:09
7

Actually, when the data are numbers, -gr works better. Try these two examples, differing only in the g and n flags: echo "1 11 1 2" | tr ' ' '\n' | sort | uniq -c | sort -nr and echo "1 11 1 2" | tr ' ' '\n' | sort | uniq -c | sort -gr. The first one sorts incorrectly, but not the second one. – Peter Jaric Feb 15 '19 at 10:31
@you are right, this is a corner case but I will mention it in the answer – Eduard Florinescu Feb 15 '19 at 13:06
2

sort -g and sort -n give me the same output for the given example on coreutils 9.1 (also tested with LC_ALL=C). – TheHardew May 25 '22 at 18:55

score 26 · Answer 3 · edited Dec 13 '19 at 03:26

26

You can use an associative array on awk and then -optionally- sort:

$ awk ' { tot[$0]++ } END { for (i in tot) print tot[i],i } ' access.log | sort

output:

1 c.php
1 d.php
2 b.php
3 a.php

edited Dec 13 '19 at 03:26

slm

369,824

answered May 28 '15 at 04:21

Laurence R. Ugalde

501

How would you count the number of occurrences as the pipe is sending data? – user123456 Oct 09 '16 at 18:00
8

This approach is very valuable if the input list is very large, because it does not require reading the entire list into memory and then sorting it. – neirbowj Nov 03 '19 at 18:01

score 5 · Answer 4 · edited Sep 24 '21 at 14:38

5

You can use clickhouse-client tool for working with files like with a sql table with a single column in this case:

clickhouse-local --query \
"select data, count() from file('access.log', TSV, 'data String') group by data order by count(*) desc limit 10"

My brief experiment shows it's about 50 times faster than

cat access.log | sort | uniq -c | sort -nr | head 10

edited Sep 24 '21 at 14:38

AdminBee

22,803

answered Sep 24 '21 at 14:30

Alexey Kupershtokh

59

2

XD this is very cool use case of Clickhouse! – Kokizzu Sep 24 '21 at 15:25

score 0 · Answer 5 · answered Jun 03 '20 at 05:36

There is only 1 sample for d.php. So you'll get nice output like this.

wolf@linux:~$ cat file | sort | uniq -c
      3 a.php
      2 b.php
      1 c.php
      1 d.php
wolf@linux:~$

What happens when there is 4 d.php?

wolf@linux:~$ cat file | sort | uniq -c
      3 a.php
      2 b.php
      1 c.php
      4 d.php
wolf@linux:~$

If you want to sort the output by the number of occurrence, you might want to send the stdout to sort again.

wolf@linux:~$ cat file | sort | uniq -c | sort
      1 c.php
      2 b.php
      3 a.php
      4 d.php
wolf@linux:~$

Use -r for reverse

wolf@linux:~$ cat file | sort | uniq -c | sort -r
      4 d.php
      3 a.php
      2 b.php
      1 c.php
wolf@linux:~$

Hope this example helps

duplicated answer of https://unix.stackexchange.com/a/263849/72456 — αғsнιη, Jun 03 '20 at 05:50
what @αғsнιη comments, it is even superior, this one here is missing. — hakre, Jan 26 '21 at 10:52

Sort and count number of occurrence of lines

5 Answers5

Linked

Related