6

I'm quite new to Linux, and I've found quite a bit of useful information on how to do character counts in a file, but is there a way in Linux/terminal to sort a text file by the number of times a specific character occurs per line?

E.g. given:

baseball
aardvark
a man a plan a canal panama
cat
bat
bill

Sort by the number of occurrences of the letter "a" yielding:

a man a plan a canal panama
aardvark
baseball
cat
bat
bill

Regarding "cat" and "bat" at one occurrence of "a" each, I don't care if the order of lines with equal counts get reversed, just interested in a general sort of lines by character frequency.

5 Answers5

6

The general approach with this kind of task is to use awk or perl... to compute the metric you're interested in and prepend it to the line, and then feed that to sort and remove the metric off the sorted output:

awk '{print gsub("a","a"), $0}' < file | sort -rn | cut -d' ' -f2-
6

Another Schwartzian transform:

$ awk -Fa '{print NF,$0}' file | sort -nr | cut -d' ' -f2-
a man a plan a canal panama
aardvark
baseball
cat
bat
bill

Or, in Perl:

perl -Fa -lane 'print "$#F $_"' file | sort -nr | cut -d' ' -f2-
terdon
  • 242,166
3

You can also just sort on the character:

tr -cd a\\n <file | paste - ./file | LC_ALL=C sort -rk1,1 | cut -f2-

Here's what your example looks like after being translated and pasted before it is piped into sort:

aa  baseball
aaa aardvark
aaaaaaaaaa  a man a plan a canal panama
a   cat
a   bat
    bill

Then sort gets it and, all things being equal, sorts the shorter keys before the longer keys but in -reverse, and its output is...

aaaaaaaaaa  a man a plan a canal panama
aaa aardvark
aa  baseball
a   cat
a   bat
    bill

...and cut just strips away up to the first tab.

a man a plan a canal panama
aardvark
baseball
cat
bat
bill
terdon
  • 242,166
mikeserv
  • 58,310
  • @terdon - hey what happened to the comments? They were good! But now I'm confused - LC...=C is better for speed and edge cases, but how can your sort put b before a in any language? I think maybe it's about the tab. – mikeserv Aug 02 '15 at 08:08
  • I deleted them since I added it to your answer. It's not sorting b before a but space (which is the first field for lines without a) before a. – terdon Aug 02 '15 at 12:06
  • @terdon - there's no space. It's a tab. – mikeserv Aug 02 '15 at 17:18
  • OK, it's still sorted before the a. – terdon Aug 03 '15 at 00:30
  • @terdon - I believe it, but it confuses me. I'm curious about it is all. – mikeserv Aug 03 '15 at 00:31
0
#!/bin/bash
cat input.txt |
while IFS= read -r a; do
    b=${a//[^a]}
    echo "${#b} $a"
done | sort -rn | sed 's/[^ ]* //'
yaegashi
  • 12,326
0

Since the Schwartzian transform has been mentioned, I'm surprised to see that no-one has yet posted a pure Perl implementation of one:

perl -ne 'push @a, $_ }{ print map { $_->[0] } sort { $b->[1] <=> $a->[1] } map { [$_, $_ =~ tr/a//] } @a' file
a man a plan a canal panama
aardvark
baseball
cat
bat
bill

Each line of the file is pushed to @a, then once the file has been read, the count of the character a is used to sort the array.

Since counting the number of occurrences of a character isn't that expensive a function to compute, a more concise method would be just to use sort by itself:

$ perl -ne 'push @a, $_ }{ print sort { $b =~ tr/a// <=> $a =~ tr/a// } @a' file
a man a plan a canal panama
aardvark
baseball
cat
bat
bill