How to sort file by character occurrences per line?

Question

I'm quite new to Linux, and I've found quite a bit of useful information on how to do character counts in a file, but is there a way in Linux/terminal to sort a text file by the number of times a specific character occurs per line?

E.g. given:

baseball
aardvark
a man a plan a canal panama
cat
bat
bill

Sort by the number of occurrences of the letter "a" yielding:

a man a plan a canal panama
aardvark
baseball
cat
bat
bill

Regarding "cat" and "bat" at one occurrence of "a" each, I don't care if the order of lines with equal counts get reversed, just interested in a general sort of lines by character frequency.

If you know How to count the number of a specific character in each line? you could paste the result and the file, sort by 1st column then delete it, eg. tr -d -c 'a\n' <infile | awk '{print length}' | paste - infile | sort -k1nr | cut -f2- — don_crissti, Aug 01 '15 at 20:29

score 6 · Answer 1 · answered Aug 01 '15 at 20:55

6

The general approach with this kind of task is to use awk or perl... to compute the metric you're interested in and prepend it to the line, and then feed that to sort and remove the metric off the sorted output:

awk '{print gsub("a","a"), $0}' < file | sort -rn | cut -d' ' -f2-

answered Aug 01 '15 at 20:55

Stéphane Chazelas

544,893

2

This technique is known as a Schwartzian transform – glenn jackman Aug 02 '15 at 01:10

score 6 · Answer 2 · answered Aug 02 '15 at 08:01

6

Another Schwartzian transform:

$ awk -Fa '{print NF,$0}' file | sort -nr | cut -d' ' -f2-
a man a plan a canal panama
aardvark
baseball
cat
bat
bill

Or, in Perl:

perl -Fa -lane 'print "$#F $_"' file | sort -nr | cut -d' ' -f2-

answered Aug 02 '15 at 08:01

terdon

242,166

score 3 · Answer 3 · edited Aug 02 '15 at 08:05

3

You can also just sort on the character:

tr -cd a\\n <file | paste - ./file | LC_ALL=C sort -rk1,1 | cut -f2-

Here's what your example looks like after being translated and pasted before it is piped into sort:

aa  baseball
aaa aardvark
aaaaaaaaaa  a man a plan a canal panama
a   cat
a   bat
    bill

Then sort gets it and, all things being equal, sorts the shorter keys before the longer keys but in -reverse, and its output is...

aaaaaaaaaa  a man a plan a canal panama
aaa aardvark
aa  baseball
a   cat
a   bat
    bill

...and cut just strips away up to the first tab.

a man a plan a canal panama
aardvark
baseball
cat
bat
bill

edited Aug 02 '15 at 08:05

terdon

242,166

answered Aug 02 '15 at 06:47

mikeserv

58,310

@terdon - hey what happened to the comments? They were good! But now I'm confused - LC...=C is better for speed and edge cases, but how can your sort put b before a in any language? I think maybe it's about the tab. – mikeserv Aug 02 '15 at 08:08
I deleted them since I added it to your answer. It's not sorting b before a but space (which is the first field for lines without a) before a. – terdon Aug 02 '15 at 12:06
@terdon - there's no space. It's a tab. – mikeserv Aug 02 '15 at 17:18
OK, it's still sorted before the a. – terdon Aug 03 '15 at 00:30
@terdon - I believe it, but it confuses me. I'm curious about it is all. – mikeserv Aug 03 '15 at 00:31

score 0 · Answer 4 · answered Aug 01 '15 at 20:29

0

#!/bin/bash
cat input.txt |
while IFS= read -r a; do
    b=${a//[^a]}
    echo "${#b} $a"
done | sort -rn | sed 's/[^ ]* //'

answered Aug 01 '15 at 20:29

yaegashi

12,326

score 0 · Answer 5 · answered Aug 02 '15 at 11:12

Since the Schwartzian transform has been mentioned, I'm surprised to see that no-one has yet posted a pure Perl implementation of one:

perl -ne 'push @a, $_ }{ print map { $_->[0] } sort { $b->[1] <=> $a->[1] } map { [$_, $_ =~ tr/a//] } @a' file
a man a plan a canal panama
aardvark
baseball
cat
bat
bill

Each line of the file is pushed to @a, then once the file has been read, the count of the character a is used to sort the array.

Since counting the number of occurrences of a character isn't that expensive a function to compute, a more concise method would be just to use sort by itself:

$ perl -ne 'push @a, $_ }{ print sort { $b =~ tr/a// <=> $a =~ tr/a// } @a' file
a man a plan a canal panama
aardvark
baseball
cat
bat
bill

How to sort file by character occurrences per line?

5 Answers5

Linked