How to use grep, sort, and uniq to create three fields of output

Question

I am using two files in my command, the first file (file1) is just a file with every letter of the alphabet on individual lines. The second file ($w in my command) is a giant word list. I have to compare the alphabet list to the word list to find words containing a letter from the alphabet exactly twice, show how many of those words there are for each letter, and an example word. The output would be something like this but for the entire alphabet

v 94 bivalve
w 94 awkward
x 3 executrix
y 196 abysmally
z 58 bedazzle

Below is my command and its output

 for i in `cat file1`; do grep $i.*$i $w | sort | uniq -c | head -1; done
  1 aardvark    
  1 abba
  1 acacia
  1 abandoned
  1 abalienate
  1 affability
  1 ageing
  1 aforethought
  1 abalienation
  1 hajj
  1 backstroke
  1 abnormally
  1 accommodate
  1 abalienation
  1 abdominous
  1 agitprop
  1 quinqevalent
  1 aardvark
  1 abbess
  1 abatement
  1 absquatulate
  1 bivalve
  1 awkward
  1 executrix
  1 abysmally
  1 bedazzle

Unless you want file globbing, you better double quote that pattern that you're passing to grep. Also see Why does my shell script choke on whitespace or other special characters? — Wildcard, Nov 08 '16 at 23:35
If any of the answers solved your problem, please accept it by clicking the checkmark next to it. Thank you! — Jeff Schaller, Jan 20 '19 at 13:11

score 3 · Answer 1 · answered Nov 08 '16 at 21:41

3

Assuming you're using bash and a relatively new version of it you should be able to do something like this.

for CHAR in {a..z}
do
    WORD_LIST=( $(grep "$CHAR.*$CHAR" $w) )
    echo $CHAR ${#WORD_LIST[@]} ${WORD_LIST[0]}
done

We're making use of bash arrays which can give you a count of the size ${#WORD_LIST[@]} and we're getting the first element of the array ${WORD_LIST[0]}.

The reason your example doesn't work is because uniq -c will only count uniq instances so it will give you a count of each word rather than a count of all the words passed to it, then you only grab the first output.

answered Nov 08 '16 at 21:41

Zachary Brady

4,240

Oh ok, I've never used an array in bash but im starting to see how it works. Now if I wanted to find only the words containing two letters instead of containing atleast the two letters how would I go about that? I know that the Y word count has one word containing 3 y's (syzygy), how would I exclude that from the count? – James Obrien Nov 08 '16 at 22:17
Cool (+1) but the regular expression also accepts more than 2 occurrences. – JJoao Nov 08 '16 at 23:21

JJoao · Answer 2 · 2016-11-09T12:35:59.473

Sarting from Zachary Brady version:

for i in {a..z} 
 do 
   ( echo $i ;
     grep -c    "^[^$i]*$i[^$i]*$i[^$i]*$" file1; 
     grep -m 1  "^[^$i]*$i[^$i]*$i[^$i]*$" file1
   ) | paste - - - 
 done

"^[^$i]*$i[^$i]*$i[^$i]*$" is to ensure that we get exactly 2 occurrences of $i (example ^[^a]*a[^a]*a[^a]*$)
grep -c ... counts the number of matching words
grep -m 1 ... get the first matching word
paste - - - ... joins the 3 output lines in a single line

If you prefer a random word example, substitute the second grep by

grep "^[^$i]*$i[^$i]*$i[^$i]*$" file1 | shuf | head -1

Another alternative for "ensure exactly two" is find 2 aa and reject if aaa:

grep 'a.*a' file1  | grep -vc 'a.*a.*a'

Jeff Schaller · Answer 3 · 2016-11-09T17:15:58.670

Here's two ways to do it, one more shell-oriented (using mainly grep), and one with awk.

w=/usr/share/dict/words
sort file1 | uniq | while read letter
do
  count=$(grep -ic "^[^$letter]*$letter[^$letter]*$letter[^$letter]*$" "$w")
  r=$(( (RANDOM % count) + 1 ))
  printf "%s %d %s\n" "$letter" $count \
    $(grep -i "^[^$letter]*$letter[^$letter]*$letter[^$letter]*$" "$w" | \
        sed -n ${r}p )
done

The initial sort and uniq are unnecessary if file1 was prepared as indicated (one letter per line), but I gratuitously added them to get closer to the "use grep sort and uniq" requirement.

The awk solution:

BEGIN {
  split("abcdefghijklmnopqrstuvwxyz", alphabet, "");
  srand();
}
{
  for (i in alphabet) {
    letter=alphabet[i]
    if (match(tolower($1), "^[^"letter"]*"letter"[^"letter"]*"letter"[^"letter"]*$")) {
      counts[letter]++
      if (wordfor[letter]) {
        if (rand() * counts[letter] >= counts[letter] - 1)
          wordfor[letter]=$1
      } else
        wordfor[letter]=$1
    }
  }
}
END {
  for (i in alphabet)
    print alphabet[i], counts[alphabet[i]], wordfor[alphabet[i]]
}

Save that to a file and use something like:

w=/usr/share/dict/words ## or whatever
awk -f theabove.awk "$w" | sort

How to use grep, sort, and uniq to create three fields of output

3 Answers3