2

I am using two files in my command, the first file (file1) is just a file with every letter of the alphabet on individual lines. The second file ($w in my command) is a giant word list. I have to compare the alphabet list to the word list to find words containing a letter from the alphabet exactly twice, show how many of those words there are for each letter, and an example word. The output would be something like this but for the entire alphabet

v 94 bivalve
w 94 awkward
x 3 executrix
y 196 abysmally
z 58 bedazzle

Below is my command and its output

 for i in `cat file1`; do grep $i.*$i $w | sort | uniq -c | head -1; done
  1 aardvark    
  1 abba
  1 acacia
  1 abandoned
  1 abalienate
  1 affability
  1 ageing
  1 aforethought
  1 abalienation
  1 hajj
  1 backstroke
  1 abnormally
  1 accommodate
  1 abalienation
  1 abdominous
  1 agitprop
  1 quinqevalent
  1 aardvark
  1 abbess
  1 abatement
  1 absquatulate
  1 bivalve
  1 awkward
  1 executrix
  1 abysmally
  1 bedazzle
Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255

3 Answers3

3

Assuming you're using bash and a relatively new version of it you should be able to do something like this.

for CHAR in {a..z}
do
    WORD_LIST=( $(grep "$CHAR.*$CHAR" $w) )
    echo $CHAR ${#WORD_LIST[@]} ${WORD_LIST[0]}
done

We're making use of bash arrays which can give you a count of the size ${#WORD_LIST[@]} and we're getting the first element of the array ${WORD_LIST[0]}.

The reason your example doesn't work is because uniq -c will only count uniq instances so it will give you a count of each word rather than a count of all the words passed to it, then you only grab the first output.

  • Oh ok, I've never used an array in bash but im starting to see how it works. Now if I wanted to find only the words containing two letters instead of containing atleast the two letters how would I go about that? I know that the Y word count has one word containing 3 y's (syzygy), how would I exclude that from the count? – James Obrien Nov 08 '16 at 22:17
  • Cool (+1) but the regular expression also accepts more than 2 occurrences. – JJoao Nov 08 '16 at 23:21
2

Sarting from Zachary Brady version:

for i in {a..z} 
 do 
   ( echo $i ;
     grep -c    "^[^$i]*$i[^$i]*$i[^$i]*$" file1; 
     grep -m 1  "^[^$i]*$i[^$i]*$i[^$i]*$" file1
   ) | paste - - - 
 done
  • "^[^$i]*$i[^$i]*$i[^$i]*$" is to ensure that we get exactly 2 occurrences of $i (example ^[^a]*a[^a]*a[^a]*$)
  • grep -c ... counts the number of matching words
  • grep -m 1 ... get the first matching word
  • paste - - - ... joins the 3 output lines in a single line

If you prefer a random word example, substitute the second grep by

grep "^[^$i]*$i[^$i]*$i[^$i]*$" file1 | shuf | head -1

Another alternative for "ensure exactly two" is find 2 aa and reject if aaa:

grep 'a.*a' file1  | grep -vc 'a.*a.*a' 
JJoao
  • 12,170
  • 1
  • 23
  • 45
2

Here's two ways to do it, one more shell-oriented (using mainly grep), and one with awk.

w=/usr/share/dict/words
sort file1 | uniq | while read letter
do
  count=$(grep -ic "^[^$letter]*$letter[^$letter]*$letter[^$letter]*$" "$w")
  r=$(( (RANDOM % count) + 1 ))
  printf "%s %d %s\n" "$letter" $count \
    $(grep -i "^[^$letter]*$letter[^$letter]*$letter[^$letter]*$" "$w" | \
        sed -n ${r}p )
done

The initial sort and uniq are unnecessary if file1 was prepared as indicated (one letter per line), but I gratuitously added them to get closer to the "use grep sort and uniq" requirement.

The awk solution:

BEGIN {
  split("abcdefghijklmnopqrstuvwxyz", alphabet, "");
  srand();
}
{
  for (i in alphabet) {
    letter=alphabet[i]
    if (match(tolower($1), "^[^"letter"]*"letter"[^"letter"]*"letter"[^"letter"]*$")) {
      counts[letter]++
      if (wordfor[letter]) {
        if (rand() * counts[letter] >= counts[letter] - 1)
          wordfor[letter]=$1
      } else
        wordfor[letter]=$1
    }
  }
}
END {
  for (i in alphabet)
    print alphabet[i], counts[alphabet[i]], wordfor[alphabet[i]]
}

Save that to a file and use something like:

w=/usr/share/dict/words ## or whatever
awk -f theabove.awk "$w" | sort
Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255