0

I have 10 text files, in each file i have a chapter from a book, i want to find the pair of words that appear the most of the time together in a line i.e:

chapter1:

hello world good boy green sun

good green boy sun world hello

chapter2:

chapter3:

.....etc

Output wanted for chapter1:

hello world (alphabet order)
John B
  • 9

2 Answers2

2
awk '
  {
    $0 = tolower($0)
    for (i = 1; i < NF; i++) {
      pair = $i"" < $(i+1) ? $i" "$(i+1) : $(i+1)" "$i
      c = ++count[pair]
      if (c > max) max = c
    }
  }
  END {
    for (pair in count)
      if (count[pair] == max)
        print pair
  }'
0

Try this,

  1. Use awk to print each pair of words.
  2. Use perl to sort the words in a pair (via).
  3. Use sort and uniq -c to count occurrences each pair.

awk '{for (i=1;i<NF;i++) { print tolower($i)" "tolower($(i+1)) }}' file \
| perl -ane '$,=" "; print sort @F; print "\n";' \
| sort | uniq -c | sort -b -k1nr -k2

Output:

  2 boy green
  2 hello world
  1 boy good
  1 boy sun
  1 good green
  1 good world
  1 green sun
  1 sun world
pLumo
  • 22,565