Trying to find the frequency of words in a file using a script

Question

The file I have is called test and it contains the following lines:

This is a test Test test test There are multiple tests.

I want the output to be:

test@3 tests@1 multiple@1 is@1 are@1 a@1 This@1 There@1 Test@1

I have the following script:

 cat $1 | tr ' ' '\n' > temp # put all words to a new line
    echo -n > file2.txt # clear file2.txt
    for line in $(cat temp)  # trace each line from temp file
    do
    # check if the current line is visited
     grep -q $line file2.txt 
     if [ $line==$temp] 
     then
    count= expr `$count + 1` #count the number of words
     echo $line"@"$count >> file2.txt # add word and frequency to file
     fi
    done

score 5 · Answer 1 · answered Apr 01 '18 at 16:45

5

Use sort | uniq -c | sort -n to create a frequency table. Some more tweaking needed to get the desired format.

 tr ' ' '\n' < "$1" \
 | sort \
 | uniq -c \
 | sort -rn \
 | awk '{print $2"@"$1}' \
 | tr '\n' ' '

answered Apr 01 '18 at 16:45

choroba

47,233

For the assignment I have to use this format, of a for loop and if, – Samurai Bale Apr 01 '18 at 17:52

score 3 · Answer 2 · answered Apr 01 '18 at 16:45

3

grep + sort + uniq + sed pipeline:

grep -o '[[:alnum:]]*' file | sort | uniq -c | sed -E 's/[[:space:]]*([0-9]+) (.+)/\2@\1/'

The output:

a@1
are@1
is@1
multiple@1
test@3
Test@1
tests@1
There@1
This@1

answered Apr 01 '18 at 16:45

RomanPerekhrest

30,212

score 2 · Answer 3 · answered Apr 01 '18 at 21:39

2

$ cat >wdbag.py
#!/usr/bin/python

from collections import *
import re, sys

text=' '.join(sys.argv[1:])       

t=Counter(re.findall(r"[\w']+", text.lower()))

for item in t:
  print item+"@"+str(t[item])

$ chmod 755 wdbag.py 

$ ./wdbag.py "This is a test Test test test There are multiple tests."
a@1
tests@1
multiple@1
this@1
is@1
there@1
are@1
test@4

$ ./wdbag.py This is a test Test test test There are multiple tests.
a@1
tests@1
multiple@1
this@1
is@1
there@1
are@1
test@4

Ref: https://stackoverflow.com/a/11300418/3720510

answered Apr 01 '18 at 21:39

Hannu

494

add a , last on the print -line to get the list one a single row. – Hannu Apr 01 '18 at 21:48
why are you using lower() when the desired output is differentiated according to case? – Jeff Mar 27 '21 at 22:03
Because I'm not a Practically Perfect Person: If you look behind the Ref: -link you see that the code is a perfect copy of what you find there. Now: You might consider .lower() to be an option that is easy to remove, but harder to add - if you do not know about it. Therefore the answer 'covers more ground' than without it. – Hannu Mar 28 '21 at 05:17
FWIW: python3 on the first line, and print( item+"@"+str(t[item]) ) as replacement (added parentheses) is what is required to have this work with py3. Optionally , end='' inside the last ) for print - to have the items listen on a single line. – Hannu Mar 28 '21 at 05:25

αғsнιη · Answer 4 · 2018-04-02T02:59:50.813

With awk only:

 awk -v RS='( |\\.|\n)' '{s[$0]++} 
     END{for (x in s) {printf "%s%s", SEP,x"@"s[x]; SEP=" "}; print ""}' infile

This defines the Record Separator either a space, dot or \newline, then save fields into an array called s with the key as whole fields/words and for each seen of the words, increment the occurrences in array that represents the value of the keys in array.

At the END loop over the elements of the array and first print the keys (fields/words) x, a @ and their value as occurrences s[x].

The SEP as a variable used to add spaces between each words when printing and on second to the next words.

score 0 · Answer 5 · answered Apr 01 '18 at 17:24

0

Using grep and awk..

 grep -o '[[:alnum:]]*' file | awk '{ count[$0]++; next}END {ORS=" "; for (x in count)print x"@"count[x];print "\n"}'

tests@1 Test@1 multiple@1 a@1 This@1 There@1 are@1 test@3 is@1

answered Apr 01 '18 at 17:24

Bharat

814

MiniMax · Answer 6 · 2018-04-01T20:52:18.690

gawk '
{
    for(i = 1; i <= NF; i++) {
        arr[$i]++
    }
}
END {
    PROCINFO["sorted_in"] = "@val_num_desc"

    for(i in arr) {
        printf "%s@%s ", i, arr[i]
    }
    print ""
}
' FPAT='[a-zA-Z]+' input.txt

Explanation

PROCINFO["sorted_in"] = "@val_num_desc" - Order by element values in descending order (rather than by indices). Scalar values are compared as numbers. See Predefined Array Scanning Orders.

FPAT='[a-zA-Z]+' - A regular expression describing the contents of the fields in a record. When set, gawk parses the input into fields, where the fields match the regular expression, instead of using the value of the FS variable as the field separator.

Input

This is a test Test test test There are multiple tests.
This is a test Test test test There are multiple tests.
This is a test Test test test There are multiple tests.

Output

test@9 tests@3 Test@3 multiple@3 a@3 This@3 There@3 are@3 is@3

Could you use the code I already posted, Im restricted to that format, — Samurai Bale, Apr 01 '18 at 20:49
@SamuraiBale No, because of this - Why is using a shell loop to process text considered bad practice?. You use the inappropriate approach to solving your task. It is like coding game in the bash - it can be done (for example "tetris" in the terminal window), but the bash is the not right instrument for this. The same for text processing - the awk is the right tool for this, the bash loop is not. — MiniMax, Apr 01 '18 at 21:06

score 0 · Answer 7 · edited Apr 11 '18 at 22:37

As OP asked in the same kind of format...

bash-4.1$ cat test.sh
#!/bin/bash

tr ' ' '\n' < ${1} > temp
while read line
do
    count=$(grep -cw ${line} temp)
    echo -n "${line}@${count} "
done < temp
echo ""

bash-4.1$ bash test.sh test.txt
This@1 is@1 a@1 test@3 Test@1 test@3 test@3 There@1 are@1 multiple@1 tests.@1

bash-4.1$ cat test.txt
This is a test Test test test There are multiple tests.

Trying to find the frequency of words in a file using a script

7 Answers7

Linked