3

The file I have is called test and it contains the following lines:

This is a test Test test test There are multiple tests.

I want the output to be:

test@3 tests@1 multiple@1 is@1 are@1 a@1 This@1 There@1 Test@1

I have the following script:

 cat $1 | tr ' ' '\n' > temp # put all words to a new line
    echo -n > file2.txt # clear file2.txt
    for line in $(cat temp)  # trace each line from temp file
    do
    # check if the current line is visited
     grep -q $line file2.txt 
     if [ $line==$temp] 
     then
    count= expr `$count + 1` #count the number of words
     echo $line"@"$count >> file2.txt # add word and frequency to file
     fi
    done

7 Answers7

5

Use sort | uniq -c | sort -n to create a frequency table. Some more tweaking needed to get the desired format.

 tr ' ' '\n' < "$1" \
 | sort \
 | uniq -c \
 | sort -rn \
 | awk '{print $2"@"$1}' \
 | tr '\n' ' '
choroba
  • 47,233
3

grep + sort + uniq + sed pipeline:

grep -o '[[:alnum:]]*' file | sort | uniq -c | sed -E 's/[[:space:]]*([0-9]+) (.+)/\2@\1/'

The output:

a@1
are@1
is@1
multiple@1
test@3
Test@1
tests@1
There@1
This@1
2
$ cat >wdbag.py
#!/usr/bin/python

from collections import *
import re, sys

text=' '.join(sys.argv[1:])       

t=Counter(re.findall(r"[\w']+", text.lower()))

for item in t:
  print item+"@"+str(t[item])

$ chmod 755 wdbag.py 

$ ./wdbag.py "This is a test Test test test There are multiple tests."
a@1
tests@1
multiple@1
this@1
is@1
there@1
are@1
test@4

$ ./wdbag.py This is a test Test test test There are multiple tests.
a@1
tests@1
multiple@1
this@1
is@1
there@1
are@1
test@4

Ref: https://stackoverflow.com/a/11300418/3720510

Hannu
  • 494
  • add a , last on the print -line to get the list one a single row. – Hannu Apr 01 '18 at 21:48
  • why are you using lower() when the desired output is differentiated according to case? – Jeff Mar 27 '21 at 22:03
  • Because I'm not a Practically Perfect Person: If you look behind the Ref: -link you see that the code is a perfect copy of what you find there. Now: You might consider .lower() to be an option that is easy to remove, but harder to add - if you do not know about it. Therefore the answer 'covers more ground' than without it. – Hannu Mar 28 '21 at 05:17
  • FWIW: python3 on the first line, and print( item+"@"+str(t[item]) ) as replacement (added parentheses) is what is required to have this work with py3. Optionally , end='' inside the last ) for print - to have the items listen on a single line. – Hannu Mar 28 '21 at 05:25
1

With awk only:

 awk -v RS='( |\\.|\n)' '{s[$0]++} 
     END{for (x in s) {printf "%s%s", SEP,x"@"s[x]; SEP=" "}; print ""}' infile

This defines the Record Separator either a space, dot or \newline, then save fields into an array called s with the key as whole fields/words and for each seen of the words, increment the occurrences in array that represents the value of the keys in array.

At the END loop over the elements of the array and first print the keys (fields/words) x, a @ and their value as occurrences s[x].

The SEP as a variable used to add spaces between each words when printing and on second to the next words.

αғsнιη
  • 41,407
0

Using grep and awk..

 grep -o '[[:alnum:]]*' file | awk '{ count[$0]++; next}END {ORS=" "; for (x in count)print x"@"count[x];print "\n"}'

tests@1 Test@1 multiple@1 a@1 This@1 There@1 are@1 test@3 is@1

Bharat
  • 814
0
gawk '
{
    for(i = 1; i <= NF; i++) {
        arr[$i]++
    }
}
END {
    PROCINFO["sorted_in"] = "@val_num_desc"

    for(i in arr) {
        printf "%s@%s ", i, arr[i]
    }
    print ""
}
' FPAT='[a-zA-Z]+' input.txt

Explanation

PROCINFO["sorted_in"] = "@val_num_desc" - Order by element values in descending order (rather than by indices). Scalar values are compared as numbers. See Predefined Array Scanning Orders.

FPAT='[a-zA-Z]+' - A regular expression describing the contents of the fields in a record. When set, gawk parses the input into fields, where the fields match the regular expression, instead of using the value of the FS variable as the field separator.

Input

This is a test Test test test There are multiple tests.
This is a test Test test test There are multiple tests.
This is a test Test test test There are multiple tests.

Output

test@9 tests@3 Test@3 multiple@3 a@3 This@3 There@3 are@3 is@3 
MiniMax
  • 4,123
0

As OP asked in the same kind of format...

bash-4.1$ cat test.sh
#!/bin/bash

tr ' ' '\n' < ${1} > temp
while read line
do
    count=$(grep -cw ${line} temp)
    echo -n "${line}@${count} "
done < temp
echo ""

bash-4.1$ bash test.sh test.txt
This@1 is@1 a@1 test@3 Test@1 test@3 test@3 There@1 are@1 multiple@1 tests.@1

bash-4.1$ cat test.txt
This is a test Test test test There are multiple tests.
Kamaraj
  • 4,365