Using a single command-line command, how would I search every text file in a database to find the 10 most used words?

Question

This answered question explains how to search and sort a specific filename, but how would you accomplish this for an entire directory? I have 1 million text files I need to to search for the ten most frequently used words.

database= /data/000/0000000/s##_date/*.txt - /data/999/0999999/s##_data/*txt

Everything I have attempted results in sorting filenames, paths, or directory errors.

I have made some progress with grep, but parts of filenames seem to appear in my results.

grep -r . * | tr -c '[:alnum:]' '[\n*]' | sort | uniq -c | sort -nr | head  -10
output:
 1145 
    253 txt
    190 s01
    132 is
    126 of
    116 the
    108 and
    104 test
     92 with
     84 in

The 'txt' and 's01' come from file names and not from the text inside the text file. I know there are ways of excluding common words like "the" but would rather not sort and count file names at all.

When you say "command line" that suggests bash which is probably not best suited to your task. — jdwolf, Jan 22 '18 at 01:41
The link doesn’t work. Are you just looking for the top ten file names, or words in the files? — Guy, Jan 22 '18 at 03:45
@drewbenn, cat would only work for smaller test cases but not the whole database. However, --no-filename worked. Thank you. — dpoiesz, Jan 22 '18 at 22:21

Kusalananda · Accepted Answer · 2018-02-15T21:34:13.760

grep will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.

Instead of using grep (which is an inspired but slow solution to not being able to cat all files on the command line in one go) you may actually cat all the text files together and process it as one big document like this:

find /data -type f -name '*.txt' -exec cat {} + |
tr -cs '[:alnum:]' '\n' | sort | uniq -c | sort -nr | head

I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([\n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.

The find command finds all regular files (-type f) anywhere under /data whose filenames matches the pattern *.txt. For as many as possible of those files at a time, cat is invoked to concatenate them (this is what -exec cat {} + does). cat is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find+cat.

To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.

Using a single command-line command, how would I search every text file in a database to find the 10 most used words?

1 Answers1