2

This answered question explains how to search and sort a specific filename, but how would you accomplish this for an entire directory? I have 1 million text files I need to to search for the ten most frequently used words.

database= /data/000/0000000/s##_date/*.txt - /data/999/0999999/s##_data/*txt

Everything I have attempted results in sorting filenames, paths, or directory errors.

I have made some progress with grep, but parts of filenames seem to appear in my results.

grep -r . * | tr -c '[:alnum:]' '[\n*]' | sort | uniq -c | sort -nr | head  -10
output:
 1145 
    253 txt
    190 s01
    132 is
    126 of
    116 the
    108 and
    104 test
     92 with
     84 in

The 'txt' and 's01' come from file names and not from the text inside the text file. I know there are ways of excluding common words like "the" but would rather not sort and count file names at all.

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
dpoiesz
  • 23

1 Answers1

1

grep will show the filename of each file that matches the pattern along with the line that contains the match if more than one file is searched, which is what's happening in your case.

Instead of using grep (which is an inspired but slow solution to not being able to cat all files on the command line in one go) you may actually cat all the text files together and process it as one big document like this:

find /data -type f -name '*.txt' -exec cat {} + |
tr -cs '[:alnum:]' '\n' | sort | uniq -c | sort -nr | head

I've added -s to tr so that multiple consecutive newlines are compressed into one, and I change all non-alphanumerics to newlines ([\n*] made little sense to me). The head command produces ten lines of output by default, so -10 (or -n 10) is not needed.

The find command finds all regular files (-type f) anywhere under /data whose filenames matches the pattern *.txt. For as many as possible of those files at a time, cat is invoked to concatenate them (this is what -exec cat {} + does). cat is possibly invoked many times if you have a huge number of files, but that does not affect the rest of the pipeline as it just reads the output stream from find+cat.


To avoid counting empty lines, you may want to insert sed '/^ *$/d' just before or just after the first sort in the pipeline.

Kusalananda
  • 333,661