1

I have a directory with over 150K files. I want to create a list all the files that contain the text stored in a text variable, storing that list of files in another variable .

I first tried:

searchtext="Subject: Your"
files = $(grep "$searchtext" ./* | awk '{print ($1)}' )

, and while that works for a moderate number of files in the directory, it generates an error "Argument list too long" when run on the directory with 150K files. (The awk with print was used to extract just the filename from the grep result.)

I found that files=$(grep "$searchtext" ./* | awk '{print ($1)}') works for the 150K file directory, but it takes almost 90 minutes to run.

If present in the file, the $searchtext string will be located in the beginning of the file. So I thought I could speed this up greatly if the grep was restricted to say the first 30 line of text. Not being sure how that could be done, I found How do I grep the first 50 lines of each file in a directory recursively? and tried several of the suggestions there. The one that seemed best suited for my task was:

searchtext="Subject: Your"
find . -type f -exec head -n 30 {} + | grep "$searchtext"

This runs in an acceptable time, but it does not output the filenames of the files that contain the search text. I trred grep -l, but that results in an error: "find: head' terminated by signal 13 ". Somewhere it was suggested that the using "\" instead of "+" might be more appropriate. However, that also generates an error: "find: missing argument to -exec' ".

Looking ahead to when the grep result includes the file names, I expect another issue. When I try to assign the grep output to a variable as:

files = $(find . -type f -exec head -n 30 {} + | grep "$searchtext")

I get an error "ut1.sh: line 16: files: command not found ". For some reason, the variable "files" is being interpreted as a command? My script name is ut1.sh . I have assigned variables this way many times before without issue.

My bash version is GNU bash, version 4.1.2(2)-release (x86_64-redhat-linux-gnu)

How to get the job done, and what was wrong with my attempts?

thanks

Mike
  • 65
  • 1
    variable assignments in the shell don't take spaces around the =: files = foo bar would run the command files with three arguments. see e.g. here – ilkkachu Jul 04 '17 at 19:36

2 Answers2

1

To get the list of filenames that grep matches, you could use the -l switch to just get the filename, no need to use awk to process the output. This is faster in the case of matching files, too, since grep can stop after the pattern is found once.

grep -le "$searchtext" ./* 

You could put the output from that in a variable, with simple assignment (but filenames with whitespace and glob characters will cause issues):

files=$(grep -le "$searchtext" ./* ) 

As for this:

find . -type f -exec head -n 30 {} + | grep "$searchtext"

The pipe here separates the find and the grep, so you're effectively concatenating the first 30 lines of every file (losing track of file names here), and then grepping the result. grep -l can only tell you if there are any matches in the whole input. You'd need to run a shell from within find to combine the head and grep for each file individually:

export searchtext
find . -type f -exec sh -c 'head -n 30 "$1" | grep -q "$searchtext" && echo "$1"' sh {} \;

But we might as well use awk to do that. This would look for the pattern only on the first 30 lines (GNU awk):

awk -vpattern="$searchtext" 'FNR <= 30 && $0 ~ pattern { print FILENAME; nextfile }' *

or with find:

find . -type f -exec awk -vpattern="$searchtext" 'FNR <= 4 && $0 ~ pattern { print FILENAME; nextfile }' {} +
ilkkachu
  • 138,973
0

With bash 4.4+ and GNU grep:

readarray -td '' files < <(grep -rZFle "$searchtext" .)

If it's email files, you probably only want to search in the headers here since you seem to be looking for a subject. With GNU awk:

readarray -td '' files < <(
  SEARCH="$searchtext" find . -type f -exec gawk -v ORS='\0' -v RS='\r?\n' '
    $0 == "" {nextfile}
    index($0, ENVIRON["SEARCH"]) {print FILENAME; nextfile}' {} +
)