How to find all files containing various strings from a long list of string combinations?

Question

I am still very new to command line tools (using my Mac OSX terminal) and hope I haven't missed the answer somewhere else, but I have searched for hours.

I have a text file (let's call it strings.txt) containing 200 combinations of 3 strings. [Edit 2017/01/30] The first five rows look like this:

"surveillance data" "surveillance technology" "cctv camera"
"social media" "surveillance techniques" "enforcement agencies"
"social control" "surveillance camera" "social security"
"surveillance data" "security guards" "social networking"
"surveillance mechanisms" "cctv surveillance" "contemporary surveillance"

Note that I can change strings.txt to any other format, as long as the bigrams/ 2-word phrases like surveillance data in line 1 stay together. (That means I can delete the quotes if necessary, as for the answer by @MichaelVehrs below).

Now I want to search a directory of more than 800 files for those files that contain at least one of the string combinations (anywhere in the file). My original idea was to use egrep with a pattern file like this:

egrep -i -l -r -f strings.txt file_directory

However, I can only get this to work if there is one string per line. This is not desirable, because I need the identified files to contain all three strings of a given pattern. Is there a way to add some kind of AND operator to the grep pattern file? Or is there another way to achieve what I want using another function/tool? Many thanks!

Edit 2017/01/30

The answer by @MichaelVehrs below was very helpful; I edited it to the following:

while read one two three four five six
do grep -ilFr "$one $two" *files* | xargs grep -ilFr "$three $four" |  xargs grep -ilFr "$five $six"
done < *patternfile* | sort -u

This answer works when the pattern file contains the strings without quotes. Sadly, it only seems to match the pattern on the first line of the pattern file. Does anyone know why?

Edit 2017/01/29

A similar question about grepping multiple values has been asked before, but I need the AND logic in order to match one of the three-string-combinations from the pattern file strings.txt in the other files. I realise that the format of strings.txt might have to be changed for the matching to work and would appreciate suggestions.

Note that for -f flag , each pattern must be on separate lines. So you would have to split your "social order" "government policies" "national security" into 3 lines, or use \| to separate each phrase within double quotes, like "social order\|government policies\|national security" — Sergiy Kolodyazhnyy, Jan 30 '17 at 00:09
Alternatively, if you need AND logic there ( for matching multiple patters being within a line) you could switch to using perl or awk. See this for example: http://unix.stackexchange.com/a/177524/85039 — Sergiy Kolodyazhnyy, Jan 30 '17 at 00:12
Thank you @Serg I do need the AND logic and I believe I saw that post earlier, but I'm not sure how to combine the awk or perl statements with my input file... as I would prefer not to type out the 200 combinations. — ViolaW, Jan 30 '17 at 00:31
@ViolaW let me ask you this: does the file contain any regex expressions, or is it only consisting of phrases ? — Sergiy Kolodyazhnyy, Jan 30 '17 at 00:36
@Serg I can add regex expressions if that solves it! I tried something like "social order" & "national security" etc. but couldn't find the right answer.. basically I'm happy to turn it into whatever format is necessary so that I can match them — ViolaW, Jan 30 '17 at 00:38
@don_crissti thanks - are you asking about the format of the strings in the pattern file or in the files where I want to find them? At the moment I have deleted the quotes from the pattern file, because that seemed to match the answer given by MichaelVehrs. I don't mind whether they have quotes or not (in the files to be matched they don't). And they should be anywhere in the file; not necessarily on the same line. — ViolaW, Jan 30 '17 at 13:50
sorry I didn't know what you meant by that @don_crissti I haven't escaped any quotes — ViolaW, Jan 30 '17 at 14:14

Stéphane Chazelas · Answer 1 · 2017-02-01T10:43:52.700

I'd use perl, something like:

perl -MFile::Find -MClone=clone -lne '
  # parse the strings.txt input, here looking for the sequences of
  # 0 or more characters (.*?) in between two " characters
  for (/"(.*?)"/g) {
    # @needle is an array of associative arrays whose keys
    # are the "strings" for each line.
    $needle[$n]{$_} = undef;
  }
  $n++;

  END{
    sub wanted {
      return unless -f; # only regular files
      my $needle_clone = clone(\@needle);
      if (open FILE, "<", $_) {
        LINE: while (<FILE>) {
          # read the file line by line
          for (my $i = 0; $i < $n; $i++) {
            for my $s (keys %{$needle_clone->[$i]}) {
              if (index($_, $s)>=0) {
                # if the string is found, we delete it from the associative
                # array.
                delete $needle_clone->[$i]{$s};
                unless (%{$needle_clone->[$i]}) {
                  # if the associative array is empty, that means we have
                  # found all the strings for that $i, that means we can
                  # stop processing, and the file matches
                  print $File::Find::name;
                  last LINE;
                }
              }
            }
          }
        }
        close FILE;
      }
    }
    find(\&wanted, ".")
  }' /path/to/strings.txt

That means we minimize the number of string searches.

Here, we're processing the files line by line. If the files are reasonably small, you could process them as a whole which would simplify it a bit and might improve performance.

Note that it does expect the list file to be in the:

 "surveillance data" "surveillance technology" "cctv camera"
 "social media" "surveillance techniques" "enforcement agencies"
 "social control" "surveillance camera" "social security"
 "surveillance data" "security guards" "social networking"
 "surveillance mechanisms" "cctv surveillance" "contemporary surveillance"

format, with a number (doesn't have to be 3) of quoted (with double quote) strings on each line. The quoted strings cannot contain double quote characters themselves. The double quote character is not part of the text being searched. That is if the list file contained:

"A" "B"
"1" "2" "3"

that would report the path of all the regular files in the current directory and below that contain either

both A and B
or (being not an exclusive or) all of 1, 2 and 3

anywhere in them.

Thank you! I know nothing about perl, so could you please clarify where I have to paste the path to my text file directory (the one to be searched)? — ViolaW, Jan 31 '17 at 18:19
@ViolaW, that's searching in the current directory (the "." argument to find()). The strings to search are found in files passed as arguments (like that /path/to/string.txt or standard input if no file is given), as per the processing done upon -n — Stéphane Chazelas, Jan 31 '17 at 21:53
thank you for explaining this. somehow it's still not working on my machine. Actually I was asked not to pursue this problem any further. I appreciate everyone's help; unfortunately I can't make any of the solutions work (might be my own/ machine's problem); so I'm not sure whether I should somehow close this question? — ViolaW, Jan 31 '17 at 23:48
@ViolaW, no need to close the question. Incidentally, based on the notes on your profile page, you would do extremely well to learn some Perl. (The backronym for Perl is "Practical Extraction and Reporting Language," and it's unparalleled for text processing capabilities.) — Wildcard, Feb 01 '17 at 10:25
@Wildcard yes you're right, I have been told this before; it will be next on my list then :) — ViolaW, Feb 01 '17 at 10:36
@ViolaW, maybe I didn't get the requirements right. See edit for what this perl code is meant to do. — Stéphane Chazelas, Feb 01 '17 at 10:44

score 1 · Answer 2 · answered Jan 30 '17 at 07:38

1

The problem is a bit awkward, but you could approach it like this:

while read one two three four five six
  do grep -lF "$one $two" *files* | xargs grep -lF "$three $four" | xargs grep -lF "$five $six"
done < patterns | sort -u

This assumes that your pattern file contains exactly six words per line (three patterns of two words each). The logical and is realized by chaining three consecutive filters (grep). Note that this is not particularly efficient. An awk solution would probably be faster.

answered Jan 30 '17 at 07:38

Michael Vehrs

2,208

thanks! I tried applying this but I'm so inexperienced that I must be doing something wrong. I need to replace *files* with the files that I want to search and patterns with the pattern file right? somehow I don't get any output :\ does this script need to be placed in a script file or can it be executed just like this? – ViolaW Jan 30 '17 at 11:47
@ViolaW You are correct. And the script can be used as is. Are you sure there are matches to be found? – Michael Vehrs Jan 30 '17 at 12:20
is it possible to adapt the script so that it applies to all lines in the pattern file? – ViolaW Jan 30 '17 at 14:25
@ViolaW What do you mean? The while loop reads all lines of the pattern file. – Michael Vehrs Jan 31 '17 at 06:54
in my test with two patterns (i.e. two lines in the pattern file) it only returns the texts that match the first pattern, not those that match the second one – ViolaW Jan 31 '17 at 18:26
It works for my. Try adding echo "$one $two $three $four $five $six" in the loop to see whether the pattern file is being read correctly. – Michael Vehrs Feb 01 '17 at 06:34
1

By the way , i read on SO in a similar post that using xargs -r instead of simple xargs ensures that second grep will not be run on null data / no matches : http://stackoverflow.com/questions/41896604/linux-listing-files-that-contain-several-words/41897975#41897975 – George Vasiliou Feb 01 '17 at 08:12
@GeorgeVasiliou Good point. – Michael Vehrs Feb 01 '17 at 08:45
@MichaelVehrs sorry i didn't see your reply. the xargs -r doesn't work on my computer. as the solution from GeorgeVasiliou worked for me I think I'll leave it at that. Many thanks for your help and I'm sure that people who know more about unix and while loops than me can make your method work. – ViolaW Feb 01 '17 at 12:15

George Vasiliou · Answer 3 · 2017-01-30T15:01:04.453

This is another approach that seems to work in my tests.

I copied your strings file data to a file named d1.txt and move it to a separate directory (i.e tmp) to avoid later grep to match strings file in the same file (d1.txt).

Then insert in this strings file (d1.txt in my case) a semi colon between each search term with the following command: sed -i 's/" "/";"/g' ./tmp/d1.txt

$ cat ./tmp/d1.txt
"surveillance data" "surveillance technology" "cctv camera"
"social media" "surveillance techniques" "enforcement agencies"
"social control" "surveillance camera" "social security"
"surveillance data" "security guards" "social networking"
"surveillance mechanisms" "cctv surveillance" "contemporary surveillance"
$ sed -i 's/" "/";"/g' ./tmp/d1.txt
$ cat ./tmp/d1.txt
"surveillance data";"surveillance technology";"cctv camera"
"social media";"surveillance techniques";"enforcement agencies"
"social control";"surveillance camera";"social security"
"surveillance data";"security guards";"social networking"
"surveillance mechanisms";"cctv surveillance";"contemporary surveillance"

Then remove the double quotes using command sed 's/"//g' ./tmp/d1.txt PS: This may no be really necessary, but i removed double quotes for testing.

$ sed -i 's/"//g' ./tmp/d1.txt && cat ./tmp/d1.txt
surveillance data;surveillance technology;cctv camera
social media;surveillance techniques;enforcement agencies
social control;surveillance camera;social security
surveillance data;security guards;social networking
surveillance mechanisms;cctv surveillance;contemporary surveillance

No you can grep all files in current directory with the program agrep which is designed exactly to provide multi pattern grep with AND operation.

agrep requires multiple patterns to be separated by semi colon ; in order to be evaluated as AND.

In my tests, i created two sample files with contents:

$ cat d2.txt
This guys over there have the required surveillance technology to do the job.

The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera.

$ cat d3.txt
All surveillance data are locked.
All surveillance data are locked and guarded by security guards.
There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

Running agrep on the current directory returns the correct lines (with AND) and filenames:

$ while IFS= read -r line;do agrep "$line" *;done<./tmp/d1.txt
d2.txt: The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera.
d3.txt: There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

thank you @GeorgeVasiliou I believe it would work if I wasn't on a Mac which doesn't seem to have agrep available. — ViolaW, Jan 31 '17 at 18:20
@ViolaW OK. If you like you can have a look here on how to make agrep to work on osx: https://github.com/Wikinaut/agrep/issues/2 — George Vasiliou, Jan 31 '17 at 22:42

George Vasiliou · Accepted Answer · 2017-02-01T12:17:18.157

Since agrep seems not to be present in your system, have a look in this alternative based on sed and awk to apply grep with and operation from patterns read by a local file.

PS: Since you use osx i'm not sure if the awk version you have will support bellow usage.

awk can simulate grep with AND operation of multiple patterns in this usage:
awk '/pattern1/ && /pattern2/ && /pattern3/'

So you could transform your pattern file from this:

$ cat ./tmp/d1.txt
"surveillance data" "surveillance technology" "cctv camera"
"social media" "surveillance techniques" "enforcement agencies"
"social control" "surveillance camera" "social security"
"surveillance data" "security guards" "social networking"
"surveillance mechanisms" "cctv surveillance" "contemporary surveillance"

To this:

$ sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' ./tmp/d1.txt
/surveillance data/ && /surveillance technology/ && /cctv camera/
/social media/ && /surveillance techniques/ && /enforcement agencies/
/social control/ && /surveillance camera/ && /social security/
/surveillance data/ && /security guards/ && /social networking/
/surveillance mechanisms/ && /cctv surveillance/ && /contemporary surveillance/

PS: You can redirect the output to another file by using >anotherfile in the end , or you can use the sed -i option to make in-place changes in the same search terms pattern file.

Then you just need to feed awk with awk-formatted patterns from this pattern file :

$ while IFS= read -r line;do awk "$line" *.txt;done<./tmp/d1.txt #d1.txt = my test pattern file

You could also not transform patterns in your original pattern file by applying sed in each line of this original pattern file like this:

while IFS= read -r line;do 
  line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line")
  awk "$line" *.txt
done <./tmp/d1.txt

Or as one-liner:

$ while IFS= read -r line;do line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line"); awk "$line" *.txt;done <./tmp/d1.txt

Above commands return the correct AND results in my test files that look like this:

$ cat d2.txt
This guys over there have the required surveillance technology to do the job.
The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera.

$ cat d3.txt
All surveillance data are locked.
All surveillance data are locked and guarded by security guards.
There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

Results:

$ while IFS= read -r line;do awk "$line" *.txt;done<./tmp/d1.txt
#or while IFS= read -r line;do line=$(sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g' <<<"$line"); awk "$line" *.txt;done <./tmp/d1.txt
The other guys not only have efficient surveillance technology, but they also gather surveillance data by one cctv camera.
There are several surveillance mechanisms (i.e cctv surveillance, contemporary surveillance, etv)

Update:
Above awk solution prints the contents of matching txt files.
If you want to display the filenames instead of the contents, then use the following awk where necessary:

awk "$line""{print FILENAME}" *.txt

I think this works, thank you @GeorgeVasiliou! The last thing is how can I retrieve the file names of the matching files rather than the text? (because I will have lots of text and I wouldn't know where it's coming from) — ViolaW, Feb 01 '17 at 10:35
@ViolaW This seems to work in my machine: awk "$line""{print FILENAME}" *.txt. Give a try. — George Vasiliou, Feb 01 '17 at 10:59
thanks @GeorgeVasiliou actually I realised the modified pattern text out put looks like this: "cctv cameras/ && /surveillance techniques/ && /cctv policy" so with quotes at the front and back, I don't know why... and the awk "$line""{print FILENAME}" *.txt shows all text files (even the ones that shouldn't match) and all of them twice... sorry this is becoming such a pain — ViolaW, Feb 01 '17 at 11:35
@ViolaW It seems that sed fails for some reason. Try to use this sed instead : sed 's/" "/\/ \&\& \//g; s/^"/\//g; s/"$/\//g'. If this sed succeeds then the print filename should also succeed — George Vasiliou, Feb 01 '17 at 12:04
working now! wonderful! thank you :) @GeorgeVasiliou this is my first question here, should we add this modification to your answer or how do we highlight that this modification made it work? — ViolaW, Feb 01 '17 at 12:08

How to find all files containing various strings from a long list of string combinations?

Edit 2017/01/30

Edit 2017/01/29

4 Answers4