17

I'm looking for a way to list all files in a directory that contain the full set of keywords I'm seeking, anywhere in the file.

So, the keywords need not to appear on the same line.

One way to do this would be:

grep -l one $(grep -l two $(grep -l three *))

Three keywords is just an example, it could just as well be two, or four, and so on.

A second way I can think of is:

grep -l one * | xargs grep -l two | xargs grep -l three

A third method, that appeared in another question, would be:

find . -type f \
  -exec grep -q one {} \; -a \
  -exec grep -q two {} \; -a \
  -exec grep -q three {} \; -a -print

But that's definitely not the direction I'm going here. I want something that requires less typing, and possibly just one call to grep, awk, perl or similar.

For example, I like how awk lets you match lines that contain all keywords, like:

awk '/one/ && /two/ && /three/' *

Or, print just the file names:

awk '/one/ && /two/ && /three/ { print FILENAME ; nextfile }' *

But I want to find files where the keywords may be anywhere in the file, not necessarily on the same line.


Preferred solutions would be gzip friendly, for example grep has the zgrep variant that works on compressed files. Why I mention this, is that some solutions may not work well given this constraint. For example, in the awk example of printing matching files, you can't just do:

zcat * | awk '/pattern/ {print FILENAME; nextfile}'

You need to significantly change the command, to something like:

for f in *; do zcat $f | awk -v F=$f '/pattern/ { print F; nextfile }'; done

So, because of the constraint, you need to call awk many times, even though you could do it only once with uncompressed files. And certainly, it would be nicer to just do zawk '/pattern/ {print FILENAME; nextfile}' * and get the same effect, so I would prefer solutions that allow this.

arekolek
  • 372
  • 1
    You don't need them to be gzip friendly, just zcat the files first. – terdon Jun 30 '16 at 10:53
  • @terdon I've edited the post, explaining why I mention that files are compressed. – arekolek Jun 30 '16 at 11:06
  • There isn't much difference between launching awk once or many times. I mean, OK, some small overhead but I doubt you would even notice the difference. It is, of course, possible to make the awk/perl whatever script do this itself but that starts to become a full blown program and not a quick one-liner. Is that what you want? – terdon Jun 30 '16 at 11:11
  • @terdon Personally, the more important aspect for me is how complicated the command will be (I guess my second edit came while you were commenting). For example, the grep solutions are easily adaptable just by prefixing grep calls with a z, there's no need for me to also handle file names. – arekolek Jun 30 '16 at 11:19
  • Yes, but that's grep. AFAIK, only grep and cat have standard "z-variants". I don't think you'll get anything simpler than using a for f in *; do zcat -f $f ... solution. Anything else would have to be a full program that checks file formats before opening or uses a library to do the same. – terdon Jun 30 '16 at 11:25
  • @don_crissti ah, yes. I eve knew some of those. Thanks. – terdon Jun 30 '16 at 11:45
  • @arekolek I added a little script that will take three (exactly three) search strings and a list of files and will print the name of any files with all three. It can deal with .gz or regular files. You could extend that to take an arbitrary number of string but that's getting way beyond the scope of this site. – terdon Jun 30 '16 at 12:46
  • @don_crissti It's for interactive use, so definitely below 10 keywords. Looking for fixed strings is sufficient for my purposes. – arekolek Jun 30 '16 at 23:01
  • @arekolek effective use of unix means being a tool builder. sometimes you build tools with pipelines of shell commands (like your find -exec or xargs solutions), other times you build tools with more complete languages. BTW, writing tools in perl is often quite similar to writing in shell in that it has an enormous collection (CPAN) of library modules to do just about any task you want, and you put them together with some wrapper code, just as shell has a collection of tools that you put together with wrapper code. The perl script in my answer is now a new tool that (...cont...) – cas Jul 01 '16 at 01:44
  • (...cont...) gives you a very simple one-liner that lest you search for any number of strings in any number of files (which can be uncompressed or compressed in several different formats). It could be improved with better option handling, but it's a complete standalone tool (save it with a memorable/useful filename in your PATH, e.g. ~/bin) – cas Jul 01 '16 at 01:47

5 Answers5

14
awk 'FNR == 1 { f1=f2=f3=0; };

     /one/   { f1++ };
     /two/   { f2++ };
     /three/ { f3++ };

     f1 && f2 && f3 {
       print FILENAME;
       nextfile;
     }' *

If you want to automatically handle gzipped files, either run this in a loop with zcat (slow and inefficient because you'll be forking awk many times in a loop, once for each filename) or rewrite the same algorithm in perl and use the IO::Uncompress::AnyUncompress library module which can decompress several different kinds of compressed files (gzip, zip, bzip2, lzop). or in python, which also has modules for handling compressed files.


Here's a perl version that uses IO::Uncompress::AnyUncompress to allow for any number of patterns and any number of filenames (containing either plain text or compressed text).

All args before -- are treated as search patterns. All args after -- are treated as filenames. Primitive but effective option handling for this job. Better option handling (e.g. to support a -i option for case-insensitive searches) could be achieved with the Getopt::Std or Getopt::Long modules.

Run it like so:

$ ./arekolek.pl one two three -- *.gz *.txt
1.txt.gz
4.txt.gz
5.txt.gz
1.txt
4.txt
5.txt

(I won't list files {1..6}.txt.gz and {1..6}.txt here...they just contain some or all of the words "one" "two" "three" "four" "five" and "six" for testing. The files listed in the output above DO contain all three of the search patterns. Test it yourself with your own data)

#! /usr/bin/perl

use strict;
use warnings;
use IO::Uncompress::AnyUncompress qw(anyuncompress $AnyUncompressError) ;

my %patterns=();
my @filenames=();
my $fileargs=0;

# all args before '--' are search patterns, all args after '--' are
# filenames
foreach (@ARGV) {
  if ($_ eq '--') { $fileargs++ ; next };

  if ($fileargs) {
    push @filenames, $_;
  } else {
    $patterns{$_}=1;
  };
};

my $pattern=join('|',keys %patterns);
$pattern=qr($pattern);
my $p_string=join('',sort keys %patterns);

foreach my $f (@filenames) {
  #my $lc=0;
  my %s = ();
  my $z = new IO::Uncompress::AnyUncompress($f)
    or die "IO::Uncompress::AnyUncompress failed: $AnyUncompressError\n";

  while ($_ = $z->getline) {
    #last if ($lc++ > 100);
    my @matches=( m/($pattern)/og);
    next unless (@matches);

    map { $s{$_}=1 } @matches;
    my $m_string=join('',sort keys %s);

    if ($m_string eq $p_string) {
      print "$f\n" ;
      last;
    }
  }
}

A hash %patterns is contains the complete set of patterns that files have to contain at least one of each member $_pstring is a string containing the sorted keys of that hash. The string $pattern contains a pre-compiled regular expression also built from the %patterns hash.

$pattern is compared against each line of each input file (using the /o modifier to compile $pattern only once as we know it won't ever change during the run), and map() is used to build a hash (%s) containing the matches for each file.

Whenever all the patterns have been seen in the current file (by comparing if $m_string (the sorted keys in %s) is equal to $p_string), print the filename and skip to the next file.

This is not a particularly fast solution, but is not unreasonably slow. The first version took 4m58s to search for three words in 74MB worth of compressed log files (totalling 937MB uncompressed). This current version takes 1m13s. There are probably further optimisations that could be made.

One obvious optimisation is to use this in conjunction with xargs's -P aka --max-procs to run multiple searches on subsets of the files in parallel. To do that, you need to count the number of files and divide by the number of cores/cpus/threads your system has (and round up by adding 1). e.g. there were 269 files being searched in my sample set, and my system has 6 cores (an AMD 1090T), so:

patterns=(one two three)
searchpath='/var/log/apache2/'
cores=6
filecount=$(find "$searchpath" -type f -name 'access.*' | wc -l)
filespercore=$((filecount / cores + 1))

find "$searchpath" -type f -print0 | 
  xargs -0r -n "$filespercore" -P "$cores" ./arekolek.pl "${patterns[@]}" --

With that optimisation, it took only 23 seconds to find all 18 matching files. Of course, the same could be done with any of the other solutions. NOTE: The order of filenames listed in the output will be different, so may need to be sorted afterwards if that matters.

As noted by @arekolek, multiple zgreps with find -exec or xargs can do it significantly faster, but this script has the advantage of supporting any number of patterns to search for, and is capable of dealing with several different types of compression.

If the script is limited to examining only the first 100 lines of each file, it runs through all of them (in my 74MB sample of 269 files) in 0.6 seconds. If this is useful in some cases, it could be made into a command line option (e.g. -l 100) but it has the risk of not finding all matching files.


BTW, according to the man page for IO::Uncompress::AnyUncompress, the compression formats supported are:


One last (I hope) optimisation. By using the PerlIO::gzip module (packaged in debian as libperlio-gzip-perl) instead of IO::Uncompress::AnyUncompress I got the time down to about 3.1 seconds for processing my 74MB of log files. There were also some small improvements by using a simple hash rather than Set::Scalar (which also saved a few seconds with the IO::Uncompress::AnyUncompress version).

PerlIO::gzip was recommended as the fastest perl gunzip in https://stackoverflow.com/a/1539271/137158 (found with a google search for perl fast gzip decompress)

Using xargs -P with this didn't improve it at all. In fact it even seemed to slow it down by anywhere from 0.1 to 0.7 seconds. (I tried four runs and my system does other stuff in the background which will alter the timing)

The price is that this version of the script can only handle gzipped and uncompressed files. Speed vs flexibility: 3.1 seconds for this version vs 23 seconds for the IO::Uncompress::AnyUncompress version with an xargs -P wrapper (or 1m13s without xargs -P).

#! /usr/bin/perl

use strict;
use warnings;
use PerlIO::gzip;

my %patterns=();
my @filenames=();
my $fileargs=0;

# all args before '--' are search patterns, all args after '--' are
# filenames
foreach (@ARGV) {
  if ($_ eq '--') { $fileargs++ ; next };

  if ($fileargs) {
    push @filenames, $_;
  } else {
    $patterns{$_}=1;
  };
};

my $pattern=join('|',keys %patterns);
$pattern=qr($pattern);
my $p_string=join('',sort keys %patterns);

foreach my $f (@filenames) {
  open(F, "<:gzip(autopop)", $f) or die "couldn't open $f: $!\n";
  #my $lc=0;
  my %s = ();
  while (<F>) {
    #last if ($lc++ > 100);
    my @matches=(m/($pattern)/ogi);
    next unless (@matches);

    map { $s{$_}=1 } @matches;
    my $m_string=join('',sort keys %s);

    if ($m_string eq $p_string) {
      print "$f\n" ;
      close(F);
      last;
    }
  }
}
cas
  • 78,579
  • for f in *; do zcat $f | awk -v F=$f '/one/ {a++}; /two/ {b++}; /three/ {c++}; a&&b&&c { print F; nextfile }'; done works fine, but indeed, takes 3 times as long as my grep solution, and is actually more complicated. – arekolek Jun 30 '16 at 12:16
  • 1
    OTOH, for plain text files it would be faster. and the same algorithm implemented in a language with support for reading compressed files (like perl or python) as i suggested would be faster than multiple greps. "complication" is partially subjective - personally, i think a single awk or perl or python script is less complicated than multiple greps with or without find....@terdon's answer is good, and does it without needing the module I mentioned (but at the cost of forking zcat for every compresssed file) – cas Jun 30 '16 at 13:34
  • I had to apt-get install libset-scalar-perl to use the script. But it doesn't seem to terminate in any reasonable time. – arekolek Jul 01 '16 at 11:20
  • how many and what size (compressed and uncompressed) are the files you're searching? dozens or hundreds of small-medium size files or thousands of large ones? – cas Jul 01 '16 at 13:42
  • Here's a histogram of the sizes of compressed files (20 to 100 files, up to 50MB but mostly below 5MB). Uncompressed look the same, but with sizes multiplied by 10. – arekolek Jul 01 '16 at 14:34
  • not good. my initial testing was on the 12 tiny files I mentioned...results were instant. when i tested on 74MB of compressed log files, it took 4:58. I assumed that when the module docs said it was an interface to zlib, that it would actually use zlib for near-native performance. instead it seems to have re-implemented the algorithms in perl. If i get a chance today, i'll hunt for a decompression lib for perl that does use one or more of the shared libs. – cas Jul 02 '16 at 01:45
  • with one simple optimisation, i got the 74MB search down from 4:58 (298 secs) to 1:25 (85 secs)....check if each line matches ANY of the patterns and only try to find which patterns it matches if it does. so much of the performance problem is in the search loop. i'll update my answer. BTW, I also tried your "examine only the first 100 lines optimisation", took only 0.076 seconds but didnt find one matching file, checking first 500 lines took 0.118 seconds but still didn't find a match. checking 10000 lines took 43s but found only 10 out of the 18 files that the full 4m58s run found. – cas Jul 02 '16 at 02:17
  • I expanded on the optimisation I used when I remembered that perl can return a list on capture groups, so didn't need to loop over $patterns set to find which matches were on each line. Got the time down to 1m27s (87 secs) for the full 74MB search, finding all 18 matching files. I'll update my answer with the revised script. – cas Jul 02 '16 at 02:36
12

Set record separator to . so that awk will treat whole file as a one line:

awk -v RS='.' '/one/&&/two/&&/three/{print FILENAME}' *

Similarly with perl:

perl -ln00e '/one/&&/two/&&/three/ && print $ARGV' *
jimmij
  • 47,140
  • 3
    Neat. Note that this will load the whole file into memory though and that might be a problem for large files. – terdon Jun 30 '16 at 10:26
  • I initially upvoted this, because it looked promising. But I can't get it to work with gzipped files. for f in *; do zcat $f | awk -v RS='.' -v F=$f '/one/ && /two/ && /three/ { print F }'; done outputs nothing. – arekolek Jun 30 '16 at 12:09
  • @arekolek That loop works for me. Are your files properly gzipped? – jimmij Jun 30 '16 at 12:25
  • @arekolek you need zcat -f "$f" if some of the files are not compressed. – terdon Jun 30 '16 at 12:29
  • I've tested it also on uncompressed files and awk -v RS='.' '/bfs/&&/none/&&/rgg/{print FILENAME}' greptest/*.txt still returns no results, while grep -l rgg $(grep -l none $(grep -l bfs greptest/*.txt)) returns expected results. – arekolek Jul 01 '16 at 12:13
3

For compressed files, you could loop over each file and decompress first. Then, with a slightly modified version of the other answers, you can do:

for f in *; do 
    zcat -f "$f" | perl -ln00e '/one/&&/two/&&/three/ && exit(0); }{ exit(1)' && 
        printf '%s\n' "$f"
done

The Perl script will exit with 0 status (success) if all three strings were found. The }{ is Perl shorthand for END{}. Anything following it will be executed after all input has been processed. So the script will exit with a non-0 exit status if not all the strings were found. Therefore, the && printf '%s\n' "$f" will print the file name only if all three were found.

Or, to avoid loading the file into memory:

for f in *; do 
    zcat -f "$f" 2>/dev/null | 
        perl -lne '$k++ if /one/; $l++ if /two/; $m++ if /three/;  
                   exit(0) if $k && $l && $m; }{ exit(1)' && 
    printf '%s\n' "$f"
done

Finally, if you really want to do the whole thing in a script, you could do:

#!/usr/bin/env perl

use strict;
use warnings;

## Get the target strings and file names. The first three
## arguments are assumed to be the strings, the rest are
## taken as target files.
my ($str1, $str2, $str3, @files) = @ARGV;

FILE:foreach my $file (@files) {
    my $fh;
    my ($k,$l,$m)=(0,0,0);
    ## only process regular files
    next unless -f $file ;
    ## Open the file in the right mode
    $file=~/.gz$/ ? open($fh,"-|", "zcat $file") : open($fh, $file);
    ## Read through each line
    while (<$fh>) {
        $k++ if /$str1/;
        $l++ if /$str2/;
        $m++ if /$str3/;
        ## If all 3 have been found
        if ($k && $l && $m){
            ## Print the file name
            print "$file\n";
            ## Move to the net file
            next FILE;
        }
    }
    close($fh);
}

Save the script above as foo.pl somewhere in your $PATH, make it executable and run it like this:

foo.pl one two three *
terdon
  • 242,166
3

Of all the solutions proposed so far, my original solution using grep is the fastest one, finishing in 25 seconds. It's drawback is that it's tedious to add and remove keywords. So I came up with a script (dubbed multi) that simulates the behavior, but allows to change the syntax:

#!/bin/bash

# Usage: multi [z]grep PATTERNS -- FILES

command=$1

# first two arguments constitute the first command
command_head="$1 -le '$2'"
shift 2

# arguments before double-dash are keywords to be piped with xargs
while (("$#")) && [ "$1" != -- ] ; do
  command_tail+="| xargs $command -le '$1' "
  shift
done
shift

# remaining arguments are files
eval "$command_head $@ $command_tail"

So now, writing multi grep one two three -- * is equivalent to to my original proposal and runs in the same time. I can also easily use it on compressed files by using zgrep as the first argument instead.

Other solutions

I also experimented with a Python script using two strategies: searching for all keywords line by line, and searching in whole file keyword by keyword. The second strategy was faster in my case. But it was slower than just using grep, finishing in 33 seconds. Line by line keyword matching finished in 60 seconds.

#!/usr/bin/python3

import gzip, sys

i = sys.argv.index('--')
patterns = sys.argv[1:i]
files = sys.argv[i+1:]

for f in files:
  with (gzip.open if f.endswith('.gz') else open)(f, 'rt') as s:
    txt = s.read()
    if all(p in txt for p in patterns):
      print(f)

The script given by terdon finished in 54 seconds. Actually it took 39 seconds of wall time, because my processor is dual core. Which is interesting, because my Python script took 49 seconds of wall time (and grep was 29 seconds).

The script by cas failed to terminate in reasonable time, even on a smaller number of files that were processed with grep under 4 seconds, so I had to kill it.

But his original awk proposal, even though it's slower than grep as is, has potential advantage. In some cases, at least in my experience, it's possible to expect that all the keywords should all appear somewhere in the head of the file if they are in the file at all. This gives this solution a dramatic boost in performance:

for f in *; do
  zcat $f | awk -v F=$f \
    'NR>100 {exit} /one/ {a++} /two/ {b++} /three/ {c++} a&&b&&c {print F; exit}'
done

Finishes in a quarter of a second, as opposed to 25 seconds.

Of course, we may not have the advantage of searching for keywords that are known to occur near the beginning of the files. In such case, solution without NR>100 {exit} takes 63 seconds (50s of wall time).

Uncompressed files

There's no significant difference in running time between my grep solution and cas' awk proposal, both take a fraction of a second to execute.

Note that the variable initialization FNR == 1 { f1=f2=f3=0; } is mandatory in such case to reset the counters for every subsequent processed file. As such, this solution requires editing the command in three places if you want to change a keyword or add new ones. On the other hand, with grep you can just append | xargs grep -l four or edit the keyword you want.

A disadvantage of grep solution that uses command substitution, is that it will hang if anywhere in the chain, before the last step, there are no matching files. This doesn't affect the xargs variant because the pipe will be aborted once grep returns a non-zero status. I've updated my script to use xargs so I don't have to handle this myself, making the script simpler.

arekolek
  • 372
  • Your Python solution may benefit from pushing the loop down to C layer with not all(p in text for p in patterns) – iruvar Jul 01 '16 at 13:02
  • @iruvar Thanks for the suggestion. I've tried it (sans not) and it finished in 32 seconds, so not that much of improvement, but it's certainly more readable. – arekolek Jul 01 '16 at 13:19
  • you could use an associative array rather than f1,f2,f3 in awk, with key=search-pattern, val=count – cas Jul 02 '16 at 01:37
  • @arekolek see my latest version using PerlIO::gzip rather than IO::Uncompress::AnyUncompress. now takes only 3.1 seconds instead of 1m13s to process my 74MB of log files. – cas Jul 02 '16 at 05:25
  • BTW, if you have previously run eval $(lesspipe) (e.g. in your .profile, etc), you can use less instead of zcat -f and your for loop wrapper around awk will be able to process any kind of file that less can (gzip, bzip2, xz, and more)....less can detect if stdout is a pipe and will just output a stream to stdout if it is. – cas Jul 02 '16 at 09:34
0

Another option - feed words one at a time to xargs for it to run grep against the file. xargs can itself be made to exit as soon as an invocation of grep returns failure by returning 255 to it (check the xargs documentation). Of course the spawning of shells and forking involved in this solution will likely slow it down significantly

printf '%s\n' one two three | xargs -n 1 sh -c 'grep -q $2 $1 || exit 255' _ file

and to loop it up

for f in *; do
    if printf '%s\n' one two three | xargs -n 1 sh -c 'grep -q $2 $1 || exit 255' _ "$f"
    then
         printf '%s\n' "$f"
    fi
done
iruvar
  • 16,725
  • This looks nice, but I'm not sure how to use this. What is _ and file? Will this search in multiple files passed as argument and return files that contain all the keywords? – arekolek Jul 01 '16 at 19:08
  • @arekolek, added a loop version. And as for _, it's being passed as the $0 to the spawned shell - this would show up as the command name in the output of ps - I would defer to the master here – iruvar Jul 01 '16 at 19:55