Find files containing a set of words

Question

I'm using grep to find files within a directory, containing a set of words. But grep search lines containing these words, what I want is grep to show me the files or file containing all those words even in different lines.

grep -lw "ből\|dének\|jeként\|jében\|jéből\|jéhez\|jének\|jéről\|jét\|jével\|jéül" *model.txt

But it isn't valid if the file contains one or two .. words. Must contain the entire set of words

How I can achieve this with bash?

I am using the code suggested by Tagwint

find -name '*model.txt' | while read f; do [[ "$(grep -o -w -f patterns  $f| sort -u|wc -l)" -eq "$(cat patterns | wc -l)" ]] && echo $f; done

How could it be modified to also show the number of occurrences found in each file? Like..

685 01_táska.model.txt
687 02_dinnye.model.txt
685 03_kapu.model.txt
685 04a_nő.model.txt
685 04b_büdzsé.model.txt

Potential duplicate of grepping foo and bar or How to search files where two different words exist? — Stéphane Chazelas, Mar 21 '16 at 14:02
@StéphaneChazelas How to search files where two or more different words exist? — Firefly, Mar 21 '16 at 14:11
@StéphaneChazelas I'm not closing this as a duplicate because typical solutions for two words don't scale to many words. — Gilles 'SO- stop being evil', Mar 21 '16 at 22:43

Tagwint · Accepted Answer · 2016-03-22T08:06:48.947

2

I guess by 'shorter solution' you mean shorter line, you cannot shorten your very long list, right?

I'd suggest you putting all the words into one file then make use of -f grep option. Then the solution below makes use of the -o option which provides the only matching parts. This results in the list of all matched words in one file. Sort'ing then uniq'ing that list if matches the pattern list exactly means the file has them all. wc -l counts lines.

find -name '*model.txt' | while read f; do [[ "$(grep -o -w -f patterns  $f| sort -u|wc -l)" -eq "$(cat patterns | wc -l)" ]] && echo $f; done

patterns is the name of the file that contains your search words:

#cat patterns
ből
ből
dének
jeként
jé
....

Note also -w option of grep, that makes sure matching against whole words only. Otherwise the calculation could go wrong for substirng words like joy and joyful

Of course you can make a nicer look from the onliner, if that matters for you

Update Make sure the pattern file does not have empty lines.

Update 2 Make sure your patterns file has no duplicates inside - those will spoil the party

Update 3

To have counter of occurencies in front of file name:

 find -name '*model.txt' | while read f; do [[ "$(grep -o -w -f patterns  $f| tee /tmp/$f |sort -u|wc -l)" -eq "$(cat patterns | wc -l)" ]] && echo $(cat /tmp/$f|wc -l) $f ; rm /tmp/$f; done

The idea is to save all the matches on the fly in a temp file, and count them before sorting/uniquing. Cleanup tmp file just to keep good manners.

edited Mar 22 '16 at 08:06

answered Mar 21 '16 at 16:30

Tagwint

2,480

This is just what I needed, but something happens to your onliner, it returns nothing. If I run this for example grep -o -w -f patterns $f | sort -u | wc -l *model.txt The script returns me the number of occurrences found in each file, along with the file name.But it doesn't finish. – Firefly Mar 21 '16 at 17:15
I created the file patterns as your sample and I'm running your onliner from cywing. – Firefly Mar 21 '16 at 17:25
grep -o -w -f patterns $f | sort -u | wc -l *model.txt is semantically wrong - it you only see the result of the last wc command, no matching, no grepping whatsoever.
Just re-checked it, the onliner works fine with some sample data make sure there's no empty lines in patterns file. Neither EOF on the last line - that makes no result I belive that's the reason
– Tagwint Mar 21 '16 at 17:31
Perfect, you were right. What could I do so also would print the number of occurrences in each file? like.. 685 01_táska.model.txt 687 02_dinnye.model.txt – Firefly Mar 22 '16 at 07:35
well, you could do the onliner a bit more longer :) Please see the update 3 – Tagwint Mar 22 '16 at 08:03
Where is the update 3 ?. I don't see it. – Firefly Mar 22 '16 at 08:07
Perfect! You're a boss! – Firefly Mar 22 '16 at 08:12

score 1 · Answer 2 · answered Mar 22 '16 at 01:06

Here's an awk script that memorizes which words it has seen and prints out the names of the files that contains all the required words.

awk -v required_words='ből dének jeként jében jéből jéhez jének jéről jét jével jéül' '
    function check() {
        for (w in seen) if (!seen[w]) return;
        print last_file;
    }
    BEGIN {
        split(required_words, a);
        for (i in a) seen[a[i]] = 0;
    }
    NR==1 { last_file = FILENAME; }
    FNR==1 && NR!=1 { check(); for (w in seen) seen[w] = 0; }
    END { check() }
    { split($0, a, /[^[:alpha:]]+/);
      for (i in a) if (a[i] in seen) seen[a[i]]=1; }
' *model.txt

Find files containing a set of words

2 Answers2