1

I want to print the filename/s together with the matching pattern but only once even if the pattern match has multiple occurrence in the file.

E.g. I have a list of patterns; list_of_patterns.txt and the directory I need to find the files is /path/to/files/*.

list_of_patterns.txt:

A
B
C
D
E

/path/to/files/

/file1
/file2
/file3

Let say /file1 has the pattern A multiple times like this:

/file1:

A
4234234
A
435435435
353535
A

(Also same goes to other files where there are multiple pattern match.)

I have this grep command running but it prints the filename every time a pattern matches.

grep -Hof list_of_patterns.txt /path/to/files/*

output:

/file1:A
/file1:A
/file1:A
/file2:B
/file2:B
/file3:C
/file3:B
... and so on.

I know sort can do this when you pipe it after the grep command grep -Hof list_of_patterns.txt /path/to/files/* | sort -u but it only executes when grep is finished. In the real world, my list_of_patterns.txt has hundreds of patterns inside. It takes sometimes an hour to finish the task.

Is there a better way to speedup the process?

UPDATE: some files have more than a hundred occurrences of matching pattern. E.g. /file4 has occurrences of pattern A 900 times. That's why it's taking grep an hour to finish because it prints every occurrences of the pattern match together with the filename.

E.g. output:

/file4:A
/file4:A
/file4:A
/file4:A
/file4:A
/file4:A
/file4:A
/file4:A
... and so on til' it reach 900 occurrences.

I only want it to print only once.

E.g. Desired output:

/file4:A
/file1:A
/file2:B
/file3:A
/file4:B
  • Hundreds of patterns would not make grep take an hour to process a few files. Are your files also very big or do you have many thousands of files to search in? – Kusalananda Feb 14 '18 at 06:43
  • I think the option you are looking for is -m1 – Sundeep Feb 14 '18 at 06:45
  • @Kusalananda, Yeah I think the files are causing this issue. I just found a file that has 1 pattern match only but with 950+ occurrences. That's why it takes an hour to finish. – WashichawbachaW Feb 14 '18 at 06:47
  • @Sundeep Would that not discard the matches for some patterns? Only the first matching pattern in the pattern file would be reported. – Kusalananda Feb 14 '18 at 06:49
  • @Sundeep, I'm using that now but still so slow. The problem is some files sometimes have 100+ occurrence of pattern match. – WashichawbachaW Feb 14 '18 at 06:49
  • 1
    @Kusalananda -m1 will cause exactly one output line per file, along with whatever pattern matched... not sure if OP wants one line for each matching pattern – Sundeep Feb 14 '18 at 06:51
  • @WashichawbachaW when you use -m1, grep will quit immediately after finding a matching line... – Sundeep Feb 14 '18 at 06:52
  • @Sundeep It would also not give matches for any but the first pattern that matches in the pattern file, so possible matches of later patterns would be missed for a particular file. – Kusalananda Feb 14 '18 at 06:53
  • I have updated my question for clarification. – WashichawbachaW Feb 14 '18 at 06:57
  • @WashichawbachaW, so you want to search each file against ALL patterns but display ALL matches in distinct (non-repeated) manner, right? – RomanPerekhrest Feb 14 '18 at 07:08
  • @RomanPerekhrest, yeah. Just exactly like sort -u does. Like I said in my question but it waits for grep to finish. Is there a way grep could perform what sort can do? Or there are other command that can perform the task better and faster? – WashichawbachaW Feb 14 '18 at 07:14

1 Answers1

3

Is there a better way to speedup the process?

Yes, it's called GNU parallel:

parallel -j0 -k "grep -Hof list_of_patterns.txt {} | sort -u" ::: /path/to/files/*
  • j N - number of jobslots. Run up to N jobs in parallel. 0 means as many as possible.
  • k (--keep-order) - keep sequence of output same as the order of input
  • ::: arguments - use arguments from the command line as input source instead of stdin (standard input)
  • The -j N number should possibly be limited to a number not too much higher than the available number of cores on the machine, especially if each individual grep against a file is slow. – Kusalananda Feb 14 '18 at 07:27
  • 1
    What is the correct N for -j N? It depends: https://oletange.wordpress.com/2015/07/04/parallel-disk-io-is-it-faster/ – Ole Tange Feb 14 '18 at 07:29
  • If mixing results is acceptable, remove -k + use --line-buffer and instead of sort -u: perl -ne '$s{$_}++ or print'. This will give results before the full job is finished. – Ole Tange Feb 14 '18 at 07:30
  • Can I install it without sudo permission? – WashichawbachaW Feb 14 '18 at 08:04
  • @WashichawbachaW, if you are ready for some manual "experiments" - you may try https://unix.stackexchange.com/questions/42567/how-to-install-program-locally-without-sudo-privileges – RomanPerekhrest Feb 14 '18 at 08:31
  • @RomanPerekhrest, I just installed it. I'm now running the command you provided. It's not done yet. So far it's not printing any duplicates like my previous output. – WashichawbachaW Feb 14 '18 at 08:46
  • @WashichawbachaW, you are contradicting your conditions: 1) your phrase yeah. Just exactly like sort -u does ; 2) now you're saying it's not printing any duplicates. That's contradiction – RomanPerekhrest Feb 14 '18 at 08:52
  • @RomanPerekhrest, what I mean about duplicates is the duplicate combination of filename:pattern. Just like sort -u does when it removes the duplicate filename:pattern combination. E.g. When in my old output I have 1 or more printed /file1:A now it only shows 1. – WashichawbachaW Feb 14 '18 at 09:29
  • It's done. It took 49mins compared to 1hr 25mins from grep. Thanks @RomanPerekhrest – WashichawbachaW Feb 14 '18 at 09:34
  • @WashichawbachaW, it only shows 1 - of course, that's what sort -u does (as you required earlier) – RomanPerekhrest Feb 14 '18 at 09:34