Compare 60 large files and output only the lines that are common to all files

Question

I Have 60 files, each contain about 10,000 lines. Each line contains a single string.

I want to find out only the strings that are common to all files.

Must be exact matches, so we are comparing the entire line.

My first thought was to read all the files, counting each string in a hash array. awk '{ ++Cnt[$0]; }' Use an END clause to list only those strings whose Cnt equals the number of files (which you also count). But that gives a false result if a string occurs multiple times in any file. So: are the strings unique in any one file? We can cope with this, but it would be slightly more complex. — Paul_Pedant, Jun 23 '20 at 10:28
Thank you for your comments. I'd say it is unlikely that the same string would appear more than once but I can't be 100% sure. Duplicates are not required if they do exist, so we could check and clean the files beforehand. Something like sort FILE | uniq -cd — Chris, Jun 23 '20 at 11:04

pLumo · Answer 1 · 2020-06-23T12:57:42.630

1

Try this,

awk '
    BEGINFILE{fnum++; delete f;}
    !f[$0]++{s[$0]++;}
    END {for (l in s){if (s[l] == fnum) print l}}
' files*

Explanation:

BEGINFILE { ... } Run at beginning of each file
- fnum++ increment file counter
- delete f delete array which is used used to filter duplicate lines per file (see link for posix-compliant solution).
!f[$0]++ { ... } Run only for first occurence of a line in a file (when f[$0] is 0 (false))
- s[$0]++ Increment line-counter.
END { ... } Run once at the end
- for (l in s){if (s[l] == fnum) print l} Loop lines and print each one where number of occurences equals the number of files.

600.000 lines should be fine in memory. Otherwise, you could possibly remove everything from s which is less than fnum in the BEGINFILE{...} block.

edited Jun 23 '20 at 12:57

answered Jun 23 '20 at 11:31

pLumo

22,565

(note that it assumes files are not empty as FNR==1 would not trigger on empty files). – Stéphane Chazelas Jun 23 '20 at 12:19
Ok thank you. Sorry for being slow, but I'm assuming I place all the files within a directory and run this? – Chris Jun 23 '20 at 12:40
Whatever is easier for you. You can also refer the files individually like awk '...' /path/to/file1 /path/to/file2 /path/to/fileN, at the end, the files need to be the arguments to that awk command. – pLumo Jun 23 '20 at 12:42
Sorry :( can you confirm that files* at the end of the command in your example refers to all files in the current directory? – Chris Jun 23 '20 at 12:47
1

won't modern awk have a BEGINFILE pattern trigerred at each new file ? – Archemar Jun 23 '20 at 12:52
@Chris all files (excluding hidden) in current directory are just * – pLumo Jun 23 '20 at 12:56
I never used BEGINFILE, but seems to be better option FNR==1, edited. – pLumo Jun 23 '20 at 12:58
Ok so does the command count all the files in current directory? Or do I need to specify the files? Can you update your answer to a working example please? Or how do I make it run on all files in a directory? I'm sorry I'm not an advanced user. – Chris Jun 23 '20 at 13:00
files* will include all files which name are starting with "files". If you want all files in the directory, use *. This is very basic shell knowledge, maybe you should look up, it's called "globbing". Without knowing your directory / files structure I cannot give a "working example". – pLumo Jun 23 '20 at 13:03
I understand wildcards and globbing but I don't know scripting - so as far as I knew "files" might be part of the command. You included the word "files*" - how am I to determine that this is, or is not part of the command without clarification? So.. just to be perfectly clear, I can simply paste the following into the command line form the directory containing the files?
awk ' BEGINFILE{fnum++; delete f;} !f[$0]++{s[$0]++;} END {for (l in s){if (s[l] == fnum) print l}} ' *
– Chris Jun 23 '20 at 13:11
exactly ....... – pLumo Jun 23 '20 at 13:22
Great stuff - thanks :-) I will test! – Chris Jun 23 '20 at 13:43

Ole Tange · Answer 2 · 2020-07-23T07:17:54.540

Parallelized version in bash. It should work for files bigger than memory.

export LC_ALL=C
comm -12 \
  <(comm -12 \
    <(comm -12 \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 1) <(sort 2);) <(comm -12  <(sort 3) <(sort 4););) \
        <(comm -12  <(comm -12  <(sort 5) <(sort 6);) <(comm -12  <(sort 7) <(sort 8);););) \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 9) <(sort 10);) <(comm -12  <(sort 11) <(sort 12););) \
        <(comm -12  <(comm -12  <(sort 13) <(sort 14);) <(comm -12  <(sort 15) <(sort 16););););) \
    <(comm -12 \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 17) <(sort 18);) <(comm -12  <(sort 19) <(sort 20););) \
        <(comm -12  <(comm -12  <(sort 21) <(sort 22);) <(comm -12  <(sort 23) <(sort 24);););) \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 25) <(sort 26);) <(comm -12  <(sort 27) <(sort 28););) \
        <(comm -12  <(comm -12  <(sort 29) <(sort 30);) <(comm -12  <(sort 31) <(sort 32);););););) \
  <(comm -12 \
    <(comm -12 \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 33) <(sort 34);) <(comm -12  <(sort 35) <(sort 36););) \
        <(comm -12  <(comm -12  <(sort 37) <(sort 38);) <(comm -12  <(sort 39) <(sort 40);););) \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 41) <(sort 42);) <(comm -12  <(sort 43) <(sort 44););) \
        <(comm -12  <(comm -12  <(sort 45) <(sort 46);) <(comm -12  <(sort 47) <(sort 48););););) \
    <(comm -12 \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 49) <(sort 50);) <(comm -12  <(sort 51) <(sort 52););) \
        <(comm -12  <(comm -12  <(sort 53) <(sort 54);) <(comm -12  <(sort 55) <(sort 56);););) \
      <(cat  <(comm -12  <(comm -12  <(sort 57) <(sort 58);) <(comm -12  <(sort 59) <(sort 60););) ;);););

Replace sort with cat if the files are already sorted.

Would make more sense to use comm -12 instead of join here. You'd probably want export LC_ALL=C, and use sort -u. Note that the ;s are not necessary. — Stéphane Chazelas, Jul 23 '20 at 07:09
If the files are already sorted, you'd rather replace <(sort file) with file than with <(cat file). — Stéphane Chazelas, Jul 23 '20 at 07:19

score 0 · Answer 3 · answered Jun 23 '20 at 12:14

With zsh, using its ${a:*b} array intersection operator on arrays marked with the unique flag (also using the $(<file) ksh operator and f parameter expansion flag to split on line feed characters):

#! /bin/zsh -
typeset -U all list
all=(${(f)"$(<${1?})"}); shift
for file do
  list=(${(f)"$(<$file)"})
  all=(${all:*list})
done
print -rC1 -- $all

(that script takes the list of files as arguments; empty lines are ignored).

score 0 · Answer 4 · answered Jun 23 '20 at 13:57

With join:

cp a jnd
for f in a b c; do join jnd $f >j__; cp j__ jnd; done

I have just numbers (1-6, 3-8, 5-9) in three files a, b and c. This is the two lines (numbers, strings) the three have in common.

]# cat jnd
5
6

It is not elegant/efficient like that, especially with that cp in between. But it can be made to work in parallel quite easily. Select a subgroup of files (for f in a*), give the files unique names and then you can run many subgroups at once. You still have to join these results... - with 64 files you would have 8 threads joining 8 files each, and then the remaining 8 joined files could again be split up, into 4 threads.

Compare 60 large files and output only the lines that are common to all files

4 Answers4