2

I Have 60 files, each contain about 10,000 lines. Each line contains a single string.

I want to find out only the strings that are common to all files.

Must be exact matches, so we are comparing the entire line.

Chris
  • 21
  • 1
  • My first thought was to read all the files, counting each string in a hash array. awk '{ ++Cnt[$0]; }' Use an END clause to list only those strings whose Cnt equals the number of files (which you also count). But that gives a false result if a string occurs multiple times in any file. So: are the strings unique in any one file? We can cope with this, but it would be slightly more complex. – Paul_Pedant Jun 23 '20 at 10:28
  • Thank you for your comments. I'd say it is unlikely that the same string would appear more than once but I can't be 100% sure. Duplicates are not required if they do exist, so we could check and clean the files beforehand. Something like sort FILE | uniq -cd – Chris Jun 23 '20 at 11:04
  • Are the files sorted? – Stéphane Chazelas Jun 23 '20 at 11:32

4 Answers4

1

Try this,

awk '
    BEGINFILE{fnum++; delete f;}
    !f[$0]++{s[$0]++;}
    END {for (l in s){if (s[l] == fnum) print l}}
' files*

Explanation:

  • BEGINFILE { ... } Run at beginning of each file

  • !f[$0]++ { ... } Run only for first occurence of a line in a file (when f[$0] is 0 (false))

    • s[$0]++ Increment line-counter.
  • END { ... } Run once at the end

    • for (l in s){if (s[l] == fnum) print l} Loop lines and print each one where number of occurences equals the number of files.

600.000 lines should be fine in memory. Otherwise, you could possibly remove everything from s which is less than fnum in the BEGINFILE{...} block.

pLumo
  • 22,565
  • (note that it assumes files are not empty as FNR==1 would not trigger on empty files). – Stéphane Chazelas Jun 23 '20 at 12:19
  • Ok thank you. Sorry for being slow, but I'm assuming I place all the files within a directory and run this? – Chris Jun 23 '20 at 12:40
  • Whatever is easier for you. You can also refer the files individually like awk '...' /path/to/file1 /path/to/file2 /path/to/fileN, at the end, the files need to be the arguments to that awk command. – pLumo Jun 23 '20 at 12:42
  • Sorry :( can you confirm that files* at the end of the command in your example refers to all files in the current directory? – Chris Jun 23 '20 at 12:47
  • 1
    won't modern awk have a BEGINFILE pattern trigerred at each new file ? – Archemar Jun 23 '20 at 12:52
  • @Chris all files (excluding hidden) in current directory are just * – pLumo Jun 23 '20 at 12:56
  • I never used BEGINFILE, but seems to be better option FNR==1, edited. – pLumo Jun 23 '20 at 12:58
  • Ok so does the command count all the files in current directory? Or do I need to specify the files? Can you update your answer to a working example please? Or how do I make it run on all files in a directory? I'm sorry I'm not an advanced user. – Chris Jun 23 '20 at 13:00
  • files* will include all files which name are starting with "files". If you want all files in the directory, use *. This is very basic shell knowledge, maybe you should look up, it's called "globbing". Without knowing your directory / files structure I cannot give a "working example". – pLumo Jun 23 '20 at 13:03
  • I understand wildcards and globbing but I don't know scripting - so as far as I knew "files" might be part of the command. You included the word "files*" - how am I to determine that this is, or is not part of the command without clarification? So.. just to be perfectly clear, I can simply paste the following into the command line form the directory containing the files?

    awk ' BEGINFILE{fnum++; delete f;} !f[$0]++{s[$0]++;} END {for (l in s){if (s[l] == fnum) print l}} ' *

    – Chris Jun 23 '20 at 13:11
  • exactly ....... – pLumo Jun 23 '20 at 13:22
  • Great stuff - thanks :-) I will test! – Chris Jun 23 '20 at 13:43
1

Parallelized version in bash. It should work for files bigger than memory.

export LC_ALL=C
comm -12 \
  <(comm -12 \
    <(comm -12 \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 1) <(sort 2);) <(comm -12  <(sort 3) <(sort 4););) \
        <(comm -12  <(comm -12  <(sort 5) <(sort 6);) <(comm -12  <(sort 7) <(sort 8);););) \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 9) <(sort 10);) <(comm -12  <(sort 11) <(sort 12););) \
        <(comm -12  <(comm -12  <(sort 13) <(sort 14);) <(comm -12  <(sort 15) <(sort 16););););) \
    <(comm -12 \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 17) <(sort 18);) <(comm -12  <(sort 19) <(sort 20););) \
        <(comm -12  <(comm -12  <(sort 21) <(sort 22);) <(comm -12  <(sort 23) <(sort 24);););) \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 25) <(sort 26);) <(comm -12  <(sort 27) <(sort 28););) \
        <(comm -12  <(comm -12  <(sort 29) <(sort 30);) <(comm -12  <(sort 31) <(sort 32);););););) \
  <(comm -12 \
    <(comm -12 \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 33) <(sort 34);) <(comm -12  <(sort 35) <(sort 36););) \
        <(comm -12  <(comm -12  <(sort 37) <(sort 38);) <(comm -12  <(sort 39) <(sort 40);););) \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 41) <(sort 42);) <(comm -12  <(sort 43) <(sort 44););) \
        <(comm -12  <(comm -12  <(sort 45) <(sort 46);) <(comm -12  <(sort 47) <(sort 48););););) \
    <(comm -12 \
      <(comm -12 \
        <(comm -12  <(comm -12  <(sort 49) <(sort 50);) <(comm -12  <(sort 51) <(sort 52););) \
        <(comm -12  <(comm -12  <(sort 53) <(sort 54);) <(comm -12  <(sort 55) <(sort 56);););) \
      <(cat  <(comm -12  <(comm -12  <(sort 57) <(sort 58);) <(comm -12  <(sort 59) <(sort 60););) ;);););

Replace sort with cat if the files are already sorted.

Ole Tange
  • 35,514
0

With zsh, using its ${a:*b} array intersection operator on arrays marked with the unique flag (also using the $(<file) ksh operator and f parameter expansion flag to split on line feed characters):

#! /bin/zsh -
typeset -U all list
all=(${(f)"$(<${1?})"}); shift
for file do
  list=(${(f)"$(<$file)"})
  all=(${all:*list})
done
print -rC1 -- $all

(that script takes the list of files as arguments; empty lines are ignored).

0

With join:

cp a jnd
for f in a b c; do join jnd $f >j__; cp j__ jnd; done

I have just numbers (1-6, 3-8, 5-9) in three files a, b and c. This is the two lines (numbers, strings) the three have in common.

]# cat jnd
5
6

It is not elegant/efficient like that, especially with that cp in between. But it can be made to work in parallel quite easily. Select a subgroup of files (for f in a*), give the files unique names and then you can run many subgroups at once. You still have to join these results... - with 64 files you would have 8 threads joining 8 files each, and then the remaining 8 joined files could again be split up, into 4 threads.