I Have 60 files, each contain about 10,000 lines. Each line contains a single string.
I want to find out only the strings that are common to all files.
Must be exact matches, so we are comparing the entire line.
I Have 60 files, each contain about 10,000 lines. Each line contains a single string.
I want to find out only the strings that are common to all files.
Must be exact matches, so we are comparing the entire line.
Try this,
awk '
BEGINFILE{fnum++; delete f;}
!f[$0]++{s[$0]++;}
END {for (l in s){if (s[l] == fnum) print l}}
' files*
Explanation:
BEGINFILE { ... }
Run at beginning of each file
fnum++
increment file counterdelete f
delete array which is used used to filter duplicate lines per file (see link for posix-compliant solution).!f[$0]++ { ... }
Run only for first occurence of a line in a file (when f[$0]
is 0 (false))
s[$0]++
Increment line-counter.END { ... }
Run once at the end
for (l in s){if (s[l] == fnum) print l}
Loop lines and print each one where number of occurences equals the number of files.600.000 lines should be fine in memory. Otherwise, you could possibly remove everything from s
which is less than fnum
in the BEGINFILE{...}
block.
awk '...' /path/to/file1 /path/to/file2 /path/to/fileN
, at the end, the files need to be the arguments to that awk
command.
– pLumo
Jun 23 '20 at 12:42
BEGINFILE
pattern trigerred at each new file ?
– Archemar
Jun 23 '20 at 12:52
files*
will include all files which name are starting with "files". If you want all files in the directory, use *
. This is very basic shell knowledge, maybe you should look up, it's called "globbing". Without knowing your directory / files structure I cannot give a "working example".
– pLumo
Jun 23 '20 at 13:03
awk ' BEGINFILE{fnum++; delete f;} !f[$0]++{s[$0]++;} END {for (l in s){if (s[l] == fnum) print l}} ' *
– Chris Jun 23 '20 at 13:11Parallelized version in bash. It should work for files bigger than memory.
export LC_ALL=C
comm -12 \
<(comm -12 \
<(comm -12 \
<(comm -12 \
<(comm -12 <(comm -12 <(sort 1) <(sort 2);) <(comm -12 <(sort 3) <(sort 4););) \
<(comm -12 <(comm -12 <(sort 5) <(sort 6);) <(comm -12 <(sort 7) <(sort 8);););) \
<(comm -12 \
<(comm -12 <(comm -12 <(sort 9) <(sort 10);) <(comm -12 <(sort 11) <(sort 12););) \
<(comm -12 <(comm -12 <(sort 13) <(sort 14);) <(comm -12 <(sort 15) <(sort 16););););) \
<(comm -12 \
<(comm -12 \
<(comm -12 <(comm -12 <(sort 17) <(sort 18);) <(comm -12 <(sort 19) <(sort 20););) \
<(comm -12 <(comm -12 <(sort 21) <(sort 22);) <(comm -12 <(sort 23) <(sort 24);););) \
<(comm -12 \
<(comm -12 <(comm -12 <(sort 25) <(sort 26);) <(comm -12 <(sort 27) <(sort 28););) \
<(comm -12 <(comm -12 <(sort 29) <(sort 30);) <(comm -12 <(sort 31) <(sort 32);););););) \
<(comm -12 \
<(comm -12 \
<(comm -12 \
<(comm -12 <(comm -12 <(sort 33) <(sort 34);) <(comm -12 <(sort 35) <(sort 36););) \
<(comm -12 <(comm -12 <(sort 37) <(sort 38);) <(comm -12 <(sort 39) <(sort 40);););) \
<(comm -12 \
<(comm -12 <(comm -12 <(sort 41) <(sort 42);) <(comm -12 <(sort 43) <(sort 44););) \
<(comm -12 <(comm -12 <(sort 45) <(sort 46);) <(comm -12 <(sort 47) <(sort 48););););) \
<(comm -12 \
<(comm -12 \
<(comm -12 <(comm -12 <(sort 49) <(sort 50);) <(comm -12 <(sort 51) <(sort 52););) \
<(comm -12 <(comm -12 <(sort 53) <(sort 54);) <(comm -12 <(sort 55) <(sort 56);););) \
<(cat <(comm -12 <(comm -12 <(sort 57) <(sort 58);) <(comm -12 <(sort 59) <(sort 60););) ;);););
Replace sort
with cat
if the files are already sorted.
comm -12
instead of join
here. You'd probably want export LC_ALL=C
, and use sort -u
. Note that the ;
s are not necessary.
– Stéphane Chazelas
Jul 23 '20 at 07:09
<(sort file)
with file
than with <(cat file)
.
– Stéphane Chazelas
Jul 23 '20 at 07:19
With zsh
, using its ${a:*b}
array intersection operator on arrays marked with the unique flag (also using the $(<file)
ksh operator and f
parameter expansion flag to split on line feed characters):
#! /bin/zsh -
typeset -U all list
all=(${(f)"$(<${1?})"}); shift
for file do
list=(${(f)"$(<$file)"})
all=(${all:*list})
done
print -rC1 -- $all
(that script takes the list of files as arguments; empty lines are ignored).
With join
:
cp a jnd
for f in a b c; do join jnd $f >j__; cp j__ jnd; done
I have just numbers (1-6, 3-8, 5-9) in three files a, b and c. This is the two lines (numbers, strings) the three have in common.
]# cat jnd
5
6
It is not elegant/efficient like that, especially with that cp
in between. But it can be made to work in parallel quite easily. Select a subgroup of files (for f in a*
), give the files unique names and then you can run many subgroups at once. You still have to join these results... - with 64 files you would have 8 threads joining 8 files each, and then the remaining 8 joined files could again be split up, into 4 threads.
awk '{ ++Cnt[$0]; }'
Use an END clause to list only those strings whose Cnt equals the number of files (which you also count). But that gives a false result if a string occurs multiple times in any file. So: are the strings unique in any one file? We can cope with this, but it would be slightly more complex. – Paul_Pedant Jun 23 '20 at 10:28