find common lines between multiple files

Question

I have 4 files which are like

       file A
       >TCONS_00000867
       >TCONS_00001442
       >TCONS_00001447
       >TCONS_00001528
       >TCONS_00001529
       >TCONS_00001668
       >TCONS_00001921
   file b
   &gt;TCONS_00001528
   &gt;TCONS_00001529
   &gt;TCONS_00001668
   &gt;TCONS_00001921
   &gt;TCONS_00001922
   &gt;TCONS_00001924

   file c
   &gt;TCONS_00001529
   &gt;TCONS_00001668
   &gt;TCONS_00001921
   &gt;TCONS_00001922
   &gt;TCONS_00001924
   &gt;TCONS_00001956
   &gt;TCONS_00002048

   file d
   &gt;TCONS_00001922
   &gt;TCONS_00001924
   &gt;TCONS_00001956
   &gt;TCONS_00002048

All files contain more than 2000 lines and are sorted by first column.

I want to find common lines in all files. I tried awk and grep and comm but not working.

Stéphane Chazelas · Accepted Answer · 2022-03-29T08:58:41.217

Since the files are already sorted:

comm -12 a b |
  comm -12 - c |
  comm -12 - d

comm finds common lines between files. By default comm prints 3 TAB-separated columns:

The lines unique to the first file,
The lines unique to the second file,
The lines common to both files.

With the -1, -2, -3 options, we suppress the corresponding column. So comm -12 a b reports the lines common to a and b. - can be used in place of a file name to mean stdin.

score 7 · Answer 2 · edited May 10 '19 at 09:44

7

cat a b c d |sort |uniq -c |sed -n -e 's/^ *4 \(.*\)/\1/p'

edited May 10 '19 at 09:44

Stephen Kitt

434,908

answered May 10 '19 at 07:53

Piotr

71
1
1

Actually, save the sed, this is quite good for finding duplicate lines across many files: cat to sort to uniq -c. Somehow I didn't quite think of this, good answer! – smaslennikov May 21 '19 at 21:35
1

You can also use uniq command to only print duplicated lines: uniq -cd – mems Sep 30 '19 at 14:44

score 0 · Answer 3 · answered Feb 06 '24 at 21:15

0

comm would be cool if it took 3+ files as input. Since it doesn't I don't see the value in it, and would recommend just using grep:

Doesn't require pre-sorted input
Don't have to suppress output columns
Command is just as short and easy as comm

grep -f a.txt b.txt | 
  grep -f - c.txt |
  grep -f - d.txt

-f <file> tells grep to look in <file> for patterns, so any line in a.txt also found in b.txt will be output as a match. This continues down the pipeline, so lines found in both a.txt AND b.txt will be used as patterns to match in c.txt. At the end of the pipeline the only output shown are lines which appear in every file.

answered Feb 06 '24 at 21:15

Petrucci

31

This requires comparing *every* line in a.txt to *every* line in b.txt. If each file has 1000 lines, that's (theoretically) a million comparisons. That's the value of comm — if a line 6 sorts between b line 8 and line 9, you don't have to compare it to lines 10, 11, 12, ... – G-Man Says 'Reinstate Monica' Feb 06 '24 at 23:46

Good point, if comm actually moves on to the next pattern after finding the first match, then it is more efficient than grep. However a test indicates that comm is only slightly faster

$ time comm -12  <(seq 1 1000) <(seq 1 1000) | comm -12 - <(seq 1 1000) | comm -12 - <(seq 1 1000) | comm -12 - <(seq 1 1000) | tail -1
1000

real    0m0.037s
user    0m0.006s
sys     0m0.010s
$ time grep -f <(seq 1 1000) <(seq 1 1000) | grep -f - <(seq 1 1000) |  grep -f - <(seq 1 1000) |  grep -f - <(seq 1 1000) | tail -1
1000

real    0m0.053s
user    0m0.024s
sys     0m0.012s

– Petrucci Feb 07 '24 at 14:22

jubilatious1 · Answer 4 · 2024-02-08T09:30:57.183

Using Raku (formerly known as Perl_6)

~$ raku -e 'my %h;  for dir(test => / file \w /) {   \
               %h{$_}++ for .lines.unique };         \
            .put if .value == 4 for %h;'
#OR
~$ raku -e 'my %h; my @a = dir(test => / file \w /);  

            for @a { %h{$_}++ for .lines.unique };    

            for %h { .put if .value == @a.elems };'

Above are answers written in Raku, a member of the Perl-family of programming languages. Briefly, Raku's dir function is used to inspect the local directory and pull out regex matches to filenames. Here we assume the files are named file followed by \w a word character, but it's equally easy to (let's say) match \.txt files with the appropriate regex (not globbing pattern).

For both answers above a %h hash is declared. Once filenames are obtained (actually .IO objects), they are iterated through linewise using for, incrementing the hash key every time an identical line is seen.

In the first answer (last statement) the lines are returned if their value matches 4, i.e. from all four input files. In the second answer (last statement) the lines are returned if their value matches @a.elems, i.e. the number of input files (in other words--this value is set dynamically).

Sample Input (Note: files are named fileA, fileB, etc. Also, fileA has an extra line at the end, compared to OP's first file):

       fileA
       >TCONS_00000867
       >TCONS_00001442
       >TCONS_00001447
       >TCONS_00001528
       >TCONS_00001529
       >TCONS_00001668
       >TCONS_00001921
       >TCONS_00001922
   fileB
   &gt;TCONS_00001528
   &gt;TCONS_00001529
   &gt;TCONS_00001668
   &gt;TCONS_00001921
   &gt;TCONS_00001922
   &gt;TCONS_00001924

   fileC
   &gt;TCONS_00001529
   &gt;TCONS_00001668
   &gt;TCONS_00001921
   &gt;TCONS_00001922
   &gt;TCONS_00001924
   &gt;TCONS_00001956
   &gt;TCONS_00002048

   fileD
   &gt;TCONS_00001922
   &gt;TCONS_00001924
   &gt;TCONS_00001956
   &gt;TCONS_00002048

Sample Output (showing all key/value counts):

~$ raku -e 'my %h; for dir(test => / file \w /) { %h{$_}++ for .lines.unique }; .say for %h.sort: -*.value;'
>TCONS_00001922 => 4
>TCONS_00001668 => 3
>TCONS_00001921 => 3
>TCONS_00001924 => 3
>TCONS_00001529 => 3
>TCONS_00002048 => 2
>TCONS_00001528 => 2
>TCONS_00001956 => 2
>TCONS_00000867 => 1
>TCONS_00001442 => 1
>TCONS_00001447 => 1

Sample Output (showing only lines where .value == 4 or .value == @a.elems):

~$ raku -e 'my %h; my @a = dir(test => / file \w /); for @a { %h{$_}++ for .lines.unique }; for %h { .key.put if .value == @a.elems};'
>TCONS_00001922

Finally, for those who prefer shell-globbing, Raku can do this as well. The key is remembering that the $*ARGFILES dynamic variable must be converted to .handles to read the input properly:

~$ raku -e 'my ($n,%h); for $*ARGFILES.handles -> $fh { $n++; %h{$_}++ for $fh.lines.unique }; for %h { .key.put if .value == $n };' file?
>TCONS_00001922

NOTE: The OP's test input seems like a trick question because no lines are in common between all four files! Thus the first file (fileA) has been modified to provide a positive control: >TCONS_00001922.

https://stackoverflow.com/a/68774047/7270649
https://docs.raku.org/routine/dir
https://docs.raku.org
https://raku.org

find common lines between multiple files

4 Answers4

Linked