29

I have 4 files which are like

       file A
       >TCONS_00000867
       >TCONS_00001442
       >TCONS_00001447
       >TCONS_00001528
       >TCONS_00001529
       >TCONS_00001668
       >TCONS_00001921
   file b
   >TCONS_00001528
   >TCONS_00001529
   >TCONS_00001668
   >TCONS_00001921
   >TCONS_00001922
   >TCONS_00001924

   file c
   >TCONS_00001529
   >TCONS_00001668
   >TCONS_00001921
   >TCONS_00001922
   >TCONS_00001924
   >TCONS_00001956
   >TCONS_00002048

   file d
   >TCONS_00001922
   >TCONS_00001924
   >TCONS_00001956
   >TCONS_00002048

All files contain more than 2000 lines and are sorted by first column.

I want to find common lines in all files. I tried awk and grep and comm but not working.

jubilatious1
  • 3,195
  • 8
  • 17

4 Answers4

38

Since the files are already sorted:

comm -12 a b |
  comm -12 - c |
  comm -12 - d

comm finds common lines between files. By default comm prints 3 TAB-separated columns:

  1. The lines unique to the first file,
  2. The lines unique to the second file,
  3. The lines common to both files.

With the -1, -2, -3 options, we suppress the corresponding column. So comm -12 a b reports the lines common to a and b. - can be used in place of a file name to mean stdin.

7
cat a b c d |sort |uniq -c |sed -n -e 's/^ *4 \(.*\)/\1/p'
Stephen Kitt
  • 434,908
Piotr
  • 71
  • 1
  • 1
  • Actually, save the sed, this is quite good for finding duplicate lines across many files: cat to sort to uniq -c. Somehow I didn't quite think of this, good answer! – smaslennikov May 21 '19 at 21:35
  • 1
    You can also use uniq command to only print duplicated lines: uniq -cd – mems Sep 30 '19 at 14:44
0

comm would be cool if it took 3+ files as input. Since it doesn't I don't see the value in it, and would recommend just using grep:

  • Doesn't require pre-sorted input
  • Don't have to suppress output columns
  • Command is just as short and easy as comm
grep -f a.txt b.txt | 
  grep -f - c.txt |
  grep -f - d.txt

-f <file> tells grep to look in <file> for patterns, so any line in a.txt also found in b.txt will be output as a match. This continues down the pipeline, so lines found in both a.txt AND b.txt will be used as patterns to match in c.txt. At the end of the pipeline the only output shown are lines which appear in every file.

  • This requires comparing *every* line in a.txt to *every* line in b.txt.  If each file has 1000 lines, that's (theoretically) a million comparisons.  That's the value of comm — if a line 6 sorts between b line 8 and line 9, you don't have to compare it to lines 10, 11, 12, ... – G-Man Says 'Reinstate Monica' Feb 06 '24 at 23:46
  • Good point, if comm actually moves on to the next pattern after finding the first match, then it is more efficient than grep. However a test indicates that comm is only slightly faster
    $ time comm -12  <(seq 1 1000) <(seq 1 1000) | comm -12 - <(seq 1 1000) | comm -12 - <(seq 1 1000) | comm -12 - <(seq 1 1000) | tail -1
    1000
    
    real    0m0.037s
    user    0m0.006s
    sys     0m0.010s
    $ time grep -f <(seq 1 1000) <(seq 1 1000) | grep -f - <(seq 1 1000) |  grep -f - <(seq 1 1000) |  grep -f - <(seq 1 1000) | tail -1
    1000
    
    real    0m0.053s
    user    0m0.024s
    sys     0m0.012s
    
    – Petrucci Feb 07 '24 at 14:22
0

Using Raku (formerly known as Perl_6)

~$ raku -e 'my %h;  for dir(test => / file \w /) {   \
               %h{$_}++ for .lines.unique };         \
            .put if .value == 4 for %h;'

#OR

~$ raku -e 'my %h; my @a = dir(test => / file \w /);
for @a { %h{$_}++ for .lines.unique };
for %h { .put if .value == @a.elems };'

Above are answers written in Raku, a member of the Perl-family of programming languages. Briefly, Raku's dir function is used to inspect the local directory and pull out regex matches to filenames. Here we assume the files are named file followed by \w a word character, but it's equally easy to (let's say) match \.txt files with the appropriate regex (not globbing pattern).

For both answers above a %h hash is declared. Once filenames are obtained (actually .IO objects), they are iterated through linewise using for, incrementing the hash key every time an identical line is seen.

In the first answer (last statement) the lines are returned if their value matches 4, i.e. from all four input files. In the second answer (last statement) the lines are returned if their value matches @a.elems, i.e. the number of input files (in other words--this value is set dynamically).

Sample Input (Note: files are named fileA, fileB, etc. Also, fileA has an extra line at the end, compared to OP's first file):

       fileA
       >TCONS_00000867
       >TCONS_00001442
       >TCONS_00001447
       >TCONS_00001528
       >TCONS_00001529
       >TCONS_00001668
       >TCONS_00001921
       >TCONS_00001922
   fileB
   &gt;TCONS_00001528
   &gt;TCONS_00001529
   &gt;TCONS_00001668
   &gt;TCONS_00001921
   &gt;TCONS_00001922
   &gt;TCONS_00001924

   fileC
   &gt;TCONS_00001529
   &gt;TCONS_00001668
   &gt;TCONS_00001921
   &gt;TCONS_00001922
   &gt;TCONS_00001924
   &gt;TCONS_00001956
   &gt;TCONS_00002048

   fileD
   &gt;TCONS_00001922
   &gt;TCONS_00001924
   &gt;TCONS_00001956
   &gt;TCONS_00002048

Sample Output (showing all key/value counts):

~$ raku -e 'my %h; for dir(test => / file \w /) { %h{$_}++ for .lines.unique }; .say for %h.sort: -*.value;'
>TCONS_00001922 => 4
>TCONS_00001668 => 3
>TCONS_00001921 => 3
>TCONS_00001924 => 3
>TCONS_00001529 => 3
>TCONS_00002048 => 2
>TCONS_00001528 => 2
>TCONS_00001956 => 2
>TCONS_00000867 => 1
>TCONS_00001442 => 1
>TCONS_00001447 => 1

Sample Output (showing only lines where .value == 4 or .value == @a.elems):

~$ raku -e 'my %h; my @a = dir(test => / file \w /); for @a { %h{$_}++ for .lines.unique }; for %h { .key.put if .value == @a.elems};'
>TCONS_00001922

Finally, for those who prefer shell-globbing, Raku can do this as well. The key is remembering that the $*ARGFILES dynamic variable must be converted to .handles to read the input properly:

~$ raku -e 'my ($n,%h); for $*ARGFILES.handles -> $fh { $n++; %h{$_}++ for $fh.lines.unique }; for %h { .key.put if .value == $n };' file?
>TCONS_00001922

NOTE: The OP's test input seems like a trick question because no lines are in common between all four files! Thus the first file (fileA) has been modified to provide a positive control: >TCONS_00001922.

https://stackoverflow.com/a/68774047/7270649
https://docs.raku.org/routine/dir
https://docs.raku.org
https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17