2

I want to perform what some data analysis software call an anti-join: remove from one list those lines matching lines in another list. Here is some toy data and the expected output:

$ echo -e "a\nb\nc\nd" > list1
$ echo -e "c\nd\ne\nf" > list2
$ antijoincommand list1 list2
a
b
terdon
  • 242,166
Josh
  • 303

3 Answers3

9

I wouldn't use join for this because join requires input to be sorted, which is an unnecessary complication for such a simple job. You could instead use grep:

$ grep -vxFf list2 list1
a
b

Or awk:

$ awk 'NR==FNR{++a[$0]} !a[$0]' list2 list1
a
b

If the files are already sorted, an alternative to join -v 1 would be comm -23

$ comm -23 list1 list2 
a
b
terdon
  • 242,166
  • Avoiding sort with grep is great for the toy data I provided. Thanks! In the real world, my file1 often has multiple columns of data, one of which is being used for the join. A modified version of your awk code would address this use case. – Josh May 24 '20 at 13:46
  • 1
    @Josh yes, just change the $0 with $N where N is the field number you are joining on. – terdon May 24 '20 at 13:47
  • 1
    This works even if the column numbers in file1 and file2 are different: like awk 'NR==FNR{++a[$2]} !a[$5]' list2 list1; quite usual for the tag file to be a different format to the main data. – Paul_Pedant May 24 '20 at 14:14
  • upvoted for the comm -23 command – user2297550 Jan 30 '22 at 08:17
3

One way to do this with the join utility is:

$ join -v 1 list1 list2
a
b

From the manpage:

-a FILENUM

: also print unpairable lines from file FILENUM, where FILENUM is 1 or 2, corresponding to FILE1 or FILE2

-v FILENUM

: like -a FILENUM, but suppress joined output lines

Geremia
  • 1,183
Josh
  • 303
0

Using Raku (formerly known as Perl_6)

Raku has Set object types, and you can read individual files to create Sets from lines:

~$ raku -e 'my $a = Set.new: "list1".IO.lines; 
            my $b = Set.new: "list2".IO.lines; 
            say "list1 = ", $a;
            say "list2 = ", $b;'
list1 = Set(a b c d)
list2 = Set(c d e f)

You can perform asymmetric Set differences, with either ASCII infix (-), or Unicode infix :

~$ raku -e 'my $a = Set.new: "list1".IO.lines; 
            my $b = Set.new: "list2".IO.lines; 
            say $a (-) $b;'
Set(a b)
~$ raku -e 'my $a = Set.new: "list1".IO.lines; 
            my $b = Set.new: "list2".IO.lines; 
            say $b (-) $a;'
Set(e f)

OTOH, sometimes you need to perform a symmetric Set difference, and Raku has you covered. Use either ASCII infix (^) or Unicode infix :

~$ raku -e 'my $a = Set.new: "list1".IO.lines; 
            my $b = Set.new: "list2".IO.lines; 
            say $a (^) $b;'
Set(a b e f)

Finally, you can get linewise output by changing the final line to .keys.put for … .
Final symmetric Set difference example below, using Unicode infix operator:

~$ raku -e 'my $a = Set.new: "list1".IO.lines;
            my $b = Set.new: "list2".IO.lines;
            .keys.put for $a ⊖ $b;'
f
e
a
b

https://docs.raku.org/type/Set
https://docs.raku.org/language/setbagmix#Operators_with_set_semantics
https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17