Remove Common lines in a file

Question

I have two files fileA & fileB

fileA has lot of IPs & fileB has less IPs. How can we do

fileA - fileB = fileC (File without common IPs)

fileA

1.1.1.1
2.2.2.2
3.3.3.3
4.4.4.4
5.5.5.5

fileB

4.4.4.4
1.1.1.1

fileC

2.2.2.2
3.3.3.3
5.5.5.5

I got lot of options on google but couldn't get anything relevant

You should have included in your example an IP in fileB that doesn't exist in fileA so we could see if that should appear in the output or not. Everyones making different assumptions... — Ed Morton, Sep 02 '22 at 13:40
Do you want the intersection or actually the complement ? Please read the descriptions (definitions) in https://catonmat.net/set-operations-in-unix-shell — QuartzCristal, Sep 02 '22 at 23:55

score 8 · Answer 1 · answered Sep 02 '22 at 09:13

8

The comm tool may be useful here, particularly if you don't care that the results are sorted in alphanumerical order:

comm -23 <( sort -u fileA ) <( sort -u fileB ) >fileC

See man sort and man comm for reference details of their use.

answered Sep 02 '22 at 09:13

Chris Davies

116,213
16
160
287

That is assuming that you want the complement of fileA (sutract fileB from FileA, fileA - fileB). If what is wanted is the Symmetric Difference (elements that are in set1 or in set2 but not both) you could use sort <(sort -u fileA) <(sort -u fileB) | uniq -u (or comm -3). – QuartzCristal Sep 03 '22 at 00:53
1

@QuartzCristal yes. I read the question as wanting to use FileB to eliminate IP addresses from FileA. – Chris Davies Sep 03 '22 at 07:24
Yes ! I agree, that is my perception as well. But then, I read a sentence like (File without common IPs) and I left wondering if what is meant is to eliminate repeated IPs anywhere they appear. Well, nevermind, there is nothing that you should do with your answer, not until the OP could clarify what is he meaning to say. – QuartzCristal Sep 03 '22 at 08:04

Marius_Couet · Accepted Answer · 2022-09-02T13:53:14.703

To do fileA - fileB, you can use awk (this will not get you IP only in fileB):

awk 'NR==FNR{a[$0];next}!($0 in a)' fileB fileA

NR refers to the total record number and FNR refers to the record number (typically the line number) in the current file. So if a line exist in the first file, it won't be displayed from the second.

If you need to get rid of duplicate line in fileA use :

awk 'NR==FNR{a[$0]++;next}!a[$0]++' fileB fileA

Kusalananda · Answer 3 · 2022-09-02T12:39:37.673

3

The question can be interpreted in a couple of different ways.

I will assume that each line is unique in the file it occurs in.

Assuming you want to remove the entries from fileA that are also found in fileB.

This removes the IP addresses found in fileB from the ones in fileA:
```
grep -v -Fx -f fileB fileA >fileC
```
The options used with grep here ensure that the patterns (lines read from fileB using -f) are treated as strings rather than as regular expressions (-F), and that we are matching whole lines rather than substrings (-x). We also invert the sense of the match with -v to output all lines from fileA that does not match any of the lines in fileB.
Assuming you want to get all entries that are unique to fileA or that are unique to fileB:

The following outputs the lines that are not duplicated across the files. It uses -u with uniq, which is a non-standard but commonly available option used for outputting lines that are not duplicated over subsequent lines.
```
sort fileA fileB | uniq -u >fileC
```

edited Sep 02 '22 at 12:39

answered Sep 02 '22 at 09:15

Kusalananda

333,661

sort fileA fileB fileB | uniq -u Need to duplicate the common IPs. Edit the answer pls. – K-attila- Sep 02 '22 at 10:42
@K-att- That is correct, but only if fileB contains values that doesn't exist in fileA. That is not what the OP supplied as examples, but yes sort fileA fileB fileB | uniq -u is the correct solution. – QuartzCristal Sep 02 '22 at 11:59
@K-att- My intention was to list the entries that were not repeated between the files. If I use fileB twice, then I would get no entry from that file. If fileB contains an entry that is not in fileA, then I was assuming that the user wanted to see it. This is what I meant by there being different interpretations of the problem. – Kusalananda Sep 02 '22 at 12:04
fileA - fileB = fileC (File without common IPs) – K-attila- Sep 02 '22 at 12:12
@K-att- Exactly. If you use fileB twice in the call to sort, then you will get no elements from fileB, even if the file contains elements that are not common to both files. – Kusalananda Sep 02 '22 at 12:14
@Kusalananda You are interpreting the question as getting the intersection of both files and printing what is not common to both of them. But the procedure you decided to use (with sort) require that there are not repeated values in fileA. If you add a value 6.6.6.6 twice in fileA it will not be printed on the output. – QuartzCristal Sep 02 '22 at 23:41
@Kusalananda An alternative interpretation of fileA-fileB is that of subtraction (removal of values in fileB from fileA). That is what you did with the grep example and the sort command is not an equivalent. – QuartzCristal Sep 02 '22 at 23:41
@Kusalananda The description in https://catonmat.net/set-operations-in-unix-shell might be of help. And the solutions already exist in https://unix.stackexchange.com/questions/11343/linux-tools-to-treat-files-as-sets-and-perform-set-operations-on-them – QuartzCristal Sep 02 '22 at 23:57
1

@QuartzCristal Yes, I assume that each file contains lines that are unique within that file. This is an assumption that is already part of my answer. I also say that there are different ways of interpreting the question, then I give two different answers, together with two different interpretation. This is also part of the text already. – Kusalananda Sep 03 '22 at 00:13
Yes, that is true. But plain rejection is not better than trying to include some/any of the helpful points: (1) you can comment (note, write, say, include) that your assumption of that each file is a set (no repeated lines) could be enforced with <(sort -u file A). (2) That the equivalent to your grep solution is sort fileA fileB fileB | uniq -u with the sort command. (3) That a seemingly cleaner equivalent to sort fileA FileB | uniq -u which enforce assumptions is comm -13 <(sort -u fileB) <(sort -u fileA). (4) And that there is another existing Unix answer that might be of help. – QuartzCristal Sep 03 '22 at 00:42

Özgür Murat Sağdıçoğlu · Answer 4 · 2022-09-02T10:02:54.110

1

If preservation of order is not important, you can first remove duplicates in first file and concatenate the output with second file twice (for removing anything unique to it) and print only non-duplicate lines with uniq -u.

sort -u fileA | cat - fileB fileB | sort | uniq -u

edited Sep 02 '22 at 10:02

answered Sep 02 '22 at 08:44

Özgür Murat Sağdıçoğlu

203

Remove Common lines in a file

4 Answers4