Pattern Matching and Delete the whole line

Question

I want to delete all the Lines of File 1, if Column1 of File1 matches exactly with Column 1 File2.

File 1:

r001:21:10    21    AAAAAATTTGC    *     =    XM:21
r002:21:10    21    YAAAATTTGC     *     =    nM:21
r001:21:10    21    TTAAAATTTGC    *     =    XM:21
r0012:21:10   21    LLAAAATTTGC    *     +    XM:21
r001:21:10    21    AAAAAATTTGC    *     =    GM:21

File2:

r001:21:10
r001:21:20
r002:41:36
r002:41:99
r002:41:87
r0012:21:1

Expected Output:

r002:21:10    21    YAAAATTTGC     *     =    nM:21
r0012:21:10   21    LLAAAATTTGC    *     +    XM:21

score 6 · Accepted Answer · edited Feb 28 '14 at 12:59

6

You can use this awk:

$ awk 'FNR==NR {a[$i]; next}; !($1 in a)' f2 f1
r002:21:10    21    YAAAATTTGC     *     =    nM:21
r0012:21:10   21    LLAAAATTTGC    *     +    XM:21

Explanation

FNR==NR {a[$i]; next} it reads the first file and saves the contents into the a array.
!($1 in a) while reading the second file, it checks if the first field is in the a array. If not, prints the line.

edited Feb 28 '14 at 12:59

Stéphane Chazelas

544,893

answered Feb 28 '14 at 12:27

fedorqui

7,861
7
36
74

@Stephane Chazelas, I don't see need of that semicolon. Why should it be there? – fedorqui Feb 28 '14 at 13:01
1

It's required by POSIX, so you could expect some awk implementations to fail without it. – Stéphane Chazelas Feb 28 '14 at 13:05
1

Having said that, I don't know of any awk implementation that does. I once raised it to the autin group – Stéphane Chazelas Feb 28 '14 at 13:09
@StéphaneChazelas I was revisiting the link you gave and apparently your proposal has been approved. – fedorqui Sep 08 '15 at 14:05
@fedorqui, some of it has been approve but not the part about needing ; or newline between condition+action pairs. The standard does not require support for awk '/x/{print} {print}' and this is intentional. (actually the gawk maintainer objected to it). – Stéphane Chazelas Sep 08 '15 at 14:38

score 2 · Answer 2 · answered Feb 28 '14 at 17:07

You could also do

$ grep -wvFf file2 file1
r002:21:10    21    YAAAATTTGC     *     =    nM:21
r0012:21:10   21    LLAAAATTTGC    *     +    XM:21

From man grep:

   -F, --fixed-strings
          Interpret PATTERN as a  list  of  fixed  strings,  separated  by
          newlines,  any  of  which is to be matched. 
   -f FILE, --file=FILE
          Obtain  patterns  from  FILE,  one  per  line.  
   -v, --invert-match
          Invert the sense of matching, to select non-matching lines. 
   -w, --word-regexp
          Select  only  those  lines  containing  matches  that form whole
          words.  The test is that the matching substring must  either  be
          at  the  beginning  of  the  line,  or  preceded  by  a non-word
          constituent character.

NOTE: This however, will search the entirety of each line of file1, not just the first column.

+1 from me. works fast and efficient also for large flat file databases. — , Feb 28 '14 at 20:19

score 2 · Answer 3 · answered Mar 01 '14 at 19:58

If the output order is not important and your shell supports process substitution (bash does), you could use join on the sorted files:

join -v 1 <(sort -k1,1 file1) <(sort -k1,1 file2) | column -t
r0012:21:10  21  LLAAAATTTGC  *  +  XM:21
r002:21:10   21  YAAAATTTGC   *  =  nM:21

Explanation: join files on the first column, -v 1 = output non-matching lines from first file. Files are sorted by their first column -k1,1. The last column -t does pretty printing.

score -1 · Answer 4 · edited Apr 13 '17 at 12:36

-1

Looks like from another thread, I was able to find the answer using that method,

Comparing two files using Unix and Awk

FNR == NR {
  f1[$1,$2,$3] = $0
  f1_c14[$1,$2,$3] = 1
  f1_c5[$1,$2,$3] = $4
  next
}

f1_c14[$1,$2,$3] {
  if ($4 != f1_c5[$1,$2,$3]) print f1[$1,$2,$3] ;
}

f1[$1,$2,$3] {
  if ($4 != f1_c5[$1,$2,$3]) print $0;
}

edited Apr 13 '17 at 12:36

Community

1

answered May 06 '14 at 20:10

nvt007

21

When copying, do so carefully please - you missed a closing brace. It also doesn't produce the required output (copying blindly doesn't really help that much). – peterph May 06 '14 at 21:48
I didn't copy it blindly. I spent time to understand how to use the arrays. It's awesome ! worked for me now. – nvt007 May 13 '14 at 20:35
If you present a piece of code as a solution, it should work on the data provided in the question you are answering - it doesn't seem to do so in this case. – peterph May 15 '14 at 18:14
Sorry ! I had been out for a while. I tested today using the test files, file1 and file2, just exactly as I posted here, and it returned the result as expected. Don't know why it doesn't work for you ? can you provide details of your steps ? – nvt007 May 28 '14 at 17:13
Use the data provided in this question not in the one from which you copied the script. Note, that the requirement here is different than in the other question (different fields to compare), although the person asking is the same - that should also make it clear, that this question is different. Then put first data set into file1, second data set into file2 and script into F. Run awk -f F -- file1 file2. Understand what the script actually tries to do and you'll find out, that it, among other things, compares 1st, 2nd and 4th fields not just the 1st one. – peterph May 29 '14 at 08:09
Yes, I was using the two files that I posted here, not from the link that I used as a reference nor the one from this owner's thread. Just run awk -f test.awk file1 file2 (test.awk is the file that has my script). And it returned exactly each pair of lines that matched the first 3 fields as well as the fourth fields that didn't match. – nvt007 May 29 '14 at 16:18
Good. Now please compare this with the question you are trying to answer: I want to delete all the Lines of File 1, if Column1 of File1 matches exactly with Column 1 File2. – peterph Jun 01 '14 at 11:23
Not sure what are you trying to prove here rather then focusing on the solution. I did say "..delete or won't print out that line..". Hope that we can end it here. – nvt007 Jun 02 '14 at 16:14

Pattern Matching and Delete the whole line

4 Answers4

Explanation