I've got a file input.txt in the following tab delimited format:
aaaa bbbb
aaaa bbbb c
aaaa bbbb c dd
aaaa bbbb cc
aaaa bbbb x
aaaa bbbb xx
dddd eeee
dddd eeee f
dddd eeee f g
dddd eeee fe
h ii j
For each row, check whether there is another row that already contains the leading columns. If so, remove the row. Else keep it. Let's take a look at the example.
- The first row is removed because there is another row with an additional column, whose first columns are the same: the second row. In this case, remove the first row and keep the second row.
- The second row is removed because there is another row with an additional column, whose first columns are the same: the third row. In this case, remove the second row and keep the third row.
- The third row is not removed, because there is no other row whose first columns are the same. In this case, keep the third row.
And so on and so forth. The output file should be:
aaaa bbbb c dd
aaaa bbbb cc
aaaa bbbb x
aaaa bbbb xx
dddd eeee f g
dddd eeee fe
h ii j
Maybe we can find a solution that runs smooth for millions of lines.