0

I've got a file input.txt in the following tab delimited format:

aaaa    bbbb
aaaa    bbbb    c
aaaa    bbbb    c   dd
aaaa    bbbb    cc
aaaa    bbbb    x
aaaa    bbbb    xx
dddd    eeee
dddd    eeee    f
dddd    eeee    f   g
dddd    eeee    fe
h   ii  j

For each row, check whether there is another row that already contains the leading columns. If so, remove the row. Else keep it. Let's take a look at the example.

  • The first row is removed because there is another row with an additional column, whose first columns are the same: the second row. In this case, remove the first row and keep the second row.
  • The second row is removed because there is another row with an additional column, whose first columns are the same: the third row. In this case, remove the second row and keep the third row.
  • The third row is not removed, because there is no other row whose first columns are the same. In this case, keep the third row.

And so on and so forth. The output file should be:

aaaa    bbbb    c   dd
aaaa    bbbb    cc
aaaa    bbbb    x
aaaa    bbbb    xx
dddd    eeee    f   g
dddd    eeee    fe
h   ii  j

Maybe we can find a solution that runs smooth for millions of lines.

GioWay
  • 27

3 Answers3

1

This simply sorts the input in reverse so "foobar" comes before "foo" and then doesn't print the current line (foo) if it's a substring of the previous one (foobar) starting from the first char of each.

$ sort -r file | awk 'index(prev FS,$0 FS) != 1; {prev=$0}'
h   ii  j
dddd    eeee    fe
dddd    eeee    f   g
aaaa    bbbb    xx
aaaa    bbbb    x
aaaa    bbbb    cc
aaaa    bbbb    c   dd

If the output order matters to you there's ways to address that, e.g.:

$ cat -n file | sort -k2r |
    awk '{orig=$0; $1=""} index(prev FS,$0 FS) != 1{print orig} {prev=$0}' |
    sort -n | cut -f2-
aaaa    bbbb    c   dd
aaaa    bbbb    cc
aaaa    bbbb    x
aaaa    bbbb    xx
dddd    eeee    f   g
dddd    eeee    fe
h   ii  j
Ed Morton
  • 31,617
  • This does not address the column nature of the data. If you remove the line ending with dd from the data, your solution will also filter out the line with the last column of c because that line is a character prefix of the line ending with cc. You need to match whole columns, not partial columns. – camh Oct 13 '19 at 20:24
  • @camh thank you. I've added some lines in the example files regarding this issue. – GioWay Oct 14 '19 at 06:14
  • @camh thanks for the info that the match must be on whole columns, not partial. I've updated my answer to account for that. – Ed Morton Oct 14 '19 at 15:50
  • Thank you Ed. I'm not sure if I understood the awk part in your first answer correctly. If the current line (\tLINE\t) is not in prev (previous line), print it. Else set prev to current line. Is that correct? At least when the awk script is running the first time, prev is not initialized, so we do not have to initialize variables in awk generally? – GioWay Oct 14 '19 at 18:41
  • Not quite - "If the current line is not AT THE START OF prev (courtesy of the test for != 1), print it". The rest is correct. – Ed Morton Oct 14 '19 at 18:50
  • Hey Ed, I've edited the first post. It is not working for the lines containing hello world. Do you know what to do? Maybe you could also explain why this doesn't work for learning purposes? – GioWay Dec 18 '19 at 05:58
  • This question is 2 months old, this was the accepted answer during that time as it did what you asked for in your question, and now you've unaccepted my answer and modified the original question so you can ask a followup in a comment. Don't you think leaving this one as it was and asking a new followup question with your new sample input/output would be more appropriate? I certainly do. – Ed Morton Dec 18 '19 at 14:15
  • If this is actually the usual procedure on this site, then I can do this. – GioWay Dec 18 '19 at 16:20
  • Going back and changing your question months after you've accepted an answer will not get you any new answers so yes, that is the usual procedure. – Ed Morton Dec 18 '19 at 16:23
0

You want to drop any line that is a prefix of another line, based on columns (fields) rather than characters. This is doable with awk(1). You first want to reverse sort the data so that the longer lines are first, so if a line is a prefix, it comes after the line that it is a prefix of. Then you can use awk to scan the fields to see if they match the last line you saved and if so, drop it:

sort -r input.txt | awk '
    { for (i=1; i<=NF; i++) if (save[i] != $i) {keep=1; break} }
    keep == 0 { next }
    { delete save; for (i=1; i<=NF; i++) save[i]=$i; keep=0; print }
'

The first awk action compares the current fields with a saved set of fields. If any of the fields are different, we mark the line as a keeper. If they are all the same, we dont, so the second action takes effect where we skip the line if it is not a keeper. The third action saves the current line and prints it, clearing the keep flag ready for the next line.

I don't have a data set millions of lines long, so I'm not sure whether this will perform well for you. Try it and see.

camh
  • 39,069
  • nice but this outputs both aaaa bbbb c AND aaaa bbbb c dd. it should only output the latter. an insufficient caffeine error is preventing me from seeing more than a vague outline but i suspect that this approach needs to be able to save more than just one line (some kind of sliding window of "similar" lines to minimise memory usage), AND it also needs to save (and compare) the number of fields for each saved line. – cas Oct 14 '19 at 01:36
  • BTW, this Q is very strongly reminiscent of bio-informatics "find the longest sequence" searches - I suspect that bioperl or biopython communities would be a good place to start looking for an efficient algorithm. – cas Oct 14 '19 at 01:39
  • @cas when I ran it I did not see that issue. Do you have CR's in your file as well as LFs? – camh Oct 14 '19 at 07:20
  • nope, no CRs. I just copy-pasted the sample into vi, ran :%s/ */\t/g to make it tab-separated, and saved it. – cas Oct 14 '19 at 12:33
  • delete save; for (i=1; i<=NF; i++) save[i]=$i = split($0,save). – Ed Morton Oct 14 '19 at 19:22
0

You are just looking for the longest unique match in a pattern anchored to the start of the line so, assuming your file is called tst ....

while read l ; do if [ $(grep -c -E "^$l" tst) -eq 1 ]; then echo $l; fi ; done < tst

However this will fail if there is a repeat of the longest pattern so you need to deal with that ...

while read l ; do if [ $(grep -c -E "^$l" <<<$(sort tst | uniq)) -eq 1 ]; then echo $l; fi ; done <<<$(sort tst | uniq)
bu5hman
  • 4,756