Filter out inconsistent data blocks

Question

I would like to filter out rows that has inconsistent geno values in line-part blocks. In case of duplicate consistent values , I would just keep the first row. As example, 'gf345 part1' has more than one geno values, so delete this block,'gf345 part3' is repeated with single geno value AT, so only keep the first row.

line part serial geno
ax211 part1 1234 AA
gf345  part1 1345 TT
gf345  part1  3456 AA
gf345  part1 1346 TT
ax211 part2 1834 AA
gf345  part2 1395 TT
gf345  part2  3656 AA
gf345  part2 13746 TT
ax211 part3 1634 AA
gf345  part3 13345 AT
gf345  part3  34256 AT
gf345  part3 13446 AT

Expected out

line part serial geno
ax211 part1 1234 AA
ax211 part2 1834 AA
ax211 part3 1634 AA
gf345  part3 13345 AT

this is what I have tried.

awk     'FNR==NR        {a[$1$2]+=$4;  b[$1$2]=$4;next}
                    $1$2 in b  {if (a[$1$2] ==1 ) print $0 }
        ' file file

Not going to write a whole awk script here, but the obvious error in your attempt is simple: you're using += operator on the string $4. That forces it to be evaluated numerically, which of course makes it into a 0. For counting unique values, check out http://unix.stackexchange.com/q/243530/135943 — Wildcard, Dec 22 '15 at 02:47

score 1 · Answer 1 · answered Jan 12 '15 at 17:38

I think the easiest thing is to sort the input first. This solution avoids using an array, which would limit the size of the input file you can handle.

If sorting is not a problem, then this should work:

sort file | awk '{
        if ($1$2 != key) {
                if (valid == 1)
                        print firstline;
                firstline=$0;
                key=$1$2;
                value=$4;
                valid=1
        }
        else {
                if ($4 != value)
                        valid = 0
        }
} END {
        if (valid == 1)
                print firstline
}'

Filter out inconsistent data blocks

1 Answers1