I would like to filter out rows that has inconsistent geno values in line-part blocks. In case of duplicate consistent values , I would just keep the first row. As example, 'gf345 part1' has more than one geno values, so delete this block,'gf345 part3' is repeated with single geno value AT, so only keep the first row.
line part serial geno
ax211 part1 1234 AA
gf345 part1 1345 TT
gf345 part1 3456 AA
gf345 part1 1346 TT
ax211 part2 1834 AA
gf345 part2 1395 TT
gf345 part2 3656 AA
gf345 part2 13746 TT
ax211 part3 1634 AA
gf345 part3 13345 AT
gf345 part3 34256 AT
gf345 part3 13446 AT
Expected out
line part serial geno
ax211 part1 1234 AA
ax211 part2 1834 AA
ax211 part3 1634 AA
gf345 part3 13345 AT
this is what I have tried.
awk 'FNR==NR {a[$1$2]+=$4; b[$1$2]=$4;next}
$1$2 in b {if (a[$1$2] ==1 ) print $0 }
' file file
awk
script here, but the obvious error in your attempt is simple: you're using+=
operator on the string$4
. That forces it to be evaluated numerically, which of course makes it into a 0. For counting unique values, check out http://unix.stackexchange.com/q/243530/135943 – Wildcard Dec 22 '15 at 02:47