How to remove duplicate (or more) values within a selected field in a file, keeping only one copy?
Exemple
Input file:
A 1,2,3,45,1,8,2,3
B 5,6,6,6,6,6,2,3,7
Expected output:
A 1,2,3,45,8
B 5,6,2,3,7
A sed
one:
sed '
s/[^[:blank:]]\{1,\}/,&,/g;:1
s/\(\(,[^,[:blank:]]*\)\(,[^,[:blank:]]*\)*\)\2,/\1,/;t1
s/,\([^[:blank:]]*\),/\1/g'
(it processes all the fields that contain ,
characters and preserves spacing)
With perl
:
perl -MList::MoreUtils=uniq -pe 's{\S*,\S*}{join ",", uniq split ",", $&}ge'
(it processes all the fields that contain ,
characters and preserves spacing)
Another perl
solution:
perl -anle '
print "$F[0] ", join ",", grep {!$seen{$_}++} split ",",$F[1];
%seen=();
' file
A 1,2,3,45,8
B 5,6,2,3,7
I guess this is like Stephane's though it is a little different. Anyway, I took the trouble to write it. I based it on this thing I did before here (where I also explain it a lot better)...
sed ':t
s/\([^,]*\),\(.*\1\)/ \2/;tt
s/ */,/g;s/,/ /;s/,$//' <<\DATA
A 1,2,3,45,1,8,2,3,
B 5,6,6,6,6,6,2,3,7
DATA
A 45,1,8,2,3
B 5,6,2,3,7
awk '{n=split($2, a, ","); $2=a[1];
for(i=2; i<=n; i++)
{$2 = ($2 ~ "(^|,)" a[i] "($|,)") ? $2 : ($2 "," a[i])}}1' OFS='\t' file
For completeness, a solution in awk
:
BEGIN {
FS = "[ \t,]+";
OFS = ",";
}
{
delete seen;
for(i = 2; i <= NF; i++) {
if($i in seen) {
$i = "";
}
seen[$i] = 1;
}
sub(",","\t"); #separate first field with a tab
gsub(",,+",","); #squeeze empty fields
sub(",$",""); #remove trailing comma, if any
print;
}