Remove duplicates values within a field

Question

How to remove duplicate (or more) values within a selected field in a file, keeping only one copy?

Exemple

Input file:

A    1,2,3,45,1,8,2,3
B    5,6,6,6,6,6,2,3,7

Expected output:

A    1,2,3,45,8
B    5,6,2,3,7

score 5 · Answer 1 · answered Aug 05 '14 at 10:59

5

A sed one:

sed '
  s/[^[:blank:]]\{1,\}/,&,/g;:1
  s/\(\(,[^,[:blank:]]*\)\(,[^,[:blank:]]*\)*\)\2,/\1,/;t1
  s/,\([^[:blank:]]*\),/\1/g'

(it processes all the fields that contain , characters and preserves spacing)

answered Aug 05 '14 at 10:59

Stéphane Chazelas

544,893

why didn't I notice this 17 hours ago? – mikeserv Aug 06 '14 at 03:32

Stéphane Chazelas · Answer 2 · 2014-08-05T11:03:43.510

3

With perl:

perl -MList::MoreUtils=uniq -pe 's{\S*,\S*}{join ",", uniq split ",", $&}ge'

(it processes all the fields that contain , characters and preserves spacing)

edited Aug 05 '14 at 11:03

answered Aug 05 '14 at 10:30

Stéphane Chazelas

544,893

score 2 · Accepted Answer · answered Aug 05 '14 at 10:35

Another perl solution:

perl -anle '                                                                    
    print "$F[0] ", join ",", grep {!$seen{$_}++} split ",",$F[1];              
    %seen=();                                                                   
' file
A 1,2,3,45,8
B 5,6,2,3,7

score 2 · Answer 4 · edited Apr 13 '17 at 12:36

2

I guess this is like Stephane's though it is a little different. Anyway, I took the trouble to write it. I based it on this thing I did before here (where I also explain it a lot better)...

sed ':t
s/\([^,]*\),\(.*\1\)/ \2/;tt
s/  */,/g;s/,/ /;s/,$//' <<\DATA
A 1,2,3,45,1,8,2,3,
B 5,6,6,6,6,6,2,3,7
DATA

OUTPUT

A 45,1,8,2,3
B 5,6,2,3,7

edited Apr 13 '17 at 12:36

Community

1

answered Aug 06 '14 at 03:31

mikeserv

58,310

score 2 · Answer 5 · answered Aug 06 '14 at 05:23

2

awk '{n=split($2, a, ","); $2=a[1];
  for(i=2; i<=n; i++)
    {$2 = ($2 ~ "(^|,)" a[i] "($|,)") ? $2 : ($2 "," a[i])}}1' OFS='\t' file

answered Aug 06 '14 at 05:23

Srini

161

score 1 · Answer 6 · answered Aug 05 '14 at 14:49

For completeness, a solution in awk:

BEGIN {
    FS = "[ \t,]+";
    OFS = ",";
}

{
    delete seen;
    for(i = 2; i <= NF; i++) {
        if($i in seen) {
            $i = "";
        }
        seen[$i] = 1;
    }
    sub(",","\t");   #separate first field with a tab
    gsub(",,+",","); #squeeze empty fields
    sub(",$","");    #remove trailing comma, if any
    print;
}

Remove duplicates values within a field

6 Answers6

OUTPUT