remove duplicates records in a file, ignoring unique identifier

Question

I have a file (comma delimited) that has 28 fields/heathers/attributes. Field # makes the record unique. However, the rest of the fields may be identical. I need to identify the dups and only keep one. I am ok, if it is easier to keep the first iteration rather than the second. example:

INPUT FILE:

1,ed23,jon,doe,director,usa
2,ed23,jon,doe,director,usa
3,er67,jake,Kogan,director,usa
4,er67,jake,Kogan,director,usa
5,dc10,Charls,Morg,manager,usa
6,kc56,patel,Kumar,associate,india

DESIRED OUTPUT:

2,ed23,jon,doe,director,usa
4,er67,jake,Kogan,director,usa
5,dc10,Charls,Morg,manager,usa
6,kc56,patel,Kumar,associate,india

Please add your desired output (no description) for that sample input to your question (no comment). — Cyrus, Aug 17 '19 at 17:09
sorry, I omitted the ',' my fault. the files is big contains 700 thousand records. The output i want is one record. arbitrarily leave one. also I need to see the count of the dups. does that makes sense? — daniel caceres, Aug 17 '19 at 17:43
Per @Cyrus, uniq and its -f option can help with removing duplicates based on a specific key. Further, uniq and its -c option can help with counting duplicates. — steve, Aug 17 '19 at 18:22
i tried but did not work. bz, id is unique, so it treats the entire record as unique.sort people.csv|uniq -d. do you have an example with awk? it really needs to check for column 2-28, that's where duplication happends — daniel caceres, Aug 17 '19 at 18:27
Start by providing a correct data sample. If it's CSV then please make it so. — Chris Davies, Aug 17 '19 at 19:04

cas · Answer 1 · 2019-08-18T03:00:20.543

Your sample input is messed up - the 1st line (column headers) doesn't even have field-separator commas, and most lines don't have a comma between the lastname and grade fields.

To provide somewhat-sane input, I've edited it to look like this:

$ cat input.txt 
ID, uid  ,firstname ,lastname,   grade    , country n28
1 , ed23 , jon     ,   doe   ,  director  ,  usa
2 , ed23 ,  jon     ,  doe   ,  director     , usa
3 , er67 ,  jake     , Kogan ,  director     , usa
4 , er67 ,  jake     , Kogan ,  director     , usa
5 , dc10 ,  Charls     ,Morg ,  manager      , usa
6 , kc56 ,  patel     ,Kumar ,  associate    , india

A simple implementation, just eliminating the dupes would be something like this:

$ awk -F' *, *' -v OFS=, \
    'NR==1 {$1=$1;$0=$0; print; next};
     {id=$1; $1=""; $0=$0; if (!seen[$0]++) {print id $0}}' input.txt 
ID,uid,firstname,lastname,grade,country n28
1,ed23,jon,doe,director,usa
3,er67,jake,Kogan,director,usa
5,dc10,Charls,Morg,manager,usa
6,kc56,patel,Kumar,associate,india

This sets the input field separator (FS) to zero-or-more spaces followed by a comma then zero-or-more spaces, and the output field separator (OFS) to just a comma. i.e. it effectively strips the leading and trailing white space from all fields.

For the first input line (NR==1), it uses an awk trick to reformat the input line: change any of the fields (even setting it to its original value) and then set $0=$0. The line will be reformatted to use the new OFS. then it prints it and skips to the next line.

For the remaining input, it stores $1 in a variable called id, sets $1 to the empty string and then uses the $0=$0 trick again (effectively deleting $1 from the line) before printing the id and the remainder of the line.

Unlike your sample output, this prints the first of any duplicate lines, not the last - it's very easy to detect the first time you've seen something but harder to detect the last time you're going to see it (you won't know until you've read all the input). Also, this does not count the number of times a duplicate has been seen.

To do both of those things requires reading the entire input file before producing any output, and a second array (ids) to keep track of the last-seen id numbers of the duplicates - using twice as much memory, which may be significant with 700K input lines.

$ awk -F' *, *' -v OFS=, \
   'NR==1 {$1=$1;$0=$0",count";print;next};
   {id=$1; $1=""; $0=$0; seen[$0]++; ids[$0]=id};
   END { for (id in ids) {print ids[id] id, seen[id]} }' input.txt  | \
 sort -n
ID,uid,firstname,lastname,grade,country n28,count
2,ed23,jon,doe,director,usa,2
4,er67,jake,Kogan,director,usa,2
5,dc10,Charls,Morg,manager,usa,1
6,kc56,patel,Kumar,associate,india,1

sort -n is used here because associative arrays in awk are unordered, so come out in semi-random order. GNU awk has an asort() function which can sort arrays by value which could be used on the ids array here, but a) it's not portable and b) it's easy to just pipe the output to sort -n.

great explanation. I did messed up with the copy paste. the file is clean. is comma separated and does not contain the heathers. I just put for information purposes. but I guess i complicated things. there are no white spaces either in between fields. I apologize, i did not make it easy. — daniel caceres, Aug 18 '19 at 12:05
if there is no header line in the input, just delete the NR==1 line in the scripts and they will work whether there's extra white-space in the fields or not. — cas, Aug 18 '19 at 12:08
should I also remove ' , ' since there are no white spaces. let me know, and thanks again for the solution. — daniel caceres, Aug 18 '19 at 12:38
if you want, replace it with just a comma. but it won't make any difference, that regexp will work whether there is any leading/trailing spaces or not. but why ask when you can try it yourself and see what the result will be? my answer is meant to be experimented with and tweaked and modified. it's meant to teach a general technique, not just to provide one magic answer. — cas, Aug 18 '19 at 12:39

Ralph Rönnquist · Answer 2 · 2019-08-18T22:29:13.343

0

For a clean, comma-separated input an awk script like the following might work for you:

awk -F, '{X=""; for (i=2;i<29;i++) X=X " " $i;} \
     seen[X]!=1 {print;} \
     {seen[X]=1;}' < input

The first awk rule builds up a "key" by cherry-picking the "words" 2 to 28 of the input (where, by the -F, argument, anything separated by comma is a "word"). The next rule prints the line unless the "key" has been registered, and the third rule then registers the key for the line.

edited Aug 18 '19 at 22:29

answered Aug 18 '19 at 00:21

Ralph Rönnquist

3,253

great explanation. I did messed up with the copy paste. the file is clean. is comma separated and does not contain the heathers. I just put for information purposes. but I guess i complicated things. there are no white spaces either in between fields. I apologize, i did not make it easy. – daniel caceres Aug 18 '19 at 12:06
Ok; I revised the command line to suit "the correct" input format. – Ralph Rönnquist Aug 18 '19 at 22:24

Kusalananda · Answer 3 · 2019-08-19T07:09:26.467

I will assume that the file is in a "simple CSV" format, meaning there are no embedded commas or embedded newlines in the data.

$ tac file | awk -F , '{ key = $0; sub("[^,]*,", "", key) } !seen[key]++' | tac
2,ed23,jon,doe,director,usa
4,er67,jake,Kogan,director,usa
5,dc10,Charls,Morg,manager,usa
6,kc56,patel,Kumar,associate,india

The awk code in the middle of the pipeline above will create a string used as a key in a hash from all fields of each line except for the first. It will print the first occurrence of a line with a specific key and ignore all duplicates.

Since you appear to want to get the last duplicate, I reverse the order of the lines in the input with tac (from GNU coreutils) before feeding them into the awk program. I then reverse the output of the awk program.

This approach has the downside that the keys calculated will use as much memory as the combined size of all unique lines, minus the first field.

The following is a more memory-efficient approach, but it assumes that the input is sorted so that duplicated lines are always occurring together:

$ tac file | awk -F , '{ key = $0; sub("[^,]*,", "", key) } key != prev; { prev = key }' | tac
2,ed23,jon,doe,director,usa
4,er67,jake,Kogan,director,usa
5,dc10,Charls,Morg,manager,usa
6,kc56,patel,Kumar,associate,india

this works too: tac input.txt | sort -t, -k2 -u | sort -n. or, to include the count of duplicates: sort -t, -k2 input.txt | tac | sed -e 's/,/\t/' | uniq -c -f1 | sed -e 's/\t/,/' | sort -n -k2. it's a shame uniq doesn't have a -t option like sort — cas, Aug 19 '19 at 08:49
@cas Yes, that would be a slightly simpler solution assuming that the lines are allowed to be reordered. — Kusalananda, Aug 19 '19 at 08:52
@danielcaceres You may want to start accepting the most helpful answers to your questions. I'm noticing that you haven't done that on any of the questions that you've asked so far. This would also make people more inclined to help you in the future. See https://unix.stackexchange.com/help/someone-answers — Kusalananda, Aug 19 '19 at 19:29

markgraf · Answer 4 · 2019-08-19T07:42:16.733

Detailing the uniq-approach from the comments above:

$ tr ',' '\t' < temp/testfile | uniq -f 1 | tr '\t' ','
1,ed23,jon,doe,director,usa
3,er67,jake,Kogan,director,usa
5,dc10,Charls,Morg,manager,usa
6,kc56,patel,Kumar,associate,india

Using \t as a delimiter to avoid choking on spaces in your data.

uniq will keep the first of the unique lines found. If you absolutely need to keep the 'last' entries you need to work from the end to the start of your file. You can do that using tac:

$ tac temp/testfile|tr ',' '\t' | uniq -f 1 | tr '\t' ','|tac
2,ed23,jon,doe,director,usa
4,er67,jake,Kogan,director,usa
5,dc10,Charls,Morg,manager,usa
6,kc56,patel,Kumar,associate,india

appreciated. nicily done – daniel caceres Aug 19 '19 at 17:56 — daniel caceres, Aug 19 '19 at 17:56

remove duplicates records in a file, ignoring unique identifier

4 Answers4