Your sample input is messed up - the 1st line (column headers) doesn't even have field-separator commas, and most lines don't have a comma between the lastname and grade fields.
To provide somewhat-sane input, I've edited it to look like this:
$ cat input.txt
ID, uid ,firstname ,lastname, grade , country n28
1 , ed23 , jon , doe , director , usa
2 , ed23 , jon , doe , director , usa
3 , er67 , jake , Kogan , director , usa
4 , er67 , jake , Kogan , director , usa
5 , dc10 , Charls ,Morg , manager , usa
6 , kc56 , patel ,Kumar , associate , india
A simple implementation, just eliminating the dupes would be something like this:
$ awk -F' *, *' -v OFS=, \
'NR==1 {$1=$1;$0=$0; print; next};
{id=$1; $1=""; $0=$0; if (!seen[$0]++) {print id $0}}' input.txt
ID,uid,firstname,lastname,grade,country n28
1,ed23,jon,doe,director,usa
3,er67,jake,Kogan,director,usa
5,dc10,Charls,Morg,manager,usa
6,kc56,patel,Kumar,associate,india
This sets the input field separator (FS
) to zero-or-more spaces followed by a comma then zero-or-more spaces, and the output field separator (OFS
) to just a comma. i.e. it effectively strips the leading and trailing white space from all fields.
For the first input line (NR==1
), it uses an awk trick to reformat the input line: change any of the fields (even setting it to its original value) and then set $0=$0
. The line will be reformatted to use the new OFS. then it prints it and skips to the next line.
For the remaining input, it stores $1 in a variable called id
, sets $1 to the empty string and then uses the $0=$0
trick again (effectively deleting $1 from the line) before printing the id and the remainder of the line.
Unlike your sample output, this prints the first of any duplicate lines, not the last - it's very easy to detect the first time you've seen something but harder to detect the last time you're going to see it (you won't know until you've read all the input). Also, this does not count the number of times a duplicate has been seen.
To do both of those things requires reading the entire input file before producing any output, and a second array (ids
) to keep track of the last-seen id numbers of the duplicates - using twice as much memory, which may be significant with 700K input lines.
$ awk -F' *, *' -v OFS=, \
'NR==1 {$1=$1;$0=$0",count";print;next};
{id=$1; $1=""; $0=$0; seen[$0]++; ids[$0]=id};
END { for (id in ids) {print ids[id] id, seen[id]} }' input.txt | \
sort -n
ID,uid,firstname,lastname,grade,country n28,count
2,ed23,jon,doe,director,usa,2
4,er67,jake,Kogan,director,usa,2
5,dc10,Charls,Morg,manager,usa,1
6,kc56,patel,Kumar,associate,india,1
sort -n
is used here because associative arrays in awk are unordered, so come out in semi-random order. GNU awk has an asort()
function which can sort arrays by value which could be used on the ids
array here, but a) it's not portable and b) it's easy to just pipe the output to sort -n
.
-f
. – Cyrus Aug 17 '19 at 17:13uniq
and its-f
option can help with removing duplicates based on a specific key. Further,uniq
and its-c
option can help with counting duplicates. – steve Aug 17 '19 at 18:22