I am working with a CSV data set which looks like the below:
year,manufacturer,brand,series,variation,card_number,card_title,sport,team
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK- DF1,Darren Smith,Football,
2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football,
2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,
It contains a number of duplicates that I need to remove (keeping one instance of the record). Based on Remove duplicate entries from a CSV file I have used sort -u file.csv --o deduped-file.csv which works well for examples like
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
but does not capture examples like
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,,
Where the data is incomplete, but is a representation of the same thing.
Is it possible to remove duplicates based on specified fields e.g year, manufacturer, brand, series, variation?
"John Amoth, Jr", at least not up to and including the fields that you wish to use as the deduplication key) then you can likely use something as simple asawk -F, '!seen[$1, $2, $3, $4, $5]++' file.csv. See How does awk '!a[$0]++' work? – steeldriver Dec 11 '21 at 14:07