Removing Duplicates from a CSV based on specified columns

Question

I am working with a CSV data set which looks like the below:

year,manufacturer,brand,series,variation,card_number,card_title,sport,team
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK- DF1,Darren Smith,Football,
2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football,
2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,

It contains a number of duplicates that I need to remove (keeping one instance of the record). Based on Remove duplicate entries from a CSV file I have used sort -u file.csv --o deduped-file.csv which works well for examples like

2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,

but does not capture examples like

2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,,

Where the data is incomplete, but is a representation of the same thing.

Is it possible to remove duplicates based on specified fields e.g year, manufacturer, brand, series, variation?

Does your CSV file actually have those line breaks in it? The ones for "Darren Smith"'s entries? If not, then don't post it like that - it's a red herring distracting potential answerers. If it does, then it's broken - CSV requires multi-line fields to be quoted. — cas, Dec 11 '21 at 13:29
BTW, your requirements are moving towards the edge of what's possible with simple line-oriented tools. You probably need to use a tool like GNU datamash, Miller aka mlr, or csvkit. Or use a language with CSV libraries like perl (Text::CSV module) or python (csv library). — cas, Dec 11 '21 at 13:36
@cas thanks for the comment. I've corrected the CSV structure in the question, the line breaks are not present in the data. Thanks for the links - I'll take a look — Stuart Brown, Dec 11 '21 at 13:39
If your data is "simple" CSV (i.e. no quoted delimiters within fields, like "John Amoth, Jr", at least not up to and including the fields that you wish to use as the deduplication key) then you can likely use something as simple as awk -F, '!seen[$1, $2, $3, $4, $5]++' file.csv. See How does awk '!a[$0]++' work? — steeldriver, Dec 11 '21 at 14:07

glenn jackman · Accepted Answer · 2021-12-11T14:19:02.393

6

I would create a "key" of the first 5 fields, and then only print a line if that key is being seen for the first time:

awk -F, '
  {key = $1 FS $2 FS $3 FS $4 FS $5}
  !seen[key]++ 
' file

year,manufacturer,brand,series,variation,card_number,card_title,sport,team
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football,
2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football,
2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,

edited Dec 11 '21 at 14:19

answered Dec 11 '21 at 14:10

glenn jackman

85,964

That's great, works perfectly. For my own knowledge, if I wanted to match on the third and fifth fields would I just modify e.g. {key = FS $3 FS $5}? – Stuart Brown Dec 11 '21 at 14:44
Yes. You could omit the leading FS – glenn jackman Dec 11 '21 at 17:34

score 3 · Answer 2 · answered Dec 11 '21 at 14:04

Decorating first with awk to retain your header line at the top during the sort then using any awk+sort+cut:

$ awk -v OFS=',' '{print (NR>1), $0}' file | sort -u -t, -k1,6 | cut -d, -f2-
year,manufacturer,brand,series,variation,card_number,card_title,sport,team
2015,Leaf,Army All-American Bowl,5-Star Future Autographs,,FSF-RG1,Rasheem Green,Soccer,
2015,Leaf,Metal Draft,Touchdown Kings,Die-Cut Autographs Blue Prismatic,TDK-DF1,Darren Smith,Football,
2015,Leaf,Trinity,Patch Autograph,Bronze,PA-DJ2,Duke Johnson,Football,
2015,Leaf,Trinity,Printing Plates,Magenta,TS-JH2,John Amoth,Soccer,

score 1 · Answer 3 · answered Dec 12 '21 at 11:33

Non-awk solution: one may simple use sort, define fields 1 to 5 as comparision -keys, use , as field separator (-t) and select -unique entries only:

 sort -t, -k1,5 -u infile

However this will make the header line last. Use either -r for reversed sorting or separate the header line via e.g.

{ sed 1q infile ; sed 1d infile | sort -k1,5 -t, -u ; }

Removing Duplicates from a CSV based on specified columns

3 Answers3

Linked