How to remove duplicate items within the same line from a csv file?

Question

I have a csv file with ~4000 lines, each one containing between 2 and 30 names separated by commas. The names are including titles (for example mr. X Adams or ms. Y Sanders). Some names exist multiple times within the same line, and I would like to have the multiples within the same line removed. It is in a file "input.csv" and another file "output.csv" should be the end result.

Example, I have:

mr. 1,mr. 2,mr. 3,mr. 1,mr. 4
prof. x,prof. y,prof. x
mr. 1,prof y

which should become

mr. 1,mr. 2,mr. 3,mr. 4   (mr. 1 was already meantioned so it should be removed)
prof. x,prof. y           (prof. x was already mentioned so it should be removed)
mr. 1,prof y              (even though both were already mentioned in the same file, they were not mentioned within this line so they may remain)

@αғsнιη It's not a dupe of that question. That is much more liberal with matching, e.g. case-insensitive, Persian/Arabic. — Sparhawk, Oct 08 '18 at 11:32
@αғsнιη But it's clearly different in some cases. That question would treat Mr X and mR x as duplicates. This one would not. Also, the code is necessarily much more convoluted. — Sparhawk, Oct 08 '18 at 11:39
Possible duplicate of remove duplicated pattern/entries within each field in CSV file — Romeo Ninov, Oct 13 '18 at 05:01
A duplicated pattern/entries within each **field** is clearly not the same as duplicated field within each **row**. — , Oct 14 '18 at 14:29

score 0 · Answer 1 · answered Oct 08 '18 at 11:30

0

you can try:

#!/bin/bash

cat file | while IFS= read -r line ; do 
echo "$line" | tr , '\n' | sort -u | tr '\n' , | sed 's/,$/\n/' ; 
done

answered Oct 08 '18 at 11:30

This will fail if any field contain ,, like one,"some, text in one field",three which most csv may contain. In short: do not parse csv with text tools. – Oct 08 '18 at 11:43

How to remove duplicate items within the same line from a csv file?

1 Answers1