0

I have a csv file with ~4000 lines, each one containing between 2 and 30 names separated by commas. The names are including titles (for example mr. X Adams or ms. Y Sanders). Some names exist multiple times within the same line, and I would like to have the multiples within the same line removed. It is in a file "input.csv" and another file "output.csv" should be the end result.

Example, I have:

mr. 1,mr. 2,mr. 3,mr. 1,mr. 4
prof. x,prof. y,prof. x
mr. 1,prof y

which should become

mr. 1,mr. 2,mr. 3,mr. 4   (mr. 1 was already meantioned so it should be removed)
prof. x,prof. y           (prof. x was already mentioned so it should be removed)
mr. 1,prof y              (even though both were already mentioned in the same file, they were not mentioned within this line so they may remain)
Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255

1 Answers1

0

you can try:

#!/bin/bash

cat file | while IFS= read -r line ; do 
echo "$line" | tr , '\n' | sort -u | tr '\n' , | sed 's/,$/\n/' ; 
done 
  • This will fail if any field contain ,, like one,"some, text in one field",three which most csv may contain. In short: do not parse csv with text tools. –  Oct 08 '18 at 11:43