Remove duplicate entries from a CSV file

Question

I've got a [csv] file with duplicate datum reprinted ie the same data printed twice. I've tried using sort's uniq by sort myfile.csv | uniq -u however there is no change in the myfile.csv, also I've tried sudo sort myfile.csv | uniq -u but no difference.

So currently my csv file looks like this

a
a
a
b
b
c
c
c
c
c

I would like to look like it

a
b
c

You can also try to not rely on the terminal. You can try this online tool instead http://textmechanic.com/text-tools/basic-text-tools/remove-duplicate-lines/ — Aminah Nuraini, Feb 22 '17 at 01:29

score 30 · Accepted Answer · answered Mar 12 '15 at 12:37

The reason the myfile.csv is not changing is because the -u option for uniq will only print unique lines. In this file, all lines are duplicates so they will not be printed out.

However, more importantly, the output will not be saved in myfile.csv because uniq will just print it out to stdout (by default, your console).

You would need to do something like this:

$ sort -u myfile.csv -o myfile.csv

The options mean:

-u - keep only unique lines
-o - output to this file instead of stdout

You should view man sort for more information.

score 4 · Answer 2 · edited Apr 16 '19 at 18:24

4

If you want to maintain order of the file (not sorted) but still remove duplicates you can also do this

awk '!v[$1]++' /tmp/file

For example

d
d
a
a
b
b
c
c
c
c
c

It will output

d
a
b
c

edited Apr 16 '19 at 18:24

Rui F Ribeiro

56,709
26
150
232

answered Feb 22 '17 at 01:47

NinjaGaiden

324

Could you please expand on the syntax? – Sopalajo de Arrierez Apr 27 '18 at 23:55
2

Place the string in a hash. If the string does NOT exist in the hash then print. – NinjaGaiden Apr 17 '19 at 16:52
As far as I see, this is the best answer. How it works:
1. Awk starts and creates an empty array v.
2. Awk reads each line from the file /tmp/file.
3. For each line Awk extracts the first field ($1) (or $0 to full line).
4. Awk checks if this field has already been seen by looking in the array v.
5. New Value: If the field hasn't been seen before (!v[$1] is true) it stores a value of $1 in the array v for that field (v[$1]++) and prints the line.
6. Duplicate Value: If the field has already been seen (!v[$1] is false) it doesn't print the line and increments the counter.
– Charles Santos Jan 30 '24 at 18:27

score 3 · Answer 3 · edited May 22 '15 at 02:49

As Belmin showed, sort is great. His answer is best for unsorted data, and it's easy to remember and use.

However, it is also volatile, as it changes the order of the input. If you absolutely need to have the data go through in the same order but removing later duplicates, awk may be better.

$ cat myfile.csv
c
a
c
b
b
a
c


$ awk '{if (!($0 in x)) {print $0; x[$0]=1} }' myfile.csv
c
a
b

Weird edge case, but it does come up from time to time.

Also, if your data is already sorted when you are poking at it, you can just run uniq.

$ cat myfile.csv 
a
a
a
b
b
c
c
c
c
c


$ uniq myfile.csv 
a
b
c

Drawback to both of my suggestions is that you need to use a temporary file and copy that back in.

cuonglm · Answer 4 · 2015-03-12T09:08:01.387

2

uniq -u only print unique lines. Your input has no unique lines, so uniq -u print nothing. You only need sort:

sort -u myfile.csv

edited Mar 12 '15 at 09:08

answered Mar 12 '15 at 09:02

cuonglm

153,898

Remove duplicate entries from a CSV file

4 Answers4

Linked