227

A huge (up to 2 GiB) text file of mine contains about 100 exact duplicates of every line in it (useless in my case, as the file is a CSV-like data table).

What I need is to remove all the repetitions while (preferably, but this can be sacrificed for a significant performance boost) maintaining the original sequence order. In the result each line is to be unique. If there were 100 equal lines (usually the duplicates are spread across the file and won't be neighbours) there is to be only one of the kind left.

I have written a program in Scala (consider it Java if you don't know about Scala) to implement this. But maybe there are faster C-written native tools able to do this faster?

UPDATE: the awk '!seen[$0]++' filename solution seemed working just fine for me as long as the files were near 2 GiB or smaller but now as I am to clean-up a 8 GiB file it doesn't work any more. It seems taking infinity on a Mac with 4 GiB RAM and a 64-bit Windows 7 PC with 4 GiB RAM and 6 GiB swap just runs out of memory. And I don't feel enthusiastic about trying it on Linux with 4 GiB RAM given this experience.

Ivan
  • 17,708
  • this will destroy your ordering but,have you tried sort -u, I have no idea how or if it can run on such a massive file – squareborg Jan 27 '12 at 15:57
  • 7
    C is often not significantly faster than Java, and if you're running it (in-order) now, there's a fair chance it'll finish before you get an answer here, implement it, and it finishes running; out of order, sort -u will probably be faster. – Kevin Jan 27 '12 at 15:59

12 Answers12

345

An awk solution seen on #bash (Freenode):

awk '!seen[$0]++' filename

If you want to edit the file in-place, you can use the following command (provided that you use a GNU awk version that implements this extension):

awk -i inplace '!seen[$0]++' filename
AdminBee
  • 22,803
enzotib
  • 51,661
  • 2
    Just tried this on a 2G file and it took three minutes on my notebook. Not bad. I also tried uniq filename | awk '!seen[$0]++', but it wasn't any faster. – mgjk Jan 27 '12 at 19:27
  • This is surprisingly faster than a more verbose awk version using 2 array lookups (shown as an expanded explanation in Gilles answer) : 0m36.132s vs 0m49.958s.. for 50 million lines.. I thought the bottleneck would be the I/O, but the extra array lookup is... 1 million elements in the array seems to make a rather significant dent... – Peter.O Jan 28 '12 at 05:53
  • But how does that compare to sort -u .... ? – HashWizard May 13 '17 at 21:38
  • 1
    @HashWizard: this command does not sort, but eliminates every next occurrence of the same line – enzotib May 14 '17 at 15:51
  • 2
    Wondering how this command works? -- See here: https://unix.stackexchange.com/questions/159695/how-does-awk-a0-work – supergra Oct 24 '17 at 19:13
  • How do you replace the file with this text? – David G Oct 24 '17 at 22:59
  • Will this work if the duplicates are randomly distributed, rather than next to each other? – Max Williams Jan 04 '18 at 11:15
  • @0x499602D2 To replace the file you would pipe it back at the file: awk'!seen[$0]++' filename > filename. This is probably a bad idea though, you probably want to do awk'!seen[$0]++' filename > newfilename to make sure it works correctly before deleting the old one. – setholopolus Jan 19 '18 at 19:57
  • 2
    @MaxWilliams yes, it works is they are randomly distributed. – setholopolus Jan 19 '18 at 19:58
  • In case you wanted to retain the last occurrence of the repeated line: tac filename | awk '!seen[$0]++' | tac > newfilename - but maybe awk could do it without tac? – Lucas Walter Sep 11 '19 at 15:10
  • 1
    preserve newlines or lines with spaces awk '/^\s*?$/||!seen[$0]++' – James O'Brien Mar 13 '20 at 00:46
  • What is this charm !?!?! – Nam G VU Mar 31 '20 at 10:15
  • Can we also do: awk '!seen[$0]++' filename > filename? – alper Feb 13 '22 at 23:47
  • On macOS: "awk: can't open file !seen[$0]++ source line number 1" – JeremyF Mar 14 '22 at 15:29
  • FYI for others, the in-place option does NOT work on Ubuntu. – Raleigh L. Apr 20 '23 at 06:45
  • @Raleigh, I think the default version of awk in Ubuntu is mawk, not GNU awk, and the answer expressly talks about GNU awk when reporting the --in-place option. – enzotib Apr 20 '23 at 10:34
56
sort -u big-csv-file.csv > duplicates-removed.csv

Note that the output file will be sorted.

αғsнιη
  • 41,407
  • 2
    Not as fast as the awk command in other answers, but conceptually simple! – Johann Mar 31 '15 at 23:11
  • @Johann I am doing this pretty often on files with hundreds of thousands (even million) of short newline terminated strings. I get the results pretty quick for the experiments I am doing. It can be more important if used in scripts which are run again and again, savings in time can be considerable. – Vladislavs Dovgalecs Mar 31 '15 at 23:13
  • 2
    Use sort -u to remove duplicates during the sort, rather than after. (And saves memory bandwidth) piping it to another program). This is only better than the awk version if you want your output sorted, too. (The OP on this question wants his original ordering preserved, so this is a good answer for a slightly different use-case.) – Peter Cordes Sep 14 '15 at 15:39
  • Took about a minute, for me, for a 5.5 million line file (1.8 GB in total). Brilliant. – Max Williams Jan 04 '18 at 11:23
56

There's a simple (which is not to say obvious) method using standard utilities which doesn't require a large memory except to run sort, which in most implementations has specific optimizations for huge files (a good external sort algorithm). An advantage of this method is that it only loops over all the lines inside special-purpose utilities, never inside interpreted languages.

<input nl -b a -s : |           # number the lines
sort -t : -k 2 -u |             # sort and uniquify ignoring the line numbers
sort -t : -k 1n |               # sort according to the line numbers
cut -d : -f 2- >output          # remove the line numbers

If all lines begin with a non-whitespace character, you can dispense with some of the options:

<input nl | sort -k 2 -u | sort -k 1n | cut -f 2- >output

For a large amount of duplication, a method that only requires storing a single copy of each line in memory will perform better. With some interpretation overhead, there's a very concise awk script for that (already posted by enzotib):

<input awk '!seen[$0]++'

Less concisely: !seen[$0] {print} {seen[$0] += 1}, i.e. print the current line if it hasn't been seen yet, then increment the seen counter for this line (uninitialized variables or array elements have the numerical value 0).

For long lines, you can save memory by keeping only a non-spoofable checksum (e.g. a cryptographic digest) of each line. For example, using SHA-1, you only need 20 bytes plus a constant overhead per line. But computing digests is rather slow; this method will only win if you have a fast CPU (especially one with a hardware accelerator to compute the digests) and not a lot of memory relative to the size of the file and sufficiently long lines. No basic utility lets you compute a checksum for each line; you'd have to bear the interpretation overhead of Perl/Python/Ruby/… or write a dedicated compiled program.

<input perl -MDigest::MD5 -ne '$seen{Digest::MD5::md5($_)}++ or print' >output
23

Assuming you can afford to keep as much as the de-duplicated file in memory (if your data is indeed duplicated by a factor of 100, that should be about 20MiB + overhead), you can do this very easily with Perl.

$ perl -ne 'print unless $dup{$_}++;' input_file > output_file

This preserves the order too.

You could extract the number of occurrences of each line from the %dup hash if you so wished, as an added free bonus.

If you prefer awk, this should do it too (same logic as the perl version, same ordering, same data gathered in the dup variable):

$ awk '{if (++dup[$0] == 1) print $0;}' input_file > output_file
Mat
  • 52,586
14

As no other answer provided inplace support, here is one:

gawk -i inplace '!a[$0]++' file
rindeal
  • 778
  • Does this preserve the order? By the way, this did not work for me. My version is: GNU Awk 4.0.2 – Leonid Feb 16 '17 at 10:31
  • 1
    @Leonid yes, it does. It prints the first occurrence of any unique line. The inplace support was first introduced in version 4.1, which was released in 2013. – rindeal Feb 16 '17 at 12:49
  • This should be the answer. It's actually delete the duplicated string in the existing or current file where the top answer and most of the answers here only printout the uniq / duplicated strings and doing nothing and we have to create another output to store the result. – MaXi32 Jun 06 '20 at 08:33
  • How does gawk differs from awk? – alper Feb 13 '22 at 23:51
3

You can use uniq http://www.computerhope.com/unix/uuniq.htm

uniq reports or filters out repeated lines in a file.

3

SOLUTION WITHOUT MAINTAINING THE ORIGINAL SEQUENCE ORDER

I did it with the following code piece.

sort duplicates.txt | uniq > noDuplicates.txt

The sort command sorts the lines alphabetically, and the uniq command removes the duplicates.

NOTE: Why we sorted the lines first is that uniq does not detect duplicate lines unless they are adjacent.

  • 2
    The question asks for a method (preferably) which maintains the input order; could you [edit] your answer to address that? Note that there are existing answers using sort which maintain input order, and one answer using sort without maintaining input order but in a more efficient manner than piping to uniq. – Stephen Kitt Sep 07 '20 at 14:01
  • @StephenKitt Edited. I inspected other answers, but couldn't find anything only with basic commands. Thanks for your feedback. – Caglayan DOKME Sep 07 '20 at 14:18
  • I gave you a link to an answer with only basic commands, in fact only one command, sort -u (which is part of POSIX) ;-). – Stephen Kitt Sep 07 '20 at 14:25
  • @StephenKitt I saw that answer. Mine is also a way to handle the problem. What do you want me to do more? Should I delete the answer? – Caglayan DOKME Sep 07 '20 at 15:48
  • No, don’t delete your answer; I just wanted to make sure you were aware of the other answer, given that you said you “couldn't find anything only with basic commands”. – Stephen Kitt Sep 07 '20 at 15:59
  • @StephenKitt Actually, you are right. It is more basic than mine. Thanks. Have a good day. – Caglayan DOKME Sep 07 '20 at 16:07
2

Python One liners :

python -c "import sys; lines = sys.stdin.readlines(); print ''.join(sorted(set(lines)))" < InputFile
Rahul Patil
  • 24,711
  • this causes the entire file to be slurped into memory and may not be a good fit for the OP's problem. Also not guaranteed to retain order – iruvar Sep 15 '13 at 14:50
  • Thanks for the suggestion, I've been just learning python.. just tried this for learning purpose.. :) – Rahul Patil Sep 15 '13 at 19:52
  • Here's a Python 2.7 version that is not a one-liner but (succinctly) returns unique lines preserving order without either loading the entire file into memory or creating a single gigantic string to feed to print – iruvar Sep 16 '13 at 16:37
  • Thanks @1_CR I have something learn today :) OrderedDict – Rahul Patil Sep 16 '13 at 16:39
0

None of the answers here worked for me on my Mac so I wrote a simple python script that works for me. I am ignoring leading/trailing whitespace and also don't care about memory consumption.

import sys

inputfile = sys.argv[1]
outputfile = sys.argv[2]

with open(inputfile) as f:
    content = f.readlines()

content = [x.strip() for x in content]

my_list = list(set(content))

with open(outputfile, 'w') as output:
    for item in my_list:
        output.write("%s\n" % item)

Save the above to unique.py and run like this:

python unique.py inputfile.txt outputfile.txt
Freddie
  • 101
0

Using Raku (formerly known as Perl_6)

~$ raku -ne 'BEGIN my %dup; .put unless %dup{$_}++;'  input_file > output_file

OR:

~$ raku -e '.put for lines.unique;'  input_file > output_file

[ Note: compare first answer here with excellent Perl answer by @Mat ].

https://docs.raku.org
https://raku.org

jubilatious1
  • 3,195
  • 8
  • 17
0

Note that sort can write its output to any one of the files that it was given as input:

LC_ALL=C sort -u -o input input

That's fine as sort needs to have read all its input before it can start outputting anything (before it can tell which is the line that sorts first which could very well be the last line of the input).

sort will (intelligently) use temporary files so as to avoid loading the whole input in memory. You'll need enough space in $TMPDIR (or /tmp if that variable is not set). Some sort implementations can compress the temp files (like with --compress-program=lzop with GNU sort) which can help if you're short on disk space or have slow disks.

With LC_ALL=C sort order is by byte value which should speed things up and also guarantee a total and deterministic order (which you don't always get otherwise especially on GNU systems).

-1

With bash 4, a pure-bash solution that takes advantage of associative arrays can be used. Here is an example

unset llist; declare -A llist;
while read -r line; do
if [[ ${llist[$line]} ]]; then
  continue
else 
  printf '%s\n' "$line"
  llist[$line]="x"
fi
done < file.txt
iruvar
  • 16,725
  • 2
    Don't use read loops to process big text files. bash has to read one-byte-at-a-time to avoid overshooting a newline. Bash is also not very fast at text processing in general compared to awk. If you do use this, read -ra will avoid eating backslashes in your input. Also, don't forget to unset llist after the loop, if you put this in a shell function or use it interactively. – Peter Cordes Sep 14 '15 at 15:44
  • 2
    @PeterCordes, or you could have just referenced this :-) – iruvar Sep 14 '15 at 20:41