5

The size of the file is 962,120,335 bytes.

HP-UX ******B.11.31 U ia64 ****** unlimited-user license

hostname> what /usr/bin/awk
/usr/bin/awk:
         main.c $Date: 2009/02/17 15:25:17 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132)
         run.c $Date: 2009/02/17 15:25:20 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132)
         $Revision: @(#) awk R11.31_BL2010_0503_1 PATCH_11.31 PHCO_40052
hostname> what /usr/bin/sed
/usr/bin/sed:
         sed0.c $Date: 2008/04/23 11:11:11 $Revision: r11.31/1 PATCH_11.31 (PHCO_38263)
         $Revision: @(#) sed R11.31_BL2008_1022_2 PATCH_11.31 PHCO_38263
 hostname>perl -v
    This is perl, v5.8.8 built for IA64.ARCHREV_0-thread-multi
hostname:> $ file /usr/bin/perl
/usr/bin/perl:  ELF-32 executable object file - IA64
hostname:> $ file /usr/bin/awk
/usr/bin/awk:   ELF-32 executable object file - IA64
hostname:> $ file /usr/bin/sed
/usr/bin/sed:   ELF-32 executable object file - IA64

There are no GNU tools here.
What are my options?

How to remove duplicate lines in a large multi-GB textfile?

and

http://en.wikipedia.org/wiki/External_sorting#External_merge_sort

perl -ne 'print unless $seen{$_}++;' < file.merge > file.unique

throws

Out of Memory!

The resultant file of 960MB is merged from files of these sizes listed below, the average being 50 MB 22900038, 24313871, 25609082, 18059622, 23678631, 32136363, 49294631, 61348150, 85237944, 70492586, 79842339, 72655093, 73474145, 82539534, 65101428, 57240031, 79481673, 539293, 38175881

Question: How to perform external sort merge and deduplicate this data? Or, how to deduplicate this data?

  • The normal pattern for dedup'ing is sort .... | uniq. If sort is failing due to lack of memory then you could try breaking apart the file into many pieces (for example using split), depup'ing each part indidivually, cating them back together (which hopefully results in a smaller file than the original), then dedup that. – Celada Mar 19 '15 at 07:53
  • 2
    Doesn't a simple sort -u work? I know that the old sort from System V R4 used temporary files in /var/tmp if sorting in memory wasn't possible, so large files shouldn't be a problem. – wurtel Mar 19 '15 at 07:59
  • Do you need to preserve the order of the first occurrences of each line? If not sort -u is the clear solution. – Gilles 'SO- stop being evil' Mar 19 '15 at 22:57

3 Answers3

3

It seems to me that the process you're following at the moment is this, which fails with your out of memory error:

  1. Create several data files
  2. Concatenate them together
  3. Sort the result, discarding duplicate records (rows)

I think you should be able to perform the following process instead

  1. Create several data files
  2. Sort each one independently, discarding its duplicates (sort -u)
  3. Merge the resulting set of sorted data files, discarding duplicates (sort -m -u)
Chris Davies
  • 116,213
  • 16
  • 160
  • 287
  • How to merge? To be able to merge in reasonable time, we need some lookup logic, e.g. hash table, but then we again face the same problem -> not enough memory to store huge hash table. – Boy Dec 21 '17 at 19:15
  • @Borna why would you want a hash table when merging multiple pre-sorted files? These external merge-sort algorithms have been around since the days of magnetic tape - at least 50 years ago. – Chris Davies Dec 21 '17 at 19:35
  • That is exactly what I was looking for, thank you sir! One question, I was wondering how efficient would it be to create n files in a single directory (under Linux), where each file name is a row from the 'non-unique-lines' file (lets say no illegal chars for the file name), and thus eliminating duplicate rows. – Boy Dec 22 '17 at 20:42
  • @Borna that sounds an interesting question in its own right. When you've asked it I'd appreciate a ping back here with the reference and I'll take a look – Chris Davies Dec 22 '17 at 23:17
0

Of course there are no GNU/Linux tools: what is part of the Source Code Control System (SCCS), which I do not believe exists at all in Linux.

So, presumably you are on Unix. There the sort algorithm is capable of dealing with these problems: the Algorithmic details of UNIX Sort command states that an input of size M, with a memory of size N, is subdivided into M/N chunks that fit into memory, and which are worked upon serially.

It should fit the bill.

MariusMatutiae
  • 4,372
  • 1
  • 25
  • 36
0
% perl -ne 'if ( $seen{$_}++ ) {
    $count++ ;
    if ($count > 1000000) {
        $seen = () ;
        $count = 0 ;
    }
} else {
    print ;
}' <eof   
a
a
a
b
c
a
a
a
b
c
eof   
a
b
c
%
dan
  • 933