deduplication of lines in a large file

Question

The size of the file is 962,120,335 bytes.

HP-UX ******B.11.31 U ia64 ****** unlimited-user license

hostname> what /usr/bin/awk
/usr/bin/awk:
         main.c $Date: 2009/02/17 15:25:17 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132)
         run.c $Date: 2009/02/17 15:25:20 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132)
         $Revision: @(#) awk R11.31_BL2010_0503_1 PATCH_11.31 PHCO_40052
hostname> what /usr/bin/sed
/usr/bin/sed:
         sed0.c $Date: 2008/04/23 11:11:11 $Revision: r11.31/1 PATCH_11.31 (PHCO_38263)
         $Revision: @(#) sed R11.31_BL2008_1022_2 PATCH_11.31 PHCO_38263
 hostname>perl -v
    This is perl, v5.8.8 built for IA64.ARCHREV_0-thread-multi
hostname:> $ file /usr/bin/perl
/usr/bin/perl:  ELF-32 executable object file - IA64
hostname:> $ file /usr/bin/awk
/usr/bin/awk:   ELF-32 executable object file - IA64
hostname:> $ file /usr/bin/sed
/usr/bin/sed:   ELF-32 executable object file - IA64

There are no GNU tools here.
What are my options?

How to remove duplicate lines in a large multi-GB textfile?

and

http://en.wikipedia.org/wiki/External_sorting#External_merge_sort

perl -ne 'print unless $seen{$_}++;' < file.merge > file.unique

throws

Out of Memory!

The resultant file of 960MB is merged from files of these sizes listed below, the average being 50 MB 22900038, 24313871, 25609082, 18059622, 23678631, 32136363, 49294631, 61348150, 85237944, 70492586, 79842339, 72655093, 73474145, 82539534, 65101428, 57240031, 79481673, 539293, 38175881

Question: How to perform external sort merge and deduplicate this data? Or, how to deduplicate this data?

The normal pattern for dedup'ing is sort .... | uniq. If sort is failing due to lack of memory then you could try breaking apart the file into many pieces (for example using split), depup'ing each part indidivually, cating them back together (which hopefully results in a smaller file than the original), then dedup that. — Celada, Mar 19 '15 at 07:53
Doesn't a simple sort -u work? I know that the old sort from System V R4 used temporary files in /var/tmp if sorting in memory wasn't possible, so large files shouldn't be a problem. — wurtel, Mar 19 '15 at 07:59
Do you need to preserve the order of the first occurrences of each line? If not sort -u is the clear solution. — Gilles 'SO- stop being evil', Mar 19 '15 at 22:57

Chris Davies · Answer 1 · 2017-12-23T12:39:33.203

3

It seems to me that the process you're following at the moment is this, which fails with your out of memory error:

Create several data files
Concatenate them together
Sort the result, discarding duplicate records (rows)

I think you should be able to perform the following process instead

Create several data files
Sort each one independently, discarding its duplicates (sort -u)
Merge the resulting set of sorted data files, discarding duplicates (sort -m -u)

edited Dec 23 '17 at 12:39

answered Mar 19 '15 at 13:54

Chris Davies

116,213
16
160
287

How to merge? To be able to merge in reasonable time, we need some lookup logic, e.g. hash table, but then we again face the same problem -> not enough memory to store huge hash table. – Boy Dec 21 '17 at 19:15
@Borna why would you want a hash table when merging multiple pre-sorted files? These external merge-sort algorithms have been around since the days of magnetic tape - at least 50 years ago. – Chris Davies Dec 21 '17 at 19:35
That is exactly what I was looking for, thank you sir! One question, I was wondering how efficient would it be to create n files in a single directory (under Linux), where each file name is a row from the 'non-unique-lines' file (lets say no illegal chars for the file name), and thus eliminating duplicate rows. – Boy Dec 22 '17 at 20:42
@Borna that sounds an interesting question in its own right. When you've asked it I'd appreciate a ping back here with the reference and I'll take a look – Chris Davies Dec 22 '17 at 23:17

score 0 · Answer 2 · answered Mar 19 '15 at 08:15

0

Of course there are no GNU/Linux tools: what is part of the Source Code Control System (SCCS), which I do not believe exists at all in Linux.

So, presumably you are on Unix. There the sort algorithm is capable of dealing with these problems: the Algorithmic details of UNIX Sort command states that an input of size M, with a memory of size N, is subdivided into M/N chunks that fit into memory, and which are worked upon serially.

It should fit the bill.

answered Mar 19 '15 at 08:15

MariusMatutiae

4,372
1
25
36

2

The question states that the op is on HP-UX. SCCS is proprietary, but you can use what on LInux if you install GNU CSSC – Anthon Mar 19 '15 at 08:23
Sun open-sourced SCCS as part of the Heirloom project. The README states that it has been successfully built on Linux. – Warren Young Mar 19 '15 at 08:30
@Anthon and WarrenYoung Thanks, I did not know this. – MariusMatutiae Mar 19 '15 at 08:41
SCCS is opensource on my request since December 2006. Be however careful with heirloom SCCS, since it got only attention for a few months and is dead since spring 2007. It still has bugs on Linux. The sccs project on sourceforge http://sccs.sf.net is actively maintained and without known bugs. – schily Jun 03 '20 at 17:38

dan · Answer 3 · 2015-03-20T00:20:39.767

0

% perl -ne 'if ( $seen{$_}++ ) {
    $count++ ;
    if ($count > 1000000) {
        $seen = () ;
        $count = 0 ;
    }
} else {
    print ;
}' <eof   
a
a
a
b
c
a
a
a
b
c
eof   
a
b
c
%

edited Mar 20 '15 at 00:20

answered Mar 19 '15 at 15:33

dan

933

That only works if the file is already sorted in some order. And in that case it can be done faster with uniq. – Gilles 'SO- stop being evil' Mar 19 '15 at 22:56
Right! I fixed my slow uniq for a better approach based on the hypothesis the repetitions aren't too sparsed. – dan Mar 20 '15 at 00:23

deduplication of lines in a large file

3 Answers3