The size of the file is 962,120,335 bytes.
HP-UX ******B.11.31 U ia64 ****** unlimited-user license
hostname> what /usr/bin/awk
/usr/bin/awk:
main.c $Date: 2009/02/17 15:25:17 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132)
run.c $Date: 2009/02/17 15:25:20 $Revision: r11.31/1 PATCH_11.31 (PHCO_36132)
$Revision: @(#) awk R11.31_BL2010_0503_1 PATCH_11.31 PHCO_40052
hostname> what /usr/bin/sed
/usr/bin/sed:
sed0.c $Date: 2008/04/23 11:11:11 $Revision: r11.31/1 PATCH_11.31 (PHCO_38263)
$Revision: @(#) sed R11.31_BL2008_1022_2 PATCH_11.31 PHCO_38263
hostname>perl -v
This is perl, v5.8.8 built for IA64.ARCHREV_0-thread-multi
hostname:> $ file /usr/bin/perl
/usr/bin/perl: ELF-32 executable object file - IA64
hostname:> $ file /usr/bin/awk
/usr/bin/awk: ELF-32 executable object file - IA64
hostname:> $ file /usr/bin/sed
/usr/bin/sed: ELF-32 executable object file - IA64
There are no GNU tools here.
What are my options?
How to remove duplicate lines in a large multi-GB textfile?
and
http://en.wikipedia.org/wiki/External_sorting#External_merge_sort
perl -ne 'print unless $seen{$_}++;' < file.merge > file.unique
throws
Out of Memory!
The resultant file of 960MB is merged from files of these sizes listed below, the average being 50 MB 22900038, 24313871, 25609082, 18059622, 23678631, 32136363, 49294631, 61348150, 85237944, 70492586, 79842339, 72655093, 73474145, 82539534, 65101428, 57240031, 79481673, 539293, 38175881
Question: How to perform external sort merge and deduplicate this data? Or, how to deduplicate this data?
sort .... | uniq
. Ifsort
is failing due to lack of memory then you could try breaking apart the file into many pieces (for example usingsplit
), depup'ing each part indidivually,cat
ing them back together (which hopefully results in a smaller file than the original), then dedup that. – Celada Mar 19 '15 at 07:53sort -u
work? I know that the oldsort
from System V R4 used temporary files in /var/tmp if sorting in memory wasn't possible, so large files shouldn't be a problem. – wurtel Mar 19 '15 at 07:59sort -u
is the clear solution. – Gilles 'SO- stop being evil' Mar 19 '15 at 22:57