I am sorting big files (>100Go), and to reduce time spent on disk writes, I am trying to use GNU sort's --compress-program
parameter. (Related: How to sort big files?)
However, it appears in certain cases that only the first temporary is compressed. I would like to know why, and what I could do to compress all temporaries.
I am using:
sort (GNU coreutils) 8.25
lzop 1.03
/LZO library 2.09
Steps to reproduce the issue:
You will need something like ~15Go free space, ~10Go ram, some time
First, create a 10Go file with the following C code:
#include <stdio.h>
#include <stdlib.h>
int main(void) {
unsigned long n;
unsigned char i;
srand(42);
for(n = 0; n < 1000000000; n++) {
for(i = 0; i < 3; i++) {
printf("%03d", rand() % 1000);
}
printf("\n");
}
fflush(stdout);
return 0;
}
And running it:
$ gcc -Wall -O3 -o generate generate.c
$ ./generate > test.input # takes a few minutes
$ head -n 4 test.input
166740881
241012758
021940535
743874143
Then, start the sort process:
$ LC_ALL=C sort -T . -S 9G --compress-program=lzop test.input -o test.output
After some time, suspend the process, and list the tempararies created in the same folder (due to -T .
):
$ ls -s sort*
890308 sortc1JZsR
890308 sorte7O878
378136 sortrK37RZ
$ file sort*
sortc1JZsR: ASCII text
sorte7O878: ASCII text
sortrK37RZ: lzop compressed data - version 1.030, LZO1X-1, os: Unix
It seems that only sortrK37RZ
(the first temporary created) has been compressed.
[Edit] Running that same sort
command with -S
set to 7G
is fine (i.e. all temporaries are compressed) while with 8G
the issue is present.
[Edit] lzop is not called for the other temporary
I tryied and used the following script as a wrapper for lzop
:
#!/bin/bash
set -e
echo "$$: start at $(date)" >> log
lzop $@
echo "$$: end at $(date)" >> log
Here is the content of the log
file, when several temporaries are written to disk:
11109: start at Sun Apr 10 22:56:51 CEST 2016
11109: end at Sun Apr 10 22:57:17 CEST 2016
So my guess is that the compress program is not called at all.
file
cannot detect the type correctly. – meuh Apr 10 '16 at 17:47file
command is right. Which also makes sense when you look at the size of the temporaries (the ASCII ones being way bigger) – r0g Apr 10 '16 at 18:28