What I want to know, is what is tar doing at the start, before it starts passing data on to gzip? Can I make it skip that step?
I'm writing a script to run on my Synology NAS box (running DSM 6.2.1-23824 Update 1, with tar version 1.28) to compress copies of virtual machine hdd images. The source files are stored as sparse files on a btrfs filesystem. I'm looking for a little bit of compression, preferably keeping the sparseness, and as much speed as it can.
While I am working with only 1 file at a time the reason for using tar in the first place is to use its --sparse
flag, as gzip cannot unzip a file as a sparse file. The central command I'm trying to run is:
GZIP=-1 nice -n 19 tar --keep-old-files --sparse -czf $destDir/$vmFolder/$file.tar.gz $file 2>>$log
However with the size of the HDD images (ranging from 2GB to 120GB), there are many minutes when tar starts, it is furiously reading the source as fast as it can, but gzip is not being given anything to work with. The length of time this goes on for scales with the size of the source file.
Things I've tried to work around the issue:
- If I just use gzip the output starts straight away, but I lose the sparse info.
If I use pipes, as below, it does the same thing.
nice -n 19 tar --keep-old-files --sparse -cf - $file | nice -n 19 gzip --fast > $destDir/$vmFolder/$file.tar.gz 2>>$log
Admittedly the NAS box only has an Intel Atom D2700, but the tar operation shouldn't be CPU intensive. I can appreciate that gzip is cpu intensive and this will be a limiting factor, particularly with an old Atom CPU. I was hoping to use lz4
or lzop
but the Synology OS doesn't seem to have them, just gzip, 7z, and xz.
Note that as part of the script it can run as many of these commands in parallel as I like using this semaphore script as a template to utilise all cores of the CPU even with single threaded gzip.
Edit: Testing my script without the --sparse
option, but still using tar
, does not have this problem, and the data immediately flows through to gzip.
tar
has to determine whether a file is sparse before archiving it. This means reading the entire file to find the non-sparse bits. This could potentially take time depending on the size of the source files. I'm not writing this as an answer since it's just handwaving and I have nothing to test on. – Kusalananda Dec 14 '18 at 07:54--sparse
When given this option,tar
attempts to determine if the file is sparse prior to archiving it". – Kamil Maciorowski Dec 14 '18 at 08:11gtar
already arrived in the presence.star
uses the SEEK_HOLE feature since summer 2005. – schily Dec 14 '18 at 15:21tar
will useSEEK_HOLE
if thelseek()
implementation on the host supports it, according to thegtar
source code. – Kusalananda Dec 14 '18 at 15:24