tar takes a long time before passing data to gzip

Question

What I want to know, is what is tar doing at the start, before it starts passing data on to gzip? Can I make it skip that step?

I'm writing a script to run on my Synology NAS box (running DSM 6.2.1-23824 Update 1, with tar version 1.28) to compress copies of virtual machine hdd images. The source files are stored as sparse files on a btrfs filesystem. I'm looking for a little bit of compression, preferably keeping the sparseness, and as much speed as it can.

While I am working with only 1 file at a time the reason for using tar in the first place is to use its --sparse flag, as gzip cannot unzip a file as a sparse file. The central command I'm trying to run is:

GZIP=-1 nice -n 19 tar --keep-old-files --sparse -czf $destDir/$vmFolder/$file.tar.gz $file  2>>$log

However with the size of the HDD images (ranging from 2GB to 120GB), there are many minutes when tar starts, it is furiously reading the source as fast as it can, but gzip is not being given anything to work with. The length of time this goes on for scales with the size of the source file.

Things I've tried to work around the issue:

If I just use gzip the output starts straight away, but I lose the sparse info.

If I use pipes, as below, it does the same thing.

nice -n 19 tar --keep-old-files --sparse -cf - $file | nice -n 19 gzip --fast > $destDir/$vmFolder/$file.tar.gz 2>>$log

Admittedly the NAS box only has an Intel Atom D2700, but the tar operation shouldn't be CPU intensive. I can appreciate that gzip is cpu intensive and this will be a limiting factor, particularly with an old Atom CPU. I was hoping to use lz4 or lzop but the Synology OS doesn't seem to have them, just gzip, 7z, and xz.

Note that as part of the script it can run as many of these commands in parallel as I like using this semaphore script as a template to utilise all cores of the CPU even with single threaded gzip.

Edit: Testing my script without the --sparse option, but still using tar, does not have this problem, and the data immediately flows through to gzip.

tar has to determine whether a file is sparse before archiving it. This means reading the entire file to find the non-sparse bits. This could potentially take time depending on the size of the source files. I'm not writing this as an answer since it's just handwaving and I have nothing to test on. — Kusalananda, Dec 14 '18 at 07:54
@Kusalananda You may be on the right track. From the manual on Debian 9: "--sparse When given this option, tar attempts to determine if the file is sparse prior to archiving it". — Kamil Maciorowski, Dec 14 '18 at 08:11
@Kusalananda if you are on a historic platform, you are right. Those who use modern platforms make use of the lseek() SEEK_HOLE feature since summer 2005. I am however not shure whether gtar already arrived in the presence. star uses the SEEK_HOLE feature since summer 2005. — schily, Dec 14 '18 at 15:21
@schily GNU tar will use SEEK_HOLE if the lseek() implementation on the host supports it, according to the gtar source code. — Kusalananda, Dec 14 '18 at 15:24

tar takes a long time before passing data to gzip

0 Answers0