I've got a few full old backups of things like binary database dumps. Obviously, they don't differ much so doing full backups is not the smartest idea here. For now, I'm looking for a compression program capable of taking an advantage from the fact, that most of the files have a similar content.
2 Answers
If you first tar the files (using tar cvf my_backup.tar <file list...>
) then any compression tool will do a good job as they will see the data as one big file.
So just tar the files, and then put them in a zip, 7-zip, bzip2, etc. From the tar file, you can try the different compression algorithm and see which one performs best.

- 9,345
-
1Right, you be sure to use a format that archives and then compresses, instead of one (like .zip) which compresses each file individually and then archives them – lk- Aug 16 '12 at 18:00
-
tar + lzma seems to work really well for binary blobs. LZMA is built into newer versions of tar too. – Chris S Aug 16 '12 at 18:17
-
@Huygens: AFAIK you're wrong. All compression programs I know work with some window (typically 64 kB) and see nothing beyond it as stated e.g. here on page 7. – maaartinus Aug 16 '12 at 21:46
-
@maaartinus this is a shortcoming of gzip then, not tar. Try another compression algorithm, and try to use the options. To experiment you can tar twice the same file and compress the resulting output and see the size as compare to compressing only one file. – Huygens Aug 20 '12 at 15:06
-
@Huygens You can't expect a compression algorithm go checking if (some prefix of) a 100 MiB chunk appears again in the input. I'd just suggest burning the whole lot to DVDs or such, and hope they won't be needed... – vonbrand Jan 16 '13 at 10:52
I've had very good luck with 7-Zip. If you have the horsepower, it is capable of operating with a very large window. Make sure your original files are as uncompressed as possible so it can find similarities. (For Excel files in a heterogeneous environment, for example, this means unzipping their contents first, since xlsx files are lightly compressed when they're stored. I was once able to compress 600 MiB+ of almost-redundant Excel version files down to a few hundred KiB.)

- 131
rdiff-backup
sound useful? – jw013 Aug 16 '12 at 21:23lrzip
, but it was extremely slow (many hours for some 100 GB). However, there might have been a problem with the NTFS partition the data were on. – maaartinus Aug 16 '12 at 21:38rdiff-backup
would do, as it's efficient with slowly evolving files, while all I have now is a set of old snapshots. Things likeobnam
orbup
would help (but the former requires a newer system while the latter has still problems with removal from the backup). – maaartinus Aug 16 '12 at 21:43