Compress a large number of large files fast

Question

I have about 200 GB of log data generated daily, distributed among about 150 different log files.

I have a script that moves the files to a temporary location and does a tar-bz2 on the temporary directory.

I get good results as 200 GB logs are compressed to about 12-15 GB.

The problem is that it takes forever to compress the files. The cron job runs at 2:30 AM daily and continues to run till 5:00-6:00 PM.

Is there a way to improve the speed of the compression and complete the job faster? Any ideas?

Don't worry about other processes and all, the location where the compression happens is on a NAS, and I can run mount the NAS on a dedicated VM and run the compression script from there.

Here is the output of top for reference:

top - 15:53:50 up 1093 days,  6:36,  1 user,  load average: 1.00, 1.05, 1.07
Tasks: 101 total,   3 running,  98 sleeping,   0 stopped,   0 zombie
Cpu(s): 25.1%us,  0.7%sy,  0.0%ni, 74.1%id,  0.0%wa,  0.0%hi,  0.1%si,  0.1%st
Mem:   8388608k total,  8334844k used,    53764k free,     9800k buffers
Swap: 12550136k total,      488k used, 12549648k free,  4936168k cached
 PID  USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7086 appmon    18   0 13256 7880  440 R 96.7  0.1 791:16.83 bzip2
7085  appmon    18   0 19452 1148  856 S  0.0  0.0   1:45.41 tar cjvf /nwk_storelogs/compressed_logs/compressed_logs_2016_30_04.tar.bz2 /nwk_storelogs/temp/ASPEN-GC-32459:nkp-aspn-1014.log /nwk_stor
30756 appmon    15   0 85952 1944 1000 S  0.0  0.0   0:00.00 sshd: appmon@pts/0
30757 appmon    15   0 64884 1816 1032 S  0.0  0.0   0:00.01 -tcsh

If you have multiple CPUs and you have or can split it into multiple tar files, you could run multiple compressions. — Jeff Schaller, May 04 '16 at 23:03
@JeffSchaller would it be possible to get multiple bzip2 processes compress different files but write to the same tar.bz2 file ? — anu, May 04 '16 at 23:06
Are the log files generated on local disk before moving to NAS? If so compress then move; in that way you're only sending 15Gb of data over the network rather than 100(move) then 115(100read+15write) when compressing.
Alternatively it looks like you might be CPU bound on that one bzip2 process, so running multiple in parallel (one per CPU) might help (until you hit the I/O limit).

Or use a simpler compression (eg "gzip -1"). It won't save as much disk space but it'll run faster. — Stephen Harris, May 04 '16 at 23:12
@Sukminder I will definitely try this and see the difference in size. Thanks. — anu, May 04 '16 at 23:54
Your top output shows that your single-threaded bzip2 process is maxing out one core, but that you're running it on a quad-core system (One process using 100% CPU -> 25.1% user-space CPU time, 74% idle). So with minor changes, you can go 4x as fast, unless something else becomes the bottleneck. Read Gilles answer carefully. Consider using the CPU in the same box as the disks holding the data to do the compression. (You might even compress some of your files on one box, others on the other, and archive after, so both CPUs are utilized.) — Peter Cordes, May 05 '16 at 18:44

score 26 · Accepted Answer · edited Apr 13 '17 at 12:36

The first step is to figure out what the bottleneck is: is it disk I/O, network I/O, or CPU?

If the bottleneck is the disk I/O, there isn't much you can do. Make sure that the disks don't serve many parallel requests as that can only decrease performance.

If the bottleneck is the network I/O, run the compression process on the machine where the files are stored: running it on a machine with a beefier CPU only helps if the CPU is the bottleneck.

If the bottleneck is the CPU, then the first thing to consider is using a faster compression algorithm. Bzip2 isn't necessarily a bad choice — its main weakness is decompression speed — but you could use gzip and sacrifice some size for compression speed, or try out other formats such as lzop or lzma. You might also tune the compression level: bzip2 defaults to -9 (maximum block size, so maximum compression, but also longest compression time); set the environment variable BZIP2 to a value like -3 to try compression level 3. This thread and this thread discuss common compression algorithms; in particular this blog post cited by derobert gives some benchmarks which suggest that gzip -9 or bzip2 with a low level might be a good compromise compared to bzip2 -9. This other benchmark which also includes lzma (the algorithm of 7zip, so you might use 7z instead of tar --lzma) suggests that lzma at a low level can reach the bzip2 compression ratio faster. Just about any choice other than bzip2 will improve decompression time. Keep in mind that the compression ratio depends on the data, and the compression speed depends on the version of the compression program, on how it was compiled, and on the CPU it's executed on.

Another option if the bottleneck is the CPU and you have multiple cores is to parallelize the compression. There are two ways to do that. One that works with any compression algorithm is to compress the files separately (either individually or in a few groups) and use parallel to run the archiving/compression commands in parallel. This may reduce the compression ratio but increases the speed of retrieval of an individual file and works with any tool. The other approach is to use a parallel implementation of the compression tool; this thread lists several.

"If the bottleneck is the disk I/O, there isn't much you can do." That's probably true here, since the compression ratio is already good, but in general when I/O is the bottleneck it can be worth looking into using up more CPU to get a better compression ratio (using different compression settings or a different algorithm)... you can't really reduce the "I" (because you need to read in all of the data) but you can sometimes significantly reduce the "O" :-) — psmears, May 05 '16 at 16:03
If you tell 7z not to make a "solid" archive, or limit the size of "solid" blocks, it will run mutliple LZMA threads in parallel, IIRC. log file data is a special case for compression, because it tends to be highly redundant (lots of similarity between lines). It's definitely worth testing gzip, bzip2, and xz on the OP's specific log files, rather than just looking at generic compression benchmarks to rule out any options. Even fast compressors are worth considering (lzop, lz4, snappy). — Peter Cordes, May 05 '16 at 18:11
The preferred LZMA compressor these days is xz. Use tar -J or --xz, not --lzma. .lzma is considered a "legacy" file format. The multiple iterations of file formats for LZMA compression is a bit of an embarrassment, and something they should have got right the first time. But AFAIK it's basically good now, and .xz isn't about to be replaced by yet another file format for the same compression stream. — Peter Cordes, May 05 '16 at 18:18
7z does have excellent compression & multi-threading, but because of the archive format (needs an index, or maybe bugs?) I don't think it can be used in the middle of a pipeline - it won't use stdin and stdout at the same time — Xen2050, May 05 '16 at 19:49
This was really helpful and insightful. My team figured that the operation over NFS was a big bottleneck. — anu, May 09 '16 at 22:50

score 19 · Answer 2 · edited May 05 '16 at 17:05

19

You can install pigz, parallel gzip, and use tar with the multi-threaded compression. Like:

tar -I pigz -cf file.tar.gz *

Where the -I option is:

-I, --use-compress-program PROG
  filter through PROG

Of course, if your NAS doesn't have multiple cores / powerful CPU, you are limited anyway by the CPU power.

The speed of the hard-disk/array on which the VM and the compression is running can be a bottleneck also.

edited May 05 '16 at 17:05

Jeff Schaller

67,283
35
116
255

answered May 04 '16 at 23:14

magor

3,752
2
13
28

1

And if you want to use bzip2, you can use pbzip2 or lbzip2. – Radovan Garabík May 05 '16 at 07:13
2

This is your best answer. But first, make sure that your first move is to a location that's on the same filesystem as the original files. Otherwise, your "move" is really a byte-copy-then-delete. On the same filesystem, a move is a rearrangement of filesystem links. That's orders of magnitude faster. For my logfiles that are hundreds of Gigabytes large, pigz made all the difference. You can tell it how many parallel threads to run. As long as your cpu has multiple cores, I would not spend a lot of time investigating. You'll likely want pigz in any event; you can get your speedup immediately. – Mike S May 05 '16 at 14:16
Once you're pigz'ing, look at your htop and iostat outputs and observe your system performance, if you would like to investigate your system further. But again, I will no longer try and compress large files without pigz. On a modern multicore system, it's just silly not to use it. It's such an immediate win- you'll see. – Mike S May 05 '16 at 14:18

score 11 · Answer 3 · answered May 06 '16 at 08:24

11

By far the fastest and most effective way of compressing data is to generate less of it.

What kinds of logs are you generating? 200GB daily sounds like quite a lot (unless you're google or some ISP...), consider that 1MB of text is about 500 pages, so you're generating the equivalent of 100 million pages of text per day, you'll fill the library of congress in a week.

See over your log data if you can reduce it somehow and still get what you need from the logs. For example by turning down the log level or using a terser log format. Or if you are using the logs for statistics, process the statistics on-the-fly and dump a file with the summary and then filter the logs prior to compression for storage.

answered May 06 '16 at 08:24

Emily L.

318
2
10

5

This is an interesting philosophical solution. The solution of most of lives problems is to avoid having the problem altogether isn't it. That's until one closely examine the suggestion and realizes that there are 100s of people and 1000s of approvals that one has to go through to achieve this. – anu May 08 '16 at 16:17
1

@anu No context to the question was given so I assumed none. And could you please tell me where you got the number 1000s of approvals from? To me it seems like you just made that up. – Emily L. May 08 '16 at 18:18
1

I'll upvote this. This is the often-overlooked, but once noticed, standout solution to many of life's problems. – jrw32982 May 11 '16 at 13:19
3

Well .. now that I no longer work there, I can atleast disclose that this was a problem at Apple. More specifically on the service stack that serves the online app store ... so yeah 1000s of approvals is pretty much a reality because they have 1000s of microservices and each of them produce logs that need to be compressed and will have to signoff on changing their logging levels etc ... Anyways... we figured out a solution for this inhouse btw.. which is pretty much equivalent to parallel gzip that gets offloaded to another microservices. – anu May 27 '19 at 01:45
1

200gb / day is not much at all and you don't need to be Google; I've worked on projects where 10s of TB of data was generated / day. – magor Apr 22 '22 at 08:10
This answer was written in 2016, over 6 years ago. Please check the date before posting. – Emily L. Jun 15 '22 at 09:51

score 4 · Answer 4 · answered May 05 '16 at 01:09

4

If the only requirement is that the compression is fast, I'd recommend lz4 very highly.

It's used in lots of places where the speed of compression is more important than the compression ratio (e.g. filesystems with transparent compression like ZFS)

answered May 05 '16 at 01:09

pdo

6,970
3
28
19

Never heard of it before, is there a program that's likely already installed practically everywhere that uses it, like xz? – Xen2050 May 05 '16 at 19:04

score 3 · Answer 5 · answered May 05 '16 at 00:02

You can reduce the amount of compression (in terms of space saved) to make it faster. To start with, bzip2 is MUCH slower than gzip, though it compresses smaller. You can also change the compression level of bzip2, gzip, or most compression programs to trade size for speed.

If you are not willing to trade size of speed, you can still probably get the same size or smaller while still getting a speed improvement using a compressor that uses LZMA (xz for example).

You'll find benchmarks if you search, but your best bet is doing some tests with your own file on your target hardware.

Compress a large number of large files fast

5 Answers5

By far the fastest and most effective way of compressing data is to generate less of it.