I have about 200 GB of log data generated daily, distributed among about 150 different log files.
I have a script that moves the files to a temporary location and does a tar-bz2 on the temporary directory.
I get good results as 200 GB logs are compressed to about 12-15 GB.
The problem is that it takes forever to compress the files. The cron job runs at 2:30 AM daily and continues to run till 5:00-6:00 PM.
Is there a way to improve the speed of the compression and complete the job faster? Any ideas?
Don't worry about other processes and all, the location where the compression happens is on a NAS, and I can run mount the NAS on a dedicated VM and run the compression script from there.
Here is the output of top for reference:
top - 15:53:50 up 1093 days, 6:36, 1 user, load average: 1.00, 1.05, 1.07
Tasks: 101 total, 3 running, 98 sleeping, 0 stopped, 0 zombie
Cpu(s): 25.1%us, 0.7%sy, 0.0%ni, 74.1%id, 0.0%wa, 0.0%hi, 0.1%si, 0.1%st
Mem: 8388608k total, 8334844k used, 53764k free, 9800k buffers
Swap: 12550136k total, 488k used, 12549648k free, 4936168k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
7086 appmon 18 0 13256 7880 440 R 96.7 0.1 791:16.83 bzip2
7085 appmon 18 0 19452 1148 856 S 0.0 0.0 1:45.41 tar cjvf /nwk_storelogs/compressed_logs/compressed_logs_2016_30_04.tar.bz2 /nwk_storelogs/temp/ASPEN-GC-32459:nkp-aspn-1014.log /nwk_stor
30756 appmon 15 0 85952 1944 1000 S 0.0 0.0 0:00.00 sshd: appmon@pts/0
30757 appmon 15 0 64884 1816 1032 S 0.0 0.0 0:00.01 -tcsh
tar.bz2
file ? – anu May 04 '16 at 23:06Alternatively it looks like you might be CPU bound on that one bzip2 process, so running multiple in parallel (one per CPU) might help (until you hit the I/O limit).
Or use a simpler compression (eg "gzip -1"). It won't save as much disk space but it'll run faster.
– Stephen Harris May 04 '16 at 23:12top
output shows that your single-threadedbzip2
process is maxing out one core, but that you're running it on a quad-core system (One process using 100% CPU ->25.1%
user-space CPU time, 74% idle). So with minor changes, you can go 4x as fast, unless something else becomes the bottleneck. Read Gilles answer carefully. Consider using the CPU in the same box as the disks holding the data to do the compression. (You might even compress some of your files on one box, others on the other, and archive after, so both CPUs are utilized.) – Peter Cordes May 05 '16 at 18:44