linux unbuffered compression options

Question

I'm working on a embedded system and due to persistent memory restrictions, I need to compress a log file "on the fly".

My objective is to have a one liner inside a script, which would look like this:

(myLaunchScript.sh 2>&1 | awk '{ print strftime("%Y-%m-%d %H:%M:%S"), $0; fflush(); }' | busybox gzip -c > /mnt/persistenMem/log_app.log.gz 2>&1 )&

awk is a helper to add datetime to traces. This reduces overhead in application printf.

The problem is that this implementation loses output if the system is powered off unespectedly, as gzip has an internal buffer. That internal buffer may differ depending on which options busybox was compiled.

Is there any compression tool that implements an unbuffered logic?

Or a line by line compression? Is it possible?

Standard compression algorithms tend to produce data words with bit lengths different to 8, so they need to do some buffering to combine those bits into bytes. What sort of data are you compressing? Is it just plain ASCII text? — PM 2Ring, Feb 10 '16 at 13:56

score 0 · Answer 1 · edited May 23 '17 at 12:40

Maybe it's the stdout buffer that's delaying the output. The pipe buffer may be quite large, it's usually several lines at best. You may want to check if you can disable it.

Line by line compression won't do a thing, it might even increase the file. gzip (based on DEFLATE) and most other compression algorithms work in sliding windows that are quite large, and it also has to store a lot of past data (that's unrelated to the output buffer) to actually compress anything - most of the compression comes from referring to past data (usually indirectly, you find what sequences are most common and give them the shortest codes in the output). There is a trade-off between compression rate and memory consumption, and if you want to properly compress, you need an in-memory data structure that holds the current dictionary. It'll also work in chunks, not byte-by-byte, so things won't move forward until enough input is collected.

Check if numbered arguments that set the compression rate does anything (gzip -3 for level 3 instead of default which is 6 on my system). Some implementations and algorithms may have other settings (bzip2 -s reduces memory usage - try it out, if you have it on your system).

All in all, I'm not sure if gzip even does any additional buffering. I didn't check the source code for busybox implementation, but I imagine it shouldn't be much more than what's needed by the algorithm anyway. I really suspect that the stream buffers are the main bottleneck.

I've already discarded the pipe, as I've already performed tests without using gzip and the file where output is written can be read as expected. I already disabled buffered output for libc/printf. — kikeenrique, Feb 10 '16 at 14:10
Using sync on a single file would do that, yes. But I'm not sure that's a wise thing to do. It puts strain on the storage hardware and slows down the operation, if you keep syncing. — orion, Feb 11 '16 at 09:16
Sure, sync does slow things down and puts undue stress on the hardware (and will probably make a massive impact on the life of a flash drive). But if the OP is losing data due to disk buffering then sync will prevent that. — PM 2Ring, Feb 12 '16 at 13:11

linux unbuffered compression options

1 Answers1