I want to count the number of differences (different bytes) in two large streams (devices/files). E.g. two hard disks, or one hard disk and /dev/zero
.
The program(s) involved must be fast, as fast as the streams come in (say 1GB/s, although 0.2GB/s is probably OK), and they may use at most a few GB of RAM and of tmp files. In particular, there is no filesystem available that is big enough to store the differences to be counted. The streams are several TB in size.
The count need not (and in fact must not) treat whitespace or line feeds any different than other characters. The streams are binary-only and not interpret-able as text.
I said streams, although the devices are actually seek-able. But for speed reasons, all should preferably be done with one single streaming pass without seeks, if possible.
What I tried so far:
cmp -l /dev/sda /dev/sdb | wc
But this is way to slow; wc
alone uses one CPU core >50% and the output is only about 2MB/s, which is too slow by a factor of 100.