how to count number of differences in large streams quickly?

Question

I want to count the number of differences (different bytes) in two large streams (devices/files). E.g. two hard disks, or one hard disk and /dev/zero.

The program(s) involved must be fast, as fast as the streams come in (say 1GB/s, although 0.2GB/s is probably OK), and they may use at most a few GB of RAM and of tmp files. In particular, there is no filesystem available that is big enough to store the differences to be counted. The streams are several TB in size.

The count need not (and in fact must not) treat whitespace or line feeds any different than other characters. The streams are binary-only and not interpret-able as text.

I said streams, although the devices are actually seek-able. But for speed reasons, all should preferably be done with one single streaming pass without seeks, if possible.

What I tried so far:

cmp -l /dev/sda /dev/sdb | wc

But this is way to slow; wc alone uses one CPU core >50% and the output is only about 2MB/s, which is too slow by a factor of 100.

How do you define 'number of differences'? Do we need to cope with insertions or deletions - if abcdefg changed to abXeZfg does that count as one, two, or three differences? — tripleee, Jul 09 '13 at 11:22
simple position-for-position differences. No insertion/deletion. In your example, 3 differences. — psrandall, Jul 10 '13 at 00:35

score 2 · Answer 1 · 2013-07-09T16:08:46.933

2

With seekable files, you can parallelize this. For example, you could do something like this (untested):

# Number of jobs to run at once
jobs=4
# Block size
bs=512
# Total number of block to work with
size=1000000
# Or, to calculate the size automatically for one of the disks,
#size=$(( $(df -B 1 /dev/sda | tail -n1 | awk '{ print $2 }') / $bs ))

# Number of blocks per job
count=$(( $size / $jobs )) 
for i in $(seq 1 $jobs) ; do
    seek=$(( ($i - 1) * $count ))
    cmp -l <(dd bs=$bs skip=$seek count=$count if=/dev/sdb1) <(dd bs=$bs skip=$seek count=$count if=/dev/sdb2) | wc &
done

dd will only seek once when if first starts.

edited Jul 09 '13 at 16:08

answered Jul 09 '13 at 01:26

Parallelization is not enough to solve this problem. First of all, the current solution is too slow by a factor of 100 and I do not have the 100 CPUs which would be required to bring things up to the desired speed. Secondly, the real problem is that the output of wc -l is too bloated and that wc looks for newlines and whitespace, making both ill suited for this task. – psrandall Jul 09 '13 at 15:47
When its argument list is -l, GNU wc only counts newlines and bytes. That aside, I missed the words "slow by a factor of 100" in your question. I don't know of any existing utilities that can provide for your needs. You'll probably have to develop your own. – Jul 09 '13 at 16:30
Thanks for the info that wc -l does not count words, hence is faster. – psrandall Jul 10 '13 at 00:27
The "slow by a factor of 100" is implicit in the numbers I mentioned (0.2GB/s required, 2MB/s actually delivered) – psrandall Jul 10 '13 at 00:33
I missed that too. To be honest, I rushed reading your question a bit. Sorry about that. – Jul 10 '13 at 01:04
No problem, maybe I should have been more explicit. I like your parallelization solution btw; it will be useful for other problems. Not so much for this one though b/c of the number of CPUs required and b/c hard disks penalize seeking. – psrandall Jul 10 '13 at 18:31

score 2 · Answer 2 · answered Jul 10 '13 at 01:09

You aren't going to get 1GB/s unless your hardware supports it. The fastest 7200 rpm hard drives can reach about 200MB/s on part of their surface, and more like 100MB to 150MB over the whole surface.. So your “probably ok” figure is actually above the ideal figure unless you have unusually fast drives (10krpm, RAID0).

cmp -l | wc may be limited by the CPU because you're asking wc to count words, which is a bit complicated. cmp -l | wc -l will reduce the load on the CPU.

Unless you have very fast disks, cmp -l | wc -l is probably IO-bound already. If you're using the whole disk bandwidth, parallelization won't help you.

If cmp -l | wc -l is still CPU-bound, or if you want to free your CPU to do other things, an ad hoc program that does the count will perform better. Warning, untested.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#define BUFFER_SIZE 1048576
unsigned char buf1[BUFFER_SIZE];
unsigned char buf2[BUFFER_SIZE];
int main (int argc, char *argv[]) {
    FILE *f1, *f2;
    unsigned long long total = 0, diffs = 0;
    unsigned n1, n2, n, i;
    f1 = fopen(argv[1], "r");
    if (f1 == NULL) {perror(argv[1]); return 2;}
    f2 = fopen(argv[2], "r");
    if (f2 == NULL) {perror(argv[2]); return 2;}
    do {
        n1 = fread(buf1, 1, BUFFER_SIZE, f1);
        n2 = fread(buf2, 1, BUFFER_SIZE, f2);
        if (n1 > n2) n = n2; else n = n1;
        for (i = 0; i < n; i++) if (buf1[i] != buf2[i]) ++diffs;
        total += n;
    } while (n1 == BUFFER_SIZE && n2 == BUFFER_SIZE);
    printf("%llu\n", diffs);
    if (ferror(f1)) {perror(argv[1]);}
    if (ferror(f2)) {perror(argv[2]);}
    if (n1 != n2) {
        fprintf(stderr, "Stopped after %llu bytes.\n", total);
        return 1;
    }
    return 0;
}

Lovely, thanks a lot! I like it. How long did it take you to write? I was thinking more about using out-of-the-box shell commands but this will work just as well. — psrandall, Jul 10 '13 at 18:32
And yes, you are right that 0.2GB/s is the maximum speed of the incoming stream, which is why I said software running at this speed is OK. The 1GB/s figure was just to ensure that any temporary backlog is dealt with quickly, ensuring maximum streaming speed. — psrandall, Jul 10 '13 at 18:37

how to count number of differences in large streams quickly?

2 Answers2

Linked