How to count the number of bytes in a very large file, grouping the same bytes?

Question

I am searching for a way to get a statistics on a very large (multiple times larger than the available RAM) the outputs what byte values in the files are present and how often:

A0 01 00 FF 77 01 77 01 A0

I need to know how many A0 bytes there are in this file, how many 01, and so on. The result could be:

A0: 2
01: 3
00: 1
FF: 1
77: 2

Therefore this question is very close to the question How to count the number of bytes in a file, grouping the same bytes? but non of the existing answers works for larger files. From my understanding all answers require a minimum RAM equal to the size of the file to be tested (up to multiple times).

Hence the answers don't work on systems with small RAM, e.g. a Raspberry for processing a multi-GB file.

Is there a simple solution that works on any file size even if we have for example only 512MB RAM available?

I've added an answer on your linked question so I believe yours is now effectively a duplicate. (Nothing wrong with your question, except the answer fits better on the other one.) — Chris Davies, Mar 31 '20 at 11:57

score 1 · Answer 1 · answered Mar 30 '20 at 13:07

1

Whip up a tiny C (or Perl, Python, whatever) program that reads one byte at a time and keeps totals. Any language that isn't totally braindead on a reasonable operating system will handle buffering and other chores transparently in reasonably efficient way.

answered Mar 30 '20 at 13:07

vonbrand

18,253

I tried it with a Python 3 program that reads the data block wise (1MB) and processes it. On a Raspberry 2 this script requires more than one second per Megabyte! I don't think that Perl will be faster and C is a language that does not make fun. Therefore I was hoping for a well designed and fast working existing tool. In any way we are not on stackoverflow.com here... – Robert Mar 30 '20 at 13:29

score 0 · Answer 2 · answered Mar 30 '20 at 12:46

0

Not sure if that is the solution you are looking for but I would just split the file up into multiple smaller files (e.g. via split -b 100MB yourfile) applying the methods described in the thread you linked and then adding up the counted bytes in the seperate files using the spreadsheet software of your choice.

answered Mar 30 '20 at 12:46

luca

142

2

Don't you think that this approach is a bit inefficient? You would have to read every byte of the file two times and write it one time. For an operation that is can be done by just one read operation and constant RAM amount if implemented correctly... – Robert Mar 30 '20 at 12:54
1

@Robert This is true, it is in fact unefficient but depending on how often you have to do this and how much time you have a quick and dirty solution that is simple to implement might do the job. But there will probably be someone who can give you a better solution, that was just my first thought – luca Mar 30 '20 at 12:57

Robert · Answer 3 · 2020-03-30T14:29:16.247

As there does not seem to be an existing tool that can do what I want I tried two self-implemented "scripts" using the languages I am best in: Python and Java:

1st try: Python

The following python 3 script works on any file size and counts how often each byte occurs. Unfortunately even it works very very slow. Using Pyhon 3.5 on a Raspberry 2 it requires more than one seconds to process one Megabyte!

#!/usr/bin/python3
import sys
file_name = sys.argv[1]
count = 0
block_size = 1048576
byte_count = [0] * 256
with open(file_name, "rb") as f:
    data = f.read(block_size)
    while data:
        for b in data:
            byte_count[b] += 1
        count = count + len(data)
        print("%d MiB"%(count / 1048576))
        data = f.read(block_size)

print("read bytes: {}".format(count))
for i in range(0,255):
    b_c = byte_count[i]
    print("{} : {} ({:f} %)".format('0x%02x'%i, b_c,  b_c / count * 100))

2nd try: Java

For my second try I used Java and it seems like a static typed language with JIT that reuses buffers works way more efficient. The Java version running on Java 9 was 40x faster than the Python version, even though both versions work the same way.

Compile: javac CountByteValues.java
Run: java -cp . CountByteValues <filename>

.

// CountByteValues.java
import java.io.FileInputStream;
import java.io.IOException;

public class CountByteValues {

    public static void main(String[] args) {
        try (FileInputStream in = new FileInputStream(args[0])) {
            long[] byteCount = new long[256];
            byte[] buffer = new byte[1048576];
            int read;
            long count = 0;
            while ((read = in.read(buffer)) >= 0) {
                for (int i = 0; i < read; i++) {
                    byteCount[0xFF & buffer[i]]++;
                }
                count += read;
                System.out.println((count / 1048576) + " MB");
            }

            System.out.println("Bytes read: " + count);
            for (int i = 0; i < byteCount.length; i++) {
                System.out.println(String.format("0x%x %d (%.2f%%)", i, byteCount[i], byteCount[i] * 100f / count));
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

score 0 · Answer 4 · answered Mar 31 '20 at 00:51

As is usual, a C program would be the fastest.
In a computer where the perl example you presented takes 5 seconds.
The next C code takes only 0.069s:

#include <stdio.h>

#define BUFFERLEN 4096

int main(){
    // This program reads standard input and calculate frequencies of different
    // bytes and present the frequences for each byte value upon exit.
    //
    // Example:
    //
    //     $ echo "Hello world" | ./a.out
    //
    // Copyright (c) 2015 Björn Dahlgren
    // Open source: MIT License

    long long tot = 0; // long long guaranteed to be 64 bits i.e. 16 exabyte
    long long n[256]; // One byte == 8 bits => 256 unique bytes

    const int bufferlen = BUFFERLEN;
    char buffer[BUFFERLEN];
    int i;
    size_t nread;

    for (i=0; i<256; ++i)
        n[i] = 0;

    do {
        nread = fread(buffer, 1, bufferlen, stdin);
        for (i = 0; i < nread; ++i)
            ++n[(unsigned char)buffer[i]];
        tot += nread;
    } while (nread == bufferlen);
    // here you may want to inspect ferror of feof

    for (i=0; i<256; ++i){
        printf("%d ", i);
        printf("%f\n", n[i]/(float)tot);
    }
    return 0;
}

Copied from https://unix.stackexchange.com/a/209786/232326

How to count the number of bytes in a very large file, grouping the same bytes?

4 Answers4

1st try: Python

2nd try: Java