As there does not seem to be an existing tool that can do what I want I tried two self-implemented "scripts" using the languages I am best in: Python and Java:
1st try: Python
The following python 3 script works on any file size and counts how often each byte occurs. Unfortunately even it works very very slow. Using Pyhon 3.5 on a Raspberry 2 it requires more than one seconds to process one Megabyte!
#!/usr/bin/python3
import sys
file_name = sys.argv[1]
count = 0
block_size = 1048576
byte_count = [0] * 256
with open(file_name, "rb") as f:
data = f.read(block_size)
while data:
for b in data:
byte_count[b] += 1
count = count + len(data)
print("%d MiB"%(count / 1048576))
data = f.read(block_size)
print("read bytes: {}".format(count))
for i in range(0,255):
b_c = byte_count[i]
print("{} : {} ({:f} %)".format('0x%02x'%i, b_c, b_c / count * 100))
2nd try: Java
For my second try I used Java and it seems like a static typed language with JIT that reuses buffers works way more efficient. The Java version running on Java 9 was 40x faster than the Python version, even though both versions work the same way.
- Compile:
javac CountByteValues.java
- Run:
java -cp . CountByteValues <filename>
.
// CountByteValues.java
import java.io.FileInputStream;
import java.io.IOException;
public class CountByteValues {
public static void main(String[] args) {
try (FileInputStream in = new FileInputStream(args[0])) {
long[] byteCount = new long[256];
byte[] buffer = new byte[1048576];
int read;
long count = 0;
while ((read = in.read(buffer)) >= 0) {
for (int i = 0; i < read; i++) {
byteCount[0xFF & buffer[i]]++;
}
count += read;
System.out.println((count / 1048576) + " MB");
}
System.out.println("Bytes read: " + count);
for (int i = 0; i < byteCount.length; i++) {
System.out.println(String.format("0x%x %d (%.2f%%)", i, byteCount[i], byteCount[i] * 100f / count));
}
} catch (IOException e) {
e.printStackTrace();
}
}
}