13

I'd like to know the equivalent of

cat inputfile | sed 's/\(.\)/\1\n/g' | sort | uniq -c

presented in https://stackoverflow.com/questions/4174113/how-to-gather-characters-usage-statistics-in-text-file-using-unix-commands for production of character usage statistics in text files for binary files counting simple bytes instead of characters, i.e. output should be in the form of

18383 57
12543 44
11555 127
 8393 0

It doesn't matter if the command takes as long as the referenced one for characters.

If I apply the command for characters to binary files the output contains statistics for arbitrary long sequences of unprintable characters (I don't seek explanation for that).

5 Answers5

11

With GNU od:

od -vtu1 -An -w1 my.file | sort -n | uniq -c

Or more efficiently with perl (also outputs a count (0) for bytes that don't occur):

perl -ne 'BEGIN{$/ = \4096};
          $c[$_]++ for unpack("C*");
          END{for ($i=0;$i<256;$i++) {
              printf "%3d: %d\n", $i, $c[$i]}}' my.file
5

For large files using sort will be slow. I wrote a short C program to solve the equivalent problem (see this gist for Makefile with tests):

#include <stdio.h>

#define BUFFERLEN 4096

int main(){ // This program reads standard input and calculate frequencies of different // bytes and present the frequences for each byte value upon exit. // // Example: // // $ echo "Hello world" | ./a.out // // Copyright (c) 2015 Björn Dahlgren // Open source: MIT License

long long tot = 0; // long long guaranteed to be 64 bits i.e. 16 exabyte
long long n[256]; // One byte == 8 bits =&gt; 256 unique bytes

const int bufferlen = BUFFERLEN;
char buffer[BUFFERLEN];
int i;
size_t nread;

for (i=0; i&lt;256; ++i)
    n[i] = 0;

do {
    nread = fread(buffer, 1, bufferlen, stdin);
    for (i = 0; i &lt; nread; ++i)
        ++n[(unsigned char)buffer[i]];
    tot += nread;
} while (nread == bufferlen);
// here you may want to inspect ferror of feof

for (i=0; i&lt;256; ++i){
    printf(&quot;%d &quot;, i);
    printf(&quot;%f\n&quot;, n[i]/(float)tot);
}
return 0;

}

usage:

gcc main.c
cat my.file | ./a.out
  • Do you have a test? There're no comments in the code. It's in general not a good idea to use untested and publish untested or uncommented code - not matter whether it's common practice. The possibility to review revisions is also limited on this platform, consider an explicit code hosting platform. – Kalle Richter Jun 16 '15 at 09:22
  • @KarlRichter tests were a good idea to add. I found the old version choked on '\0' characters. This version should work (passes a few basic tests at least). – Bjoern Dahlgren Jun 16 '15 at 12:14
  • 1
    fgets gets a line, not a buffer-full. You're scanning the 4096-byte full buffer for each line read from stdin. You need fread here, not fgets. – Stéphane Chazelas Jun 16 '15 at 13:43
  • @StéphaneChazelas great - didn't know of fread (seldom do I/O from C). updated example to use fread instead. – Bjoern Dahlgren Jun 17 '15 at 11:00
  • I've added an if block around the printf statements, that makes the output more readable if some bytes don't occur in the input file: https://gist.github.com/martinvonwittich/2f0a9e5dcf63cb316765260694b73172 – Martin von Wittich Jun 04 '16 at 11:08
3

As mean, sigma and CV are often important when judging statistic data of the content of binary files, I've created a cmdline program that graphs all this data as an ascii circle of byte deviations from sigma.
http://wp.me/p2FmmK-96
It can be used with grep, xargs and other tools to extract statistics. enter image description here

circulosmeos
  • 281
  • 1
  • 4
  • 4
2

The recode program can do this quickly even for large files, either frequency statistics either for bytes or for the characters of various character sets. E.g. to count byte frequencies:

$ echo hello there > /tmp/q
$ recode latin1/..count-characters < /tmp/q
1  000A LF   1  0020 SP   3  0065 e    2  0068 h    2  006C l    1  006F o
1  0072 r    1  0074 t

Caution - specify your file to recode as standard input, otherwise it will silently replace it with the character frequencies!

Use recode utf-8/..count-characters < file to treat the input file as utf-8. Many many other character sets are available, and it will fail if the file contains any illegal characters.

nealmcb
  • 796
  • 9
  • 16
1

This is similar to Stephane's od answer but it shows the ASCII value of the byte. It is also sorted by frequency / number of occurences.

xxd -c1 my.file|cut -c10-|sort|uniq -c|sort -nr

I don't think this is efficient since many processes are started but it's good for single files, particularly small files.

brendan
  • 175