How can I convert two-valued text data to binary (bit-representation)

Question

I have a text file with two (2) only possible characters (and maybe new lines \n). Example:

ABBBAAAABBBBBABBABBBABBB

(Size 24 bytes)

How can I convert this to a binary file, meaning a bit representation, with each one of the two possible values being assigned to 0 or 1?

Resulting binary file (0=A, 1=B):

011100001111101101110111     # 24 bits - not 24 ASCII characters

Resulting file in Hex:

70FB77                       # 3 bytes - not 6 ASCII characters

I would be mostly interested in a command-line solution (maybe dd,xxd, od, tr, printf, bc). Also, regarding the inverse: how to get back the original?

not really.. see the comments (24 bits - not 24 ASCII characters). — henfiber, Jun 25 '15 at 16:02
You probably could write a script in python, perl, or ocaml, (or maybe gnu awk) ... to do that. Does it count as a command line solution? — Basile Starynkevitch, Jun 25 '15 at 16:11
A script in python/perl/awk solution would be acceptable if it was written in a way I can use in a pipe (e.g. cat file | 2bits.py ) — henfiber, Jun 25 '15 at 16:15

lcd047 · Accepted Answer · 2015-06-26T13:10:19.197

5

Another perl:

perl -pe 'BEGIN { binmode \*STDOUT } chomp; tr/AB/\0\1/; $_ = pack "B*", $_'

Proof:

$ echo ABBBAAAABBBBBABBABBBABBB | \
    perl -pe 'BEGIN { binmode \*STDOUT } chomp; tr/AB/\0\1/; $_ = pack "B*", $_' | \
    od -tx1
0000000 70 fb 77
0000003

The above reads input one line at a time. It's up to you to make sure the lines are exactly what they are supposed to be.

Edit: The reverse operation:

#!/usr/bin/env perl

binmode \*STDIN;

while ( defined ( $_ = getc ) ) {
    $_ = unpack "B*";
    tr/01/AB/;
    print;
    print "\n" if ( not ++$cnt % 3 );
}
print "\n" if ( $cnt % 3 );

This reads a byte of input at a time.

Edit 2: Simpler reverse operation:

perl -pe 'BEGIN { $/ = \3; $\ = "\n"; binmode \*STDIN } $_ = unpack "B*"; tr/01/AB/'

The above reads 3 bytes at a time from STDIN (but receiving EOF in the middle of a sequence is not a fatal problem).

edited Jun 26 '15 at 13:10

answered Jun 25 '15 at 17:16

lcd047

7,238

Nice. I've always hated pack formats, B* is fantastic. – glenn jackman Jun 25 '15 at 17:22
Is there an equally beautiful perl one-liner to get back the original? (I suppose something in the lines of unpack and y/01/AB) – henfiber Jun 26 '15 at 07:47
@henfiber Not really. You could naively do it like this: perl -pe 'BEGIN { undef $/; binmode \*STDIN } $_ = unpack "B*"; tr/01/AB/' | fold -b24. But this reads the entire input in memory. Reading input 3 bytes at a time with error checking and all can be done of course, but it won't be a nice one-liner. – lcd047 Jun 26 '15 at 08:06
2

@henfiber I added a script that does the reverse operation. Like I said, it isn't an one-liner. – lcd047 Jun 26 '15 at 08:33
Does this also apply to the convert-to-binary one-liner? Does it need to load the entire input into memory? – henfiber Jun 26 '15 at 09:11
@henfiber No, the direct script reads input one line at a time. The reversing script I posted above reads input a byte at a time. – lcd047 Jun 26 '15 at 09:41
I see, in the bit representation there are no new lines (n), therefore the perl line-mode operation would read the entire input into memory. But couldn't I do something like this: fold -b3 | perl -pe 'BEGIN { undef $/; binmode \*STDIN } $_ = unpack "B*"; tr/01/AB/' | tr -d '\n' to read 3 bytes at a time? Am I missing something? – henfiber Jun 26 '15 at 11:31
1

@henfiber Actually, Perl is eternal, human memory is finite. ;) It's perfectly possible to read (at most) 3 bytes at a time with an one-liner, albeit not as nicely as in line mode. Remembering something useful every day. ;) – lcd047 Jun 26 '15 at 13:06
Thanks, that worked perfectly. The only thing I don't understand is binmode \*STDIN and binmode \*STDOUT. Would you mind explaining what they do and/or why they are necessary? Including or excluding them did not make any difference in my tests. – henfiber Jun 26 '15 at 14:47
1

@henfiber They make a difference on systems that need line-termination translations (i.e. Windows). This is explained in the perlfunc manual. – lcd047 Jun 26 '15 at 17:07

mikeserv · Answer 2 · 2015-06-28T10:57:55.677

{   printf '2i[q]sq[?z0=qPl?x]s?l?x'
    tr -dc AB | tr AB 01 | fold -b24
}   <infile   | dc

In making the following statement, @lcd047 has pretty well nailed my earlier state of confusion:

You seem to be confused by the output of od. Use od -tx1 to look at bytes. od -x reads words, and on little endian machines that swaps bytes. I didn't follow closely the exchange above, but I think your initial version was correct, and you don't need to mess with byte order at all. Just use od -tx1, not od -x.

Now this makes me feel a lot better - the earlier need for dd conv=swab was bugging me all day. I couldn't pin it, but I knew there was something wrong w/ it. Being able to explain it away in my own stupidity is very comforting - especially since I learned something.

Anyway, that will delete every byte which isn't [AB], then translate those to [01] accordingly, before folding the resulting stream at 24 bytes per line. dc ? reads a line at a time, checks if input contained anything, and, if so, Prints the byte value of that number to stdout.

From man dc:

P
- Pops off the value on top of the stack. If it is a string, it is simply printed without a trailing newline. Otherwise it is a number, and the integer portion of its absolute value is printed out as a "base (UCHAR_MAX+1)" byte stream.
i
- Pops the value off the top of the stack and uses it to set the input radix.

some shell automation

Here is a shell function I wrote based on the above which can go both ways:

ABdc()( HOME=/dev/null  A='[fc[fc]]sp[100000000o]p2o[fc]' B=2i
        case    $1      in
        (-B) {  echo "$B"; tr AB 01      | paste -dP - ~      ; }| dc;;
        (-A) {  echo "$A"; od -vAn -tu1  | paste -dlpx - ~ ~ ~; }| dc|
         dc  |  paste - - - ~            | expand -t10,20,30     |
                cut -c2-9,12-19,22-29    | tr ' 01' AAB         ;;
        (*)     set '' "$1";: ${1:?Invalid opt: "'$2'"}         ;;
        esac
)

That will translate the ABABABA stuff to bytes with -B, so you can just do:

ABdc -B <infile

But it will translate arbitrary input to 24 ABABABA bit-per-byte encoded strings - in the same form as that presented for example in the question - w/ -B.

seq 5 | ABdc -A | tee /dev/fd/2 | ABdc -B

AABBAAABAAAABABAAABBAABA
AAAABABAAABBAABBAAAABABA
AABBABAAAAAABABAAABBABAB
AAAABABAAAAAAAAAAAAAAAAA
1
2
3
4
5

For -A output I rolled in cut, expand, and od here, which I'll get into in a minute, but I also added another dc. I dropped the line-for-line ? read dc script for another method which works an array at time with f - which is a command that prints the full dc command-stack to stdout. Of course, because dc is a stack-oriented last-in,first-out type of application, that means that the full-stack comes out in the reverse order it went in.

This might be a problem, but I use another dc anyway with an output radix set to 100000000 to handle all of the 0-padding as simply as possible. And when it reads the other's last-in,first-out stream, it applies that logic to it all over again, and it all comes out in the wash. The two dcs work in concert like this:

{   echo '[fc[fc]]sp[100000000o]p2o[fc]'
    echo some data | 
    od -An -tu1        ###arbitrary input to unsigned decimal ints
    echo lpx           ###load macro stored in p and execute
} | tee /dev/fd/2  |   ###just using tee to show stream stages
dc| tee /dev/fd/2  |dc

...the stream per the first tee...

[fc[fc]]sp[100000000o]pc2o[fc]            ###dc's init cmd from 1st echo
 115 111 109 101  32 100  97 116  97  10  ###od's output
lpx                                       ###load p; execute

...per the second tee, as written from dc to dc...

100000000o                             ###first set output radix
1010                                   ###bin/rev vs of od's out
1100001                                ###dc #2 reads it in, revs and pads it 
1110100                                
1100001
1100100
100000
1100101
1101101
1101111                                ###this whole process is repeated
1110011                                ###once per od output line, so
fc                                     ###each worked array is 16 bytes.

...and the output which the second dc writes is...

From there the function pastes it on <tabs>...

 01110011    01101111    01101101
 01100101    00100000    01100100
 01100001    01110100    01100001
 00001010

...expands <tabs> to spaces at 10 column intervals...

 01110011  01101111  01101101
 01100101  00100000  01100100
 01100001  01110100  01100001
 00001010

...cuts away all but bytes 2-9,12-19,22-29...

011100110110111101101101
011001010010000001100100
011000010111010001100001
00001010

...and translates <spaces> and zeroes to A and ones to B...

ABBBAABBABBABBBBABBABBAB
ABBAABABAABAAAAAABBAABAA
ABBAAAABABBBABAAABBAAAAB
AAAABABAAAAAAAAAAAAAAAAA

You can see on the last line there my primary motivation for including expand - it's such a lightweight filter, and it very easily ensures that every sequence written - even the last - is padded out to 24 encoded-bits. When that is process reversed, and the strings are decoded to -Byte-value, there are two appended NULs:

ABdc -B <<\IN | od -tc
ABBBAABBABBABBBBABBABBAB
ABBAABABAABAAAAAABBAABAA
ABBAAAABABBBABAAABBAAAAB
AAAABABAAAAAAAAAAAAAAAAA
IN

...as you can see...

0000000   s   o   m   e       d   a   t   a  \n  \0  \0
0000014

real world data

I played with it, and tried it with some simple, realistic streams. I constructed this elaborate pipeline for staged reports...

{                            ###dunno why, but I often use man man
    (                        ###as a test input source
        {   man man       |  ###streamed to tee
            tee /dev/fd/3 |  ###branched to stdout
            wc -c >&2        ###and to count source bytes
        }   3>&1          |  ###the branch to stdout is here
        ABdc -A           |  ###converted to ABABABA
        tee /dev/fd/3     |  ###branched again
        ABdc -B              ###converted back to bytes
        times >&2            ###the process is timed
    ) | wc -c >&2            ###ABdc -B's output is counted
} 3>&1| wc -c                ###and so is the output of ABdc -A

I don't have any good basis for performance comparison, here, though. I can only say that I was driven to this test when I was (perhaps naively) impressed enough to do so by...

man man | ABdc -A | ABdc -B

...which painted my terminal screen w/ man's output at the same discernible speed as the unfiltered command might do. The output of the test was...

37595                       ###source byte count
0m0.000000s 0m0.000000s     ###shell processor time nil
0m0.720000s 0m0.250000s     ###shell children's total user, system time
37596                       ###ABdc -B output byte count
313300                      ###ABdc -A output byte count

initial tests

The rest is just a more simple proof of concept that it works at all...

printf %s ABBBAAAABBBBBABBABBBABBB|
tee - - - - - - - -|
tee - - - - - - - - - - - - - - - |
{   printf '2i[q]sq[?z0=qPl?x]s?l?x'
    tr -dc AB | tr AB 01 | fold -b24
} | dc        | od -tx1

0000000 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000020 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000040 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000060 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000100 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000120 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000140 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000160 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000200 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000220 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000240 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000260 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000300 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000320 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000340 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000360 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000400 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000420 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000440 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000460 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000500 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000520 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000540 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000560 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000600 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70
0000620 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb
0000640 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77 70 fb 77
0000660

Shouldn't that be 70FB77 as in the OP's question? Just asking, I know nothing about this sort of thing. — terdon, Jun 25 '15 at 16:23
@mikeserv Thanks, dc seems nice, I haven't used it at all yet.. However, I suppose this does not address my question. I know how to get a binary-like ASCII representation or the HEX-like ASCII representation with something like: tr 'AB' '01' <file | tr -d '\n' | { echo -n 'obase=16 ; ibase=2 ;'; cat; echo; } | bc. I am interested to save these 3 bytes (24 bits) to a file. — henfiber, Jun 25 '15 at 16:28
@henfiber - So why not just do it w/ bc then? Oh wait - those is printing the hex value for its input bytes - it's not ascii-like -they're bytes. I used od to show them. Do echo 0P|dc | od, it's not printing numbers - they're bytes. Am i following maybe? — mikeserv, Jun 25 '15 at 16:29
"I am interested to save those 3 bytes (24 bits) to a file". If I redirect the ASCII representation of the binary characters to a file, it will have a size of 24 bytes. If I save the hex one, it will be 6 bytes. All I want is to save the 3 bytes which consist of the 24 01110000... bits — henfiber, Jun 25 '15 at 16:35
@henfiber - if you do it w/ dc - it is 3 bytes. od does the hex representation. I'll show you wc -c. — mikeserv, Jun 25 '15 at 16:36
printf 2i%s ABBBAAAABBBBBABBABBBABBB | tr AB 01 | dc | wc -c prints 0 for me. My dc version: dc --version : dc (GNU bc 1.06.95) 1.3.95 — henfiber, Jun 25 '15 at 16:43
@henfiber - i have to pull that actually - how are they divded in the file? You say some newlines - but are they definitely divided by # in the file? i'm doing something wrong w/ fold or whatever there. Downvote away - if you feel it should be done - i'm just having fun with the problem. — mikeserv, Jun 25 '15 at 16:52
You may ignore new lines for now. Let's say that only A and B characters exist in the file, like in my example — henfiber, Jun 25 '15 at 16:59
@henfiber - well then that's really easy - cause that's just a dd job. I fixed the fold - it just wasn't doing a P for the very last line, so i made it. But i did it the wrong way, on second thought... it will wprk for the one... but now i'n thinking about. I'll make file with like 20 of those strings and nothing else, and see if i can make it come out roght side up. The dd thing there works just fine, actually. — mikeserv, Jun 25 '15 at 17:05
Thanks @mikeserv I found out that: echo ABBBAAAABBBBBABBABBBABBB | tr AB 01 | dd ibs=24 cbs=24 conv=unblock,sync 2>/dev/null | paste -sdP | dc -e2i -f- 2>/dev/null prints the 3 correct bytes, so I discarded the 1st and 3rd dd. Could you explain how the last 3 commands work? (dd ibs=24 cbs=24 conv=unblock,sync, paste -sdP and dc -e2i -f- 2>/dev/null). Not here, please edit if you want. — henfiber, Jun 25 '15 at 17:58
@henfiber - maybe it's just my system's endianness or something - though that's a bit more than i fully understand. Yeah, i'll touch on it, but i was thinking about doing a little dc script for reading the file, too. — mikeserv, Jun 25 '15 at 18:02
You seem to be confused by the output of od. Use od -tx1 to look at bytes. od -x reads words, and on little endian machines that swaps bytes. I didn't follow closely the exchange above, but I think your initial version was correct, and you don't need to mess with byte order at all. Just use od -tx1, not od -x. — lcd047, Jun 26 '15 at 05:28
@lcd047 - thanks very much! Where the hell were you earlier? — mikeserv, Jun 26 '15 at 05:58
Online - n. Property of a human to be available to a computer. ;) — lcd047, Jun 26 '15 at 06:11
@lcd047 - well, it is fixed now - thanks to you - and that was fun f***ing answer all around. Thanks for making it worthwhile, too. — mikeserv, Jun 26 '15 at 06:14
It's curious how our upbringing is shaping us. I started with assembly languages for mainframes some 30 years ago, and messing with bits, bytes, endianness, and related stuff still feels like home to me, although I haven't really wrote assembly code for the last maybe 20 years. These days I see people starting with Haskell or Scheme, and they never seem to really get to grasp pointers. Then again, I suppose I'll never be any good with Haskell, either. Well, we are what we are. — lcd047, Jun 26 '15 at 06:35
@lcd047 - i started woth wolfenstein 3d and tradewars 2002. Those things meant config.sys, himem.sys, xmodem. I was no slouch at 9 years old - at 32 i might need an adjustment now and then. You should do the getting to know you thing on meta. And ping me when you're done - if you would be so kind - I wanna read it. that thing you said earlier about AIX and ordered lists got me wondering. And anyway - screw haskell. — mikeserv, Jun 26 '15 at 06:40
Thanks @lcd047 and @mikeserv. That's a lot simpler now. Actually, I found out that just running tr AB 01 | dc -e2i[q]sq[?z0=qPl?x]s?l?x does the trick. But how does this cryptic dc command work? — henfiber, Jun 26 '15 at 07:16
@henfiber - I only did the first tr -dc because you said there might be newlines - this way it doesn't matter - they're always ab all the time. But if you know the file's ok, just use it, of course. About dc, this might be a start.. dc is old by the way - older than C. — mikeserv, Jun 26 '15 at 07:19
dc is older than C, but it comes from the same birthplace as Unix: "dc is the oldest surviving Unix language. When its home Bell Labs received a PDP-11, dc—written in B—was the first language to run on the new computer, even before an assembler." https://en.wikipedia.org/wiki/Dc_%28computer_program%29 — Kaz, Jun 30 '15 at 14:54
@Kaz - yup. If you followed the the link you'd see a comment from me to at the bottom of it that effect as well. — mikeserv, Jun 30 '15 at 16:26

glenn jackman · Answer 3 · 2015-06-25T17:27:41.047

Perl:

my $len = 24;
my $str = "ABBBAAAABBBBBABBABBBABBB\n";
$str =~ s/\s//g;
(my $bin = $str) =~ y/AB/01/;
my $val = oct("0b".$bin);
printf "%s -> %s -> %X\n", $str, $bin, $val;

my ($filename, $fh) = ("temp.out");

# write the file
open $fh, '>', $filename;
print $fh pack("N", $val);      # this actually writes 4 bytes
close $fh;

# now read it, and convert back to a string:
open $fh, '<', $filename;
read $fh, my $data, 4;
close $fh;

my $new_val = unpack "N", $data;
my $new_bin = substr unpack("B32", $data), -$len;
(my $new_str = $new_bin) =~ y/01/AB/;

printf "%X -> %s -> %s\n", $new_val, $new_bin, $new_str;

ABBBAAAABBBBBABBABBBABBB -> 011100001111101101110111 -> 70FB77
70FB77 -> 011100001111101101110111 -> ABBBAAAABBBBBABBABBBABBB

Thanks to lcd047's perfect answer, mine becomes:

my $str = "ABBBAAAABBBBBABBABBBABBB\n";
$str =~ s/\s//g;
(my $bin = $str) =~ y/AB/01/;
printf "%s -> %s\n", $str, $bin;

my ($filename, $fh) = ("temp.out");

# write the file
open $fh, '>', $filename;
print $fh pack("B*", $bin);
close $fh;

my $size = -s $filename;
print $size, "\n";

# now read it, and convert back to a string:
open $fh, '<', $filename;
read $fh, my $data, 1024;
close $fh;

my $new_bin = unpack("B*", $data);
(my $new_str = $new_bin) =~ y/01/AB/;

printf "%s -> %s\n", $new_bin, $new_str;

ABBBAAAABBBBBABBABBBABBB -> 011100001111101101110111
3
011100001111101101110111 -> ABBBAAAABBBBBABBABBBABBB

Kaz · Answer 4 · 2015-06-29T21:30:19.747

The following solution is based on nothing but xxd (one of the tools mentioned in the question), Bash and GNU sed.

It assumes that the input consists of complete bytes (groups of eight letters), arbitrarily separated by newlines.

The approach is:

Strip all newlines.
Group letters into four-letter groups terminated by spaces.
Filter these quadgraphs into hex digits, not separated from each other.
Group pairs of hex digits together, each pair on a separate line.
Read these byte values with a shell loop and reconstruct an xxd compatible hex dump.
Pipe into xxd -r to convert dump into hex data.

Code:

#!/bin/bash

addr=0
tr -d '\n' | \
sed -e 's/..../& /g' |
sed -e 's/AAAA /0/g' \
    -e 's/AAAB /1/g' \
    -e 's/AABA /2/g' \
    -e 's/AABB /3/g' \
    -e 's/ABAA /4/g' \
    -e 's/ABAB /5/g' \
    -e 's/ABBA /6/g' \
    -e 's/ABBB /7/g' \
    -e 's/BAAA /8/g' \
    -e 's/BAAB /9/g' \
    -e 's/BABA /A/g' \
    -e 's/BABB /B/g' \
    -e 's/BBAA /C/g' \
    -e 's/BBAB /D/g' \
    -e 's/BBBA /E/g' \
    -e 's/BBBB /F/g' |
sed -e 's/../&\n/g' |
while read hexbyte ; do
  if [ $((addr % 16)) == 0 ] ; then
     printf "%08x: " $addr
  fi
  printf "%s" $hexbyte
  if [ $((addr % 16)) == 15 ] ; then
    printf "\n"
  else
    printf " "
  fi
  : $(( addr++ ))
done |
xxd -r - -

Modest sample run:

$ cat data
AAAABBB
B
ABABABAB
ABABAABBBBAABBAB

$ ./ab.sh < data > data.bin

$ xxd data.bin
0000000: 0f55 53cd                                .US.

Here is a modification of the code to handle a trailing group of seven or fewer bits, by padding it with zeros to make a complete byte, so that for instance a file containing nothing but B maps to a 0x80 byte:

#!/bin/bash

addr=0
tr -d '\n' | \
sed -e 's/..../& /g' |
sed -e 's/AAAA /0/g' \
    -e 's/AAAB /1/g' \
    -e 's/AABA /2/g' \
    -e 's/AABB /3/g' \
    -e 's/ABAA /4/g' \
    -e 's/ABAB /5/g' \
    -e 's/ABBA /6/g' \
    -e 's/ABBB /7/g' \
    -e 's/BAAA /8/g' \
    -e 's/BAAB /9/g' \
    -e 's/BABA /A/g' \
    -e 's/BABB /B/g' \
    -e 's/BBAA /C/g' \
    -e 's/BBAB /D/g' \
    -e 's/BBBA /E/g' \
    -e 's/BBBB /F/g' \
    -e 's/AAA$/0/g' \
    -e 's/AAB$/2/g' \
    -e 's/ABA$/4/g' \
    -e 's/ABB$/6/g' \
    -e 's/BAA$/8/g' \
    -e 's/BAB$/A/g' \
    -e 's/BBA$/C/g' \
    -e 's/BBB$/E/g' \
    -e 's/AA$/0/g' \
    -e 's/AB$/4/g' \
    -e 's/BA$/8/g' \
    -e 's/BB$/C/g' \
    -e 's/A$/0/g' \
    -e 's/B$/8/g' |
sed -e 's/../&\n/g' |
sed -e 's/^.$/&0\n/' |
while read hexbyte ; do
  if [ $((addr % 16)) == 0 ] ; then
     printf "%08x: " $addr
  fi
  printf "%s" $hexbyte
  if [ $((addr % 16)) == 15 ] ; then
    printf "\n"
  else
    printf " "
  fi
  : $(( addr++ ))
done |
xxd -r - -

score 1 · Answer 5 · answered Jan 29 '21 at 05:55

replace characters in text file:

sed -i 's/A/0/g' file.in
sed -i 's/B/1/g' file.in

If you're representing newline characters with \n, then replace them with newlines:

sed 's/\\n/\'$'\n''/g' file.in

(ABBBAAAABBBBBABBABBBABBB becomes 011100001111101101110111)

Treat the ascii (string) in file.in as binary data to write (as is) to binary file:

data=$(cat file.in)
replace file
echo $(printf '%x\n' "$((2#$data))") | xxd -r -p > file.out
or write to existing file without truncating
echo $(printf '%x\n' "$((2#$data))") | xxd -r -p - file.out

which gives the following hex codes when reading the (3-byte) binary file:

hd file.out
70fb77

To decode (reverse), read the binary file with hd or xxd, convert hex codes to binary, then swap 0 & 1 for A & B.

Tested on Ubuntu 16.04.7

How can I convert two-valued text data to binary (bit-representation)

5 Answers5

some shell automation

real world data

initial tests

replace file

or write to existing file without truncating