19

I want to read a character and then a fixed length of string (the string is not null terminated in the file, and its length is given by the preceding character).

How can I do this in a bash script? How to define the string variable so that I can do some post-processing on it?

Jeff Schaller
  • 67,283
  • 35
  • 116
  • 255
Amanda
  • 725

5 Answers5

20

If you want to stick with shell utilities, you can use head to extract a number of bytes, and od to convert a byte into a number.

export LC_ALL=C    # make sure we aren't in a multibyte locale
n=$(head -c 1 | od -An -t u1)
string=$(head -c $n)

However, this does not work for binary data. There are two problems:

  • Command substitution $(…) strips final newlines in the command output. There's a fairly easy workaround: make sure the output ends in a character other than a newline, then strip that one character.

    string=$(head -c $n; echo .); string=${string%.}
    
  • Bash, like most shells, is bad at dealing with null bytes. As of bash 4.1, null bytes are simply dropped from the result of the command substitution. Dash 0.5.5 and pdksh 5.2 have the same behavior, and ATT ksh stops reading at the first null byte. In general, shells and their utilities aren't geared towards dealing with binary files. (Zsh is the exception, it's designed to support null bytes.)

If you have binary data, you'll want to switch to a language like Perl or Python.

<input_file perl -e '
  read STDIN, $c, 1 or die $!;    # read length byte
  $n = read STDIN, $s, ord($c);   # read data
  die $! if !defined $n;
  die "Input file too short" if ($n != ord($c));
  # Process $s here
'
<input_file python -c '
  import sys
  n = ord(sys.stdin.read(1))      # read length byte
  s = sys.stdin.read(n)           # read data
  if len(s) < n: raise ValueError("input file too short")
  # Process s here
'
7

If you want to be able to deal with binary file in shell, the best option (only?) is to work with hexdump tool.

hexdump -v -e '/1 "%u\n"' binary.file | while read c; do
  echo $c
done

Read only X bytes:

head -cX binary.file | hexdump -v -e '/1 "%u\n"' | while read c; do
  echo $c
done

Read length (and work with 0 as length) and then "string" as byte decimal value:

len=$(head -c1 binary.file | hexdump -v -e '/1 "%u\n"')
if [ $len -gt 0 ]; then
  tail -c+2 binary.file | head -c$len | hexdump -v -e '/1 "%u\n"' | while read c; do
    echo $c
  done
fi
3
exec 3<binary.file     # open the file for reading on file descriptor 3
IFS=                   #
read -N1 -u3 char      # read 1 character into variable "char"

# to obtain the ordinal value of the char "char"
num=$(printf %s "$char" | od -An -vtu1 | sed 's/^[[:space:]]*//')

read -N$num -u3 str    # read "num" chars
exec 3<&-              # close fd 3
glenn jackman
  • 85,964
3

UPDATE (with hindsight):... This question/answer (my answer) makes me think of the dog which keeps chasing the car.. One day, finally, he catches up to the car.. Okay, he caught it, but he really can't do much with it... This anser 'catches' the strings, but then you can't do much with them, if they have embedded null-bytes... (so a big +1 to Gilles answer.. another language may be in order here.)

dd reads any and all data... It certainly won't baulk at zero as a "length"... but if you have \x00 anywhere in your data, you will need to be creative how you handle it; dd has no propblems with it, but your shell script will have problems (but it depends on what you want to do with the data)... The following basically outputs each "data string", to a file with a line divider between each strin...

btw: You say "character", and I assume you mean "byte"...
but the word "character" has become ambiguous in these days of UNICODE, where only the 7-bit ASCII character-set uses a single byte per character... And even within the Unicode system, byte counts vary depending on the method of encoding characters, eg. UTF-8, UTF-16, etc.

Here is a simple script to highlight the difference between a Text "character" and bytes.

STRING="௵"  
echo "CHAR count is: ${#STRING}"  
echo "BYTE count is: $(echo -n $STRING|wc -c)" 
# CHAR count is: 1
# BYTE count is: 3  # UTF-8 ecnoded (on my system)

If your length character is 1-byte long and indicates a byte-length, then this script should do the trick, even if the data contains Unicode characters... dd only sees bytes regardless of any locale setting...

This script uses dd to read the binary file and outputs the strings seperated by a "====" divider... See next script for test data

#   
div="================================="; echo $div
((skip=0)) # read bytes at this offset
while ( true ) ; do
  # Get the "length" byte
  ((count=1)) # count of bytes to read
  dd if=binfile bs=1 skip=$skip count=$count of=datalen 2>/dev/null
  (( $(<datalen wc -c) != count )) && { echo "INFO: End-Of-File" ; break ; }
  strlen=$((0x$(<datalen xxd -ps)))  # xxd is shipped as part of the 'vim-common' package
  #
  # Get the string
  ((count=strlen)) # count of bytes to read
  ((skip+=1))      # read bytes from and including this offset
  dd if=binfile bs=1 skip=$skip count=$count of=dataline 2>/dev/null
  ddgetct=$(<dataline wc -c)
  (( ddgetct != count )) && { echo "ERROR: Line data length ($ddgetct) is not as expected ($count) at offset ($skip)." ; break ; }
  echo -e "\n$div" >>dataline # add a newline for TEST PURPOSES ONLY...
  cat dataline
  #
  ((skip=skip+count))  # read bytes from and including this offset
done
#   
echo

exit

This script builds test data which includes a 3-byte prefix per line...
The prefix is a single UTF-8 encoded Unicode character...

# build test data
# ===============
  prefix="௵"   # prefix all non-zero length strings will this obvious 3-byte marker.
  prelen=$(echo -n $prefix|wc -c)
  printf \\0 > binfile  # force 1st string to be zero-length (to check zero-length logic) 
  ( lmax=3 # line max ... the last on is set to  255-length (to check  max-length logic)
    for ((i=1;i<=$lmax;i++)) ; do    # add prefixed random length lines 
      suflen=$(numrandom /0..$((255-prelen))/)  # random length string (min of 3 bytes)
      ((i==lmax)) && ((suflen=255-prelen))      # make last line full length (255) 
      strlen=$((prelen+suflen))
      printf \\$((($strlen/64)*100+$strlen%64/8*10+$strlen%8))"$prefix"
      for ((j=0;j<suflen;j++)) ; do
        byteval=$(numrandom /9,10,32..126/)  # output only printabls ASCII characters
        printf \\$((($byteval/64)*100+$byteval%64/8*10+$byteval%8))
      done
        # 'numrandom' is from package 'num-utils"
    done
  ) >>binfile
#
Peter.O
  • 32,916
  • 1
    Your code looks more complicated than it should be, especially the random test data generator. You can get random bytes from /dev/urandom on most unices. And random test data isn't the best test data, you should make sure to address difficult cases such as, here, null characters and newline in boundary places. – Gilles 'SO- stop being evil' Apr 09 '11 at 16:01
  • Yes thanks. I thought of using /dev/random but figured the test data gen was of no great import, and I wanted to test drive 'numrandom' (whicn you mentioned elsewhere; 'num-utils'some nice features.). I've just taken a closer look at your answer, and realized that you are doing pretty much the same thing, except that it is more succinct :).. I hadn't notice that you had stated the key points in 3 lines! I had focused on your other-language references.. Getting it to work was a good experience, and I now understand better your references to other-languages! \x00 can be a shell-stopper – Peter.O Apr 09 '11 at 17:13
1

This one just copy a binary file :

 while read -n 1 byte ; do printf "%b" "$byte" ; done < "$input" > "$output"
rzr
  • 129