1

I'm working through Unix For Poets, and trying make a file containing all words/tokens in the Bible. However, when using tr, as suggested, this includes the empty string. See example below:

> tr -sc 'A-Za-z' '[\12*]' < bible.txt > bible.words
> sed 5q bible.words

The
Project
Gutenberg
EBook

I have read through the man page for tr, without any luck. Any help with understanding why their included would be much appreciated.

EDIT:

First example:

Line from bible.txt:

1:1 Paul, a servant of Jesus Christ, called to be an apostle,

Command which reproduces the unexpected result:

> echo '1:1 Paul, a servant of Jesus Christ, called to be an apostle,' | tr -sc 'A-Za-z' '[\12*]'

Paul
a
servant
of
Jesus
Christ
called
to
be
an
apostle

Expected output:

Paul
a
servant
of
Jesus
Christ
called
to
be
an
apostle

Second example:

Line from bible.txt:

The Project Gutenberg Ebook of The King James Bible

command with same unexpected result:

echo 'The Project Gutenberg EBook of The King James Bible  ' | tr -sc 'A-Za-z' '[\12*]'

The
Project
Gutenberg
EBook
of
The
King
James
Bible

Expected output:

The
Project
Gutenberg
EBook
of
The
King
James
Bible

Note its the prefix empty line I don't understand.

Ola R
  • 13
  • reproduce the problem for 2 lines, show bible.txt for those lines and show our expected and current output – Utsav Jul 02 '17 at 12:19
  • Updated the question with expected and actual output. – Ola R Jul 02 '17 at 12:53
  • I cannot reproduce your second example. echo 'The Project Gutenberg EBook of The King James Bible ' | tr -sc 'A-Za-z' '[\12*]' is not giving me empty first like. However in your first example it is expected as there is a space after 1:1 before next word. – Utsav Jul 02 '17 at 12:59
  • can reproduce with simple 1:1the project as input string... as initial 1:1 will be replaced with newline... won't be a problem if there are alphabets at start of string... 'is1:1the project'... can use grep -oi '[a-z]*' as alternate if your grep implementation supports this – Sundeep Jul 02 '17 at 13:13
  • @Sundeep So '1:1' becomes truncated into '' which is split by a newline due to the space. Is this only the case if the complement characters prefix first word (with a space)? Fx. echo '1:1 Paul 1:1 a a' | tr -sc 'A-Za-z' '[\12]' does not produce an empty line between 'Paul' and 'a'. The first example is reproduced when doing: sed 1q < bible.txt | tr -sc 'A-Za-z' '[\12]', or copying directly from the file. However, it doesn't reproduce when I copy from browser. – Ola R Jul 02 '17 at 13:25
  • @OlaR for '1:1 Paul 1:1 a a' case, all the newlines between Paul and a are squeezed to single newline... that is what -s option is for... remove it and see it in action for yourself – Sundeep Jul 02 '17 at 13:36

1 Answers1

3

You need to understand the tr options at work here to know what's going on.

  1. -c => complement the first character set. Means, any chars not found in the first char set are to be selected. In your case, 'A-Za-z' will imply any nonalphabetics like a space, a number, a newline, a control char would be chosen.
  2. -s => multiple consecutive chosen chars are to be squashed in as a one.
  3. The second set is the chars that are to be mapped into. \12 is the octal ascii for a newline.

That means all alphabets(both upper & lower case) are to be left untouched whilst runs of non-alphabetics shall be turned into a single newline:

     ----     --        --------     -     -       -----      ----
$#%! This     is        StarWars     R2    D2      robot     @work.
|---|    |---|  |------|        |---| |---| |-----|     |----|    ||
 \n        \n      \n             \n    \n     \n         \n      \n 

All the alphabets are untouched while a run of multiple nonalphabets are turned into newlines.


output:

This
is
StarWars
R
D
robot
work
  • Was about to write u didn't address the question, but edit (with prefix chars) fixes it. – Ola R Jul 02 '17 at 18:28