I'm working through Unix For Poets, and trying make a file containing all words/tokens in the Bible. However, when using tr, as suggested, this includes the empty string. See example below:
> tr -sc 'A-Za-z' '[\12*]' < bible.txt > bible.words
> sed 5q bible.words
The
Project
Gutenberg
EBook
I have read through the man page for tr, without any luck. Any help with understanding why their included would be much appreciated.
EDIT:
First example:
Line from bible.txt:
1:1 Paul, a servant of Jesus Christ, called to be an apostle,
Command which reproduces the unexpected result:
> echo '1:1 Paul, a servant of Jesus Christ, called to be an apostle,' | tr -sc 'A-Za-z' '[\12*]'
Paul
a
servant
of
Jesus
Christ
called
to
be
an
apostle
Expected output:
Paul
a
servant
of
Jesus
Christ
called
to
be
an
apostle
Second example:
Line from bible.txt:
The Project Gutenberg Ebook of The King James Bible
command with same unexpected result:
echo 'The Project Gutenberg EBook of The King James Bible ' | tr -sc 'A-Za-z' '[\12*]'
The
Project
Gutenberg
EBook
of
The
King
James
Bible
Expected output:
The
Project
Gutenberg
EBook
of
The
King
James
Bible
Note its the prefix empty line I don't understand.
echo 'The Project Gutenberg EBook of The King James Bible ' | tr -sc 'A-Za-z' '[\12*]'
is not giving me empty first like. However in your first example it is expected as there is a space after1:1
before next word. – Utsav Jul 02 '17 at 12:591:1the project
as input string... as initial1:1
will be replaced with newline... won't be a problem if there are alphabets at start of string...'is1:1the project'
... can usegrep -oi '[a-z]*'
as alternate if yourgrep
implementation supports this – Sundeep Jul 02 '17 at 13:13'1:1 Paul 1:1 a a'
case, all the newlines betweenPaul
anda
are squeezed to single newline... that is what-s
option is for... remove it and see it in action for yourself – Sundeep Jul 02 '17 at 13:36