4

I want to convert all apostrophes in this file to X:

Bob's book
Bob’s book
Bob′s book  # (Might look the same but actually different)

The first apostrophe is replaced as expected:

$ cat file | tr "'" "X"
BobXs book
Bob’s book
Bob′s book

But the the two other kinds of apostrophe, strange things happen:

$ cat file | tr "’" "X"
Bob's book
BobXXXs book
BobXX�s book

$ cat file | tr "′" "X"
Bob's book
BobXX�s book
BobXXXs book

How to make it work?

2 Answers2

8

tr works in units of bytes, which means it doesn't work properly for multi-byte encodings like UTF-8. The only solutions I know of are to find a version of tr that supports Unicode, or switch to sed or some other tool that can do string replacement.

jw013
  • 51,212
0

For me tr works for fine both for ascii and utf-8 files as long your OS is configured to work with utf-8 codepage.

Here is my sample #1 (Solaris 11):

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_ALL=

As you can see OS is configured to work with utf-8. I created both files in utf-8 codepage:

$ cat file
Bob’s Bob′s Bob's

$ cat apos
’′'

Then I got expected results replacing all apos like this:

$ cat file | tr "$(cat apos)" "xxx"
Bobxs Bobxs Bobxs

Here is my sample #2 (Solaris 10):

$ locale
LANG=
LC_CTYPE="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_COLLATE="C"
LC_MONETARY="C"
LC_MESSAGES="C"
LC_ALL=

Here you can see that this OS is configured to handle simple ASCII, not utf-8, so you may expect trouble processing utf-8 files with multi-byte characters using tr. But there is workaround. As long tr command allows to input octal representation of character, then you can replace all bytes of specified character using octal representation.

In your case you have:

char  hex        octal
’     E2 80 99   \342\200\231
′     E2 80 B2   \342\200\262
'     27         \47

Firts and second apos is represented by three bytes. Third one is standard ascii (one byte).

So if you wanna replace first apos you can use:

$ cat file | tr "\342\200\231" "\0\0x"
Bobxs Bob▒s Bob's

Second:

$ cat file | tr "\342\200\262" "\0\0x"
Bob▒s Bobxs Bob's

Third:

$ cat file | tr "\47" "x"
Bob’s Bob′s Bobxs

To replace all in one shot you may use:

$ cat file | tr "\342\200\231\262\47" "\0\0xxx"
Bobxs Bobxs Bobxs

Of course it's not perfect as long this will replace all occurences of byte \342, \200, \231, \262 in file, so other multi-byte characters which contain these bytes will be broken. But if your file do not contain any other multi-byte characters it will work.

MST
  • 101
  • 1
    Note that ' (U+0027, APOSTROPHE) is not the same character as (U+2019, RIGHT SINGLE QUOTATION MARK) and (U+2032, PRIME) that the OP was having problems with. You should [edit] to expand your answer to discuss those cases, as OP never had trouble with a straight ASCII apostrophe U+0027. – user May 09 '17 at 19:15
  • You're right. I corrected my answer. – MST May 10 '17 at 21:09
  • "this will replace all occurences of byte \342, \200, \231, \262 in file, so other multi-byte characters which contain these bytes will be broken" The nice part about UTF-8 here is that if you use a full UTF-8 encoding of a code point, then that cannot match any other validly UTF-8 encoded code point. The first byte of the UTF-8 encoding specifies the length, and the remaining bytes are clearly distinguishable from a first byte. The only way this can break is if your file is not valid UTF-8. (This property of UTF-8 also means that it's one of the few encodings that can be reliably detected.) – user May 11 '17 at 07:29