2

I am working on AIX unix and trying to remove non-printable characters from file the data looks like Caucasian male lives in Arizona w/ fiancÃÂÃÂÃÂÃÂÃÂ in file when I view in Notepad++ using UTF-8 encoding. When I try to view file in unix I get ^▒▒^▒▒^▒▒^▒▒^▒▒^▒▒ instead of the special characters.

I want to replace all those special characters with space.

I tried sed 's/[^[:print:]]/ /g' file but it does not remove those characters.My locale are listed below when I run locale -a

C
POSIX
en_US.8859-15
en_US.ISO8859-1
en_US

I even tried sed -e 's/[^ -~]/ /g' file and it did not remove the characters.

I see that others stackflow answers used UTF-8 locale with GNU sed and this worked but I do not have that locale.

Also I am using ksh.

Auguster
  • 123
  • à and look pretty printable to me. A UTF-8 à is encoded as 0xc3 0x83. 0xc3 in iso8859-1 or 15 is also à as it happens which is printable, 0x83 would be a control character in both though – Stéphane Chazelas Sep 25 '18 at 19:53
  • Possible dublicate https://unix.stackexchange.com/questions/201751/replace-non-printable-characters-in-perl-and-sed –  Sep 25 '18 at 20:05
  • 1
    @Goro Yes at this point its is possibly a duplicate now that I understand to use C locale – Auguster Sep 25 '18 at 20:09
  • To actually show what the characeters are it is useful to show their hex values. Something like: echo "fiancÃÂÃÂÃÂÃÂÃÂ" | od -tx1, or, maybe if your sed supports it: echo "fiancÃÂÃÂÃÂÃÂÃÂ" | sed -n l. –  Sep 25 '18 at 21:08

2 Answers2

1

You can use the command tr as follows:

tr -cd '[:print:]\t\r\n'

Explanation:

`[:print:]'
Any character from the `[:space:]' class, and any character that is not in the `[:graph:]' class
\r -- return
\t -- horizontal tab

Examples based on Centos 7:tris GNU and UTF-8 encoding

$ echo "fiancÃÂÃÂÃÂÃÂÃÂ" | tr -cd '[:print:]\t\r\n'
fianc

$ echo "get ^▒▒^▒▒^▒▒^▒▒^▒▒^▒▒ " | tr -cd '[:print:]\t\r\n'
get ^^^^^^

echo " Caucasian male lives in Arizona w/ fianc▒^▒▒^▒▒^▒▒^▒▒^▒▒^▒^▒▒^▒▒^▒▒^▒▒^▒▒^▒"  | tr -cd '[:print:]\t\r\n'
 Caucasian male lives in Arizona w/ fianc^^^^^^^^^^^^
  • That did not work for me I tried echo " Caucasian male lives in Arizona w/ fianc▒^▒▒^▒▒^▒▒^▒▒^▒▒^▒^▒▒^▒▒^▒▒^▒▒^▒▒^▒" | tr -d '[:print:]' and got output as some unreadable text – Auguster Sep 25 '18 at 19:36
  • 1
    LC_ALL=C tr ... – Jeff Schaller Sep 25 '18 at 19:38
  • @JeffSchaller I tried LC_ALL=C still get output as some unreadable text – Auguster Sep 25 '18 at 19:43
  • 1
    LC_ALL=C tr -cd '[:print:]' < input works here – Jeff Schaller Sep 25 '18 at 19:43
  • @JeffSchaller when I run LC_ALL=C tr -cd '[:print:]' < file I get Usage: tr [ [-c|-C] | -[c|C]ds | -[c|C]s | -ds | -s ] [-A] String1 String2 tr { -[c|C]d | -[c|C]s | -d | -s } [-A] String1 – Auguster Sep 25 '18 at 19:45
  • @JeffSchaller Nvm I tried LC_ALL=C tr -cd '[:print:]' < input > out and it worked thanks :) – Auguster Sep 25 '18 at 19:52
  • 1
    echo "fiancÃÂÃÂÃÂÃÂÃÂ" | tr -cd '[:print:]\t\r\n' should return fiancÃÂÃÂÃÂÃÂàas  is a printable character. GNU tr doesn't in UTF8 as it doesn't support multi-byte characters yet, but it does in iso8859-1. In the C locale on systems where the C locale charset is ASCII, that does remove  (or whatever bytes those are made of) as ASCII has no such character in the first place. – Stéphane Chazelas Sep 25 '18 at 22:46
  • 1
    Because CentOS tr is GNU tr and you probably tried it in a UTF-8 locale where à is made of 2 bytes and GNU tr doesn't support multibyte characters. If you use LC_ALL=C as suggested by Auguster, it will work (at removing those à however they're encoded) regardless of whether tr supports multibyte characters or not. In the C locale, all characters are single bytes, and on most systems including AIX, the C locale charset is ASCII that has no character with the 8th bit set (which each byte of the UTF-8 encoding of à has as well as its single byte iso8859-1 encoding) – Stéphane Chazelas Sep 25 '18 at 22:52
1

If the current locale already uses UTF-8 as the charset (and file is written using that charset):

<file LC_ALL=C sed 's/[^ -~]//g'

Or, to include control characters in AIX sed:

<file LC_ALL=C sed "$(printf "s/[^[:print:]\t\r]//g")"
  • @Stéphane what does printf do here? If I am deleting all characters and saving to another file do I need to use printf? – Auguster Sep 26 '18 at 13:50
  • @Auguster, printf is there to expand the \t into a TAB character and \r into a CR character. If using ksh93 on AIX, you can also use $'s/[^[:print:]\t\r]//g' – Stéphane Chazelas Sep 26 '18 at 15:16