15

I need to replace some non-printable characters with spaces in file.

Specifically, all characters from 0x00 up to 0x1F, except 0x09 (TAB), 0x0A (new line), 0x0D (CR)

Up until now, I just needed to replace 0x00 character. Since my previous OS was AIX (without GNU commands), I can't use sed (well, I can but it had some limitations). So, I found next command using perl, which worked as expected:

perl -p -e 's/\x0/ /g' $FILE_IN > $FILE_OUT 

Now I'm working on Linux, so I expected to be able to use sed command.

My questions:

  • Is this command appropriate to replace those characters? I tried, and it seems to work, but I want to make sure:

    perl -p -e 's/[\x00-\x08\x0B\x0C\x0E-\x1F]/ /g' $FILE_IN > $FILE_OUT  
    
  • I thought perl -p works as sed. So, why does the previous command work (at least, it doesn't fail), and next one doesn't?

    sed -e 's/[\x00-\x08\x0B\x0C\x0E-\x1F]/ /g' $FILE_IN > $FILE_OUT   
    

    It tells me:

    sed: -e expression #1, char 34: Invalid collation character

Albert
  • 1,185
  • perl -p prints end product of stdin after doing the operations you desire, in this case it's just replacement. sed's regex might be different than perl. –  Jul 14 '17 at 17:14

2 Answers2

16

That's a typical job for tr:

LC_ALL=C tr '\0-\10\13\14\16-\37' '[ *]' < in > out

In your case, it doesn't work with sed because you're in a locale where those ranges don't make sense. If you want to work with byte values as opposed to characters and where the order is based on the numerical value of those bytes, your best bet is to use the C locale. Your code would have worked with LC_ALL=C with GNU sed, but using sed (let alone perl) is a bit overkill here (and those \xXX are not portable across sed implementations while this tr approach is POSIX).

You can also trust your locale's idea of what printable characters are with:

tr -c '[:print:]\t\r\n' '[ *]'

But with GNU tr (as typically found on Linux-based systems), that only works in locales where characters are single-byte (so typically, not UTF-8).

In the C locale, that would also exclude DEL (0x7f) and all byte values above (not in ASCII).

In UTF-8 locales, you could use GNU sed which doesn't have the problem GNU tr has:

sed 's/[^[:print:]\r\t]/ /g' < in > out

(note that those \r, \t are not standard, and GNU sed won't recognize them if POSIXLY_CORRECT is in the environment (will treat them as backslash, r and t being part of the set as POSIX requires)).

It would not convert bytes that don't form valid characters if any though.

  • I understand what tr command does. I understand (more or less) what LC_ALL = C is, but not all together. Nevertheless tr -d removes those characters, but I want to replace with spaces. Sorry, title was wrong. I just realized, when @don_crissti modified. – Albert May 06 '15 at 11:33
  • @Albert, sorry. See the edit and the link I added. – Stéphane Chazelas May 06 '15 at 11:36
  • I not sure sure about encoding. That file comes from a HOST environment, which uses EBCDIC encoding, and it is transferred to Linux using XCOM. For instance, non ASCII-characters like É are codified (using od -xa) as 0xC9, so I guess it would be ISO-8859-1. – Albert May 06 '15 at 12:05
  • @Albert, probably. You can use locale -a to see if there are locales with iso8859-1 as the charset on your system and use LC_CTYPE=<that-locale> tr ...[:print:]... to convert non-printables in that locale. Or you can use iconv to convert those files to your locale's charset. – Stéphane Chazelas May 06 '15 at 12:20
  • I think it's not needed, because my locale's charset is set to LC_ALL=en_US.iso88591. So, your command (tr -c '[:print:]\t\r\n' '[ *]') works perfectly without modifying locale or converting file. Thanks a lot. – Albert May 06 '15 at 12:29
  • What does it mean "r and t being part of the set as POSIX requires"? – robertspierre Jan 11 '24 at 09:37
  • @robertspierre, POSIX requires that [\r\t] match on either backslash, r or t. Matching on either the CR or TAB control character as the GNU implementation of sed does is forbidden by the POSIX specification. That's why GNU sed doesn't do it when in POSIX mode. – Stéphane Chazelas Jan 11 '24 at 09:48
1

I was trying to send a notification via libnotify, with content that may contain unprintable characters. The existing solutions did not quite work for me (using a whitelist of characters using tr works, but strips any multi-byte characters).

Here is what worked, while passing the test:

message=$(iconv --from-code=UTF-8 -c <<< "$message")