1

I want to filter a file by character (for the purpose of removing invalid xml characters which I cannot control the generation of) but I cannot seem to even be able to copy individual characters from one file to another. I used printf to copy literal sections including carriage returns before, but now it does not copy a carriage return as one, but as some empty length string. My code:

infile=$1
outfile=$2
touch $outfile
while IFS= read -r -n1 char
do
        # display one character at a time
        printf "%s" "$char" >> $outfile
done < "$infile"
diff $infile $outfile

I don't mind using sed or awk, but I would have to encode the allowed characters. Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */

  • 1
    Have you looked into iconv? – DopeGhoti Feb 01 '18 at 20:32
  • 2
    Without looking deeper into your code, are you aware, that there are much better and faster tools than bash available to you for "text stream filtering/conversion", like perl/sed/awk/.... ? – Alex Stragies Feb 01 '18 at 20:42
  • I use sed and awk within bash, so sure, those work, but I don't know how to encode the characters into an argument for them. – Timothy Swan Feb 01 '18 at 21:03
  • 2
    Doing this in bash or sh is guaranteed to fail on many real-world unicode input files (including all UTF-16 input) because the NUL byte (0x00) is a valid byte in many UTF-16 characters (and is also a valid character in standard UTF-8 - modified UTF-8 uses 0xC0 0x80 in place of any actual NUL bytes, allowing NUL to be used as terminator) - and sh is completely incapable of storing NUL bytes in any variable. Use awk or perl instead. Or some other unicode-aware tool like iconv instead. – cas Feb 02 '18 at 01:53
  • Plain removal of characters changes the meaning of the file. Read the details here. It shows a way to create a malicious string. An attacker could make any malformed string that upon removal of the malformed character becomes something else. From this, it is reasonable to use something to detect that a file has incorrect characters but it is not advisable to delete them. –  Feb 03 '18 at 20:38
  • The characters FFFE and FFFF are defined as non-characters. If you choose to delete them, then you should also delete all other 64 non-characters. –  Feb 03 '18 at 20:41

1 Answers1

2

Carriage return shouldn't be a problem, read should read it just fine. The newline (linefeed) is, since it's the default delimiter for read. You could use the read -d '' trick to make it work.

echo $'\r' | { IFS= read -r -n1 x; echo "$x"|xxd; }          # CR
echo $'\n' | { IFS= read -r -n1 x; echo "$x"|xxd; }          # LF fails
echo $'\n' | { IFS= read -d '' -r -n1 x; echo "$x"|xxd; }    # LF ok

But, like they say, you probably don't want to do stuff like this in the shell. tr would be just what you need for deleting a fixed set of characters, but at least GNU tr works on bytes, not characters, so it's not much use for Unicode.

I think this Perl should work, for UTF-8 data, if your locales are correctly set to UTF-8:

perl -C -pe 'tr/\x09\x0a\x0d\x20-\x{d7ff}\x{e000}-\x{fffd}\x{10000}-\x{10ffff}//cd' < in > out

But better test it, I'm not that used to Unicode quirks.

tr/abc//cd deletes the characters that are not listed in abc (tr/// is actually meant to transform characters to others, see perlop). It takes lists of characters, as well as ranges, and \xHH means the character with hex value HH, and \x{HHHH} one with value HHHH. So the above accepts 0x09, 0x0a, 0x0d, everything from 0x20 to 0xd7ff etc.

The list above is taken directly from the list presented in the question. I'll leave it to the end user to evaluate if it should be changed.

ilkkachu
  • 138,973
  • Correct list for perl tr should be \x{9}-\x{A}\x{D}-\x{D}\x{20}-\x{D7FF}\x{E000}-\x{FDCF}\x{FE00}-\x{FFFD}\x{10000}-\x{1FFFD}\x{20000}-\x{2FFFD}\x{30000}-\x{3FFFD}\x{40000}-\x{4FFFD}\x{50000}-\x{5FFFD}\x{60000}-\x{6FFFD}\x{70000}-\x{7FFFD}\x{80000}-\x{8FFFD}\x{90000}-\x{9FFFD}\x{A0000}-\x{AFFFD}\x{B0000}-\x{BFFFD}\x{C0000}-\x{CFFFD}\x{D0000}-\x{DFFFD}\x{E0000}-\x{EFFFD}\x{F0000}-\x{FFFFD}\x{100000}-\x{10FFFD} –  Feb 02 '18 at 00:39
  • I used the perl line which this answer recommended and it worked for the instance that I was dealing with. I may replace @isaac's line if that's the consensus. Can someone comment about the difference or recommend me an official resource to understand that syntax for myself? – Timothy Swan Feb 02 '18 at 17:43
  • That line is removing the non-characters U+FDD0-U+FDEF and fffe-ffff in each of the unicode pages (fffe-ffff,1fffe-1ffff, 2fffe-2ffff, … … up to 10fffe-10ffff). Non-characters should not be used in interchange with other users –  Feb 02 '18 at 17:58
  • @TimothySwan, yeah, I just copied the list from your question. The beginning in isaac's list looks the same (though I think the range in \x{D}-\x{D} is just redundant, it's the same as just \x0d), but it does exclude #fdd0 to #fdff and all the higher ones where the low byte is ff or fe – ilkkachu Feb 03 '18 at 00:31
  • @isaac, though do note that the link you refer to also has this question and answer: "Q: Can noncharacters simply be deleted from input text? A: No. Doing so can lead to security problems." so it would probably be better to do something else about them. Replace with some known placeholder? Just reject the whole input? – ilkkachu Feb 03 '18 at 12:21
  • @ilkkachu Well, non-characters are used to mark a known placeholder. But they should only be used for local exchange (within the realm of only one application). As the xml file purpose is to exchange data between several parties non-characters should be regarded as invalid and removed from the original file. If that "security issue" (mostly related to way of authenticating file content) is a concern, just diff the original from the result, if different reject the whole file. The perl command still allows that action. –  Feb 03 '18 at 19:56
  • If non-characters should be allowed, the actual string to Perl should be as follows 'tr/\x09\x0a\x0d\x20-\x{d7ff}\x{e000}-\x{10ffff}//cd' this answer list a range that removes fffe and ffff. Either remove all or remove none. –  Feb 03 '18 at 20:11
  • On digging depper on this issue, the "security issue" of "Unicode Technical Report #36" details that the removal of a code may change the meaning and use of the file. If that is the concern, really, any code removed will change the meaning of the file, as such, the perl command may be used to check the input file but not to "clean" it. –  Feb 03 '18 at 20:33