I want to filter a file by character (for the purpose of removing invalid xml characters which I cannot control the generation of) but I cannot seem to even be able to copy individual characters from one file to another. I used printf
to copy literal sections including carriage returns before, but now it does not copy a carriage return as one, but as some empty length string. My code:
infile=$1
outfile=$2
touch $outfile
while IFS= read -r -n1 char
do
# display one character at a time
printf "%s" "$char" >> $outfile
done < "$infile"
diff $infile $outfile
I don't mind using sed or awk, but I would have to encode the allowed characters.
Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF] /* any Unicode character, excluding the surrogate blocks, FFFE, and FFFF. */
iconv
? – DopeGhoti Feb 01 '18 at 20:320x00
) is a valid byte in many UTF-16 characters (and is also a valid character in standard UTF-8 - modified UTF-8 uses0xC0 0x80
in place of any actual NUL bytes, allowing NUL to be used as terminator) - andsh
is completely incapable of storing NUL bytes in any variable. Useawk
orperl
instead. Or some other unicode-aware tool likeiconv
instead. – cas Feb 02 '18 at 01:53FFFE
andFFFF
are defined as non-characters. If you choose to delete them, then you should also delete all other 64 non-characters. – Feb 03 '18 at 20:41