For some testing purposes I need a string with invalid unicode characters. How to create such string in Zsh?
1 Answers
I assume you mean UTF-8 encoded Unicode characters.
That depends what you mean by invalid.
invalid_byte_sequence=$'\x80\x81'
That's a sequence of bytes that, by itself, isn't valid in UTF-8 encoding (the first byte in a UTF-8 encoded character always has the two highest bits set). That sequence could be seen in the middle of a character though, so it could end-up forming a valid sequence once concatenated to another invalid sequence like $'\xe1'
. $'\xe1'
or $'\xe1\x80'
themselves would also be invalid and could be seen as a truncated character.
other_invalid_byte_sequence=$'\xc2\xc2'
The 0xc2 byte would start a 2-byte character, and 0xc2 cannot be in the middle of a UTF-8 character. So that sequence can never be found in valid UTF-8 text. Same for $'\xc0'
or $'\xc1'
which are bytes that never appear in the UTF-8 encoding.
For the \uXXXX
and \UXXXXXXXX
sequences, I assume the current locale's encoding is UTF-8.
non_character=$'\ufffe'
That's one of the 66 currently specified non-characters.
not_valid_anymore=$'\U110000'
Unicode is now restricted to code points up to 0x10FFFF. And the UTF-8 encoding which was originally designed to cover up to 0x7FFFFFFF (perl
also supports a variant that goes to 0xFFFFFFFFFFFFFFFF) is now conventionally restricted to that as well.
utf16_surrogate=$'\ud800'
Code points 0xD800 to 0xDFFF are code points reserved for the UTF16 encoding. So the UTF-8 encoding of those code points is invalid.
Now most of the remaining code points are still not assigned in the latest version of Unicode.
unassigned=$'\u378'
Newer versions of Unicode come with new characters specified. For instance Unicode 8.0 (released in June 2015) has (U+1F917) which was not assigned in earlier versions.
unicode_8_and_above_only=$'\U1f917'
Some testing with uconv
:
$ printf %s $invalid_byte_sequence| uconv -x any-name
Conversion to Unicode from codepage failed at input byte position 0. Bytes: 80 Error: Illegal character found
Conversion to Unicode from codepage failed at input byte position 1. Bytes: 81 Error: Illegal character found
$ printf %s $other_invalid_byte_sequence| uconv -x any-name
Conversion to Unicode from codepage failed at input byte position 0. Bytes: c2 Error: Illegal character found
Conversion to Unicode from codepage failed at input byte position 1. Bytes: c2 Error: Truncated character found
$ printf %s $non_character| uconv -x any-name
\N{<noncharacter-FFFE>}
$ printf %s $not_valid_anymore| uconv -x any-name
Conversion to Unicode from codepage failed at input byte position 0. Bytes: f4 90 80 80 Error: Illegal character found
$ printf %s $utf16_surrogate | uconv -x any-name
Conversion to Unicode from codepage failed at input byte position 0. Bytes: ed a0 80 Error: Illegal character found
$ printf %s $unassigned | uconv -x any-name
\N{<unassigned-0378>}
$ printf %s $unicode_8_and_above_only | uconv -x any-name
\N{<unassigned-1F917>}
$
With GNU grep
, you can use grep .
to see if it can find a character in the input:
l=(invalid_byte_sequence other_invalid_byte_sequence non_character
not_valid_anymore utf16_surrogate unassigned unicode_8_and_above_only)
for c ($l) print -r ${(P)c} | grep -q . && print $c
Which for me gives:
non_character
not_valid_anymore
utf16_surrogate
unassigned
unicode_8_and_above_only
That is, my grep
still considers some of those invalid, non-characters or not-assigned-yet characters as being (or containing) characters. YMMV for other implementations of grep
or other utilities.

- 544,893
for c ($l) print -r ${(P)c} |
– iruvar Dec 07 '15 at 11:52