Why is it not possible to search through text file contents encoded in UTF-16?

Question

I understand that e.g. catfish and gnome-search-utils both can search inside file contents that are UTF-8 encoded. To be able to search for words or numbers within text files one would have to convert them via iconv into UTF-8 first.

If the file is known, text editors like gedit or mousepad have no trouble with UTF-16.

Why is there no search tool (GUI or command-line) with any of the Linux distributions that can handle UTF-16 encoded txt files?

I'm on Xubuntu.

ripgrep 0.5.0 supports UTF-16, but (rant) it is a terrible encoding that should never be used, as 1) a UTF-16 string cannot be a C string if it contains any ASCII characters, 2) It is just as much a variable-width encoding as UTF-8, 3) Many tools choke on the BOM, but it is necessary to disambiguate endianness — Fox, May 09 '17 at 15:52
@Fox -- you would no more encode a user string in UTF-16 in C, than you would encode them in UTF-8. C only handles ASCII, and you need library functions to convert strings to(or from) UTF-8 OR UTF-16. However, I tend to agree UTF-16 is icky -- especially since it's often UCS-2 in disguise (no BOM, only supports up to Unicode-2) -- especially when talking about WindowsOS files (log files, reg files, may not have BOMs for example). — Astara, Aug 25 '17 at 02:20
@Astara My statement about C-strings was a quick summary of: if a character is in the subset of Unicode that overlaps with ASCII, its encoding in UTF-16 (or UCS-2) contains a null-byte. The only character containing a null-byte in UTF-8 is NUL itself. This means that you can use functions from the standard C library to read, write, copy, etc. UTF-8 strings, but not UTF-16. You won't get proper change-case support, of course, but the basics are free. In any case, this appears to be a digression from a digression — Fox, Aug 25 '17 at 02:38
@Fox - updating my comment: C supports characters since C90 (see https://en.wikipedia.org/wiki/C_string_handling), which includes the wchar type which supports 32-bit characters -- more than enough for UCS-2. The main problems with UTF-16 are similar to UTF-8 in that, like UTF-8, characters use a variable number of bytes (UTF-8: 1-4 bytes, UTF-16: 2-4 bytes). UCS-2 was limited to 16 bits and only supported up to, I believe, unicode-2.0. — Astara, Sep 03 '17 at 20:33
@Fox it's because you use the wrong element type for the string. An array of wchar_t must be used for a C string encoded in UTF-16. It's just that C doesn't have good Unicode support because it's too old. Things are better in C++ because of templates and any types of strings can be used. Besides, string length can be taken in O(n) — phuclv, Jan 18 '19 at 15:20

score 6 · Accepted Answer · answered May 09 '17 at 17:56

6

UTF-16 (or UCS-2) is highly unfriendly for the null-terminated strings used by the C standard library and the POSIX ABI. For example, command line arguments are terminated by NULs (bytes with value zero), and any UTF-16 character with numerical value < 256 contains a zero byte, so any strings of the usual English letters would be impossible to represent in UTF-16 on a command line argument.

That in turn means that either the utilities would need to take input in some other format (say UTF-8) and convert to UTF-16; or they would need to take their input in some other way. The first option would require all such utilities to contain (or link to) code for the conversion, and the second would make interfacing those programs to other utilities somewhat difficult.

Given those difficulties, and the fact that UTF-8 has better backwards-compatibility properties, I'd just guess that few care to use UTF-16 enough to be motivated to create tools for that.

answered May 09 '17 at 17:56

ilkkachu

138,973

The null termination code in UTF-16 is two null bytes in a row -- which encodes a null byte for UTF-16. If your command line handles UTF-16, then ascii (or unicode) letter 'A' would be internally represented by 0x41 \x00 (on windows x86, lower byte is always 1st, often called 'LSB' (vs. MSB). The thing in 'C', is that UTF-16 is an encoding, BELOW what the language uses. 'C' uses user strings which are automatically converted to the platform's native encoding. So a 'C' prog printing "hello world\n" works on all C-supporting platforms. – Astara Aug 25 '17 at 02:15
@Astara, well, in practice, the tools that exist assume a character of 8 bits, so the first 8-bit byte with value 0 terminates the string. POSIX also defines a string as "A contiguous sequence of bytes terminated by and including the first null byte.", and that a byte is exactly the same as an octet, i.e. 8 bits. So yeah, you'd need to have a tool that explicitly supports UTF-16. – ilkkachu Aug 25 '17 at 15:53
We aren't talking '8-bit' interfaces between tools -- we are talking character interterfaces between tools. Whether those characters are 8 or 32 bits internally isn't something passed out to external tools. The original question asked for a find tool to search for text in files that was UTF-16 encoded. The included version of 'find.exe' in /windows/system32, does that. – Astara Aug 26 '17 at 00:32
@Astara, well, the read() and write() system calls deal in bytes, so the interpretation of a character must be done in the tool. – ilkkachu Aug 26 '17 at 17:55
There are no read/write "system" calls on NT. On Win, there are 'read/write' library calls that present I/O as 8-bit chars, but on NT those library calls convert from 8 to 16-bit when talking to the system. – Astara Aug 27 '17 at 15:44
@Astara, so, are you talking about Windows or some Unix-like system? The question mentions Xubuntu (Linux), and this particular site is aimed at "users of Linux, FreeBSD and other Un*x-like operating systems". I'm pretty sure what I said about the application having to interpret bytes it reads from a file is accurate on a POSIX system, and that's what those Unix systems are. – ilkkachu Aug 29 '17 at 12:16
@ikkachu - You mention the read/write system calls. Those calls don't care about the encoding or format of the data -- they work fine because they only handle bytes (not characters) and work with byte lengths. They can read/write UTF-16 data as easily as ASCII. It's the upper level application libs (like glibc) that support characters (like UTF8 and UTF16). Those libs supported 'w' versions of the calls BEFORE they supported UTF8 (which is harder). glibc supports UTF-16 -- but no one has written apps for it. (https://www.gnu.org/software/libunistring/manual/libunistring.html, sec1.7) – Astara Aug 31 '17 at 17:59
@Astara, I'm quite sure I already stated that you'd need to have a tool that explicitly supports UTF-16. Using a library that provides that support counts as explicitly having it, because the programmer has use that library. – ilkkachu Aug 31 '17 at 20:36
@ikkachu -- All of the C-using programs use some form of "libc" -- on linux systems, that's usually "glibc". If you define libraries as "tools", then of course, you need to have a "tool" (a library) that supports UTF-16. On linux glibc has supported wide characters (32-bit) since the C90 standard from about 25 years ago. Saying that you need a library that supports it is a non-requirement, as glibc on linux has supported it for over 20 years. As for talking to a kernel via read/write calls, they take byte lengths that are never confused by NUL's indicating EOL where they shouldn't. – Astara Sep 03 '17 at 20:20
@ikkachuAlso, in regards to "C" having problems with UTF-16, see "https://en.wikipedia.org/wiki/C_string_handling". Specifically, note that C uses a null character (not nul-byte) as the end of string. With a wchar type, a null character is usually 32-bits long. Manipulating 16-bit UTF-16 hasn't been a problem, since, at least, the C90 standard. The simple reason why there isn't a UTF-16 'grep' on linux is that there hasn't been a demand for one. The Gnu utils (part of the all linux distros that I know of) is still "in process" in supporting UTF-8. UTF-16 isn't a priority. – Astara Sep 03 '17 at 20:28

score 2 · Answer 2 · answered Jan 17 '19 at 14:22

Install ripgrep utility which supports UTF-16.

For example:

rg pattern filename

ripgrep supports searching files in text encodings other than UTF-8, such as UTF-16, latin-1, GBK, EUC-JP, Shift_JIS and more. (Some support for automatically detecting UTF-16 is provided. Other text encodings must be specifically specified with the -E/--encoding flag.)

Why is it not possible to search through text file contents encoded in UTF-16?

2 Answers2

Linked

Related