Is the `utf8` in `en_US.utf8` a canonical character set?

Question

The output of locale seems to distinguish between upper and lowercase:

% locale -a 
C
en_AU.utf8
en_US.utf8
POSIX

More commonly, I've seen the hyphenated and uppercase UTF-8.

What is the canonical name for utf8 / UTF-8?

Hm, interesting. The result of locale -a on my Debian 10 is: C C.UTF-8 el_GR.utf8 POSIX. Mixed upper and lower case. — Krackout, Aug 22 '20 at 15:32
Compare, also, https://unix.stackexchange.com/q/597962/5132 and https://unix.stackexchange.com/q/573886/5132 . — JdeBP, Aug 22 '20 at 16:31
@Krackout curiously inconsistent! My output is from Manjaro. — Tom Hale, Aug 22 '20 at 16:41
All of the components here technically do not distinguish case. That does not mean that a particular operating system permits case folding, and technically utf8 is not a valid character set name according to IANA (it's UTF-8). — bk2204, Aug 23 '20 at 22:52
ISO 639-1, ISO 3166-1, and RFC 2978 do not create colliding values that are distinguished only by case. Because POSIX specifies that they are implementation-defined, an implementation may choose to permit case folding or not, provided that it documents the behavior it supports. That's what “implementation defined” means. Notably, POSIX does not require the use of IANA character sets here. — bk2204, Aug 24 '20 at 18:47

score 4 · Answer 1 · edited Oct 07 '21 at 07:34

TL;DR: Nope.

utf8 doesn't refer to an IANA character set since it drops the - character.
IANA character set names are case INsensitive.
Therefore, the following all refer to RFC3629: UTF-8, a transformation format of ISO 10646:
- UTF-8
- utf-8
- uTf-8 (Note all have a hyphen)
There is a case-sensitive alias of the above name: csUTF8

The details

POSIX.1-2017, section 8.2 Internationalization Variables

If the locale value has the form:
language[_territory][.codeset]
it refers to an implementation-provided locale, where settings of language, territory, and codeset are implementation-defined.

But while POSIX.1 leaves the details implementation defined, IANA has something to say about it.

RFC2978 IANA Charset Registration Procedures

2.3. Naming Requirements defines a character set primary name:

 mime-charset = 1*mime-charset-chars
 mime-charset-chars = ALPHA / DIGIT /
            "!" / "#" / "$" / "%" / "&" /
            "'" / "+" / "-" / "^" / "_" /
            "`" / "{" / "}" / "~"
 ALPHA        = "A".."Z"    ; Case insensitive ASCII Letter
 DIGIT        = "0".."9"    ; Numeric digit

Note the Case insensitive ASCII Letter.

Interestingly, this means that ^-^ is a happy but valid character set name.

IANA Character Sets

These are the official names for character sets that may be used in the Internet and may be referred to in Internet documentation.

The character set names may be up to 40 characters taken from the printable characters of US-ASCII. However, no distinction is made between use of upper and lower case letters. [emphasis mine]

IANA lists the character set as UTF-8.

While utf-8 (or uTf-8) is an official name for an IANA character set name, utf8 (sans hyphen) is not a IANA character set name.

Note that there is also a !case-sensitive! alias for the name UTF-8, namely: csUTF8.

The "cs" stands for character set and is provided for applications that need a lower case first letter but want to use mixed case thereafter that cannot contain any special characters, such as underbar ("_") and dash ("-").

If it's not IANA, where does `utf8` likely come from?

glibc's _nl_normalize_codeset() does the following:

Only passes characters or a digits (goodbye hyphen)

Converts characters to lowercase

for (cnt = 0; cnt < name_len; ++cnt)
  if (__isalpha_l ((unsigned char) codeset[cnt], locale))
    *wp++ = __tolower_l ((unsigned char) codeset[cnt], locale);
  else if (__isdigit_l ((unsigned char) codeset[cnt], locale))
    *wp++ = codeset[cnt];

The code comment incorrectly says:

There is no standard for the codeset names.

This comment doesn't seem cognisant of RFC2978 IANA Charset Registration Procedures, 2.3. Naming Requirements.

On some operating systems, the implementation-defined stuff makes the Unicode Common Locale Data Repository, RFC 2978, and the IANA Character Set Registry the next stops. — JdeBP, Aug 22 '20 at 17:40
Thank the FreeBSD and DragonFlyBSD people. https://wiki.freebsd.org/LocaleNewApproach https://gitweb.dragonflybsd.org/dragonfly.git/commit/252345ebec4e8957a908d352a149589ec3dfee09 https://svnweb.freebsd.org/base?view=revision&revision=286434 https://svnweb.freebsd.org/base?view=revision&revision=290494 https://svnweb.freebsd.org/base/head/tools/tools/locale/etc/charmaps/charmaps.txt?view=markup#l1 — JdeBP, Aug 25 '20 at 06:44

Is the `utf8` in `en_US.utf8` a canonical character set?

1 Answers1

TL;DR: Nope.

The details

POSIX.1-2017, section 8.2 Internationalization Variables

RFC2978 IANA Charset Registration Procedures

IANA Character Sets

If it's not IANA, where does `utf8` likely come from?

Linked

Is the `utf8` in `en_US.utf8` a canonical character set?

1 Answers1

TL;DR: Nope.

The details

POSIX.1-2017, section 8.2 Internationalization Variables

RFC2978 IANA Charset Registration Procedures

IANA Character Sets

If it's not IANA, where does utf8 likely come from?

Linked

If it's not IANA, where does `utf8` likely come from?