8

The output of locale seems to distinguish between upper and lowercase:

% locale -a 
C
en_AU.utf8
en_US.utf8
POSIX

More commonly, I've seen the hyphenated and uppercase UTF-8.

What is the canonical name for utf8 / UTF-8?

Tom Hale
  • 30,455
  • 2
    Hm, interesting. The result of locale -a on my Debian 10 is: C C.UTF-8 el_GR.utf8 POSIX. Mixed upper and lower case. – Krackout Aug 22 '20 at 15:32
  • 1
    Compare, also, https://unix.stackexchange.com/q/597962/5132 and https://unix.stackexchange.com/q/573886/5132 . – JdeBP Aug 22 '20 at 16:31
  • @Krackout curiously inconsistent! My output is from Manjaro. – Tom Hale Aug 22 '20 at 16:41
  • All of the components here technically do not distinguish case. That does not mean that a particular operating system permits case folding, and technically utf8 is not a valid character set name according to IANA (it's UTF-8). – bk2204 Aug 23 '20 at 22:52
  • ISO 639-1, ISO 3166-1, and RFC 2978 do not create colliding values that are distinguished only by case. Because POSIX specifies that they are implementation-defined, an implementation may choose to permit case folding or not, provided that it documents the behavior it supports. That's what “implementation defined” means. Notably, POSIX does not require the use of IANA character sets here. – bk2204 Aug 24 '20 at 18:47

1 Answers1

4

TL;DR: Nope.

  • utf8 doesn't refer to an IANA character set since it drops the - character.
  • IANA character set names are case INsensitive.
  • Therefore, the following all refer to RFC3629: UTF-8, a transformation format of ISO 10646:
    • UTF-8
    • utf-8
    • uTf-8 (Note all have a hyphen)
  • There is a case-sensitive alias of the above name: csUTF8

The details

POSIX.1-2017, section 8.2 Internationalization Variables

If the locale value has the form:

language[_territory][.codeset]

it refers to an implementation-provided locale, where settings of language, territory, and codeset are implementation-defined.

But while POSIX.1 leaves the details implementation defined, IANA has something to say about it.

RFC2978 IANA Charset Registration Procedures

2.3. Naming Requirements defines a character set primary name:

 mime-charset = 1*mime-charset-chars
 mime-charset-chars = ALPHA / DIGIT /
            "!" / "#" / "$" / "%" / "&" /
            "'" / "+" / "-" / "^" / "_" /
            "`" / "{" / "}" / "~"
 ALPHA        = "A".."Z"    ; Case insensitive ASCII Letter
 DIGIT        = "0".."9"    ; Numeric digit

Note the Case insensitive ASCII Letter.

Interestingly, this means that ^-^ is a happy but valid character set name.

IANA Character Sets

These are the official names for character sets that may be used in the Internet and may be referred to in Internet documentation.

The character set names may be up to 40 characters taken from the printable characters of US-ASCII. However, no distinction is made between use of upper and lower case letters. [emphasis mine]

IANA lists the character set as UTF-8.

While utf-8 (or uTf-8) is an official name for an IANA character set name, utf8 (sans hyphen) is not a IANA character set name.

Note that there is also a !case-sensitive! alias for the name UTF-8, namely: csUTF8.

The "cs" stands for character set and is provided for applications that need a lower case first letter but want to use mixed case thereafter that cannot contain any special characters, such as underbar ("_") and dash ("-").

If it's not IANA, where does utf8 likely come from?

glibc's _nl_normalize_codeset() does the following:

  • Only passes characters or a digits (goodbye hyphen)

  • Converts characters to lowercase

    for (cnt = 0; cnt < name_len; ++cnt)
      if (__isalpha_l ((unsigned char) codeset[cnt], locale))
        *wp++ = __tolower_l ((unsigned char) codeset[cnt], locale);
      else if (__isdigit_l ((unsigned char) codeset[cnt], locale))
        *wp++ = codeset[cnt];
    

The code comment incorrectly says:

There is no standard for the codeset names.

This comment doesn't seem cognisant of RFC2978 IANA Charset Registration Procedures, 2.3. Naming Requirements.

Tom Hale
  • 30,455
  • On some operating systems, the implementation-defined stuff makes the Unicode Common Locale Data Repository, RFC 2978, and the IANA Character Set Registry the next stops. – JdeBP Aug 22 '20 at 17:40
  • Thanks @JdeBP, updated based on your hints. – Tom Hale Aug 23 '20 at 07:24
  • Thank the FreeBSD and DragonFlyBSD people. https://wiki.freebsd.org/LocaleNewApproach https://gitweb.dragonflybsd.org/dragonfly.git/commit/252345ebec4e8957a908d352a149589ec3dfee09 https://svnweb.freebsd.org/base?view=revision&revision=286434 https://svnweb.freebsd.org/base?view=revision&revision=290494 https://svnweb.freebsd.org/base/head/tools/tools/locale/etc/charmaps/charmaps.txt?view=markup#l1 – JdeBP Aug 25 '20 at 06:44