1

I am trying to learn the basics of Unicode and UTF-8, and it so going very smoothly so far. I know that it is possible to chose the encoding of a file when opening it.

When I type text using the keyboard in a text editor (Gedit or Vim) or a command prompt, what is the encoding being sent by the keyboard to the application? Is it different on Windows machines? Can it be configured?

Stephen Kitt
  • 434,908
ob_dev
  • 164

2 Answers2

3

The keyboard does not send characters; it sends scan codes. For example, when you press the key labelled "e" on a typical American keyboard, it sends a scan code which essentially says "3rd key from the left on the 2nd row of alphanumeric keys in the main group". This scan code is converted into a character (or in general into a key symbol, think of keys such as "Print Screen") by the kernel (or some other component of the operating system) and, specifically in Linux, possibly by the graphical subsystem.

Generally the operating system or the graphical subsystem provide one or more utilities which control the conversion tables; for example, both in Windows and in Linux you can install as many keyboard layouts as you please and switch among them with ease.

What the application gets depends on the conventions of the operating system. On Windows, console applications get a character encoded according to the current console code page set by the command chcp; graphical applications get a key symbol which usually gets translated into a UTF-16 encoded character. On Linux, applications usually get a UTF-8 encoded character. For example, if I press the key labelled ă (LATIN SMALL LETTER A WITH BREVE, U+0103) with the keyboard layout set correctly,

  • A console application on Windows with chcp 1250 will get one byte '\xE3' (227 decimal).
  • A console application on Windows with chcp 852 will get one byte '\xC7' (199 decimal).
  • A graphical application on Windows will get a suitable key symbol, which usually will get stored/processed as two bytes '\x03' '\x01' (or as the short integer 0x103).
  • A terminal application on Linux will get two bytes '\xC4' '\x83' (<U+0103> in UTF-8 encoding).
  • A graphical application on Linux will get a suitable key symbol, which usually will get stored/processed as two bytes '\xC4' '\x83' (<U+0103> in UTF-8 encoding).

(Note that by Windows I mean Windows NT and its successors such as Windows XP, Windows Vista, 7 or 10. Windows 95 etc. are an entirely different line of operating systems, thankfully no longer in use.)

In Vim you get two new layers of translation:

  • You can install a keyboard translation map with set keymap; see :help 'keymap' and :help mbyte-keymap. This helps with entering text in the desired language on systems where you cannot install a keyboard layout at the operating system level.

  • You can define a mapping with the :map command. See :help :map.

AlexP
  • 10,455
3

See How do keyboard input and text output work? for an overview of the topic. It depends whether the application is running in a terminal or is talking directly to the GUI environment.

In a terminal, the terminal software (generally a terminal emulator this century) determines the encoding of characters. It conveys the character encoding (the same for input and output) by setting the locale environment variable LC_CTYPE. If this variable is unset or set to C, the terminal isn't providing any information so the application can't know what the encoding is. In a terminal, characters are sent to the application as characters; non-character input (function keys, cursor keys, keys with modifiers such as Alt, etc.) is sent as escape sequences (a few of them as control characters instead).

X11 applications receive input in the form of KeyPress envents. KeyPresss events contain a low-level indication (keycode which roughly corresponds to the physical location of the key, and state which encodes the active modifiers). The application can call a function such as XLookupString (traditional function, limited to Latin-1) or XmbLookupString (function supporting other unibyte encodings) or XwcLookupString (function supporting multibyte encodings) or Xutf8LookupString (modern UTF-8 function) to convert this raw information to a character string.

The mapping from keys to characters can be changed at various levels; How do keyboard input and text output work? has an overview.