58

Does it depend on what file system I use? For example, ext2/ext3/ext4 but also what happens when I insert one of those "joliet" CD-ROMs with ISO 9660? I've heard that POSIX contains some sort of spec for the charset encoding of filenames?

Essentially, what I wonder is if I got a UTF-8 encoded filename, what processing/coversion do I need to do before I pass it to a file I/O API in Linux?

martin
  • 581
  • 1
    The answers below say that the OS and filesystem don't care about encodings. Some filesystems, such as HFS+, do care a great deal. HFS+, I believe, requires UTF-8, which it converts internally to a restricted dialect of UTF-16. NTFS also has a similar issue but I'm not clear on the details. – zmccord Apr 22 '15 at 15:21
  • HFS+ also requires that names be decomposed which doesn't play nice with linux's tendency to use precomposed. https://web.archive.org/web/20080518105836/http://developer.apple.com/qa/qa2001/qa1235.html – user12439 Aug 27 '17 at 10:27

3 Answers3

58

As noted by others, there isn't really an answer to this: filenames and paths do not have an encoding; the OS only deals with sequence of bytes. Individual applications may choose to interpret them as being encoded in some way, but this varies.

Specifically, Glib (used by Gtk+ apps) assumes that all file names are UTF-8 encoded, regardless of the user's locale. This may be overridden with the environment variables G_FILENAME_ENCODING and G_BROKEN_FILENAMES.

On the other hand, Qt defaults to assuming that all file names are encoded in the current user's locale. An individual application may choose to override this assumption, though I do not know of any that do, and there is no external override switch.

Modern Linux distributions are set up such that all users are using UTF-8 locales and paths on foreign filesystem mounts are translated to UTF-8, so this difference in strategies generally has no effect. However, if you really want to be safe, you cannot assume any structure about filenames beyond "NUL-terminated, '/'-delimited sequence of bytes".

(Also note: locale may vary by process. Two different processes run by the same user may be in different locales simply by having different environment variables set.)

Diego
  • 143
  • 6
ephemient
  • 15,880
  • 3
    "NUL-terminated, '/'-delimited sequence of bytes" But without an encoding, how do you know what byte represents '/'? – Jack Jun 05 '19 at 04:13
  • 1
    @Jack Always '\x2F' regardless of what looks like /. Notably different in SJIS. – ephemient Jun 05 '19 at 05:26
  • 1
    Ah, okay. Would you consider updating the answer with that info? Maybe it's just because I recently worked on a charset conversion library, but the phrase "'/'-delimited sequence of bytes" makes no sense to me. – Jack Jun 05 '19 at 07:47
  • So how to see bytes of filename in SSH session in HEX? – Dims Jun 25 '19 at 07:13
  • Please refer to the encoding in filesystems as noted by @adam byrtek below. The filesystem determines the meaning of the bytes. Everything else tries to help you to guess the bytes (you have to pass to open() ) if you know the meaning, but not the mount options. – Oliver Meyer Nov 21 '21 at 07:52
  • The reference for slash and null termination is probably https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_271 – Oliver Meyer Nov 21 '21 at 09:26
  • i m also confused about "NUL-terminated, '/'-delimited sequence of bytes". how does FS driver know about byte representation of them without knowing encoding type? cant find anything about '\x2F' in ext docs – Alex Jul 16 '23 at 02:45
11

The unix/posix layer of linux doesn't care which encoding you use. It stores the byte sequence of your current encoding as-is.

I think those mount options are there to help you convert specific filesystems that define a charset to your system charset. (CDROMs, NTFS and the FAT variants use some unicode variants).

I wish unix defined a system global encoding, but it is actually a per user setting. So if you define a different encoding then your collegue, your filenames will show up differently.

  • Ok so then I should probably check what locale the user is currently using and convert to that for new files so that he will see the filename correctly in Nautilus etc. How can I tell what the current filename charset is for the current user? – martin Sep 15 '10 at 18:28
  • 1
    @martin It's not even that simple... Different processes can use different encodings, depending on env variables and the language it was written in. – Basic Mar 05 '16 at 20:42
6

It depends on how you mount the file system, just take a look at mount options for different file systems in man mount. For example iso9660, vfat and fat have iocharset and utf8 options.

Adam Byrtek
  • 1,339
  • So if I mount it using utf8 should I also pass utf8 to the open() syscall? – martin Sep 15 '10 at 17:11
  • Also I found this ( http://library.gnome.org/devel/glib/unstable/glib-Character-Set-Conversion.html#g-filename-to-utf8 ) which seems to indicate that the charset encoding of filenames is dependent on what locale is set? – martin Sep 15 '10 at 17:11