What charset encoding is used for filenames and paths on Linux?

Question

Does it depend on what file system I use? For example, ext2/ext3/ext4 but also what happens when I insert one of those "joliet" CD-ROMs with ISO 9660? I've heard that POSIX contains some sort of spec for the charset encoding of filenames?

Essentially, what I wonder is if I got a UTF-8 encoded filename, what processing/coversion do I need to do before I pass it to a file I/O API in Linux?

The answers below say that the OS and filesystem don't care about encodings. Some filesystems, such as HFS+, do care a great deal. HFS+, I believe, requires UTF-8, which it converts internally to a restricted dialect of UTF-16. NTFS also has a similar issue but I'm not clear on the details. — zmccord, Apr 22 '15 at 15:21
HFS+ also requires that names be decomposed which doesn't play nice with linux's tendency to use precomposed. https://web.archive.org/web/20080518105836/http://developer.apple.com/qa/qa2001/qa1235.html — user12439, Aug 27 '17 at 10:27

score 58 · Answer 1 · edited Jul 03 '15 at 10:22

58

As noted by others, there isn't really an answer to this: filenames and paths do not have an encoding; the OS only deals with sequence of bytes. Individual applications may choose to interpret them as being encoded in some way, but this varies.

Specifically, Glib (used by Gtk+ apps) assumes that all file names are UTF-8 encoded, regardless of the user's locale. This may be overridden with the environment variables G_FILENAME_ENCODING and G_BROKEN_FILENAMES.

On the other hand, Qt defaults to assuming that all file names are encoded in the current user's locale. An individual application may choose to override this assumption, though I do not know of any that do, and there is no external override switch.

Modern Linux distributions are set up such that all users are using UTF-8 locales and paths on foreign filesystem mounts are translated to UTF-8, so this difference in strategies generally has no effect. However, if you really want to be safe, you cannot assume any structure about filenames beyond "NUL-terminated, '/'-delimited sequence of bytes".

(Also note: locale may vary by process. Two different processes run by the same user may be in different locales simply by having different environment variables set.)

edited Jul 03 '15 at 10:22

Diego

143
6

answered Sep 15 '10 at 21:30

ephemient

15,880

3

"NUL-terminated, '/'-delimited sequence of bytes" But without an encoding, how do you know what byte represents '/'? – Jack Jun 05 '19 at 04:13
1

@Jack Always '\x2F' regardless of what looks like /. Notably different in SJIS. – ephemient Jun 05 '19 at 05:26
1

Ah, okay. Would you consider updating the answer with that info? Maybe it's just because I recently worked on a charset conversion library, but the phrase "'/'-delimited sequence of bytes" makes no sense to me. – Jack Jun 05 '19 at 07:47
So how to see bytes of filename in SSH session in HEX? – Dims Jun 25 '19 at 07:13
Please refer to the encoding in filesystems as noted by @adam byrtek below. The filesystem determines the meaning of the bytes. Everything else tries to help you to guess the bytes (you have to pass to open() ) if you know the meaning, but not the mount options. – Oliver Meyer Nov 21 '21 at 07:52
The reference for slash and null termination is probably https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_271 – Oliver Meyer Nov 21 '21 at 09:26
i m also confused about "NUL-terminated, '/'-delimited sequence of bytes". how does FS driver know about byte representation of them without knowing encoding type? cant find anything about '\x2F' in ext docs – Alex Jul 16 '23 at 02:45

score 11 · Answer 2 · answered Sep 15 '10 at 17:14

11

The unix/posix layer of linux doesn't care which encoding you use. It stores the byte sequence of your current encoding as-is.

I think those mount options are there to help you convert specific filesystems that define a charset to your system charset. (CDROMs, NTFS and the FAT variants use some unicode variants).

I wish unix defined a system global encoding, but it is actually a per user setting. So if you define a different encoding then your collegue, your filenames will show up differently.

answered Sep 15 '10 at 17:14

Bert Huijben

211

Ok so then I should probably check what locale the user is currently using and convert to that for new files so that he will see the filename correctly in Nautilus etc. How can I tell what the current filename charset is for the current user? – martin Sep 15 '10 at 18:28
1

@martin It's not even that simple... Different processes can use different encodings, depending on env variables and the language it was written in. – Basic Mar 05 '16 at 20:42

score 6 · Answer 3 · answered Sep 15 '10 at 17:01

6

It depends on how you mount the file system, just take a look at mount options for different file systems in man mount. For example iso9660, vfat and fat have iocharset and utf8 options.

answered Sep 15 '10 at 17:01

Adam Byrtek

1,339

So if I mount it using utf8 should I also pass utf8 to the open() syscall? – martin Sep 15 '10 at 17:11
Also I found this ( http://library.gnome.org/devel/glib/unstable/glib-Character-Set-Conversion.html#g-filename-to-utf8 ) which seems to indicate that the charset encoding of filenames is dependent on what locale is set? – martin Sep 15 '10 at 17:11

What charset encoding is used for filenames and paths on Linux?

3 Answers3

Linked