36

I asked about Linux's 255-byte file name limitation yesterday, and the answer was that it is a limitation that cannot/will not be easily changed. But I remembered that most Linux supports NTFS, whose maximum file name length is 255 UTF-16 characters.

So, I created an NTFS partition, and try to name a file to a 160-character Japanese string, whose bytes in UTF-8 is 480. I expected that it would not work but it worked, as below. How come does it work, when the file name was 480 bytes? Is the 255-byte limitation only for certain file systems and Linux itself can handle file names longer than 255 bytes?

enter image description here

----PS-----

The string is the beginning part of a famous old Japanese essay titled "方丈記". Here is the string.

ゆく河の流れは絶えずして、しかももとの水にあらず。よどみに浮かぶうたかたは、かつ消えかつ結びて、久しくとどまりたるためしなし。世の中にある人とすみかと、またかくのごとし。たましきの都のうちに、棟を並べ、甍を争へる、高き、卑しき、人の住まひは、世々を経て尽きせぬものなれど、これをまことかと尋ぬれば、昔ありし家はまれなり。

I had used this web application to count the UTF-8 bytes.

enter image description here

jpa
  • 1,269
  • 9
    Interesting! Can you post that Japanese string as text please? I just wanna make sure it's really 480 bytes. Also, please don't post screenshots of terminals - copy text from them and paste it here. – Artem S. Tashkinov Nov 14 '20 at 17:00
  • 1
    @ArtemS.Tashkinov I added the text and the details. I had used a screenshot, as a proof that it actually works. If I had just pasted the test, it would not have been a proof. – Damn Vegetables Nov 15 '20 at 06:13
  • 4
    Feel free to link to the screenshot. But the text really is more useful for the post. – Deduplicator Nov 15 '20 at 22:09
  • 8
    An image could in theory have been photoshopped; if we don't trust you to copy/paste your terminal contents into a code block without editing, posting an image isn't absolute proof either. Yes, the amount of work to fake a screenshot would be higher, but most stackexchange readers are used to assuming that copy-pasted terminal output and source code haven't been edited to troll us. A terminal copy/paste would be sufficient proof. – Peter Cordes Nov 16 '20 at 15:39

5 Answers5

32

The answer, as often, is “it depends”.

Looking at the NTFS implementation in particular, it reports a maximum file name length of 255 to statvfs callers, so callers which interpret that as a 255-byte limit might pre-emptively avoid file names which would be valid on NTFS. However, most programs don’t check this (or even NAME_MAX) ahead of time, and rely on ENAMETOOLONG errors to catch errors. In most cases, the important limit is PATH_MAX, not NAME_MAX; that’s what’s typically used to allocate buffers when manipulating file names (for programs that don’t allocate path buffers dynamically, as expected by OSes like the Hurd which doesn't have arbitrary limits).

The NTFS implementation itself doesn’t check file name lengths in bytes, but always as 2-byte characters; file names which can’t be represented in an array of 255 2-byte elements will cause a ENAMETOOLONG error.

Note that NTFS is generally handled by a FUSE driver on Linux. The kernel driver currently only supports UCS-2 characters, but the FUSE driver supports UTF-16 surrogate pairs (with the corresponding reduction in character length).

Peter Cordes
  • 6,466
Stephen Kitt
  • 434,908
  • 1
    I think modern NTFS should use UTF-16 rather than UCS-2. – Ruslan Nov 15 '20 at 07:35
  • 1
    @Ruslan, not sure about Windows, but on Linux touch ${(pl[128][\U10FFFF])} (zsh syntax) fails (ok with 127) so we can't create a 128 character filename if those characters are expressed on two 16bit words in UTF-16. So the max filename on NTFS is in between 127 and 255 characters, or between 255 and 510 bytes there it appears. – Stéphane Chazelas Nov 15 '20 at 08:19
  • Sorry, 510 bytes always, or 508 bytes if you consider that you can't add another 4 byte character after that. – Stéphane Chazelas Nov 15 '20 at 09:01
  • (that was with the ntfs-3g fuse driver. With the in-kernel one, it seems only characters up to U+FFFF are supported as it assumes UCS-2 encoding) – Stéphane Chazelas Nov 15 '20 at 09:25
  • @Ruslan yes, modern NTFS should use UTF-16, but both the kernel driver and the FUSE driver only support wchar_t, i.e. 16-bit values (the kernel documents this as “Plane-0 Unicode”). – Stephen Kitt Nov 15 '20 at 14:32
  • In your link to the FUSE driver, see this else clause. As you can see, it explicitly encodes UTF-16 surrogate pairs. So it doesn't assume UCS-2. – Ruslan Nov 15 '20 at 14:48
  • @Ruslan ah, so it does, thanks, I stand corrected (and I’ll update the answer). – Stephen Kitt Nov 15 '20 at 15:15
  • Thanks for that, very interesting. Looking at both this and yesterday's discussion, the interplay between VFS and any particular filesystem is obviously highly significant... things like coreutils, glibc, Midnight Commander and reference to "classic" UNIX variants rather less so. – Mark Morgan Lloyd Nov 15 '20 at 19:51
  • So..., the VFS layer doesn't actually check the length of a single path component? Or need it in any sense where the length limit would come up? – ilkkachu Nov 15 '20 at 21:08
17

The limit for the length of a filename is indeed coded inside the filesystem, e.g. ext4, from https://en.wikipedia.org/wiki/Ext4 :

Max. filename length 255 bytes

From https://en.wikipedia.org/wiki/XFS :

Max. filename length 255 bytes

From https://en.wikipedia.org/wiki/Btrfs :

Max. filename length 255 ASCII characters (fewer for multibyte character encodings such as Unicode)

From https://en.wikipedia.org/wiki/NTFS :

Max. filename length 255 UTF-16 code units

An overview over these limits for a number of file systems can be found at https://en.wikipedia.org/wiki/Comparison_of_file_systems#Limits . There you can also see that ReiserFS has a higher limit (almost 4K) but the kernel itself (inside VFS, the kernel virtual filesystem) has the limit of 255 bytes.

Your text uses 160 UTF-16 characters as used in NTFS:

echo ゆく河の流れは絶えずして、しかももとの水にあらず。よどみに浮かぶうたかたは、かつ消えかつ結びて、久しくとどまりたるためしなし。世の中にある人とすみかと、またかくのごとし。たましきの都のうちに、棟を並べ、甍を争へる、高き、卑しき、人の住まひは、世々を経て尽きせぬものなれど、これをまことかと尋ぬれば、昔ありし家はまれなり。 > jp.txt
iconv -f utf-8 -t utf-16 jp.txt > jp16.txt
ls -ld jp*.txt
cat jp16.txt | hexdump -C

This shows 0x140 = 320 bytes (plus 2 bytes prepended byte order mark (BOM) if used). In other words, 160 UTF-16 characters and therefore below the 255 UTF-16 character limit in NTFS but more than 255 bytes.

(ignoring the newline character here)

Ned64
  • 8,726
12

So, here's what I've found out.

Coreutils don't particularly care about filename length and simply work with user input regardless of its length, i.e. there are zero checks.

I.e. this works (filename length in bytes 462!):

name="和総坂裁精座回資国定裁出観産大掲記労。基利婚岡第員連聞余枚転屋内分。妹販得野取戦名力共重懲好海。要中心和権瓦教雪外間代円題気変知。貴金長情質思毎標豊装欺期権自馬。訓発宮汚祈子報議広組歴職囲世階沙飲。賞携映麻署来掲給見囲優治落取池塚賀残除捜。三売師定短部北自景訴層海全子相表。著漫寺対表前始稿殺法際込五新店広。"
cd /mnt/ntfs
touch "$name"

Even this works

echo 123 > "$name"
cat "$name"
123

However once you try to copy the said file to any of your classic Linux filesystems, the operation will fail:

cp "$name" /tmp
cp: cannot stat '/tmp/和総坂裁精座回資国定裁出観産大掲記労。基利婚岡第員連聞余枚転屋内分。妹販得野取戦名力共重懲好海。要中心和権瓦教雪外間代円題気変知。貴金長情質思毎標豊装欺期権自馬。訓発宮汚祈子報議広組歴職囲世階沙飲。賞携映麻署来掲給見囲優治落取池塚賀残除捜。三売師定短部北自景訴層海全子相表。著漫寺対表前始稿殺法際込五新店広。': File name too long

I.e. cp has actually attempted to create this file in /tmp but /tmp doesn't allow filenames longer than 255 bytes.

Also I've managed to open this file in mousepad (a GTK application), edit and save it - it all worked which means 255 bytes restriction applies only to certain Linux filesystems.

This doesn't mean everything will work. For instance my favorite console file manager, Midnight Commander, a clone of Norton Commander - cannot list (shows file size as 0), open, or do anything with this file:

Error
No such file or directory (2)
5

TL;DR:

There was/is some limit, for example readdir_r() can't read file names longer than 255 bytes. However Linux does aware of that and modern APIs can read long file names without problem


There's this line in ReiserFS wiki

Max. filename length: 4032 bytes, limited to 255 by Linux VFS

so there may be some real limits in VFS although I don't know enough about Linux VFS to tell. The VFS functions all work on struct dentry which stores names in the struct qstr d_name;

extern int vfs_create(struct inode *, struct dentry *, umode_t, bool);
extern int vfs_mkdir(struct inode *, struct dentry *, umode_t);
extern int vfs_mknod(struct inode *, struct dentry *, umode_t, dev_t);
extern int vfs_symlink(struct inode *, struct dentry *, const char *);
extern int vfs_link(struct dentry *, struct inode *, struct dentry *, struct inode **);
extern int vfs_rmdir(struct inode *, struct dentry *);
extern int vfs_unlink(struct inode *, struct dentry *, struct inode **);
extern int vfs_rename(struct inode *, struct dentry *, struct inode *, struct dentry *, struct inode **, unsigned int);
extern int vfs_whiteout(struct inode *, struct dentry *);

The struct qstr stores hash, length and pointer to the name so I don't think there are any physical limits unless the VFS functions explicitly truncate the name on creating/opening. I didn't check the implementation but I think long names should work fine

Update:

The length check is done in linux/fs/libfs.c and ENAMETOOLONG will be returned if the name is too long

/*
 * Lookup the data. This is trivial - if the dentry didn't already
 * exist, we know it is negative.  Set d_op to delete negative dentries.
 */
struct dentry *simple_lookup(struct inode *dir, struct dentry *dentry, unsigned int flags)
{
    if (dentry->d_name.len > NAME_MAX)
        return ERR_PTR(-ENAMETOOLONG);
    if (!dentry->d_sb->s_d_op)
        d_set_d_op(dentry, &simple_dentry_operations);
    d_add(dentry, NULL);
    return NULL;
}

The limit is defined in linux/limits.h

#define NAME_MAX         255    /* # chars in a file name */

But I have no idea how long file names can be opened without that error


However there are a few system calls that do have limits. struct dirent has the following members

struct dirent {
   ino_t          d_ino;       /* Inode number */
   off_t          d_off;       /* Not an offset; see below */
   unsigned short d_reclen;    /* Length of this record */
   unsigned char  d_type;      /* Type of file; not supported
                                  by all filesystem types */
   char           d_name[256]; /* Null-terminated filename */
};

Since d_name is a fixed array, many functions like readdir_r() won't ever be able to return names longer than 255 bytes. For example

struct dirent entry;
struct dirent *result;
dir = opendir("/");
int return_code = readdir_r(dir, &entry, &result);

That's why readdir_r() was deprecated

On some systems, readdir_r() can't read directory entries with very long names. When the glibc implementation encounters such a name, readdir_r() fails with the error ENAMETOOLONG after the final directory entry has been read. On some other systems, readdir_r() may return a success status, but the returned d_name field may not be null terminated or may be truncated.

readdir_r(3) — Linux manual page

readdir() OTOH allocates memory for struct dirent itself, so the name can actually be longer than 255 bytes and you must not use sizeof(d_name) and sizeof(struct dirent) to get the name and struct lengths

Note that while the call

fpathconf(fd, _PC_NAME_MAX)

returns the value 255 for most filesystems, on some filesystems (e.g., CIFS, Windows SMB servers), the null-terminated filename that is (correctly) returned in d_name can actually exceed this size. In such cases, the d_reclen field will contain a value that exceeds the size of the glibc dirent structure shown above.

readdir(3) — Linux manual page

Some other functions like getdents() use struct linux_dirent and struct linux_dirent64 which doesn't suffer from the fixed length issue

struct linux_dirent {
   unsigned long  d_ino;     /* Inode number */
   unsigned long  d_off;     /* Offset to next linux_dirent */
   unsigned short d_reclen;  /* Length of this linux_dirent */
   char           d_name[];  /* Filename (null-terminated) */
                     /* length is actually (d_reclen - 2 -
                        offsetof(struct linux_dirent, d_name)) */
   /*
   char           pad;       // Zero padding byte
   char           d_type;    // File type (only since Linux
                             // 2.6.4); offset is (d_reclen - 1)
   */
}

struct linux_dirent64 { ino64_t d_ino; /* 64-bit inode number / off64_t d_off; / 64-bit offset to next structure / unsigned short d_reclen; / Size of this dirent / unsigned char d_type; / File type / char d_name[]; / Filename (null-terminated) */ };

strace ls shows that ls uses getdents() to list files so it can handle file names with arbitrary length

phuclv
  • 2,086
  • 2
    readdir(3) implements that portable POSIX function on top of the Linux getdents system call. You'll always see getdents system calls even if the application is using readdir_r which makes truncated copies of the getdents results. – Peter Cordes Nov 16 '20 at 15:46
  • It should be noted that introducing readdir_r was a bad idea in the first place anyway. The readdir function is, and always has been, reentrant just fine (can be called concurrently on different handles) and the DIR * handle itself isn't thread-safe so there was never a need for a more thread-safe variant of readdir. – Jan Hudec Nov 18 '20 at 19:59
-4

But I remembered that most Linux supports NTFS, whose maximum file name length is 255 UTF-16 characters.

Are we talking filename length or pathname length?

The maximum length for NTFS pathnames has always been 64K bytes (=32K UTF-16 codepoints).

The Win32 API imposed stricter limits because (editorial comment) idiot programmers liked to declare char filename[MAX_PATH], but there were syntactic kludges around that.

  • 1
    maximum length for file name != maximum length for path name. NTFS supports maximum 255 UTF-16 code units for file names – phuclv Nov 15 '20 at 03:00
  • Does "maximum length for NTFS pathnames" mean a limitation inherent to the filesystem structure, or a limitation of the Windows implementation of NTFS, or just a Windows limitation in general? If you mean on Linux, then can you be a bit more specific? There you can have paths that are partly NTFS and partly not... And as far as I've understood, the usual limit for a path name length as accepted by the system calls is just 4096 on Linux. – ilkkachu Nov 15 '20 at 21:03
  • @ilkkachu: the maximum path length (path+file name) is 32767 TUF-16 characters for NTFS. NTFS also allows special characters like ? or * in file names. IMHO that's because NTFS was designed to be POSIX compatible. However, Windows can't handle such file names well. – Thomas Weller Nov 16 '20 at 09:52
  • 1
    @ThomasWeller, yes.... but what does "maximum path length for NTFS" mean in the context of a unix-like system where filesystems can be mounted on top of and inside each other? (Actually, I think you can also do that on Windows at least in some ways?) Is there some internal filesystem structure in NTFS that keeps track of the length of the whole path? And what path is that, then? The part within that one filesystem instance, or the whole path? If you have a path like /a/b/c/e/f where /, a ande are mount points then... which parts are affected by the limitation? – ilkkachu Nov 16 '20 at 12:00