How does Linux assign inode numbers on filesystems not based on inodes?

Question

This is the output of the ls -li command on the VFAT filesystem.

% ls -li
合計 736
1207 drwxr-xr-x 3 root root  16384  3月 10 10:42 efi
1208 -rwxr-xr-x 1 root root 721720  3月 22 14:15 kernel.bin

1207 and 1208 are the inode numbers of the directory and file. However, the VFAT filesystem does not have the concept of inodes.

How does Linux assign inode numbers for files on a filesystem which does not have the inode notion?

Chris Down · Accepted Answer · 2021-03-27T14:14:04.860

tl;dr: For virtual, volatile, or inode-agnostic filesystems, inode numbers are usually generated from a monotonically incrementing, 32-bit counter when the inode is created. The rest of the inode (eg. permissions) is built from the equivalent data in the underlying filesystem, or is replaced with values set at mount time (eg. {uid,gid}=) if no such concept exists.

To answer the question in the title (ie. abstractly, how Linux allocates inode numbers for a filesystem that has no inode concept), it depends on the filesystem. For some virtual or inodeless filesystems, the inode number is drawn at instantiation time from the get_next_ino pool. This has a number of problems, though:

get_next_ino() uses 32-bit inode numbers even on a 64-bit kernel, due to legacy handling for 32-bit userland without _FILE_OFFSET_BITS=64;
get_next_ino() is just a globally incrementing counter used by multiple filesystems, so the risk of overflow is increased even further.

Problems like this are one of the reasons why I moved tmpfs away from get_next_ino-backed inodes last year.

For this reason, tmpfs in particular is an exception from most volatile or "inodeless" filesystem formats. Sockets, pipes, ramfs, and the like still use the get_next_ino pool as of 5.11.

As for your specific question about FAT filesystems: fs/fat/inode.c is where inode numbers are allocated for FAT vilesystems. If we look in there, we see fat_build_inode (source):

struct inode *fat_build_inode(struct super_block *sb,
                              struct msdos_dir_entry *de, loff_t i_pos)
{
        struct inode *inode;
        int err;
    fat_lock_build_inode(MSDOS_SB(sb));
    inode = fat_iget(sb, i_pos);
    if (inode)
            goto out;
    inode = new_inode(sb);
    if (!inode) {
            inode = ERR_PTR(-ENOMEM);
            goto out;
    }
    inode-&gt;i_ino = iunique(sb, MSDOS_ROOT_INO);
    inode_set_iversion(inode, 1);
    err = fat_fill_inode(inode, de);
    if (err) {
            iput(inode);
            inode = ERR_PTR(err);
            goto out;
    }
    fat_attach(inode, i_pos);
    insert_inode_hash(inode);

out:
        fat_unlock_build_inode(MSDOS_SB(sb));
        return inode;
}

What this basically says is this:

Take the FAT inode creation lock for this superblock.
Check if the inode already exists at this position in the superblock. If so, unlock and return that inode.
Otherwise, create a new inode.
Get the inode number from iunique(sb, MSDOS_ROOT_INO) (more about that in a second).
Fill the rest of the inode from the equivalent FAT datastructures.

inode->i_ino = iunique(sb, MSDOS_ROOT_INO); is where the inode number is set here. iunique (source) is a fs-agnostic function that provides unique inode numbers for a given superblock. It does this by using a superblock + inode-based hash table, with a monotonically increasing counter:

ino_t iunique(struct super_block *sb, ino_t max_reserved)
{
        static DEFINE_SPINLOCK(iunique_lock);
        static unsigned int counter;
        ino_t res;
    rcu_read_lock();
    spin_lock(&amp;iunique_lock);
    do {
            if (counter &lt;= max_reserved)
                    counter = max_reserved + 1;
            res = counter++;
    } while (!test_inode_iunique(sb, res)); /* nb: this checks the hash table */
    spin_unlock(&amp;iunique_lock);
    rcu_read_unlock();

    return res;

}

In that respect, it's pretty similar to the previously mentioned get_next_ino: just per-superblock instead of being global (like for pipes, sockets, or the like), and with some rudimentary hash-table based protection against collisions. It even inherits get_next_ino's behaviour using 32-bit inode numbers as a method to try and avoid EOVERFLOW on legacy applications, so there are likely going to be more filesystems which need 64-bit inode fixes (like my aforementioned inode64 implementation for tmpfs) in the future.

So to summarise:

Most virtual or inodeless filesystems use a monotonically incrementing counter for the inode number.
That counter isn't stable even for on-disk inodeless filesystems*. It may change without other changes to the filesystem on remount.
Most filesystems in this state (except for tmpfs with inode64) are still using 32-bit counters, so with heavy use it's entirely possible the counter may overflow and you may end up with duplicate inodes.

* ...although, to be fair, by contract this is true even for filesystems which do have an inode concept when i_generation changes -- it just is less likely to happen in practice since often the inode number is related to its physical position, or similar.

"FAT vilesystems" – Greatest and most technically accurate typo of all time? — Jörg W Mittag, Mar 27 '21 at 10:21
speaking of physical position, is there some reason the VFAT driver couldn't use the physical position of a directory entry to generate the inode number? — ilkkachu, Mar 27 '21 at 11:43
@ilkkachu Absolutely -- it's certainly technically possible to construct inode numbers based on other physical properties from the filesystem, but a.) it tends to be more expensive and/or require additional reads to calculate, so we tend to avoid it if possible and b.) even in filesystems where inodes do have some kind of semantic relation to physical position on disk, it's not a contract with userspace, and is still bound by mount lifetime and i_generation (if ioctl(FS_IOC_GETVERSION) is supported for the filesystem). That's why we've generally just used incrementing counters. :-) — Chris Down, Mar 27 '21 at 14:07
@ChrisDown, yep. honestly, I'm not sure what the exact promises are regarding inode numbers. I just recall that vfat can show different inode numbers for the same file, even without unmounting in the meantime, and that seems like something some program could trip on. (Also, it doesn't seem to support FS_IOC_GETVERSION, but I'm on an old kernel.) — ilkkachu, Mar 27 '21 at 19:57
Are inodes for all existing files on the inode-less filesystem built immediately after it is mounted, or lazily? — toku-sa-n, Mar 28 '21 at 15:13
@ilkkachu: Yeah I've got a bunch of code that would trip on that. Hmmm; the offset if the directory entry on disk is something the FS already has in memory at the time. Perhaps it should have used it. — Joshua, Mar 28 '21 at 16:53
@ilkkachu "I just recall that vfat can show different inode numbers for the same file, even without unmounting in the meantime, and that seems like something some program could trip on." -- this would be a bug. If you still see this, please feel free to report it upstream to linux-fsdevel and cc me :-) — Chris Down, Mar 28 '21 at 17:56
So to clarify, the generated inodes do not get saved on the FAT vilesystem, but generated each time the nodes are "needed" (e.g. when ls is called). Or does new_inode or fat_fill_inode save the inode? — goulashsoup, Jun 13 '22 at 14:25

How does Linux assign inode numbers on filesystems not based on inodes?

1 Answers1

Linked