3

I'm working with Tensorflow's TFRecords format, which serializes a bunch of data points into a single big file. Typical values here would be 10KB per data point, 10,000 data points per big file, and a big file around 100MB. TFRecords are typically written just once - they are not appended. I think this means they will not be very fragmented.

I believe TFRecords were based on Google's internal RecordIO format.

Usually people run Tensorflow and TFRecords on Ubuntu 18.04 or 20.04, which I think is usually an ext4 filesystem.

And usually, deep learning engineers run on SSD/NVME disks. The cost delta over magnetic spinning platter disks is immaterial compared to the massive cost of the GPU itself.

Question 1:

In an ext4 filesystem, if I know that a specific datapoint is say 9,000,000 bytes into a file, can I seek to that location and start reading the datapoint in constant time? By constant time, I mean solely as a function of the depth of the seek. I'm not worried about the effect the total size of the file has.

If this is true, it would imply there's some kind of lookup table / index for each file in an ext4 file system that maps seek locations to disk sectors.

I haven't looked at filesystems in decades, but I seem to recall that FAT filesystems are linked lists - you have to read a disk sector to know what the next disk sector is. This implies that to seek to 9,000,000 bytes into a file, I need to read all the disk sectors from the first 8,999,999 bytes. E.g. seek time is linear to the "depth" of the seek. I'm hoping ext4 is constant time, not linear.

Question 2:

My ultimate goal is to perform random access into a TFRecord. For reasons I assume are related to optimizing read speed on magnetic spinning platter disks, TFRecords are designed for serial reading, not random access.

Regardless of whether the seek function is a constant time or not (as a function of the depth of the seek), would random access into a big file on an ext4 filesystem be "fast enough"? I don't honestly know exactly what fast enough would be, but for simplicity, let's say a very fast deep learning model might be able to pull 10000 data points per second, where each data point is around 10KB and randomly pulled from a big file.

Yaoshiang
  • 133
  • Ubuntu 18.04 or 20.04, which I think is usually an ext4 filesystem. Maybe not. My Ubuntu 20.04 laptop is using XFS. I don't remember if that was my choice when I installed Ubuntu or if it was the default. Either way, I'd think you could select just about most any filesystem. – Andrew Henle Jan 05 '22 at 21:17

1 Answers1

8

Maybe, if the file is not fragmented on the disk. But it probably doesn't matter if it's strictly constant time.

ext2 and ext3 stored the locations of the data blocks in trees that had 1 to 4 levels, so lookups couldn't be constant-time. Also, the blocks of the tree could in principle be anywhere along the filesystem, so some disk seeks may be required.

ext4 stores a tree of extents, which then describe multiple contiguous data blocks. So if it's known a file has only one extent, the lookup would be constant time. But if it's fragmented (or larger than 128 MiB, requiring more than one extent), it would not.

(source: https://www.kernel.org/doc/html/latest/filesystems/ext4/dynamic.html#the-contents-of-inode-i-block)

Though I might be more worried about if the lookups are fast enough, than if they're constant time. That's a far easier goal, since the trees aren't going to be too deep anyway and if you're accessing the same file repeatedly, they'll soon all be loaded into memory anyway, eliminating any disk seeks (which aren't that big of a problem on SSDs, but, anyway). There's also the system call overhead on each access, double if you seek before each read/write. Though I think there are some more advanced system calls that can alleviate that.

that FAT filesystems are linked lists - you have to read a disk sector to know what the next disk sector is. This implies that to seek to 9,000,000 bytes into a file, I need to read all the disk sectors from the first 8,999,999 bytes. E.g. seek time is linear to the "depth" of the seek.

FAT filesystems have a table of block pointers (the FAT table), and those pointers form the linked list. Not the data blocks themselves. So, with e.g. a 4 kB block size, you only need to read 9000000 / 4096 ~= 2000 pointers, a few kB worth. It's still a linked list though, and iterating it requires a number of steps proportional to the seek location (unless the fs driver has some smarts to reduce that). But the FAT table is contiguous and also likely to be in memory so there are no disk seeks involved.

Typical values here would be 10KB per data point, 10,000 data points per big file, and a big file around 100MB.

let's say a very fast deep learning model might be able to pull 10000 data points per second, where each data point is around 10KB and randomly pulled from a big file.

A 100 MB file should easily fit fully in memory (multiple times over), and keeping it there would get rid of the system call overhead of the seeks too. If you're only reading, that's that.

If you were to write, too, note that not all writes would immediately hit the disk flash anyway without some special care that would probably slow the whole thing down. (At least you'd need to call fsync() on every occasion, and trust the drive to not cheat you.) With the file in memory, you could either manually write it back every once in a while, or map it to memoery with mmap() and call msync() to ask it to be written back from time to time.

ilkkachu
  • 138,973
  • 2
    ups, thanks @Stephen – ilkkachu Jan 05 '22 at 20:00
  • That so much @ilkkachu. I've updated my question to also ask about "fast enough" for a specific data loading example. And I've clarified that constant is only as a function of the depth of the seek, not the file size or fragmentation. I also updated that TFRecords are usually written once. They are not appended to. I suspect this means they are low on fragmentation? – Yaoshiang Jan 05 '22 at 20:22
  • 1
    FWIW, I seem to remember that the FAT filesystem is actually among the fastest filesystems for reading - that contiguous FAT table that's already in memory makes for some real fast reads. – Andrew Henle Jan 05 '22 at 21:15
  • 1
    @Yaoshiang "Fast enough" depends on your situation: not only the state of the system when you read the file, but how often you're reading it, and even the length of the YouTube video you watch to pass the time while your program is running. :-P Bottom line, would you get to use your time more effectively by switching filesystems? That's a question only you can answer. – David Z Jan 06 '22 at 06:05
  • 1
    @Yaoshiang : Any chance your TFRecords are written using ext4's preallocation to minimize fragmentation? – Eric Towers Jan 06 '22 at 07:02
  • 1
    I think the question is asking if for a given file, the complexity of seek is independent of the seek position. And as I understand it, that's close to the case for EXT filesystems - there can be a slight increase when extra blocks are required for block pointers, but they are not (as Yaoshiang seemed to fear) linear with the seek position. – Toby Speight Jan 06 '22 at 09:27
  • 2
    @TobySpeight, yeah, that's what I meant with the trees likely being "not too deep". Strictly, it's not constant time unless all leaves of the tree are on the same level... (Then again, with all the caches and stuff going on, I'm not sure if anything on a modern computer is actually constant-time, but that's way to deep in the rabbit hole.) – ilkkachu Jan 06 '22 at 12:45
  • 2
    Oh yes, where constant time really matters (e.g. in crypto) we end up having to jump through many hoops to avoid data-dependent timing. It's usually best to offload to function-specific hardware when possible. (Now totally off-topic...) – Toby Speight Jan 06 '22 at 13:52
  • @EricTowers, I suspect there is no ext4 optimization, just standard python writes. One dynamic is that the writing all happens in one session - there is no appending. However, typically, you don't know how big the file will be until you are done writing it, so my guess is that pre-allocation would be hard to pull off. – Yaoshiang Jan 06 '22 at 18:43
  • 1
    @TobySpeight that's correct. I made it confusing by saying "constant" time... what I meant was "constant time with respect to seek position". If EXT4 has a tree index that grows O(lg n) with respect to file size, that's outside the concern I had. – Yaoshiang Jan 06 '22 at 18:47
  • @Yaoshiang, you could guess some size and truncate the file after. If you use what ever it is that fallocate uses, you don't even need to write to the file in advance. ext4 can allocate the data blocks and just mark them as unwritten, they'll read as zeroes until written to and the process is quick – ilkkachu Jan 06 '22 at 19:33