I'm working with Tensorflow's TFRecords format, which serializes a bunch of data points into a single big file. Typical values here would be 10KB per data point, 10,000 data points per big file, and a big file around 100MB. TFRecords are typically written just once - they are not appended. I think this means they will not be very fragmented.
I believe TFRecords were based on Google's internal RecordIO format.
Usually people run Tensorflow and TFRecords on Ubuntu 18.04 or 20.04, which I think is usually an ext4 filesystem.
And usually, deep learning engineers run on SSD/NVME disks. The cost delta over magnetic spinning platter disks is immaterial compared to the massive cost of the GPU itself.
Question 1:
In an ext4 filesystem, if I know that a specific datapoint is say 9,000,000 bytes into a file, can I seek to that location and start reading the datapoint in constant time? By constant time, I mean solely as a function of the depth of the seek. I'm not worried about the effect the total size of the file has.
If this is true, it would imply there's some kind of lookup table / index for each file in an ext4 file system that maps seek locations to disk sectors.
I haven't looked at filesystems in decades, but I seem to recall that FAT filesystems are linked lists - you have to read a disk sector to know what the next disk sector is. This implies that to seek to 9,000,000 bytes into a file, I need to read all the disk sectors from the first 8,999,999 bytes. E.g. seek time is linear to the "depth" of the seek. I'm hoping ext4 is constant time, not linear.
Question 2:
My ultimate goal is to perform random access into a TFRecord. For reasons I assume are related to optimizing read speed on magnetic spinning platter disks, TFRecords are designed for serial reading, not random access.
Regardless of whether the seek function is a constant time or not (as a function of the depth of the seek), would random access into a big file on an ext4 filesystem be "fast enough"? I don't honestly know exactly what fast enough would be, but for simplicity, let's say a very fast deep learning model might be able to pull 10000 data points per second, where each data point is around 10KB and randomly pulled from a big file.