2

I have a very large disk drive (2TB), but not very much RAM (8GB). I'd like to be able to run some big data experiments on a large file (~200GB) that exists on my disk's filesystem. I understand that will be very expensive in terms of disk bandwidth, but I don't mind the high I/O usage.

How could I load this huge file into a C++ array, so that I could perform reads and writes to the file at locations of my choosing? Does mmap work for this purpose? What parameter options should I be using to do this? I don't want to trigger the OOM killer at any point of running my program.

I know that mmap supports file-backed and anonymous mappings but I'm not entirely sure which to use. What about between using a private vs shared mapping?

1 Answers1

2

It only makes sense to use a file-backed mapping to mmap a file, not an anonymous mapping. If you want to write to the mapped memory and have the changes get written back to the file, then you need to use a shared mapping. With a file-backed, shared mapping, you don't need to worry about the OOM killer, so as long as your process is 64-bit, there's no problem with just mapping the entire file into memory. (And even if you weren't 64-bit, the problem would be lack of address space, not lack of RAM, so the OOM killer still wouldn't affect you; your mmap would just fail.)