3

Is it beneficial to artificially prime the buffer cache when dealing with larger files?

Here's a scenario: A large file needs to be processed, line by line. Conceptually, it is easy to parallelize the task to saturate a multi-core machine. However, since the the lines need to be read first (before distributing them to workers round-robin), the overall process becomes IO-bound and therefore slower.

Is it reasonable to read all or portions of the file into the buffer cache in advance to have faster read times, when the actual processing occurs?


Update: I wrote a small front-end to the readahead system call. Trying to add some benchmark later...

miku
  • 673
  • 2
    benchmark, otherwise it is just opinion based ;) but i'd just use http://man7.org/linux/man-pages/man2/readahead.2.html and read it it like you normally would – Ulrich Dangel Jul 17 '14 at 07:54
  • @UlrichDangel, thanks for the readahead link, didn't knew about this syscall. – miku Jul 17 '14 at 07:58
  • 1
    just a headsup, if you use mmap, you can use madvise to mark it as sequential read so the kernel will do the readaehead as well and discard old pages. http://man7.org/linux/man-pages/man2/madvise.2.html – Ulrich Dangel Jul 17 '14 at 10:58
  • @UlrichDangel, thanks again. I found a simple POSIX-compliant answer, namely dd. – miku Jul 17 '14 at 12:29
  • 2
    I wouldn't do it that way as you depend on an external program (figuring out bs/count etc.) For a while there was an optimisation in dd which just returned if of was /dev/null. How about using posix_fadvise instead? This should be posix compliant and still achieve the same result without relying on dd – Ulrich Dangel Jul 17 '14 at 12:49
  • Great advice, I'll try to see, if I can implement these various approaches... – miku Jul 17 '14 at 13:12

1 Answers1

0

To prime the cache with a whole file:

cat big.file >/dev/null

To prime the cache with a portion of a file, as per this comment:

time dd if=big.file of=/dev/null bs=1024k count=XXX skip=YYY

Example 2.5G file:

$ time rarara big.file 0 2459650481
real    0m13.803s

$ sync && echo 3 | sudo tee /proc/sys/vm/drop_caches

$ time dd if=big.file of=/dev/null bs=4096 count=600501 skip=0
real    0m14.394s
miku
  • 673
  • JFTR you should clear the buffers between running these commands. I think this testcase is also rather artificial, I'd evaluate it within your environment to see how many operations/second you can achieve via which method. (Do not forget to clear the buffer/cache between runs) – Ulrich Dangel Jul 17 '14 at 12:55
  • @UlrichDangel, I did drop_caches, just forgot to mention it. – miku Jul 17 '14 at 13:01
  • It isn't clear what your example is supposed to demonstrate. That dd with a small block size is smaller than rarara? A small block size makes reading artificially slower. How does this compare with not priming the cache at all? Linux already has read-ahead as a kernel feature, and it's often counterproductive to try to second-guess it. – Gilles 'SO- stop being evil' Jul 17 '14 at 21:31
  • @Gilles: I guess the example does not convey too much, maybe just this: that filling the buffer cache with a dedicated syscall is a bit faster than dd (difference gets larger with bigger files). As for "Linux already has read-ahead as a kernel feature" - when does it read-ahead? When I start to read a large file sequentially - or when certain access patterns occur? I am thankful for any hints, pointers or resources on this. – miku Jul 17 '14 at 21:40