2

In my C++ program where intensive disk, network I/O and even CPU computation occur, I am using memory mapped region as an array.

With very small data, it works fine. However when I ran the program with very huge data my application crashes. (I absolutely understand that mmap region's size should not be a concern because OS will deal with all the I/O and buffering)

I don't want to blame on Linux for it, but I'd like to know whether there is any case 'mmap' becomes unstable and can make the OS crashes?

When OS crashes, in the screen I can see the kernel panic message related with some blah blah 'write_back' ... (I will add the msg here as soon as I reproduce the problem)

// The program uses MPI network operations over memory mapped region (Intel MPI with Infiniband's RDMA enabled) where RDMA possibly bypasses OS kernel and directly writes some data into memory.

I investigated the callstack and found some kernel codes: (http://lxr.free-electrons.com/source/fs/ext4/inode.c#L2313)

I guess the errors comes from 'BUG_ON' trap in #L2386 BUG_ON(PageWriteback(page)); kernel's ver is 3.19.0 (https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.19.tar.xz)

enter image description here

syko
  • 695
  • 1
    How huge (in bytes) is the data? What does the program or shell print when it crashes? – Mark Plotnick Feb 16 '16 at 20:33
  • 5
    mmap is not unstable, your C++ program is. Perhaps is it compiled in 32 bit mode? – jlliagre Feb 16 '16 at 20:39
  • 2
    How does your application "crash"? – Andrew Henle Feb 16 '16 at 20:50
  • Are you assuming that mmap() always returns the same address? Does your program use a long call stack (i.e. it's recursive or makes heavy use of function calls), or is it threaded? –  Feb 16 '16 at 22:25
  • 1
    @MarkPlotnick I mean 'huge' by the case where I had the mmap region's size as more than the physical memory size. ( it's 32GB machine. When the mmap region is about 12GB (+ ~15GB for other data structures), it's fine. But when I increased the mmap region's size to 24GB. The OS, not my program, crashes. – syko Feb 16 '16 at 23:46
  • @BruceEdiger I store the returned address from mmap and keep using it. I do not use any recursive call but the application is very intensively threaded. Is it possible the load for aio of mmap is killing the OS? – syko Feb 16 '16 at 23:46
  • @jlliagre I know that, that's why I am here to ask some opinions :D – syko Feb 16 '16 at 23:48
  • 3
    @AndrewHenle I shoud have said that the program kills OS. OS spit out some kernel panic msg (blah blah 'writeback"... I will add the msg as soon as I reproduce it) – syko Feb 16 '16 at 23:49
  • Please give us an mcve. I played with a couple of mmap scenarios but my kernel lives on. And I wanted it to die so bad. :( – Petr Skocik Feb 17 '16 at 00:11
  • 1
    Most importantly, post the kernel stack trace. It should be in /var/log/messages. This should help: http://unix.stackexchange.com/questions/60574/determining-cause-of-linux-kernel-panic – Andrew Henle Feb 17 '16 at 01:24

2 Answers2

1

By the description in cooments, his is a kernel crash (panic). It definitely should never happen.

What distribution is this? What kernel version? Architecture?

First update everything. If your distribution is End of Life, upgrade. Then try again.

If it persists, you should be able to make this reproducible with a small C program, which does the humongous mmap() as your C++ program does, and repeat on that memory a dance similar to the C++ one (perhaps just trying o access "far into it" is enough). Gather everything and report it via he bug reporting channel of your distribution.

vonbrand
  • 18,253
1

You cannot cause a kernel panic via improper use of any system call, mmap included. The syscall interface doesn't provide the caller with the wherewithal to corrupt the kernel's data structures.

I would look for a hardware problem, and pay careful attention to any clues in the system logs, e.g. /var/log/kernel.log. As an experiment, I'd try mapping the same size file on a different filesystem, because the disk is the most likely component to fail.

It's possible you've tickled a kernel bug. A quick search for writeback and panic turns up [this old bug].1 If you're running a very old kernel then, sure, it's probably time to upgrade.