mmap and slow DMA transfers

Question

I have a process that reads data from a hardware device using DMA transfers at a speed of ~4 * 50MB/s and at the same time the data is processed, compressed and written to a 4TB memory mapped file.

Each DMA transfer should (and do on average) take less than 20ms. However, a few times every 5 minutes the DMA transfers can take up to 300ms which is a huge issue.

We believe this might be related to the kernel flushing dirty memory mapped pages to disk. Since if we stop writing to the mapped memory the DMA transfers durations are just fine. However, we are confused as to how/why this could affect the DMA transfers and whether there is a way to avoid this?

The hardware device has some memory to buffer data but when the DMA transfers are this slow we are loosing data.

Currently we’re doing testing on Arch Linux with a 4.1.10 lts kernel, but we’ve also tried Ubuntu 14.04 with mostly worse results. Hardware is a HP z820 workstation, 32GB RAM and dual Xeon E5-2637 @ 3.50Ghz (http://www8.hp.com/h20195/v2/GetPDF.aspx/c04111177.pdf).

We also tried a Windows version of our software as well that does not suffer from this specific issue but has lots of other issues.

First, what is your disk hardware and filesystem? Maybe faster disks and a faster filesystem will fix the problem, perhaps with more frequent flushing to disk. And how are you writing data to your output file? Are you seeking and writing randomly? Are you reading back from the file? If the page cache isn't helping your performance, you can replace the mmap()'d file with open() and write() calls using the O_DIRECT flag on the open(). This won't use the page cache and the kernel won't have to flush dirty pages. This does assume your underlying file system supports direct IO. — Andrew Henle, Oct 26 '15 at 09:57
@AndrewHenle: It's 2x RAID0 3 TB spinning drives running XFS. The amount of data written is 25MB/s while the disks have minimum speed > 100MB/s. The data is written entirely sequentially. We will read back from the file in the future but in this specific test it is only written to. We need to go through the file cache for future reading. — ronag, Oct 26 '15 at 10:04
The best solution for us would be if the DMA transfers were always prioritised over disk io. Since the disks should have no trouble keeping up. — ronag, Oct 26 '15 at 10:05
I suppose your disks use DMA transfers as well. You should check the manual to see if you can configure the memory controller and what are the options. — Dmitry Grigoryev, Oct 26 '15 at 10:10
Hm, yes, the Intel Raid Controller might be using DMA. We'll try removing that and use regular software raid for the disks. — ronag, Oct 26 '15 at 10:14
@ronag "We need to go through the file cache for future reading" Do you? How many bytes are you writing total? "for future reading" implies the data will likely be long gone from the page cache by the time you read the file. And 3TB drives implies SATA, running at most 7200 RPM. They'll do OK for streaming, but they won't handle random IO or concurrent access well at all, and they'll only have no trouble keeping up to your required data rate under ideal conditions, as you're already seeing. — Andrew Henle, Oct 26 '15 at 10:21
@AndrewHenle: It's writing continuously and sequentially and overwriting the oldest data. The data will be read about 10 seconds after it's been written so we mostly need about 10 seconds of file cache to avoid read io which would totally bust disk performance as you hint. There will be very little random IO. Both read and write will be sequential except from some very unusual binary seeking. Note, that for this test we are not reading at all from the file since we haven't even gotten just writing working yet. — ronag, Oct 26 '15 at 10:24
@ronag - "The data will be read about 10 seconds after it's been written" and " we haven't even gotten just writing working yet"? Get some SSDs, see if they solve the problem, and if so move on. — Andrew Henle, Oct 26 '15 at 10:30
@AndrewHenle: Affordable SSDs don't have reliable enough performance and are not suitable for continuous write workloads. We used SSDs initially but suffered major data loss due to sudden drops in write performance. Unless we buy some of those enterprise SDDs which costs as much as a castle. I don't think the disk performance of the spin drives should be a problem, we are only using ~16% of the minimum write performance. — ronag, Oct 26 '15 at 10:34
Good laptop SSDs are ubiquitous with no data loss. So you're spending how many man-hours at how much money an hour to try to come up with only half a solution that's going to require non-trivial tuning of software and hardware to make it work - and then you'll have to deal with reading the data 10 seconds later? Using 16% of the average aggregate write performance is irrelevant if you can't meet peak requirements - and that's not even addressing your read requirement. If your disk solution didn't fall apart every 5 minutes you wouldn't be asking this question. Buy the enterprise SSDs. — Andrew Henle, Oct 26 '15 at 10:51
@AndrewHenle: Actually the price difference for us is so large that it's even worth to put weeks of dev time into this... What you are saying makes sense and we have looked into it. But it simply doesn't work from an economical perspective for our case. — ronag, Oct 26 '15 at 11:00
@DmitryGrigoryev: Replacing Intel RAID with Linux software RAID seems to improve things we now have a intermittent max of 100ms, still not a full solution though. Is there any way in Linux to get a bit more information what the DMA controller is doing? — ronag, Oct 26 '15 at 11:10
Have a look on http://www.makelinux.net/ldd3/chp-15-sect-4 for some DMA programming tips. — Dmitry Grigoryev, Oct 26 '15 at 11:19
Not sure where else we can find improvements. Does regular disk io also go through DMA? How can the DMA be blocked for 100ms? If the kernel uses DMA to flush dirty pages to disk is it possible to configure the size or priority of the transfers so there is more interleaving? — ronag, Oct 26 '15 at 12:20
You can try chrt -f and ionice -c 1 to give realtime scheduling to your process, may help with io queueing. Also, stop cron and other background jobs you are not interested in (eg temporarily with kill -stop then kill -cont). — meuh, Oct 26 '15 at 14:30
We've noticed 2 scenarios where the problem doesn't occur. If we read from the file (cat) while it's being written or if we use O_DIRECT. The second one I can understand but the first one makes no sense to me. — ronag, Oct 26 '15 at 15:29
@meuh: Your suggestion to use chrt -f solved our problem. If you create an answer with your suggestion I can accept it. — ronag, Oct 28 '15 at 11:43

score 2 · Accepted Answer · answered Oct 28 '15 at 12:31

Linux has some realtime options, though it is not a realtime kernel as such. This allows a process to demand that it is scheduled before non-realtime processes, as soon as it is ready, and to hold on to the cpu for as long as necessary.

By default processes are given scheduling policy SCHED_OTHER. You can set this to realtime SCHED_FIFO for a given running pid with chrt -f -p prio pid, or prefix the command with chrt -f prio when you start it. The prio priority is independent of normal processes, and is only used when realtime processes compete for resources. ps shows these priorities as negative values (eg -21 for realtime prio 20).

ionice --class 1 -p pid can also help scheduling your process with preferential realtime io queueing.

score 1 · Answer 2 · edited Apr 13 '17 at 12:36

1

Does your device operate continously for minutes or are there any regular pauses in the transfers?

If there are pauses, you can force the kernel to empty buffers and cache during those, so this activity wouldn't interfere with the DMA transfers. Alternatively, you could configure your kernel to have BDFLUSHR interval of 1 second, so that the kernel has less data to write each time it decides to flush buffers.

If you need to insure continuous operation, you will need RAM with more channels, so that CPU and your device can access the memory simultaneously (as it turns out, you already have a 4-channel memory controller). Make sure you configure your RAM in unganged mode, if this option is available. Make sure you have installed similar DRAM modules in 4 slots corresponding to memory channels, so that your memory controller can in fact operate in 4-channel mode.

edited Apr 13 '17 at 12:36

Community

1

answered Oct 26 '15 at 09:38

Dmitry Grigoryev

7,313

The device runs continuously in 40 ms intervals (it's a 4x video capture device). – ronag Oct 26 '15 at 09:45
So you have about 20ms of DMA transfer + 20ms of pause? – Dmitry Grigoryev Oct 26 '15 at 09:47
I don't know the exact values but yes, < 20ms DMA transfer and then pause. – ronag Oct 26 '15 at 09:49
Is ganged vs unganged a BIOS setting? – ronag Oct 26 '15 at 09:52
Yes, if this is configurable. Sometimes the mode is just hardcoded in the chipset. – Dmitry Grigoryev Oct 26 '15 at 09:56

score 0 · Answer 3 · answered May 16 '17 at 13:45

I'd guess that you haven't modified kernel dirty pages settings. For your use case, I would try something like this:

/proc/sys/vm/dirty_background_bytes:50000000
/proc/sys/vm/dirty_bytes:4000000000
/proc/sys/vm/dirty_expire_centisecs:100
/proc/sys/vm/dirty_writeback_centisecs:20

(See https://www.kernel.org/doc/Documentation/sysctl/vm.txt for details.)

The problem is basically that the default kernel limits are problematic if you system has lots of RAM and slow enough storage device and you're seeking low worst case latency. In practice, system IO subsystem buffer is filled up and it needs to force writing processes to sleep until enough data has been written to block devices ("flushing dirty pages").

mmap and slow DMA transfers

3 Answers3