16

Recently saw a question that sparked this thought. Couldn't really find an answer here or via the Google machine. Basically, I'm interested in knowing how the kernel I/O architecture is layered. For example, does kjournald dispatch to pdflush or the other way around? My assumption is that pdflush (being more generic to mass storage I/O) would sit at a lower level and trigger the SCSI/ATA/whatever commands necessary to actually perform the writes, and kjournald handles higher level filesystem data structures before writing. I could see it the other way around as well, though, with kjournald directly interfacing with the filesystem data structures and pdflush waking up every now and then to write dirty pagecache pages to the device through kjournald. It's also possible that the two don't interact at all for some other reason.

Basically: I need some way to visualize (graph or just an explanation) the basic architecture used for dispatching I/O to mass storage within the Linux kernel.

Bratchley
  • 16,824
  • 14
  • 67
  • 103
  • 1
    Is this what you're kind of looking for? http://oss.org.cn/ossdocs/linux/kernel/a1/index.html – slm Jun 06 '13 at 12:12
  • 1
    Also there is this presentation: 7th slide in: http://www.slideshare.net/LukCzerner/local-file-systems-update – slm Jun 06 '13 at 12:15
  • 1
    There's this diagram I found too: http://www.thomas-krenn.com/en/oss/linux-io-stack-diagram/linux-io-stack-diagram_v0.1.pdf – slm Jun 06 '13 at 12:35
  • Unfortunately, it doesn't look like pdflush or kjournald is mentioned in the chart or the seminar video. If I had to guess (going off the chart) since it looks like all filesystem logic is handled before it can make it to the page cache (I do know pdflush is on the other side of the page cache) it looks as though kjournald would hand off to page cache which then hands off to pdflush which sends it to the actual block device layer. That's just speculation, though, until I can find something that says that's correct. – Bratchley Jun 06 '13 at 12:47
  • yeah the stuff you provided is incredibly close to what I'm interested in getting, they just don't mention specific kernel threads. – Bratchley Jun 06 '13 at 13:11
  • 1
    I found this interactive kernel map which helps to show how the various components of the kernel go together: http://makelinux.net/kernel_map/ – slm Jun 08 '13 at 16:18
  • I think the answer is in what I've posted, I need to work on how to pull this info out so that it answers your question w/o having to read all of the reference material. LMK what you think. – slm Jun 08 '13 at 16:26
  • 1
    One more resource, pages 19-24: Linux Performance and Tuning Guidelines. This one looks like exactly what you're looking for. – slm Jun 08 '13 at 16:44
  • The bulleted break down of the disk I/O sequence on page 20 is pretty close to what I would consider an answer for someone who finds this question later (and is at least 50-75% of what a complete answer would be). it mentions pdflush specifically, but it just says that it transmits it to the block device. We still need to establish whether journald sits on the other side or if it transmits changes to the page cache (or skips page cache altogether and just does it syncronously which I think I've seen elsewhere...not sure though). – Bratchley Jun 08 '13 at 17:34
  • I'm thinking the answer may be some combination of the redbook and the makelinux links you submitted. – Bratchley Jun 08 '13 at 17:35
  • I found this which is talking about some journal updates being made synchronous via fsync which implies (but doesn't prove) to me that the usual case is asynchronous, which itself implies it goes to the page cache. I still need to find something that more explicitly corroborates this... This may be us getting closer to this one. – Bratchley Jun 08 '13 at 17:42
  • Found the kjournald part of the answer here on the right hand half of section 3.2.1. journal writes as handled exclusively via kjournald, pdflush sits on the side of it and handles the actual file data. kjournald is synchronous but the file writes needn't be. Can you put your answer together with all the stuff that's been found so I can accept it? – Bratchley Jun 08 '13 at 17:51
  • Basically, I'm interested in that redbook section, the reference to the makelinux site, the stuff from my immediately previous post, and anything else you feel would be important to someone working the google machine several months down the road. Also, that diagram from your first few posts would probably help people visualize how kjournald can sit beside pdflush and not ontop of or below it in the stack. – Bratchley Jun 08 '13 at 17:53
  • We should probably through kswapd in there too, since I included that in the title when I first posted. That one's pretty straightforward, though. – Bratchley Jun 08 '13 at 18:10
  • 1
    I'll try and pull this all together into an answer later tonight after the kids go to bed 8-). Thanks for getting back to me. – slm Jun 08 '13 at 20:38

1 Answers1

22

Before we discuss the specifics regarding pdflush, kjournald, andkswapd`, let's first get a little background on the context of what exactly we're talking about in terms of the Linux Kernel.

The GNU/Linux architecture

The architecture of GNU/Linux can be thought of as 2 spaces:

  • User
  • Kernel

Between the User Space and Kernel Space sits the GNU C Library (glibc). This provides the system call interface that connects the kernel to the user-space applications.

The Kernel Space can be further subdivided into 3 levels:

  • System Call Interface
  • Architectural Independent Kernel Code
  • Architectural Dependent Code

System Call Interface as its name implies, provide an interface between the glibc and the kernel. The Architectural Independent Kernel Code is comprised of the logical units such as the VFS (Virtual File System) and the VMM (Virtual Memory Management). The Architectural Dependent Code is the components that are processor and platform-specific code for a given hardware architecture.

Diagram of GNU/Linux Architecture

                                 ss of gnu/linux arch.

For the rest of this article, we'll be focusing our attention on the VFS and VMM logical units within the Kernel Space.

Subsystems of the GNU/Linux Kernel

                                    ss of kernel com

VFS Subsystem

With a high level concept of how the GNU/Linux kernel is structured we can delve a little deeper into the VFS subsystem. This component is responsible for providing access to the various block storage devices which ultimately map down to a filesystem (ext3/ext4/etc.) on a physical device (HDD/etc.).

Diagram of VFS

ss of vfs

This diagram shows how a write() from a user's process traverses the VFS and ultimately works its way down to the device driver where it's written to the physical storage medium. This is the first place where we encounter pdflush. This is a daemon which is responsible for flushing dirty data and metadata buffer blocks to the storage medium in the background. The diagram doesn't show this but there is another daemon, kjournald, which sits along side pdflush, performing a similar task writing dirty journal blocks to disk. NOTE: Journal blocks is how filesystems like ext4 & JFS keep track of changes to the disk in a file, prior to those changes taking place.

The above details are discussed further in this paper.

Overview of write() steps

To provide a simple overview of the I/O sybsystem operations, we'll use an example where the function write() is called by a User Space application.

  1. A process requests to write a file through the write() system call.
  2. The kernel updates the page cache mapped to the file.
  3. A pdflush kernel thread takes care of flushing the page cache to disk.
  4. The file system layer puts each block buffer together to a bio struct (refer to 1.4.3, “Block layer” on page 23) and submits a write request to the block device layer.
  5. The block device layer gets requests from upper layers and performs an I/O elevator operation and puts the requests into the I/O request queue.
  6. A device driver such as SCSI or other device specific drivers will take care of write operation.
  7. A disk device firmware performs hardware operations like seek head, rotation, and data transfer to the sector on the platter.

VMM Subsystem

Continuing our deeper dive, we can now look into the VMM subsystem. This component is responsible for maintaining consistency between main memory (RAM), swap, and the physical storage medium. The primary mechanism for maintaining consistency is bdflush. As pages of memory are deemed dirty they need to be synchronized with the data that's on the storage medium. bdflush will coordinate with pdflush daemons to synchronize this data with the storage medium.

Diagram of VMM

                ss of VMM

Swap

When system memory becomes scarce or the kernel swap timer expires, the kswapd daemon will attempt to free up pages. So long as the number of free pages remains above free_pages_high, kswapd will do nothing. However, if the number of free pages drops below, then kswapd will start the page reclaming process. After kswapd has marked pages for relocation, bdflush will take care to synchronize any outstanding changes to the storage medium, through the pdflush daemons.

References & Further Readings

slm
  • 369,824
  • 1
    I'm going to wait a day before I accept this as an answer and award the bounty so that it stays on the "bounty" page. That way anyone who's seen it before has a chance to notice it has an answer now. – Bratchley Jun 09 '13 at 11:43
  • 1
    Thanks again, BTW. You really went all-out on researching this. – Bratchley Jun 09 '13 at 11:45