Supporting block I/O contexts

By Jonathan Corbet
June 18, 2012

Memory storage devices, including flash, are essentially just random-access devices with some peculiar restrictions. Given direct access to the device, Linux kernel developers could certainly come up with drivers that would provide optimal performance and device lifetime. In the real world, though, these devices are hidden behind their own proprietary operating systems and software stacks; much of the real (commercial) value seems to be in the software bundled inside. As a result, the kernel must try to coax the device's firmware into doing an optimal job. Over time, the storage industry has added various mechanisms by which an operating system can pass hints down to the device; the "trim" or "discard" mechanism is one of those. Newer eMMC and unified flash storage (UFS) devices add a new hint in the form of "contexts"; patches exist to support this feature, but they seem to have raised more questions than they have answered.

The standards documents describing contexts do not appear to be widely available—or at least findable. From what your editor has been able to divine, "contexts" are a small number added to I/O requests that are intended to help the device optimize the execution of those requests. They are meant to differentiate different types of I/O, keeping large, sequential operations separate from small, random requests. I/O can be placed into a "large unit" context, where the operating system promises to send large requests and, possibly, not attempt to read the data back until the context has been closed.

Saugata Das recently posted a small patch set adding context support to the ext4 filesystem and the MMC block driver. At the lower level, context numbers are associated with block I/O requests by storing the number in the newly-added bi_context (in struct bio) and context (in struct request) fields. The virtual filesystem layer takes responsibility for setting those fields, but, in the end, it defers to the actual filesystems to come up with the proper context numbers. There is a new address space operation (called get_context()) by which the VFS can call into the filesystem code to obtain a context number for a specific request. The block layer has been modified to avoid merging block I/O requests if those requests have been assigned to different contexts.

There was little discussion of the lower-level changes, which apparently make sense to the developers who have examined them. The filesystem-level changes have seen rather more discussion, though. Saugata's patch set only touches the ext4 filesystem; those changes cause ext4 to use the inode number of the file under I/O as the context number. Thus, all I/O requests to a single file will be assigned to the same context, while requests to different files would go into different contexts (within limits—eMMC hardware, for example, only supports 15 contexts, so many inode numbers will be mapped onto a single context number at the lower levels). The question that came up was: is using the inode number the right policy? Coming up with an answer involves addressing two independent questions: (1) what does the "context" mechanism actually do?, and (2) how can Linux filesystems provide the best possible context information to the storage devices?

Arnd Bergmann (who has spent a lot of time understanding the details of how flash storage works) has noted that the standard is deliberately vague on what the context mechanism does; the authors wanted to create something that would outlive any specific technology. He went on to say:

That said, I think it is rather clear what the authors of the spec had in mind, and there is only one reasonable implementation given current flash technology: You get something like a log structured file system with 15 contexts, where each context writes to exactly one erase block at a given time.

The effect of such an implementation would be to concentrate data written under any one context into the same erase block(s). Given that, there are at least a couple of ways to use contexts to optimize I/O performance.

For example, one could try to concentrate data with the same expected lifetime, so that, when part of an erase block is deleted, all of the data in that erase block will be deleted. Using the inode number as the context number could have that effect; deleting the file associated with that inode will delete all of its blocks at the same time. So, as long as the file is not subject to random writes (as, say, a database file might be), using contexts in this manner should reduce the amount of garbage collection and read-modify-write cycles needed when a file is deleted.

Another helpful approach might be to use contexts to separate large, long-lived files from those that are shorter and more ephemeral. The larger files would be well-placed on the medium, and the more volatile data would be concentrated into a smaller number of erase blocks. In this case, using the inode number to identify contexts may or may not work well. Large files would be nicely separated, but the smaller files could be separated from each other as well, which may not be desirable: if several small files would fit into a single erase block, performance might be improved if all of those files were written in the same context. In this case, some other policy might be more advisable.

But what should that policy be? Arnd suggested that using the inode number of the directory containing the file might work better. Various commenters thought that using the ID of the process writing to the file could work, though there are some potential difficulties when multiple processes write the same file. Ted Ts'o suggested that grouping files written by the same process in a short period of time could give good results. Also useful, he thought, might be to look at the size of the file relative to the device's erase block size; files much smaller than an erase block would be placed into the same context, while larger files would get a context of their own.

A related idea, also from Ted, was to look at the expected I/O patterns. If an existing file is opened for write access, chances are good that a random I/O pattern will result. Files opened with O_CREAT, instead, are more likely to be sequential; separating those two types of files into different contexts would likely yield better results. Some flags used with posix_fadvise() could also be used in this way. There are undoubtedly other possibilities as well. Choosing a policy will have to be done with care; poor use of contexts could just as easily reduce performance and longevity instead of increasing them.

Figuring all of this out will certainly take some time, especially since devices with actual support for this feature are still relatively rare. Interestingly, according to Arnd, there may be an opportunity in getting ext4 to supply context information early:

Having code in ext4 that uses the contexts will at least make it more likely that the firmware optimizations are based on ext4 measurements rather than some other file system or operating system. From talking with the emmc device vendors, I can tell you that ext4 is very high on the list of file systems to optimize for, because they all target Android products.

Ext4 is, of course, the filesystem of choice for current Android systems. So, conceivably, an ext4 implementation could drive hardware behavior in the same way that much desktop hardware is currently designed around what Windows does.

Given that the patches are relatively small and that policies can be changed in the future without user-space compatibility issues, chances are good that something will be merged into the mainline as soon as the 3.6 development cycle. Then it will just be a matter of seeing what the hardware manufacturers actually do and adjusting accordingly. With luck, the eventual result will be longer-lasting, better-performing memory storage devices.

Index entries for this article
Kernel	Block layer
Kernel	Filesystems/ext4
Kernel	Solid-state storage devices

(Log in to post comments)

Supporting block I/O contexts

Posted Jun 21, 2012 10:38 UTC (Thu) by etienne (guest, #25256) [Link]

At the minimum, the context should change when writing data or metadata or journal - would be nice for FAT at least, if possible the default for all filesystems.

Supporting block I/O contexts

Posted Jun 22, 2012 9:11 UTC (Fri) by arnd (subscriber, #8866) [Link]

We already have the "data tag" support, which is used to annotate metadata. In eMMC 4.5 any request can be tagged either metadata or belonging to one of 5 to 15 contexts, or being unspecified. Any request that a file system flags as a REQ_META write uses the data tag.

Since FAT is both very simple and very common, a lot of the flash devices actually have logic in them to detect the access patterns even without those annotations. Usually the FAT area is known to be at the beginning of the device (partition-less SD cards and older USB sticks) or an erase block that receives lots of random I/O is expected to be the FAT. Further, devices often expect a FAT cluster size of 32KB, so all requests of that size are taken to be data while all smaller ones are treated as directory updates.

Supporting block I/O contexts

Posted Jun 21, 2012 11:01 UTC (Thu) by gioele (subscriber, #61675) [Link]

Slightly out of topic, but why do flash vendors not focus on flash-specific file systems like the Nokia-sponsored UBIFS instead of hacking features over traditional file systems like ext4?

Supporting block I/O contexts

Posted Jun 21, 2012 14:13 UTC (Thu) by drag (guest, #31333) [Link]

It appears that MTD file systems are designed for small systems. I don't know about UBIFS in particular, but typically they work well only on small devices.

To go and write your own file system for MTD that would scale up like XFS or Ext4 can do would require years of development and for Windows this is a virtual impossibility unless you can get Microsoft on board. Meanwhile people have been writing block storage to memory device translation for decades. People have been doing these translation firmwares for a very long time.

Another thing is that compatibility with existing interfaces is important. They want to have the ability for people to purchase and use the devices with the minimal amount of effort.

Supporting block I/O contexts

Posted Jun 22, 2012 9:12 UTC (Fri) by arnd (subscriber, #8866) [Link]

UBIFS sits on top of the UBI, which sits on top of raw NAND flash, while the hardware industry is now moving towards block based storage such as eMMC. While it would be possible to do UBI on top of an eMMC to get some of the performance back, in the discussion we had between Linaro and some of the eMMC vendors, we ended up discarding that idea. Instead, focusing on improving performance on ext4 and btrfs on flash based block devices is something we are planning to spend more time on together.

Context documentation

Posted Jun 21, 2012 16:17 UTC (Thu) by roman (guest, #24157) [Link]

"The standards documents describing contexts do not appear to be widely available...."

Contexts are documented in the eMMC 4.5 specification. Last I knew, anyone could register on the JEDEC web site and download the PDF.

Supporting block I/O contexts

Posted Jun 22, 2012 17:32 UTC (Fri) by giraffedata (guest, #1954) [Link]

This sounds like a case of overgeneralization. There's no way an abstract, undefined concept of "context" of a write (or is it of the data written?) can be useful. They should just have made it (and called it) expiration group.

Separately, an expected lifetime value would help with the placement goals that apparently underly this feature.