Supporting block I/O contexts
The standards documents describing contexts do not appear to be widely available—or at least findable. From what your editor has been able to divine, "contexts" are a small number added to I/O requests that are intended to help the device optimize the execution of those requests. They are meant to differentiate different types of I/O, keeping large, sequential operations separate from small, random requests. I/O can be placed into a "large unit" context, where the operating system promises to send large requests and, possibly, not attempt to read the data back until the context has been closed.
Saugata Das recently posted a small patch set adding context support to the ext4 filesystem and the MMC block driver. At the lower level, context numbers are associated with block I/O requests by storing the number in the newly-added bi_context (in struct bio) and context (in struct request) fields. The virtual filesystem layer takes responsibility for setting those fields, but, in the end, it defers to the actual filesystems to come up with the proper context numbers. There is a new address space operation (called get_context()) by which the VFS can call into the filesystem code to obtain a context number for a specific request. The block layer has been modified to avoid merging block I/O requests if those requests have been assigned to different contexts.
There was little discussion of the lower-level changes, which apparently make sense to the developers who have examined them. The filesystem-level changes have seen rather more discussion, though. Saugata's patch set only touches the ext4 filesystem; those changes cause ext4 to use the inode number of the file under I/O as the context number. Thus, all I/O requests to a single file will be assigned to the same context, while requests to different files would go into different contexts (within limits—eMMC hardware, for example, only supports 15 contexts, so many inode numbers will be mapped onto a single context number at the lower levels). The question that came up was: is using the inode number the right policy? Coming up with an answer involves addressing two independent questions: (1) what does the "context" mechanism actually do?, and (2) how can Linux filesystems provide the best possible context information to the storage devices?
Arnd Bergmann (who has spent a lot of time understanding the details of how flash storage works) has noted that the standard is deliberately vague on what the context mechanism does; the authors wanted to create something that would outlive any specific technology. He went on to say:
The effect of such an implementation would be to concentrate data written under any one context into the same erase block(s). Given that, there are at least a couple of ways to use contexts to optimize I/O performance.
For example, one could try to concentrate data with the same expected lifetime, so that, when part of an erase block is deleted, all of the data in that erase block will be deleted. Using the inode number as the context number could have that effect; deleting the file associated with that inode will delete all of its blocks at the same time. So, as long as the file is not subject to random writes (as, say, a database file might be), using contexts in this manner should reduce the amount of garbage collection and read-modify-write cycles needed when a file is deleted.
Another helpful approach might be to use contexts to separate large, long-lived files from those that are shorter and more ephemeral. The larger files would be well-placed on the medium, and the more volatile data would be concentrated into a smaller number of erase blocks. In this case, using the inode number to identify contexts may or may not work well. Large files would be nicely separated, but the smaller files could be separated from each other as well, which may not be desirable: if several small files would fit into a single erase block, performance might be improved if all of those files were written in the same context. In this case, some other policy might be more advisable.
But what should that policy be? Arnd suggested that using the inode number of the directory containing the file might work better. Various commenters thought that using the ID of the process writing to the file could work, though there are some potential difficulties when multiple processes write the same file. Ted Ts'o suggested that grouping files written by the same process in a short period of time could give good results. Also useful, he thought, might be to look at the size of the file relative to the device's erase block size; files much smaller than an erase block would be placed into the same context, while larger files would get a context of their own.
A related idea, also from Ted, was to look at the expected I/O patterns. If an existing file is opened for write access, chances are good that a random I/O pattern will result. Files opened with O_CREAT, instead, are more likely to be sequential; separating those two types of files into different contexts would likely yield better results. Some flags used with posix_fadvise() could also be used in this way. There are undoubtedly other possibilities as well. Choosing a policy will have to be done with care; poor use of contexts could just as easily reduce performance and longevity instead of increasing them.
Figuring all of this out will certainly take some time, especially since devices with actual support for this feature are still relatively rare. Interestingly, according to Arnd, there may be an opportunity in getting ext4 to supply context information early:
Ext4 is, of course, the filesystem of choice for current Android systems. So, conceivably, an ext4 implementation could drive hardware behavior in the same way that much desktop hardware is currently designed around what Windows does.
Given that the patches are relatively small and that policies can be
changed in the future without user-space compatibility issues, chances are
good that something will be merged into the mainline as soon as the 3.6
development cycle. Then it will just be a matter of seeing what the
hardware manufacturers actually do and adjusting accordingly. With luck,
the eventual result will be longer-lasting, better-performing memory
storage devices.
Index entries for this article | |
---|---|
Kernel | Block layer |
Kernel | Filesystems/ext4 |
Kernel | Solid-state storage devices |
(Log in to post comments)
Supporting block I/O contexts
Posted Jun 21, 2012 10:38 UTC (Thu) by etienne (guest, #25256) [Link]
Supporting block I/O contexts
Posted Jun 22, 2012 9:11 UTC (Fri) by arnd (subscriber, #8866) [Link]
Since FAT is both very simple and very common, a lot of the flash devices actually have logic in them to detect the access patterns even without those annotations. Usually the FAT area is known to be at the beginning of the device (partition-less SD cards and older USB sticks) or an erase block that receives lots of random I/O is expected to be the FAT. Further, devices often expect a FAT cluster size of 32KB, so all requests of that size are taken to be data while all smaller ones are treated as directory updates.
Supporting block I/O contexts
Posted Jun 21, 2012 11:01 UTC (Thu) by gioele (subscriber, #61675) [Link]
Supporting block I/O contexts
Posted Jun 21, 2012 14:13 UTC (Thu) by drag (guest, #31333) [Link]
To go and write your own file system for MTD that would scale up like XFS or Ext4 can do would require years of development and for Windows this is a virtual impossibility unless you can get Microsoft on board. Meanwhile people have been writing block storage to memory device translation for decades. People have been doing these translation firmwares for a very long time.
Another thing is that compatibility with existing interfaces is important. They want to have the ability for people to purchase and use the devices with the minimal amount of effort.
Supporting block I/O contexts
Posted Jun 22, 2012 9:12 UTC (Fri) by arnd (subscriber, #8866) [Link]
Context documentation
Posted Jun 21, 2012 16:17 UTC (Thu) by roman (guest, #24157) [Link]
Contexts are documented in the eMMC 4.5 specification. Last I knew, anyone could register on the JEDEC web site and download the PDF.
Supporting block I/O contexts
Posted Jun 22, 2012 17:32 UTC (Fri) by giraffedata (guest, #1954) [Link]
This sounds like a case of overgeneralization. There's no way an abstract, undefined concept of "context" of a write (or is it of the data written?) can be useful. They should just have made it (and called it) expiration group.Separately, an expected lifetime value would help with the placement goals that apparently underly this feature.