Modular, switchable I/O schedulers

[Posted September 21, 2004 by corbet]

The I/O scheduler ("elevator") has a challenging job: it must arrange for disk I/O operations to be executed in the optimal order. "Optimal" means maximizing the I/O bandwidth to the disk while, simultaneously, ensuring that all requests are satisfied in a timely manner, no process suffers excessive latency, and, for desktop systems, that the interactive "feel" of the system is responsive. Some schedulers take on additional tasks, such as dividing the available bandwidth equally between processes (or users) contending for each disk.

Given that set of demands, it is not surprising that there are multiple I/O schedulers in the Linux kernel. The deadline scheduler works by enforcing a maximum latency for all requests. The anticipatory scheduler briefly stalls I/O after a read request completes with the idea that another, nearby read is likely to come in quickly. The completely fair queueing scheduler (recently updated by Jens Axboe) applies a bandwidth allocation policy. And there is a simple "noop" scheduler for devices, such as RAM disks, which do not benefit from fancy scheduling schemes (though such devices usually short out the request queue entirely).

The kernel has a nice, modular scheme for defining and using I/O schedulers. What it lacks, however, is any flexible way of letting a system administrator choose a scheduler. I/O schedulers are built into the kernel code, and exactly one of them can be selected - for all disks in the system - at boot time with the elevator= parameter. There is no way to use different schedulers for different drives, or to change schedulers once the system boots. The chosen scheduler is used, and any others configured into the system simply sit there and consume memory.

Jens Axboe has recently posted a patch which improves on this situation. With this patch in place, I/O schedulers can be built as loadable modules (though, as Jens cautions, at least one scheduler must be linked directly into the kernel or the system will have a hard time booting). A new scheduler attribute in each drive's sysfs tree lists the available schedulers, noting which one is active at any given time. Changing schedulers is simply a matter of writing the name of the new scheduler into that attribute.

The patch is long, but the amount of work required to support switchable I/O schedulers wasn't all that great. The internal structures describing elevators have been split apart to reflect the more dynamic nature of things; struct elevator_ops contains the scheduler methods, while struct elevator_type holds the metadata which describes an I/O scheduler to the kernel. The new elevator_queue structure glues an instance of an I/O scheduler to a specific request queue. Updating the mainline schedulers to work with the new structures required a fair number of relatively straightforward code changes. Each scheduler now also has module initialization and cleanup functions which have been separated from the code needed to set up or destroy an elevator for a specific queue.

One interesting question is: what should be done with the currently queued block requests when an I/O scheduler change is requested? One could imagine requeueing all of those requests with the new scheduler in order to let it have its say immediately. The simpler approach, which was chosen for this patch, is to block the creation of new requests and wait for the queue to empty out. Once all outstanding I/O has been finished up, the old scheduler can be shut down and moved out of the way.

There have been no (public) objections to the patch; chances are it will find its way into the mainline sometime after 2.6.9 comes out.

Index entries for this article
Kernel	Block layer/I/O scheduling
Kernel	Elevator
Kernel	I/O scheduler

(Log in to post comments)

It's not only RAM disks

Posted Sep 23, 2004 18:31 UTC (Thu) by smurf (subscriber, #17840) [Link]

Consider USB sticks, commonly accessed via SCSI emulation.

They don't need a scheduler policy, yet from the PoV of Linux they look like normal disks.

FLASH drives do need a scheduler policy

Posted Sep 29, 2004 18:12 UTC (Wed) by BrucePerens (guest, #2510) [Link]

FLASH drives like USB sticks have a limited write lifetime. They really should have a scheduler policy, that policty should be to keep frequently-written structures (superblock, FS metadata, directories) in core, consolidate their writes, and write them out infrequently. This is an area where we can come into conflict with the filesystem, which may have its own idea of what should be written atomicaly. BSD-style IO serialization (defining the order in which some critical data must be written, something like tagged queueing but for filesystems) might work best with this scheme, as it would communicate to the scheduler more information about what data it can freely re-arrange and what needs to have its order respected.

Thanks

Bruce

FLASH drives do need a scheduler policy

Posted Sep 29, 2004 18:17 UTC (Wed) by corbet (editor, #1) [Link]

Ah, but decisions like "don't sync the superblock and metadata too often" are not block-level issues, and thus have nothing to do with the I/O scheduler. All filesystems must make their own decisions on where to put data and when to force it to disk - that's higher level stuff. The I/O schedulers discussed in this article, I believe, don't have a role to play there.

FLASH drives do need a scheduler policy

Posted Sep 29, 2004 18:40 UTC (Wed) by smurf (subscriber, #17840) [Link]

The scheduler can of course notice that a block has been written three times in the last second, and hold further writes for some time.

The downside is that this is excessively unsafe WRT data integrity.

If that's a problem, use a better filesystem than FAT (MP3 sticks, digital cameras, ad nauseam).

FLASH drives do need a scheduler policy

Posted Sep 30, 2004 7:45 UTC (Thu) by axboe (subscriber, #904) [Link]

No, that's absolutely the wrong layer to attempt to solve that problem, Jon is absolutely right. You should use a suitable file system for such media that has awareness of its limitations.

FLASH drives do need a scheduler policy

Posted Oct 1, 2004 0:45 UTC (Fri) by dmaxwell (guest, #14010) [Link]

Technically that is true. However, the "killer app" for memory sticks is physically moving data from one system to another. Oftentimes, the systems are running different platforms. My stick can go from a Mac running OS 9 to Windows 2000 to Linux all in one day. My stick would be useless if I used a filesystem "aware of flash memory limitations". The BSDs and Linux offer an embarrassment of riches in filesystems. We can tailor filesystems to the job at hand; imagine that! If you want to move media between platforms then only one filesystem suffices regardless of it's (many) flaws.

I have to use FAT32 if I want to employ my stick as a universal device.

FLASH drives do need a scheduler policy

Posted Oct 1, 2004 6:29 UTC (Fri) by axboe (subscriber, #904) [Link]

It still doesn't change the fact that you cannot solve this problem at the block layer, as you don't have enough information to do so - all you get is a start offset and length of where to write the data. If you get writes for the same blocks every few seconds, you must look elsewhere to fix it.

And FAT32 is fine to use on a flash stick.

FLASH drives do need a scheduler policy

Posted Oct 1, 2004 16:51 UTC (Fri) by BrucePerens (guest, #2510) [Link]

Jens,

Yes, that's true. However, it can be fixed, if the filesystem communicates either a write barrier (write this block before any blocks I send down after it) or ordering information (write block X before block Y, order of block Z doesn't matter). Once it is so easy to work on I/O scheduling, the need for this will become so evident that there is no question that it will be done.

Bruce

FLASH drives do need a scheduler policy

Posted Oct 3, 2004 9:48 UTC (Sun) by axboe (subscriber, #904) [Link]

Bruce,

We already have write barriers, and it doesn't solve the entire problem. It will make any given fs more flash friendly indeed, but it's still quite a bit away from a fs specifically designed to minimize drive wear.

Your last sentence doesn't make sense.

Jens

FLASH drives do need a scheduler policy

Posted Oct 3, 2004 12:06 UTC (Sun) by BrucePerens (guest, #2510) [Link]

Jens,

OK. I think one of the BSDs goes a bit further than write barriers in communicating time information. But I see your point that this can not go all of the way in reducing FLASH wear.

Thanks

Bruce

FLASH drives do need a scheduler policy

Posted Oct 1, 2004 1:50 UTC (Fri) by BrucePerens (guest, #2510) [Link]

Ah, but decisions like "don't sync the superblock and metadata too often" are not block-level issues, and thus have nothing to do with the I/O scheduler.

Jon,

I understand that to the I/O scheduler, a block is just a block. But I feel that the filesystem should be a higher layer than whatever understands the time constraints of the media. What is necessary is for the filesystem to communicate to the lower levels when order is important. If the I/O scheduler knows that there are 100 blocks that should not go out to the stick without changing the superblock and the directory at the same time, it can handle I/O buffering for USB sticks sensibly. This is why now that we are getting versatile I/O scheduling, some sort of tagged-queueing-like scheme will now become important in the filesystem layer.

Thanks

Bruce

Modular, switchable I/O schedulers

Posted Sep 28, 2004 20:32 UTC (Tue) by manpreet (guest, #12039) [Link]

Waiting for the queue to empty out could be a long wait for the threads that want disk attention. Another possible way is to keep accepting requests, but queue them in the new scheduler's queue and switch queues as soon as the older one empties out.

Modular, switchable I/O schedulers

Posted Sep 30, 2004 7:30 UTC (Thu) by axboe (subscriber, #904) [Link]

That's not easily doable, since each queue can only have one io scheduler assigned and active. Additionally, the io scheduler requires exclusive access to the given queue. What you propose would require the old one be active from io completions and the new one working for submission and queueing.

Modular, switchable I/O schedulers

Posted Oct 6, 2004 10:10 UTC (Wed) by danielos (guest, #6053) [Link]

it is the same! does not matter what is in the new queue if new scheduler is not active, it is as process are waiting. Another way is to have more scheduler per I/O like process scheduler, but this is an unusefull overload. I don't think it's a matter of changing I/O scheduler seconds by seconds, but, at worst hour by hour, and the thread that want disk attention have to respect other request and scheduler queue policy (or does it exists a super policy that all scheduler respect? this would be a viable way to define a complex but efficent policy and subpolicy).

dan.
(sorry for exclamation)