Filesystems for zoned block devices

By Jake Edge
May 21, 2019

Damien Le Moal and Naohiro Aota led a combined storage and filesystem session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) on filesystem work that has been done for zoned block devices. These devices have multiple zones with different characteristics; usually there are zones that can only be written in sequential order as well as conventional zones that can be written in random order. The genesis of zoned block devices is shingled magnetic recording (SMR) devices, which were created to increase the capacity of hard disks, but at the cost of some flexibility.

Le Moal began by noting that the session would not be about zoned block devices, as the "Zoned block devices and file systems" title might imply; it would instead focus on filesystems for zoned block devices. At this point, the only Linux filesystem with support for zoned devices is F2FS; that work is still ongoing as there are some bugs to fix and some features lacking. Work has also been done to add support for Btrfs; he turned the mic over to Aota to talk about that.

Btrfs

Getting Btrfs working on zoned block devices required aligning its internal "device extent" structure with the zones. If the extent size is smaller than any given zone, some space will be wasted; larger extents can cover multiple zones. Extents are allocated sequentially. Internal buffering has been added to sort write requests to maintain the sequential ordering required by the zone.

Multiple modes are supported for Btrfs, including single, DUP, RAID0, RAID1, and RAID10. There is no support for RAID5 or RAID6, Aota said, because larger SMR disks are not well suited for those RAID modes due to the long rebuild time required when drives fail. Le Moal added that those modes could be handled, but for 15TB drives, for example, the rebuild time will be extremely long.

Aota said there are two missing features that are being worked on. "Device replace" is mostly done, but there are still some race conditions to iron out. Supporting fallocate() has not been done yet; there are difficulties preallocating space in a sequential zone. Some kind of in-memory preallocation is what he is planning. Chris Mason did not think fallocate() support was important for the initial versions of this code; it is not really a high-priority item for copy-on-write (CoW) filesystems. He did not think the code should be held up for that.

Going forward, NVMe Zone Namespace (ZNS) support is planned, Aota said. In devices supporting ZNS, there will be no conventional zones supporting random writes at all. That means the superblock will need to be copy on write, so two zones will be reserved for the superblock and the filesystem will switch back and forth between them.

Ric Wheeler asked about how long RAID rebuilds would take for RAID5/6. Le Moal said it could take a day or more. Wheeler did not think that was excessive, at least for RAID6, and said that there may be interest in having that RAID mode. The RAID6 rebuild I/O could be done at a lower priority, Wheeler said. But Mason thought that RAID5/6 support could wait until later; once again, he does not want to see these patches get hung up on that. Le Moal said they would send their patches soon.

ZoneFS

ZoneFS is a new filesystem that exposes zoned block devices to users in the simplest possible way, Le Moal said. It exports each zone as a file under the mountpoint in two directories: /conventional for random-access zones or /sequential for sequential-only zones. Under those directories, the zones will be files that use the zone number as the file name.

ZoneFS presents a fixed set of files that cannot be changed, removed, or renamed, and new files cannot be added. The only truncation operations (i.e. truncate() and ftruncate()) supported for the sequential zones are those that specify a zero length; they will simply set the zone's write pointer back to the start of the zone. There will be no on-disk metadata for the filesystem; the write pointer location indicates the size of a sequential file.

ZoneFS may not look "super useful", he said, so why was it created? Applications could get the same effect by opening the block device file directly, but application developers are not comfortable with that; he gets lots of email asking for something like ZoneFS. It works well for certain applications (e.g. RocksDB and LevelDB) that already use sequential data structures. It is also easy to integrate the ZoneFS files with these applications.

Beyond that, ZoneFS can be used to support ZNS as well. Unlike the disk vendors, however, the NVMe people are saying that there may be a performance cost from relying on implicit open and close zone operations, as Ted Ts'o pointed out. That is going to make it interesting for filesystems like Btrfs that are trying to operate on both types of media but have not added explicit opens and closes based on what the SMR hard disk vendors have said along the way.

Hearing no strong opposition to the idea, Le Moal said he would be sending ZoneFS patches around soon.

Index entries for this article
Kernel	Block layer/Zoned devices
Kernel	Filesystems
Conference	Storage Filesystem & Memory Management/2019

(Log in to post comments)

Filesystems for zoned block devices

Posted May 21, 2019 6:27 UTC (Tue) by epa (subscriber, #39769) [Link]

I wonder whether a git repository could be stored in one of these append-only files? If you assume that objects never get garbage collected.

Filesystems for zoned block devices

Posted May 21, 2019 6:54 UTC (Tue) by smurf (subscriber, #17840) [Link]

The problem is that this is not what you want. You want to store the current state of your repository as plain files (or rather, lightly-compressed ones) and older versions as delta to that because the current state is what you most often need to retrieve.

In contrast, an append-only storage would keep the oldest versions as plaintext and then store all the incremental changes on top, requiring either frequent snapshots or a lot of work to easily access the current state.

Filesystems for zoned block devices

Posted May 21, 2019 9:24 UTC (Tue) by epa (subscriber, #39769) [Link]

You're right, I had forgotten that git compresses and repacks the objects. You could store absolutely everything un-delta'd and uncompressed, but that would use tons of space, negating any advantage from using an SMR drive.

However, it's not true that "what you most often need to retrieve" is the current state. Suppose the SMR disk is on the server. The initial clone sucks down all objects. After that subsequent pulls by clients will get new objects -- which can be sent as deltas. But the client would have lots of work to do applying all the deltas to get the current state of files.

Filesystems for zoned block devices

Posted Jun 2, 2019 22:00 UTC (Sun) by Wol (subscriber, #4433) [Link]

But if all the recent state is on the server uncompressed, it's not much work for the server to calculate the delta and send it.

If all the user is doing is going from version N-1 to version N, then storing the updates as deltas is fine. But how many users do that? In an actively updated projects users are going to fall behind and there are going to be a lot of users updating from N-X where X could be 10, 20, or more versions behind. That's a LOT of downloading of deltas!

I think git assumes that the network is the constrained resource, so if it has access to N-X as well as N, it can calculate the delta quite possibly faster than it can send it ...

Cheers,
Wol

Filesystems for zoned block devices

Posted May 21, 2019 14:16 UTC (Tue) by grawity (subscriber, #80596) [Link]

You could certainly do this, because Plan 9 already had it in the form of Venti and Fossil.

Filesystems for zoned block devices

Posted May 21, 2019 13:22 UTC (Tue) by nix (subscriber, #2304) [Link]

FYI it takes about a day and a half to read the entire surface of the 8TiB non-zoned disks here (at 200MiB/s). I'd expect a 15TiB drive to either be twice as fast or take twice as long, or a subset of those. Those drives at least are perfectly practical for RAID-6: we're not yet near the point at which a rebuild takes so long that multiple failures of other drives can be expected while the rebuild is underway. (Though I have them set up as multiple RAID-6 and RAID-0 arrays, not one big one.)

Filesystems for zoned block devices

Posted May 21, 2019 17:47 UTC (Tue) by brouhaha (subscriber, #1698) [Link]

I had a 3TB drive fail in a triple-parity RAID-6 that contained 15 3TB drives total. While rebuilding it, two more drives failed. The drives were all right around three years old at the time, some just under, some just over.

3TB drives happened to be especially awful, for reasons that are unclear, but multiple failures during a RAID-6 rebuild that takes 36 hours are entirely plausible. Keep in mind that drive failures in a system are not nearly as uncorrelated as one might expect.

Filesystems for zoned block devices

Posted May 21, 2019 18:48 UTC (Tue) by KaiRo (subscriber, #1987) [Link]

One reason why they're not that unorrelated is - from what I heard - that rebuilding taxes the remaining drives quite a lot and that can push them over the edge if they are getting close to failure.

Filesystems for zoned block devices

Posted May 22, 2019 17:10 UTC (Wed) by nix (subscriber, #2304) [Link]

Yeah, it has to read the entire drive surface of all the other drives, and if they're similar ages and especially from the same batch...

(This is why RAID-5 for large arrays is almost certainly a bad idea. One extra failure of a degraded array and you're lost.)

Filesystems for zoned block devices

Posted Jun 2, 2019 22:13 UTC (Sun) by Wol (subscriber, #4433) [Link]

I've spec'd a raid-61 which I would like to implement if I get the chance. In discussion with Neil, this is designed such that any single failed drive can be rebuilt without stressing any other single drive because all the blocks are smeared over all the available drives. The current raid-10 implementation pretty much guarantees that all mirror blocks are on just one or two other drives.

Cheers,
Wol

Filesystems for zoned block devices

Posted Jun 24, 2019 14:39 UTC (Mon) by nix (subscriber, #2304) [Link]

In discussion with Neil, this is designed such that any single failed drive can be rebuilt without stressing any other single drive because all the blocks are smeared over all the available drives.

That's how RAID-6 works anyway, isn't it? The P and Q parities are intermingled with all the other blocks, in one-stripe-long runs. (Of course, this still means you have to read from any given disk N-1/N of the time during a rebuild, which is a lot of I/O and will stress the mechanical drive components, but I don't see how to avoid that.)

Filesystems for zoned block devices

Posted May 22, 2019 10:14 UTC (Wed) by Jonno (subscriber, #49613) [Link]

> 3TB drives happened to be especially awful, for reasons that are unclear

3TB drives was the largest commonly available drive in late 2011 to late 2013, when HDD quality control went down the drain after the Thailand drive crisis...

Filesystems for zoned block devices

Posted May 22, 2019 17:18 UTC (Wed) by nix (subscriber, #2304) [Link]

I would say that by the time you have multiple drive failures in a week or so in a single chassis you're well past any hope of the uninterrupted service availability RAID gives you, and it's time to tear the system down, replace all the *other* drives as well (since they're clearly cursed, or faulty, or all subjected to horrible power spikes or something, and are all suspect, even those that are still outwardly working), and restore from backup. Because of course you have a backup because RAID is not a backup and we don't regularly see people on the linux-raid list who have no backups at all oh wait we do all the time :(

This is basically true even if you have a pile of hot spares: if they were connected to the system in question with any kind of electrical path, even if they weren't spinning, they are quite likely damaged :(

(I have two onsite backups and one hot-swapped offsite backup, as well as distributed storage of what really matters. None of them use RAID or any new flashy filesystem features because the more new stuff you use, the more you are risking bugs in *that*...)

Filesystems for zoned block devices

Posted May 23, 2019 21:54 UTC (Thu) by k3ninho (subscriber, #50375) [Link]

I want a 'rebuild throttle' that can keep drives in the temperature range they're accustomed to and possibly set a 50-80% of maximum data throughput to avoid perturbing aging disks.

That said, another correspondent here said 'power supplies and spikes' so that's another variable to control -- is there spare capacity (and overhead) to run your RAID set at full capacity for the duration of the rebuild?

K3n.

Filesystems for zoned block devices

Posted Jun 2, 2019 22:07 UTC (Sun) by Wol (subscriber, #4433) [Link]

What you want there is a non-pass-through UPS (AC -> DC -> AC ie any spikes etc get trapped in the UPS). What would be really nice is a motor connected to a generator so the computer is electrically insulated from the outside world. Not sure you can get those, though.

Cheers,
Wol

Filesystems for zoned block devices

Posted Jun 3, 2019 18:10 UTC (Mon) by gioele (subscriber, #61675) [Link]

> What would be really nice is a motor connected to a generator so the computer is electrically insulated from the outside world. Not sure you can get those, though.

Such things are quite common in electronics: https://en.wikipedia.org/wiki/Isolation_transformer

Filesystems for zoned block devices

Posted Jun 3, 2019 19:59 UTC (Mon) by farnz (subscriber, #17727) [Link]

Or, for the old school, rotary UPSes existed, which are literally an electric motor, a flywheel, and an electric generator connected together, using the flywheel's mass to maintain steady power. I believe the Cray-2 was fed by a pair of these devices…

Filesystems for zoned block devices

Posted Jun 3, 2019 20:21 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

These are called DRUPS and they are generally used for datacenter power backups.