Filesystems for zoned block devices
Damien Le Moal and Naohiro Aota led a combined storage and filesystem session at the 2019 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) on filesystem work that has been done for zoned block devices. These devices have multiple zones with different characteristics; usually there are zones that can only be written in sequential order as well as conventional zones that can be written in random order. The genesis of zoned block devices is shingled magnetic recording (SMR) devices, which were created to increase the capacity of hard disks, but at the cost of some flexibility.
Le Moal began by noting that the session would not be about zoned block devices, as the "Zoned block devices and file systems" title might imply; it would instead focus on filesystems for zoned block devices. At this point, the only Linux filesystem with support for zoned devices is F2FS; that work is still ongoing as there are some bugs to fix and some features lacking. Work has also been done to add support for Btrfs; he turned the mic over to Aota to talk about that.
Btrfs
Getting Btrfs working on zoned block devices required aligning its internal "device extent" structure with the zones. If the extent size is smaller than any given zone, some space will be wasted; larger extents can cover multiple zones. Extents are allocated sequentially. Internal buffering has been added to sort write requests to maintain the sequential ordering required by the zone.
![Damien Le Moal & Naohiro Aota [Damien Le Moal & Naohiro Aota]](https://web.archive.org/web/20240110075849im_/https://static.lwn.net/images/2019/lsf-lemoal-aota-sm.jpg)
Multiple modes are supported for Btrfs, including single, DUP, RAID0, RAID1, and RAID10. There is no support for RAID5 or RAID6, Aota said, because larger SMR disks are not well suited for those RAID modes due to the long rebuild time required when drives fail. Le Moal added that those modes could be handled, but for 15TB drives, for example, the rebuild time will be extremely long.
Aota said there are two missing features that are being worked on. "Device replace" is mostly done, but there are still some race conditions to iron out. Supporting fallocate() has not been done yet; there are difficulties preallocating space in a sequential zone. Some kind of in-memory preallocation is what he is planning. Chris Mason did not think fallocate() support was important for the initial versions of this code; it is not really a high-priority item for copy-on-write (CoW) filesystems. He did not think the code should be held up for that.
Going forward, NVMe Zone Namespace (ZNS) support is planned, Aota said. In devices supporting ZNS, there will be no conventional zones supporting random writes at all. That means the superblock will need to be copy on write, so two zones will be reserved for the superblock and the filesystem will switch back and forth between them.
Ric Wheeler asked about how long RAID rebuilds would take for RAID5/6. Le Moal said it could take a day or more. Wheeler did not think that was excessive, at least for RAID6, and said that there may be interest in having that RAID mode. The RAID6 rebuild I/O could be done at a lower priority, Wheeler said. But Mason thought that RAID5/6 support could wait until later; once again, he does not want to see these patches get hung up on that. Le Moal said they would send their patches soon.
ZoneFS
ZoneFS is a new filesystem that exposes zoned block devices to users in the simplest possible way, Le Moal said. It exports each zone as a file under the mountpoint in two directories: /conventional for random-access zones or /sequential for sequential-only zones. Under those directories, the zones will be files that use the zone number as the file name.
ZoneFS presents a fixed set of files that cannot be changed, removed, or renamed, and new files cannot be added. The only truncation operations (i.e. truncate() and ftruncate()) supported for the sequential zones are those that specify a zero length; they will simply set the zone's write pointer back to the start of the zone. There will be no on-disk metadata for the filesystem; the write pointer location indicates the size of a sequential file.
ZoneFS may not look "super useful", he said, so why was it created? Applications could get the same effect by opening the block device file directly, but application developers are not comfortable with that; he gets lots of email asking for something like ZoneFS. It works well for certain applications (e.g. RocksDB and LevelDB) that already use sequential data structures. It is also easy to integrate the ZoneFS files with these applications.
Beyond that, ZoneFS can be used to support ZNS as well. Unlike the disk vendors, however, the NVMe people are saying that there may be a performance cost from relying on implicit open and close zone operations, as Ted Ts'o pointed out. That is going to make it interesting for filesystems like Btrfs that are trying to operate on both types of media but have not added explicit opens and closes based on what the SMR hard disk vendors have said along the way.
Hearing no strong opposition to the idea, Le Moal said he would be sending ZoneFS patches around soon.
Index entries for this article | |
---|---|
Kernel | Block layer/Zoned devices |
Kernel | Filesystems |
Conference | Storage Filesystem & Memory Management/2019 |
(Log in to post comments)
Filesystems for zoned block devices
Posted May 21, 2019 6:27 UTC (Tue) by epa (subscriber, #39769) [Link]
Filesystems for zoned block devices
Posted May 21, 2019 6:54 UTC (Tue) by smurf (subscriber, #17840) [Link]
In contrast, an append-only storage would keep the oldest versions as plaintext and then store all the incremental changes on top, requiring either frequent snapshots or a lot of work to easily access the current state.
Filesystems for zoned block devices
Posted May 21, 2019 9:24 UTC (Tue) by epa (subscriber, #39769) [Link]
However, it's not true that "what you most often need to retrieve" is the current state. Suppose the SMR disk is on the server. The initial clone sucks down all objects. After that subsequent pulls by clients will get new objects -- which can be sent as deltas. But the client would have lots of work to do applying all the deltas to get the current state of files.
Filesystems for zoned block devices
Posted Jun 2, 2019 22:00 UTC (Sun) by Wol (subscriber, #4433) [Link]
If all the user is doing is going from version N-1 to version N, then storing the updates as deltas is fine. But how many users do that? In an actively updated projects users are going to fall behind and there are going to be a lot of users updating from N-X where X could be 10, 20, or more versions behind. That's a LOT of downloading of deltas!
I think git assumes that the network is the constrained resource, so if it has access to N-X as well as N, it can calculate the delta quite possibly faster than it can send it ...
Cheers,
Wol
Filesystems for zoned block devices
Posted May 21, 2019 13:22 UTC (Tue) by nix (subscriber, #2304) [Link]
Filesystems for zoned block devices
Posted May 21, 2019 17:47 UTC (Tue) by brouhaha (subscriber, #1698) [Link]
3TB drives happened to be especially awful, for reasons that are unclear, but multiple failures during a RAID-6 rebuild that takes 36 hours are entirely plausible. Keep in mind that drive failures in a system are not nearly as uncorrelated as one might expect.
Filesystems for zoned block devices
Posted May 21, 2019 18:48 UTC (Tue) by KaiRo (subscriber, #1987) [Link]
Filesystems for zoned block devices
Posted May 22, 2019 17:10 UTC (Wed) by nix (subscriber, #2304) [Link]
(This is why RAID-5 for large arrays is almost certainly a bad idea. One extra failure of a degraded array and you're lost.)
Filesystems for zoned block devices
Posted Jun 2, 2019 22:13 UTC (Sun) by Wol (subscriber, #4433) [Link]
Cheers,
Wol
Filesystems for zoned block devices
Posted Jun 24, 2019 14:39 UTC (Mon) by nix (subscriber, #2304) [Link]
In discussion with Neil, this is designed such that any single failed drive can be rebuilt without stressing any other single drive because all the blocks are smeared over all the available drives.That's how RAID-6 works anyway, isn't it? The P and Q parities are intermingled with all the other blocks, in one-stripe-long runs. (Of course, this still means you have to read from any given disk N-1/N of the time during a rebuild, which is a lot of I/O and will stress the mechanical drive components, but I don't see how to avoid that.)
Filesystems for zoned block devices
Posted May 22, 2019 10:14 UTC (Wed) by Jonno (subscriber, #49613) [Link]
3TB drives was the largest commonly available drive in late 2011 to late 2013, when HDD quality control went down the drain after the Thailand drive crisis...
Filesystems for zoned block devices
Posted May 22, 2019 17:18 UTC (Wed) by nix (subscriber, #2304) [Link]
This is basically true even if you have a pile of hot spares: if they were connected to the system in question with any kind of electrical path, even if they weren't spinning, they are quite likely damaged :(
(I have two onsite backups and one hot-swapped offsite backup, as well as distributed storage of what really matters. None of them use RAID or any new flashy filesystem features because the more new stuff you use, the more you are risking bugs in *that*...)
Filesystems for zoned block devices
Posted May 23, 2019 21:54 UTC (Thu) by k3ninho (subscriber, #50375) [Link]
That said, another correspondent here said 'power supplies and spikes' so that's another variable to control -- is there spare capacity (and overhead) to run your RAID set at full capacity for the duration of the rebuild?
K3n.
Filesystems for zoned block devices
Posted Jun 2, 2019 22:07 UTC (Sun) by Wol (subscriber, #4433) [Link]
Cheers,
Wol
Filesystems for zoned block devices
Posted Jun 3, 2019 18:10 UTC (Mon) by gioele (subscriber, #61675) [Link]
Such things are quite common in electronics: https://en.wikipedia.org/wiki/Isolation_transformer
Filesystems for zoned block devices
Posted Jun 3, 2019 19:59 UTC (Mon) by farnz (subscriber, #17727) [Link]
Or, for the old school, rotary UPSes existed, which are literally an electric motor, a flywheel, and an electric generator connected together, using the flywheel's mass to maintain steady power. I believe the Cray-2 was fed by a pair of these devices…
Filesystems for zoned block devices
Posted Jun 3, 2019 20:21 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]