In-band deduplication for Btrfs
"In-band deduplication" is the process of detecting and unifying duplicate data blocks as files are being written, rather than at some later time. Btrfs support for this feature has been under development since at least early 2013. Quite recently it reached the point where developer Qu Wenruo thought it was sufficiently ready to send Btrfs maintainer Chris Mason a pull request (as yet unanswered) hoping that it might be added to the kernel during the 4.6 merge window. While this is far from a magic bullet that will suddenly remove all the waste in your filesystem that is caused by duplicate data, there are use cases where it could bring real benefits.
Offline and in-band
It has been possible to unify duplicated blocks in Btrfs since Linux 3.12 when the BTRFS_IOC_FILE_EXTENT_SAME ioctl() command was added (it has since been renamed FIDEDUPERANGE when the basic ioctl() handling was moved to the VFS, so it could someday be used by other filesystems). This ioctl() is given a range of bytes in the file (which must be aligned to the filesystem block size) and a list of offsets in other files (or possibly the same file). This identifies two or more ranges of the same size that are claimed to be identical. Btrfs reads all the blocks, checks that they are in fact identical, and then changes its internal bookkeeping so that all files point to one set of blocks on the storage medium. The space used by the other copies of the data will typically become free space available for reallocation.
To understand how this works it is necessary to understand how Btrfs makes use of "extents". An extent is a contiguous set of blocks on the storage device and also a contiguous set of blocks in a file (at least initially). It has a "logical address" that is mapped to a physical location on one, or perhaps more, of the underlying devices. An extent can become part of several different files, either through snapshots, reflinks, or deduplication. To allow this storage space to ultimately be reused, Btrfs maintains a reference count for each extent and will not re-use any of the space until that count becomes zero.
It is not generally possible to split an extent. If a single block is written into the middle of a large extent, the "copy on write" design of Btrfs requires that either the whole extent be copied or that the file's indexing information be changed to point to the first half of the extent, then the new block, then the remainder of the extent. Btrfs takes the latter approach.
When FIDEDUPERANGE is used, the file indexes will be updated to point to the relevant extents or partial extents of the source file, and the reference counts on the various extents will be increased or decreased as appropriate. This may not release quite as much space as expected since there may have been extents that were only partially identical between two files. The identical part in one extent may have no files including it any more, but the space will not be freed until the whole extent is no longer referenced.
duperemove
is a tool that can be used to examine a set of files,
look for duplicated regions, and call the ioctl to remove that
duplication from the underlying storage. This can be useful anywhere
that files are likely to be completely or largely the same, but where
they need to be kept separate for administrative or other practical reasons. A common use case is filesystem images used for
virtualization or the multiple similar-but-not-identical file sets
used by containers. Running duperemove
from time to time
could save
a lot of disk space.
This functionality was referred to in the initial commit as "offline deduplication", which is a little confusing since "offline" in the context of filesystems usually implies that the filesystem isn't mounted, and that is not the case here. It is more like "on-demand" deduplication and is sometimes referred to as "batch" deduplication. This contrasts with the new work that could be called "time-of-write" or transparent deduplication, but is called "in-band" deduplication in this patch set.
When duperemove
runs, it computes a hash value for every
block in
every file and then compares these, looking for ranges of matching
blocks. When it finds suitable ranges, it requests the deduplication.
In-band deduplication uses the same idea of hashing blocks of data,
but in many other respects it is quite different.
In-band duplication works in units of extents of a specific size — a deduplication block size of up to 8MB can be configured; the default is 128KB. When data is written out to storage, blocks are gathered into extents of this size whenever possible. If there are only sufficient blocks being written for a smaller extent, that extent will be written without deduplication. For every extent of the target size, the hash (currently SHA-256, though possibly configurable in the future) is calculated. If another extent can be found with the same hash, then a new reference is added to that extent and the new data is not written. If no match is found, the data is written as normal and the hash is stored together with the logical address of the extent so that it is available for future matching.
This process can be enabled for a whole Btrfs filesystem using the "btrfs dedup enable" command and then disabled for individual files or directories by using the "btrfs property set" command to set the "dedup" property to "disable". Setting this on a directory causes all files or directories created within that directory to inherit the setting. This means that enabling deduplication on only a subset of a filesystem is possible, but a little bit clumsy.
Two back-ends
There are two separate back-ends for storing the mapping between hashes and extents; the fact that the implementation of these back-ends is kept cleanly separate from their use is rightfully highlighted as a strength of this patch set.
One of the two back-ends is an in-memory mapping. Two red-black trees are created when the filesystem is mounted; whenever new data is written to a file on which deduplication is enabled, new entries are added to the trees. One tree maps from hash to logical address and is used to see if a suitable extent exists that already stores the required data. The second provides a reverse mapping that allows hash entries to be deleted when an extent is freed after its reference count reaches zero. The size of these trees is limited by a configurable entry count that defaults to 32768. It would not be surprising to see that replaced or augmented with a "shrinker" callback from the memory management subsystem, so that it could shrink only when memory is tight.
The second mechanism stores these mappings in the filesystem. Btrfs has flexible data structures for storing all sorts of metadata and data on disk using a number of distinct B-trees, one for each subvolume, one for the extent reference counts, one for storing checksums, etc. The in-band deduplication patchset adds another B-tree that has two types of keys: one for lookup by hash and another for lookup by extent address.
Looking up the hash in the B-tree is not quite so straightforward as one might hope, since Btrfs has a fixed format for the lookup key: a 64-bit object-ID, an eight-bit type, and a 64-bit offset. It is not possible to store the full 256-bit hash in the key and even storing 128 bits in the two 64-bit fields would be problematic. Btrfs requires that all keys be unique, so that approach would effectively limit the hash size to 128 bits.
The approach chosen is to store 64 bits of the hash in the object-ID, and the logical address of the extent in the offset field. Each key is accompanied by a variable-length data field and this is used to store the full hash. If the hashes of two extents collide in those 64 bits, the offsets will still be different, so the keys will be unique.
To handle these collisions a lookup first performs a regular B-tree search for a key with the appropriate hash bits in the object-ID and U64_MAX in the offset field. This will provide the last possible location where the target hash could be stored. A linear search is then performed searching backward and comparing the full hash until a match is found or there are no more keys with the required hash fragment.
Unlike with in-memory lookup there is no mechanism to limit the number of hash entries stored, beyond normal filesystem-full checks that might prevent a new extent from being written.
One difference between this in-band deduplication and the on-demand
approach that is worth highlighting is the dependence placed on the
hash. duperemove
uses a hash only to guide the search for duplicate
blocks — the hash is not authoritative and Btrfs will not allow the
deduplication to happen if the identified regions are not
byte-for-byte identical. In-band duplication as currently implemented
does not perform that comparison. If the hash matches, then the
extents are assumed to match.
This means the correctness of the
deduplication is completely dependent on the uniqueness of the hash,
so the extra effort to make use of all 256 bits is easy to justify.
Whether that complete dependence is itself justified is not an easy
question to answer. It is certainly extremely unlikely for two
distinct blocks to have the same hash, but it is also certainly quite
possible. It would only need to cause corruption once to be
extremely embarrassing. Adding byte-for-byte comparison is on the
planned feature list but, according to Wenruo, "not anytime soon".
In-band deduplication brings the benefit of being automatic, but has a cost
that it will probably miss duplication that duperemove
could find.
Different alignment of extents between files would completely defeat
the duplicate detection, as would creating files before deduplication
was enabled. The former could be improved to some degree with a
smaller extent size, though that would brings costs of its own. As is so
often the case, finding the most effective solution — which could include a
mix of offline and in-band deduplication — will be highly
dependent on each particular use case.
Test, test, and test
Wenruo assures us that this patch set has seen quite a lot of testing and that it has been some time since the last of the bugs found by that testing was fixed, which is encouraging. Some new tests have been submitted for the xfstests test suite specifically to exercise the deduplication and to check for some of the bugs that have been found and fixed.
One aspect of testing that seemed strangely absent from the pull request was any hint of how in-band deduplication affects performance. It is to be expected that this sort of functionality would slow writes down, particularly when the table of hashes is not kept in memory. It could also speed up some writes if lots of duplicate extents are found. Some indication of the sort of performance change experienced would certainly help to complete the picture.
But maybe the hope is to crowdsource that testing. There are so many different possible hardware configurations and usage scenarios to test that it is hard for one developer to even begin to give meaningful results. A large community, on the other hand, can try lots of things in parallel. As Wenruo noted in his pull request, there is still work to be done, but it should be quite ready for people to test.
Trying it out requires some patches
to the btrfs-progs
along with the Git
tree from the pull request, but that should be no challenge for those
who enjoy compiling their own kernels. I'm sure additional results would be
most welcome.
Index entries for this article | |
---|---|
Kernel | Btrfs |
Kernel | Filesystems/Btrfs |
GuestArticles | Brown, Neil |
(Log in to post comments)
In-band deduplication for Btrfs
Posted Mar 10, 2016 14:00 UTC (Thu) by martin.langhoff (guest, #61417) [Link]
There's a known pattern of calculating a fast cheap hash in-band during writes, which is later used by an offline process to assess dedupe candidates.
The fast hash will have some false positives, that's OK. It's merely a low cost guide for the batch or idle process to pick candidates. The office process performs a full comparison, and might have additional rules, such as skipping files with high rates of change.
Perhaps that's a better approach?
In-band deduplication for Btrfs
Posted Mar 10, 2016 14:57 UTC (Thu) by jtaylor (subscriber, #91739) [Link]
Your suggested approach is basically batch duplication, btrfs already stores cheap hashes with its data, a userspace program could use these too relatively quickly find candidates for the dedupe ioctl to deduplicate safely.
duperemove basically does this, except it cannot yet do incremental deduplication.
It could probably be done by using btrfs subvolume find-new command that can list what has changed since a certain filesystem generation
In-band deduplication for Btrfs
Posted Mar 10, 2016 18:41 UTC (Thu) by roblucid (guest, #48964) [Link]
I would guess avoiding the bit for bit comparison is done for performance reasons, on the pragmatic basis that collisions, even if there's only really 64bits of entropy are going to be very, very rare :)
In-band deduplication for Btrfs
Posted Mar 10, 2016 19:01 UTC (Thu) by martin.langhoff (guest, #61417) [Link]
In-band deduplication for Btrfs
Posted Mar 10, 2016 23:23 UTC (Thu) by nybble41 (subscriber, #55106) [Link]
Even filesystems don't "roll the dice" often enough to make 256-bit hash collisions a serious consideration.
In-band deduplication for Btrfs
Posted Mar 11, 2016 14:05 UTC (Fri) by bgoglin (subscriber, #7800) [Link]
In-band deduplication for Btrfs
Posted Mar 11, 2016 15:24 UTC (Fri) by nybble41 (subscriber, #55106) [Link]
In-band deduplication for Btrfs
Posted Mar 11, 2016 15:30 UTC (Fri) by micka (subscriber, #38720) [Link]
OK, that's not by random chance anymore, but...
In-band deduplication for Btrfs
Posted Mar 11, 2016 16:25 UTC (Fri) by nybble41 (subscriber, #55106) [Link]
In-band deduplication for Btrfs
Posted Mar 15, 2016 16:10 UTC (Tue) by intgr (subscriber, #39733) [Link]
I disagree, that actually is a very scary failure mode for a file system. If an attacker is allowed to influence what gets stored in a file system, then a preimage attack or possibly a clever application of a collision attack would allow poisoning the file system.
For instance, an attacker knows that some user wants to store document A on the system. The attacker can prepare a colliding document B and upload it before the user gets the chance to upload A. When document A is written later, the file system will throw away A and keep the tampered document B instead.
Consider that document A can be, for example, a system package update that the system administrator installs. Lulz ensues.
In-band deduplication for Btrfs
Posted Mar 15, 2016 17:26 UTC (Tue) by nybble41 (subscriber, #55106) [Link]
> I disagree, that actually is a very scary failure mode for a file system.
I wasn't trying to downplay the problems it would create for a filesystem, and I agree with everything else you said. However, SHA-256 hashes are used for more than just identifying blocks within filesystems for deduplication. The ability to create SHA-256 hash collisions would undermine the entire digital signature system, for example. Implementing a workaround is also easier in the context of deduplication—in the short term you can just turn it off while you reindex the drive with a different hash function. Unless the content of your filesystem is *really* important, the odds of anyone wasting a SHA-256 collision 0-day attack on it are vanishingly small, and even a major issue with the algorithm which cut the effective bit length in half would not represent an immediate practical threat.
In-band deduplication for Btrfs
Posted Mar 24, 2016 16:18 UTC (Thu) by nye (guest, #51576) [Link]
Somewhat more to the point, if you have a function which checks for duplicate data - whether by comparing the hashes, or comparing the data itself - the chance of a hash collision is less than the chance that random memory errors will happen to cause that function to return the wrong result. That is to say, the reliability of the byte-for-byte function is no greater than the compare-the-hashes function, at least when it comes to order of magnitude.
In practical terms, if you have a pipeline which relies on every step operating correctly, there's not much point in paying an ongoing cost to improve the reliability of one given component if there are others that are orders of magnitude more likely to fail. Sure, technically it makes a difference, but only in the sense that you can make the ocean bigger by tipping a bucket of water into it.
In-band deduplication for Btrfs
Posted Mar 11, 2016 17:56 UTC (Fri) by ikm (subscriber, #493) [Link]
You always play probability games, whether you like it or not. Any digital circuit is a system with SNRs sufficiently high to quantize levels reliably, but this can't ever be 100% reliable. Data redundancy can be put on top of it to make it more reliable - however, there's still always a probability of state and data corruption in such a system. Once you realize that, the fears of having a hash collision reduce to the question of being able to calculate the respective probabilities and compare them with the other probabilities involved.
In-band deduplication for Btrfs
Posted Mar 11, 2016 21:56 UTC (Fri) by oever (guest, #987) [Link]
In-band deduplication for Btrfs
Posted Mar 13, 2016 19:21 UTC (Sun) by nix (subscriber, #2304) [Link]
56 matching prefix bits
2.02 bits per doubling
104 bits (51.38 doublings) remaining
2.93716e+15 times larger is possible
Everyone on earth could have 438382 data sets like yours, all in one
repository, and we would expect 1 object collision.
I am not scared of a collision.
In-band deduplication for Btrfs
Posted Mar 14, 2016 15:29 UTC (Mon) by mathstuf (subscriber, #69389) [Link]
In-band deduplication for Btrfs
Posted Mar 15, 2016 17:08 UTC (Tue) by nix (subscriber, #2304) [Link]
In-band deduplication for Btrfs
Posted Aug 4, 2016 15:24 UTC (Thu) by JoeyUnknown (guest, #110181) [Link]
It should be a performance option. I can think of plenty of cases where for me a hash is fine. In some cases however, I don't really want to play dice with my data. Secondary to that, while right now the possibility of a collision is low, in future things can happen that might change that.
In some cases, depending on scenario, I would rather a system that performs worse than a possibility of a bizarre hidden integrity failure which can make a heck of a mess. If there ever was a hash collision, chances are it wouldn't be detected. The data would just have to be rebuilt and repaired or something. It's just one less vetor to worry about when it comes to big data where integrity is sensitive.
In-band deduplication for Btrfs
Posted Mar 15, 2016 21:11 UTC (Tue) by robbe (guest, #16131) [Link]
But, the willingness to dedicate CPU cycles to pure science is waning – maybe we should promise the person who finds the first collision some riches. Let’s call it btrcoin.
In-band deduplication for Btrfs
Posted Mar 28, 2016 1:56 UTC (Mon) by quwenruo (guest, #107900) [Link]
I found that most comment is concerning on the collision of SHA256.
That's normal, as when it's possible to have collision, it will happen one day.
Personally I'm quite confident about current "strong" hash algorithm, but since users have the conern, then we need to address them, and just like ZFS, to provide a byte-by-byte comparison option for dedupe.
Currently what we can do is, to add SHA512 to reduce the collision. Although SHA512 is originally planned to improve the hash speed on modern 64bit machines.
But for now, we're focusing on polishing the ioctl interface and on-disk format with maintainers(Chris and David), hoping the in-memory backend can be merged in 4.7 first.
For byte-by-byte comparison, it maybe be scheduled after all current backend merged.
But since so many user worry about the SHA256 collision, we will keep it in mind and investigate from now on.
Thanks,
Qu
In-band deduplication for Btrfs
Posted Aug 4, 2016 15:08 UTC (Thu) by JoeyUnknown (guest, #110181) [Link]
I think that this should be made this to work safely first. It should not be an incomplete optimised solution first.
A basic system doesn't need to use cryptographically secure but it does need to not have a risk of collision no matter how unlikely.
Your hash can be really anything from CRC to the most significant bits of the block. Something that avoids excessive bucketing however would be ideal (IE all bits contribute more or less equally). That might be similar to what you get with cryptographic functions, but it doesn't mean you need a cryptographic function. It doesn't really matter for just making it work.
Then when you have a match or matches, all you can do is a full content compare to confirm and use the match. You will need to be able to bucket because one key could match multiple blocks.
Once you have something that works and that is safe, then you can start to look at optimising it as much as possible while still keeping it entirely reliable. Once you've reached the point you can go no longer with that, you should then add options for performance at the cost of risk and options for greater security.
Options might be skipping the read check, using a more secure or large hash, writing blocks but then deferring deduplication as a background task,
Partial oportunistic dedupelication is another option as well. For example, there's a very good chance when a block is written that is that same as another, well, if A[n] == B[n] the probability of A[n +- 1] == B[n +- 1] is very high. In this case, you would deduplicate where cheap and easy but let a scheduled task clean up the rest later. Another option might be hinting, that is, during a big operation to turn on dedupe in some manner. You could also prioritse the most common data. A zeroed out block is one of the obvious cases.
You can also do some direct content indexing with various tree structure (IE, each byte or word) but that does not work well at all for random data. Those kind of trees more or less represent something akin to compression trees (and also similar to deduplication). Random data does not compress well at all so that would be an attack vector or risk of relying on a less lossy key compression based system.
This isn't really a simple thing but I can say this, I would not feel comfortable using a solution vulnerable to hash collision no matter how unlikely it is, especially one that can't detect it. I might be sensitive but I would prefer a flawless system.
There may be other similar systems out there you can look at that basically do the same but outside of the realm of file systems. Hash tables have collisions and need to detect them so that might be a place to start looking at.