In-band deduplication for Btrfs

March 9, 2016

This article was contributed by Neil Brown

"In-band deduplication" is the process of detecting and unifying duplicate data blocks as files are being written, rather than at some later time. Btrfs support for this feature has been under development since at least early 2013. Quite recently it reached the point where developer Qu Wenruo thought it was sufficiently ready to send Btrfs maintainer Chris Mason a pull request (as yet unanswered) hoping that it might be added to the kernel during the 4.6 merge window. While this is far from a magic bullet that will suddenly remove all the waste in your filesystem that is caused by duplicate data, there are use cases where it could bring real benefits.

Offline and in-band

It has been possible to unify duplicated blocks in Btrfs since Linux 3.12 when the BTRFS_IOC_FILE_EXTENT_SAME ioctl() command was added (it has since been renamed FIDEDUPERANGE when the basic ioctl() handling was moved to the VFS, so it could someday be used by other filesystems). This ioctl() is given a range of bytes in the file (which must be aligned to the filesystem block size) and a list of offsets in other files (or possibly the same file). This identifies two or more ranges of the same size that are claimed to be identical. Btrfs reads all the blocks, checks that they are in fact identical, and then changes its internal bookkeeping so that all files point to one set of blocks on the storage medium. The space used by the other copies of the data will typically become free space available for reallocation.

To understand how this works it is necessary to understand how Btrfs makes use of "extents". An extent is a contiguous set of blocks on the storage device and also a contiguous set of blocks in a file (at least initially). It has a "logical address" that is mapped to a physical location on one, or perhaps more, of the underlying devices. An extent can become part of several different files, either through snapshots, reflinks, or deduplication. To allow this storage space to ultimately be reused, Btrfs maintains a reference count for each extent and will not re-use any of the space until that count becomes zero.

It is not generally possible to split an extent. If a single block is written into the middle of a large extent, the "copy on write" design of Btrfs requires that either the whole extent be copied or that the file's indexing information be changed to point to the first half of the extent, then the new block, then the remainder of the extent. Btrfs takes the latter approach.

When FIDEDUPERANGE is used, the file indexes will be updated to point to the relevant extents or partial extents of the source file, and the reference counts on the various extents will be increased or decreased as appropriate. This may not release quite as much space as expected since there may have been extents that were only partially identical between two files. The identical part in one extent may have no files including it any more, but the space will not be freed until the whole extent is no longer referenced.

duperemove is a tool that can be used to examine a set of files, look for duplicated regions, and call the ioctl to remove that duplication from the underlying storage. This can be useful anywhere that files are likely to be completely or largely the same, but where they need to be kept separate for administrative or other practical reasons. A common use case is filesystem images used for virtualization or the multiple similar-but-not-identical file sets used by containers. Running duperemove from time to time could save a lot of disk space.

This functionality was referred to in the initial commit as "offline deduplication", which is a little confusing since "offline" in the context of filesystems usually implies that the filesystem isn't mounted, and that is not the case here. It is more like "on-demand" deduplication and is sometimes referred to as "batch" deduplication. This contrasts with the new work that could be called "time-of-write" or transparent deduplication, but is called "in-band" deduplication in this patch set.

When duperemove runs, it computes a hash value for every block in every file and then compares these, looking for ranges of matching blocks. When it finds suitable ranges, it requests the deduplication. In-band deduplication uses the same idea of hashing blocks of data, but in many other respects it is quite different.

In-band duplication works in units of extents of a specific size — a deduplication block size of up to 8MB can be configured; the default is 128KB. When data is written out to storage, blocks are gathered into extents of this size whenever possible. If there are only sufficient blocks being written for a smaller extent, that extent will be written without deduplication. For every extent of the target size, the hash (currently SHA-256, though possibly configurable in the future) is calculated. If another extent can be found with the same hash, then a new reference is added to that extent and the new data is not written. If no match is found, the data is written as normal and the hash is stored together with the logical address of the extent so that it is available for future matching.

This process can be enabled for a whole Btrfs filesystem using the "btrfs dedup enable" command and then disabled for individual files or directories by using the "btrfs property set" command to set the "dedup" property to "disable". Setting this on a directory causes all files or directories created within that directory to inherit the setting. This means that enabling deduplication on only a subset of a filesystem is possible, but a little bit clumsy.

Two back-ends

There are two separate back-ends for storing the mapping between hashes and extents; the fact that the implementation of these back-ends is kept cleanly separate from their use is rightfully highlighted as a strength of this patch set.

One of the two back-ends is an in-memory mapping. Two red-black trees are created when the filesystem is mounted; whenever new data is written to a file on which deduplication is enabled, new entries are added to the trees. One tree maps from hash to logical address and is used to see if a suitable extent exists that already stores the required data. The second provides a reverse mapping that allows hash entries to be deleted when an extent is freed after its reference count reaches zero. The size of these trees is limited by a configurable entry count that defaults to 32768. It would not be surprising to see that replaced or augmented with a "shrinker" callback from the memory management subsystem, so that it could shrink only when memory is tight.

The second mechanism stores these mappings in the filesystem. Btrfs has flexible data structures for storing all sorts of metadata and data on disk using a number of distinct B-trees, one for each subvolume, one for the extent reference counts, one for storing checksums, etc. The in-band deduplication patchset adds another B-tree that has two types of keys: one for lookup by hash and another for lookup by extent address.

Looking up the hash in the B-tree is not quite so straightforward as one might hope, since Btrfs has a fixed format for the lookup key: a 64-bit object-ID, an eight-bit type, and a 64-bit offset. It is not possible to store the full 256-bit hash in the key and even storing 128 bits in the two 64-bit fields would be problematic. Btrfs requires that all keys be unique, so that approach would effectively limit the hash size to 128 bits.

The approach chosen is to store 64 bits of the hash in the object-ID, and the logical address of the extent in the offset field. Each key is accompanied by a variable-length data field and this is used to store the full hash. If the hashes of two extents collide in those 64 bits, the offsets will still be different, so the keys will be unique.

To handle these collisions a lookup first performs a regular B-tree search for a key with the appropriate hash bits in the object-ID and U64_MAX in the offset field. This will provide the last possible location where the target hash could be stored. A linear search is then performed searching backward and comparing the full hash until a match is found or there are no more keys with the required hash fragment.

Unlike with in-memory lookup there is no mechanism to limit the number of hash entries stored, beyond normal filesystem-full checks that might prevent a new extent from being written.

One difference between this in-band deduplication and the on-demand approach that is worth highlighting is the dependence placed on the hash. duperemove uses a hash only to guide the search for duplicate blocks — the hash is not authoritative and Btrfs will not allow the deduplication to happen if the identified regions are not byte-for-byte identical. In-band duplication as currently implemented does not perform that comparison. If the hash matches, then the extents are assumed to match. This means the correctness of the deduplication is completely dependent on the uniqueness of the hash, so the extra effort to make use of all 256 bits is easy to justify. Whether that complete dependence is itself justified is not an easy question to answer. It is certainly extremely unlikely for two distinct blocks to have the same hash, but it is also certainly quite possible. It would only need to cause corruption once to be extremely embarrassing. Adding byte-for-byte comparison is on the planned feature list but, according to Wenruo, "not anytime soon".

In-band deduplication brings the benefit of being automatic, but has a cost that it will probably miss duplication that duperemove could find. Different alignment of extents between files would completely defeat the duplicate detection, as would creating files before deduplication was enabled. The former could be improved to some degree with a smaller extent size, though that would brings costs of its own. As is so often the case, finding the most effective solution — which could include a mix of offline and in-band deduplication — will be highly dependent on each particular use case.

Test, test, and test

Wenruo assures us that this patch set has seen quite a lot of testing and that it has been some time since the last of the bugs found by that testing was fixed, which is encouraging. Some new tests have been submitted for the xfstests test suite specifically to exercise the deduplication and to check for some of the bugs that have been found and fixed.

One aspect of testing that seemed strangely absent from the pull request was any hint of how in-band deduplication affects performance. It is to be expected that this sort of functionality would slow writes down, particularly when the table of hashes is not kept in memory. It could also speed up some writes if lots of duplicate extents are found. Some indication of the sort of performance change experienced would certainly help to complete the picture.

But maybe the hope is to crowdsource that testing. There are so many different possible hardware configurations and usage scenarios to test that it is hard for one developer to even begin to give meaningful results. A large community, on the other hand, can try lots of things in parallel. As Wenruo noted in his pull request, there is still work to be done, but it should be quite ready for people to test.

Trying it out requires some patches to the btrfs-progs along with the Git tree from the pull request, but that should be no challenge for those who enjoy compiling their own kernels. I'm sure additional results would be most welcome.

Index entries for this article
Kernel	Btrfs
Kernel	Filesystems/Btrfs
GuestArticles	Brown, Neil

(Log in to post comments)

In-band deduplication for Btrfs

Posted Mar 10, 2016 14:00 UTC (Thu) by martin.langhoff (guest, #61417) [Link]

At first blush the performance cost and collision risk of the technique described make me uneasy...

There's a known pattern of calculating a fast cheap hash in-band during writes, which is later used by an offline process to assess dedupe candidates.

The fast hash will have some false positives, that's OK. It's merely a low cost guide for the batch or idle process to pick candidates. The office process performs a full comparison, and might have additional rules, such as skipping files with high rates of change.

Perhaps that's a better approach?

In-band deduplication for Btrfs

Posted Mar 10, 2016 14:57 UTC (Thu) by jtaylor (subscriber, #91739) [Link]

I guess the in-band deduplication has the goal of avoiding a duplicated writes completely before they even go to disk.

Your suggested approach is basically batch duplication, btrfs already stores cheap hashes with its data, a userspace program could use these too relatively quickly find candidates for the dedupe ioctl to deduplicate safely.
duperemove basically does this, except it cannot yet do incremental deduplication.
It could probably be done by using btrfs subvolume find-new command that can list what has changed since a certain filesystem generation

In-band deduplication for Btrfs

Posted Mar 10, 2016 18:41 UTC (Thu) by roblucid (guest, #48964) [Link]

Presumably it's expected that the HW acceleration will be available for the common SHA-256 algorithmn, which is in kernel, so the costs of hashing an extent may be less than feared on many core CPUs

I would guess avoiding the bit for bit comparison is done for performance reasons, on the pragmatic basis that collisions, even if there's only really 64bits of entropy are going to be very, very rare :)

In-band deduplication for Btrfs

Posted Mar 10, 2016 19:01 UTC (Thu) by martin.langhoff (guest, #61417) [Link]

Very rare is a statement of probability. You can't play probability games with filesystems because you roll the dice so many times that you will hit the jackpot.

In-band deduplication for Btrfs

Posted Mar 10, 2016 23:23 UTC (Thu) by nybble41 (subscriber, #55106) [Link]

Even if you generated a new 256-bit hash every picosecond (10^12 hashes per second), it would be 10^19 years before the probability of a collision reached 50%, taking into account the Birthday Paradox. That is over 700 million times the current age of the universe, with a 50% probability of *one* collision. The probability of finding any collisions is still less than 10^-9 after 500 trillion (5*10^14) years.

Even filesystems don't "roll the dice" often enough to make 256-bit hash collisions a serious consideration.

In-band deduplication for Btrfs

Posted Mar 11, 2016 14:05 UTC (Fri) by bgoglin (subscriber, #7800) [Link]

Unlikely doesn't mean it won't ever happen.

In-band deduplication for Btrfs

Posted Mar 11, 2016 15:24 UTC (Fri) by nybble41 (subscriber, #55106) [Link]

For all practical purposes, in this case it does mean exactly that. It is far, far more likely that the filesystem will suffer catastrophic failure from some other cause (for example, a freak surge in cosmic gamma radiation sufficient to wipe out human civilization) than that you will ever see a SHA-256 hash collision occur by random chance in the context of a single filesystem. It is so far down the list of things to worry about that developers would be more productively employed working on almost anything else compared to implementing bit-for-bit block comparison to supplement the SHA-256 match.

In-band deduplication for Btrfs

Posted Mar 11, 2016 15:30 UTC (Fri) by micka (subscriber, #38720) [Link]

So, nobody does cryptanalysis on SHA-256 and try to create attacks on it ?
OK, that's not by random chance anymore, but...

In-band deduplication for Btrfs

Posted Mar 11, 2016 16:25 UTC (Fri) by nybble41 (subscriber, #55106) [Link]

Sure, cryptoanalysis uncovering weaknesses in the SHA-256 algorithm is a possibility. I was replying only to the "it's not mathematically impossible, ergo it must be treated as a realistic possibility" argument. However, if attackers can arrange for a SHA-256 hash collision on demand then I think we'll have bigger problems to worry about than some corrupted filesystem data. There are also various well-known methods to thwart attacks dependent on predictable hash results, like seeding the hash function with a hidden per-filesystem salt, and in the event that such an attack is discovered the workaround is simple: just disable online deduplication until the hash function can be updated.

In-band deduplication for Btrfs

Posted Mar 15, 2016 16:10 UTC (Tue) by intgr (subscriber, #39733) [Link]

> However, if attackers can arrange for a SHA-256 hash collision on demand then I think we'll have bigger problems to worry about

I disagree, that actually is a very scary failure mode for a file system. If an attacker is allowed to influence what gets stored in a file system, then a preimage attack or possibly a clever application of a collision attack would allow poisoning the file system.

For instance, an attacker knows that some user wants to store document A on the system. The attacker can prepare a colliding document B and upload it before the user gets the chance to upload A. When document A is written later, the file system will throw away A and keep the tampered document B instead.

Consider that document A can be, for example, a system package update that the system administrator installs. Lulz ensues.

In-band deduplication for Btrfs

Posted Mar 15, 2016 17:26 UTC (Tue) by nybble41 (subscriber, #55106) [Link]

> > However, if attackers can arrange for a SHA-256 hash collision on demand then I think we'll have bigger problems to worry about
> I disagree, that actually is a very scary failure mode for a file system.

I wasn't trying to downplay the problems it would create for a filesystem, and I agree with everything else you said. However, SHA-256 hashes are used for more than just identifying blocks within filesystems for deduplication. The ability to create SHA-256 hash collisions would undermine the entire digital signature system, for example. Implementing a workaround is also easier in the context of deduplication—in the short term you can just turn it off while you reindex the drive with a different hash function. Unless the content of your filesystem is *really* important, the odds of anyone wasting a SHA-256 collision 0-day attack on it are vanishingly small, and even a major issue with the algorithm which cut the effective bit length in half would not represent an immediate practical threat.

In-band deduplication for Btrfs

Posted Mar 24, 2016 16:18 UTC (Thu) by nye (guest, #51576) [Link]

>It is far, far more likely that the filesystem will suffer catastrophic failure from some other cause (for example, a freak surge in cosmic gamma radiation sufficient to wipe out human civilization) than that you will ever see a SHA-256 hash collision occur by random chance in the context of a single filesystem

Somewhat more to the point, if you have a function which checks for duplicate data - whether by comparing the hashes, or comparing the data itself - the chance of a hash collision is less than the chance that random memory errors will happen to cause that function to return the wrong result. That is to say, the reliability of the byte-for-byte function is no greater than the compare-the-hashes function, at least when it comes to order of magnitude.

In practical terms, if you have a pipeline which relies on every step operating correctly, there's not much point in paying an ongoing cost to improve the reliability of one given component if there are others that are orders of magnitude more likely to fail. Sure, technically it makes a difference, but only in the sense that you can make the ocean bigger by tipping a bucket of water into it.

In-band deduplication for Btrfs

Posted Mar 11, 2016 17:56 UTC (Fri) by ikm (subscriber, #493) [Link]

> You can't play probability games with filesystems

You always play probability games, whether you like it or not. Any digital circuit is a system with SNRs sufficiently high to quantize levels reliably, but this can't ever be 100% reliable. Data redundancy can be put on top of it to make it more reliable - however, there's still always a probability of state and data corruption in such a system. Once you realize that, the fears of having a hash collision reduce to the question of being able to calculate the respective probabilities and compare them with the other probabilities involved.

In-band deduplication for Btrfs

Posted Mar 11, 2016 21:56 UTC (Fri) by oever (guest, #987) [Link]

git uses the SHA-1 hash of a file as the identifier. If there is a collision, the git repository breaks. Other threats to data integrity are much higher than the chance of an accidental collision. Git it also a deduplicated file system of sorts, so the same considerations apply.

In-band deduplication for Btrfs

Posted Mar 13, 2016 19:21 UTC (Sun) by nix (subscriber, #2304) [Link]

And because those considerations apply to git, they apply to bup as well. Bup actually provides a tool to check the probability of such trouble. Let's run it over my backups for the last few years:

56 matching prefix bits
2.02 bits per doubling
104 bits (51.38 doublings) remaining
2.93716e+15 times larger is possible

Everyone on earth could have 438382 data sets like yours, all in one
repository, and we would expect 1 object collision.

I am not scared of a collision.

In-band deduplication for Btrfs

Posted Mar 14, 2016 15:29 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

I thought I read somewhere that someone had an object of 40 zeros and had some collision locally. Or was that just a potentiality?

In-band deduplication for Btrfs

Posted Mar 15, 2016 17:08 UTC (Tue) by nix (subscriber, #2304) [Link]

I've only heard of it as a possibility. Nobody has ever mentioned encountering a real collision, and frankly I'm not worried about one turning up in the foreseeable future.

In-band deduplication for Btrfs

Posted Aug 4, 2016 15:24 UTC (Thu) by JoeyUnknown (guest, #110181) [Link]

It's unlikely, but it should still not be that difficult to read the data and compare both. For those that don't need the performance hit, that would be turned off and you would remove a dimension from your data structure (btree[hashkey]->bucket->blocks to btree[hashkey]->block).

It should be a performance option. I can think of plenty of cases where for me a hash is fine. In some cases however, I don't really want to play dice with my data. Secondary to that, while right now the possibility of a collision is low, in future things can happen that might change that.

In some cases, depending on scenario, I would rather a system that performs worse than a possibility of a bizarre hidden integrity failure which can make a heck of a mess. If there ever was a hash collision, chances are it wouldn't be detected. The data would just have to be rebuilt and repaired or something. It's just one less vetor to worry about when it comes to big data where integrity is sensitive.

In-band deduplication for Btrfs

Posted Mar 15, 2016 21:11 UTC (Tue) by robbe (guest, #16131) [Link]

I, for one, welcome this SHA2@Home project trying to find a collision, however unlikely.

But, the willingness to dedicate CPU cycles to pure science is waning – maybe we should promise the person who finds the first collision some riches. Let’s call it btrcoin.

In-band deduplication for Btrfs

Posted Mar 28, 2016 1:56 UTC (Mon) by quwenruo (guest, #107900) [Link]

Thanks for bringing this feature to the spotlight.

I found that most comment is concerning on the collision of SHA256.
That's normal, as when it's possible to have collision, it will happen one day.
Personally I'm quite confident about current "strong" hash algorithm, but since users have the conern, then we need to address them, and just like ZFS, to provide a byte-by-byte comparison option for dedupe.

Currently what we can do is, to add SHA512 to reduce the collision. Although SHA512 is originally planned to improve the hash speed on modern 64bit machines.

But for now, we're focusing on polishing the ioctl interface and on-disk format with maintainers(Chris and David), hoping the in-memory backend can be merged in 4.7 first.

For byte-by-byte comparison, it maybe be scheduled after all current backend merged.
But since so many user worry about the SHA256 collision, we will keep it in mind and investigate from now on.

Thanks,
Qu

In-band deduplication for Btrfs

Posted Aug 4, 2016 15:08 UTC (Thu) by JoeyUnknown (guest, #110181) [Link]

SHA is more relevant if you have a major security concern or something and if you're using an implementation that isn't collision safe (as SHA tends to avoid it more than more basic hash algorithms).

I think that this should be made this to work safely first. It should not be an incomplete optimised solution first.

A basic system doesn't need to use cryptographically secure but it does need to not have a risk of collision no matter how unlikely.

Your hash can be really anything from CRC to the most significant bits of the block. Something that avoids excessive bucketing however would be ideal (IE all bits contribute more or less equally). That might be similar to what you get with cryptographic functions, but it doesn't mean you need a cryptographic function. It doesn't really matter for just making it work.

Then when you have a match or matches, all you can do is a full content compare to confirm and use the match. You will need to be able to bucket because one key could match multiple blocks.

Once you have something that works and that is safe, then you can start to look at optimising it as much as possible while still keeping it entirely reliable. Once you've reached the point you can go no longer with that, you should then add options for performance at the cost of risk and options for greater security.

Options might be skipping the read check, using a more secure or large hash, writing blocks but then deferring deduplication as a background task,

Partial oportunistic dedupelication is another option as well. For example, there's a very good chance when a block is written that is that same as another, well, if A[n] == B[n] the probability of A[n +- 1] == B[n +- 1] is very high. In this case, you would deduplicate where cheap and easy but let a scheduled task clean up the rest later. Another option might be hinting, that is, during a big operation to turn on dedupe in some manner. You could also prioritse the most common data. A zeroed out block is one of the obvious cases.

You can also do some direct content indexing with various tree structure (IE, each byte or word) but that does not work well at all for random data. Those kind of trees more or less represent something akin to compression trees (and also similar to deduplication). Random data does not compress well at all so that would be an attack vector or risk of relying on a less lossy key compression based system.

This isn't really a simple thing but I can say this, I would not feel comfortable using a solution vulnerable to hash collision no matter how unlikely it is, especially one that can't detect it. I might be sensitive but I would prefer a flawless system.

There may be other similar systems out there you can look at that basically do the same but outside of the realm of file systems. Hash tables have collisions and need to detect them so that might be a place to start looking at.