Overlayfs issues and experiences
David Howells and Mike Snitzer led a discussion at the 2015 Linux Storage, Filesystem, and Memory Management (LSFMM) Summit about the overlay filesystem (overlayfs), which is the union filesystem implementation that was adopted into the kernel in 3.18. There are a number of problems that need to be addressed for this new filesystem.
Howells was first up. He noted that overlayfs does not play nicely with security technologies that use object labels (e.g. SELinux). There are a couple of problems that he reported back in November. Overlay filesystems can have three different inodes for any given file, one in the overlayfs itself, one in the read-only lower layer, and another in the writable upper layer if the file has been written (and, thus, copied up to the upper layer). The problem for SELinux and others regards which of the three different possible versions of the inode (i.e. lower, upper, or overlay) is visible to them. That affects what security labels will be seen on the file. But those problems have largely been solved at this point.
![David Howells [David Howells]](https://web.archive.org/web/20240110050128im_/https://static.lwn.net/images/2015/lsf-howells-sm.jpg)
There are two more problems, for file locking and fanotify, that still need to be addressed. The first is a Jeff Layton problem, while the other is an Eric Paris problem, Howells said with a chuckle. Layton was present, so the discussion turned to locking. What happens when an overlayfs file that has not been written to is locked (so the lock must be placed on the lower layer), then written to so that it must be copied up from the lower layer into the upper? Should the lock be copied up too? What if there are two overlays referring to the same underlying file, each of which has a copied-up version of the file, where should the lock go then?
As it turns out, the fanotify problems are similar. If an application requests notifications on an overlayfs file that has not been written to, the notification must get placed on the lower layer inode. If the notifications are not copied up when the file gets written, then applications won't get notified even if changes are being made to the file.
James Bottomley suggested that the semantics for file locking and fanotify need to be worked out before a mechanism to satisfy them can be proposed. Ted Ts'o was uncomfortable having different behavior based on whether the file was part of an overlayfs. Howells noted that things can get worse than he had described when you add in network filesystems (e.g. SMB or NFS) as the overlayfs layers. He noted that he had posted a message in January with all of the problems he could think of, but "there are probably more".
Layton suggested returning ENOLCK when trying to lock files in an overlayfs until the semantics could be worked out and implemented. Al Viro noted that with overlayfs, a file opened for reading may have a different inode number than one opened for writing. That could be a problem for a number of different applications. The classic example is a mail user agent, Viro said, but some editors also care.
Bottomley said that there is a need to avoid surprise semantics. To do that, the developers need to know what actually matters and what users care about. POSIX semantics were broken for overlayfs, but does that really harm real users? "There is a limit to how far we need to dig to find problems that people are not complaining about", he said.
![Mike Snitzer [Mike Snitzer]](https://web.archive.org/web/20240110050128im_/https://static.lwn.net/images/2015/lsf-snitzer-sm.jpg)
One of the users of overlayfs is Docker, so Snitzer wanted to look at that use case. Docker tried Btrfs, but didn't like it, he said. The project can't use block-based solutions, such as those based on device mapper and thin provisioning (thinp) that most Linux distributions use. The reason behind that is "lame" in Snitzer's view. Essentially, the project wants its Go programs to be built once (on Ubuntu), then to be able to be run on any other distribution forever, which requires statically built binaries. But there is no static library available for udev, which means that the devicemapper graph driver cannot be used. That is a political, not a technical, issue, Snitzer said.
The big reason that Docker has switched to overlayfs is to gain the memory efficiency that comes from pages in the page cache being shared between the containers. That doesn't happen with thinp currently, but Snitzer said that Dave Chinner has some ideas for using XFS on top of thinp to achieve it.
Chinner spoke up to describe the problem, which is that there might be a hundred containers running on a system all based on a snapshot of a single root filesystem. That means there will be a hundred copies of glibc in the page cache because they come from different namespaces with different inodes, so there is no sharing of the data. Basically, he said, there needs to be a kind of page cache deduplication to fix the problem.
Bottomley noted that it was a similar problem to the one that KSM tries to solve. KSM basically uses hashes of the contents of various pages of memory to share memory better between virtual machines. For containers, the main need is to deduplicate the page cache specifically. Bottomley said that the company he works for, Parallels, has a solution to the deduplication problem that does not require hashing each page, but that it is, currently at least, proprietary. Sharing of memory between containers is something that many are looking for, though, so there was some discussion of how to do it without the overhead that KSM incurs. That is where things wound down.
[I would like to thank the Linux Foundation for travel support to Boston
for the summit.]
Index entries for this article | |
---|---|
Kernel | Filesystems/Union |
Kernel | Overlayfs |
Conference | Storage Filesystem & Memory Management/2015 |
(Log in to post comments)
Overlayfs issues and experiences
Posted Mar 19, 2015 16:37 UTC (Thu) by ibukanov (subscriber, #3942) [Link]
Also, I do not see the issue [2] regarding device mapper in docker as a political one. For technical reasons docker guys prefer a static binary. It is just that this is not compatible with udev implementation.
[1] https://github.com/docker/docker/issues/10180
[2] https://github.com/docker/docker/issues/10705
Overlayfs issues and experiences
Posted Mar 31, 2015 0:38 UTC (Tue) by bronson (subscriber, #4806) [Link]
Overlayfs issues and experiences
Posted Mar 22, 2015 13:55 UTC (Sun) by jlayton (subscriber, #31672) [Link]
File locking really exists to ensure that changes to files are coordinated properly.
Union mounts and overlayfs both share the same basic design -- you have a r/o layer (or more than one) and a r/w layer. The assumption is that the r/o layer won't change out from under you, even if it's (for instance) a mounted NFS filesystem. The r/w layer is also assumed to not be shared between multiple hosts. Each host must have its own.
So, locking at the r/o layer is just not very interesting and may possibly be problematic. Consider two hosts setting a write lock on the same file on a NFS-hosted r/o layer. Pushing those lock requests to the server would unnecessarily serialize their access. The file is not going to change either way, so that's just unnecessary.
So, I think that we only want to worry about locking on the r/w layer. Furthermore since we have an assumption that the r/w layers are not shared, we only need to worry about locking on a single host. In the case of NFS or another remote fs, we don't really even need to send those lock requests outside of the client.
Unfortunately, there are some other problems that get in the way of fixing this properly with overlayfs currently. David is looking at ways to address those -- once that's fixed we should be able to make file locking work on overlayfs too.
Overlayfs issues and experiences
Posted Mar 22, 2015 18:12 UTC (Sun) by mathstuf (subscriber, #69389) [Link]
Isn't this only true for filesystems globally mounted as r/o? I use r/o nullfs mounts on FreeBSD to expose git repos to git-daemon and cgit. The gitolite jail certainly can change it out from underneath them, but maybe this isn't an overlayfs use case either?