The case of the stalled CPU controller

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jonathan Corbet
August 17, 2016

Long-running technical disagreements are certainly well-known in the kernel community. Usually they are eventually resolved and the developers involved move on to new problems. Occasionally, though, stalled consensus can lead to a break between the kernel and its user community. The ongoing dispute over the CPU controller in the new control-group hierarchy is beginning to look like one of those unpleasant cases.

Control groups (cgroups) have been supported by the kernel for nearly ten years; they provide a mechanism by which processes can be grouped in a hierarchical manner and made subject to various resource controllers. It did not take long after the introduction of control groups before users and developers began to realize that there were some fundamental problems with its design; the discussion about fixing those problems got started in earnest in early 2012. Those discussions led to the description of a second version of the cgroup API and the beginning of work to move to that API. At this point, the version-2 API transition is mostly complete, with one glaring exception: the CPU controller.

The CPU controller, as one might expect, controls access to the CPU; it allows different groups of processes to be allocated specific amounts of CPU time and keeps those groups from interfering with each other. The low-level CPU-controller code is able to support the new API without trouble, but the scheduler developers have resisted the merging of that API itself; at this point, the CPU controller is the most significant controller without version-2 support. In an attempt to push things forward (and to say what will happen if things do not move forward), cgroup maintainer Tejun Heo posted a detailed summary of the situation as he sees it. That document is well worth reading for those who are interested in the topic.

Objections to the CPU controller

In short, the scheduler developers object to the new API for two reasons, both stemming from a perceived mismatch between the API and how they feel CPU control should be done. Both relate to fundamental design decisions in the version-2 cgroup API.

In the original cgroup implementation, each thread (of which a process may contain many) can be placed in a separate control group. The version-2 API, instead, requires that all of a process's threads be in the same group. For some controllers, such as the memory-usage controller, putting different threads into different groups makes little sense; all those threads are sharing the same memory, after all, so it is hard to say what it would mean to try to apply different policies to different threads. There are reasonable use cases for applying different CPU-usage policies to different threads, but the unified hierarchy, which is a fundamental design aspect of the version-2 API, requires all controllers to see the same cgroup arrangement. So all threads must be in the same cgroup from the CPU controller's point of view.

This requirement apparently seems fundamentally wrong to the scheduler developers; nothing in the scheduler itself recognizes the abstraction of a "process" at all. At that level, everything is a thread; applying a coarser policy at the cgroup level takes away an important degree of flexibility for (from their point of view) no gain.

There are users who want to be able to apply different policies to different threads; managing a thread pool is one commonly cited use case. But Heo stands by the design decisions; he also feels that the same interface should not be used at both the administrator level and within an individual process. He has proposed a mechanism called resource groups for the intra-process case, but that proposal has not made a lot of headway thus far.

There is another version-2 design decision that does not sit well with the scheduler developers. In the new API, a control group may contain other control groups, or it may contain processes, but not both; processes can only appear as leaves in the control-group hierarchy. Again, this decision was made to facilitate support for controllers other than the CPU controller. If subgroups and processes appear in the same cgroup, then the two types of object must compete for the same resource. In the CPU case, that competition is easily managed; when a cgroup is "scheduled," the scheduler recurses into the group and chooses one of the entities found therein to run. For many other controllers, though, it is not possible to treat processes and subgroups in the same manner.

The primary objection here seems to be that this restriction stomps on some of the elegance in the CPU-scheduler design; scheduling decisions are applied to "scheduling entities" that can be either processes or groups, and the scheduler itself need not care which. The version-2 API makes some control policies difficult or impossible to achieve but, Heo asserts, that may not matter much:

However, it isn't clear what the practical usefulness of a layout with direct competition between tasks and cgroups would be, considering that number and behavior of tasks are controlled by each application, and cgroups primarily deal with system level resource distribution; changes in the number of active threads would directly impact resource distribution. Real world use cases of such layouts could not be established during the discussions.

In summary, Heo says, there are solid reasons for the decisions that were made in the version-2 API. It handles most use cases as-is, and the addition of features like resource groups can fill in the gaps that remain. If there is anybody who still cannot work with the version-2 API, version-1 will continue to be maintained for as long as it has users. The transition is nearly done: the low-level support is there; all that is left to be merged is the API-level code to allow the CPU controller to operate in the unified version-2 hierarchy. But that code has been blocked with, seemingly, no way forward.

What happens now

Heo clearly hopes that, by reopening the discussion, he can maybe bring it to a conclusion and clear the way for the remaining patches to be merged. There is little evidence of that happening from the discussion so far, though. In the absence of a solution there, he is planning to do a couple of other things to make this functionality available to users.

One of those is to maintain the necessary patches going forward so that anybody who wants the CPU controller with the version-2 API can easily add it to their kernel. While it is unstated, it seems fairly clear that he is hoping that distributors will apply these patches to make the functionality available to their users. That approach has been used to resolve such logjams in the past; if a patch is widely applied by distributors and widely used, there comes a point where it clearly makes no sense to keep it out of the mainline. That was part of the reasoning that brought the Android patches into the mainline, among others.

The other half of that picture is to ensure that the most widely distributed user of control groups — systemd — is able to use the version-2 API. To that end, he has posted a pull request to add this functionality to systemd, saying: "This commit implements systemd CPU controller support on the unified hierarchy so that users who choose to deploy CPU controller cgroup v2 support can easily take advantage of it." That code was merged into the systemd mainline on August 14.

That action has led to a bit of disagreement in the systemd community, given that systemd normally wants to see features merged upstream before adding code to make use of them — though, it must be said, the bulk of that disagreement seems to come from a single vociferous developer. Lennart Poettering defended the action, saying that the systemd developers want to get the capability into users' hands, and that he hopes to get the kernel patches added to Fedora's kernel as well. Greg Kroah-Hartman added that this is not the first time that support for unmerged features has been added, and that it is often for good reasons:

Sometimes you have to add code to projects in order to be able to properly test the kernel code. And to make it easier for people to upgrade their kernels in the future and have things work properly on their existing, older, system tools. This happens all the time, I don't know why you are suddenly surprised about this

That is where things stand as of this writing. Predictions can be dangerous, especially when they involve the future, but, in this case, it seems likely that the kernel patches will indeed find their way into a number of distributor kernels. They make the version-2 API more widely useful, and, since most distributors are using systemd at this point, they have an important consumer lined up and ready to use it. Pressure from the user community is a blunt tool to use when patches are stalled but, in this case, it might just work.

Index entries for this article
Kernel	Control groups/Thread-level control

(Log in to post comments)

The case of the stalled CPU controller

Posted Aug 18, 2016 8:21 UTC (Thu) by matlads (guest, #84088) [Link]

Thanks for the chuckle: "Predictions can be dangerous, especially when they involve the future"

Predictions

Posted Aug 18, 2016 13:21 UTC (Thu) by corbet (editor, #1) [Link]

I wish I could claim credit for that, but Yogi Berra got there first.

The case of the stalled CPU controller

Posted Aug 18, 2016 8:27 UTC (Thu) by zdzichu (subscriber, #17118) [Link]

Nb. Fedora kernel developers are not willing to add out-of-tree patches for cgroup-v2 cpu controller: https://lists.fedoraproject.org/archives/list/devel@lists...

It never worked in the past (with utrace, Secure Boot, kdbus).

The case of the stalled CPU controller

Posted Aug 18, 2016 9:55 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

How about squashfs (out-of-tree module for about 5 years)?

The case of the stalled CPU controller

Posted Aug 18, 2016 11:56 UTC (Thu) by johannbg (guest, #65743) [Link]

Hum what about it?

If I can recall correctly each distribution shipped their own downstream patch with different naming scheme ( with the fallout from user trying to use patch(es) between distributions ), One patch was neeeded for each kernel version 2.4, 2.5 2.6 etc and in Fedora they broke backwards compatibility support at least one time with <4.0 SquashFS without even mentioning they intended to do so "surprises" to everyone using earlier versions of.

I interpret the above historic scenario as 5 years of failure not a 5 year success at pressuring the upstream kernel community to finally merge something because it was so wildly deployed,used and tested by downstream distributions.

And did it finally get merged due to the above reason or was that because of something entirely different to that?

Does anyone here possess the historic knowledge to confirm that was the root cause for it being merged in the first place?

The case of the stalled CPU controller

Posted Aug 18, 2016 8:41 UTC (Thu) by johannbg (guest, #65743) [Link]

Mr. "vociferous" here you know the individual has problem with merging code to an master branch that relies on un-accepted code in another upstream which will lead to an end result of higher maintenance in said upstream or in downstream consumer of that upstream *always* ( and in this case applying double standards as well in the systemd community ).

This being merged is all seemingly being done in a attempt of convincing the upstream kernel community to move forward with the unified hierarchy support with an strategy that was used in Fedora and has proven to have work *so well* as can be seen with utrace, secure boot, and kdbus.

But hey the Fedora kernel community fell for this not once,not twice but three times so let's see if they dont for the fourth because they love to carry out of tree patches to the kernel.

Seriously you dont sign downstream ( kernel ) community's to carry an out of tree patch indefinitely, flip a switch to expose that to downstream distribution entire userbase and cross your finger and hope somebody, reports something and call that testing.

If the intent was truly to test this, that requires measurable results which is conducted over a period of time with instructions of how to test,what to test and how to/what to report. It's something you would do in collaboration with the downstream QA community and would be conducted through "test days" or similar implemented downstream processes.
All of which can be managed in a ppa/corp repo dedicated for such effort so reporters would not have to specifically jump through hoops to set the test environment up, participate, delivers the feedback wanted and same downstream not being signup with added maintenance burden indefinitely.

This whole scenario of using upstream and downstream to solve a childish stalemate in the kernel community is just silly and the approach/strategy used trying break that stale mate even sillier <sigh>

The case of the stalled CPU controller

Posted Aug 23, 2016 20:31 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

Not that I'm defending the practice, but how many features have been merged into the kernel because Android uses them? Haven't there been some Google patches merged due to (Google-specific?) deployment prevalence too?

The case of the stalled CPU controller

Posted Aug 23, 2016 22:38 UTC (Tue) by johannbg (guest, #65743) [Link]

If you are referring to binder and the time when Greg moved that code from staging, merged it into mainline and gave everyone with expertise to review it the middle finger in the process of doing so I would call it similar but not the same political strategy's ( I think Greg signed him self up to having to maintain that code himself in that process ) but I have a hard time imagine that such stunt worked twice since there are both people that are more stubborn than I am that reside in that community ( and I'm pretty fracking stubborn ) as well as taking great pride and are thorough in their reviews ( which means things can take time ) so I dont think you can pressure them into "acceptance" by distributing the code downstream like a new found drug then forming lynch mobs out of it's consumer, hand them lid torches and have them march upstream shouting merge! merge! merge!.

cgroupv2 would probably be merged upstream if Ted had called it something else like "weight groups" and skipped any backwards compatibility with cgroups basically introduced this as a new thing, with new approach solving the same underlying problem.

The case of the stalled CPU controller

Posted Aug 26, 2016 8:02 UTC (Fri) by geuder (subscriber, #62854) [Link]

I find it astonishing that the CPU controller developers want more freedom to do complicated things. I found that it does not even work reasonably when you use cpu.shares on different levels of the hierarchy. (Well, it was a 3.11 or 3.14 kernel I need to use at work. Would need to check my notes to remember the details. And it could perfectly in today's kernel...)