Is the whole system idle?

By Jonathan Corbet
July 10, 2013

The full dynamic tick feature that made its debut in the 3.10 kernel can be good for users who want their applications to have full use of one or more CPUs without interference from the kernel. By getting the clock tick out of the way, this feature minimizes kernel overhead and the potential latency problems. Unfortunately, full dynamic tick operation also has the potential to increase power consumption. Work is underway to fix that problem, but it turns out to require a bit of information that is surprisingly hard to get: is the system fully idle or not?

The kernel has had the ability to turn off the periodic clock interrupt on idle processors for many years. Each processor, when it goes idle, will simply stop its timer tick; when all processors are idle, the system will naturally have the timer tick disabled systemwide. Fully dynamic tick — where the timer tick can be disabled on non-idle CPUs — adds an interesting complication, though. While most processors can (when the conditions are right) run without the clock tick, one processor must continue to keep the tick enabled so that it can perform a number of necessary system timekeeping operations. Clearly, this "timekeeping CPU" should be able to disable its tick and go idle if nothing else is running in the system, but, in current kernels, there is no way for that CPU to detect this situation.

A naive solution to this problem will come easily to mind: maintain a global counter tracking the number of idle CPUs. Whenever a processor goes idle, it increments the counter; when the processor becomes busy again, it decrements the counter. When the number of idle CPUs matches the number of CPUs in the system, the kernel will know that no work is being done and the timekeeping CPU can take a break.

The problem, of course, is that cache contention for that global counter would kill performance on larger systems. Transitions to and from idle are common under most workloads, so the cache line containing the counter would bounce frequently across the system. That would defeat some of the point of the dynamic tick feature; it seems likely that many users would prefer the current power-inefficient mode to a "solution" that carried such a heavy cost.

So something smarter needs to be done. That's the cue for an entry by Paul McKenney, whose seven-part full-system idle patch set may well be the solution to this problem.

As one might expect, the solution involves the maintenance of a per-CPU array of idle states. Each CPU can update its status in the array without contending with the other CPUs. But, once again, the naive solution is inadequate. With a per-CPU array, determining whether the system is fully idle requires iterating through the entire array to examine the state of each CPU. So, while maintaining the state becomes cheap, answering the "is the system idle?" question becomes expensive if the number of CPUs is large. Given that the timekeeping code is likely to want to ask that question frequently (at each timer tick, at least), an expensive implementation is not indicated; something else must be done.

Paul's approach is to combine the better parts of both naive solutions. A single global variable is created to represent the system's idle state and make that state easy to query quickly. That variable is updated from a scan over the individual CPU idle states, but only under specific conditions that minimize cross-CPU contention. The result should be the best of both worlds, at the cost of delayed detection of the full-system idle state and the addition of some tricky code.

The actual scan of the per-CPU idle flags is not done in the scheduler or timekeeping code, as one might expect. Instead (as others might expect), Paul put it into the read-copy-update (RCU) subsystem. That may seem like a strange place, but it makes a certain sense: RCU is already tracking the state of the system's CPUs, looking for "grace periods" during which unused RCU-protected data structures can be reclaimed. Tracking whether each CPU is fully idle is a relatively small change to the RCU code. As an added benefit, it is easy for RCU to avoid scanning over the CPUs when things are busy, so the overhead of maintaining the global full-idle state vanishes when the system has other things to do.

The actual idleness of the system is tracked in a global variable called full_sysidle_state. Updating this variable too often would bring back the cache-line contention problem, though, so the code takes a more roundabout path. Whenever the system is perceived to be idle, the code keeps track of when the last processor went idle. Only after a delay will the global idle state be changed. That delay drops to zero for "small" machines (those with no more than eight processors), it increases linearly as the number of processors goes up. So, on a very large system, all processors must be idle for quite some time before full_sysidle_state will change to reflect that state of affairs.

The result is that detection of full-system idle will be delayed on larger systems, possibly by a significant fraction of a second. So the timer tick will run a little longer than it strictly needs to. That is a cost associated with Paul's approach, as is the fact that his patch set adds some 500 lines of core kernel code for what is, in the end, the maintenance of a single integer value. But that, it seems, is the price that must be paid for scalability in a world where systems have large numbers of CPUs.

Index entries for this article
Kernel	Dynamic tick
Kernel	Read-copy-update

(Log in to post comments)

What about delays to shut off the tick on huge systems?

Posted Jul 13, 2013 4:51 UTC (Sat) by PaulMcKenney (subscriber, #9624) [Link]

Great article, thank you!

For whatever it is worth, here is an analysis of the full-system-idle delays on larger systems.

One reason for believing that these delays are OK is that the fraction of the system affected by the timekeeping ticks decreases as the system size increases. To see this, suppose each socket has eight cores, each with two hardware threads. On a 16-CPU system, the timekeeping ticks are causing the system's one and only socket to power up frequently. On a 32-CPU system, only half of the sockets are powering up frequently. On a 64-CPU system, only 25% of the sockets are powering up frequently. Skipping a few factors of two, on a 4096-CPU system, only about 0.4% of the sockets are powering up frequently. If this 4096-CPU system is running at HZ=1000, the full-system-idle delay will be about 256 milliseconds, which should be OK.

Another reason for believing that these delays are OK focuses on the probability of the system momentarily going fully idle while running a workload. Suppose that it is worthwhile to take advantage of the full-system-idle if that state is in effect at least 1% of time time. If the workload has random per-CPU busy and idle times, this would require that the CPU utilization be below a certain level.

That certain level is 0.1% CPU utilization, or, in bc-speak:

1-e(l(0.01)/4096).

However, if your workload is keeping your 4096-CPU system at 0.1% CPU utilization, your workload would very likely run just as well (and more economically) on a much smaller system, perhaps even as small as four CPUs.

This suggests that the only time that full-system-idle state is important for huge systems is when the system is in fact idle, in other words, when the workload has fully stopped. In which case, a quarter of a second wind-down time seems eminently reasonable.

Or at least that is what I am hoping. ;-)

Is the whole system idle?

Posted Jul 13, 2013 20:28 UTC (Sat) by vomlehn (guest, #45588) [Link]

Quote:

The actual scan of the per-CPU idle flags is not done in the scheduler or timekeeping code...Paul put it into the read-copy-update (RCU) subsystem.

Got a quibble, from the programming philosophy standpoint, with the approach as described (haven't read the patch and the quibble is about the approach and not whether it's actually done this way). What has happened in this case is the discovery that what RCU was doing has a broader applicability. So, instead of putting the change in the RCU code it should be "hoisted" to a higher level and combined with similar functionality. This may result in a new subsystem or an addition to an existing subsystem. In this case, I would expect the code to be in an "idleness tracking" subsystem. The RCU code would then reference this subsystem.

I don't want to pick on this patch; I've committed this particular sin more than once but once I done my penitance by doing the full restructuring, I feel so much *cleaner*.

Is the whole system idle?

Posted Jul 13, 2013 22:59 UTC (Sat) by marcH (subscriber, #57642) [Link]

From the (admittedly little) I know about RCU, I see it as a fairly general tool to manage concurrency. Tracking "global idleness" sounds like a good enough fit... what do I miss here?

Is the whole system idle?

Posted Jul 13, 2013 23:28 UTC (Sat) by vomlehn (guest, #45588) [Link]

RCU (read-copy-update) is indeed a concurrency-related facility, one focused on updating data in an atomic fashion. RCU has a lot to do with idleness because its implementation tries to use idle periods to do some of its work. It could, however, do its work in other ways that don't depend on idleness. It could, if hardware supported it, be implemented differently (rather like the x86 processor TSX feature used for lock elision, covered in the LWN article Merging lock elision into glibc and previous related articles). Without changing the interfaces, it would suddenly have nothing to do with idleness.

So, this was not a feature of RCU, it was rather an implementation detail of RCU. So long as it remains hidden, i.e. not used outside of RCU, this was a perfectly good place for it. When it becomes visible, it's time to evaluate whether it should be refactored by moving it out of RCU. My current view is that, yes, it should, especially since it seems to belong with other existing idleness and power management features.

The point I'm trying to make, which I wouldn't want to get lost, is that when we pull formerly hidden code out of the shadows, we need to evaluate whether it still belongs where it used to be or whether our desire to keep code well organized suggests we should move it elsewhere.

[An aside: I expect the kernel x86 community has already given consideration to an RCU implementation using TSX (Transactional Synchronization Extension) which would require refactoring RCU to separate out the idleness-based code.]

Is the whole system idle?

Posted Jul 15, 2013 21:28 UTC (Mon) by naptastic (guest, #60139) [Link]

Sorry for the naive question but, as an end-user who wants low latency for audio applications (real-time mixing,) how should I configure this feature in order to take maximum advantage of it?

Is the whole system idle?

Posted Jul 18, 2013 14:36 UTC (Thu) by Funcan (subscriber, #44209) [Link]

If your highest priority low latency, you probably don't want this feature, nor many other power saving features - you pretty much want to keep your CPUs awake and ready to do work at short notice.

Is the whole system idle?

Posted Jul 19, 2013 1:28 UTC (Fri) by oconnorcjo (guest, #2605) [Link]

I see this as fixing a problem that does not exist. If you have a system with 30+ processors, then having one CPU never go idle, seems like a non issue. The only serious area for this patch is for mobile devices and laptops. How I might aproach this concept is have a global variable to check idleness and if there are more than 16 CPU's just DON'T check at all. If a machine has more than 16 CPU's one CPU never going full idle to do clock ticks is not that much of a power savings so just do't do it. For all cases with less CPU's well the delay factor of checking should be minimal.

Is the whole system idle?

Posted Jul 19, 2013 9:20 UTC (Fri) by micka (subscriber, #38720) [Link]

Mobile device with 16 CPU is for next year.
Well, not really, but I don't know when it will be, and I certainly won't say it will not happen. I wouldn't have thought we'd have 4 CPUs in phones not so long ago, when my desktop computer was single core and I was sad because kernel devs seemed to only be interested in performance on server hardware (ie multi CPU).

I'm no expert, but if you have for example 4 big.LITTLE in a device, isn't that already 8 cores ?

Is the whole system idle?

Posted Jul 19, 2013 15:42 UTC (Fri) by mathstuf (subscriber, #69389) [Link]

I thought the current implementation of big.LITTLE would treat it as 4 cores with different frequency scaling options.

Is the whole system idle?

Posted Jul 19, 2013 17:49 UTC (Fri) by corbet (editor, #1) [Link]

The big.LITTLE switcher code works that way; bit LITTLE MP, instead, treats all of the processors independently. Neither approach is upstream, currently.

Is the whole system idle?

Posted Jul 19, 2013 8:56 UTC (Fri) by etienne (guest, #25256) [Link]

> The actual scan of the per-CPU idle flags is not done in the scheduler

Isn't it the case that IRQ can be directed towards any of the processor cores in hardware, and a core going to sleep shall divert IRQs to other cores before sleeping (because the time to wake-up adds to latency).
If you are the core assigned to every IRQs and are going to sleep, you are the last one...
IRQ assignment is obviously not a core local resource, but is done anyway - why not re-use these hardware registers?