Short subjects: kerneloops, read-mostly, and port 80

By Jonathan Corbet
December 18, 2007

Kerneloops. Triage is an important part of a kernel developer's job. A project as large and as widely-used as the kernel will always generate more bug reports than can be realistically addressed in the amount of time which is available. So developers must figure out which reports are most deserving of their attention. Sometimes the existence of an irate, paying customer makes this decision easy. Other times, though, it is a matter of making a guess at which bugs are affecting the largest numbers of users. And that often comes down to how many different reports have come in for a given problem.

Of course, counting reports is not the easiest thing to do, especially if they are not all sent to the same place. In an attempt to make this process easier, Arjan van de Ven has announced a new site at kerneloops.org. Arjan has put together some software which scans certain sites and mailing lists for posted kernel oops output; whenever a crash is found, it is stuffed into a database. Then an attempt is made to associate reports with each other based on kernel version and the call trace; from that, a list of the most popular ways to crash can be created. As of this writing, the current fashion for kernel oopses would appear to be in ieee80211_tx() in current development kernels. Some other information is stored with the trace; in particular, it is possible to see what the oldest kernel version associated with the problem is.

This is clearly a useful resource, but there are a couple of problems which make it harder to do the job properly. One is that there is no distinctive marker which indicates the end of an oops listing, so the scripts have a hard time knowing where to stop grabbing information. The other is that multiple reports of the same oops can artificially raise the count for a particular crash. The solution to both problems is to place a marker at the end of the oops output which includes a random UUID generated at system boot time. Patches to this effect are circulating, though getting the random number into the output turns out to be a little harder than one might have expected. So, for 2.6.24, the "random" number may be all zeroes, with the real problem to be solved in 2.6.25.

Read-mostly. Anybody who digs through kernel source for any period of time will notice a number of variables declared in a form like this:

    static int __read_mostly ignore_loglevel;

The __read_mostly attribute says that accesses to this variable are usually (but not always) read operations. There were some questions recently about why this annotation is done; the answer is that it's an important optimization, though it may not always be having the effect that developers are hoping for.

As is well described in What every programmer should know about memory, proper use of processor memory caches is crucial for optimal performance. The idea behind __read_mostly is to group together variables which are rarely changed so they can all share cache lines which need not be bounced between processors on multiprocessor systems. As long as nobody changes a __read_mostly variable, it can reside in a shared cache line with other such variables and be present in cache (if needed) on all processors in the system.

The read-mostly attribute generally works well and yields a measurable performance improvement. There are concerns, though, that this feature could be over-used. Andrew Morton expressed it this way:

So... once we've moved all read-mostly variables into __read_mostly, what is left behind in bss? All the write-often variables. All optimally packed together to nicely maximise cacheline sharing.

Combining frequently-written variables into shared cache lines is a good way to maximize the bouncing of those cache lines between processors - which would be bad for performance. So over-aggressive segregation of read-mostly variables to minimize cache line bouncing could have the opposite of the desired effect: it could make the kernel's cache behavior worse.

The better way, says Andrew, would have been to create a "read often" attribute for variables which are frequently used in a read-only mode. That would leave behind the numerous read-rarely variables to serve as padding keeping the write-often variables nicely separated from each other. Thus far, patches to make this change have not been forthcoming.

I/O port delays. The functions provided by the kernel for access to I/O ports have long included versions which insert delays. A driver would normally read a byte from a port with inb(), but inb_p() could be used if an (unspecified) short delay was needed after the operation. A look through the driver tree shows that quite a few drivers use the delayed versions of the I/O port accessors, even though, in many cases, there is no real need for that delay.

This delay is implemented (on x86 architectures) with a write to I/O port 80. There is generally no hardware listening for an I/O operation on that port, so this write has the sole effect of delaying the processor while the bus goes through an abortive attempt to execute the operation. It is an operation with reasonably well-defined semantics, and it has worked for Linux for many years.

Except that now, it seems, this technique no longer works on a small subset of x86_64 systems. Instead, the write to port 80 will, on occasion, freeze the system hard; this, in turn, generates a rather longer delay than was intended. One could imagine the creation of an elaborate mechanism for restarting I/O operations after the user resets the system, but the kernel developers, instead, chose to look for alternative ways of implementing I/O delays.

In almost every case, the alternative form of the delay is a call to udelay(). The biggest problem here is that udelay() works by sitting in a tight loop; it cannot know how many times to go through the loop until the speed of the processor has been calibrated. That calibration happens reasonably early in the boot process, but there are still tasks to be performed - including I/O port operations - first. This problem is being worked around by removing some delayed operations from the early setup code, but some developers worry that it will never be possible to get them all. It has been suggested that the kernel could just assume it's running on the fastest-available processor until the calibration happens, but, beyond being somewhat inelegant, that could significantly slow the bootstrap process on slower machines - all of which work just fine with the current code.

The real solution is to simply get rid of almost all of the delayed I/O port operations. Very few of them are likely to be needed with any hardware which still works. In some cases, what may really be going on is that the delays are being used to paper over driver bugs - such as failing to force a needed PCI write out by doing a read operation. Just removing the delays outright would probably cause instability in unpredictable places - not a result most developers are striving for. So the task of cleaning up those calls will have to be done carefully over time. Meanwhile, the use of port 80 will probably remain unchanged for 2.6.24.

Index entries for this article
Kernel	Debugging
Kernel	Device drivers
Kernel	__read_mostly

(Log in to post comments)

Short subjects: kerneloops, read-mostly, and port 80

Posted Dec 20, 2007 11:29 UTC (Thu) by willy (subscriber, #9762) [Link]

the delays are being used to paper over driver bugs - such as failing to force a needed PCI write out by doing a read operation

It's good that the message is getting out that PCI writes can be posted -- but that can't be the case here; IO port writes can't be posted, only MMIO writes.

Short subjects: kerneloops, read-mostly, and port 80

Posted Dec 23, 2007 21:48 UTC (Sun) by jcm (subscriber, #18262) [Link]

Jon was just giving an example of posted writes, I don't think he was suggesting this was a
problem in this case.

Short subjects: kerneloops, read-mostly, and port 80

Posted Dec 20, 2007 12:43 UTC (Thu) by jengelh (subscriber, #33263) [Link]

Port 0x80.

Short subjects: kerneloops, read-mostly, and port 80

Posted Dec 20, 2007 20:01 UTC (Thu) by jzbiciak (guest, #5246) [Link]

What happens if you mark "write often" variables as having an alignment equal to the L2 line
size?  That'd force them into separate cache lines, and other stuff with lesser alignment
constraints could still fill in the holes.  (This works for scalars at least, if not arrays.)

Short subjects: kerneloops, read-mostly, and port 80

Posted Dec 20, 2007 20:08 UTC (Thu) by jzbiciak (guest, #5246) [Link]

Also, it occurs to me that even this strategy might be a bad one, since you still have the
problem of reads to unrelated variables causing some amount of cache line traffic for
writeable lines.  Furthermore, the "best" strategy may vary by processor.

How many "write often" variables are there that are not per-CPU?  If this set is moderately
small, perhaps forcing them all into separate cache lines with *nothing* in the holes is an
acceptable increase in footprint in modern CPUs.  And, on AMD CPUs, you may be able to skip
this altogether:  The MOESI cache protocol they use allows cache lines to be in a "shared,
writeable" state (the "O" state), which directly addresses this bounce issue.

Short subjects: kerneloops, read-mostly, and port 80

Posted Jan 2, 2008 18:25 UTC (Wed) by wilck (guest, #29844) [Link]

If IO delay through port 0x80 goes away, it will eventually be possible to use BIOS post codes from Phoenix-based BIOSes for debugging purposes. I have been waiting years for that to happen. Let's hope that the current discussion won't lead to a dead end as this one from 2002 did.