KS2009: The future of perf events

By Jonathan Corbet
October 19, 2009

LWN's 2009 Kernel Summit coverage

At last year's kernel summit, the perf events subsystem did not yet exist, even in out-of-tree form. Now, this subsystem (initially called "perf counters") is in the mainline and rapidly gaining new capabilities. Paul Mackerras led a discussion on recent and future developments for this code.

The perf counters code brought with it the perf_counter_open() system call. Unlike previous work in this area, perf counters deals with single counters at a time, rather than presenting a view of the entire performance measurement unit (PMU) in the processor. Since being merged, this code has gained the ability to treat tracepoints as software events, collect user stack chains, perform filtering with C-like expressions, and work with kprobes. It also has been rebranded "perf events" to highlight its expanding scope.

What does the future hold? One is better scheduling of counters. A typical hardware PMU can only measure a subset of all the possible events at any given time. If an application asks for more events than the PMU can handle, the kernel will simulate a larger PMU by having it count different events at different times. This scheduling of the PMU works, but it would be nice to have better control. User space should be able to attach priorities to counters, indicating which ones it really cares about. It would also be nice to be able to modify the scheduling interval used.

Internally, there needs to be a well-defined API for users of the perf events subsystem. Support for more architectures is in the works; currently recent x86, server PowerPC, and SPARC64 processors are supported. Also desired is the ability to combine multiple counters to create abstract events.

Currently the kernel has a number of ways to report events to user space. It seems that at least some of the developers working in this area would like perf events to subsume the others and become the one true event counting and reporting infrastructure. If this happens, ftrace would lose its ring buffer implementation and report its data through the (different) ring buffer used by perf events. There was no real opposition to this idea; nobody was willing to take the position that having duplicated ring buffer implementations in the kernel was a good idea. The fact that the author of the ftrace ring buffer was not present may also have contributed to the silence on this issue.

There was some talk of working with user-space tracepoints. How should data from these tracepoints be integrated with kernel tracing data? One idea was to make some sort of inject_tracepoint() system call, but that was dismissed as being too slow. If user-space tracing data needs to go through the kernel, some sort of inbound ring buffer needs to be implemented.

An alternative would be to record user-space trace data separately, then integrate it during postprocessing. The problem here is that it is surprisingly difficult to timestamp this data in a way that allows it to be reliably merged with kernel trace data. In the end, it may be necessary to just record trace information with CPU numbers and time stamp counter contents.

There are also issues with the collection of backtrace information from user space. That is apparently hard to do if user space has not been built with frame pointers. But, on some architectures, using frame pointers has a serious performance cost, so most distributors are unwilling to build their systems that way. There was some debate over whether it's possible to reliably generate backtraces without frame pointer information, but no clear conclusion.

The following session covered the related issue of tracepoints and user-space ABI issues. Tracepoint documentation was covered; all of those tracepoints are not as useful as they could be if they are not properly documented. There is an effort underway to document tracepoints in the kerneldoc system, but that was criticized as being the wrong approach. What is really needed, people said, was self-documenting tracepoints. The documentation could be put into a special kernel section, in the same way as module author and parameter information is stored now. The documentation could be extracted by tools when needed. That would require keeping an uncompressed kernel image around, but that is often necessary anyway.

What about tracepoint stability? As developers discover tracepoints, they will create tools that rely on them; changing the tracepoints will then break the tools. Arjan van de Ven made the point that, without some sort of tracepoint stability, these tools would quickly become useless. So, at some level, tracepoints need to be seen as part of the kernel ABI.

There is little appetite for casting every kernel tracepoint in cement, though; many (or most) of them are really just debugging aids for developers. So what is needed is a way to mark some tracepoints as being part of the stable ABI. It was asserted that this marking must be done in the code itself; that way developers will see the ABI status of a given tracepoint and avoid changing it. Presumably something like Arjan's TRACE_EVENT_ABI patch would be used.

These sessions also included a quick demonstration of the ftrace, perf, and timechart tools. The demonstrations were interesting but not really amenable to a writeup here. The best thing for interested developers to do is to read the documentation and LWN articles about these tools and play with them for a while.

Next: LKML volume and related issues

Index entries for this article
Kernel	Performance monitoring

(Log in to post comments)

KS2009: The future of perf events

Posted Oct 19, 2009 18:29 UTC (Mon) by nevets (subscriber, #11875) [Link]

Note, as the author of the ftrace ring buffer, there's no reason that perf can not use the ftrace ring buffer as well. I hate calling it the "ftrace ring buffer" since it was made specifically as a separate entity.

I guess it is time to write up some patches and make perf use it ;-)