Type checking for BPF tracing

By Jonathan Corbet
October 28, 2019

The BPF in-kernel virtual machine has brought a new set of capabilities to a number of functional areas in the kernel, including, significantly, tracing. Since BPF programs run in the kernel, much effort goes into ensuring that they will not cause problems for the running system; to that end, the BPF verifier checks every possible aspect of each BPF program's behavior to ensure that it is safe to run in the kernel — with one notable exception. With a patch set titled "revolutionize bpf tracing", Alexei Starovoitov aims to close that loophole and eliminate a set of potential problems in a widely used class of BPF programs.

BPF is heavily used in tracing applications to gain access to useful kernel information and to perform data aggregation in kernel space. There are two variants of these programs. If a tracepoint has been placed in a useful location in the kernel, a BPF program can be attached there; otherwise, a kprobe can be placed at (almost) any kernel location and used to trigger a BPF program. Either way, the BPF verifier currently has little visibility into the data that will be passed to those programs.

Consider, for example, the trace_kfree_skb tracepoint placed in net_tx_action(). When this tracepoint triggers, any handlers (including attached BPF programs) will be passed two pointers, one to the sk_buff structure representing the network packet of interest, and one to the function that is freeing that packet. The type information associated with those pointers is lost, however; the program itself just sees a pair of 64-bit unsigned integers. Accessing the kernel data of interest requires casting those integers into pointers of the correct type, then using helpers like bpf_probe_read() to read the data behind those pointers. A series of bpf_probe_read() calls may be needed to walk through a data structure and get to the data the tracing program is actually looking for.

The problem is that a BPF program can cast one of these values into any type it likes; the result need not correspond to the actual type of that data. A mistake could cause a BPF program to go off into the weeds; in one worst-case scenario, the program could wander into a memory-mapped I/O area and cause some real damage. This isn't generally a security issue, since tracing is a privileged operation to begin with, but it is a safety issue — exactly the sort of issue that the BPF verifier is meant to prevent.

This problem has existed since the kernel first gained the ability to attach BPF programs to tracepoints and kprobes. Meanwhile, BPF developers have been working on an entirely different problem: the lack of binary portability for BPF programs. These programs go digging around in kernel data structures, but the layout and content of those structures varies depending on the kernel configuration, the underlying architecture, and more. The data of interest to any given program may be located 12 bytes into a structure on one kernel, but only 8 bytes into that structure on a different kernel. Without the ability to "relocate" these references, BPF users must rebuild their programs on every target system.

The "compile once run everywhere" effort has, over the last couple of years or so, worked to address this problem through the creation of a compact, machine-readable description of the kernel's data structures. This "BPF type format" (BTF) data is provided by the kernel itself, but it can be used by user-space support libraries to adjust a binary BPF program for a local kernel before loading it, mostly solving the binary portability issue. But it turns out that BTF information has other uses as well.

In particular, it is possible to annotate tracepoints with information about the types of the data values passed to handler programs. That allows the verifier to ensure that those programs are working with the correct data types. It also makes it possible for the C handler programs to follow pointers directly; when those programs are compiled to BPF and loaded into the kernel, the verifier can implicitly substitute the bpf_probe_read() calls where they are needed — after performing the necessary type checking, of course.

The end result of all this will be BPF tracepoint handler programs that are safer and far less error-prone to write. Whether tracing is "revolutionized" remains to be seen, but it is clearly improved in a significant way.

What is decidedly not revolutionized is data access within kprobe handlers. A kprobe can be set anywhere in the running kernel, and it is given access to the contents of the processor registers when the probe is hit. It is not, at this point, possible for the verifier to know what will be in those registers at that time, so this kind of checking cannot be done. That means that, especially in the parts of the kernel that are not amenable to the addition of proper tracepoints, the use of BPF programs without this sort of type checking will have to continue.

That said, progress is progress, and this work will increase the safety of much of the tracing code that is currently in use. It has been queued in the bpf-next tree so, barring some sort of last-minute hitch, it can be expected to show up in the 5.5 kernel.

Index entries for this article
Kernel	BPF/Tracing
Kernel	Tracing/with BPF

(Log in to post comments)

Please remember that the kernel is for running userspace!

Posted Oct 30, 2019 4:11 UTC (Wed) by ringerc (subscriber, #3071) [Link]

I'm excited to see where BPF / eBPF tracing is going.

Remember userspace

But I'm also seeing echoes of a repeating pattern here. Userspace, as always, seems to be relegated to an uninteresting afterthought. Perhaps less so than was the case with perf, but still to a concerning degree.

Generations of tools with production userspace as an afterthought

We've been through a lot of tracing tools and frameworks both for generating/annotating traceable data and for collecting/processing it. uprobes, ftrace, dtrace-for-linux, SystemTap, linux-perf and perf_events, LTTng, and more. They're inter-related and not a simple generational series of changes. But one pattern seems to have persisted pretty strongly: nothing works very well with "real world" production userspace deployments initially, yet earlier generations of tools tend to be abandoned. If you go to use one tool you're told it's obsolete, you should use the new one instead, but that one doesn't actually work on any of the systems widely deployed in production and won't for a long time yet. When it does, it probably doesn't support external debuginfo (like every system has had for 10+ years) for another year or two after that, because really, who debugs packaged binaries on production systems? So this time around I'm going to plead for some attention to userspace, including the boring parts of userspace above the C library, allocator and syscall wrappers. Because tracing is incredibly useful when working with complex real world systems, in production, supporting live workloads. But that seems to be the last priority when tools are being worked on. (If I can get involved in relevant testing, UX, bug fixing etc please reach out; craig.ringer@2ndquadrant.com is the best address and make sure to mention tracing in the subject.)

SystemTap is amazing with userspace (now)

I recently got around to learning SystemTap after, again, being unable to make perf do what I want for my tracing needs. It's *amazing*. It has a few frustrating deficiencies and I understand that its "kernel modules compiled on demand" design doesn't have a future. But the tool itself is like nothing else I've ever used, it's incredible, warts and all. In particular:

It can traverse complex data structures, with logic where needed not just simple expressions
It can access global variables (via DWARF lookups, with debuginfo) which is vital in nontrivial systems
It can integrate events from multiple tracked processes to observe complex behaviour of multi-process systems
It can place all sorts of probe events on function calls, returns, static tracepoint markers, statement level probes, process start/end, and much more
... and it makes all that pretty easy so you can focus on the actual details of the program you are tracing without rewriting half the tools you need each time.

Example I used it for in PostgreSQL

With SystemTap I was able to rapidly gain insight into the deep guts of PostgreSQL's behaviour in a way I've never been able to do before, even when patching PostgreSQL itself and post-processing trace logs. I can follow the lifecycle of nontrivial events in the system and collect timing and stats on them. My stap script observes a lot of activity:

* observe all backend fork()s to determine backend process type (postmaster, logical or physical walsender, client backend, etc) and decide what to observe in that backend dynamically
* For user backends:
  * frontend connects
  * backend attaches
  * backend starts transaction
  * backend assigns non-virtual transaction-id
  * backend ends txn?
    * rollback? Stop tracking
    * commit? 
      * See if we need to follow it through logical decoding or not by inspecting the state at commit time (current db, active replication slots, etc)
    * prepare 2pc txn?
      * Add it to a "prepared xacts" track-list
  * backend runs a 2PC commit prepared or rollback prepared?
      * record time from prepare to commit/rollback
      * then observe logical decoding as if it's a normal commit
* logical replication walsender backends:
    * notice when we start buffering data for a new committed xid, start timing
    * wait for logical decoding to finish buffering the commit
    * when commit is decoded, record total buffering time
    * observe output plugin calls when buffered commit is being sent to client. Take timing info and stats on the types of operations sent and processing duration. Record info on skipped-over rows, tables, etc where we buffered them but discarded the buffered data.
    * record time final commit sent to client
    * wait for the receiver to confirm that the txn was received in full
    * record time client confirmed commit
* logical replication receiver / apply worker backends:
  * observe when a commit arrives from the logical walsender
  * associate it with the timing data collected for it on the origin side
  * record how long we take to apply it, whether it waits for locks, which ones and for how long
  * report final apply time
* Generate summary statistics periodically and on exit
* Optionally generate detailed trace logs of targeted activity, optionally  filtered to focus only on one client application / client connection

It's amazing to have this level of insight. And now that I've written the basic tapset to inspect the relevant parts of PostgreSQL I can massage it to extract all sorts of timing, performance, and behaviour data. (It'd be a lot more amazing if it weren't for some bugs in systemtap that make it a LOT harder to make the foundations I wrote reusable, though, per reports to the seemingly dead systemtap mailing list). I even extended it so it can associate two sides of a network connection as logical replication peers (if they're on the same host only, so far) and observe message-passing timings and processing delays across the wire. So I can see whether the sending side is waiting for more data to send, waiting for the client to consume its buffer, etc. I can see if the sender is delayed by the client waiting on a lock, waiting on I/O, etc.

Out with the old, in with the new? Again?

But I've since learned that SystemTap is considered pretty much dead and obsolete. Yet I cannot use eBPF on any production systems I'm likely to encounter in the next year or two. And even then, will I be able to get this level of access to userspace? Meanwhile perf never really gained complete enough tracing features to be of much use without a large amount of postprocessing using "perf script". It cannot even time syscalls from the enter to the exit sdt probe!

So please consider userspace

I'd love to help out if anyone working on these tools wants constructive input, testing, bug checking etc from someone who works on the PostgreSQL database backend in real world production deployments. Reach out to me, craig.ringer@2ndquadrant.com.

Please remember that the kernel is for running userspace!

Posted Oct 30, 2019 4:24 UTC (Wed) by ringerc (subscriber, #3071) [Link]

Oh, while I'm at it, here are some notes I made on new-user experience with stap

* https://sourceware.org/ml/systemtap/2019-q4/msg00012.html
* https://sourceware.org/ml/systemtap/2019-q4/msg00013.html
* https://sourceware.org/ml/systemtap/2019-q4/msg00014.html
* https://stackoverflow.com/q/58443096/398670
* https://stackoverflow.com/q/58443706/398670

If I had to list priorities for what counts as "useful for userspace" these would be some of the most important:

* DWARF based probes on function entry and exit
* Support for external DWARF debuginfo
* Support for reading SDTs from .probe ELF sections like perf's "sdt:" or stap's "mark"
* process-entry and process-exit probes
* A fast, efficient output channel, preferably structured objects not just text so it can be ingested for postprocessing/integration
* Access to function parameters on function entry, return value on exit
* A few ways to target processes:
* pid or list of pids
* parent pid
* process group id
* cgroup id
* fully qualified executable(s) path(s) including libraries
* probing in ld.so link time libraries
* probing in dlopen()ed libraries
* Simple way to record the values of string arguments not just their addresses
* Expressions for traversal of probed argument structs using DWARF debuginfo
* preferably with null-safe traversal like Groovy's wonderful .? operator
* A means of accessing global variables, akin to stap's @var with optional compilation unit and module/executable name to qualify resolution
* A means of casting pointers to datatypes extracted from DWARF data like stap's @cast operator so they can then be traversed
* Some level of control/feedback back into the probes so recording/capture/reporting can be turned on and off contextually, so you can focus only on data of interest

So, pretty much "what systemtap does".

I expect I could live without the the global associative arrays etc if probes ran in their own per-process (or per-thread) spaces and had some way to easily and efficiently turn particular probes on/off per-process/per-thread, define conditional expressions for which they'd fire, etc.

Similarly, stap's @entry feature to "remember" function args at entry time for probes on function return is great. But it can largely be worked around externally, at the cost of sometimes reporting much more information than needed to the postprocessing tools.

If anyone's been playing with this stuff around eBPF please poke me. Also, if anyone's been exploring ideas for a new generation of dtrace-alike SDT probes that embed argument name and datatype information, let me know, I'm really keen to get involved.

Please remember that the kernel is for running userspace!

Posted Oct 30, 2019 17:36 UTC (Wed) by mjw (subscriber, #16740) [Link]

> If anyone's been playing with this stuff around eBPF please poke me

Given that you seem to be a huge systemtap fan I assume you already know about stapbpf?
Which tries to bring everything stap does through generating BPF.
It is not fully on par yet, but getting there:
https://sourceware.org/systemtap/wiki/stapstapbpfComparison

If you haven't looked at it yet, then these are nice articles about it:
https://developers.redhat.com/blog/2017/12/13/introducing...
https://developers.redhat.com/blog/2017/12/15/bpf-maps-us...
https://developers.redhat.com/blog/2018/04/23/systemtaps-...

stapbpf

Posted Oct 31, 2019 15:06 UTC (Thu) by ringerc (subscriber, #3071) [Link]

I've had a quick look at stapbpf, yes, and I think the direction looks interesting. I tried it with stap 4.1 and couldn't capture and report a string variable's value though, so I pretty much wrote it off as not complete enough to be useful yet. I just retried with git HEAD with the same outcome.

semantic error: function may not be used with bpf runtime: identifier 'user_string_n' at /usr/share/systemtap/tapset/uconversions.stp:114:10
        source: function user_string_n:string (addr:long, n:long)
                         ^

Pass 2: analysis failed.  [man error::pass2]
Number of similar error messages suppressed: 2.
Rerun with -v to see them.

I've learned a bit more about stap since though and noticed the bpf:: tapsets. But they seem to be mostly a WIP skeleton.

stapbpf

Posted Oct 31, 2019 18:17 UTC (Thu) by smakarov (subscriber, #135270) [Link]

I've been involved with the work to push stapbpf closer to a full-featured tracer. Complete feature parity with SystemTap's LKM backend will not be achievable, because BPF (even with recent enhancements) has inherent limitations. That said, there are important stapbpf features still in the pipeline:

https://sourceware.org/bugzilla/buglist.cgi?component=bpf...

At the moment, if you want to understand what BPF can and can't do, bcc/bpftrace versus SystemTap LKM is probably a more fair comparison.

user_string_n() would have been implementable rather quickly, if all we cared about is x86_64:

function user_string_n (addr:long, n:long) {
// In BPF the same bpf_probe_read() functions that read data from a kernel
// address are also used by bcc tools to read from user space addresses.
// It Happened To Work On x86_64, So Why Not^tm?
return kernel_string_n (addr, n) // <- calls bpf_probe_read_str()
}

I missed the availability of this misfeature my first time going through the BPF helper documentation, which says "this function reads data from an address" without specifying whether it's a kernel or user address. Looking at the implementation I noted it calls a function that takes kernel addresses. "Seems like BPF doesn't support reading from userspace yet," thought I, and left it at that.

The issue caught the attention of Torvalds in February:

https://lore.kernel.org/lkml/CAHk-=wjJqVXydQf_dprBmr=zXsy...

Now in October it's being fixed:

https://lore.kernel.org/bpf/637eba87807516061f1fee9353605...

Then existing BPF tools relying on the misfeature will be fixed, e.g.:

https://github.com/iovisor/bpftrace/issues/614

When the bpf_probe_read_{user,kernel} patch makes it into a release, stapbpf can add userspace tapset functions that won't return the wrong data when there's a discrepancy between user and kernel addressing.

Please remember that the kernel is for running userspace!

Posted Oct 30, 2019 19:12 UTC (Wed) by mhumpula (guest, #108642) [Link]

Hi, did you have a look at bcc tools or bpftrace?
https://github.com/iovisor/bcc
https://github.com/iovisor/bpftrace/

Please remember that the kernel is for running userspace!

Posted Nov 1, 2019 4:35 UTC (Fri) by ringerc (subscriber, #3071) [Link]

Not much yet - I studied systemtap first, because I wanted to start with something fairly mature and that will be available on customer systems in the forseeable future.

I'm hoping to find time to look at eBPF-tools, bcc and bpftrace soon though.

Please remember that the kernel is for running userspace!

Posted Oct 30, 2019 10:31 UTC (Wed) by fuhchee (subscriber, #40059) [Link]

> I've since learned that SystemTap is considered pretty much dead and obsolete.

That depends entirely on the beholder. Thank you for your reports / suggestions to the systemtap@ mailing list, we'll contemplate them and get back to you.

Please remember that the kernel is for running userspace!

Posted Nov 1, 2019 4:37 UTC (Fri) by ringerc (subscriber, #3071) [Link]

Thanks for all the help! I like how sytemtap works - and the excellent, comprehensive documentation.

So I'm hoping it'll be possible to transition it over to BPF in a useful way rather than the community doing the "out with the old, in with the new" thing and jumping on bpftrace/bcc/whatever.

Type checking for BPF tracing

Posted Oct 30, 2019 5:30 UTC (Wed) by marcH (subscriber, #57642) [Link]

> The "compile once run everywhere" effort ... machine-readable description of the kernel's data structures... annotate tracepoints with information about the types of the data... verifier to ensure ... with the correct data types.... after performing the necessary type checking...

So the kernel really is developing a new, "domain-specific" programming language.

NIH?

Type checking for BPF tracing

Posted Oct 30, 2019 6:01 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]

Yep, only unlimited loops are missing for it to compete with JavaScript.