Speeding up D-Bus
The D-Bus interprocess communication (IPC) mechanism is used extensively by Linux desktop environments and applications, but it suffers from less-than-optimal performance. While that problem may not be so noticeable on desktop-class systems, it can be a real issue for smaller and embedded devices. Over the years there have been a number of attempts to add functionality to the kernel to increase D-Bus performance, but none have passed muster. A recent proposal to add multicast functionality to Unix sockets is another attempt at solving that problem.
D-Bus currently relies on a daemon process to authenticate processes and deliver messages that it receives over Unix sockets. Part of the performance problem is caused by the user-space daemon, which means that messages need two trips through the kernel on their way to the destination (once on the way to the daemon and another on the way to the receiver). It also requires waking up the daemon and an "extra" transition to and from kernel mode. Putting D-Bus message handling directly into the kernel would eliminate the need to involve the daemon at all. That would eliminate one of the transitions and one of the copies, which would improve performance.
If all of the D-Bus messages were simply between pairs of processes, Unix sockets could potentially be used directly between them. But there is more to D-Bus than that. Processes can register for certain kinds of events they wish to receive (things like USB devices being attached, a new song playing, or battery status changes for example), so a single message may need to be multicast to multiple receivers. That is part of what the daemon mediates.
Earlier efforts to add an AF_DBUS socket type (and associated kdbus module) to handle D-Bus messages in the kernel weren't successful because kernel hackers were not willing to add the complexity of D-Bus routing. The most recent proposal was posted by Javier Martinez Canillas based on work from Alban Créquy, who also proposed the AF_DBUS feature. It adds multicasting support to Unix (i.e. AF_UNIX) sockets, instead, while using packet filters so that receivers only get the messages they are interested in. That way, the routing is strictly handled via multicast plus existing kernel infrastructure.
As described in Rodrigo Moya's blog posting, there are a number of reasons that a D-Bus optimization can't use the existing mechanisms in the kernel. Netlink sockets would seem to be one plausible alternative, and there is support for multicasting, but D-Bus requires fully reliable delivery even if the receiver's queue is full. In that case, netlink sockets just drop packets, while D-Bus needs the sender to block until the receiver processes some of its messages. In addition, netlink sockets do not guarantee the ordering of multicast messages that D-Bus requires.
Another option would be to use UDP multicast, but Moya (and Canillas) seem skeptical that it will perform as well as Unix socket multicast. There is also a problem for devices that do not have a network card, because the lo loopback network device does not support multicast. Moya also notes that a UDP-based solution suffers from the same packet loss and ordering guarantee problems that come with netlink sockets.
So, that left Créquy and others at Collabora (including Moya, Canillas, and others) to try a different approach. Créquy outlines the multicast approach on his blog. Essentially, both SOCK_DGRAM and SOCK_SEQPACKET socket types can create and join multicast groups which will then forward all traffic to each member of the group. Datagram multicast allows any process that knows the group address to join, while seqpacket multicast (which is connection-oriented like a SOCK_STREAM but enforces message boundaries) allows the group creator to decide whether to allow a particular group member at accept() time.
As Moya described, a client would still connect to the D-Bus daemon for authentication, and would then be added to the seqpacket multicast group for the bus. The daemon would also attach a packet filter that would eliminate any of the messages that the client should not receive. One of the patches in the set implements the ability for the daemon to attach a filter to the remote endpoint, so that it would be in control of which messages a client can see.
The idea is interesting, but so far comments on the netdev mailing list have been light. Kernel network maintainer David Miller is skeptical that the proposal is better than having the daemon just use UDP:
I can't see how this is better than doing multicast over ipv4 using UDP or something like that, code which we have already and has been tested for decades.
Cannilas responded by listing some of the
reasons that UDP multicast would not serve their purposes, but admitted
that no performance numbers had yet been gathered. Miller said that he
will await those numbers before reviewing the proposal further, noting: "In many cases TCP/UDP over
loopback is actually faster than AF_UNIX.
".
Even if UDP has the needed performance, some solution would need to be found for the packet loss and ordering issues. Blocking senders due to inattentive receivers may be a hard sell, however, as it seems like it could lead to denial of service problems, no matter which socket type is used. But it is clear that there is a lot of interest in better D-Bus performance. In fact, it goes well beyond just D-Bus as "fast" IPC mechanisms are regularly proposed for the kernel. It's unclear whether multicasting for Unix sockets is suitable for that, but finding a way to speed up D-Bus (and perhaps other IPC-using programs) is definitely on some folks' radar.
Index entries for this article | |
---|---|
Kernel | D-Bus |
Kernel | Message passing |
Kernel | Networking/D-Bus |
(Log in to post comments)
Speeding up D-Bus
Posted Mar 1, 2012 13:57 UTC (Thu) by juliank (guest, #45896) [Link]
Speeding up D-Bus
Posted Mar 1, 2012 21:16 UTC (Thu) by rvfh (guest, #31018) [Link]
Speeding up D-Bus
Posted Mar 2, 2012 5:53 UTC (Fri) by cmccabe (guest, #60281) [Link]
Point optmization?
Posted Mar 3, 2012 1:39 UTC (Sat) by pflugstad (subscriber, #224) [Link]
Why does it feel like they're going at this problem the wrong way.
Is the problem really the amount of time it takes to (essentially) ping-pong a message between some processes? Seems to me that that's way down in the microsecond level already (10000 ping-pong messages over 3.8 seconds in one of the linked articles) - far beyond the point where humans would notice even an order of magnitude speed-up. And Linux context switching is also ridiculously fast and optimized.
Maybe they need to really look at the interactions that seem to take a long time? Or maybe they need to look at the content of the messages - maybe instead of sending 5 messages to update five semi-related things, maybe it's better to send one message to update them.
Maybe it's something else that's the problem - like maybe the way the applications listen for and respond to messages is problematic? Are applications polling the sockets, instead of blocking on them in some way.
Seriously - what kind of interactions are going on over DBus that speeding them up by even 1.8x (from the same linked article) is really going to matter?
One thing I do have experience with: with almost all optimization problems, changing your fundamental algorithm (going from O(n^3) to O(n)) will do way more than any point optimization.
And given how fast IPC and context switching in Linux is already, this whole discussion feels like a point optimization and they're not really getting at the root of the problems.
Point optmization?
Posted Mar 4, 2012 4:01 UTC (Sun) by i3839 (guest, #31386) [Link]
All in all Dbus is a piece of bloatware:
dbus-daemon is 300KB and libdbus-1.so is 240KB.
More than half a megabyte to send some messages around, that's just sad. I really don't understand why embedded systems use D-bus in its current form.
On my system I sadly can't avoid installing dbus anymore, but I'm certainly
not running dbus-daemon. (A chmod -x /usr/bin/dbus-* helps a lot.)
Point optmization?
Posted Mar 4, 2012 8:46 UTC (Sun) by khim (subscriber, #9252) [Link]
More than half a megabyte to send some messages around, that's just sad.
Why? It's about the size of “Hello, World!”:
$ echo $'#include <stdio.h>\n int main() { printf("Hello, World!\\n"); }' | gcc -xc -static - ; size a.out
text data bss dec hex filename
657927 3488 12568 673983 a48bf a.out
Point optmization?
Posted Mar 5, 2012 0:37 UTC (Mon) by i3839 (guest, #31386) [Link]
Glibc's bloatness is bad enough, but don't blame it on "Hello, World!"
At least Glibc gets used and shared by most programs, dbus isn't.
And Glibc's bloatness is never an excuse for others to be bad too.
Point optmization?
Posted Mar 5, 2012 4:31 UTC (Mon) by pflugstad (subscriber, #224) [Link]
IPC and context switching is blazing fast in Linux already - any improvement in IPC is down in the sub-millisecond level, and is not noticeable by humans. And given how Dbus seems to be used, it seems like it's mostly on the human scale of things.
So I'm really at a loss to figure out what they are trying to fix here. If they're seeing some really slow response times from some events, then I find it extremely hard to believe that improving the IPC by a factor of 1.8x (or honestly, even 10x) is going to make any _noticeable_ difference at all.
So: are they really looking at the right thing? Is IPC truly the bottleneck and the thing causing whatever perceived slowness they are trying to fix? Or, are they parsing XML at runtime, which is notoriously slow? Or something like that?
Point optmization?
Posted Mar 6, 2012 6:25 UTC (Tue) by i3839 (guest, #31386) [Link]
A huge library size for simple functionality is a clear sign of badly written or designed code, with all downsides that come with that: Inefficient, unnecessary complex code which is hard to debug and hard to optimise properly.
Context switching may be very fast, until you trash all your caches between every switch, then it all goes down the drain.
So the main reason why improving the IPC works is probably because it means less crappy code of dbus is run instead. If dbus does the multicasting, it sends messages one by one. The processes receiving them where most likely idling, while dbus-daemon, the bloated pig it is, probably did enough jumping around its own code to eat its timeslice up. So the processes receiving the message gets scheduled, does its thing, and only then does dbus-daemon gets a new time slice and can send the message to the next processes waiting for it (SMP should help a lot though). This cycle repeats itself till all processes received the message. If dbus-daemon was mean and lean this extra ping-ponging wouldn't be very noticeable and wouldn't happen as much. By pushing the multicasting into the kernel, this particular problem is avoided.
Sending a short message to multiple processes should be very fast, and we agree that isn't what makes dbus so slow. What makes it slow is all the other things it does for no good reason, but what exactly all that is, I don't know. It could be a bug too.
Point optmization?
Posted Mar 6, 2012 7:41 UTC (Tue) by khim (subscriber, #9252) [Link]
A huge library size for simple functionality is a clear sign of badly written or designed code, with all downsides that come with that: Inefficient, unnecessary complex code which is hard to debug and hard to optimise properly.
Or perhaps it's just a code needed for the task on hand. How can you distinguish these two cases?
Sending a short message to multiple processes should be very fast, and we agree that isn't what makes dbus so slow. What makes it slow is all the other things it does for no good reason, but what exactly all that is, I don't know.
Let's summarize the discussion:
1. You have no idea about the dbus design.
2. You have no idea about the task dbus is trying to solve.
3. Yet “you know for sure” it's bloated pig which trashes caches and this is why it's slow.
Real feat of solid engineering thought! Not.
Compare libpthread from GLibC and bionic, for example. GLibC's one is about three or four times larger yet in a lot of cases it's 10-100 times faster (I'm not joking).
Sometimes you need a lot of code because the task you are trying to solve requires a lot of code. Sometimes it's just legacy. To say that bloat indeed affects performance you need benchmarks, not handwaving.
If dbus-daemon was mean and lean this extra ping-ponging wouldn't be very noticeable and wouldn't happen as much.
And this is what I'm talking about: why are you so sure dbus-deamon trashed everything in CPU if it's size is much smaller then CPU cache? Do you have any evidence that your outrageous theories have any relation to what happens on practice? Are you just writing rubbish because you can?
Point optmization?
Posted Mar 5, 2012 7:11 UTC (Mon) by khim (subscriber, #9252) [Link]
At least Glibc gets used and shared by most programs, dbus isn't.
Actually dbus is already used by a lot of programs and I suspect in the future it'll be more fundamental, not less:
$ ldd /usr/bin/* | grep libc.so.6 | wc
1181 4724 59026
$ ldd /usr/bin/* | grep libdbus-1.so.3 | wc
196 784 11760
From this POV it's size is not excessive and before we can talk about just sad it's always a good idea to compare “bloated pig” with “lean and mean” alternative. In case of GLibC it's size justified not by the fact that it's most popular Linux libc, but the fact that it's the only one with efficient threading, adequate i18n, etc. Oh, and significant piece of GLibC is backward-compatibility functions: today's GLibC is still compatible with all the programs compiled decade ago (and more). This may be insignificant advantage for you, but for me it's important. What's your alternative for dbus and how it compares to dbus feature-wise?
Point optmization?
Posted Mar 5, 2012 12:22 UTC (Mon) by mpr22 (subscriber, #60784) [Link]
I really don't understand why embedded systems use D-bus in its current form.
Wild guess: because accommodating the extra RAM and nonvolatile storage consumption is easier than taking an existing thing (or things) they want to use which happens to use dbus and rewriting it to not use dbus.
Point optmization?
Posted Mar 20, 2012 7:19 UTC (Tue) by alison (subscriber, #63752) [Link]
I really don't understand why embedded systems use D-bus in its current form.
mpr22 responded;
Wild guess: because accommodating the extra RAM and nonvolatile storage consumption is easier than taking an existing thing (or things) they want to use which happens to use dbus and rewriting it to not use dbus.
A lot of embedded systems that rely on bluez keep D-Bus around to service it. It's my understanding from an article (on LWN?) that the Android team is creating an alternative bluetooth stack at least in part to get rid of D-Bus.