A vDSO implementation of getrandom()

By Jonathan Corbet
January 6, 2023

Most developers probably do not see the generation of random numbers as being a performance bottleneck for their programs, but there are seemingly exceptions. Over the last few years, Jason Donenfeld has brought a new level of energy to the development of the kernel's random-number generator; he is now directing his efforts toward improving performance for user space with this patch series that provides an implementation of the getrandom() system call in the kernel's "virtual dynamic shared object" (vDSO) area. The result is, indeed, better performance, but not all developers see this benefit as being worth the additional complexity required to achieve it.

Traditionally, user-space processes on Linux systems have obtained random data by opening /dev/urandom (or /dev/random) and reading data from it. More recently, the addition of getrandom() simplified access to random data; a call to getrandom() will fill a user-space buffer with random data from the kernel without the need to open any files. This random data is provided with all of the guarantees that the kernel can make, including doing its best to ensure that the data is actually random and preventing repeated data sequences when, for example, a virtual machine forks.

It's worth noting that, in the BSD world, it is more common to call the arc4random() library function. The 2.36 release of the GNU C Library included an implementation of arc4random() that, in its pre-release form, included a fair amount of its own logic for the generation and management of random data. In July 2022, Donenfeld questioned the need for this function, noting that "getrandom() and /dev/urandom are extremely fast". Supporting arc4random() makes code more portable, though, so that function stayed in the library. The version that was eventually released was significantly simplified by Donenfeld, to the point that it essentially a wrapper around getrandom() when that system call is available. As a result, the performance of getrandom() also determines how fast arc4random() will be.

The vDSO API

As "extremely fast" as getrandom() may be, some users have apparently complained that it still is not fast enough. In response, Donenfeld is now working to create a vDSO implementation. The vDSO is a special range of kernel memory that is mapped into each user-space process; it contains code that can be called to perform system-call-like functions that can be carried out without actually entering the kernel, thus avoiding the cost of a context switch. System calls implemented in this way can include getcpu() and clock_gettime() — both of which come down to simply reading some data out of the kernel.

Moving getrandom() into the vDSO has the potential to improve performance for applications that need a lot of random data, but it will clearly need to involve more than just reading some kernel data if the guarantees made by that system call are to be upheld. As a result, the patch series implementing this functionality touches 59 files and adds some 1,200 lines of code. It also complicates the interface somewhat, in that there are now two (virtual) system calls involved. The first must be called at least once prior to requesting any random data:

    void *vgetrandom_alloc(unsigned int *num, unsigned int *size_per_each,
                           unsigned long addr, unsigned int flags);

The caller (likely to be the C library) should specify, in *num, the number of "opaque states" needed; that number is likely to correspond to the number of threads that are expected to run. This call will allocate a range of memory, returning its address. The number of states actually allocated will be stored in *num, and the size of each state in *size_per_each. The other two arguments are currently unused and must be zero.

The library is expected to assign one of the returned states to each thread in a process; the pointer to any given state can be had by offsetting the base address by a multiple of the value returned in *size_per_each. The kernel uses this space to track the state of the random-number generator for the thread; user space should not access its contents. Should more state-storage space be needed, more calls can be made to vgetrandom_alloc().

With the state space in hand, random data can be obtained with:

    ssize_t vgetrandom(void *buffer, size_t len, unsigned int flags,
    		       void *opaque_state);

The first three arguments are the same as for getrandom(), while opaque_state points to one of the states returned by vgetrandom_alloc(). The intent is for this function to behave just as getrandom() would, only faster. Applications would not normally call it directly; instead, it would be invoked from the C library's getrandom() wrapper. Donenfeld said (in the cover letter) that "performance is pretty stellar (around 15x for uint32_t generation), and it seems to be working".

`VM_DROPPABLE`

While some developers are clearly unconvinced about the need to optimize getrandom(), most of the patch series is relatively uncontroversial at this point. There is one significant exception, though: the management of the "opaque state" data. This data is used by the kernel to ensure that each caller to vgetrandom() gets a unique stream of random data, that each thread's random-number generator is reseeded as needed, and so on; it can be thought of as a sort of per-thread entropy storage that can be consulted without entering the kernel. For the curious, each state entry is defined (in the kernel) as:

    struct vgetrandom_state {
	union {
	    struct {
		u8	batch[CHACHA_BLOCK_SIZE * 3 / 2];
		u32	key[CHACHA_KEY_SIZE / sizeof(u32)];
	    };
	    u8		batch_key[CHACHA_BLOCK_SIZE * 2];
        };
    	vdso_kernel_ulong	generation;
    	u8			pos;
    	bool 			in_use;
    };

This structure occupies 256 bytes, which is not huge, but a system running a lot of threads could have quite a few of them. This is kernel-allocated memory that, in effect, must be locked into RAM; writing it out to secondary storage could expose the state of the random-number generator, potentially creating security problems. The state memory must thus be treated specially.

Earlier versions of the patch set used mlock() to ensure that this memory would stay in place. There were some problems with that approach, though, starting with the fact that locked memory is subject to resource limits. If the kernel and C library start using some of a process's allowed locked memory for random-number generation, once-working programs could start failing. Locked memory will also become unlocked in the child process after a fork. So using mlock() is not an ideal solution.

The memory used for states has another interesting characteristic, though: it can simply vanish at any time with minimal consequences. Since it is, at its core, a cache of random data, it can be replaced with more random data. This can already happen if, for example, the thread forks. In the worst case, should the state memory not be available at all, the vDSO function can just make a proper call to getrandom() to get its job done (albeit more slowly). So, while the state memory should be locked in the sense of not being written to swap, it does not need to be absolutely nailed down in RAM; it can be thrown away if need be.

Donenfeld decided to take advantage of the disposable nature of this memory. Rather than using mlock(), the current patch set adds a new (kernel-internal) memory flag called VM_DROPPABLE. Memory marked with this flag will never be written to swap or a core dump, and the memory-management subsystem can, if memory is tight, just reclaim it for other uses. VM_DROPPABLE memory is demand-paged, in that it is not actually allocated until an attempt is made to access it; normally the failure to allocate a page in this circumstance will be fatal for the process involved. With VM_DROPPABLE memory, instead, failures are simply ignored and any attempted writes are simply dropped — an outcome that is effected with some low-level, architecture-specific magic that simply skips the instruction attempting the write.

Dropping `VM_DROPPABLE`

Ingo Molnar objected strongly to this patch, saying that it adds complexity to a subsystem that needs a high level of trust. It would be better, he said, to just make mlock() work as needed or simply let the vDSO allocate a few pages outside of the existing resource limits. Donenfeld reacted poorly, questioning Molnar's motives. Molnar then vetoed the patch, leading to more complaints from Donenfeld, who did point out, though, that a process can make an arbitrary number of calls to vgetrandom_alloc(), so care needs to be taken to prevent it from being used to circumvent the limits on locked memory. Simply tweaking mlock() would not be a sufficient solution.

Andy Lutomirski also disliked VM_DROPPABLE, saying that Donenfeld was "trying to shoehorn all kinds of bizarre behavior into the core mm". He suggested that Donenfeld could, instead, create a special type of mapping, with its own local implementation, to obtain the needed semantics; the kernel provides the infrastructure needed to do this. There would then be no need to modify the core memory-management subsystem. Donenfeld had a relatively positive reaction to this suggestion and seemed ready to proceed in that direction. Linus Torvalds, though, argued that none of this effort was worth it. Speeding up random-number generation, beyond a point, is not the kernel's job, he said; instead, that should be left to the C library. After some discussion, Torvalds made his position clear:

I'm NAK'ing making invasive changes to the VM for something this specialized. I really believe that the people who have this issue are *so* few and far between that they can deal with the VM forking and reseeding issues quite well on their own.

Donenfeld has also been clear in his disagreement with Torvalds's assessment of the need for this feature, though; he claimed that it just isn't possible to create a safe random-number generator in user space (the patch cover letter also covers that ground in depth). He later went on to say:

I think the moral of yesterday's poo-poo'ing all over this cool new idea is that the Linux innercircle doesn't really care for "security things" as a motivator and just takes the shortest and easiest route toward swatting it away like a gadfly, assuming that the concerns are unreal or niche or paranoid or whatever.

It will probably come as little surprise that Torvalds disagreed with this view.

In the end, though, there do appear to be valid arguments, from both performance and security points of view, for a vDSO getrandom() implementation. So work on the vDSO patches is likely to continue, but the VM_DROPPABLE flag is clearly not going to clear the bar. Donenfeld will almost certainly return with an implementation based on Lutomirski's suggestions; that should address the main concerns that have been raised about this work. At that point, its chances of getting upstream are probably reasonably good, even if some developers remain unconvinced about how necessary it really is.

Index entries for this article
Kernel	Random numbers
Kernel	Security/Random number generation

(Log in to post comments)

VM_DROPPABLE

Posted Jan 6, 2023 17:25 UTC (Fri) by geofft (subscriber, #59789) [Link]

VM_DROPPABLE seems like a necessary (though perhaps insufficient) primitive for doing this in userspace, anyway. All the arguments Jason set out for why it's needed - e.g., the memory should not be swapped out, but it's okay if under memory pressure it just disappears because you can repopulate it from normal getrandom(), so it shouldn't count as mlocked - seem like they'd apply equally well to a userspace RNG as to a kernelspace one.

I'm actually kind of surprised to see so much objection to it; is it because it's not being exposed to userspace and so it feels like complexity? A MAP_DROPPABLE mmap() flag seems like it would genuinely be useful to userspace, all the more so if this patchset as a whole is nacked.

VM_DROPPABLE

Posted Jan 6, 2023 20:59 UTC (Fri) by mwsealey (subscriber, #71282) [Link]

It would on the face of it be a ridiculously good thing to use for any and all kinds of application cache where you're really trying to keep data in memory but to be honest wouldn't care if it wasn't - a quick glance at any web browser where the original resources may as well be on disk or over the network but would work best if they were available in memory in a form a browser needs it. I'm not considering that you could just open a file and hope the kernel fills the page cache with bits of files, but say.. decoded or decompressed images. What about compiled GPU shaders or transient texture data for open world games or mapping software which shuffle this data in and out all the time?

One thing it's probably not all that good for is random number storage, if that's the only use case anyone managed to come up with. But, per comments, the Linux folk do seem to swat away things if they only have one use case, and they can't use their gigas brains to come up with one other potential user for it. Things have to be in the kernel for userspace to take advantage of it, you won't see 50 userspace applications using an MM feature if it doesn't exist yet.

VM_DROPPABLE

Posted Jan 6, 2023 21:18 UTC (Fri) by abatters (✭ supporter ✭, #6932) [Link]

I remember reading the LWN articles on volatile ranges from years ago. Not sure if that work was ever merged, but madvise(MADV_FREE) does exist.

VM_DROPPABLE

Posted Jan 7, 2023 15:40 UTC (Sat) by Paf (subscriber, #91811) [Link]

I mean, this is essentially how the page cache (ie, the cache for data read from files) works today. Its contents can be evicted arbitrarily and will be refilled as needed from the backing file. That covers the main need for userspace I think.

VM_DROPPABLE

Posted Jan 8, 2023 16:54 UTC (Sun) by geofft (subscriber, #59789) [Link]

VM_DROPPABLE as defined here does something different, if I'm understanding the patchset comments correctly: if it's evicted, it reads in as zeroes (it doesn't automatically refill) and more interestingly if there isn't enough room to allocate pages for it, writes are just silently dropped, meaning you never segfault or OOM-kill from either reads or writes.

I think the use case for userspace is caches of calculations where it doesn't make sense to allocate a file on disk - not only is the disk space wasteful, you'd prefer to just be told "the data is lost" and recalculate it instead of doing disk I/O, because the calculation takes less time than paging from disk.

VM_DROPPABLE

Posted Jan 6, 2023 22:14 UTC (Fri) by saladin (subscriber, #161355) [Link]

Later comments in the thread were talking about a VM_DROPPABLE flag that would send a signal to userspace when the dropped page was accessed. The dropping could be due to memory pressure, process/VM forks, or what have you.

This use case would be harder for glibc to plumb than Jason's current patch, but would be much more generally applicable for uses like in-memory caching, storing secrets, or something more esoteric (emulators often use custom SIGSEGV handlers already). I think that that's a better approach, if there's some big program that could use the flag.

VM_DROPPABLE

Posted Jan 7, 2023 2:40 UTC (Sat) by Cyberax (✭ supporter ✭, #52523) [Link]

One other thing, it might be good to integrate this with hypervisors, so that the random pools are dropped on hypervisor-based fork/restore.

A vDSO implementation of getrandom()

Posted Jan 6, 2023 17:27 UTC (Fri) by Manifault (subscriber, #155796) [Link]

>I think the moral of yesterday's poo-poo'ing all over this cool new idea is that the Linux innercircle doesn't really care for "security things" as a
>motivator and just takes the shortest and easiest route toward swatting it away like a gadfly, assuming that the concerns are unreal or niche or
>paranoid or whatever.

Yikes, IMO that is just not the right attitude at all. We're talking about mapping random numbers to VDSO? Come on, requiring that kind of performance for random numbers is a niche use case (at least at this point) whether you want to admit it or not. That doesn't mean it isn't valuable, but _everything_ is a complexity tradeoff, and pretending like this is just a matter of people not caring about security really feels like a poor-faith reading of the room more than anything.

A vDSO implementation of getrandom()

Posted Jan 6, 2023 21:33 UTC (Fri) by wittenberg (subscriber, #4473) [Link]

Fast random number generation will be required by many users in the next few years.

When quantum computers get big enough (estimates are the mid 2030s), they will break our asymmetric cryptography. That includes signatures, as well as PKI, and key exchange. In short, almost everything. In July 2022, has announced replacement algorithms (but not the parameters and protocols needed to use them). The replacement algorithms require many more random bits than current algorithms.

There are already requirements for government systems to start using the new algorithms (as soon as NIST announces the parameters and protocols required, which they say will be 2024.) The deadlines for completing the transition are all 2035 or earlier. I expect that commercial entities that need cryptography (certainly banks, block-chain systems, and other financial institutions) will also want to start their transition soon.

There will be a mad rush in 2024 when the protocols are published, so we're trying to do as much as we can now.

--David

A vDSO implementation of getrandom()

Posted Jan 7, 2023 4:04 UTC (Sat) by roc (subscriber, #30627) [Link]

A VDSO implementation of getrandom() only matters when you need to call getrandom() very frequently. If you just need more bits per call, it doesn't help at all. Quantum-resistant algorithms that need more key bits will be the latter case, not the former.

A vDSO implementation of getrandom()

Posted Jan 8, 2023 14:58 UTC (Sun) by atnot (subscriber, #124910) [Link]

The reason getrandom() being faster is important, which I'm a bit disappointed was never brought up in the discussion (perhaps Donenfeld took it for granted) is not because anything needs secure random numbers that fast, but because it removes reasons *not* to use cryptographically random numbers.

We've seen this pattern happen again and again:
- Someone discovers that using the kernel RNG is kind of slow and decides to make their own in userspace
- That RNG is weak, either deliberately (e.g. mersenne twister), accidentally (debian keypocalypse) or due to technical limitations (VM fork/checkpoint oblivious)
- Someone inevitably accidentally ends up using those numbers in a security critical context, carnage ensues

The solution here is the same as what most cryptographic improvements in the last decade have come from: Make one random number generator, make it as secure as possible, make it faster than you could write it yourself, and make it the default in everything. That's the only way to fix this, would fix it everywhere, far beyond just the kernel, and it's why Cryptography people get upset when people say it isn't necessary.

A vDSO implementation of getrandom()

Posted Jan 8, 2023 20:11 UTC (Sun) by iabervon (subscriber, #722) [Link]

That suggests that it might be best to separate the state management from the number production. There are applications where you want to get the same sequence in each clone (think duplicate bridge or reproducible tests), and it would be good if these applications didn't result in RNGs being around that were weakened in ways other than an explicit "generate from a captured seed" method.

Maybe the right thing is to have the kernel support exclusively around making sure that nobody can ever see your per-thread primary RNG states and that they get created with data nobody could have predicted, and have the number production code in the standard library (with an occasional system call to ask for more entropy if needed and if the RNG instance is not supposed to be reproducible).

A vDSO implementation of getrandom()

Posted Jan 8, 2023 20:45 UTC (Sun) by roc (subscriber, #30627) [Link]

It is not reasonable to assume only kernel code can be trusted to be correct.

"Concentrate effort on a small number of high quality RNGs", sure. But that number doesn't have to be 1; we have to evaluate cost-benefit tradeoffs.

"Kernel support needed to protect RNG state", sure. Please do solve that problem.

A vDSO implementation of getrandom()

Posted Jan 9, 2023 12:31 UTC (Mon) by judas_iscariote (guest, #47386) [Link]

> It is not reasonable to assume only kernel code can be trusted to be correct.

Nobody expects it to be perfect . One does expect to fix ONLY this one, if it has a problem.. all components benefiting from this fix.. move along. this argument of "only trusted and correct" can only be made by someone that has never looked at the alternatives or the sheer amount of work that is reviewing, fixing and maintaining all the userspace stuff. it is really scary.

A vDSO implementation of getrandom()

Posted Jan 8, 2023 22:59 UTC (Sun) by Wol (subscriber, #4433) [Link]

> The solution here is the same as what most cryptographic improvements in the last decade have come from: Make one random number generator, make it as secure as possible, make it faster than you could write it yourself, and make it the default in everything. That's the only way to fix this, would fix it everywhere, far beyond just the kernel, and it's why Cryptography people get upset when people say it isn't necessary.

The other problem, for people who aren't experts in the field, is that documentation of random numbers is hard to find. Personally, I've been looking for a PRNG with certain properties. I remember seeing it demonstrated in my maths classes (as an example of a bad random number generator) but I can't find it - after so many cycles, each number had appeared the same number of times. I'm trying to smear mirrored raid data across disks, to reduce rebuild stress (okay, it increases the risk of a double failure causing data loss, but it reduces the risk of recovery from a single failure triggering further failures). So far, pretty much everything I've tried has not managed to smear the data, rebuilding one failed disk means reading from two other disks max ... :-(

If I could find this PRNG, maybe it would help - or maybe not ... but at least I'll have tried.

Cheers,
Wol

A vDSO implementation of getrandom()

Posted Jan 8, 2023 23:38 UTC (Sun) by kschendel (subscriber, #20465) [Link]

Perhaps you're thinking of RANDU()? Xn+1 = 65539 Xn mod 2^31

A vDSO implementation of getrandom()

Posted Jan 9, 2023 3:38 UTC (Mon) by ianmcc (subscriber, #88379) [Link]

Yes, linear congruential generators of the form x_{n+1} = (a*x_n) % b have a period of at most b, and don't repeat any numbers before it reaches the period. If a,b are chosen appropriately then it has period equal to b, then the output is simply a permutation of 0,1,2,...,(n-1). If you want to use this for striping data over N devices then you might be able to find a constant a such that it has period N. But why would that be better than writing them in some other specified sequence?

A vDSO implementation of getrandom()

Posted Jan 9, 2023 9:09 UTC (Mon) by Wol (subscriber, #4433) [Link]

The problem I have at the moment, is that for every block written to device a, EVERY mirror block ends up on either device b, or device c, whether I have 3 drives or 30 drives.

I was looking for a generator that would break that property (while still guaranteeing that the mirror of a would not end up on a). I shall have to play, but if I can find arguments to the congruential generator that satisfy both properties, I will be well pleased :-)

Cheers,
Wol

A vDSO implementation of getrandom()

Posted Jan 9, 2023 13:04 UTC (Mon) by ianmcc (subscriber, #88379) [Link]

The low order bits of LCG's are notoriously non-random, eg the lowest bit alternating between 0 and 1.

A vDSO implementation of getrandom()

Posted Jan 9, 2023 14:33 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

Why not do Tetris-style "bag of pieces" randomization?

N: number of disks
M: mirroring level

- Take a list 1…N and permute it randomly
- When writing to disk X, find X in the list at index K and mirror it to the disks at indices (K+n mod N, n from 1 to M) in the list
- Permute the list whenever you want (each block write, reboot, 10 minutes, whatever).

A vDSO implementation of getrandom()

Posted Jan 9, 2023 14:44 UTC (Mon) by Wol (subscriber, #4433) [Link]

> - Permute the list whenever you want (each block write, reboot, 10 minutes, whatever).

Because that would turn my Data Storage Device into a Data Shredding Device?

I could generate a random lookup table, fix it up, and store it in the array metadata, but that's more stuff to fill up the disk space, any damage to it will destroy the array, and it will impact array creation. This needs to have the same lifetime as the array layout, which is why I'd rather a PRNG.

Cheers,
Wol

A vDSO implementation of getrandom()

Posted Jan 9, 2023 18:58 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

This seems like a purely ephemeral data structure to me; why would it need to be stored with the array? Granted, I'm not super familiar with the use case and don't see how it turns into a shredding device.

A vDSO implementation of getrandom()

Posted Jan 9, 2023 20:12 UTC (Mon) by Wol (subscriber, #4433) [Link]

Because it defines the layout of the blocks on the disk. Think of a drive, mapping the LBA numbers to the disk sectors. By default, LBA = platter,cylinder,head, sector. Scramble that mapping, like when you relocate bad blocks, and if you lose that mapping your data is gone.

Let's look at your standard Raid 1+0 ...



sda sdb sdc sdd

 1   2   1   2

 3   4   3   4

 5   6   5   6

So we have sda and b striped, c and d striped, and each pair mirrored. If one drive fails, we need a straight copy from the mirror to recover. If that drive fails under load, we're stuffed.

Now let's look at a linux Raid 10.



sda sdb sdc sdd sde

 1   1   2   2   3

 3   4   4   5   5

 6   6   7   7   8

Note that whichever drive fails, it's only the two drives, one each side, that gets hammered in the recovery. It's reduced the load a little, but not much, still inviting another disk failure.

I'm thinking more along the lines of a raid-51 or raid-61, but the crucial thing I want to avoid is only a couple of drives being hammered in a rebuild. But you see, the whole point of my PRNG, is to tell the raid which drive each block goes on. Lose that, and the data is scrambled ...

Cheers,
Wol

A vDSO implementation of getrandom()

Posted Jan 9, 2023 22:52 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

Oh, it's a one-time thing. Hmm. How about something like using a well-known number (e, pi, gamma, etc.) in base (N-1) to tell you how which other drive to select for the next block. Probably needs tweaking in case on drive fills up (maybe count empty slots not on the current drive?) or for redundancy >2. So, say you have pi in base 4 for 5 drives and redundancy of 2: 3.0210033312222020201122030020310222

sda sdb sdc sdd sde
 1   2   2   3   1
 4   5   3   4   5
 6   6   7   8   9
 A   7   A   9   8

A vDSO implementation of getrandom()

Posted Jan 9, 2023 23:46 UTC (Mon) by Wol (subscriber, #4433) [Link]

This looks well interesting ...

I'm not going to try making something like raid-51 work across different size drives so I'm not going to address a drive filling up.

And I'm not going to run the algorithm that far - the computational cost will be far too much. But if I can make it scramble 4 or 5 stripes, I can then just repeat that pattern.

The only problem I have is I can't work out what you've done :-)

Oh - and you've missed something which may or may not be important, in that we have the number of physical drives, and the number of logical drives. I don't think it will make a difference actually, it just affects the number of stripes I need to scramble. (I might have a 4-drive raid-5, mirrored and spread across 6 drives. But I think that's just making sure the number of stripes is a lowest common multiple, here it would be 12 ...)

Cheers,
Wol

A vDSO implementation of getrandom()

Posted Jan 10, 2023 0:52 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

> The only problem I have is I can't work out what you've done :-)

3.0210033312222020201122030020310222

Step 1: Find empty slot (sda/0)
Step 2: Take next digit (3) and use the next slot on disk not_sda[3] (sde/0)
Step 3: Repeat

sda/0 - 3 -> sde/0
sdb/0 - 0 -> sdc/0
sdd/0 - 2 -> sdb/1 (well, seems I messed up here, but this should be enough to make something more robust)

A vDSO implementation of getrandom()

Posted Jan 10, 2023 0:58 UTC (Tue) by mathstuf (subscriber, #69389) [Link]

> Probably needs tweaking … for redundancy >2.

For R == 3 use digits from base N-2 to select the sector from the drives not already selected for that stripe. Keep reducing the base to get a "new" set of digits that are (essentially) random, but deterministic for the next drive to hold a new copy.

This still runs into the "drive fills up" problem (where you just end up getting "unlucky" and hammer a specific drive all the time. In that case, maybe use an offset into the number's digits and try again?

A vDSO implementation of getrandom()

Posted Jan 10, 2023 7:08 UTC (Tue) by ianmcc (subscriber, #88379) [Link]

I still don't get why you want this to be random. Am I understanding right that you want to mirror blocks of data onto N disks, such that each block gets written onto a pair of disks, and you want them distributed evenly, so that you don't end up with (for an example with N = 4) half the blocks mirrored on disks 1,2 and the other blocks mirrored on disks 3,4?

In this N=4 example, there are 6 possible pairs ( = N(N-1)/2): 1:2, 1:3, 1:4, 2:3, 2:4, 3:4. Each disk appears exactly 3 times in that list, so if you just allocated blocks round-robin to pairs of disks in that sequence you will end up with evenly distributed data, and if one disk fails then the other mirror pair of each block will be split evenly among the other drives.

A vDSO implementation of getrandom()

Posted Jan 10, 2023 8:45 UTC (Tue) by Wol (subscriber, #4433) [Link]

Oh the joys of a fresh pair of eyes ... :-) That's an excellent PRNG that looks like it does exactly what's required ...

Thank you very much ...

Cheers,
Wol

A vDSO implementation of getrandom()

Posted Jan 10, 2023 13:38 UTC (Tue) by ianmcc (subscriber, #88379) [Link]

You could also order the sequence to minimize (or potentially eliminate, depending on the number of disks and number of mirrors) successive writes to the same disk(s).

A vDSO implementation of getrandom()

Posted Jan 10, 2023 7:10 UTC (Tue) by ianmcc (subscriber, #88379) [Link]

Sorry, my reply above should have been to the parent of the post.

A vDSO implementation of getrandom()

Posted Jan 10, 2023 17:23 UTC (Tue) by joib (subscriber, #8541) [Link]

> I'm thinking more along the lines of a raid-51 or raid-61, but the crucial thing I want to avoid is only a couple of drives being hammered in a rebuild.

Might parity declustering be the thing you're looking for?

https://www.pdl.cmu.edu/PDL-FTP/Declustering/ASPLOS.pdf

Unfortunately Linux mdadm/DM doesn't support it, AFAIU.

A vDSO implementation of getrandom()

Posted Jan 10, 2023 21:33 UTC (Tue) by Wol (subscriber, #4433) [Link]

Hmm... very interesting reading...

Actually, linux does support something like this, it's called raid-10. The number of logical disks in the array does not match the number of physical disks, and provided those two numbers are relatively prime, if I understand the paper correctly you will get exactly what they are describing with no effort. But linux doesn't support parity raid with differing logical and physical disk counts. No reason why not ...

But I'm looking at raid-51 or raid-61, which means I don't care about parity - I should have a mirror somewhere. So I've massively reduced both i/o and computation relative to this paper. Assuming four logical drives, according to the paper I have to read 3 drives' worth of data for raid-5. I only need to read one drive's worth as I'm mirrored, but everything I've tried so far says it comes from just two physical drives. To reduce stressing the disks as per the paper, that means I need at least NINE physical drives - that's the first number that (a) means I need to read less than half a drive, and (b) is uniquely prime to my four logical drives to decluster it.

So I'm unfortunately forced to the conclusion that - provided I'm not desperate to squeeze every usable byte of capacity out of my drives - even my flawed raid-51 algorithm is better than this declustered parity ... don't get me wrong, this is a very good tactic for a straight raid-5 or 6, but not for a mirrored one.

Cheers,
Wol

A vDSO implementation of getrandom()

Posted Jan 9, 2023 3:24 UTC (Mon) by ianmcc (subscriber, #88379) [Link]

There are a few different concepts here that you might be conflating. There are many different ways of generating random numbers, and they have very different properties. So how you generate them depends a lot on what you need them for.

If you want simple random numbers for a game, then you don't really care, you could use pretty much anything. A simple linear congruential generator (LCG) such as suggested by kschendel below will probably do fine.

If you are doing a simulation of some physical process that uses random dynamics (Monte Carlo simulation) then you want random numbers with very good statistical properties. Simple random number generators, including older rand(3) implementations, won't cut it as statistical biases in the output will skew the results. LCG's are notoriously bad for this. These days the Mersenne Twister is the most common algorithm for these purposes.

Then there are cryptographic uses for random numbers. This comes with a completely different set of requirements, for example the numbers should be truly unpredictable; given a (possibly large) sequence of numbers it is impossible to predict the next number. For the LCG, this is trivial (the 'internal state' of the RNG is simply the previous output, and even if you don't know the constants a,b in x_{n+1} = (x_n*a) % b it is pretty easy to work them out). For the Mersenne Twister this is also fairly easy, eg https://github.com/fx5/not_random . Something like AES in counter mode (CTR) passes this test; the n'th random number x_n is given by AES(n). Given any set of past output, you cannot calculate the internal state without breaking AES. But this fails other desirable properties for cryptographic uses, in particular if you are able to get hold of the internal state of the RNG (the counter and the encryption key, in the case of AES-CTR) then you can determine all of the previous output. You can make it 'forward secure' by mixing some of the output of x_n into the internal state. Ideally you also want some entropy injection, so that even if an attacker gets hold of the internal state of the RNG at some time, they cannot predict the output indefinitely into the future.

If you want to distribute data across a raid cluster, then I don't think you want any of the above algorithms. You probably don't even want random numbers, or at least, not statistically independent random numbers. The reason being that if you distribute data truly randomly, then there will be some chance that you get a distribution of data that is very bad. Eg, if you want to distribute data evenly across disks, then you wouldn't want to do it completely randomly, because if you roll a dice 6N times, the probabiliy of getting each number appearing exactly N times goes to zero for large N. There are other algorithms that you can use that give pseudo-random output but are correlated in such a way as to produce the desired distribution. That is what you want to determine - what statistical properties do you want in your numbers. In general it is a difficult problem to produce random numbers that have specified correlations, so you'll need to look for domain specific algorithms.

A vDSO implementation of getrandom()

Posted Jan 9, 2023 10:06 UTC (Mon) by Wol (subscriber, #4433) [Link]

> If you want to distribute data across a raid cluster, then I don't think you want any of the above algorithms. You probably don't even want random numbers, or at least, not statistically independent random numbers. The reason being that if you distribute data truly randomly, then there will be some chance that you get a distribution of data that is very bad. Eg, if you want to distribute data evenly across disks, then you wouldn't want to do it completely randomly, because if you roll a dice 6N times, the probabiliy of getting each number appearing exactly N times goes to zero for large N. There are other algorithms that you can use that give pseudo-random output but are correlated in such a way as to produce the desired distribution. That is what you want to determine - what statistical properties do you want in your numbers. In general it is a difficult problem to produce random numbers that have specified correlations, so you'll need to look for domain specific algorithms.

Yup. And all the algorithms currently in use have the bad property that the data on one disk is mirrored on at most two others.

My main aim is to reduce raid stress, but it also might increase rebuild speed (the more drives the data is spread across, the more the rebuild speed tends towards the write speed of the replacement drive). And while disk damage is far less common than disk loss, it increases resilience in the face of that.

Plus, of course, it's the mental challenge :-)

But this is where I think the cryptographers (as mentioned by atnot) are making the exact same mistake they are accusing others of. "Secure cryptographic random" in my use case is an anti-feature. Let's have one callable function, where you're forced to specify which features are important TO YOU on setup, and then the actual function you get is determined by that. Even here, I doubt I could specify my requirements :-) which is why I *need* - not necessarily a "roll your own" - but a custom function for my purposes.

Cheers,
Wol

A vDSO implementation of getrandom()

Posted Jan 10, 2023 8:52 UTC (Tue) by NYKevin (subscriber, #129325) [Link]

The attitude of the cryptographers is probably that you should figure out which distribution you want, and then use inverse transform sampling to sample from it, starting from a "good" random number generator.

But that wouldn't even solve your issue, because your issue is more fundamental. What you are asking for is not something that can reasonably be done with randomness, regardless of distribution. Any distribution will, for large N, eventually exhibit whatever pattern you don't want it to in at least one case (unless you pick a trivial distribution). This is a simple corollary of the law of large numbers. You can probably get around it by wrapping the randomness up in some kind of stateful algorithm or data structure, but at that point, the random generator is not the problem, and you can very well use a "proper" one.

In short: You don't want randomness, you want a deterministic permutation or cycle that, in the past, someone may have inaccurately described as "random." Cryptographers are, perhaps understandably, not interested in dealing with such historical curiosities.

A vDSO implementation of getrandom()

Posted Jan 10, 2023 14:30 UTC (Tue) by Wol (subscriber, #4433) [Link]

> The attitude of the cryptographers is probably that you should figure out which distribution you want, and then use inverse transform sampling to sample from it, starting from a "good" random number generator.

Which doesn't address the problem of "You want three? Pick two. Pick any two. But you can't have three". If someone wants speed over security (and there are good reasons for that), telling them they should accept the speed hit of a secure random function will simply get you laughed at. If cryptographers really do have that attitude, then they're living in a fool's paradise. Life is *ALWAYS* a compromise, and insisting that *YOUR* compromise is the only viable one is simply causing pain for everyone.

> In short: You don't want randomness, you want a deterministic permutation or cycle that, in the past, someone may have inaccurately described as "random." Cryptographers are, perhaps understandably, not interested in dealing with such historical curiosities.

So why are *cryptographers* trying to hijack *random numbers*? I've been completely open I want a *P*RNG.

This is the pain point. Random numbers, and cryptography, are two separate domains with considerable interaction. And until cryptographers stop pretending they are one and the same, we'll continue having fruitless discussions about "why aren't random numbers secure?". Because, just maybe, like in my case, security is one feature I *DON'T* want - in fact, it would be completely disastrous to me.

(At which point, I have to say thanks to all the cryptographically minded people who've contributed and provided me with some great ideas :-)

Cheers,
Wol

A vDSO implementation of getrandom()

Posted Jan 6, 2023 18:23 UTC (Fri) by saladin (subscriber, #161355) [Link]

Having been watching this argument in the mailing lists over the last few days, I've been waiting for lwn (or phoronix) to cover it. Well done in beating Michael to it with a very high quality article. This is why I subscribe.

A vDSO implementation of getrandom()

Posted Jan 6, 2023 19:30 UTC (Fri) by anadav (subscriber, #99427) [Link]

My hat off to Andy, who (almost) always comes with practical and pragmatic solutions.

A vDSO implementation of getrandom()

Posted Jan 6, 2023 19:35 UTC (Fri) by Wol (subscriber, #4433) [Link]

> Donenfeld has also been clear in his disagreement with Torvalds's assessment of the need for this feature, though; he claimed that it just isn't possible to create a safe random-number generator in user space (the patch cover letter also covers that ground in depth). He later went on to say:

And why do you *need* a secure random number generator? Some people do, but surely they should just scrap all the user-exposed generators, and replace them with ONE new generator that has *mandatory* arguments to specify security, rng or prng, speed, whatever. That way people are forced to pay attention to what they get ... :-)

Some hope :-)

Cheers,
Wol

A vDSO implementation of getrandom()

Posted Jan 6, 2023 21:19 UTC (Fri) by dvrabel (subscriber, #9500) [Link]

I don't see how the implementation of the random number generator helps with problems caused by virtual machine snapshots/forks. If a process fills a buffer with randomness, and then the VM is forked, all the VMs are going to see the same random data regardless of how "safe" the random number generation was.

A mechanism to notify userspace of VM forks is needed, so userspace can discard or regenerate secrets/keys/whatever. If this mechanism exists then userspace random number generators can be made safe and need only be (periodically) seeded from the kernel's entropy pool (as per the original glib's arc4random() implementation).

A vDSO implementation of getrandom()

Posted Jan 7, 2023 4:17 UTC (Sat) by roc (subscriber, #30627) [Link]

Yes, exactly. It seems like vDSO getrandom() doesn't really solve anything. You might as well call getrandom() less frequently, but with larger buffers. That increases your vulnerability to VM cloning but it was already nonzero with vDSO getrandom().

OTOH the issues of protecting userspace memory regions that must not get persisted anywhere are real.

A vDSO implementation of getrandom()

Posted Jan 7, 2023 11:51 UTC (Sat) by roc (subscriber, #30627) [Link]

Hmm, maybe I'm wrong about the VM-cloning threat.

Considering key generation: imagine a VM generates a key using random bits and no other inputs. Clearly there is no way to ensure the VM and its clone always produce different keys --- the clone can happen after the key was generated.

On the other hand, image a VM generates a key based on random bits and some non-random input. We might want the property that if the non-random inputs are different, the VM and its clone will use different random bits. We *can* guarantee this by calling getrandom() after we have acquired the non-random inputs, and ensuring cloning the VM causes future getrandom()s to return different data than in the original VM. This would be broken by pregenerating random bits as I suggested.

This is subtle. If you're worried about this threat you have to order operations carefully even if you have access to a suitable getrandom(). An easy-to-implement option would be to expose a fixed number of "VM UID" bits via shared memory in the vDSO that could be mixed into a userspace PRNG at the point where you want VMs to diverge.

A vDSO implementation of getrandom()

Posted Jan 6, 2023 23:23 UTC (Fri) by Karellen (subscriber, #67644) [Link]

I'm left wonding how an io_uring implementation of getrandom() might compare to both the traditional syscall, and this proposed vDSO variant, in terms of both speed and complexity.

A vDSO implementation of getrandom()

Posted Jan 7, 2023 0:55 UTC (Sat) by judas_iscariote (guest, #47386) [Link]

It does not help, this is one of the suggestions that show some people refuse to understand the needs of userspace components. It has to be something that
- Fast, really fast without any significant syscall overhead that can be used anywhere for every random number number need without having to come up with ad-hoc algorithms. something that allow us to drop most if not all userspace prngs with dubious caracteristics.
- It has to be async signal safe, thread safe, $whatever safe .
- It must account for all the cases where the state needs to dropped and recreated, you cannot do that in userspace.
- It must not block.
- It must be usable from the dynamic linker up to high level components.

A vDSO implementation of getrandom()

Posted Jan 7, 2023 15:47 UTC (Sat) by Paf (subscriber, #91811) [Link]

io_uring is mostly about asynchronicity and potentially stacking up of multiple calls. It is likely of no help for a necessarily synchronous call where you need that specific result immediately. (Also getrandom() is, I believe, faster than the old fd based interface, which io_uring would use.)

A vDSO implementation of getrandom()

Posted Jan 8, 2023 18:04 UTC (Sun) by sbaugh (subscriber, #103291) [Link]

Many users of randomness are heavily threaded anyway (e.g. encrypted network communications). An io_uring-based getrandom() would probably improve performance for them, by allowing many calls from many threads to be performed at once. Or at least, because it much improves locality: http://catern.com/flexsc.html

A vDSO implementation of getrandom()

Posted Jan 8, 2023 18:19 UTC (Sun) by atnot (subscriber, #124910) [Link]

The bigger problem is that it's unsuitable as a default implementation of random numbers, because it'll involve a whole bunch of program-specific effort to use

A vDSO implementation of getrandom()

Posted Jan 7, 2023 16:23 UTC (Sat) by compudj (subscriber, #43335) [Link]

Why not do this as an RSEQ extension instead?

A vDSO implementation of getrandom()

Posted Jan 7, 2023 16:38 UTC (Sat) by compudj (subscriber, #43335) [Link]

More precisely: let the kernel populate a per task struct random state, and copy it to extended RSEQ per thread structure fields on return to userspace. This would remove most of the complexity from this vDSO proposal.

A vDSO implementation of getrandom()

Posted Jan 8, 2023 20:00 UTC (Sun) by caliloo (subscriber, #50055) [Link]

I have to admit the 2 steps dance to initiate the thing and all the complexity around thread and fork safety make me wonder if that is a sane trade off. I wonder if the time spent into this wouldn’t have been better spent writing a good user space prng. And or perhaps a mechanism to alert user space about “plane of existence duplication” weather that is process forking, vm sibbling spin off, or whatever else.

A vDSO implementation of getrandom()

Posted Jan 8, 2023 20:04 UTC (Sun) by caliloo (subscriber, #50055) [Link]

And of course I have to wonder what is the use case for a high bandwidth random number consumer that forks or duplicate itself. Intuitively I’d think those high bandwidth thing tend to be highly isolated and compartmentalised from the rest of your infra thus not too sensitive to vm copy or fork situations.

A vDSO implementation of getrandom()

Posted Jan 8, 2023 23:28 UTC (Sun) by ejr (subscriber, #51652) [Link]

High-performance computing: simulations needing "randomness" require that the streams of numbers be independent.

Also, sometimes utterly reproducible. Particularly if public policy / international relations are involved.

These points are, erm, not well-addressed in general.

A vDSO implementation of getrandom()

Posted Jan 12, 2023 4:25 UTC (Thu) by kmeyer (subscriber, #50720) [Link]

Glibc will add arc4random for portability/compatibility, but still refuses to add strlcpy? Jeez.

A vDSO implementation of getrandom()

Posted Jan 14, 2023 19:23 UTC (Sat) by judas_iscariote (guest, #47386) [Link]

real life software includes its own full copy of the bsd arc4random* thing if not found in the C library. You *really* do not want that. it is ery different from a simple string function.

A vDSO implementation of getrandom()

Posted Jan 16, 2023 22:43 UTC (Mon) by intgr (subscriber, #39733) [Link]

> includes its own full copy of the bsd arc4random* thing if not found in the C library

Huh, why would anyone do that? Surely it's easier and makes more sense to detect arc4random and fall back to getrandom if former is not available, than to ship a full copy of arc4random.

A vDSO implementation of getrandom()

The vDSO API

VM_DROPPABLE

Dropping VM_DROPPABLE

VM_DROPPABLE

VM_DROPPABLE

VM_DROPPABLE

VM_DROPPABLE

VM_DROPPABLE

VM_DROPPABLE

VM_DROPPABLE

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

A vDSO implementation of getrandom()

`VM_DROPPABLE`

Dropping `VM_DROPPABLE`