For many kinds of Monte Carlo algorithms, CSPRNGs are stupidly slow. The author compares two handpicked examples of a fast CSPRNG and a very slow PRNG, arriving at a factor of 4. In practice, e.g. comparing to very simple stuff like multiply-add RNGs, it is more like a factor of 4000. Only to then claim that "But that would only be true if generating random bits was the hot spot, the bottleneck of your program. It never is in practice."
E.g. for the usual example of Monte Carlo integration, you pick a point randomly, usually by running your RNG for both coordinates. Then you evaluate your characteristic function with that point as an input. Very often, the function will have a runtime that is in the range of a few hundred FPU instructions or less. Comparable to the evaluation of your CSPRNG. So in the end, you usually get a 40 to 50% faster runtime by just using a slow PRNG instead of a CSPRNG. Not to mention the few extra percent you will gain with a fast PRNG.
And it doesn't stop there. CSPRNG libraries are often optimized to be side-channel free and cryptographically safe. Even if you were to use a CSPRNG, you are leaving performance on the table by using stuff with security properties you will never ever need, that can easily be optimized out for a 30% gain in the CSPRNG parts.
And yes, for simulations a few percent points are relevant. Thoses "few" percents are maybe days in runtime, thousands of currency units in power and hardware cost and tons of CO2 in pollution.
It's not true that I cherry-picked a slow PRNG for my 7 GB/s number. In fact I selected the second-fastest PRNG on that page (because it's popular)! The fastest one is 8 GB/s. PCG32 is 3 GB/s.
Your 4000x factor speed up for a linear-congruential generator is just a completely false number.
Yes I did pick ChaCha20 for its speed -- it's designed for speed!
"A few hundred FPU instructions" in your Monte Carlo is not comparable to generating a number with ChaCha20. If you need a few hundred FPU instructions per number, you will be running a lot slower than 2 GB/s. ChaCha only requires a few cycles per byte.
I agree that you can optimize out the side-channel free part for non-crypto-purposes. That's a good thing! I recommend doing that.
The state of the art in super-fast PRNGs is about 0.3 cycles per byte at the moment. I believe this is done with SIMD versions of the xoroshiro algorithms right now. 0.3 cycles per byte compared to "a few" cycles per byte is an order of magnitude difference in throughput.
Still, most crypto libraries are designed with extreme performance in mind, and very few PRNG libraries are. You can do a lot better than 0.3 cycles/byte and still beat the NIST test if you try for speed.
Upon a closer look, do not trust the numbers on https://rust-random.github.io/book/guide-rngs.html in any way, they are clearly bogus and implausible. Their figure for their "StepRNG" which is just a counter is 51GB/s. Their XorShift RNG at 5GB/s, which is just a XOR and a shift is slower than Xorshiro at 7GB/s, which is a xor, shift and rotate, 1 op more. Both XorShift and Xorshiro should actually be of comparable performance to a counter of the same width because with modern CPUs, all those trivial bit operations like shift, rotate and XOR are sub-cycle microops. And all three of those should either be memory-bound and therefore of the same performance, or quite a bit faster than memory-bound if they just measure the in-register performance.
> Your 4000x factor speed up for a linear-congruential generator is just a completely false number.
> Yes I did pick ChaCha20 for its speed -- it's designed for speed!
1 Chacha20 block takes 20 rounds, each of which consists of 4 QR (quarter round) operations. A QR is 4 additions, 4 XORs and 4 ROTLs, so 12 instructions on 32bit values. Multiply that together and you arrive at 960 operations per block (actually a handful more for the counter, maybe the round loop and stuff like that, but not a lot), each block gives you 16 uint32 values. So 60 instructions per uint32 or 15 instructions per byte.
A multiply-add generator takes only 2 instructions (you could use fused-multiply-add if available, but i'll leave that out as I left out sub-microop-rotate before, just to not overcomplicate things) per uint32 or half an instruction per byte. Yes, that is not yet a hyperbolic factor of 4000.
But then you'll have to use your random values. Since your Chacha20 random number stream only comes in blocks, on many CPU architectures, you will have all your registers full with your resulting block. Meaning that for the subsequent calculation, you have to store those random numbers somewhere or throw them away, do your other calc, then load the randomness again, etc. So you will always pay a penalty for cache and memory accesses and you will always have unnecessary register pressure. Even a L1 cache access will cost you about 4 cycles of access latency, other cache levels are far worse. Which means that it probably won't be 4000 yet, but a lot more.
Now we'll arrive at "yes, but somebody said ChaCha20 is roughly 1 cycle per byte!". Which isn't wrong, but you have to read carefully: 1 _cycle_, not 1 _instruction_. That benchmark relies on calculating multiple ChaCha20 blocks in parallel, because a modern CPU has multiple execution units and thus can execute multiple independent instructions within one cycle. There is also SIMD, where one instruction can operate on multiple pieces of data. But to be fair, we also need to do this with our multiply-add-RNG. And where I can have 16 registers of 32bits calculating one ChaCha20 block, I can also have 16 of the same 32bit registers calculating 16 multiply-add random numbers in parallel.
Thus giving us 60 cycles per ChaCha20 uint32 vs. 0.125 cycles (2/16) per multiply-add uint32. That is a factor of 480, not taking possible memory or cache penalties into account, because that really depends on the computation between the randomness steps. Still not 4k, I admit, that was hyperbole.
You can get close to 10x faster [1] than chacha8 with SIMD PRNG implementations, that are way higher quality than the usual LCG and don't fail any statistical test in the test suites (testu01 & PractRand). They are obviously not cryptographically secure, but it's very hard to run into problems with bias in simulations, when test suites designed to find such bias can't find it.
I was curious. The rust rand library has implementations for a lot of rngs - prngs and csrngs. All with high quality implementations as far as I can tell.
They agree with your numbers: their fast, high quality prng implementations are about 8gb/sec and chacha is about 2gb/sec:
For Monte Carlo and if you need for speed, consider Quasi Monte Carlo. For the right kind of problems the speed up in convergence is huge (ie given same accuracy it will be faster.)
The author is a bit too categorical, there are certainly use-cases where both speed and code-size matters a great deal. Others have mentioned Monte Carlo algorithms, but there are others: not very long ago I was doing an effect in a shader that needed a PRNG, it would have been lunacy to use a CSPRNG. High quality statistical properties are not important at all, but speed and instruction size is VERY important. My hunch is that this is often true in game-dev: Minecraft probably should not use a CSPRNG to generate its random world.
However, in general, I basically agree: for MOST tasks where you need a random number generator, a CSPRNG is probably the right choice and it's a reasonable "default" choice if you don't have good reasons why it shouldn't be. It's a continual annoyance to me that PRNG libraries usually don't include any solid CSPRNGs (looking at you, C++ header <random>) when it's often the right choice.
The reason not to use CSPRNG is performance and efficiency. There are many applications where you only need randomness, and don't care about the resiliency provided by the CSRPNG. I mean, I'm not going to run a cryptographically secure generator on a GPU just to scatter a few rays for my raytracer.
IMO, suggesting that CSPRNG should be the default choice is a bad engineering advice. This leads to software burning processor cycles for no reason at all. I agree with the author that the default PRNGs provided by many environments are of very poor quality, but that's hardly a reason to choose something you don't need.
The author does address the perf & efficiency concerns, though I have no opinion on the strength of their conclusions.
But I think this is like everything else: don't prematurely optimize. If you start with a CSPRNG, you're more likely to know that your algorithm is correct, and that you're getting good, random results. If you later profile your code and see that the CSPRNG is a bottleneck, you can swap it out with something faster. And then you'll also have a baseline for comparison, so you'll know if using a weaker PRNG actually has a negative impact on whatever it is you're doing.
(And even if you don't do that profiling, you can swap out the CSPRNG just to see what happens, and get a good idea if the quality of the randomness produced by the PRNG is good enough, based on your experience with the CSPRNG.)
Perhaps. But one could equally well argue that it should be the default simply because for the small number of applications where performance is really constrained by RNG performance one can still override with a less secure algorithm, while having the default the other way opens many applications to security flaws because of plain old oversights. Many programming language or STL features are somewhat less efficient than they could be by default out of similar concerns.
It’s not just security flaws you should worry about, because the arguments against CSPRNGs often involve situations with no security implications (e.g. rendering in games).
It is absolutely the case that a crappy RNG can compromise a renderer by creating visible patterns where none should exist. You can get really visible artifacting with e.g. an LCG-driven raytracer. Granted, the artifacts should go away with a higher-quality RNG, but the fact remains that if you choose a CSPRNG in the first place then there will be essentially no chance of such artifacts in the first place (barring errors in the implementation).
I'm surprised I'm not seeing mention of the reduced round ChaCha8 version of ChaCha, that's what I'd recommend for Monte Carlo use. It's available in Rust at https://docs.rs/rand_chacha/0.3.1/rand_chacha/struct.ChaCha8... . I recommend seeding it from a high quality random source, but it could be fixed in your source code, generated with `openssl rand -base64 32` or similar.
I'm going to keep using quick insecure PRNGs for my NES games rather than write a 6502 implementation of a CSPRNG... I guess this isn't something that's safe to assume for all game development by any means, but in my case if someone really can predict the future actions based off of visual observation they deserve the slight advantage they get.
Even simpler than implementing a cryptographic PRNG yourself, just use /dev/urandom, which will provide an infinite, non-blocking stream of cryptographically-strong pseudorandom bytes. All platforms currently have known-good implementations of /dev/urandom -- Linux uses ChaCha and MacOS and BSDs use Bruce Schneier's Fortuna.
/dev/urandom is really slow compared to a userland CSPRNG, though. And if you are doing simulations or fuzz testing, you need to be able to seed your PRNG to get reproducible results.
You can always pre-fetch random bytes in larger blocks. Read a few kB of random bytes at a time, store them in a buffer, and refresh the buffer when you run out.
Agreed on the need for using a predictable seed for testing/simulations, though. Technically you can seed /dev/urandom, but it's per-system, not per-process, so you can't guarantee some other process isn't "interrupting" your random stream.
They weakened it recently in Linux again. For over a decade there was a badly bugged PRNG in the Linux kernel, it was discovered and replaced with a more costly one which worked great. Then, only a short time ago, they replaced that with one of... shady provenance. You're better off writing your own PRNG on that platform IMHO.
Jason Donenfeld (author of Wireguard) replaced Linux’s SHA-1 based PRNG (remember, SHA-1 is cryptographically broken) with BLAKE2. What is shady about it?
You can’t get cryptographically secure random numbers without platform support, so it’s really bad to tell people to avoid the kernel CSPRNG.
There's one big scenario where PRNGs are good: reproducibility. I have a lot of tests that are deterministic given a seed, and if the test fails then I can grab the seed and replay the test.
CSPRNG is designed to be cryptographically secure, and thus I somewhat agree that it should be the default choice. But there are many cases where reproducibility is a good thing, and that's where you need seeded rngs.
EDIT: I was under the (wrong) impression that CSPRNGs were non-deterministic.
I'm confused; if you can predict the output, then isn't it by definition not secure? Or is the seed so hard to brute force that it's equivalent to breaking the underlying cryptography.
I am pretty sympathetic to a CSPRNG being the default in standard libraries and so on. The relative performance has vastly changed since the first stdlib random()s; modern hardware- or SIMD-accelerated functions run at GBs/second per core, and CSPRNGs are flawless PRNGs--no fussing about whether PCG's subtle patterns matter--but the reverse is definitely not true.
If you consume many GBs/s/core of randomness, or you're writing a homebrew NES game, sure, you have a special case, code up a non-CSPRNG. But that isn't a reason for the default for everyone to remain a function that needs a warning label.
E.g. for the usual example of Monte Carlo integration, you pick a point randomly, usually by running your RNG for both coordinates. Then you evaluate your characteristic function with that point as an input. Very often, the function will have a runtime that is in the range of a few hundred FPU instructions or less. Comparable to the evaluation of your CSPRNG. So in the end, you usually get a 40 to 50% faster runtime by just using a slow PRNG instead of a CSPRNG. Not to mention the few extra percent you will gain with a fast PRNG.
And it doesn't stop there. CSPRNG libraries are often optimized to be side-channel free and cryptographically safe. Even if you were to use a CSPRNG, you are leaving performance on the table by using stuff with security properties you will never ever need, that can easily be optimized out for a 30% gain in the CSPRNG parts.
And yes, for simulations a few percent points are relevant. Thoses "few" percents are maybe days in runtime, thousands of currency units in power and hardware cost and tons of CO2 in pollution.
Your 4000x factor speed up for a linear-congruential generator is just a completely false number.
Yes I did pick ChaCha20 for its speed -- it's designed for speed!
"A few hundred FPU instructions" in your Monte Carlo is not comparable to generating a number with ChaCha20. If you need a few hundred FPU instructions per number, you will be running a lot slower than 2 GB/s. ChaCha only requires a few cycles per byte.
I agree that you can optimize out the side-channel free part for non-crypto-purposes. That's a good thing! I recommend doing that.
Here's the comparison from the maintainers of Julia: https://prng.di.unimi.it/#shootout
Still, most crypto libraries are designed with extreme performance in mind, and very few PRNG libraries are. You can do a lot better than 0.3 cycles/byte and still beat the NIST test if you try for speed.
> Your 4000x factor speed up for a linear-congruential generator is just a completely false number. > Yes I did pick ChaCha20 for its speed -- it's designed for speed!
1 Chacha20 block takes 20 rounds, each of which consists of 4 QR (quarter round) operations. A QR is 4 additions, 4 XORs and 4 ROTLs, so 12 instructions on 32bit values. Multiply that together and you arrive at 960 operations per block (actually a handful more for the counter, maybe the round loop and stuff like that, but not a lot), each block gives you 16 uint32 values. So 60 instructions per uint32 or 15 instructions per byte.
A multiply-add generator takes only 2 instructions (you could use fused-multiply-add if available, but i'll leave that out as I left out sub-microop-rotate before, just to not overcomplicate things) per uint32 or half an instruction per byte. Yes, that is not yet a hyperbolic factor of 4000.
But then you'll have to use your random values. Since your Chacha20 random number stream only comes in blocks, on many CPU architectures, you will have all your registers full with your resulting block. Meaning that for the subsequent calculation, you have to store those random numbers somewhere or throw them away, do your other calc, then load the randomness again, etc. So you will always pay a penalty for cache and memory accesses and you will always have unnecessary register pressure. Even a L1 cache access will cost you about 4 cycles of access latency, other cache levels are far worse. Which means that it probably won't be 4000 yet, but a lot more.
Now we'll arrive at "yes, but somebody said ChaCha20 is roughly 1 cycle per byte!". Which isn't wrong, but you have to read carefully: 1 _cycle_, not 1 _instruction_. That benchmark relies on calculating multiple ChaCha20 blocks in parallel, because a modern CPU has multiple execution units and thus can execute multiple independent instructions within one cycle. There is also SIMD, where one instruction can operate on multiple pieces of data. But to be fair, we also need to do this with our multiply-add-RNG. And where I can have 16 registers of 32bits calculating one ChaCha20 block, I can also have 16 of the same 32bit registers calculating 16 multiply-add random numbers in parallel.
Thus giving us 60 cycles per ChaCha20 uint32 vs. 0.125 cycles (2/16) per multiply-add uint32. That is a factor of 480, not taking possible memory or cache penalties into account, because that really depends on the computation between the randomness steps. Still not 4k, I admit, that was hyperbole.
[0] https://github.com/espadrine/shishua
They agree with your numbers: their fast, high quality prng implementations are about 8gb/sec and chacha is about 2gb/sec:
https://rust-random.github.io/book/guide-rngs.html
However, in general, I basically agree: for MOST tasks where you need a random number generator, a CSPRNG is probably the right choice and it's a reasonable "default" choice if you don't have good reasons why it shouldn't be. It's a continual annoyance to me that PRNG libraries usually don't include any solid CSPRNGs (looking at you, C++ header <random>) when it's often the right choice.
IMO, suggesting that CSPRNG should be the default choice is a bad engineering advice. This leads to software burning processor cycles for no reason at all. I agree with the author that the default PRNGs provided by many environments are of very poor quality, but that's hardly a reason to choose something you don't need.
But I think this is like everything else: don't prematurely optimize. If you start with a CSPRNG, you're more likely to know that your algorithm is correct, and that you're getting good, random results. If you later profile your code and see that the CSPRNG is a bottleneck, you can swap it out with something faster. And then you'll also have a baseline for comparison, so you'll know if using a weaker PRNG actually has a negative impact on whatever it is you're doing.
(And even if you don't do that profiling, you can swap out the CSPRNG just to see what happens, and get a good idea if the quality of the randomness produced by the PRNG is good enough, based on your experience with the CSPRNG.)
It is absolutely the case that a crappy RNG can compromise a renderer by creating visible patterns where none should exist. You can get really visible artifacting with e.g. an LCG-driven raytracer. Granted, the artifacts should go away with a higher-quality RNG, but the fact remains that if you choose a CSPRNG in the first place then there will be essentially no chance of such artifacts in the first place (barring errors in the implementation).
> The de-facto standard Rust library for random numbers, rand, uses ChaCha12, a reduced-round variant of ChaCha20, as its default PRNG.
Agreed on the need for using a predictable seed for testing/simulations, though. Technically you can seed /dev/urandom, but it's per-system, not per-process, so you can't guarantee some other process isn't "interrupting" your random stream.
Deleted Comment
You can’t get cryptographically secure random numbers without platform support, so it’s really bad to tell people to avoid the kernel CSPRNG.
CSPRNG is designed to be cryptographically secure, and thus I somewhat agree that it should be the default choice. But there are many cases where reproducibility is a good thing, and that's where you need seeded rngs.
EDIT: I was under the (wrong) impression that CSPRNGs were non-deterministic.
Making the default secure would be a pretty significant mitigation against something that commonly makes "common security bugs" lists like https://developer.android.com/privacy-and-security/risks/wea...
If you consume many GBs/s/core of randomness, or you're writing a homebrew NES game, sure, you have a special case, code up a non-CSPRNG. But that isn't a reason for the default for everyone to remain a function that needs a warning label.