If your computer was built after 1999, it probably supports the SSE instruction set. It contains the _mm_rsqrt_ps instruction, which is faster and will give you four reciprocal square roots at once: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...
That being said, the techniques discussed here are not totally irrelevant (yet). There still exists some hardware with fast instructions for float/int conversion, but lacking rsqrt, sqrt, pow, log instructions, which can all be approximated with this nice trick.
The SSE float reciprocal instructions have slightly different results between Intel and AMD, which can be a source of headaches for those expecting deterministic results between PCs. (see https://robert.ocallahan.org/2021/09/rr-trace-portability-di...)
Amusingly, (to me at least) there's also an SSE instruction for non-reciprocal square roots but it's so much slower than reciprocal square root that calculating sqrt(x) as x * 1/sqrt(x) is faster assuming you can tolerate the somewhat reduced precision.
the vast, vast majority of computers do not support the sse instruction set or indeed any of the i386 and amd64 instruction set at all, and the fraction of computers that do support them, other than through emulation, is continually shrinking
gpu instruction sets, arm, risc-v, avr, pic, 8051, fpga... of course, in many cases, these do have a built-in approximate reciprocal square root operation, but probably implemented with this algorithm
the vast, vast majority of computers do not support the sse instruction set or indeed any of the i386 and amd64 instruction set at all, and the fraction of computers that do support them (other than through emulation) is continually shrinking
gpu instruction sets, arm, risc-v, avr, pic, 8051, fpga... of course, in many cases, these do have a built-in approximate reciprocal square root operation, but probably implemented with this algorithm
> It's important to note that this algorithm is very much of its time. Back when Quake 3 was released in 1999, computing an inverse square root was a slow, expensive process. The game had to compute hundreds or thousands of them per second in order to solve lighting equations, and other 3D vector calculations that rely on normalization. These days, on modern hardware, not only would a calculation like this not take place on the CPU, even if it did, it would be fast due to much more advanced dedicated floating point hardware.
Calculations like this definitely take place on the CPU all the time. It's a common misconception that games and other FLOP-heavy apps want to offload all floating-point operations to the GPU. In fact, it really only makes sense to offload large uniform workloads to the GPU. If you're doing one-off vector normalization--say, as part of the rotation matrix construction needed to make one object face another--then you're going to want to stay on the CPU, because the CPU is faster at that. In fact, the CPU would remain faster at single floating point operations even if you didn't take the GPU transfer time into account--the GPU typically runs at a slower clock rate and relies on parallelism to achieve its high FLOP count.
% Constants
FISRCON GREG #5FE6EB50C7B537A9
THREHAF GREG #3FF8000000000000
% Save half of the original number
OR $2,$0,0
INCH $2,#FFF0
% Bit level hacking
SRU $1,$0,1
SUBU $0,FISRCON,$1
% First iteration
FMUL $1,$2,$0
FMUL $1,$1,$0
FSUB $1,THREHAF,$1
FMUL $0,$0,$1
% Second iteration
FMUL $1,$2,$0
FMUL $1,$1,$0
FSUB $1,THREHAF,$1
FMUL $0,$0,$1
This implementation makes an assumption that the original number is greater than 2^-1021.
I just wrote this years ago (go.mod said 1.12) for the fun of it, thought I had it in a Gist/GitHub, and uploaded it yesterday in response to this post.
One thing I remember trying was coding this up in ASM… which makes it worse, because it prevents inlining. But I learned the Go ASM syntax that way.
The formulas for the floats have typos. They should read (-1)^S, not -1^S (which always equals -1).
Interpreting the raw bit patterns isn't a piecewise linear approximation of the logarithm. The lines between the data points on the blue graph don't actually exist, it's not possible for a bit to be half set to 1. It's rather a discrete version of the logarithm: the only data points that exist - where the red and blue lines meet - are literally equal to the (scaled, shifted) logarithm.
I'm not sure I understand. Imagine a very small 6-bit float, with 1 bit sign, 2 bits exponent, and 3 bits mantissa. The interval [010000, 010111] contains the numbers 2, 2.25, 2.5, 2.75, 3, 3.25, 3.5, 3.75.
But! These numbers' base two logarithms imply mantissae of
- .0000000 (equal to the float 000)
- .0010101 (not equal to the float 001)
- .0101001 (not equal to the float 010)
- .0111010 (not equal to the float 011)
- .1001010 (not equal to the float 100)
- .1011001 (not equal to the float 101)
- .1100111 (not equal to the float 110)
- .1110100 (not equal to the float 111)
because the floats in the [2,4) interval are linearly spaced, whereas the corresponding logarithms are not. In other words, the floats are a piecewise linear approximation of the logarithm – just as the article says.
You're right about the first part, of course. But don't you just need to truncate?
In any case, my point was more that it's not "piecewise linear". Piecewise linear means that the map is defined on some interval; it's not, it's defined on a discrete set of values. To take your example, the map isn't defined on e.g., 2.3. You can decide to interpolate linearly between the value at 2.25 and the value at 2.5, but that's a decision you make which isn't reflected in the code.
Or said differently: do you consider that any map defined on a discrete subset of R is really a piecewise linear map?
It's not a continuous piecewise linear approximation of the logarithm, it's a discrete piecewise linear approximation of the logarithm. You are correct that the blue lines aren't continuous, but incorrect about their interpretation - there are 256 individual points evenly spaced along the that x axis that make up the blue graph, not just a handful at the intersections. (A full graph would have 2^32 options in a piecewise linear pattern, but that isn't what the OP has graphed.)
Given that the article is talking about operations on 32 bit integers and IEEE-754 32-bit floats, I personally think it's fine to leave out "discrete" in the description.
This is a good post explaining a lot of interesting concepts, with a section that has surprisingly bad algebra.
> The exact steps to go from the first form to this one are numerous to say the least, but I've included it in full for completeness.
The algebra included after that line has many unnecessary steps, as well as multiple cancelling sign errors. Most notably the second line to the third line does not distribute the negative sign correctly. Picking up after the second line, I would simply have:
This is fewer than 1/3 of the included steps (between the second line and the end), and has the advantage of being correct during the intermediate steps. I don't think any of my steps are anything other than self-evident to someone who understands algebra, but I'd be happy to entertain competing perspectives.
Here’s something new that isn’t mentioned and might be worth adding to the repo. (Edit I’m wrong, it is there, I just didn’t recognize it.) The magic number in the famous code snippet is not the optimal constant. You can do maybe 0.5% better relative error using a different constant. Maybe at the time it was infeasible to search for the absolutely optimal number, but now it’s relatively easy. I also went down this rabbit hole at some point, so I have a Jupyter notebook for finding the optimal magic number for this (1/x^2) and also for (1/x). Anyone wanna know what the optimal magic number is?
Good point, I hadn’t read Lomont’s paper and should have. I read the section in the Wikipedia article talking about it, and did try the constant that it suggests, however it depends on doing extra Newton iterations and I looked at the relative error of the initial guess without Newton. I can see in the paper he found something within 1 bit of what I found. I’m not certain mine’s better, but my python script claims it is.
For no Newton iterations, I thought I found 0x5f37641f had the lowest relative error, and I measure it at 3.4211. Of course I’m not super certain, Lomont’s effort is way more complete than mine. Lomont’s paper mentions 0x5f37642f with a relative error of 3.42128 and Wikipedia and the paper both talk about 0x5f375a86 being best when using Newton iterations.
That being said, the techniques discussed here are not totally irrelevant (yet). There still exists some hardware with fast instructions for float/int conversion, but lacking rsqrt, sqrt, pow, log instructions, which can all be approximated with this nice trick.
Same as Carmack's, we did a single step of Newton's method and it was definitely good enough.
gpu instruction sets, arm, risc-v, avr, pic, 8051, fpga... of course, in many cases, these do have a built-in approximate reciprocal square root operation, but probably implemented with this algorithm
gpu instruction sets, arm, risc-v, avr, pic, 8051, fpga... of course, in many cases, these do have a built-in approximate reciprocal square root operation, but probably implemented with this algorithm
> It's important to note that this algorithm is very much of its time. Back when Quake 3 was released in 1999, computing an inverse square root was a slow, expensive process. The game had to compute hundreds or thousands of them per second in order to solve lighting equations, and other 3D vector calculations that rely on normalization. These days, on modern hardware, not only would a calculation like this not take place on the CPU, even if it did, it would be fast due to much more advanced dedicated floating point hardware.
Calculations like this definitely take place on the CPU all the time. It's a common misconception that games and other FLOP-heavy apps want to offload all floating-point operations to the GPU. In fact, it really only makes sense to offload large uniform workloads to the GPU. If you're doing one-off vector normalization--say, as part of the rotation matrix construction needed to make one object face another--then you're going to want to stay on the CPU, because the CPU is faster at that. In fact, the CPU would remain faster at single floating point operations even if you didn't take the GPU transfer time into account--the GPU typically runs at a slower clock rate and relies on parallelism to achieve its high FLOP count.
In the old days, FPU do async computation.
FPU is now a considered an integrated part of CPU.
https://github.com/ncruces/fastmath/blob/main/fast.go
Also see this StackOverflow:
https://stackoverflow.com/questions/32042673/optimized-low-a...
I just wrote this years ago (go.mod said 1.12) for the fun of it, thought I had it in a Gist/GitHub, and uploaded it yesterday in response to this post.
One thing I remember trying was coding this up in ASM… which makes it worse, because it prevents inlining. But I learned the Go ASM syntax that way.
The formulas for the floats have typos. They should read (-1)^S, not -1^S (which always equals -1).
Interpreting the raw bit patterns isn't a piecewise linear approximation of the logarithm. The lines between the data points on the blue graph don't actually exist, it's not possible for a bit to be half set to 1. It's rather a discrete version of the logarithm: the only data points that exist - where the red and blue lines meet - are literally equal to the (scaled, shifted) logarithm.
Other than that, nice post!
But! These numbers' base two logarithms imply mantissae of
- .0000000 (equal to the float 000)
- .0010101 (not equal to the float 001)
- .0101001 (not equal to the float 010)
- .0111010 (not equal to the float 011)
- .1001010 (not equal to the float 100)
- .1011001 (not equal to the float 101)
- .1100111 (not equal to the float 110)
- .1110100 (not equal to the float 111)
because the floats in the [2,4) interval are linearly spaced, whereas the corresponding logarithms are not. In other words, the floats are a piecewise linear approximation of the logarithm – just as the article says.
In any case, my point was more that it's not "piecewise linear". Piecewise linear means that the map is defined on some interval; it's not, it's defined on a discrete set of values. To take your example, the map isn't defined on e.g., 2.3. You can decide to interpolate linearly between the value at 2.25 and the value at 2.5, but that's a decision you make which isn't reflected in the code.
Or said differently: do you consider that any map defined on a discrete subset of R is really a piecewise linear map?
Given that the article is talking about operations on 32 bit integers and IEEE-754 32-bit floats, I personally think it's fine to leave out "discrete" in the description.
> The exact steps to go from the first form to this one are numerous to say the least, but I've included it in full for completeness.
The algebra included after that line has many unnecessary steps, as well as multiple cancelling sign errors. Most notably the second line to the third line does not distribute the negative sign correctly. Picking up after the second line, I would simply have:
y_n+1 = y_n + (1 - x * y_n^2) / y_n^2 * (y_n^3 / 2)
y_n+1 = y_n + (1 - x * y_n^2) * (y_n / 2)
y_n+1 = y_n + y_n / 2 - x * y_n^3/2
y_n+1 = 3/2 * y_n - x * y_n^3/2
y_n+1 = y_n (3/2 * y_n - x * y_n^2 / 2)
y_n+1 = y_n (1.5 * y_n - 0.5 * x * y_n * y_n)
This is fewer than 1/3 of the included steps (between the second line and the end), and has the advantage of being correct during the intermediate steps. I don't think any of my steps are anything other than self-evident to someone who understands algebra, but I'd be happy to entertain competing perspectives.
Sure.