Everything I know about the fast inverse square root algorithm

If your computer was built after 1999, it probably supports the SSE instruction set. It contains the _mm_rsqrt_ps instruction, which is faster and will give you four reciprocal square roots at once: https://www.intel.com/content/www/us/en/docs/intrinsics-guid...

That being said, the techniques discussed here are not totally irrelevant (yet). There still exists some hardware with fast instructions for float/int conversion, but lacking rsqrt, sqrt, pow, log instructions, which can all be approximated with this nice trick.

khangaroo · 2 years ago

The SSE float reciprocal instructions have slightly different results between Intel and AMD, which can be a source of headaches for those expecting deterministic results between PCs. (see https://robert.ocallahan.org/2021/09/rr-trace-portability-di...)

bnprks · 2 years ago

Amusingly, (to me at least) there's also an SSE instruction for non-reciprocal square roots but it's so much slower than reciprocal square root that calculating sqrt(x) as x * 1/sqrt(x) is faster assuming you can tolerate the somewhat reduced precision.

mabster · 2 years ago

I wouldn't be surprised if _mm_rsqrt_ps is actually implemented using the same bit level trick.

Same as Carmack's, we did a single step of Newton's method and it was definitely good enough.

kragen · 2 years ago

the vast, vast majority of computers do not support the sse instruction set or indeed any of the i386 and amd64 instruction set at all, and the fraction of computers that do support them, other than through emulation, is continually shrinking

gpu instruction sets, arm, risc-v, avr, pic, 8051, fpga... of course, in many cases, these do have a built-in approximate reciprocal square root operation, but probably implemented with this algorithm

kragen · 2 years ago

the vast, vast majority of computers do not support the sse instruction set or indeed any of the i386 and amd64 instruction set at all, and the fraction of computers that do support them (other than through emulation) is continually shrinking

colejohnson66 · 2 years ago

In terms of CPUs where you would use an inverse square root (mainly games or scientific), x86 still reigns supreme.

% Constants FISRCON GREG #5FE6EB50C7B537A9 THREHAF GREG #3FF8000000000000 % Save half of the original number OR $2,$0,0 INCH $2,#FFF0 % Bit level hacking SRU $1,$0,1 SUBU $0,FISRCON,$1 % First iteration FMUL $1,$2,$0 FMUL $1,$1,$0 FSUB $1,THREHAF,$1 FMUL $0,$0,$1 % Second iteration FMUL $1,$2,$0 FMUL $1,$1,$0 FSUB $1,THREHAF,$1 FMUL $0,$0,$1

Time for nitpicks, sorry.

The formulas for the floats have typos. They should read (-1)^S, not -1^S (which always equals -1).

Interpreting the raw bit patterns isn't a piecewise linear approximation of the logarithm. The lines between the data points on the blue graph don't actually exist, it's not possible for a bit to be half set to 1. It's rather a discrete version of the logarithm: the only data points that exist - where the red and blue lines meet - are literally equal to the (scaled, shifted) logarithm.

Other than that, nice post!

kqr · 2 years ago

I'm not sure I understand. Imagine a very small 6-bit float, with 1 bit sign, 2 bits exponent, and 3 bits mantissa. The interval [010000, 010111] contains the numbers 2, 2.25, 2.5, 2.75, 3, 3.25, 3.5, 3.75.

But! These numbers' base two logarithms imply mantissae of

- .0000000 (equal to the float 000)

- .0010101 (not equal to the float 001)

- .0101001 (not equal to the float 010)

- .0111010 (not equal to the float 011)

- .1001010 (not equal to the float 100)

- .1011001 (not equal to the float 101)

- .1100111 (not equal to the float 110)

- .1110100 (not equal to the float 111)

because the floats in the [2,4) interval are linearly spaced, whereas the corresponding logarithms are not. In other words, the floats are a piecewise linear approximation of the logarithm – just as the article says.

g15jv2dp · 2 years ago

You're right about the first part, of course. But don't you just need to truncate?

In any case, my point was more that it's not "piecewise linear". Piecewise linear means that the map is defined on some interval; it's not, it's defined on a discrete set of values. To take your example, the map isn't defined on e.g., 2.3. You can decide to interpolate linearly between the value at 2.25 and the value at 2.5, but that's a decision you make which isn't reflected in the code.

Or said differently: do you consider that any map defined on a discrete subset of R is really a piecewise linear map?

timerol · 2 years ago

It's not a continuous piecewise linear approximation of the logarithm, it's a discrete piecewise linear approximation of the logarithm. You are correct that the blue lines aren't continuous, but incorrect about their interpretation - there are 256 individual points evenly spaced along the that x axis that make up the blue graph, not just a handful at the intersections. (A full graph would have 2^32 options in a piecewise linear pattern, but that isn't what the OP has graphed.)

Given that the article is talking about operations on 32 bit integers and IEEE-754 32-bit floats, I personally think it's fine to leave out "discrete" in the description.

johndough · 2 years ago

pcwalton · 2 years ago

To nitpick the article a bit:

> It's important to note that this algorithm is very much of its time. Back when Quake 3 was released in 1999, computing an inverse square root was a slow, expensive process. The game had to compute hundreds or thousands of them per second in order to solve lighting equations, and other 3D vector calculations that rely on normalization. These days, on modern hardware, not only would a calculation like this not take place on the CPU, even if it did, it would be fast due to much more advanced dedicated floating point hardware.

Calculations like this definitely take place on the CPU all the time. It's a common misconception that games and other FLOP-heavy apps want to offload all floating-point operations to the GPU. In fact, it really only makes sense to offload large uniform workloads to the GPU. If you're doing one-off vector normalization--say, as part of the rotation matrix construction needed to make one object face another--then you're going to want to stay on the CPU, because the CPU is faster at that. In fact, the CPU would remain faster at single floating point operations even if you didn't take the GPU transfer time into account--the GPU typically runs at a slower clock rate and relies on parallelism to achieve its high FLOP count.

j16sdiz · 2 years ago

I think he refers to FPU, not GPU.

In the old days, FPU do async computation.

FPU is now a considered an integrated part of CPU.

zzo38computer · 2 years ago

I wrote a implementation in MMIX:

This implementation makes an assumption that the original number is greater than 2^-1021.

rogerallen · 2 years ago

If you're interested, Wikipedia also has a decent discussion of this function and it's history. https://en.wikipedia.org/wiki/Fast_inverse_square_root

ncruces · 2 years ago

I collected a few of these here:

https://github.com/ncruces/fastmath/blob/main/fast.go

Also see this StackOverflow:

https://stackoverflow.com/questions/32042673/optimized-low-a...

qingcharles · 2 years ago

That's great, thank you. I was just thinking about starting a collection of these to begin rewriting my old 3D engine from the late 80s.

nj5rq · 2 years ago

Please, post the progress if you end up rewriting it.

koeng · 2 years ago

I'd love to see some benchmarks on your fastmath package

It's… probably not worth it.

I just wrote this years ago (go.mod said 1.12) for the fun of it, thought I had it in a Gist/GitHub, and uploaded it yesterday in response to this post.

One thing I remember trying was coding this up in ASM… which makes it worse, because it prevents inlining. But I learned the Go ASM syntax that way.

Very interesting, thank you for posting it.

This is a good post explaining a lot of interesting concepts, with a section that has surprisingly bad algebra.

> The exact steps to go from the first form to this one are numerous to say the least, but I've included it in full for completeness.

The algebra included after that line has many unnecessary steps, as well as multiple cancelling sign errors. Most notably the second line to the third line does not distribute the negative sign correctly. Picking up after the second line, I would simply have:

y_n+1 = y_n + (1 - x * y_n^2) / y_n^2 * (y_n^3 / 2)

y_n+1 = y_n + (1 - x * y_n^2) * (y_n / 2)

y_n+1 = y_n + y_n / 2 - x * y_n^3/2

y_n+1 = 3/2 * y_n - x * y_n^3/2

y_n+1 = y_n (3/2 * y_n - x * y_n^2 / 2)

y_n+1 = y_n (1.5 * y_n - 0.5 * x * y_n * y_n)

This is fewer than 1/3 of the included steps (between the second line and the end), and has the advantage of being correct during the intermediate steps. I don't think any of my steps are anything other than self-evident to someone who understands algebra, but I'd be happy to entertain competing perspectives.

dahart · 2 years ago

Here’s something new that isn’t mentioned and might be worth adding to the repo. (Edit I’m wrong, it is there, I just didn’t recognize it.) The magic number in the famous code snippet is not the optimal constant. You can do maybe 0.5% better relative error using a different constant. Maybe at the time it was infeasible to search for the absolutely optimal number, but now it’s relatively easy. I also went down this rabbit hole at some point, so I have a Jupyter notebook for finding the optimal magic number for this (1/x^2) and also for (1/x). Anyone wanna know what the optimal magic number is?

bradleyjg · 2 years ago

There’s a link at the bottom to a paper exploring that question.

Good point, I hadn’t read Lomont’s paper and should have. I read the section in the Wikipedia article talking about it, and did try the constant that it suggests, however it depends on doing extra Newton iterations and I looked at the relative error of the initial guess without Newton. I can see in the paper he found something within 1 bit of what I found. I’m not certain mine’s better, but my python script claims it is.

qaisjp · 2 years ago

> Anyone wanna know what the optimal magic number is?

Sure.

For no Newton iterations, I thought I found 0x5f37641f had the lowest relative error, and I measure it at 3.4211. Of course I’m not super certain, Lomont’s effort is way more complete than mine. Lomont’s paper mentions 0x5f37642f with a relative error of 3.42128 and Wikipedia and the paper both talk about 0x5f375a86 being best when using Newton iterations.