Readit News logoReadit News
Posted by u/mbitsnbites a year ago
Reciprocal Approximation with 1 Subtraction
Today's find: You can get a floating-point approximation of 1/x that's accurate to 3 bits with a single integer subtraction instruction.

  float fast_reciprocal(float x)
  {
    unsigned i = *(unsigned *) &x; 
    i = 0x7effffffU - i;
    return *(float *) &i;
  }
The magic number 0x7effffff accomplishes two things:

1) The exponent is calculated as 253-e, which effectively negates the exponent and subtracts 1.

2) The mantissa is approximated as a 1st order polynomial in the interval [1, 2).

Interesting, but perhaps not very useful (as most CPU:s have more accurate reciprocal approximations these days).

jcmeyrignac · a year ago
A better value could be 0x7EEEEBB3, as in: https://github.com/parallella/pal/blob/bd9389f14fe16db4d9630...
dahart · a year ago
EDIT: after double-checking my work, I realized I have a better bound on maximum error, but not a better average error. So, the magic number depends on the goal or metric, but mean relative error seems reasonable. Leaving my original comment here, but note the big caveat that I’m half wrong.

One can do better still - 0x7EF311BC is a near optimal value at least for inputs in the range of [0.001 .. 1000].

The simple explanation here is:

The post’s number 0x7EFFFFFF results in an approximation that is always equal to or greater than 1/x. The value 0x7EEEEBB3 is better, but it’s less than 1/x around 2/3rds of the time. My number 0x7EF311BC appears to be as well balanced as you can get, half the time greater and half the time less than 1/x.

To find this number, I have a Jupyter notebook that plots the maximum absolute value of relative error over a range of inputs, for a range of magic constants. Once it’s setup, it’s pretty easy to manually binary search and find the minimum. The plot of max error looks like a big “V”. (Edit while the plot of mean error looks like a big “U” near the minimum.

The optimal number does depend on the input range, and using a different range or allowing all finite floats will change where the optimal magic value is. The optimal magic number will also change if you add one or more Newton iterations, like in that github snippet (and also seen in the ‘quake trick’ code).

PPS maybe 0x7EF0F7D0 is a pretty good candidate for minimizing the average relative error…?

mbitsnbites · a year ago
Your suggestion got me intrigued. I have a program that does an exhaustive check for maximum and average error, so I'll give your numbers a spin.
dahart · a year ago
This code is technically UB in C++, right? [1] Has anyone run into a case where it actually didn’t work? Just curious. I’ve often assumed that if C++ compilers didn’t compile this code, all hell would break loose.

It might be nice to start sharing modern/safe versions of this snippet & the Quake thing. Is using memcpy the only option that is safe in both C and C++? That always felt really awkward to me.

[1] https://tttapa.github.io/Pages/Programming/Cpp/Practices/typ...

LegionMammal978 · a year ago
Yes, most casts via pointers are formally UB in both C and C++, and you can induce weird behavior by compilers if you transmit casted pointers through a function boundary (see [0] for a standard example: notice how the write to *pf vanishes into thin air). Since people like doing it in practice, GCC and Clang have an -fno-strict-aliasing flag to disable this optimization, and the MSVC compiler doesn't use it in the first place (except for restrict pointers in C). They don't go too far with it regardless, since lots of code using the POSIX socket API casts around struct pointers as a matter of course.

Apart from memcpy(), the 'allowed' methods include unions in C (writing to one member and reading from another), and bit_cast<T>() and std::start_lifetime_as<T>() in C++.

[0] https://godbolt.org/z/dxMMfazoq

ryao · a year ago
There are two additional ways of making this work.

The first is to allocate the memory using char a[sizeof(float)]. In C, char pointers may alias anything, so then you can do pointer conversions that would normally be undefined behavior and it should work. The other option is to use the non-standard __attribute__((__may_alias__)) on the pointer.

By the way, using union types for this is technically undefined behavior in the C and C++ standards, but GCC and Clang decided to make it defined as an implementation choice. Other compilers might not.

rjeli · a year ago
For C++, `bit_cast<uint32_t>(0.f)` should be Well Defined, right? I'm curious, in C, is union-casting float->uint32_t also Perfectly Legal And Well Defined?

(I am not a C or C++ expert.)

ckjellqv · a year ago
It should be well-defined with reinterpret_cast<uint32_t&> though.
Asooka · a year ago
It would be better if compilers were fixed. Current treatment of UB is insane. The default should be no UB and a pragma that lets you turn it on for specific parts. That used to be the case and some of the world's most performant software was written for compilers that didn't assume no pointer aliasing nor no integer overflow (case in point: Quake).

The big problem, apart from "memcpy is a bit tedious to write", is that it is impossible to guarantee absence of UB in a large program. You will get inlined functions from random headers and suddenly someone somewhere is "dereferencing" a pointer by storing it as a reference, and then your null pointer checks are deleted. Or an integer overflows and your "i>=0" array bounds assert is deleted. I have seen that happen several times and each time the various bits of code all look innocent enough until you sit down and thoroughly read each function and the functions it calls and the functions they call. So we compile with the worst UB turned off (null is a valid address, integers can overflow, pointers can alias, enums can store any integer) and honestly, we cannot see a measurable difference in performance.

UB as it is currently implemented is simply an enormous footgun. It would be a lot more useful if there were a profiler that could find parts of the code, which would benefit from UB optimisations. Then we could focus on making those parts UB-safe, add asserts for the debug/testing builds, and turn on more optimisations for them. I am quite certain nobody can write a fully UB-free program. For example, did you know that multiplying two "unsigned short" variables is UB? Can you guarantee that across all the template instantiations for custom vector types you've written you never multiply unsigned shorts? I'll leave it as an exercise to the reader as to why that is.

dzaima · a year ago
Strict aliasing in particular is very dependent on coding style; I too just turn it off with essentially no harm, but certain styles of code that don't involve carefully manually CSEing everything might benefit a good bit (though with it being purely type-based imo it's not that useful).

Outside of freestanding/kernel scenarios, null treatment shouldn't affect anything, and again can similarly benefit probably a decent amount of code by removing what is unquestionably just dead code. There's the question of pointer addition resulting in/having an argument of null being UB, which is rather bad for offsetof shenanigans or misusing pointers as arbitrary data, but that's a question of provenance, not anything about null.

Global integer overflow UB is probably the most questionable thing in that list, and I'd quite prefer having separate wrap-on-overflow and UB-on-overflow types (allowing having unsigned UB-on-overflow too). That said, even having had multiple cases of code that could hit integer overflow and thus UB, I don't recall any of them actually resulting in the compiler breaking things (granted, I know not to write a+1<a & similar for signed ints), whereas I have had a case where the compiler "fixed" the code, turning a `a-b < 0` into the more correct (at least in the scenario in question) `a<b`!

I do think that it would make sense to have an easy uniform way to choose some specific behavior for most kinds of UB.

imtringued · a year ago
There is no point in talking with the people that worship UB for the sake of UB.

They don't understand that one of the biggest barriers to developers writing and adopting more C in their projects is the random jankiness that you get from the compilers. Instead they make C this elite thing for the few people who have read every single line of C code they themselves compiled and ran on their Gentoo installation. Stuff like having no bounds checks is almost entirely irrelevant outside of compute kernels. It doesn't get you much performance, because the branches are perfectly predictable. It merely reduces code size.

There is also the problem that the C development culture is uttlery backwards compared even to the semiconductor industry. If you want to have these ultra optimized release builds, then your development builds must scream when they encounter UB and it also means that no C program or library without an extensive test suite should be allowed onto any Linux distribution's package repository. Suddenly the cost of C programming appears to be unaffordably high.

tekknolagi · a year ago
We used memcpy everywhere in our runtime and after the 10th or so time doing it, it becomes less awkward.
dahart · a year ago
And it’s always reliably optimized out in release builds, I assume?
Conscat · a year ago
A few years ago with GCC 10, I had some strange UD2 instructions (I think) inserted around some inline assembly that was fixed by replaced my `reinterpret_cast`s with `bit_cast`.
mbitsnbites · a year ago
The proper solution is to use std::bit_cast in modern C++ or otherwise use memcpy, and of course know what you're doing.

Some things that could mess with you:

* Floating-point endianity is not the same as integer endianity.

* Floating-point alignment requirements differ from integer alignment requirements.

* The compuler is configured to use something else than 32-bit binary32 IEEE 754 for the type "float".

* The computer does not use two's complement arithmetic for integers.

In practice, these are not real problems.

Earw0rm · a year ago
Yes, if implemented as shown.

You could use intrinsics for the bit casting & so on and it would be well-defined to the extent that those intrinsics are.

(I understand some people consider SIMD intrinsics in general to be UB at language level, in part because they let you switch between floats & ints in a hardware-specific way.)

pkhuong · a year ago
You can do a little better if you tweak the low order bits (https://pvk.ca/Blog/LowLevel/software-reciprocal.html for the double float version)
Y_Y · a year ago
Quick comparison with exact and one Newton-Raphson:

  Value | True | Fast | Fast+Newton
  ----------------------------------------
   0.1 | 10.000 | 11.200 | 9.8560
   0.5 | 2.0000 | 2.0000 | 2.0000
   1.0 | 1.0000 | 1.0000 | 1.0000
   2.0 | 0.5000 | 0.5000 | 0.5000
   5.0 | 0.2000 | 0.2187 | 0.1982
  10.0 | 0.1000 | 0.1094 | 0.0991
(where the extra correction was done with:

  y *= (2.0f - x*y);
)

AlotOfReading · a year ago
A better value is 0x7eb504f3. You can follow that up with Newton-raphson to refine the approximation, or approximate the Newton-raphson too by multiplying 1.385356f*x.

I did this a few days ago in the approximate division thread: https://news.ycombinator.com/item?id=42481612#42489596

In some cases, it can be faster than hardware division.

dahart · a year ago
Not sure I understand your number 0x7eb504f3 - does it require using the 1.385 factor? Is it possible there is a typo in the number? That value doesn’t make sense to me. I’m measuring error here as the absolute value of relative error, e.g., | (v - v_approx) / v |. With that constant alone plugged into the poster’s code, I get a much less accurate approximation, with a minimum error of at least ~0.27 (meaning it’s always far away from the target) and worst case error of ~0.293 over the input range [1/1000…1000]. For comparison, the original 0x7effffff error is min:0 max:0.125. With 0x7eeeebb3, the error I get is min:0 max:0.0667. With 0x7ef311bc, error is min:0 max:0.0505. It’s important that the min error over a large interval is zero, because it means the approximation actually touches the target at least once.
magicalhippo · a year ago
Similar trick to that used in the fast inverse square root routine[1] popularized by Quake 3.

[1]: https://en.wikipedia.org/wiki/Fast_inverse_square_root

fph · a year ago
For some more explanation: the main idea behind both tricks is that the IEEE floating point formats are designed so that the most significant bits represent its exponent, that is, floor(log2(x)). Hence reinterpret-casting a float x to an unsigned integer uint(x) approximates a multiple of log2(x). So these kinds of approximations work like logarithm arithmetic: float(C + a*uint(x)) approximates x^a, for a suitable constant C. Quake's invsqrt is a=-1/2, this post is a=-1.

More detail on https://en.wikipedia.org/wiki/Fast_inverse_square_root#Alias... .

IEEE754 floats are a very well-designed binary format, and the fact that these approximations are possible is part of this design; indeed, the first known instance of this trick is for a=1/2 by W. Kahan, the main designer of IEE754.

mbitsnbites · a year ago
Yep. Along the same lines. This one is even simpler, though, as it requires only a single integer CPU instruction (and the simplest of all instructions too).

If you want full precision, you need to do three Newton-Raphson iterations after the initial approximation. One iteration is:

    y = y * (2.0F - x * y);

magicalhippo · a year ago
It's a neat trick, and could be very useful on microcontrollers which doesn't have hardware division but does have hardware multiplication.
quanto · a year ago
A quick intuition: magic number 7e ffffff is negating by two's complement both the mantissa and exponent.

1. 7e: the first sig bit has to be zero to preserve the sign of the overall number.

2. ffffff: due to Taylor series 1/(1 + X) = 1 - X ..., negating gives the multiplicative inverse.

Although an IEEE float has a sign bit (thus 2's complement does not work on the float itself) but mantissa and the exponent individually work with 2's complement for different reasons. The exponent has a biased range due to being chopped by half; where as the mantissa has a biased range due to its definition and constant offset. The exponent's 1 offset (7e instead of 7f) is a bit more difficult to see at first -- the devil is in the details, but 2's complement is the basic intuition.

ncruces · a year ago
I collected this, and a few others here: https://github.com/ncruces/fastmath/blob/main/fast.go

It's in Go, but ports easily to other languages.

See also: https://stackoverflow.com/questions/32042673/optimized-low-a...

mbitsnbites · a year ago
Excellent! Will have a look.