Deleted Comment
Deleted Comment
I haven't tested Zen 4 or 5, but I haven't heard anything that indicates they should be a lot better.
This is a significant problem on AMD; Intel and Apple seems to be better.
When did this change? In my testing years ago (while I was writing Rosetta 2, so Icelake-era Intel), Intel only allowed a load to forward from a single store, and no partial forwarding (i.e. mixed cache/register) without a huge penalty, whereas AMD at least allowed partial forwarding (or had a considerably lower penalty than Intel).
#if defined(__AVX512F__) || defined(__AVX2__)
void configure_x86_denormals(void) {
_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON); // Flush results to zero
_MM_SET_DENORMALS_ZERO_MODE(_MM_DENORMALS_ZERO_ON); // Treat denormal inputs as zero
}
#endif
It had a 50x performance impact on Intel. As benchmarked on `c7i.metal-48xl` instances: - `f64` throughput grew from 0.2 to 8.2 TFLOPS.
- `f32` throughput grew from 0.6 to 15.1 TFLOPS.
Here is that section in the repo with more notes AVX-512, AMX, and other instructions: <https://github.com/ashvardanian/less_slow.cpp/blob/8f32d65cc...>.Saturation breaks the successor relation S(x) != x. Sometimes you want that, but it's extremely situational and rarely do you want saturation precisely at the type max. Saturation is better served by functions in C.
Trapping is fine conceptually, but it means all your arithmetic operations can now error. That's a severe ergonomic issue, isn't particularly well defined for many systems, and introduces a bunch of thorny issues with optimizations. Again, better as functions in C.
On the other hand, wrapping is the mathematical basis for CRCs, Error correcting codes, cryptography, bitwise math, and more. There's no wasted bits, it's the natural implementation in hardware, it's familiar behavior to students from a young age as "clock arithmetic", compilers can easily insert debug mode checks for it (the way rust does when you forget to use Wrapping<T>), etc.
It's obviously not perfect either, as it has the same problem of all fixed size representations in diverging from infinite math people are actually trying to do, but I don't think the alternatives would be better.
The natural implementation in hardware is that addition of two N-bit numbers produces an N+1-bit number. Most architectures even expose this extra bit as a carry bit.
I might be wrong here, so please feel free to correct me if so, but I don't think borrowing was a concept, per se, of the language itself.
As you mention, the concept the Rust designers took from Cyclone was explicit lifetimes.
Borrow checking provides two features (but in my opinion in a very un-ergonomic way): (1) prevention of use after free; and (2) temporal memory safety (i.e. guaranteeing no data races, but not eliminating race conditions in general).
I'm still wobbly on PLT legs though; I'm sure there's a pro or ten who could step in and elaborate.
https://homes.cs.washington.edu/~djg/papers/cyclone_memory.p...
or the more detailed discussion throughout this journal paper:
https://homes.cs.washington.edu/~djg/papers/cyclone_scp.pdf
As their citations indicate, the idea of borrowing appeared immediately in the application of subtructural logics to programming languages, back to Wadler's "Linear types can change the world!". It's just too painful without it.
[0] https://en.wikipedia.org/wiki/Diaconescu%27s_theorem
https://github.com/leanprover/lean4/blob/ad1a017949674a947f0...