brandmeyer (u/brandmeyer)

brandmeyer commented on AMD claims Arm ISA doesn't offer efficiency advantage over x86 techpowerup.com/340779/am... · Posted by u/ksec

adgjlsfhk1 · 5 months ago

> Floating point arithmetic spends three bits in the instruction encoding to support static rounding modes.

IMO this is way better than the alternative in x86 and ARM. The reason no one deals with rounding modes is because changing the mode is really slow and you always need to change it back or else everything breaks. Being able to do it in the instruction allows you to do operations with non-standard modes much more simply. For example, round-to-nearest-ties-to-odd can be incredibly useful to prevent double rounding.

brandmeyer · 5 months ago

You can't even express static rounding in C. You can't even express them in the LLVM language-independent IR. Any attempt to use the static rounding modes will necessarily involve intrinsics and/or assembly.

brandmeyer commented on AMD claims Arm ISA doesn't offer efficiency advantage over x86 techpowerup.com/340779/am... · Posted by u/ksec

newpavlov · 5 months ago

- Handling of misaligned loads/stores: RISC-V got itself into a weird middle ground, ops on misaligned pointers may work fine, may work "extremely slow", or cause fatal exceptions (yes, I know about Zicclsm, it's extremely new and only helps with the latter, also see https://github.com/llvm/llvm-project/issues/110454). Other platforms either guarantee "reasonable" performance for such operations, or forbid misaligned access with "aligned" loads/stores and provide separate misaligned instructions. Arguably, RISC-V should've done the latter (with misaligned instructions defined in a separate higher-end extension), since passing unaligned pointer into an aligned instruction signals correctness problems in software.

- The hardcoded page size. 4 KiB is a good default for RV32, but arguably a huge missed opportunity for RV64.

- The weird restriction in the forward progress guarantees for LR/SC sequences, which forces compilers to compile `compare_exchange` and `compare_exchange_weak` in the absolutely same way. See this issue for more information: https://github.com/riscv/riscv-isa-manual/issues/2047

- The `seed` CSR: it does not provide a good quality entropy (i.e. after you accumulated 256 bits of output, it may contain only 128 bits of randomness). You have to use a CSPRNG on top of it for any sensitive applications. Doing so may be inefficient and will bloat binary size (remember, the relaxed requirement was introduced for "low-powered" devices). Also, software developers may make mistake in this area (not everyone is a security expert). Similar alternatives like RDRAND (x86) and RNDR (ARM) guarantee proper randomness and we can use their output directly for cryptographic keys with very small code footprint.

- Extensions do not form hierarchies: it looks like the AVX-512 situation once again, but worse. Profiles help, but it's not a hierarchy, but a "packet". Also, there are annoyances like Zbkb not being a proper subset of Zbb.

- Detection of available extensions: we usually have to rely on OS to query available extensions since the `misa` register is accessible only in machine mode. This makes detection quite annoying for "universal" libraries which intend to support various OSes and embedded targets. The CPUID instruction (x86) is ideal in this regard. I totally disagree with the virtualization argument against it, nothing prevents VM from intercepting the read, no one excepts huge performance from such reads.

And this list is compiled after a pretty surface-level dive into the RISC-V spec. I heard about other issues (e.g. being unable to port tricky SIMD code to the V extension or underspecification around memory coherence important for writing drivers), but I can not confidently talk about those, so it's not part of my list.

P.S.: I would be interested to hear about other people gripes with RISC-V.

brandmeyer · 5 months ago

Nothing major, just some oddball decisions here and there.

Fused compare-and-branch only extends to the base integer instructions. Anything else needs to generate a value that feeds into a compare-and-branch. Since all branches are compare-and-branch, they all need two register operands, which impairs their reach to a mere +/- 4 kB.

The reach for position-independent code instructions (AUIPC + any load or store) is not quite +/- 2 GB. There is a hole on either end of the reach that is a consequence of using a sign-extended 12-bit offset for loads and stores, and a sign-extended high 20-bit offset for AIUPC. ARM's adrp (address of page) + unsigned offsets is more uniform.

RV32 isn't a proper subset of RV64, which isn't a proper subset of RV128. If they were proper subsets, then RV64 programs could run unmodified on RV128 hardware. Not that its going to ever happen, but if it did, the processor would have to mode-switch, not unlike the x86-64 transition of yore.

Floating point arithmetic spends three bits in the instruction encoding to support static rounding modes. I can count on zero hands the number of times I've needed that.

The integer ISA design goes to great lengths to avoid any instructions with three source operands, in order to simplify the datapaths on tiny machines. But... the floating point extension correctly includes fused multiply-add. So big chunks of any high-end processor will need three-operand datapaths anyway.

The base ISA is entirely too basic, and a classic failure of 90% design. Just because most code doesn't need all those other instructions doesn't mean that most systems don't. RISC-V is gathering extensions like a Katamari to fill in all those holes (B, Zfa, etc).

None of those things make it bad, I just don't think its nearly as shiny as the hype. ARM64+SVE and x86-64+AVX512 are just better.

brandmeyer commented on 4k NASA employees opt to leave agency through deferred resignation program kcrw.com/news/shows/npr/n... · Posted by u/ProAm

chadcmulligan · 6 months ago

I read this recently, and it doesn't sound good

https://idlewords.com/2024/05/the_lunacy_of_artemis.htm

brandmeyer · 6 months ago

Any argument that is filled with this much ragebait should be dismissed out of hand.

brandmeyer commented on FFmpeg devs boast of another 100x leap thanks to handwritten assembly code tomshardware.com/software... · Posted by u/harambae

torginus · 7 months ago

Sorry for the derail, but it sounds like you have a ton of experience with SIMD.

Have you used ISPC, and what are your thoughts on it?

I feel it's a bit ridiculous that in this day and age you have to write SIMD code by hand, as regular compilers suck at auto-vectorizing, especially as this has never been the case with GPU kernels.

brandmeyer · 7 months ago

ISPC suffers from poor scatter and gather support in hardware. The direct result is that it is hard to make programs that scale in complexity without resorting to shenanigans.

An ideal scatter-read or gather-store instruction should take time proportional to the number of cache lines that it touches. If all of the lane accesses are sequential and cache line aligned it should take the same amount of time as an aligned vector load or store. If the accesses have high cache locality such that only two cache lines are touched, it should cost exactly the same as loading those two cache lines and shuffling the results into place. That isn't what we have on x86-AVX512. They are microcoded with inefficient lane-at-a-time implementations. If you know that there is good locality of reference in the access, then it can be faster to hand-code your own cache line-at-a-time load/shuffle/masked-merge loop than to rely on the hardware. This makes me sad.

ISPC's varying variables have no way to declare that they are sequential among all lanes. Therefore, without extensive inlining to expose the caller's access pattern, it issues scatters and gathers at the drop of a hat. You might like to write your program with a naive x[y] (x a uniform pointer, y a varying index) in a subroutine, but ISPC's language cannot infer that y is sequential along lanes. So, you have to carefully re-code it to say that y is actually a uniform offset into the array, and write x[y + programIndex]. Error-prone, yet utterly essential for decent performance. I resorted to munging my naming conventions for such indexes, not unlike the Hungarian notation of yesteryear.

Rewriting critical data structures in SoA format instead of AoS format is non-trivial, and a prerequisite to get decent performance from ISPC. You cannot "just" replace some subroutines with ISPC routines, you need to make major refactorings that touch the rest of the program as well. This is neutral in an ISPC-versus-intrinsics (or even ISPC-versus-GPU) shootout, but it is worth mentioning only to point out that ISPC is also not a silver bullet in this regards, either.

Non-minor nit: The ISPC math library gives up far too much precision by default in the name of speed. Fortunately, Sleef is not terribly difficult to integrate and use for the 1-ulp max rounding error that I've come to expect from a competent libm.

Another: The ISPC calling convention adheres rather strictly to the C calling convention... which doesn't provide any callee-saved vector registers, not even for the execution mask. So if you like to decompose your program across multiple compilation units, you will also notice much more register save and restore traffic than you would like or expect.

I want to like it, I can get some work done in it, and I did get significant performance improvements over scalar code when using it. But the resulting source code and object code are not great. They are merely acceptable.

brandmeyer commented on Europe's first geostationary sounder satellite is launched eumetsat.int/europes-firs... · Posted by u/diggan

themisto · 7 months ago

Only tangentially related: I have nothing but respect for EUMETSAT and their public data store. For past work projects I've had to interface with a pretty broad sample of the world's space and/or meteorological agency's public data stores and APIs and EUMDAC (EUMETSAT's API client) was top tier. Well documented, modern, fast, and generally headache free.

In fact, I have nothing but respect for any agency that makes free and public access to earth observation data a priority, regardless of how janky their API is.

brandmeyer · 7 months ago

Shout-out to the NOAA GFS team, who publish the GFS analysis directly to AWS S3.

https://registry.opendata.aws/noaa-gfs-bdp-pds/

brandmeyer commented on Fun with C++26 reflection: Keyword Arguments pydong.org/posts/KwArgs/... · Posted by u/HeliumHydride

anitil · a year ago

I've seen cases where people do things like

  my_func(/*arg1=*/val1,/*arg2=*/val2)

And I suppose you could write a validator to make sure that this worked. Or using an anonymous structure in C99, or a named structure in C89. And of course a pointer if you care about register/stack usage etc.

I'm not sure what the other options are.

brandmeyer · a year ago

> And I suppose you could write a validator to make sure that this worked.

Like this one!

https://clang.llvm.org/extra/clang-tidy/checks/bugprone/argu...

brandmeyer commented on The origins of 60-Hz as a power frequency (1997) ieeexplore.ieee.org/docum... · Posted by u/theamk

hunter2_ · a year ago

What's the cause? Alternators run at very high frequencies with good rectifiers, so I'm guessing the flicker is introduced by PWM dimming, but why would that be a low enough frequency to bother people?

I'm sensitive to flicker myself, but only on the more extreme half of the spectrum. For example, half rectified LED drivers on 60 Hz AC drive me nuts, but full rectified (120 Hz) I very rarely notice. I don't notice any problem with car tail lights, except in the case of a video camera where the flicker and the frame rate are beating. The beating tends to be on the order of 10 Hz (just shooting from the hip here) so if frame rates are 30/60/120 then I guess the PWM frequency is something like 110 or 130 Hz?

brandmeyer · a year ago

> introduced by PWM dimming, but why would that be a low enough frequency to bother people?

The human fovea has a much lower effective refresh rate than your peripheral vision. So you might notice the flickering of tail lights (and daytime running lights) seen out of the corner of your eye even though you can't notice when looking directly at them.

brandmeyer commented on Software development topics I've changed my mind on chriskiehl.com/article/th... · Posted by u/belter

bjourne · a year ago

While Python has some great linters, I don't know of any in C that can correctly and automatically enforce some coding style. Most of them can only indent correctly, but they can't break up long lines over multiple lines, format array literals, or strings. Few or none knows how to deal with names or preprocessor macros.

brandmeyer · a year ago

clang-format and clang-tidy are both excellent for C and C++ (and protobuf, if your group uses it). Since they are based on the clang front-end, they naturally have full support for both languages and all of their complexity.

brandmeyer commented on A better approach to gravity: how we made EGM2008 faster elodin.systems/post/a-bet... · Posted by u/sphw

dhsysusbsjsi · a year ago

Excuse my ignorance but how big of a lookup table would you need to achieve the same outcome ?

brandmeyer · a year ago

LUTs are commonly used in geodesy applications on or near the Earth's surface. The full multipole model is used for orbital applications to account for the way that local lumpiness in Earth's mass distribution is smoothed out with increasing distance from the surface. It might be reasonable to build a 3D LUT for use at Starlink scale or higher, but certainly not for individual satellites.

brandmeyer commented on A better approach to gravity: how we made EGM2008 faster elodin.systems/post/a-bet... · Posted by u/sphw

brandmeyer · a year ago

Exactly what order and degree were you using to evaluate the model? Variations in drag and solar pressure are more significant than the uncertainty in the gravity field for objects in LEO somewhere much less than 127th order (40 microseconds on my machine, your smileage may vary), so you can safely truncate the model for simulations. GRACE worked by making many passes such that they could average out those perturbations to make their measurement. But for practical applications, those tiny terms are irrelevant.

IERS Technical Note 36 section 6.1 gives recommendations for model truncation if you are looking for justification. https://iers-conventions.obspm.fr/content/tn36.pdf