Interestingly, the chip is rated to run at DDR4-3200 or DDR5, so it's strange C&C got half that.
The power issues are likely from by modern standards pre-historical clocking behavior (single P-state to my understanding)!
I remoted into the system for testing (Cheese/George had it), and he said it took 3-4 cold reboots for it to come up, and suspected memory wasn't training correctly. So I did all the testing without ever rebooting the system, because it might not come back up if I tried.
But yes, platforms usually apply compression in terrible ways, and it's especially noticeable coming from text and straight line stuff like graphs
If I had more time I'd roll my own site with basic HTML/CSS. It's not even hard, just time consuming.
Just a nit: one step up (RX 7600 XT) comes with 16GB memory, although in clamshell configuration. With the B580 falling inbetween the 7600 and 7600 XT in terms of pricing, it seems a bit unfair to only compare it with the former.
- RX 7600 (8GB) ~€300
- RTX 4060 (8GB) ~€310
- Intel B580 (12GB) ~€330
- RX 7600 XT (16GB) ~€350
- RTX 4060 Ti (8GB) ~€420
- RTX 4060 Ti (16GB) ~€580*
*Apparently this card is really rare plus a bad value proposition, so it is hard to find
Just to take one example. Yes, on ARM and x86 you can often do array indexing in one instruction. And then it is broken down into several µops that don't run any faster than a sequence of simpler instructions (or if it's not broken down then it's the critical path and forces a lower clock speed just as, for example, the single-cycle multiply on Cortex-M0 does).
Plus, an isolated indexing into an array is rare and never speed critical. The important ones are in loops where the compiler uses "strength reduction" and "code motion out of loops" so that you're not doing "base + array_offset + indexelt_size" every time, but just "p++". And if the loop is important and tight then it is unrolled, so you get ".. = p[0]; .. = p[1]; .. = p[2]; .. = p[3]; p += 4" which RISC-V handles perfectly well.
"But code size!" you say. That one is extremely easily answered, and not with opinion and hand-waving. Download amd64, arm64, and riscv64 versions of your favourite Linux distro .. Ubuntu 24.04, say, but it doesn't matter which one. Run "size" on your choice of programs. The RISC-V will always be significantly smaller than the other two -- despite supposedly being missing important instructions.
A lot of the criticisms were of a reflexive "bigger is better" nature, but without any examination of HOW MUCH better, or the cost in something else you can't do instead because of that. For example both conditional branch range and JAL/JALR range are criticised as being limited by including one or more 5 bit register specifiers in the instruction through having "compare and branch" in a single instruction (instead of condition codes) and JAL/JALR explicitly specifying where to store the return address instead of having it always be the same register.
RISC-V conditional branches have a range of ±4 KB while arm64 conditional branches have a range of ±1 MB. Is it better to have 1 MB? In the abstract, sure. But how often do you actually use it? 4 KB is already a very large function -- let alone loop -- in modern code. If you really need it then you can always do the opposite condition branch over an unconditional ±1 MB jump. If your loop is so very large then the overhead of one more instruction is going to be far down in the noise .. 0.1% maybe. I look at a LOT of compiled code and I can't recall the last time I saw such a thing in practice.
What you DO see a lot of is very tight loops, where on a low end processor doing compare-and-branch in a single instruction makes the loop 10% or 20% faster.
This is false. You can find examples of both x86-64 and aarch64 CPUs that handle indexed addressing with no extra latency penalty. For example AMD's Athlon to 10H family has 3 cycle load-to-use latency even with indexed addressing. I can't remember off the top of my head which aarch64 cores do it, but I've definitely come across some.
For the x86-64/aarch64 cores that do take additional latency, it's often just one cycle for indexed loads. To do indexed addressing with "simple" instructions, you'd need at a shift and dependent add. That's two extra cycles of latency.
So it does have a 2nd level BTB, it's just that it's labeled as IJTP and is potentially only used by indirect branches.
That seems unwise. Might work well enough for direct branches, but it's going to preform very badly on indirect branches. I would love to see some tests for indirect branch performance (static and dynamic) in your suite.
> When return stack capacity is exceeded, P550 sees a sharp spike in latency. That contrasts with A75’s more gentle increase in latency.
That might be a direct consequence of the P550's limited BTB. Even when the return stack overflows, the A75 can probably still predict the return as if it was an indirect branch, utilising its massive 3072 entry L1 BTB.
Actually, are you sure the P550 even has a return stack? 16 correctly predicted call/ret pairs just so happens to be what you would get from a 32 entry BTB predicting 16 calls then 16 returns.
"a branch prediction unit that is composed of a 32-entry Branch Target Buffer (BTB), a 9.1 KiB-entry Branch History Table (BHT), a 16-entry Return Address Stack (RAS), 512-entry Indirect Jump Target Predictor (IJTP), and a 16-entry Return Instruction Predictor"
> The Performance P550 scales up to four-core complex configurations while delivering 30% higher performance in less than half the area of a comparable Arm® Cortex®-A75.
Dylan Patel wasn't impressed by these comparisons with A75 [2]
> @SiFive is claiming half the area and higher perf/GHz, but they are using 7nm and 100ns memory latency. Choosing to compare to the 10nm A75 on S845, notorious for its high latency at over 200ns. Purposely ignoring iso-node or other A75 comparisons.
And this analysis seems to be borne out in this Chips and Cheese post.
> As a step along that journey, P550 feels more comparable to one of Arm’s early out-of-order designs like Cortex A57. By the time A75 came out, Arm already accumulated substantial experience in designing out-of-order CPUs. Therefore, A75 is a well polished and well rounded core, aside from obvious sacrifices required for its low power and thermal budgets. P550 by comparison is rough around the edges.
So what to make of SiFive's claims? It seems quite an important claim / comparison.