clamchowder (u/clamchowder)

clamchowder commented on Condor's Cuzco RISC-V Core at Hot Chips 2025 chipsandcheese.com/p/cond... · Posted by u/rbanffy

bee_rider · 4 months ago

The static schedule part seems really interesting. They note that it only works for some instructions, but I wonder if it would be possible to have a compiler report “this section of the code can be statically scheduled.” In that case, could this have a benefit for real-time operation? Or maybe some specialized partially real-time application—mark a segment of the program as desiring static scheduling, and don’t allow memory loads, etc, inside there.

clamchowder · 4 months ago

(author here) they try for all instructions, just that it's a prediction w/replay because inevitably some instructions like memory loads are variable latency. It's not like Nvidia where fixed latency instructions are statically scheduled, then memory loads/other variable latency stuff is handled dynamically via scoreboarding.

clamchowder commented on Zhaoxin's KX-7000 chipsandcheese.com/p/zhao... · Posted by u/ryandotsmith

luyu_wu · 8 months ago

Absolutely a lomg way to go.

Interestingly, the chip is rated to run at DDR4-3200 or DDR5, so it's strange C&C got half that.

The power issues are likely from by modern standards pre-historical clocking behavior (single P-state to my understanding)!

clamchowder · 8 months ago

It does clock ramp from 800 MHz idle to 3.2 GHz under load, with 900, 1000, 1100, 1300, 1500, 1800, 2200, and 2700 MHz steps in between until it hits 3.2 GHz after 71.6 ms. Article was getting long enough so I just left it at, it reaches 3.2 GHz and stays there even though the spec sheet says it should go higher.

I remoted into the system for testing (Cheese/George had it), and he said it took 3-4 cold reboots for it to come up, and suspected memory wasn't training correctly. So I did all the testing without ever rebooting the system, because it might not come back up if I tried.

clamchowder commented on Intel's Battlemage Architecture chipsandcheese.com/p/inte... · Posted by u/ksec

dark__paladin · 10 months ago

TAA is temporal anti-aliasing, correct? There is no time dimension here, isn't it just compression + bilinear filtering?

clamchowder · 10 months ago

It was a joke about blurriness. To extend the joke, be glad it doesn't flicker and shimmer.

But yes, platforms usually apply compression in terrible ways, and it's especially noticeable coming from text and straight line stuff like graphs

clamchowder commented on Intel's Battlemage Architecture chipsandcheese.com/p/inte... · Posted by u/ksec

SG- · 10 months ago

it's a nice technical article but the charts are just terrible and seem blurry even when zoomed in.

clamchowder · 10 months ago

Yea Wordpress was a terrible platform and Substack is also a terrible platform. I don't know why every platform wants to take a simple uploaded PNG and apply TAA to it. And don't get me started on how Substack has no native table support, when HTML had it since prehistoric times.

If I had more time I'd roll my own site with basic HTML/CSS. It's not even hard, just time consuming.

clamchowder commented on Intel's Battlemage Architecture chipsandcheese.com/p/inte... · Posted by u/ksec

jorvi · 10 months ago

> Unfortunately, today’s midrange cards like the RTX 4060 and RX 7600 only come with 8 GB of VRAM

Just a nit: one step up (RX 7600 XT) comes with 16GB memory, although in clamshell configuration. With the B580 falling inbetween the 7600 and 7600 XT in terms of pricing, it seems a bit unfair to only compare it with the former.

- RX 7600 (8GB) ~€300

- RTX 4060 (8GB) ~€310

- Intel B580 (12GB) ~€330

- RX 7600 XT (16GB) ~€350

- RTX 4060 Ti (8GB) ~€420

- RTX 4060 Ti (16GB) ~€580*

*Apparently this card is really rare plus a bad value proposition, so it is hard to find

clamchowder · 10 months ago

(author here) When I checked the 7600 XT was much more expensive. Right now it's still $360 on eBay, vs the B580's $250 MSRP, though yeah I guess it's hard to find the B580 in stock

clamchowder commented on Alibaba/T-HEAD's Xuantie C910: An open source RISC-V core chipsandcheese.com/p/alib... · Posted by u/mfiguiere

IanCutress · a year ago

We need to get Chester on the podcast more :)

clamchowder · a year ago

Oh that should be fun. Would have to fit it around work though

clamchowder commented on Alibaba/T-HEAD's Xuantie C910: An open source RISC-V core chipsandcheese.com/p/alib... · Posted by u/mfiguiere

brucehoult · a year ago

The criticisms there are at the same time 1) true and 2) irrelevant.

Just to take one example. Yes, on ARM and x86 you can often do array indexing in one instruction. And then it is broken down into several µops that don't run any faster than a sequence of simpler instructions (or if it's not broken down then it's the critical path and forces a lower clock speed just as, for example, the single-cycle multiply on Cortex-M0 does).

Plus, an isolated indexing into an array is rare and never speed critical. The important ones are in loops where the compiler uses "strength reduction" and "code motion out of loops" so that you're not doing "base + array_offset + indexelt_size" every time, but just "p++". And if the loop is important and tight then it is unrolled, so you get ".. = p[0]; .. = p[1]; .. = p[2]; .. = p[3]; p += 4" which RISC-V handles perfectly well.

"But code size!" you say. That one is extremely easily answered, and not with opinion and hand-waving. Download amd64, arm64, and riscv64 versions of your favourite Linux distro .. Ubuntu 24.04, say, but it doesn't matter which one. Run "size" on your choice of programs. The RISC-V will always be significantly smaller than the other two -- despite supposedly being missing important instructions.

A lot of the criticisms were of a reflexive "bigger is better" nature, but without any examination of HOW MUCH better, or the cost in something else you can't do instead because of that. For example both conditional branch range and JAL/JALR range are criticised as being limited by including one or more 5 bit register specifiers in the instruction through having "compare and branch" in a single instruction (instead of condition codes) and JAL/JALR explicitly specifying where to store the return address instead of having it always be the same register.

RISC-V conditional branches have a range of ±4 KB while arm64 conditional branches have a range of ±1 MB. Is it better to have 1 MB? In the abstract, sure. But how often do you actually use it? 4 KB is already a very large function -- let alone loop -- in modern code. If you really need it then you can always do the opposite condition branch over an unconditional ±1 MB jump. If your loop is so very large then the overhead of one more instruction is going to be far down in the noise .. 0.1% maybe. I look at a LOT of compiled code and I can't recall the last time I saw such a thing in practice.

What you DO see a lot of is very tight loops, where on a low end processor doing compare-and-branch in a single instruction makes the loop 10% or 20% faster.

clamchowder · a year ago

"don't run any faster than a sequence of simpler instructions"

This is false. You can find examples of both x86-64 and aarch64 CPUs that handle indexed addressing with no extra latency penalty. For example AMD's Athlon to 10H family has 3 cycle load-to-use latency even with indexed addressing. I can't remember off the top of my head which aarch64 cores do it, but I've definitely come across some.

For the x86-64/aarch64 cores that do take additional latency, it's often just one cycle for indexed loads. To do indexed addressing with "simple" instructions, you'd need at a shift and dependent add. That's two extra cycles of latency.

clamchowder commented on SiFive's P550 Microarchitecture chipsandcheese.com/p/insi... · Posted by u/rbanffy

phire · a year ago

Ah, that makes so much more sense.

So it does have a 2nd level BTB, it's just that it's labeled as IJTP and is potentially only used by indirect branches.

clamchowder · a year ago

No, that's not a second level BTB in that regular direct branches don't seem to use it. It's only for predicting indirect branches.

clamchowder commented on SiFive's P550 Microarchitecture chipsandcheese.com/p/insi... · Posted by u/rbanffy

phire · a year ago

> Likely, P550 doesn’t have another BTB level. If a branch misses the 32 entry BTB, the core simply calculates the branch’s destination address when it arrives at the frontend

That seems unwise. Might work well enough for direct branches, but it's going to preform very badly on indirect branches. I would love to see some tests for indirect branch performance (static and dynamic) in your suite.

> When return stack capacity is exceeded, P550 sees a sharp spike in latency. That contrasts with A75’s more gentle increase in latency.

That might be a direct consequence of the P550's limited BTB. Even when the return stack overflows, the A75 can probably still predict the return as if it was an indirect branch, utilising its massive 3072 entry L1 BTB.

Actually, are you sure the P550 even has a return stack? 16 correctly predicted call/ret pairs just so happens to be what you would get from a 32 entry BTB predicting 16 calls then 16 returns.

clamchowder · a year ago

(author here) Just a 32 entry BTB is technically a possibility from microbenchmark results, but the EIC7700X datasheet straight up says:

"a branch prediction unit that is composed of a 32-entry Branch Target Buffer (BTB), a 9.1 KiB-entry Branch History Table (BHT), a 16-entry Return Address Stack (RAS), 512-entry Indirect Jump Target Predictor (IJTP), and a 16-entry Return Instruction Predictor"

clamchowder commented on SiFive's P550 Microarchitecture chipsandcheese.com/p/insi... · Posted by u/rbanffy

klelatti · a year ago

From the SiFive website [1]

> The Performance P550 scales up to four-core complex configurations while delivering 30% higher performance in less than half the area of a comparable Arm® Cortex®-A75.

Dylan Patel wasn't impressed by these comparisons with A75 [2]

> @SiFive is claiming half the area and higher perf/GHz, but they are using 7nm and 100ns memory latency. Choosing to compare to the 10nm A75 on S845, notorious for its high latency at over 200ns. Purposely ignoring iso-node or other A75 comparisons.

And this analysis seems to be borne out in this Chips and Cheese post.

> As a step along that journey, P550 feels more comparable to one of Arm’s early out-of-order designs like Cortex A57. By the time A75 came out, Arm already accumulated substantial experience in designing out-of-order CPUs. Therefore, A75 is a well polished and well rounded core, aside from obvious sacrifices required for its low power and thermal budgets. P550 by comparison is rough around the edges.

So what to make of SiFive's claims? It seems quite an important claim / comparison.

[1] https://www.sifive.com/cores/performance-p550

[2] https://x.com/dylan522p/status/1415395415000817664

clamchowder · a year ago

(author here) I compared it to the A75 on the Snapdragon 670, not the 845. I chose that comparison because I have a Pixel 3a (my previous daily driver cell phone), and that's the only A75 core I had access to.