Readit News logoReadit News
pbsd · 5 months ago
Vector ALU instruction latencies are understandably listed as 2 and higher, but this is not strictly the case. From AMD's Zen 5 optimization manual [1], we have

    The floating point schedulers have a slow region, in the oldest entries of a scheduler and only when the scheduler is full. If an operation is in the slow region and it is dependent on a 1-cycle latency operation, it will see a 1 cycle latency penalty.
    There is no penalty for operations in the slow region that depend on longer latency operations or loads.
    There is no penalty for any operations in the fast region.
    To write a latency test that does not see this penalty, the test needs to keep the FP schedulers from filling up.
    The latency test could interleave NOPs to prevent the scheduler from filling up.
Basically, short vector code sequences that don't fill up the scheduler will have better latency.

[1] https://www.amd.com/content/dam/amd/en/documents/processor-t...

Dylan16807 · 5 months ago
So if you fill up the scheduler with a long line of dependent instructions, you experience a significant slowdown? I wonder why they decided to make it do that instead of limiting size/fill by a bit. What all the tradeoffs were.
monster_truck · 5 months ago
This matches my experience with Zen in basically any generation. Once you've used all of the tricks and exhausted all of the memory and storage bandwidth, you'll still have compute left.

It's often faster to use one less core than you hit constraints at so that the processor can juggle them between cores to balance the thermal load as opposed to trying to keep it completely saturated.

Sesse__ · 5 months ago
I had real code that ran with IPC > 6 on Zen 3; I think that's the first time I've seen a modern CPU _really_ be ALU-bound. :-) But it was very unusual, and when I vectorized it, it ran completely different.
menaerus · 5 months ago
Zen3 decode is 4-wide + 8 uOp cache, and dispatch backend is 6 uOps wide. Theoretically, it shouldn't be possible to have IPC larger than 6.
tw1984 · 5 months ago
this is very interesting. any chance you have more concrete stats or results?

thanks

ksec · 5 months ago
Given how Apple's M4 Core can access all of the L2 Cache ( it is shared ) and has a SLC ( System Level Cache ) one could argue it is better to compare it to AMD X3D variant on Cache size. However on Geekbench 6 it is still off by 30-40% per clock. Even if we consider zero performance improvement from M5, it would be a large jump for Zen 6 to catch up.

And that is also the case with Qualcomm's Oryon and ARM's own Cortex X93x series.

Still really looking forward to Zen 6 on server though. I cant wait to see 256 Zen 6c Core.

alberth · 5 months ago
Isn’t Zen fab’ed on nodes sizes Apple used 2-3 years ago (since Apple pays for exclusive rights to TSMC for latest & greatest node sizes).
ksec · 5 months ago
Yes. N4 or 5nm Class compared to Apple's N3E or 2nd Gen 3nm. But the gap in IPC remains the same regardless of node. AMD could scale higher or has lower energy usage, it still wouldn't change the performance.

Not only is the Zen 5 slower, it also uses more energy to achieve the its results. Thinking about that the gap is staggering.

kvemkon · 5 months ago
> AMD chips don't have an equivalent to Intel PT. We'd love to add support as soon as they make one. (2022) [1]

> since 2013, Intel offers a feature called "intel processor tracing [2]

> [not answered]

> When will AMD cpus introduce Intel-PT tech or the Intel branch trace store feature? (2024) [3]

> [not answered]

Is Intel-PT over-engineered and not really needed in practice?

[1] https://github.com/janestreet/magic-trace/wiki/How-could-mag...

[2] https://community.amd.com/t5/pc-processors/amd-ipt-intelpt-i...

[3] https://community.amd.com/t5/pc-processors/will-amd-cpus-hav...

Sesse__ · 5 months ago
I've used Intel PT several times; it's completely unbeatable for some things.

In general, Intel is _way_ ahead of AMD in the performance monitoring game. For instance, IBS is a really poor replacement for PEBS (it still hits the wrong instructions, it just re-weights them and this rarely goes well), which makes profiling anything branchy or memory-bound really hard. This is the only real reason why I prefer to buy Intel CPUs still myself (although I understand this is a niche use case!).

eigenform · 5 months ago
This reminds me: has anyone ever figured out why Zen 3 was missing memory renaming, but it came back in Zen 4 and Zen 5?
Tuna-Fish · 5 months ago
AMD had two leapfrogging CPU design teams. Memory renaming was added by the team that did Zen2, presumably the Zen3 team couldn't import it in time for some reason.
JackYoustra · 5 months ago
Any writeups on why they chose this system, whether its still used today, etc? I'm completely unfamiliar with this style of management.
qwertox · 5 months ago
At the bottom of the post is a link to a PDF of "The microarchitecture of Intel, AMD, and VIA CPUs - An optimization guide for assembly programmers and compiler makers" [0]

You might want to download it and just take a look at it so you know that this content exists.

[0] https://www.agner.org/optimize/microarchitecture.pdf

kklisura · 5 months ago
Are there any good resources on how does one obtain all of this information?
rft · 5 months ago
The linked PDF in the post contains a section on how the values are measured and a link to the test suite. Search in [1] for "How the values were measured". For another project that measures the same/very similar values you can check out [2]. They have a paper about the tool they are using [3].

There is also AMD's "Software Optimization Guide" that might contain some background information. [4] has many direct attachments, AMD tends to break direct links. Intel should have similar docs, but I am currently more focused on AMD, so I only have those links at hand.

[1] https://www.agner.org/optimize/instruction_tables.pdf

[2] https://www.uops.info/background.html

[3] https://arxiv.org/abs/1911.03282

[4] https://bugzilla.kernel.org/show_bug.cgi?id=206537