AMD's Turin: 5th Gen EPYC Launched

The weirdest one of the bunch is the AMD EPYC 9175F: 16 cores with 512MB of L3 cache! Presumably this is for customers trying to minimize software costs that are based on "per-core" licensing. It really doesn't make much sense to have so few cores at such an expense, otherwise. Does Oracle still use this style of licensing? If so, they need to knock it off.

The only other thing I can think of is some purpose like HFT may need to fit a whole algorithm in L3 for absolute minimum latency, and maybe they want only the best core in each chiplet? It's probably about software licenses, though.

bob1029 · a year ago

Another good example is any kind of discrete event simulation. Things like spiking neural networks are inherently single threaded if you are simulating them accurately (I.e., serialized through the pending spike queue). Being able to keep all the state in local cache and picking the fastest core to do the job is the best possible arrangement. The ability to run 16 in parallel simply reduces the search space by the same factor. Worrying about inter CCD latency isn't a thing for these kinds of problems. The amount of bandwidth between cores is minimal, even if we were doing something like a genetic algorithm with periodic crossover between physical cores.

londons_explore · a year ago

Plenty of applications are single threaded and it's cheaper to spend thousands on a super fast CPU to run it as fast as possible than spend tens of thousands on a programmer to rewrite the code to be more parallel.

And like you say, plenty of times it is infeasible to rewrite the code because its third party code for which you don't have the source or the rights.

bee_rider · a year ago

512 MB of cache, wow.

A couple years ago I noticed that some Xeons I was using had a much cache as the ram in the systems I had growing up (millennial, so, we’re not talking about ancient commodores or whatever; real usable computers that could play Quake and everything).

But 512MB? That’s roomy. Could Puppy Linux just be held entirely in L3 cache?

zamadatix · a year ago

CCDs can't access each other's L3 cache as their own (fabric penalty is too high to do that directly). Assuming it's anything like the 9174F that means it's really 8 groups of 2 cores that each have 64 MB of L3 cache. Still enormous, and you can still access data over the infinity fabric with penalties, but not quite a block of 512 MB of cache on a single 16 core block that it might sound like at first.

Zen 4 also had 96 MB per CCD variants like the 9184X, so 768 MB per, and they are dual socket so you can end up with a 1.5 GB of total L3 cache single machine! The downside being now beyond CCD<->CCD latencies you have socket<->socket latencies.

hedora · a year ago

I wonder if you can boot it without populating any DRAM sockets.

Jestzer · a year ago

MATLAB Parallel Server also does per-core licensing.

https://www.mathworks.com/products/matlab-parallel-server/li....

Aurornis · a year ago

Many algorithms are limited by memory bandwidth. On my 16-core workstation I’ve run several workloads that have peak performance with less than 16 threads.

It’s common practice to test algorithms with different numbers of threads and then use the optimal number of threads. For memory-intensive algorithms the peak performance frequently comes in at a relatively small number of cores.

CraigJPerry · a year ago

Is this because of NUMA or is it L2 cache or something entirely different?

I worked on high perf around 10 years ago and at that point I would pin the OS and interrupt handling to a specific core so I’d always lose one core. Testing led me to disable hyperthreading in our particular use case, so that was “cores” (really threads) halfed.

A colleague had a nifty trick built on top of solarflare zero copy but at that time it required fairly intrusive kernel changes, which never totally sat well with me, but again I’d lose a 2nd core to some bookkeeping code that orchestrated that.

I’d then tasksel the app to the other cores.

NUMA was a thing by then so it really wasn’t straightforward to eek maximum performance. It became somewhat of a competition to see who could get highest throughout but usually those configurations were unusable due to unacceptable p99 latencies.

RHab · a year ago

Abaqus for example is by core, I am severly limited, for me this makes totally sense.

heraldgeezer · a year ago

Windows server and MSSQL is per core now. A lot of enterprise software is. They are switching to core because before they had it based on CPU sockets. Not just Oracle.

aecmadden · a year ago

This optimises for a key vmware license mechanism "Per core licensing with a minimum of 16 cores licensed per CPU.".

puzzlingcaptcha · a year ago

Windows server licensing starts at 16 cores

forinti · a year ago

You can pin which cores you will use and so stay within your contract with Oracle.

elil17 · a year ago

Many computational fluid dynamics programs have per core licensing and also benefit from large amounts of cache.

yusyusyus · a year ago

new vmware licensing is per-core.

Truly mind boggling scale.

Twenty years ago we had just 1-2 cores per CPU, so we were lucky to have 4 cores in a dual socket server.

A single server can now have almost 400 cores. Yes, we can have even more ARM cores but they don't perform as well as these do, at least for now.

zer00eyz · a year ago

700+ threads over 2 cores, can saturate 2 400gbe Nic's 500 wats per chip (less than 2 wats per thread)... All of that in a 2U package.... 20 years ago that would have been racks of gear.

jmrm · a year ago

I think those really 2 watts per thread are a lot more important than what us home users usually think. Having to deliver less power and having to dissipate less watts in form of heat in a data centre are really good news to its operative costs, which is usually a lot bigger than the cost of the purchase of the servers

justmarc · a year ago

With these CPUs one can definitely hit much higher rates.

800Gbps from a single server was achieved by Netflix on much lesser systems two years ago:

https://nabstreamingsummit.com/wp-content/uploads/2022/05/20...

If I were to guess, this hardware can do double that, also helping that we now have actual 800Gbps Ethernet hardware.

Indeed 20 years ago this would have been racks of gear at a very high cost and a huge power bill.

smolder · a year ago

> 700+ threads over 2 cores

I assume you mean 2 sockets.

Dylan16807 · a year ago

On the other hand, at the time we would have expected twenty years of progress to make the cores a thousand times faster. Instead that number is more like 5x.

hypercube33 · a year ago

On a different hand the way things were scaling 20 years ago (1ghz took 35 watts) we'd have 5,000W processors - instead we have 196 for 300 watts. If these are anything like ThreadRipper I wonder if they can unlock to 1000W with liquid cooling. On the flip side we are rolling about 1 to 2 watts per core which is wild. Also, can't some of these do 512bit math instructions instead of just 32bit?

RobinL · a year ago

I wonder what percentage of 'big data' jobs that run in clusters would now be far faster on a single big machine with e.g. duckdb rather than spark

justmarc · a year ago

I often think about huge, fancy cloud setups literally costing silly money to run, being replaced by a single beast of a machine powered by a modern, high core count CPU (say 48+), lots of RAM and lots of high performance enterprise-grade SSD storage.

conjecTech · a year ago

The difference in throughput for local versus distributed orchestration would mainly come from serdes, networking, switching. Serdes can be substantial. Networking and switching has been aggressively offloaded from CPU through better hardware support.

Individual tasks would definitely have better latency, but I'd suspect the impact on throughput/CPU usage might be muted. Of course at the extremes (very small jobs, very large/complex objects being passed) you'd see big gains.

semi-extrinsic · a year ago

Essentially all, I would guess. But scheduling jobs and moving data in and out of a single big machine can become a huge bottleneck.

varispeed · a year ago

Nowadays very much most services can fit on single server and serve millions of users a day. I wonder how it will affect overly expensive cloud services where you can rent a beefy dedicated server for under a grand and make tens of thousands in savings (enough to hire full time administrator with plenty of money left for other things).

chx · a year ago

Indeed: the first dual core server chips only launched in 2005 afaik with 90nm Denmark/Italy/Egypt Opterons and Paxville Xeons but on the Intel side it wasn't until 2007 when they were in full swing.

p_l · a year ago

first dual core server chips show up generally available in 2001 with IBM POWER4, then HP PA-RISC ones in 2004, and then Opterons which was followed by "emergency" design of essentially two "sockets" on one die of of NetBurst dual core systems.

aurareturn · a year ago

Phoronix recently reviewed the 196 core Turin Dense against the AmpereOne 192 core.

* Ampere MSRP $5.5K vs $15K for the EPYC.

* Turin 196 had 1.6x better performance

* Ampere had 1.2x better energy consumption

In terms of actual $/perf, Ampere 192 core is 1.7x better than Turin Dense 196 core based on Phoronix's review.

So for $5.5k, you can either buy an AmpereOne 192 core CPU (274w) or a Turin Dense 48 core CPU (300w).

Ampere has a 256 core, 3nm, 12 memory channel shipping next year that is likely to better challenge Turin Dense and Sierra Forest in terms of raw performance. For now, their value proposition is $/perf.

Anyway, I'm very interested in how Qualcomm's Nuvia-based server chips will perform. Also, if ARM's client core improvements are any indication, I will be very interested in how in-house chips like AWS Graviton, Google Axion, Microsoft Cobalt, Nvidia Grace, Alibaba Yitian will compete with better Neoverse cores. Nuvia vs ARM vs AmpereOne.

This is probably the golden age of server CPUs. 7 years ago, it was only Intel's Xeon. Now you have numerous options.

tpurves · a year ago

AMD also wins on perf/Watt which is pretty notable for anyone who still believed that X86 could never challenge ARM/Risc in efficiency. These days, lot of data centers are also more limited by available Watts (and associated cooling) which bodes well for Turin.

AnthonyMouse · a year ago

> In terms of actual $/perf, Ampere 192 core is 1.7x better than Turin Dense 196 core based on Phoronix's review.

You're comparing it to the highest MSRP Turin, which doesn't have the highest performance/$. People buy that one if they want to maximize density or performance/watt, where it bests Ampere. If you only care about performance/$ you would look at the lower core count Zen5 (rather than Zen5c) models which have twice the performance/$ of the 192-core 9965.

Doing the same for Ampere doesn't work because their 192-core 3.2GHz model is very nearly already their peak performance/$.

KingOfCoders · a year ago

The difference is, you can get EPYC CPUs but you can't get hold of Ampere CPUs.

ksec · a year ago

Ampere's MSRP is pretty close to most system vendor is paying. You could expect most vendor buy EPYC or Xeon at close to 50% off MSRP.

3abiton · a year ago

Very exciting age, and very sad drop for intel, although as many have been warning, they should have seen it coming

speedgoose · a year ago

I'm looking forward to deploy AMD Turin bare metal servers on Hetzner. The previous generations already had a great value but this seems a step above.

mistyvales · a year ago

Here I am running a 12 year old Dell PowerEdge with dual Xeons.. I wonder when the first gen Epyc servers will be cheap fodder on eBay.

p1necone · a year ago

1-3rd gen Epycs can be had super cheap, but the motherboards are expensive.

Also not worth getting anything less than 3rd gen unless you're primarily buying them for the pcie lanes and ram capacity - a regular current gen consumer CPU with half - a quarter of the core count will outperform them in compute while consuming significantly less power.

jsheard · a year ago

When buying used Epycs you have to contend with them possibly being vendor-locked to a specific brand of motherboard as well.

https://www.servethehome.com/amd-psb-vendor-locks-epyc-cpus-...

Tuna-Fish · a year ago

The reason for this is that CPU upgrades on the same board were/are very viable on SP3.

Doing that on Intel platforms just wasn't done for basically ever, it was never worth it. But upgrade to Milan from Naples or Rome is very appealing.

So SP3 CPUs are much more common used than the boards, simply because more of them were made. This is probably very bad for hobbyists, the boards are not going to get cheap until the entire platform is obsolete.

Lots of great second hand hardware to be had on ebay. Even last gen used CPUs, as well as RAM, at much less than retail.

However when you end up building a server quite often the motherboard + case is the cheap stuff, the CPUs are second in cost and the biggest expense can be the RAM.

assusdan · a year ago

IMO, 1st gen Epyc is not any good, given that 2nd gen exists, is more popular and is cheap enough (I actually have epyc 7302 and MZ31-AR0 motherboard as homelab). Too low performance per core and NUMA things, plus worse node (2nd gen compute is 7nm TSMC)

swarnie · a year ago

Unsure about the Epyc chips but Ryzen 5 series kit was being given away on Amazon in the week...

I snagged a 9 5950X for £242

kombine · a year ago

Thanks for pointing out, it's still up there for £253 - I might consider upgrading my 8 core 5800X3D.

renewiltord · a year ago

Not worth. Get 9654 on eBay for $2k plus $1k for mobo. $7k full system. Or go Epyc 7282 type, and that’s good combo easily available.

ipsum2 · a year ago

They already are, and aren't very good.

taneq · a year ago

Haha same, and it’s perfectly capable of anything a smallish company would need for general on-prem hosting.

dragontamer · a year ago

ChipsAndCheese is one of the few new tech publications that really knows what they are talking about, especially with these deep dive benchmarks.

With the loss of Anandtech, TechReport, HardCOP and other old technical sites, I'm glad to see a new publisher who can keep up with the older style stuff.

mongol · a year ago

Interestingly, Slashdot originated from a site called "Chips & Dips". Similiar inspiration?

tandr · a year ago

Did you mean to say HardOCP?

kderbe · a year ago

Chips and Cheese most reminds me of the long gone LostCircuts. Most tech sites focus on the slate of application benchmarks, but C&C writes, and LC wrote, long form articles about architecture, combined with subsystem micro-benchmarks.

Deleted Comment

nickpp · a year ago

Just in time for Factorio 2.0.

stzsch · a year ago

For those that dislike their change to substack, there is https://old.chipsandcheese.com/2024/10/11/amds-turin-5th-gen....

At least for now.