The weirdest one of the bunch is the AMD EPYC 9175F: 16 cores with 512MB of L3 cache! Presumably this is for customers trying to minimize software costs that are based on "per-core" licensing. It really doesn't make much sense to have so few cores at such an expense, otherwise. Does Oracle still use this style of licensing? If so, they need to knock it off.
The only other thing I can think of is some purpose like HFT may need to fit a whole algorithm in L3 for absolute minimum latency, and maybe they want only the best core in each chiplet? It's probably about software licenses, though.
Another good example is any kind of discrete event simulation. Things like spiking neural networks are inherently single threaded if you are simulating them accurately (I.e., serialized through the pending spike queue). Being able to keep all the state in local cache and picking the fastest core to do the job is the best possible arrangement. The ability to run 16 in parallel simply reduces the search space by the same factor. Worrying about inter CCD latency isn't a thing for these kinds of problems. The amount of bandwidth between cores is minimal, even if we were doing something like a genetic algorithm with periodic crossover between physical cores.
Plenty of applications are single threaded and it's cheaper to spend thousands on a super fast CPU to run it as fast as possible than spend tens of thousands on a programmer to rewrite the code to be more parallel.
And like you say, plenty of times it is infeasible to rewrite the code because its third party code for which you don't have the source or the rights.
A couple years ago I noticed that some Xeons I was using had a much cache as the ram in the systems I had growing up (millennial, so, we’re not talking about ancient commodores or whatever; real usable computers that could play Quake and everything).
But 512MB? That’s roomy. Could Puppy Linux just be held entirely in L3 cache?
CCDs can't access each other's L3 cache as their own (fabric penalty is too high to do that directly). Assuming it's anything like the 9174F that means it's really 8 groups of 2 cores that each have 64 MB of L3 cache. Still enormous, and you can still access data over the infinity fabric with penalties, but not quite a block of 512 MB of cache on a single 16 core block that it might sound like at first.
Zen 4 also had 96 MB per CCD variants like the 9184X, so 768 MB per, and they are dual socket so you can end up with a 1.5 GB of total L3 cache single machine! The downside being now beyond CCD<->CCD latencies you have socket<->socket latencies.
Many algorithms are limited by memory bandwidth. On my 16-core workstation I’ve run several workloads that have peak performance with less than 16 threads.
It’s common practice to test algorithms with different numbers of threads and then use the optimal number of threads. For memory-intensive algorithms the peak performance frequently comes in at a relatively small number of cores.
Is this because of NUMA or is it L2 cache or something entirely different?
I worked on high perf around 10 years ago and at that point I would pin the OS and interrupt handling to a specific core so I’d always lose one core. Testing led me to disable hyperthreading in our particular use case, so that was “cores” (really threads) halfed.
A colleague had a nifty trick built on top of solarflare zero copy but at that time it required fairly intrusive kernel changes, which never totally sat well with me, but again I’d lose a 2nd core to some bookkeeping code that orchestrated that.
I’d then tasksel the app to the other cores.
NUMA was a thing by then so it really wasn’t straightforward to eek maximum performance. It became somewhat of a competition to see who could get highest throughout but usually those configurations were unusable due to unacceptable p99 latencies.
Windows server and MSSQL is per core now. A lot of enterprise software is. They are switching to core because before they had it based on CPU sockets. Not just Oracle.
Phoronix recently reviewed the 196 core Turin Dense against the AmpereOne 192 core.
* Ampere MSRP $5.5K vs $15K for the EPYC.
* Turin 196 had 1.6x better performance
* Ampere had 1.2x better energy consumption
In terms of actual $/perf, Ampere 192 core is 1.7x better than Turin Dense 196 core based on Phoronix's review.
So for $5.5k, you can either buy an AmpereOne 192 core CPU (274w) or a Turin Dense 48 core CPU (300w).
Ampere has a 256 core, 3nm, 12 memory channel shipping next year that is likely to better challenge Turin Dense and Sierra Forest in terms of raw performance. For now, their value proposition is $/perf.
Anyway, I'm very interested in how Qualcomm's Nuvia-based server chips will perform. Also, if ARM's client core improvements are any indication, I will be very interested in how in-house chips like AWS Graviton, Google Axion, Microsoft Cobalt, Nvidia Grace, Alibaba Yitian will compete with better Neoverse cores. Nuvia vs ARM vs AmpereOne.
This is probably the golden age of server CPUs. 7 years ago, it was only Intel's Xeon. Now you have numerous options.
AMD also wins on perf/Watt which is pretty notable for anyone who still believed that X86 could never challenge ARM/Risc in efficiency. These days, lot of data centers are also more limited by available Watts (and associated cooling) which bodes well for Turin.
> In terms of actual $/perf, Ampere 192 core is 1.7x better than Turin Dense 196 core based on Phoronix's review.
You're comparing it to the highest MSRP Turin, which doesn't have the highest performance/$. People buy that one if they want to maximize density or performance/watt, where it bests Ampere. If you only care about performance/$ you would look at the lower core count Zen5 (rather than Zen5c) models which have twice the performance/$ of the 192-core 9965.
Doing the same for Ampere doesn't work because their 192-core 3.2GHz model is very nearly already their peak performance/$.
700+ threads over 2 cores, can saturate 2 400gbe Nic's 500 wats per chip (less than 2 wats per thread)... All of that in a 2U package.... 20 years ago that would have been racks of gear.
I think those really 2 watts per thread are a lot more important than what us home users usually think. Having to deliver less power and having to dissipate less watts in form of heat in a data centre are really good news to its operative costs, which is usually a lot bigger than the cost of the purchase of the servers
On the other hand, at the time we would have expected twenty years of progress to make the cores a thousand times faster. Instead that number is more like 5x.
On a different hand the way things were scaling 20 years ago (1ghz took 35 watts) we'd have 5,000W processors - instead we have 196 for 300 watts. If these are anything like ThreadRipper I wonder if they can unlock to 1000W with liquid cooling. On the flip side we are rolling about 1 to 2 watts per core which is wild. Also, can't some of these do 512bit math instructions instead of just 32bit?
I often think about huge, fancy cloud setups literally costing silly money to run, being replaced by a single beast of a machine powered by a modern, high core count CPU (say 48+), lots of RAM and lots of high performance enterprise-grade SSD storage.
The difference in throughput for local versus distributed orchestration would mainly come from serdes, networking, switching. Serdes can be substantial. Networking and switching has been aggressively offloaded from CPU through better hardware support.
Individual tasks would definitely have better latency, but I'd suspect the impact on throughput/CPU usage might be muted. Of course at the extremes (very small jobs, very large/complex objects being passed) you'd see big gains.
Nowadays very much most services can fit on single server and serve millions of users a day. I wonder how it will affect overly expensive cloud services where you can rent a beefy dedicated server for under a grand and make tens of thousands in savings (enough to hire full time administrator with plenty of money left for other things).
Indeed: the first dual core server chips only launched in 2005 afaik with 90nm Denmark/Italy/Egypt Opterons and Paxville Xeons but on the Intel side it wasn't until 2007 when they were in full swing.
first dual core server chips show up generally available in 2001 with IBM POWER4, then HP PA-RISC ones in 2004, and then Opterons which was followed by "emergency" design of essentially two "sockets" on one die of of NetBurst dual core systems.
1-3rd gen Epycs can be had super cheap, but the motherboards are expensive.
Also not worth getting anything less than 3rd gen unless you're primarily buying them for the pcie lanes and ram capacity - a regular current gen consumer CPU with half - a quarter of the core count will outperform them in compute while consuming significantly less power.
The reason for this is that CPU upgrades on the same board were/are very viable on SP3.
Doing that on Intel platforms just wasn't done for basically ever, it was never worth it. But upgrade to Milan from Naples or Rome is very appealing.
So SP3 CPUs are much more common used than the boards, simply because more of them were made. This is probably very bad for hobbyists, the boards are not going to get cheap until the entire platform is obsolete.
Lots of great second hand hardware to be had on ebay. Even last gen used CPUs, as well as RAM, at much less than retail.
However when you end up building a server quite often the motherboard + case is the cheap stuff, the CPUs are second in cost and the biggest expense can be the RAM.
IMO, 1st gen Epyc is not any good, given that 2nd gen exists, is more popular and is cheap enough (I actually have epyc 7302 and MZ31-AR0 motherboard as homelab). Too low performance per core and NUMA things, plus worse node (2nd gen compute is 7nm TSMC)
ChipsAndCheese is one of the few new tech publications that really knows what they are talking about, especially with these deep dive benchmarks.
With the loss of Anandtech, TechReport, HardCOP and other old technical sites, I'm glad to see a new publisher who can keep up with the older style stuff.
Chips and Cheese most reminds me of the long gone LostCircuts. Most tech sites focus on the slate of application benchmarks, but C&C writes, and LC wrote, long form articles about architecture, combined with subsystem micro-benchmarks.
The only other thing I can think of is some purpose like HFT may need to fit a whole algorithm in L3 for absolute minimum latency, and maybe they want only the best core in each chiplet? It's probably about software licenses, though.
And like you say, plenty of times it is infeasible to rewrite the code because its third party code for which you don't have the source or the rights.
A couple years ago I noticed that some Xeons I was using had a much cache as the ram in the systems I had growing up (millennial, so, we’re not talking about ancient commodores or whatever; real usable computers that could play Quake and everything).
But 512MB? That’s roomy. Could Puppy Linux just be held entirely in L3 cache?
Zen 4 also had 96 MB per CCD variants like the 9184X, so 768 MB per, and they are dual socket so you can end up with a 1.5 GB of total L3 cache single machine! The downside being now beyond CCD<->CCD latencies you have socket<->socket latencies.
https://www.mathworks.com/products/matlab-parallel-server/li....
It’s common practice to test algorithms with different numbers of threads and then use the optimal number of threads. For memory-intensive algorithms the peak performance frequently comes in at a relatively small number of cores.
I worked on high perf around 10 years ago and at that point I would pin the OS and interrupt handling to a specific core so I’d always lose one core. Testing led me to disable hyperthreading in our particular use case, so that was “cores” (really threads) halfed.
A colleague had a nifty trick built on top of solarflare zero copy but at that time it required fairly intrusive kernel changes, which never totally sat well with me, but again I’d lose a 2nd core to some bookkeeping code that orchestrated that.
I’d then tasksel the app to the other cores.
NUMA was a thing by then so it really wasn’t straightforward to eek maximum performance. It became somewhat of a competition to see who could get highest throughout but usually those configurations were unusable due to unacceptable p99 latencies.
* Ampere MSRP $5.5K vs $15K for the EPYC.
* Turin 196 had 1.6x better performance
* Ampere had 1.2x better energy consumption
In terms of actual $/perf, Ampere 192 core is 1.7x better than Turin Dense 196 core based on Phoronix's review.
So for $5.5k, you can either buy an AmpereOne 192 core CPU (274w) or a Turin Dense 48 core CPU (300w).
Ampere has a 256 core, 3nm, 12 memory channel shipping next year that is likely to better challenge Turin Dense and Sierra Forest in terms of raw performance. For now, their value proposition is $/perf.
Anyway, I'm very interested in how Qualcomm's Nuvia-based server chips will perform. Also, if ARM's client core improvements are any indication, I will be very interested in how in-house chips like AWS Graviton, Google Axion, Microsoft Cobalt, Nvidia Grace, Alibaba Yitian will compete with better Neoverse cores. Nuvia vs ARM vs AmpereOne.
This is probably the golden age of server CPUs. 7 years ago, it was only Intel's Xeon. Now you have numerous options.
You're comparing it to the highest MSRP Turin, which doesn't have the highest performance/$. People buy that one if they want to maximize density or performance/watt, where it bests Ampere. If you only care about performance/$ you would look at the lower core count Zen5 (rather than Zen5c) models which have twice the performance/$ of the 192-core 9965.
Doing the same for Ampere doesn't work because their 192-core 3.2GHz model is very nearly already their peak performance/$.
Twenty years ago we had just 1-2 cores per CPU, so we were lucky to have 4 cores in a dual socket server.
A single server can now have almost 400 cores. Yes, we can have even more ARM cores but they don't perform as well as these do, at least for now.
800Gbps from a single server was achieved by Netflix on much lesser systems two years ago:
https://nabstreamingsummit.com/wp-content/uploads/2022/05/20...
If I were to guess, this hardware can do double that, also helping that we now have actual 800Gbps Ethernet hardware.
Indeed 20 years ago this would have been racks of gear at a very high cost and a huge power bill.
I assume you mean 2 sockets.
Individual tasks would definitely have better latency, but I'd suspect the impact on throughput/CPU usage might be muted. Of course at the extremes (very small jobs, very large/complex objects being passed) you'd see big gains.
Also not worth getting anything less than 3rd gen unless you're primarily buying them for the pcie lanes and ram capacity - a regular current gen consumer CPU with half - a quarter of the core count will outperform them in compute while consuming significantly less power.
https://www.servethehome.com/amd-psb-vendor-locks-epyc-cpus-...
Doing that on Intel platforms just wasn't done for basically ever, it was never worth it. But upgrade to Milan from Naples or Rome is very appealing.
So SP3 CPUs are much more common used than the boards, simply because more of them were made. This is probably very bad for hobbyists, the boards are not going to get cheap until the entire platform is obsolete.
However when you end up building a server quite often the motherboard + case is the cheap stuff, the CPUs are second in cost and the biggest expense can be the RAM.
I snagged a 9 5950X for £242
With the loss of Anandtech, TechReport, HardCOP and other old technical sites, I'm glad to see a new publisher who can keep up with the older style stuff.
Deleted Comment
At least for now.