AMD's EPYC 9355P: Inside a 32 Core Zen 5 Server Chip

>768 GB of DDR5-5200. The 12 memory controllers on the IO die provide a 768-bit memory bus, so the setup provides just under 500 GB/s of theoretical bandwidth

I know it's a server but I'd be so ready to use all of that as RAM disk. Crazy amount at a crazy high speed. Even 1% would be enough just to play around with something.

ksec · 5 months ago

I have been waiting for Netflix using FreeBSD to serve video at 1600Gb/s. They announced their 800Gbps record in 2021, and they were previously limited by CPU and Memory bandwidth. With 500GB/s that is pretty much not a thing.

NaomiLehman · 5 months ago

damn, that's a lot of gigabytes for a movie

mtoner23 · 5 months ago

For our build servers for devs we utilize roughly this setup as a ram disk. It's amazing. Build times are lighting fast (compared to HDD/SSD)

privatelypublic · 5 months ago

I'm interested in... why? What are you building that loading data from disk is so lopsided vs CPu load from compiling, or network load/latency(one 200ms of "is this the current git repo?" Is a heck of a lot of NVMe latency... and its going to be closer to 2s than 200ms)

skhameneh · 5 months ago

12 memory channels per CPU and DDR5-6400 may be supported (for reference, I found incorrect specs when I was looking at Epyc CPU retail listings some weeks ago), see https://www.amd.com/en/products/processors/server/epyc/9005-...

tehlike · 5 months ago

I have 1TB ram on my home server. It's 2666 though...

WarOnPrivacy · 5 months ago

> I have 1TB ram on my home server. It's 2666 though...

this kit? https://www.newegg.com/nemix-ram-1tb/p/1X5-003Z-01930

saltcured · 5 months ago

Man, here I am in 2025 and my home server is a surplus Thinkpad P70 with just 64 GB RAM...

summarity · 5 months ago

> Crazy amount at a crazy high speed

That's 300GB/s slower than my old Mac Studio (M1 Ultra). Memory speeds in 2025 remain thouroughly unimpressive outside of high-end GPUs and fully integrated systems.

AnthonyMouse · 5 months ago

The server systems have that much memory bandwidth per socket. Also, that generation supports DDR5-6400 but they were using DDR5-5200. Using the faster stuff gets you 614GB/s per socket, i.e. a dual socket system with DDR5-6400 is >1200GB/s. And in those systems that's just for the CPU; a GPU/accelerator gets its own.

The M1 Ultra doesn't have 800GB/s because it's "integrated", it simply has 16 channels of DDR5-6400, which it could have whether it was soldered or not. And none of the more recent Apple chips have any more than that.

It's the GPUs that use integrated memory, i.e. GDDR or HBM. That actually gets you somewhere -- the RTX 5090 has 1.8TB/s with GDDR7, the MI300X has 5.3TB/s with HBM3. But that stuff is also more expensive which limits how much of it you get, e.g. the MI300X has 192GB of HBM3, whereas normal servers support 6TB per socket.

And it's the same problem with Apple even though there's no great reason for it to be. The 2019 Intel Xeon Mac Pro supported 1.5TB of RAM -- still in slots -- but the newer ones barely reach a third of that at the top end.

matja · 5 months ago

Do you have a benchmark that shows the M1 Ultra CPU to memory throughput?

elorant · 5 months ago

Even better you could use it for inference and with that much RAM you could load any model.

bigiain · 5 months ago

Indeed. I wonder what a system like that would cost (at consumer available prices)?

magicalhippo · 5 months ago

From what I can find here in Norway the CPU would be $3800, mobo around $2000, and one stick of 64 GB 6400 MHz registered ECC runs about $530, so about $6400 for the full 768 GB. Couldn't find any kits for those.

So just those components would be just over $12k.

That's just from regular consumer shops, and includes 25% VAT. Without the VAT it's about $9800.

Problem for consumers is that a just about all the shops that sells such and you might get a deal from would be geared towards companies, and not interested in deal with consumers due to consumer protection laws.

Those are extremely uniform latencies. Seems like on these CPUs most benefits from NUMA-aware thread-pools will be coming from reduced contention - mostly synchronizing small subsets of cores, rather than the actual memory affinity.

PunchyHamster · 5 months ago

Well, all of the memory is at IO die. I remember AMD docs outright recommend to make processor hide NUMA nodes from the workload as trying to optimize for it might not even do anything for a lot of workloads

phire · 5 months ago

That AMD slide (in the conclusion) claims their switching fabric has some kind of bypass mode to improve latency when utilisation is low.

So they have been really optimising that IO die for latency.

NUMA is already workload sensitive, you need to benchmark your exact workload to know if it’s worth enabling or not, and this change is probably going to make it even less worthwhile. Sounds like you will need a workload that really pushes total memory bandwidth to make NUMA worthwhile.

afr0ck · 5 months ago

NUMA is only useful if you have multiple sockets, because then you have several I/O dies and you want your workload 1) to be closer to the I/O device and 2) avoid crossing the socket interconnect. Within the same socket, all CPUs shared the same I/O die, thus uniform latency.