Top researchers leave Intel to build startup with 'the biggest, baddest CPU'

I hope they design, build and sell a true 256-1024+ multicore CPU with local memories that appears as an ordinary desktop computer with a unified memory space for under $1000.

I've written about it at length and I'm sure that anyone who's seen my comments is sick of me sounding like a broken record. But there's truly a vast realm of uncharted territory there. I believe that transputers and reprogrammable logic chips like FPGAs failed because we didn't have languages like Erlang/Go and GNU Octave/MATLAB to orchestrate a large number of processes or handle SIMD/MIMD simultaneously. Modern techniques like passing by value via copy-on-write (used by UNIX forking, PHP arrays and Clojure state) were suppressed when mainstream imperative languages using pointers and references captured the market. And it's really hard to beat Amdahl's law when we're worried about side effects. I think that anxiety is what inspired Rust, but there are so many easier ways of avoiding those problems in the first place.

jiggawatts · 3 months ago

Check out the Azure HBv5 servers.

High bandwidth memory on-package with 352 AMD Zen 4 cores!

With 7 TB/s memory bandwidth, it’s basically an x86 GPU.

This is the future of high performance computing. It used to be available only for supercomputers but it’s trickling down to cloud VMs you can rent for reasonable money. Eventually it’ll be standard for workstations under your desk.

nullc · 3 months ago

it's kind of concerning that it's only available as a hosted product. Not good news for anyone that needs to run on-prem for confidentiality or availability reasons.

zackmorris · 3 months ago

Thanks to you and the other commenters!

I just want to leave this breadcrumb showing possible markets and applications for high-performance computing (HPC), specifically regarding SpiNNaker which is simulating neural nets (NNs) as processes communicating via spike trains rather than matrices performing gradient descent:

https://news.ycombinator.com/item?id=44201812 (Sandia turns on brain-like storage-free supercomputer)

https://blocksandfiles.com/2025/06/06/sandia-turns-on-brain-... (working implementation of 175,000 cores)

https://www.theregister.com/2017/10/19/steve_furber_arm_brai... (towards 1 million+ cores)

https://www.youtube.com/watch?v=z1_gE_ugEgE (518,400 cores as of 2016)

https://arxiv.org/pdf/1911.02385 (towards 10 million+ cores)

https://docs.hpc.gwdg.de/services/neuromorphic-computing/spi... (HPC programming model)

I'd use a similar approach but probably add custom memory controllers that calculate hashes for a unified content-addressable memory, so that arbitrary network topologies can be used. That way the computer could be expanded as necessary and run over the internet without modification. I'd also write something like a microkernel to expose the cores and memory as a unified desktop computing environment, then write the Python HPC programming model over that and make it optional. Then users could orchestrate the bare metal however they wish with containers, forked processes, etc.

A possible threat to the HPC market would be to emulate MIMD under SIMD by breaking ordinary imperative machine code up into parallelizable immutable (functional) sections bordered by IO handled by some kind of monadic or one-shot logic that prepares inputs and obtains outputs between the functional portions. That way individual neurons, agents for genetic algorithms, etc could be written in C-style or Lisp-style code that's transpiled to run on SIMD GPUs. This is an open problem that I'm having trouble finding published papers for:

https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4611137 (has PDF preview and download)

Without code examples, I'd estimate MIMD->SIMD performance to be between 1-2 orders of magnitude faster than a single-threaded CPU and 1-2 orders of magnitude slower than a GPU. Similar to scripting languages vs native code. My spidey sense is picking up so many code smells around this approach though that I suspect it may never be viable.

I'd compare the current complexities around LLMs running on SIMD GPUs to trying to implement business logic as a spaghetti of state machines instead of coroutines running conditional logic and higher-order methods via message passing. Loosely that means that LLMs will have trouble evolving and programming their own learning models. Whereas HPC doesn't have those limitations, because potentially every neuron can learn and evolve on its own like in the real world.

So a possible bridge between MIMD and SIMD would be to transpile CPU machine code coroutines to GPU shader state machines:

https://news.ycombinator.com/item?id=18704547

https://eli.thegreenplace.net/2009/08/29/co-routines-as-an-a...

In the end, they're equivalent. But a multi-page LLM specification could be reduced down to a bunch of one-liners because we can reason about coroutines at a higher level of abstraction than state machines.

zozbot234 · 3 months ago

If you have 256-1024+ multicore CPUs they will probably have a fake unified memory space that's really a lot more like NUMA underneath. Not too different from how GPU compute works under the hood. And it would let you write seamless parallel code by just using Rust.

Deleted Comment

david-gpu · 3 months ago

The challenges that arise when you have a massively parallel system are well understood by now. It is hard to keep all processing units doing something useful rather than waiting for memory or other processing units.

Once you follow the logical steps to increase utilization/efficiency you end up with something like a GPU, and that comes with the programming challenges that we have today.

In other words, it's not like CPU architects didn't think of that. Instead, there are good reasons for the status quo.

anthk · 3 months ago

Forth can.

One of the biggest problems with CPUs is legacy. Tie yourself to any legacy, and now you're spending millions of transistors to make sure some way that made sense ages ago still works.

Just as a thought experiment, consider the fact that the i80486 has 1.2 million transistors. An eight core Ryzen 9700X has around 12 billion. The difference in clock speed is roughly 80 times, and the difference in number of transistors is 1,250 times.

These are wild generalizations, but let's ask ourselves: If a Ryzen takes 1,250 times the transistor for one core, does one core run 1,250 times (even taking hyperthreading in to account) faster than an i80486 at the same clock? 500 times? 100 times?

It doesn't, because massive amounts of those transistors go to keeping things in sync, dealing with changes in execution, folding instructions, decoding a horrible instruction set, et cetera.

So what might we be able to do if we didn't need to worry about figuring out how long our instructions are? Didn't need to deal with Spectre and Meltdown issues? If we made out-of-order work in ways where much more could be in flight and the compilers / assemblers would know how to avoid stalls based on dependencies, or how to schedule dependencies? What if we took expensive operations, like semaphores / locks, and built solutions in to the chip?

Would we get to 1,250 times faster for 1,250 times the number of transistors? No. Would we get a lot more performance than we get out of a contemporary x86 CPU? Absolutely.

AnthonyMouse · 3 months ago

Modern CPUs don't actually execute the legacy instructions, they execute core-native instructions and have a piece of silicon dedicated to translating the legacy instructions into them. That piece of silicon isn't that big. Modern CPUs use more transistors because transistors are a lot cheaper now, e.g. the i486 had 8KiB of cache, the Ryzen 9700X has >40MiB. The extra transistors don't make it linearly faster but they make it faster enough to be worth it when transistors are cheap.

Modern CPUs also have a lot of things integrated into the "CPU" that used to be separate chips. The i486 didn't have on-die memory or PCI controllers etc., and those things were themselves less complicated then (e.g. a single memory channel and a shared peripheral bus for all devices). The i486SX didn't even have a floating point unit. The Ryzen 9000 series die contains an entire GPU.

Sohcahtoa82 · 3 months ago

> If a Ryzen takes 1,250 times the transistor for one core, does one core run 1,250 times (even taking hyperthreading in to account) faster than an i80486 at the same clock? 500 times? 100 times?

Would be interesting to see a benchmark on this.

If we restricted it to 486 instructions only, I'd expect the Ryzen to be 10-15x faster. The modern CPU will perform out-of-order execution with some instructions even run in parallel, even in single-core and single-threaded execution, not to mention superior branch prediction and more cache.

If you allowed modern instructions like AVX-512, then the speedup could easily be 30x or more.

> Would we get to 1,250 times faster for 1,250 times the number of transistors? No. Would we get a lot more performance than we get out of a contemporary x86 CPU? Absolutely.

I doubt you'd get significantly more performance, though you'd likely gain power efficiency.

Half of what you described in your hypothetical instruction set are already implemented in ARM.

ahartmetz · 3 months ago

A Ryzen is muuuuch more than 10-15x faster than a 486, and AVX et al do diddly squat for a lot of general-purpose code.

Clock speed is about 50x and IPC, let's say, 5-20x. So it's roughly 500x faster.

Dead Comment

layla5alive · 3 months ago

In terms of FLOPS, Ryzen is ~1,000,000 times faster than a 486.

For serial branchy code, it isn't a million times faster, but that has almost nothing to do with legacy and everything to do with the nature of serial code and that you can't linearly improve serial execution with architecture and transistor counts (you can sublinearly improve it), but rather with Denard scaling.

It is worth noting, though, that purely via Denard scaling, Ryzen is already >100x faster, though! And via architecture (those transistors) it is several multiples beyond that.

In general compute, if you could clock it down at 33 or 66MHz, a Ryzen would be much faster than a 486, due to using those transistors for ILP (instruction-level parallelism) and TLP (thread-level parallelism). But you won't see any TLP in a single serial program that a 486 would have been running, and you won't get any of the SIMD benefits either, so you won't get anywhere near that in practice on 486 code.

The key to contemporary high performance computing is having more independent work to do, and organizing the data/work to expose the independence to the software/hardware.

Szpadel · 3 months ago

that's exactly why Intel proposed x86S

that's basically x86 without 16 and 32 bit support, no real mode etc.

CPU starts initialized in 64bit without all that legacy crap.

that's IMO great idea. I think every few decades we need to stop and think again about what works best and take fresh start or drop some legacy unused features.

risc v have only mandatory basic set of instructions, as little as possible to be Turing complete and everything else is extension that can be (theoretically) removed in the future.

this also could be used to remove legacy parts without disrupting architecture

kvemkon · 3 months ago

Would be interesting to compare transistor count without L3 (and perhaps L2) cache.

16-core Zen 5 CPU achieves more than 2 TFLOPS FP64. So number crunching performance scaled very well.

It is weird, that the best consumer GPU can 4 TFLOPS. Some years ago GPUs were an order of magnitude and more faster than CPUs. Today GPUs are likely to be artificially limited.

kvemkon · 3 months ago

E.g. AMD Radeon PRO VII with 13.23 billion transistors achieves 6.5 TFLOPS FP64 in 2020 [1].

[1] https://www.techpowerup.com/gpu-specs/radeon-pro-vii.c3575

zozbot234 · 3 months ago

> 16-core Zen 5 CPU achieves more than 2 TFLOPS FP64. So number crunching performance scaled very well.

These aren't realistic numbers in most cases because you're almost always limited by memory bandwidth, and even if memory bandwidth is not an issue you'll have to worry about thermals. Theoretical CPU compute ceiling is almost never the real bottleneck. GPU's have a very different architecture with higher memory bandwidth and running their chips a lot slower and cooler (lower clock frequency) so they can reach much higher numbers in practical scenarios.

layla5alive · 3 months ago

Huh, consumer GPUs are doing Petaflops of floating point. FP64 isn't a useful comparison because FP64 is nerfed on consumer GPUs.

saati · 3 months ago

But history showed exactly the opposite, if you don't have an already existing software ecosystem you are dead, the transistors for implementing x86 peculiarities are very much worth it if people in the market want x86.

colechristensen · 3 months ago

GPUs scaled wide with a similar number of transistors to a 486 and just lots more cores, thousands to tens of thousands of cores averaging out to maybe 5 million transistors per core.

CPUs scaled tall with specialized instruction to make the single thread go faster, no the amount done per transistor does not scale anywhere near linearly, very many of the transistors are dark on any given cycle compared to a much simpler core that will have much higher utilization.

zozbot234 · 3 months ago

> Didn't need to deal with Spectre and Meltdown issues? If we made out-of-order work in ways where much more could be in flight and the compilers / assemblers would know how to avoid stalls based on dependencies, or how to schedule dependencies? What if we took expensive operations, like semaphores / locks, and built solutions in to the chip?

I'm pretty sure that these goals will conflict with one another at some point. For example, the way one solves Spectre/Meltdown issues in a principled way is by changing the hardware and system architecture to have some notion of "privacy-sensitive" data that shouldn't be speculated on. But this will unavoidably limit the scope of OOO and the amount of instructions that can be "in-flight" at any given time.

For that matter, with modern chips, semaphores/locks are already implemented with hardware builtin operations, so you can't do that much better. Transactional memory is an interesting possibility but requires changes on the software side to work properly.

AtlasBarfed · 3 months ago

If you have a very large CPU count, then I think you can dedicate a CPU to only process a given designated privacy/security focused execution thread. Especially for a specially designed syscall, perhaps

That kind of takes the specter meltdown thing out of the way to some degree I would think, although privilege elevation can happen in the darndest places.

But maybe I'm being too optimistic

dist-epoch · 3 months ago

If you look at a Zen5 die shots half of the space is taken by L3 cache.

And from each individual core:

- 25% per core L1/L2 cache

- 25% vector stuff (SSE, AVX, ...)

- from the remaining 50% only about 20% is doing instruction decoding

https://www.techpowerup.com/img/AFnVIoGFWSCE6YXO.jpg

zozbot234 · 3 months ago

The real issue with complex insn decoding is that it's hard to make the decode stage wider and at some point this will limit the usefulness of a bigger chip. For instance, AArch64 chips tend to have wider decode than their close x86_64 equivalents.

epx · 3 months ago

Aren't 99,99999% of these transistors used in cache?

nomel · 3 months ago

Look up "CPU die diagram". You'll see the physical layout of the CPU with annotated blocks.

Zen 3 example: https://www.reddit.com/r/Amd/comments/jqjg8e/quick_zen3_die_...

So, more like 85%, or around 6 orders of magnitude difference from your guess. ;)

PopePompus · 3 months ago

Gosh no. Often a majority of the transistors are used in cache, but not 99%.

Const-me · 3 months ago

CPUs can’t do that, but legacy is irrelevant. They just don’t have enough parallelism to leverage all these extra transistors. Let’s compare the 486 with a modern GPU.

Intel 80486 with 1.2M transistors delivered 0.128 flops / cycle.

nVidia 4070 Ti Super with 45.9B transistors delivers 16896 flops / cycle.

As you see, each transistor became 3.45 times more efficient at delivering these FLOPs per cycle.

johnklos · 3 months ago

> and the difference in number of transistors is 1,250 times

I should've written per core.

amelius · 3 months ago

> and now you're spending millions of transistors

and spending millions on patent lawsuits ...

smegger001 · 3 months ago

correct me if i am wrong but isn't that what was tried with the Intel Itanium processors line, only the smarter compilers and assemblers never quiet got there.

what makes it more likely to work this time?

PhilipRoman · 3 months ago

Optimizing compiler technology was still in the stone age (arguably still is) when Itanium was released. LLVM had just been born and GCC didn't start using SSA until 2005. Egraphs were unheard of in context of compiler optimization.

That said, yesterday I saw gcc generate 5 KB of mov instructions because it couldn't gracefully handle a particular vector size so I wouldn't get my hopes up...