I hope they design, build and sell a true 256-1024+ multicore CPU with local memories that appears as an ordinary desktop computer with a unified memory space for under $1000.
I've written about it at length and I'm sure that anyone who's seen my comments is sick of me sounding like a broken record. But there's truly a vast realm of uncharted territory there. I believe that transputers and reprogrammable logic chips like FPGAs failed because we didn't have languages like Erlang/Go and GNU Octave/MATLAB to orchestrate a large number of processes or handle SIMD/MIMD simultaneously. Modern techniques like passing by value via copy-on-write (used by UNIX forking, PHP arrays and Clojure state) were suppressed when mainstream imperative languages using pointers and references captured the market. And it's really hard to beat Amdahl's law when we're worried about side effects. I think that anxiety is what inspired Rust, but there are so many easier ways of avoiding those problems in the first place.
High bandwidth memory on-package with 352 AMD Zen 4 cores!
With 7 TB/s memory bandwidth, it’s basically an x86 GPU.
This is the future of high performance computing. It used to be available only for supercomputers but it’s trickling down to cloud VMs you can rent for reasonable money. Eventually it’ll be standard for workstations under your desk.
it's kind of concerning that it's only available as a hosted product. Not good news for anyone that needs to run on-prem for confidentiality or availability reasons.
I just want to leave this breadcrumb showing possible markets and applications for high-performance computing (HPC), specifically regarding SpiNNaker which is simulating neural nets (NNs) as processes communicating via spike trains rather than matrices performing gradient descent:
I'd use a similar approach but probably add custom memory controllers that calculate hashes for a unified content-addressable memory, so that arbitrary network topologies can be used. That way the computer could be expanded as necessary and run over the internet without modification. I'd also write something like a microkernel to expose the cores and memory as a unified desktop computing environment, then write the Python HPC programming model over that and make it optional. Then users could orchestrate the bare metal however they wish with containers, forked processes, etc.
-
A possible threat to the HPC market would be to emulate MIMD under SIMD by breaking ordinary imperative machine code up into parallelizable immutable (functional) sections bordered by IO handled by some kind of monadic or one-shot logic that prepares inputs and obtains outputs between the functional portions. That way individual neurons, agents for genetic algorithms, etc could be written in C-style or Lisp-style code that's transpiled to run on SIMD GPUs. This is an open problem that I'm having trouble finding published papers for:
Without code examples, I'd estimate MIMD->SIMD performance to be between 1-2 orders of magnitude faster than a single-threaded CPU and 1-2 orders of magnitude slower than a GPU. Similar to scripting languages vs native code. My spidey sense is picking up so many code smells around this approach though that I suspect it may never be viable.
-
I'd compare the current complexities around LLMs running on SIMD GPUs to trying to implement business logic as a spaghetti of state machines instead of coroutines running conditional logic and higher-order methods via message passing. Loosely that means that LLMs will have trouble evolving and programming their own learning models. Whereas HPC doesn't have those limitations, because potentially every neuron can learn and evolve on its own like in the real world.
So a possible bridge between MIMD and SIMD would be to transpile CPU machine code coroutines to GPU shader state machines:
In the end, they're equivalent. But a multi-page LLM specification could be reduced down to a bunch of one-liners because we can reason about coroutines at a higher level of abstraction than state machines.
If you have 256-1024+ multicore CPUs they will probably have a fake unified memory space that's really a lot more like NUMA underneath. Not too different from how GPU compute works under the hood. And it would let you write seamless parallel code by just using Rust.
The challenges that arise when you have a massively parallel system are well understood by now. It is hard to keep all processing units doing something useful rather than waiting for memory or other processing units.
Once you follow the logical steps to increase utilization/efficiency you end up with something like a GPU, and that comes with the programming challenges that we have today.
In other words, it's not like CPU architects didn't think of that. Instead, there are good reasons for the status quo.
One of the biggest problems with CPUs is legacy. Tie yourself to any legacy, and now you're spending millions of transistors to make sure some way that made sense ages ago still works.
Just as a thought experiment, consider the fact that the i80486 has 1.2 million transistors. An eight core Ryzen 9700X has around 12 billion. The difference in clock speed is roughly 80 times, and the difference in number of transistors is 1,250 times.
These are wild generalizations, but let's ask ourselves: If a Ryzen takes 1,250 times the transistor for one core, does one core run 1,250 times (even taking hyperthreading in to account) faster than an i80486 at the same clock? 500 times? 100 times?
It doesn't, because massive amounts of those transistors go to keeping things in sync, dealing with changes in execution, folding instructions, decoding a horrible instruction set, et cetera.
So what might we be able to do if we didn't need to worry about figuring out how long our instructions are? Didn't need to deal with Spectre and Meltdown issues? If we made out-of-order work in ways where much more could be in flight and the compilers / assemblers would know how to avoid stalls based on dependencies, or how to schedule dependencies? What if we took expensive operations, like semaphores / locks, and built solutions in to the chip?
Would we get to 1,250 times faster for 1,250 times the number of transistors? No. Would we get a lot more performance than we get out of a contemporary x86 CPU? Absolutely.
Modern CPUs don't actually execute the legacy instructions, they execute core-native instructions and have a piece of silicon dedicated to translating the legacy instructions into them. That piece of silicon isn't that big. Modern CPUs use more transistors because transistors are a lot cheaper now, e.g. the i486 had 8KiB of cache, the Ryzen 9700X has >40MiB. The extra transistors don't make it linearly faster but they make it faster enough to be worth it when transistors are cheap.
Modern CPUs also have a lot of things integrated into the "CPU" that used to be separate chips. The i486 didn't have on-die memory or PCI controllers etc., and those things were themselves less complicated then (e.g. a single memory channel and a shared peripheral bus for all devices). The i486SX didn't even have a floating point unit. The Ryzen 9000 series die contains an entire GPU.
> If a Ryzen takes 1,250 times the transistor for one core, does one core run 1,250 times (even taking hyperthreading in to account) faster than an i80486 at the same clock? 500 times? 100 times?
Would be interesting to see a benchmark on this.
If we restricted it to 486 instructions only, I'd expect the Ryzen to be 10-15x faster. The modern CPU will perform out-of-order execution with some instructions even run in parallel, even in single-core and single-threaded execution, not to mention superior branch prediction and more cache.
If you allowed modern instructions like AVX-512, then the speedup could easily be 30x or more.
> Would we get to 1,250 times faster for 1,250 times the number of transistors? No. Would we get a lot more performance than we get out of a contemporary x86 CPU? Absolutely.
I doubt you'd get significantly more performance, though you'd likely gain power efficiency.
Half of what you described in your hypothetical instruction set are already implemented in ARM.
In terms of FLOPS, Ryzen is ~1,000,000 times faster than a 486.
For serial branchy code, it isn't a million times faster, but that has almost nothing to do with legacy and everything to do with the nature of serial code and that you can't linearly improve serial execution with architecture and transistor counts (you can sublinearly improve it), but rather with Denard scaling.
It is worth noting, though, that purely via Denard scaling, Ryzen is already >100x faster, though! And via architecture (those transistors) it is several multiples beyond that.
In general compute, if you could clock it down at 33 or 66MHz, a Ryzen would be much faster than a 486, due to using those transistors for ILP (instruction-level parallelism) and TLP (thread-level parallelism). But you won't see any TLP in a single serial program that a 486 would have been running, and you won't get any of the SIMD benefits either, so you won't get anywhere near that in practice on 486 code.
The key to contemporary high performance computing is having more independent work to do, and organizing the data/work to expose the independence to the software/hardware.
that's basically x86 without 16 and 32 bit support, no real mode etc.
CPU starts initialized in 64bit without all that legacy crap.
that's IMO great idea. I think every few decades we need to stop and think again about what works best and take fresh start or drop some legacy unused features.
risc v have only mandatory basic set of instructions, as little as possible to be Turing complete and everything else is extension that can be (theoretically) removed in the future.
this also could be used to remove legacy parts without disrupting architecture
Would be interesting to compare transistor count without L3 (and perhaps L2) cache.
16-core Zen 5 CPU achieves more than 2 TFLOPS FP64. So number crunching performance scaled very well.
It is weird, that the best consumer GPU can 4 TFLOPS. Some years ago GPUs were an order of magnitude and more faster than CPUs. Today GPUs are likely to be artificially limited.
> 16-core Zen 5 CPU achieves more than 2 TFLOPS FP64. So number crunching performance scaled very well.
These aren't realistic numbers in most cases because you're almost always limited by memory bandwidth, and even if memory bandwidth is not an issue you'll have to worry about thermals. Theoretical CPU compute ceiling is almost never the real bottleneck. GPU's have a very different architecture with higher memory bandwidth and running their chips a lot slower and cooler (lower clock frequency) so they can reach much higher numbers in practical scenarios.
But history showed exactly the opposite, if you don't have an already existing software ecosystem you are dead, the transistors for implementing x86 peculiarities are very much worth it if people in the market want x86.
GPUs scaled wide with a similar number of transistors to a 486 and just lots more cores, thousands to tens of thousands of cores averaging out to maybe 5 million transistors per core.
CPUs scaled tall with specialized instruction to make the single thread go faster, no the amount done per transistor does not scale anywhere near linearly, very many of the transistors are dark on any given cycle compared to a much simpler core that will have much higher utilization.
> Didn't need to deal with Spectre and Meltdown issues? If we made out-of-order work in ways where much more could be in flight and the compilers / assemblers would know how to avoid stalls based on dependencies, or how to schedule dependencies? What if we took expensive operations, like semaphores / locks, and built solutions in to the chip?
I'm pretty sure that these goals will conflict with one another at some point. For example, the way one solves Spectre/Meltdown issues in a principled way is by changing the hardware and system architecture to have some notion of "privacy-sensitive" data that shouldn't be speculated on. But this will unavoidably limit the scope of OOO and the amount of instructions that can be "in-flight" at any given time.
For that matter, with modern chips, semaphores/locks are already implemented with hardware builtin operations, so you can't do that much better. Transactional memory is an interesting possibility but requires changes on the software side to work properly.
If you have a very large CPU count, then I think you can dedicate a CPU to only process a given designated privacy/security focused execution thread. Especially for a specially designed syscall, perhaps
That kind of takes the specter meltdown thing out of the way to some degree I would think, although privilege elevation can happen in the darndest places.
The real issue with complex insn decoding is that it's hard to make the decode stage wider and at some point this will limit the usefulness of a bigger chip. For instance, AArch64 chips tend to have wider decode than their close x86_64 equivalents.
CPUs can’t do that, but legacy is irrelevant. They just don’t have enough parallelism to leverage all these extra transistors. Let’s compare the 486 with a modern GPU.
Intel 80486 with 1.2M transistors delivered 0.128 flops / cycle.
nVidia 4070 Ti Super with 45.9B transistors delivers 16896 flops / cycle.
As you see, each transistor became 3.45 times more efficient at delivering these FLOPs per cycle.
correct me if i am wrong but isn't that what was tried with the Intel Itanium processors line, only the smarter compilers and assemblers never quiet got there.
Optimizing compiler technology was still in the stone age (arguably still is) when Itanium was released. LLVM had just been born and GCC didn't start using SSA until 2005. Egraphs were unheard of in context of compiler optimization.
That said, yesterday I saw gcc generate 5 KB of mov instructions because it couldn't gracefully handle a particular vector size so I wouldn't get my hopes up...
I tire of “Employees from Y company leave to start their own” and even “Ex-Y employees launch new W”.
How many times do we have to see these stories play out to realize it doesn’t matter where they came from. These big companies employee a lot of people of varying skill, having it on your resume means almost nothing IMHO.
Just look at the Humane pin full of “ex-Apple employees”, how’d that work out? And that’s only one small example.
I hope IO (OpenAi/Jony Ive) fails so spectacularly so that we have an even better example to point to and we can dispel the idea that if you did something impressive early in your career or worked for an impressive company, it doesn’t mean you will continue to do so.
I immediately redflag anyone who advertises themselves as "ex-company". It shows a lack of character, judgment, and, probably, actual results/contributions. Likewise, it shows that they're probably not a particularly independent thinker - they're just following the herd of people who describe themselves like that (whose ven diagram surely overlaps considerably with people who describe themselves as "creatives" - as if a car mechanic working on a rusty bolt or a kindergarten teacher, or anyone else, is not creative.
Moreover, if the ex company was so wonderful and they were so integral to it, why aren't they still there? If they did something truly important, why not just advertise that (and I'm putting aside here qualms about overt advertising rather than something more subtle, authentic, organic).
Yeah... "Combined 100 years of experience" and in previous article [0] it was "combined 80+ years" for the same people... What happened there? Accelerated aging?
There are other forces at play. Same happens with video games. People can't see how many "variables" must be tuned in order to become notified and successful.
>Hoping for an open system (which I think RISC-V is) and nothing even close to Intel ME or AMT.
The architecture is independent of additional silicon with separate functions. The "only" thing which makes RISC-V open are that the specifications are freely available and freely usable.
Intel ME is, by design, separate from the actual CPU. Whether the CPU uses x86 or RISC-V is essentially irrelevant.
I've written about it at length and I'm sure that anyone who's seen my comments is sick of me sounding like a broken record. But there's truly a vast realm of uncharted territory there. I believe that transputers and reprogrammable logic chips like FPGAs failed because we didn't have languages like Erlang/Go and GNU Octave/MATLAB to orchestrate a large number of processes or handle SIMD/MIMD simultaneously. Modern techniques like passing by value via copy-on-write (used by UNIX forking, PHP arrays and Clojure state) were suppressed when mainstream imperative languages using pointers and references captured the market. And it's really hard to beat Amdahl's law when we're worried about side effects. I think that anxiety is what inspired Rust, but there are so many easier ways of avoiding those problems in the first place.
High bandwidth memory on-package with 352 AMD Zen 4 cores!
With 7 TB/s memory bandwidth, it’s basically an x86 GPU.
This is the future of high performance computing. It used to be available only for supercomputers but it’s trickling down to cloud VMs you can rent for reasonable money. Eventually it’ll be standard for workstations under your desk.
-
I just want to leave this breadcrumb showing possible markets and applications for high-performance computing (HPC), specifically regarding SpiNNaker which is simulating neural nets (NNs) as processes communicating via spike trains rather than matrices performing gradient descent:
https://news.ycombinator.com/item?id=44201812 (Sandia turns on brain-like storage-free supercomputer)
https://blocksandfiles.com/2025/06/06/sandia-turns-on-brain-... (working implementation of 175,000 cores)
https://www.theregister.com/2017/10/19/steve_furber_arm_brai... (towards 1 million+ cores)
https://www.youtube.com/watch?v=z1_gE_ugEgE (518,400 cores as of 2016)
https://arxiv.org/pdf/1911.02385 (towards 10 million+ cores)
https://docs.hpc.gwdg.de/services/neuromorphic-computing/spi... (HPC programming model)
I'd use a similar approach but probably add custom memory controllers that calculate hashes for a unified content-addressable memory, so that arbitrary network topologies can be used. That way the computer could be expanded as necessary and run over the internet without modification. I'd also write something like a microkernel to expose the cores and memory as a unified desktop computing environment, then write the Python HPC programming model over that and make it optional. Then users could orchestrate the bare metal however they wish with containers, forked processes, etc.
-
A possible threat to the HPC market would be to emulate MIMD under SIMD by breaking ordinary imperative machine code up into parallelizable immutable (functional) sections bordered by IO handled by some kind of monadic or one-shot logic that prepares inputs and obtains outputs between the functional portions. That way individual neurons, agents for genetic algorithms, etc could be written in C-style or Lisp-style code that's transpiled to run on SIMD GPUs. This is an open problem that I'm having trouble finding published papers for:
https://papers.ssrn.com/sol3/papers.cfm?abstract_id=4611137 (has PDF preview and download)
Without code examples, I'd estimate MIMD->SIMD performance to be between 1-2 orders of magnitude faster than a single-threaded CPU and 1-2 orders of magnitude slower than a GPU. Similar to scripting languages vs native code. My spidey sense is picking up so many code smells around this approach though that I suspect it may never be viable.
-
I'd compare the current complexities around LLMs running on SIMD GPUs to trying to implement business logic as a spaghetti of state machines instead of coroutines running conditional logic and higher-order methods via message passing. Loosely that means that LLMs will have trouble evolving and programming their own learning models. Whereas HPC doesn't have those limitations, because potentially every neuron can learn and evolve on its own like in the real world.
So a possible bridge between MIMD and SIMD would be to transpile CPU machine code coroutines to GPU shader state machines:
https://news.ycombinator.com/item?id=18704547
https://eli.thegreenplace.net/2009/08/29/co-routines-as-an-a...
In the end, they're equivalent. But a multi-page LLM specification could be reduced down to a bunch of one-liners because we can reason about coroutines at a higher level of abstraction than state machines.
Deleted Comment
Once you follow the logical steps to increase utilization/efficiency you end up with something like a GPU, and that comes with the programming challenges that we have today.
In other words, it's not like CPU architects didn't think of that. Instead, there are good reasons for the status quo.
Just as a thought experiment, consider the fact that the i80486 has 1.2 million transistors. An eight core Ryzen 9700X has around 12 billion. The difference in clock speed is roughly 80 times, and the difference in number of transistors is 1,250 times.
These are wild generalizations, but let's ask ourselves: If a Ryzen takes 1,250 times the transistor for one core, does one core run 1,250 times (even taking hyperthreading in to account) faster than an i80486 at the same clock? 500 times? 100 times?
It doesn't, because massive amounts of those transistors go to keeping things in sync, dealing with changes in execution, folding instructions, decoding a horrible instruction set, et cetera.
So what might we be able to do if we didn't need to worry about figuring out how long our instructions are? Didn't need to deal with Spectre and Meltdown issues? If we made out-of-order work in ways where much more could be in flight and the compilers / assemblers would know how to avoid stalls based on dependencies, or how to schedule dependencies? What if we took expensive operations, like semaphores / locks, and built solutions in to the chip?
Would we get to 1,250 times faster for 1,250 times the number of transistors? No. Would we get a lot more performance than we get out of a contemporary x86 CPU? Absolutely.
Modern CPUs also have a lot of things integrated into the "CPU" that used to be separate chips. The i486 didn't have on-die memory or PCI controllers etc., and those things were themselves less complicated then (e.g. a single memory channel and a shared peripheral bus for all devices). The i486SX didn't even have a floating point unit. The Ryzen 9000 series die contains an entire GPU.
Would be interesting to see a benchmark on this.
If we restricted it to 486 instructions only, I'd expect the Ryzen to be 10-15x faster. The modern CPU will perform out-of-order execution with some instructions even run in parallel, even in single-core and single-threaded execution, not to mention superior branch prediction and more cache.
If you allowed modern instructions like AVX-512, then the speedup could easily be 30x or more.
> Would we get to 1,250 times faster for 1,250 times the number of transistors? No. Would we get a lot more performance than we get out of a contemporary x86 CPU? Absolutely.
I doubt you'd get significantly more performance, though you'd likely gain power efficiency.
Half of what you described in your hypothetical instruction set are already implemented in ARM.
Clock speed is about 50x and IPC, let's say, 5-20x. So it's roughly 500x faster.
Dead Comment
For serial branchy code, it isn't a million times faster, but that has almost nothing to do with legacy and everything to do with the nature of serial code and that you can't linearly improve serial execution with architecture and transistor counts (you can sublinearly improve it), but rather with Denard scaling.
It is worth noting, though, that purely via Denard scaling, Ryzen is already >100x faster, though! And via architecture (those transistors) it is several multiples beyond that.
In general compute, if you could clock it down at 33 or 66MHz, a Ryzen would be much faster than a 486, due to using those transistors for ILP (instruction-level parallelism) and TLP (thread-level parallelism). But you won't see any TLP in a single serial program that a 486 would have been running, and you won't get any of the SIMD benefits either, so you won't get anywhere near that in practice on 486 code.
The key to contemporary high performance computing is having more independent work to do, and organizing the data/work to expose the independence to the software/hardware.
that's basically x86 without 16 and 32 bit support, no real mode etc.
CPU starts initialized in 64bit without all that legacy crap.
that's IMO great idea. I think every few decades we need to stop and think again about what works best and take fresh start or drop some legacy unused features.
risc v have only mandatory basic set of instructions, as little as possible to be Turing complete and everything else is extension that can be (theoretically) removed in the future.
this also could be used to remove legacy parts without disrupting architecture
16-core Zen 5 CPU achieves more than 2 TFLOPS FP64. So number crunching performance scaled very well.
It is weird, that the best consumer GPU can 4 TFLOPS. Some years ago GPUs were an order of magnitude and more faster than CPUs. Today GPUs are likely to be artificially limited.
[1] https://www.techpowerup.com/gpu-specs/radeon-pro-vii.c3575
These aren't realistic numbers in most cases because you're almost always limited by memory bandwidth, and even if memory bandwidth is not an issue you'll have to worry about thermals. Theoretical CPU compute ceiling is almost never the real bottleneck. GPU's have a very different architecture with higher memory bandwidth and running their chips a lot slower and cooler (lower clock frequency) so they can reach much higher numbers in practical scenarios.
CPUs scaled tall with specialized instruction to make the single thread go faster, no the amount done per transistor does not scale anywhere near linearly, very many of the transistors are dark on any given cycle compared to a much simpler core that will have much higher utilization.
I'm pretty sure that these goals will conflict with one another at some point. For example, the way one solves Spectre/Meltdown issues in a principled way is by changing the hardware and system architecture to have some notion of "privacy-sensitive" data that shouldn't be speculated on. But this will unavoidably limit the scope of OOO and the amount of instructions that can be "in-flight" at any given time.
For that matter, with modern chips, semaphores/locks are already implemented with hardware builtin operations, so you can't do that much better. Transactional memory is an interesting possibility but requires changes on the software side to work properly.
That kind of takes the specter meltdown thing out of the way to some degree I would think, although privilege elevation can happen in the darndest places.
But maybe I'm being too optimistic
And from each individual core:
- 25% per core L1/L2 cache
- 25% vector stuff (SSE, AVX, ...)
- from the remaining 50% only about 20% is doing instruction decoding
https://www.techpowerup.com/img/AFnVIoGFWSCE6YXO.jpg
Zen 3 example: https://www.reddit.com/r/Amd/comments/jqjg8e/quick_zen3_die_...
So, more like 85%, or around 6 orders of magnitude difference from your guess. ;)
Intel 80486 with 1.2M transistors delivered 0.128 flops / cycle.
nVidia 4070 Ti Super with 45.9B transistors delivers 16896 flops / cycle.
As you see, each transistor became 3.45 times more efficient at delivering these FLOPs per cycle.
I should've written per core.
and spending millions on patent lawsuits ...
what makes it more likely to work this time?
That said, yesterday I saw gcc generate 5 KB of mov instructions because it couldn't gracefully handle a particular vector size so I wouldn't get my hopes up...
How many times do we have to see these stories play out to realize it doesn’t matter where they came from. These big companies employee a lot of people of varying skill, having it on your resume means almost nothing IMHO.
Just look at the Humane pin full of “ex-Apple employees”, how’d that work out? And that’s only one small example.
I hope IO (OpenAi/Jony Ive) fails so spectacularly so that we have an even better example to point to and we can dispel the idea that if you did something impressive early in your career or worked for an impressive company, it doesn’t mean you will continue to do so.
Moreover, if the ex company was so wonderful and they were so integral to it, why aren't they still there? If they did something truly important, why not just advertise that (and I'm putting aside here qualms about overt advertising rather than something more subtle, authentic, organic).
[0] https://news.ycombinator.com/item?id=41353155
And that's not sarcasm, I'm serious.
I wish them success, plus I hope they do not do what Intel did with its add-ons.
Hoping for an open system (which I think RISC-V is) and nothing even close to Intel ME or AMT.
https://en.wikipedia.org/wiki/Intel_Management_Engine
https://en.wikipedia.org/wiki/Intel_Active_Management_Techno...
The architecture is independent of additional silicon with separate functions. The "only" thing which makes RISC-V open are that the specifications are freely available and freely usable.
Intel ME is, by design, separate from the actual CPU. Whether the CPU uses x86 or RISC-V is essentially irrelevant.