"And we ask: if your matrix multiply is smaller than 16x16, are you sure what you’re doing is AI?
From a philosophical point of view, we think a frame shift is in order. A “register” certainly shouldn’t be a 32-bit word like on the CPUs of old. And a 1024-bit wide vector register, as CUDA uses, is certainly a step in the right direction. But to us a “register” is a 16x16 tile of data. We think AI wants this."
The hardware needs of AI are starting to focus. GPUs, after all, were designed for an entirely different job. They're used for AI because they have good matrix multiply hardware. "AI GPUs" get to leave out some of the stuff in a real GPU (does an H100 even have texture fill units?). Then there's a trend towards much shorter numbers. 16 bit floating point? 8 bit? 2 bit? 1 bit? That will settle out at some point. This paper indicates that hardware that likes 16x16 tiles makes a lot of sense. It's certainly possible to build such hardware. Someone reading this is probably writing it in VHDL right now, or will be soon.
Then we'll see somewhat simpler, less general, and cheaper devices that do "AI" operations with as little excess hardware baggage as possible. Nice.
GPUs have evolved to be AI machines with as little baggage as possible. People have been arguing GPUs were old technology and therefore unsuited for AI since at least 2014 (when Nervana was founded), but what they perhaps didn’t expect is that the GPU would evolve so quickly to be an AI machine.
Bill Dally from Nvidia argues that there is "no gain in building a specialized accelerator", in part because current overhead on top of the arithmetic is in the ballpark of 20% (16% of IMMA and 22% for HMMA units)
https://www.youtube.com/watch?v=gofI47kfD28
> Then we'll see somewhat simpler, less general, and cheaper devices that do "AI" operations with as little excess hardware baggage as possible. Nice.
Apple has already been doing this for a few years now. The NPU is totally different from the GPU or CPU on the die itself[1]. Nvidia is likely working on this as well, but I think a device that's a gaming/entertainment/crypto/AI bundle (i.e. sticking with the video card) is probably a better business move.
The NPUs on a lot of different systems occupy an awkward spot. For extremely small models, they're the way to go for low-power inference. But once you reach LLM or vision transformer size, it makes a lot more sense to switch to GPU shaders for that extra bit of large-model performance. For stuff like Llama and Stable Diffusion, those Neural Engines are practically wasted silicon. The biggest saving grace is projects like ONNX attempting to sew them into a unified non-15-competing-standards API, but even that won't change how underpowered they are.
Nvidia escapes this by designing their GPU architecture to incorporate NPU concepts at a fundamental level. It's less redundant silicon and enables you to scale a single architecture instead of flip-flopping to whichever one is most convenient.
On kernels such as flash attention, TMA and the L2 cache are both fast enough so as to hide these problems reasonably well. But to make the full use of the hardware, memory request must be coalesced and bank conflicts avoided
”
The depth of the competition is also starting to become apparent. There’s no way the documentation error was totally an accident. Diagrams are the easiest to steal / copy and there must have been some utility for nvidia to have left this in place. Remember when Naveen Rao’s Nervana was writing NVidia Maxwell drivers that out-performed NVidia’s own? Not every documentation mishap in a high-growth product is a competition counter-measure, but given that the researchers spent so long reverse-engineering wgmma and given the China-US political situation of the H100 in particular, it seems NVidia is up to its old tricks to protect its moat.
So don’t over-study the H100 peculiarities, as “what hardware does AI want?” really encompasses the commercial situation as well.
I don't understand. If they document their stuff with errors, it will hurt users, be they chinese or US ? Or is it expected that US users will call Nvidia's to ask for the correct documentation ?
Wait but nvidia tensor-cores are exactly the hardware that likes 16x16 tiles, no? I thought that was the whole point? The hardware is already here and I'm sceptical if there is another order of magnitude in performance to be gained from even more specialized designs.
it's going to be awkward in consumer hardware either way
if you segregate AI units from the GPU, the thing is both AI and GPUs will continue to need massive amounts of matrix multiplication and as little memory latency as possible
the move to have more of it wrapped in the GPU makes sense but at least in the short and medium term, most devices won't be able to justify the gargantuan silicon wafer space/die growth that this would entail - also currently Nvidia's tech is ahead and they don't make state of the art x86 or ARM CPUs
for the time being I think the current paradigm makes the most sense, with small compute devices making inroads in the consumer markets as non-generalist computers - note that more AI-oriented pseudo-GPUs already exist and are successful since the earlier Nvidia Tesla lineup and then the so-called "Nvidia Data Center GPUs"
> Then there's a trend towards much shorter numbers. 16 bit floating point? 8 bit? 2 bit? 1 bit?
There was that recent paper titled "The Era of 1-bit LLMs" [0] which was actually suggeting a 1.58 bit LLM (2 bits in practice).
> Someone reading this is probably writing it in VHDL right now, or will be soon.
Yeah, I think I'm in the "will be soon" camp - FPGA board has been ordered. Especially with the 2-bit data types outlined in that paper [0] and more details in [1]. There's really a need for custom hardware to do that 2-bit math efficiently. Customizing one of the simpler open source RISC-V integer implementations seems like something to try here adding in the tiled matrix registers and custom instructions for dealing with them (with the 2 bit data types).
> NVIDIA’s lies. This is an extraordinarily misleading representation of the actual 128b swizzled wgmma layout. This diagram cost us three weeks of life that we will not get back, hence the public shaming.
Wondering if anyone would be surprised that a huge amount of progress in AI is on the engineering side (optimizing matmuls), and that a huge portion of the engineering is about reverse engineering NVIDIA chips
Architecture doesn't make a difference. Big enough models trained with big enough data tend to give the same results regardless of architecture. So yes, most advances in AI are mostly due to the fact we can now multiply matrices very fast.
That's not completely true. The architecture must behave well for scaling, which is not trivial. Basic multi-layer perceptrons do not scale well for example, the gradient will vanish or explode deeper in the network.
idk, they do give the same results, but given the memory bottleneck it feels like we are at a point when architecture innovations matter again, for example check out DeepSeek V2 tech report, they modded model arch specifically for lower cost inference (by making k/v cache smaller)
There was some awareness reading the article, yet "we're warping through the quadrant in our tensor accelerator" is pretty Trek.
Have had that thought occasionally with some of the other articles. What it must read like to somebody who gets a ref link for an article over here. Wandered into some Trek nerd convention discussing warp cores.
I believe that reducing the power consumption and increasing the speed of AI inference will be best served by switching to analog, approximate circuits. We don't need perfect floating-point multiplication and addition, we just need something that takes an two input voltages and produces an output voltage that is close enough to what multiplying the input voltages would yield.
I know someone working in this direction; they've described the big challenges as:
* Finding ways to use extant chip fab technology to produce something that can do analog logic. I've heard CMOS flash presented a plausible option.
* Designing something that isn't an antenna.
* You would likely have to finetune your model for each physical chip you're running it on (the manufacturing tolerances aren't going to give exact results)
The big advantage is that instead of using 16 wires to represent a float16, you use the voltage on 1 wire to represent that number (which plausibly has far more precision than a float32). Additionally, you can e.g. wire two values directly together rather than loading numbers into an ALU, so the die space & power savings are potentially many, many orders of magnitude.
> which plausibly has far more precision than a float32
If that was true, then a DRAM cell could represent 32 bits instead of one bit. But the analog world is noisy and lossy, so you couldn't get anywhere near 32 bits of precision/accuracy.
Yes, very carefully designed analog circuits can get over 20 bits of precision, say A/D converters, but they are huge (relative to digital circuits), consume a lot of power, have low bandwidth as compared to GHz digital circuits, and require lots of shielding and power supply filtering.
This is spit-balling, but the types of circuits you can create for a neural network type chip is certainly under 8 bits, maybe 6 bits. But it gets worse. Unlike digital circuits where signal can be copied losslessly, a chain of analog circuits compounds the noise and accuracy losses stage by stage. To make it work you'd need frequent requantization to prevent getting nothing but mud out.
> which plausibly has far more precision than a float32
+/- 1e-45 to 3.4e38. granted, roughly half of that is between -1 and 1.
When we worked with low power silicon, much of the optimization was running with minimal headroom - no point railing the bits 0/1 when .4/.6 will do just fine.
> Additionally, you can e.g. wire two values directly together rather than loading numbers into an ALU
You may want an adder. Wiring two circuit outputs directly together makes them fight, which is usually bad for signals.
an analog value in such a chip has far, far less resolution than a float32. Maybe you get 16 bits of resolution, more likely 8, and your multiplications are going to be quite imprecise. The whole thing hinges on the models being tolerant of that.
I think we're far away from analog circuits being practically useful, but one place that where we might embrace the tolerance for imprecision is in noisy digital circuits. Accepting that one in a million, say, bits in an output will be flipped to achieve a better performance/power ratio. Probably not when working with float32s where a single infinity[1] could totally mess things but for int8s the occasional 128 when you wanted a 0 seems like something that should be tolerable.
[1] Are H100s' maxtrix floating point units actually IEEE 754 compliant? I don't actually know.
I'd go a step further, something which resembles how "wet brains" (biological) actually work, but which could be produced easily.
Biological neural networks are nowhere near as connected as ANNs, which are typically fully connected. With biological neurons, the ingress / egress factors are < 10. So they are highly local
It is also an entirely different model, as there is no such thing as backpropagation in biology (that we know of).
What they do have is lieu of backpropagation is feedback (cycles)
And maybe there are support cells/processes which are critical to the function of the CNS that we don't know of yet.
There could also be a fair amount of "hard coded" connectedness, even at the higher levels. We already know of some. For instances, it is known that auditory neurons in the ears are connected and something similar to a "convolution" is done in order to localize sound source. It isn't an a emergent phenomena - you don't have to be "trained" to do it.
This is not surprising give life has had billions of years and a comparable number of generations in order to figure it out.
I guess in theory this could all be done in software. However given the trillion+ neurons in primate/human brains, this would be incredibly challenging on even the thousand-core machines we have nowadays. And before you scream "cloud" it would not have the necessary interconnectedness/latency.
It would be cool if you could successful model say, a worm/insect with this approach.
> What they do have is lieu of backpropagation is feedback (cycles)
I wonder where the partial data / feedback is stored. Don't want to sound like a creationist, but it seems very improbable that "how good my sound localization is" is inferred exclusively from the # of children I have.
What do you mean with inpossible? You are aware that what radio equipment does is often equivalent of analog operations like multiplication, addition, etc. just at high frequencies?
Sure accuracy is an issue, but this is not as impossible as you may think it would be. The main question will be if the benefits by going analog outweigh the issues arising from it.
Realistically, you'd train your model the same way it's done today and then custom-order analog ones with the weights programmed in. The advantage here would be faster inference (assuming analog circuits actually work out), but custom manufacturing circuits would only really work at scale.
I don't think reprogrammable analog circuits would really be feasible, at least with today's tech. You'd need to modify the resistors etc. to make it work.
Maybe because that is a VERY different problem than the one discussed here.
Building a single analog chip with 1 billion neurons would cost billions of dollars in a best case scenario. A Nvidia card with 1 billion digital neurons is in the hundreds of dollars of range.
Those costs could come down eventually, but at that point CUDA may be long gone.
Have you done much AI work against AMD products? I'm not going to plunk down $2500+ for an RTX 4090, but have been considering an RX 7900XTX for playing around with, or at least getting started. Just curious how well it will or won't work in practice, or if saving a bit more and getting a 7900 XT over the XTX might be a better option, and how much less vram might impact usefulness in practice.
My only work with consumer AMD GPUs was mining ethereum, I had 150,000 of them.
If you want to use enterprise AMD gpus, I'm renting them. That said, I haven't even had a chance to run/play with them myself yet, they have been rented since I got them last month.
Caveat emptor and your mileage may vary; but unlike nVidia where you could just assume that everything is compatible with everything, for AMD I'd strongly recommend that you try before you buy - consider renting a cloud machine with that GPU to check if the software works for your needs before committing to a large purchase.
Good writing is clear and unambiguous. With speech there is an opportunity to interrupt and ask for clarification. Writing has one chance to get the message across. A reader shouldn't have to consult knowyourmeme.com to figure out what the heck the authors are trying to say. I don't even know what the title means here. That's how far they've missed the mark.
Wow that really sucks for you. I just read it in 5 minutes and feel much more informed about the subject pf nvidia memory twizzlization. It's kind of funny to me that presumably young college guys are writing in a style that's very readable for my old ass.
Even if you're not familiar with the "go brrr" meme (which is the only use of meme-idiom in the article and is used exactly twice), its meaning is easily inferred via context clues from the opening paragraphs.
I also enjoyed the article's style. I utterly despise "academic paper speak". It is, imho, not the most effective style to communicate complex ideas. I find it so much easier to learn from a more casual "blog post" or in-person presentation over stiff, rigid academic speak.
Also, write your own cuda kernel to do vector-matrix multiplication (if you use pycuda, you can focus on the kernel, and write everything else with python). Just tell chatgpt that you want to write your own implementation that multiplies a 4000-element vector by 4000x12000 matrix, and to guide you through the whole process.
For renting gpus, runpods is great - right now they have everything from lower tier gpus to h100s. You can start with a lesser gpu at the beginning.
From a philosophical point of view, we think a frame shift is in order. A “register” certainly shouldn’t be a 32-bit word like on the CPUs of old. And a 1024-bit wide vector register, as CUDA uses, is certainly a step in the right direction. But to us a “register” is a 16x16 tile of data. We think AI wants this."
The hardware needs of AI are starting to focus. GPUs, after all, were designed for an entirely different job. They're used for AI because they have good matrix multiply hardware. "AI GPUs" get to leave out some of the stuff in a real GPU (does an H100 even have texture fill units?). Then there's a trend towards much shorter numbers. 16 bit floating point? 8 bit? 2 bit? 1 bit? That will settle out at some point. This paper indicates that hardware that likes 16x16 tiles makes a lot of sense. It's certainly possible to build such hardware. Someone reading this is probably writing it in VHDL right now, or will be soon.
Then we'll see somewhat simpler, less general, and cheaper devices that do "AI" operations with as little excess hardware baggage as possible. Nice.
Apple has already been doing this for a few years now. The NPU is totally different from the GPU or CPU on the die itself[1]. Nvidia is likely working on this as well, but I think a device that's a gaming/entertainment/crypto/AI bundle (i.e. sticking with the video card) is probably a better business move.
[1] https://github.com/hollance/neural-engine/blob/master/docs/a...
Nvidia escapes this by designing their GPU architecture to incorporate NPU concepts at a fundamental level. It's less redundant silicon and enables you to scale a single architecture instead of flip-flopping to whichever one is most convenient.
On kernels such as flash attention, TMA and the L2 cache are both fast enough so as to hide these problems reasonably well. But to make the full use of the hardware, memory request must be coalesced and bank conflicts avoided ”
The depth of the competition is also starting to become apparent. There’s no way the documentation error was totally an accident. Diagrams are the easiest to steal / copy and there must have been some utility for nvidia to have left this in place. Remember when Naveen Rao’s Nervana was writing NVidia Maxwell drivers that out-performed NVidia’s own? Not every documentation mishap in a high-growth product is a competition counter-measure, but given that the researchers spent so long reverse-engineering wgmma and given the China-US political situation of the H100 in particular, it seems NVidia is up to its old tricks to protect its moat.
So don’t over-study the H100 peculiarities, as “what hardware does AI want?” really encompasses the commercial situation as well.
https://www.amd.com/en/products/accelerators/alveo/v80.html
XDNA Architecture
https://www.amd.com/en/technologies/xdna.html
Deleted Comment
if you segregate AI units from the GPU, the thing is both AI and GPUs will continue to need massive amounts of matrix multiplication and as little memory latency as possible
the move to have more of it wrapped in the GPU makes sense but at least in the short and medium term, most devices won't be able to justify the gargantuan silicon wafer space/die growth that this would entail - also currently Nvidia's tech is ahead and they don't make state of the art x86 or ARM CPUs
for the time being I think the current paradigm makes the most sense, with small compute devices making inroads in the consumer markets as non-generalist computers - note that more AI-oriented pseudo-GPUs already exist and are successful since the earlier Nvidia Tesla lineup and then the so-called "Nvidia Data Center GPUs"
Should be "as much memory bandwidth as possible". GPUs are designed to be (relatively) more insensitive to memory latency than CPU.
There was that recent paper titled "The Era of 1-bit LLMs" [0] which was actually suggeting a 1.58 bit LLM (2 bits in practice).
> Someone reading this is probably writing it in VHDL right now, or will be soon.
Yeah, I think I'm in the "will be soon" camp - FPGA board has been ordered. Especially with the 2-bit data types outlined in that paper [0] and more details in [1]. There's really a need for custom hardware to do that 2-bit math efficiently. Customizing one of the simpler open source RISC-V integer implementations seems like something to try here adding in the tiled matrix registers and custom instructions for dealing with them (with the 2 bit data types).
[0] https://arxiv.org/abs/2402.17764 [1] https://github.com/microsoft/unilm/blob/master/bitnet/The-Er...
1 trit, not 2 bits. 3 trits are 27 states, which can be represented with 5 bits.
Wondering if anyone would be surprised that a huge amount of progress in AI is on the engineering side (optimizing matmuls), and that a huge portion of the engineering is about reverse engineering NVIDIA chips
The line between GPU lingo and Star Trek technobabble fades away further and further.
Have had that thought occasionally with some of the other articles. What it must read like to somebody who gets a ref link for an article over here. Wandered into some Trek nerd convention discussing warp cores.
https://en.wikipedia.org/wiki/Metric_tensor_(general_relativ...
Deleted Comment
If that was true, then a DRAM cell could represent 32 bits instead of one bit. But the analog world is noisy and lossy, so you couldn't get anywhere near 32 bits of precision/accuracy.
Yes, very carefully designed analog circuits can get over 20 bits of precision, say A/D converters, but they are huge (relative to digital circuits), consume a lot of power, have low bandwidth as compared to GHz digital circuits, and require lots of shielding and power supply filtering.
This is spit-balling, but the types of circuits you can create for a neural network type chip is certainly under 8 bits, maybe 6 bits. But it gets worse. Unlike digital circuits where signal can be copied losslessly, a chain of analog circuits compounds the noise and accuracy losses stage by stage. To make it work you'd need frequent requantization to prevent getting nothing but mud out.
+/- 1e-45 to 3.4e38. granted, roughly half of that is between -1 and 1.
When we worked with low power silicon, much of the optimization was running with minimal headroom - no point railing the bits 0/1 when .4/.6 will do just fine.
> Additionally, you can e.g. wire two values directly together rather than loading numbers into an ALU
You may want an adder. Wiring two circuit outputs directly together makes them fight, which is usually bad for signals.
[1] Are H100s' maxtrix floating point units actually IEEE 754 compliant? I don't actually know.
Biological neural networks are nowhere near as connected as ANNs, which are typically fully connected. With biological neurons, the ingress / egress factors are < 10. So they are highly local
It is also an entirely different model, as there is no such thing as backpropagation in biology (that we know of).
What they do have is lieu of backpropagation is feedback (cycles)
And maybe there are support cells/processes which are critical to the function of the CNS that we don't know of yet.
There could also be a fair amount of "hard coded" connectedness, even at the higher levels. We already know of some. For instances, it is known that auditory neurons in the ears are connected and something similar to a "convolution" is done in order to localize sound source. It isn't an a emergent phenomena - you don't have to be "trained" to do it.
This is not surprising give life has had billions of years and a comparable number of generations in order to figure it out.
I guess in theory this could all be done in software. However given the trillion+ neurons in primate/human brains, this would be incredibly challenging on even the thousand-core machines we have nowadays. And before you scream "cloud" it would not have the necessary interconnectedness/latency.
It would be cool if you could successful model say, a worm/insect with this approach.
I wonder where the partial data / feedback is stored. Don't want to sound like a creationist, but it seems very improbable that "how good my sound localization is" is inferred exclusively from the # of children I have.
Sure accuracy is an issue, but this is not as impossible as you may think it would be. The main question will be if the benefits by going analog outweigh the issues arising from it.
I don't think reprogrammable analog circuits would really be feasible, at least with today's tech. You'd need to modify the resistors etc. to make it work.
Building a single analog chip with 1 billion neurons would cost billions of dollars in a best case scenario. A Nvidia card with 1 billion digital neurons is in the hundreds of dollars of range.
Those costs could come down eventually, but at that point CUDA may be long gone.
would love to revisit the material, especially in this new era of specialized processing units and UMA.
If you want to use enterprise AMD gpus, I'm renting them. That said, I haven't even had a chance to run/play with them myself yet, they have been rented since I got them last month.
Yes, we are getting more.
Good writing is also entertaining and engaging.
https://twitter.com/TheRoaringKitty/status/17900418133798504...
Also, write your own cuda kernel to do vector-matrix multiplication (if you use pycuda, you can focus on the kernel, and write everything else with python). Just tell chatgpt that you want to write your own implementation that multiplies a 4000-element vector by 4000x12000 matrix, and to guide you through the whole process.
For renting gpus, runpods is great - right now they have everything from lower tier gpus to h100s. You can start with a lesser gpu at the beginning.
I spent 2 months implementing a matmult kernel in Spiral and optimizing it.
[1]: https://matplotlib.org/stable/gallery/showcase/xkcd.html#sph...