I'd also like to point out that 90 % of the time spent to "compute" the Mandelbrot set with the JIT-compiled code is spent on compiling the function, not on computation.
If you actually want to learn something about CUDA, implementing matrix multiplication is a great exercise. Here are two tutorials:
>If you actually want to learn something about CUDA, implementing matrix multiplication is a great exercise.
There is SAXPY (matrix math A*X+Y), purportedly ([1]) the hello world of parallel math code.
>SAXPY stands for “Single-Precision A·X Plus Y”. It is a function in the standard Basic Linear Algebra Subroutines (BLAS)library. SAXPY is a combination of scalar multiplication and vector addition, and it’s very simple: it takes as input two vectors of 32-bit floats X and Y with N elements each, and a scalar value A. It multiplies each element X[i] by A and adds the result to Y[i].
This article claims to be something every developer must know, but it's a discussion of how GPUs are used in AI. Most developers are not AI developers, nor do they interact with AI or use GPUs directly. Not to mention the fact that this articles barely mentions 3d graphics at all, the reason gpus exist
One can benefit from knowing fundamentals of an adjacent field, especially something as broadly applicable as machine learning.
- You might want to use some ML in the project you are assigned next month
- It can help collaborating with someone who tackles that aspect of a project
- Fundamental knowledge helps you understand the "AI" stuff being marketed to your manager
The "I don't need this adjacent field" mentality feels familiar from schools I went to: first I did system administration where my classmates didn't care about programming because they felt like they didn't understand it anyway and they would never need it (scripting, anyone?); then I switched to a software development school where, guess what, the kids couldn't care about networking and they'd never need it anyway. I don't understand it, to me it's both interesting, but more practically: fast-forward five years and the term devops became popular in job ads.
The article is 1500 words at a rough count. Average reading speed is 250wpm, but for studying something, let's assume half of that: 1500/125 = 12 minutes of your time. Perhaps you toy around with it a little, run the code samples, and spend two hours learning. That's not a huge time investment. Assuming this is a good starting guide in the first place.
The objection isn't to the notion that "One can benefit from knowing fundamentals of an adjacent field". It's that this is "The bare minimum every developer must know". That's a much, much stronger claim.
I've come to see this sort of clickbait headline as playing on the prevalence of imposter-syndrome insecurity among devs, and try to ignore them on general principle.
I remember how I joined a startup after working for a traditional embedded shop and a colleague made (friendly) fun of me for not knowing how to use curl to post a JSON request. I learned a lot since then about backend, frontend and infrastructure despite still being an embedded developer. It seems likely that people all around the industry will be in a similar position when it comes to AI in the next years.
Most AI work will just be APIs provided by your cloud provider in less than 2 years. Understanding what's going on under the hood isn't going to be that common, maybe the AI equivalent of "use explain analyze, optimize indexes" will be what passes for (engineering, not scientist) AI expert around that time.
Not to mention their passing example of Mandelbrot set rendering only gets a 10x speedup, despite being the absolute posterchild of FLOPs-limited computation.
You would expect at least 1000x, and that's probably where it would be if they didn't include JIT compile time in their time. Mandelbrot sets are a perfect example of a calculation a GPU is good at.
yeah a lot of assumptions were made that are inaccurate.
I agree that most developers are not AI developers... OP seems to be a bit out of touch with the general population and otherwise is assuming the world around them based on their own perception.
I've noticed that every time I see an article claiming that its subject is something "every developer must know", that claim is false. Maybe there are articles which contain information that everyone must know, but all I encounter is clickbait.
Understanding now hardware is used is very beneficial for programmers
Lost of programmers started with an understanding of what happens physically on the hardware when code runs and it is unfair advantage when debugging at times
> Understanding now hardware is used is very beneficial for programmers
I agree, but to say that all developers must know how AI benefits from GPUs is a different claim. One which is false. I would say most developers don't even understand how the CPU works, let alone modern CPU features like Data/Instruction Caching, SIMD instructions, and Branch prediction.
Most developers I encounter learned Javascript and make websites
And honestly, for most "AI developers" if you are training your own model these days (versus using an already trained one) - you are probably doing it wrong.
Don't worry, you'll either be an AI developer or unemployed within 5 years. This is indeed important for you, regardless if you recognize this yet or not.
I think python is dominant in AI, because the python-C relationship mirrors the CPU-GPU relationship.
GPUs are extremely performant, and also very hard to code in, so people just use highly abstracted API calls like pytorch to command the GPU.
C is very performant, and hard to code in, so people just use python as a abstraction layer over C.
Its not clear if people need to understand GPUs that much (Unless you are deep in AI training/ops land). In time, since moore's law has ended and multithreading becomes the dominant mode of speed increases, there'll probably be brand new languages dedicated to this new paradigm of parallel programming. Mojo is a start.
I've wondered for a while: is there a space for a (new?) language which invisibly maximises performance, whatever hardware it is run on?
As in, every instruction, from a simple loop of calculations onward, is designed behind the scenes so that it intelligently maximises usage of every available CPU core in parallel, and also farms everything possible out to the GPU?
There's definitely a space for it. It may even be possible. But if you consider the long history of lisp discussions (flamewars?) about "a sufficiently smart compiler" and comparisons to C. Or maybe Java vs C++, it seems unlikely. At least very very difficult.
There are little bits of research on algorithm replacement. Like, have the compiler detect that you're trying to sort, and generate the code for quick sort or timsort. it works, kinda. There are a lot of ways to hide a sort in code, and the compiler can't readily find them all.
Not for mixed CPU/GPU, but there is the concept of a superoptimizer, that basically brute forces for the most optimal correct code. But it is not practical, besides using for very very short program snippets (and they are usually CPU-only, though there is nothing fundamental why it couldn’t utilize the GPU as well).
I'm not sure that's even possible in principle; consider the various anti-performance algorithms of proof-of-waste systems, where every step is data-dependent on the previous one and the table of intermediate results required may be made arbitrarily big.
It's a bit like "design a zip algorithm which can compress any file".
I’m not sure if that’s a good idea at the moment, but we should start with making development with vector instructions more approachable. The code should look more or less the same as working with u64s.
> Moore’s law is far from over and multithreading is not the answer.
Wut? We hit the power wall back in 2004. There was a little bit of optimization around the memory wall and ilp wall afterwards, but really, cores haven't gotten faster since.
It's been all about being able to cram more cores in since then, which implies at least multi-threading, but multi-processing is basically required to get the most out of a cpu these days.
GPUs aren't that difficult to program. CUDA is fairly straightforward for many tasks and in many cases there is an easy 100x improvement in processing speed just sitting there to be had with <100 lines of code.
Sure, but if you've never used CUDA or any other GPU framework, how many lines of documentation do you need to read, and how many lines of code are you likely to write and rewrite and delete before you end up with those <100 lines of working code?
You can pretty easily get C performance (I'd argue that C's lack of abstraction makes slow but simple code more appealing) with pythonic expressiveness pretty easily with a more modern language.
I code in both (c for hobbies and python professionally) and “significant white space” is a non-issue if you spend any amount of time getting used to it.
Complaining about significant white-space is like complaining that lisp has too many parentheses. It’s an aesthetic preference that just doesn’t matter in practice.
I actually find python and C very similar in spirit.
Syntax is mostly an irrelevance, they have surprisingly similar patterns in my opinion.
In a modern language I want a type system that both reduces risk and reduces typing — safety and metaprogramming. C obviously doesn't, python doesn't really either.
Python's approach to dynamic-ness is very similar to how I'd expect C to be as a dynamic language (if it had proper arrays/lists).
> When faced with multiple tasks, a CPU allocates its resources to address each task one after the other
Ha! I wish CPUs were still that simple.
Granted, it is legitimate for the article to focus on the programming model. But "CPUs execute instructions sequentially" is basically wrong if you talk about performance. (There are pipelines executing instructions in parallel, there is SIMD, and multiple cores can work on the same problem.)
I think this post focused on the wrong things here. CPUs with AVX-512 also have massive data parallelism, and CPUs can execute many instructions at the same time. The big difference is that CPUs spend a lot of their silicon and power handling control flow to execute one thread efficiently, while GPUs spend that silicon on more compute units and hide control flow and memory latency by executing a lot of threads.
The CPUs are good at serial code and GPUs are good at parallel code is kind of true but something of an approximation. Assume equivalent power budget in the roughly hundreds of watts range, then:
A CPU has ~100 "cores" each running one (and-a-hyperthread). independent things, and it hides memory latency by branch prediction and pipelining.
A GPU has ~100 "compute units", each running ~80 independent things interleaved, and it hides memory latency by executing the next instruction from one of the other 80 things.
Terminology is a bit of a mess, and the CPU probably has a 256bit wide vector unit while the GPU probably has a 2048bit wide vector unit, but from a short distance the two architectures look rather similar.
GPU has 10x the memory bandwidth of the CPU though, which becomes relevant for the LLMs where you essentially have to read the whole memory (if you're batching optimally, that is using all the memory either for weights or for KV cache) to produce one token of output.
I’m always surprised there isn’t a movement toward pairing a few low latency cores with a large number of high throughput cores. Surround single Intel P core with a bunch of E cores. Then, hanging off the E cores, stick a bunch of iGPU cores and/or AVX-512 units.
If you use low-latency memory, GPU cores will underperform due to low bandwidth, see modern AMD APUs with many RDNA3 cores connected to DDR5 memory. On paper, Radeon 780M delivers up to 9 FP32 TFLOPS, the figure is close to desktop version of Radeon RX 6700 which is substantially faster in gaming.
I think they may have a hurdle of getting folks to buy into the concept though.
I imagine it would be analogous to how Arria FPGA’s were included with certain Xeon CPU’s. Which further backs up your point that this could happen in the near future!
Edit: Oh, thanks for the downvote, with no discussion of the question. I'll just sit here quietly with my commercial OpenCL software that happily exploits these vector units attached to the normal CPU cores.
Given that most programming languages are designed for sequential processing (like CPUs), but Erlang/Elixir is designed for parallelism (like GPUs) … I really wonder if Nx / Axon (Elixir) will take off.
I am really wondering how well Elixir with Nx would perform for computation heavy workloads on a HPC cluster. Architecturally, it isn't that dissimilar to MPI, which is often used in that field. It should be a lot more accessible though, like numpy and the entire scientific python stack.
I've been investigating this and I wonder if the combination of Elixir and Nx/Axon might be a good fit for architectures like NVIDIA Grace Hopper where there is a mix of CPU and GPU.
Would that run on a GPU? I think the future is having both. Sequential programming is still the best abstraction for most tasks that don’t require immense parallel execution
I need a buyers guide: what's the minimum to spend, and best at a few budget tiers? Unfortunately that info changes occasionally and I'm not sure if there's any resource that keeps on top of things.
Google Colab, Kaggle Notebooks and Paperspace Notebooks all offer free GPU usage (within limits), so you do not need to spent anything to learn GPU programming.
Although I think they'll be replaced by ChatGPT a good article in that style is actually quite valuable.
I like attacking complexity head on, and have a good knowledge of both quantitative methods & qualitative details of (say) computer hardware so having an article that can tell me the nitty gritty details of a field is appreciated.
Take "What every programmer should know about memory" — should every programmer know? Perhaps not, but every good programmer should at least have an appreciation of how a computer actually works. This pays dividends everywhere — locality (the main idea that you should take away from that article) is fast, easy to follow, and usually a result of good code that fits a problem well.
I'd also like to point out that 90 % of the time spent to "compute" the Mandelbrot set with the JIT-compiled code is spent on compiling the function, not on computation.
If you actually want to learn something about CUDA, implementing matrix multiplication is a great exercise. Here are two tutorials:
https://cnugteren.github.io/tutorial/pages/page1.html
https://siboehm.com/articles/22/CUDA-MMM
There is SAXPY (matrix math A*X+Y), purportedly ([1]) the hello world of parallel math code.
>SAXPY stands for “Single-Precision A·X Plus Y”. It is a function in the standard Basic Linear Algebra Subroutines (BLAS)library. SAXPY is a combination of scalar multiplication and vector addition, and it’s very simple: it takes as input two vectors of 32-bit floats X and Y with N elements each, and a scalar value A. It multiplies each element X[i] by A and adds the result to Y[i].
[1]: https://developer.nvidia.com/blog/six-ways-saxpy/
Dead Comment
- You might want to use some ML in the project you are assigned next month
- It can help collaborating with someone who tackles that aspect of a project
- Fundamental knowledge helps you understand the "AI" stuff being marketed to your manager
The "I don't need this adjacent field" mentality feels familiar from schools I went to: first I did system administration where my classmates didn't care about programming because they felt like they didn't understand it anyway and they would never need it (scripting, anyone?); then I switched to a software development school where, guess what, the kids couldn't care about networking and they'd never need it anyway. I don't understand it, to me it's both interesting, but more practically: fast-forward five years and the term devops became popular in job ads.
The article is 1500 words at a rough count. Average reading speed is 250wpm, but for studying something, let's assume half of that: 1500/125 = 12 minutes of your time. Perhaps you toy around with it a little, run the code samples, and spend two hours learning. That's not a huge time investment. Assuming this is a good starting guide in the first place.
I've come to see this sort of clickbait headline as playing on the prevalence of imposter-syndrome insecurity among devs, and try to ignore them on general principle.
I remember how I joined a startup after working for a traditional embedded shop and a colleague made (friendly) fun of me for not knowing how to use curl to post a JSON request. I learned a lot since then about backend, frontend and infrastructure despite still being an embedded developer. It seems likely that people all around the industry will be in a similar position when it comes to AI in the next years.
Terrible article IMO.
Deleted Comment
I agree that most developers are not AI developers... OP seems to be a bit out of touch with the general population and otherwise is assuming the world around them based on their own perception.
Lost of programmers started with an understanding of what happens physically on the hardware when code runs and it is unfair advantage when debugging at times
I agree, but to say that all developers must know how AI benefits from GPUs is a different claim. One which is false. I would say most developers don't even understand how the CPU works, let alone modern CPU features like Data/Instruction Caching, SIMD instructions, and Branch prediction.
Most developers I encounter learned Javascript and make websites
(I'm not touching Nvidia since they don't provide open source drivers.)
GPUs are extremely performant, and also very hard to code in, so people just use highly abstracted API calls like pytorch to command the GPU.
C is very performant, and hard to code in, so people just use python as a abstraction layer over C.
Its not clear if people need to understand GPUs that much (Unless you are deep in AI training/ops land). In time, since moore's law has ended and multithreading becomes the dominant mode of speed increases, there'll probably be brand new languages dedicated to this new paradigm of parallel programming. Mojo is a start.
As in, every instruction, from a simple loop of calculations onward, is designed behind the scenes so that it intelligently maximises usage of every available CPU core in parallel, and also farms everything possible out to the GPU?
Has this been done? Is it possible?
https://www.modular.com/mojo
There are little bits of research on algorithm replacement. Like, have the compiler detect that you're trying to sort, and generate the code for quick sort or timsort. it works, kinda. There are a lot of ways to hide a sort in code, and the compiler can't readily find them all.
There is also https://futhark-lang.org/ , though I haven’t tried it, just heard about it.
It's a bit like "design a zip algorithm which can compress any file".
Wut? We hit the power wall back in 2004. There was a little bit of optimization around the memory wall and ilp wall afterwards, but really, cores haven't gotten faster since.
It's been all about being able to cram more cores in since then, which implies at least multi-threading, but multi-processing is basically required to get the most out of a cpu these days.
C is a way of life. Those of us who code exclusively, or nearly so, in C cannot stomach python's notion of "significant white-space."
Complaining about significant white-space is like complaining that lisp has too many parentheses. It’s an aesthetic preference that just doesn’t matter in practice.
Syntax is mostly an irrelevance, they have surprisingly similar patterns in my opinion.
In a modern language I want a type system that both reduces risk and reduces typing — safety and metaprogramming. C obviously doesn't, python doesn't really either.
Python's approach to dynamic-ness is very similar to how I'd expect C to be as a dynamic language (if it had proper arrays/lists).
Why belly ache about it? Whitespace is significant to one’s fellow humans.
Well-known for supporting any formatting style you like ;)
Ha! I wish CPUs were still that simple.
Granted, it is legitimate for the article to focus on the programming model. But "CPUs execute instructions sequentially" is basically wrong if you talk about performance. (There are pipelines executing instructions in parallel, there is SIMD, and multiple cores can work on the same problem.)
A CPU has ~100 "cores" each running one (and-a-hyperthread). independent things, and it hides memory latency by branch prediction and pipelining.
A GPU has ~100 "compute units", each running ~80 independent things interleaved, and it hides memory latency by executing the next instruction from one of the other 80 things.
Terminology is a bit of a mess, and the CPU probably has a 256bit wide vector unit while the GPU probably has a 2048bit wide vector unit, but from a short distance the two architectures look rather similar.
And memory access order is much more important that on CPU. Truly random access has very bad performance.
Call it Xeon Chi.
If you use high-bandwidth high-latency GDDR memory, CPU cores will underperform due to high latency, like there: https://www.tomshardware.com/reviews/amd-4700s-desktop-kit-r...
If you use low-latency memory, GPU cores will underperform due to low bandwidth, see modern AMD APUs with many RDNA3 cores connected to DDR5 memory. On paper, Radeon 780M delivers up to 9 FP32 TFLOPS, the figure is close to desktop version of Radeon RX 6700 which is substantially faster in gaming.
I think they may have a hurdle of getting folks to buy into the concept though.
I imagine it would be analogous to how Arria FPGA’s were included with certain Xeon CPU’s. Which further backs up your point that this could happen in the near future!
Edit: Oh, thanks for the downvote, with no discussion of the question. I'll just sit here quietly with my commercial OpenCL software that happily exploits these vector units attached to the normal CPU cores.
Given that most programming languages are designed for sequential processing (like CPUs), but Erlang/Elixir is designed for parallelism (like GPUs) … I really wonder if Nx / Axon (Elixir) will take off.
https://github.com/elixir-nx/
This is the best one imo.
https://colab.google/
https://www.kaggle.com/docs/notebooks
https://www.paperspace.com/gradient/free-gpu
Deleted Comment
I like attacking complexity head on, and have a good knowledge of both quantitative methods & qualitative details of (say) computer hardware so having an article that can tell me the nitty gritty details of a field is appreciated.
Take "What every programmer should know about memory" — should every programmer know? Perhaps not, but every good programmer should at least have an appreciation of how a computer actually works. This pays dividends everywhere — locality (the main idea that you should take away from that article) is fast, easy to follow, and usually a result of good code that fits a problem well.