GPU Survival Toolkit for the AI age

This article claims to be something every developer must know, but it's a discussion of how GPUs are used in AI. Most developers are not AI developers, nor do they interact with AI or use GPUs directly. Not to mention the fact that this articles barely mentions 3d graphics at all, the reason gpus exist

lucb1e · 2 years ago

One can benefit from knowing fundamentals of an adjacent field, especially something as broadly applicable as machine learning.

- You might want to use some ML in the project you are assigned next month

- It can help collaborating with someone who tackles that aspect of a project

- Fundamental knowledge helps you understand the "AI" stuff being marketed to your manager

The "I don't need this adjacent field" mentality feels familiar from schools I went to: first I did system administration where my classmates didn't care about programming because they felt like they didn't understand it anyway and they would never need it (scripting, anyone?); then I switched to a software development school where, guess what, the kids couldn't care about networking and they'd never need it anyway. I don't understand it, to me it's both interesting, but more practically: fast-forward five years and the term devops became popular in job ads.

The article is 1500 words at a rough count. Average reading speed is 250wpm, but for studying something, let's assume half of that: 1500/125 = 12 minutes of your time. Perhaps you toy around with it a little, run the code samples, and spend two hours learning. That's not a huge time investment. Assuming this is a good starting guide in the first place.

mrec · 2 years ago

The objection isn't to the notion that "One can benefit from knowing fundamentals of an adjacent field". It's that this is "The bare minimum every developer must know". That's a much, much stronger claim.

I've come to see this sort of clickbait headline as playing on the prevalence of imposter-syndrome insecurity among devs, and try to ignore them on general principle.

oytis · 2 years ago

> Most developers are not AI developers

I remember how I joined a startup after working for a traditional embedded shop and a colleague made (friendly) fun of me for not knowing how to use curl to post a JSON request. I learned a lot since then about backend, frontend and infrastructure despite still being an embedded developer. It seems likely that people all around the industry will be in a similar position when it comes to AI in the next years.

antifa · 2 years ago

Most AI work will just be APIs provided by your cloud provider in less than 2 years. Understanding what's going on under the hood isn't going to be that common, maybe the AI equivalent of "use explain analyze, optimize indexes" will be what passes for (engineering, not scientist) AI expert around that time.

hhjinks · 2 years ago

What do you think the industry will look like in the near future?

pixelpoet · 2 years ago

Not to mention their passing example of Mandelbrot set rendering only gets a 10x speedup, despite being the absolute posterchild of FLOPs-limited computation.

Terrible article IMO.

pclmulqdq · 2 years ago

You would expect at least 1000x, and that's probably where it would be if they didn't include JIT compile time in their time. Mandelbrot sets are a perfect example of a calculation a GPU is good at.

Deleted Comment

sigmonsays · 2 years ago

yeah a lot of assumptions were made that are inaccurate.

I agree that most developers are not AI developers... OP seems to be a bit out of touch with the general population and otherwise is assuming the world around them based on their own perception.

bigstrat2003 · 2 years ago

I've noticed that every time I see an article claiming that its subject is something "every developer must know", that claim is false. Maybe there are articles which contain information that everyone must know, but all I encounter is clickbait.

j45 · 2 years ago

Understanding now hardware is used is very beneficial for programmers

Lost of programmers started with an understanding of what happens physically on the hardware when code runs and it is unfair advantage when debugging at times

shortrounddev2 · 2 years ago

> Understanding now hardware is used is very beneficial for programmers

I agree, but to say that all developers must know how AI benefits from GPUs is a different claim. One which is false. I would say most developers don't even understand how the CPU works, let alone modern CPU features like Data/Instruction Caching, SIMD instructions, and Branch prediction.

Most developers I encounter learned Javascript and make websites

BlueTemplar · 2 years ago

Even worse, it says "GPUs", but isn't CUDA a closed feature limited to Nvidia cards, and maybe even a subset of them ?

(I'm not touching Nvidia since they don't provide open source drivers.)

sbmthakur · 2 years ago

I would have probably opened it if it weren't for the title bait.

outside1234 · 2 years ago

And honestly, for most "AI developers" if you are training your own model these days (versus using an already trained one) - you are probably doing it wrong.

Der_Einzige · 2 years ago

Don't worry, you'll either be an AI developer or unemployed within 5 years. This is indeed important for you, regardless if you recognize this yet or not.

shortrounddev2 · 2 years ago

Just like crypto right

I think python is dominant in AI, because the python-C relationship mirrors the CPU-GPU relationship.

GPUs are extremely performant, and also very hard to code in, so people just use highly abstracted API calls like pytorch to command the GPU.

C is very performant, and hard to code in, so people just use python as a abstraction layer over C.

Its not clear if people need to understand GPUs that much (Unless you are deep in AI training/ops land). In time, since moore's law has ended and multithreading becomes the dominant mode of speed increases, there'll probably be brand new languages dedicated to this new paradigm of parallel programming. Mojo is a start.

mft_ · 2 years ago

I've wondered for a while: is there a space for a (new?) language which invisibly maximises performance, whatever hardware it is run on?

As in, every instruction, from a simple loop of calculations onward, is designed behind the scenes so that it intelligently maximises usage of every available CPU core in parallel, and also farms everything possible out to the GPU?

Has this been done? Is it possible?

Simpliplant · 2 years ago

Not exactly it but Mojo sounds closest from available options

https://www.modular.com/mojo

jfoutz · 2 years ago

There's definitely a space for it. It may even be possible. But if you consider the long history of lisp discussions (flamewars?) about "a sufficiently smart compiler" and comparisons to C. Or maybe Java vs C++, it seems unlikely. At least very very difficult.

There are little bits of research on algorithm replacement. Like, have the compiler detect that you're trying to sort, and generate the code for quick sort or timsort. it works, kinda. There are a lot of ways to hide a sort in code, and the compiler can't readily find them all.

kaba0 · 2 years ago

Not for mixed CPU/GPU, but there is the concept of a superoptimizer, that basically brute forces for the most optimal correct code. But it is not practical, besides using for very very short program snippets (and they are usually CPU-only, though there is nothing fundamental why it couldn’t utilize the GPU as well).

There is also https://futhark-lang.org/ , though I haven’t tried it, just heard about it.

pjc50 · 2 years ago

I'm not sure that's even possible in principle; consider the various anti-performance algorithms of proof-of-waste systems, where every step is data-dependent on the previous one and the table of intermediate results required may be made arbitrarily big.

It's a bit like "design a zip algorithm which can compress any file".

teaearlgraycold · 2 years ago

I’m not sure if that’s a good idea at the moment, but we should start with making development with vector instructions more approachable. The code should look more or less the same as working with u64s.

howling · 2 years ago

You might be interested in https://github.com/HigherOrderCO/HVM

huijzer · 2 years ago

There are many languages doing that more or less. Jax and Mojo for example.

sitkack · 2 years ago

Moore’s law is far from over and multithreading is not the answer. Your opening sentence is spot on tho.

Kamq · 2 years ago

> Moore’s law is far from over and multithreading is not the answer.

Wut? We hit the power wall back in 2004. There was a little bit of optimization around the memory wall and ilp wall afterwards, but really, cores haven't gotten faster since.

It's been all about being able to cram more cores in since then, which implies at least multi-threading, but multi-processing is basically required to get the most out of a cpu these days.

bmc7505 · 2 years ago

Care to elaborate?

danielmarkbruce · 2 years ago

GPUs aren't that difficult to program. CUDA is fairly straightforward for many tasks and in many cases there is an easy 100x improvement in processing speed just sitting there to be had with <100 lines of code.

z2h-a6n · 2 years ago

Sure, but if you've never used CUDA or any other GPU framework, how many lines of documentation do you need to read, and how many lines of code are you likely to write and rewrite and delete before you end up with those <100 lines of working code?

mhh__ · 2 years ago

You can pretty easily get C performance (I'd argue that C's lack of abstraction makes slow but simple code more appealing) with pythonic expressiveness pretty easily with a more modern language.

sgbeal · 2 years ago

> C is very performant, and hard to code in, so people just use python as a abstraction layer over C.

C is a way of life. Those of us who code exclusively, or nearly so, in C cannot stomach python's notion of "significant white-space."

yoyohello13 · 2 years ago

I code in both (c for hobbies and python professionally) and “significant white space” is a non-issue if you spend any amount of time getting used to it.

Complaining about significant white-space is like complaining that lisp has too many parentheses. It’s an aesthetic preference that just doesn’t matter in practice.

mhh__ · 2 years ago

I actually find python and C very similar in spirit.

Syntax is mostly an irrelevance, they have surprisingly similar patterns in my opinion.

In a modern language I want a type system that both reduces risk and reduces typing — safety and metaprogramming. C obviously doesn't, python doesn't really either.

Python's approach to dynamic-ness is very similar to how I'd expect C to be as a dynamic language (if it had proper arrays/lists).

dboreham · 2 years ago

You get used to the significant whitespace. (C programmer since ~1978).

adolph · 2 years ago

> cannot stomach python's notion of "significant white-space."

Why belly ache about it? Whitespace is significant to one’s fellow humans.

bart_spoon · 2 years ago

All programming languages, including C, have significant white space. Python just has slightly more.

Phemist · 2 years ago

Wait till you start using the black formatter tool.

Well-known for supporting any formatting style you like ;)

johndough · 2 years ago

The code in this article is incorrect. The CUDA kernel is never called: https://github.com/RijulTP/GPUToolkit/blob/f17fec12e008d0d37...

I'd also like to point out that 90 % of the time spent to "compute" the Mandelbrot set with the JIT-compiled code is spent on compiling the function, not on computation.

If you actually want to learn something about CUDA, implementing matrix multiplication is a great exercise. Here are two tutorials:

https://cnugteren.github.io/tutorial/pages/page1.html

https://siboehm.com/articles/22/CUDA-MMM

sevagh · 2 years ago

>If you actually want to learn something about CUDA, implementing matrix multiplication is a great exercise.

There is SAXPY (matrix math A*X+Y), purportedly ([1]) the hello world of parallel math code.

>SAXPY stands for “Single-Precision A·X Plus Y”. It is a function in the standard Basic Linear Algebra Subroutines (BLAS)library. SAXPY is a combination of scalar multiplication and vector addition, and it’s very simple: it takes as input two vectors of 32-bit floats X and Y with N elements each, and a scalar value A. It multiplies each element X[i] by A and adds the result to Y[i].

[1]: https://developer.nvidia.com/blog/six-ways-saxpy/

Dead Comment

Handprint4469 · 2 years ago

Thank you for this, comments like yours is exactly why I keep coming back to HN.

lordwiz · 2 years ago

Thanks a lot for pointing it out, I have fixed the code and updated the blog.

anonylizard · 2 years ago

Matumio · 2 years ago

> When faced with multiple tasks, a CPU allocates its resources to address each task one after the other

Ha! I wish CPUs were still that simple.

Granted, it is legitimate for the article to focus on the programming model. But "CPUs execute instructions sequentially" is basically wrong if you talk about performance. (There are pipelines executing instructions in parallel, there is SIMD, and multiple cores can work on the same problem.)

I think this post focused on the wrong things here. CPUs with AVX-512 also have massive data parallelism, and CPUs can execute many instructions at the same time. The big difference is that CPUs spend a lot of their silicon and power handling control flow to execute one thread efficiently, while GPUs spend that silicon on more compute units and hide control flow and memory latency by executing a lot of threads.

It will do multipleS SIMD instructions at the same time, too.

JonChesterfield · 2 years ago

The CPUs are good at serial code and GPUs are good at parallel code is kind of true but something of an approximation. Assume equivalent power budget in the roughly hundreds of watts range, then:

A CPU has ~100 "cores" each running one (and-a-hyperthread). independent things, and it hides memory latency by branch prediction and pipelining.

A GPU has ~100 "compute units", each running ~80 independent things interleaved, and it hides memory latency by executing the next instruction from one of the other 80 things.

Terminology is a bit of a mess, and the CPU probably has a 256bit wide vector unit while the GPU probably has a 2048bit wide vector unit, but from a short distance the two architectures look rather similar.

mmoskal · 2 years ago

GPU has 10x the memory bandwidth of the CPU though, which becomes relevant for the LLMs where you essentially have to read the whole memory (if you're batching optimally, that is using all the memory either for weights or for KV cache) to produce one token of output.

winwang · 2 years ago

GPUs also have 10x-100x FP/INT8 throughput watt-for-watt.

hurryer · 2 years ago

GPU also has 10x memory latency compared to CPU.

And memory access order is much more important that on CPU. Truly random access has very bad performance.

bee_rider · 2 years ago

I’m always surprised there isn’t a movement toward pairing a few low latency cores with a large number of high throughput cores. Surround single Intel P core with a bunch of E cores. Then, hanging off the E cores, stick a bunch of iGPU cores and/or AVX-512 units.

Call it Xeon Chi.

Const-me · 2 years ago

I think one possible reason for that, ideally these things need different memory.

If you use high-bandwidth high-latency GDDR memory, CPU cores will underperform due to high latency, like there: https://www.tomshardware.com/reviews/amd-4700s-desktop-kit-r...

If you use low-latency memory, GPU cores will underperform due to low bandwidth, see modern AMD APUs with many RDNA3 cores connected to DDR5 memory. On paper, Radeon 780M delivers up to 9 FP32 TFLOPS, the figure is close to desktop version of Radeon RX 6700 which is substantially faster in gaming.

softfalcon · 2 years ago

Neat idea, probably even viable!

I think they may have a hurdle of getting folks to buy into the concept though.

I imagine it would be analogous to how Arria FPGA’s were included with certain Xeon CPU’s. Which further backs up your point that this could happen in the near future!

You mean like an iGPU?

Edit: Oh, thanks for the downvote, with no discussion of the question. I'll just sit here quietly with my commercial OpenCL software that happily exploits these vector units attached to the normal CPU cores.

alberth · 2 years ago

Nx / Axon

Given that most programming languages are designed for sequential processing (like CPUs), but Erlang/Elixir is designed for parallelism (like GPUs) … I really wonder if Nx / Axon (Elixir) will take off.

https://github.com/elixir-nx/

Erlang was designed for distributed systems with a lot of concurrency not for computation-heavy parallelism

matrss · 2 years ago

I am really wondering how well Elixir with Nx would perform for computation heavy workloads on a HPC cluster. Architecturally, it isn't that dissimilar to MPI, which is often used in that field. It should be a lot more accessible though, like numpy and the entire scientific python stack.

zoogeny · 2 years ago

I've been investigating this and I wonder if the combination of Elixir and Nx/Axon might be a good fit for architectures like NVIDIA Grace Hopper where there is a mix of CPU and GPU.

coffeebeqn · 2 years ago

Would that run on a GPU? I think the future is having both. Sequential programming is still the best abstraction for most tasks that don’t require immense parallel execution

dartos · 2 years ago

Axon runs compute graphs on gpu, but elixirs parallelism abstractions run on cpu

password4321 · 2 years ago

I need a buyers guide: what's the minimum to spend, and best at a few budget tiers? Unfortunately that info changes occasionally and I'm not sure if there's any resource that keeps on top of things.

alsodumb · 2 years ago

https://timdettmers.com/2023/01/30/which-gpu-for-deep-learni...

This is the best one imo.

Google Colab, Kaggle Notebooks and Paperspace Notebooks all offer free GPU usage (within limits), so you do not need to spent anything to learn GPU programming.

https://colab.google/

https://www.kaggle.com/docs/notebooks

https://www.paperspace.com/gradient/free-gpu

fulafel · 2 years ago

For learning basics of GPU programming your iGPU will do fine. Actual real-world applications are very varied of course.

You can also rent compute online if you don’t want to immediate plop down 1-2k

aunty_helen · 2 years ago

We’re back to “every developer must know” clickbait articles?

Although I think they'll be replaced by ChatGPT a good article in that style is actually quite valuable.

I like attacking complexity head on, and have a good knowledge of both quantitative methods & qualitative details of (say) computer hardware so having an article that can tell me the nitty gritty details of a field is appreciated.

Take "What every programmer should know about memory" — should every programmer know? Perhaps not, but every good programmer should at least have an appreciation of how a computer actually works. This pays dividends everywhere — locality (the main idea that you should take away from that article) is fast, easy to follow, and usually a result of good code that fits a problem well.

igh4st · 2 years ago

it seems so... Should take this article's statements with a grain of salt.