Show HN: Flash Attention in ~100 lines of CUDA

treesciencebot · a year ago

Pretty neat implementation. In general, for these sort of exercises (and even if the intention is to go to prod with custom kernels) I lean towards Triton to write the kernels themselves. It is much more easier to integrate to the tool chain, and allows a level of abstraction that doesn't affect performance even a little bit while providing useful constructs.

whimsicalism · a year ago

yeah even the official flashattention is moving many implementations from cutlass to triton except for the main mha backward/forward pass

jart · a year ago

It was written with cutlass? No wonder Peter Kim found it valuable and worthwhile to de-obfuscate. Adopting a new programming language invented by OpenAI doesn't sound like a much better alternative. I'd be shocked if either of them were able to build code for AMD GPUs, where it's easy to adapt CUDA code, but not if it's buried in tens of thousands of lines of frameworks. I like open source code to have clarity so I can optimize it for my own production environment myself. When people distribute code they've productionized for themselves, it squeezes out all the alpha and informational value. Just because something's open source doesn't mean it's open source. I think people mostly do it to lick the cookie without giving much away.

queuebert · a year ago

As a person who finds CUDA extremely easy to write and integrate, what does Triton have to offer?

whimsicalism · a year ago

block level rather than thread level programming, automatic optimization across hyperparameters, makes it much easier to write fast kernels

ixaxaar · a year ago

You mean triton the inference server or triton the DSL for cuda?

p1esk · a year ago

The DSL: https://openai.com/research/triton

whimsicalism · a year ago

they mean the dsl (not just necessarily for cuda)

treesciencebot · a year ago

triton the DSL.

fpgamlirfanboy · a year ago

> allows a level of abstraction that doesn't affect performance even a little bit

The second part of this sentence is true because the first part is false.

treesciencebot · a year ago

zero cost abstractions exist. doesn't mean all abstractions are zero-cost. or being zero-cost somehow invalidates their abstractness/genericness. but maybe we differ on the definition of abstractions.

araes · a year ago

For those who have no idea what's being discussed, quick background.

Discussing: Transformer [1] memory issues and approximate attention [2] in machine learning training.

Specifically: FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. [3]

As a side comment, this entire industry is sorely in need of at least intros. The entire space has moved so fast in the last year I need an entire new dictionary and thesaurus for all the terms they've created. Notably, because of this, found out Google has a glossary of machine learning terms. Actually somewhat handy.

[1] Google Machine Learning Glossary (Transformer): https://developers.google.com/machine-learning/glossary/#tra...

[2] Same (Attention): https://developers.google.com/machine-learning/glossary/#att...

[2] arXiv: https://arxiv.org/abs/2205.14135

robrenaud · a year ago

Regarding your comment about how fast the research and industry is moving, would HN readers be interested in relevant one or two paragraph summaries that are basically "explain it like I am a machine learning engineer from 2020" but also knows the power of these models from a perspective of using ChatGPT or MS Copilot? That is, assume a fair amount of technical knowledge about the fundamentals, but don't assume that the reader is paying any attention to have whitebox knowledge of the current state of the art.

jprete · a year ago

I personally have been looking for "explain it like I'm a CS PhD with lots of experience and the ability to look stuff up". But I suspect your summary would be pretty handy as well.

araes · a year ago

That sounds at least somewhat helpful. Honestly, a gradient for some of this stuff would be nice. Explain to me like I'm: "five", "a high schooler", "a college grad (not CS/ML/Eng)", "a CS/Eng not ML".

Although in a couple years, kids in restaurants will probably telling me how they're leveling up attention on their neuro-pet. The singularity is steep.

andoando · a year ago

I would love an explanation for software enginners / CS majors who aren't familiar with ML.

Last I studied ML was 2016 and that was stuff like decision trees, k nearest neighbors...

whimsicalism · a year ago

frankly i don’t really feel like all that much has changed since 2020 except for scale

godelski · a year ago

Zero shot is wrong, but that definition is commonly used.

Zero shot is testing out if distribution, not just "a task" not trained on. The later is ill defined.

The original definition comes from a few papers. But the classic example is a clarifier recognizing zebras but having never been trained in zebras (but may have been trained on horses). There's are out of distribution. But importantly, out of the implicit distribution, not the target distribution.

The common improper usage usually confuses these two. A simple example might me training in 256x256 images and testing on 1024x1024. That's still in the implicit distribution (as long as the classes are identical). A very common example is training on a large dataset like LAION and then testing on coco or image net 1k. This is not zero shot because the classes in ImageNet are in LAION (and in Coco). Basically, this is a useless definition because then any validation or test set is zero shot because those were never seen in the training data and thus out of the training distribution. But remember that data sets are proxies for larger distributions.

Where is can get sometimes tricky is tasks (emergence has entered the chat). For example, you may not intend to train a generative model to do clarification but you probably did (it's very clear -- in the math -- if you're training density models (KLD, score, etc)). This can get hairy because it's very easy to train a model to do things that you aren't realizing you are and later find out. Some people can get upset about this but it's the nature of frameworks that have low interpretability. There's still a lot of mathematics we need to learn and it tends not to be an explicit focus in ML but there are plenty in the community focused on this.

Deleted Comment

saiojd · a year ago

What does __syncthreads() do here exactly? I'm new to CUDA, could get the overall idea of the FlashAttention paper but not the details.

cavisne · a year ago

Causes every thread in the block to wait until they have reached this point. Worth reading a cuda primer for more details on blocks/warps.

Since the threads are relying on each other to fill the SRAM with all needed data if you didn’t wait then values would be missing.

xrd · a year ago

Any CUDA primer you recommend in particular? I had this same question.

einpoklum · a year ago

My GPU work is not in ML (deep or otherwise); but ...

1. "100 lines of CUDA" + PyTorch; maybe this is useful and maybe it isn't, but counting lines of code on top of a huge codebase is not very meaningful.

2. Launching separate kernels, synchronously, on the default stream, for various operations, is typically not the right way to utilize a GPU.

chillee · a year ago

> maybe this is useful and maybe it isn't, but counting lines of code on top of a huge codebase is not very meaningful.

In this case it's pretty reasonable imo, since the kernel itself is fairly independent - the usage of torch is just for some bindings for the data structures.

> Launching separate kernels, synchronously, on the default stream, for various operations, is typically not the right way to utilize a GPU.

This is actually the standard way to do things in ML. Assuming you're from a HPC background (where this may seem quite strange), the biggest change is that "More or less everything in ML runs on the GPU", so there is very rarely any device to host synchronizations. In addition, each individual kernel is typically run on fairly large chunks of data (a million elements would be on the smaller side), so maximizing occupancy with streams is not as necessary as in HPC.

dcanelhas · a year ago

If CPU/GPU execution speed is the goal while simultaneously code golfing the source size, https://halide-lang.org/ might have come in handy.

danielhanchen · a year ago

Fantastic work! Extremely neat and clear implementation! Interesting note on the backward pass - what do you think are the main blockers for a backward pass?

tspeterkim · a year ago

Thanks Daniel. The main blocker is me not able to fully grasp the backward pass. (trying to understand Appendix B.2 in the original paper)

I need to get more comfortable with matrix derivatives before I can confidently reimplement it in the same minimal way as I did with the forward pass.

danielhanchen · a year ago

Oh ok! Ye the backwards passes are always much more difficult due to the derivatives!