nserrino (u/nserrino)

nserrino commented on Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels gimletlabs.ai/blog/ai-gen... · Posted by u/nserrino

formalsystem · 5 months ago

I work on PyTorch and there are many things that make me suspicious about these results. My TL;DR is unless we get a zip file of all the kernels with how they're benchmarked results like this are almost impossible to verify

1. I don't have an M4 but I have an M1 Pro and I tried running the claimed 18x speedup VisionAttention attention example and I get close to identical runtimes. This example has more issues the main optimization the LLM is doing is a fusion and so not comparing to torch.compile is a bit sus. The numerics are off as well and I suspect the atols were way too big. Finally MultiHeadAttention is a deprecated API so using neither SDPA or torch.compile is a weird choice

2. In general 18x (and even some 100x speedups claimed near the end) are just a smell that some kernel is incorrect, the typical way you can get speedups like this is you don't warmup or you forget to synchronize. PyTorch has a lot of benchmarking footguns which is why sharing the exact eval scripts is helpful

3. Speaking of footguns, the shapes I saw in the examples were tiny, in that regime you're more often measuring noise as the primary bottleneck is not compute or memory but overhead

4. Generating many random shapes is also not so safe, some input distributions can make certain kernels trivial for example torch.randn() by default generates samples from a normal distribution with mean 0 and variance 1 and so if you take the mean of a large vector you're almost guaranteed to just get 0 esp if your tolerance is too high

5. KernelBench levels measure vastly different things and if you want to compare to PyTorch operators you want to focus on Level 1, Level 2 is fusions and so the right baseline is torch.compile and more reliable on nightlies. The Mamba 2 example (which I didn't run) also acknowledges that the primary thing it does is fusions which assuming everything is correct would still be strange to baseline vs eager

So please for everyone's sanity if you find a kernel that's 10-100x faster please share the exact code and benchmarking methodology to your smartest performance friends, you should be extremely skeptical of such results often you can discard some numbers based on a simple speed of light analysis. We all desperately want faster kernels but to get them we have to be really fanatical about correctness.

nserrino · 5 months ago

Hey, thanks for the thoughtful comments. A lot of big claims have been made in this area so skepticism is the right default reaction. tl;dr: agree that we should provide the kernels and benchmark suite so this can be evaluated by others, will follow up with that.

A few clarifications:

1. Baselines - We didn't compare to torch.compile because as of PyTorch 2.7, torch.compile doesn't support the MPS backend, and we ran into some issues on many of the problems when using it. GitHub issue: https://github.com/pytorch/pytorch/issues/150121. Once it's supported, it will be the obvious baseline.

2. Methodology - We followed KernelBench’s protocol to establish a baseline on Metal, adding more correctness checks. Warmup and synchronization were done. We recognize the limitations here and are expanding the validation suite.

3. Optimizations - Right now most of the optimizations are fusions, but there is some use of Metal-specific primitives/optimizations. We expect as we make the supervisor more sophisticated, the novelty of the optimized kernels will also increase.

Overall the goal here is to get some % of the benefit of a human expert in kernel engineering, without developer effort. Compiler-based optimizations are great, but hand-tuned implementations are still common for performance-critical models. The hope is that we can automate some of that process.

nserrino commented on Speeding up PyTorch inference on Apple devices with AI-generated Metal kernels gimletlabs.ai/blog/ai-gen... · Posted by u/nserrino

turbo_wombat · 5 months ago

They are comparing unoptimized PyTorch inference, something you would never deploy on a device, to a model with custom kernels.

Yes, of course the model with custom kernels is faster, whether it's written by a human or an AI.

Generally, PyTorch inference is meant to be used during the training process, and when running metrics, not when deploying. When deployed, you should export to ONNX, and then compile the ONNX to the native format of the device.

If you aren't familiar with the pipeline for ML deployment, this is the equivalent of comparing interpreted code to compiled code.

nserrino · 5 months ago

PyTorch is the baseline because that's what people prototype in, and the most common reference point. The aim here is to show that you can start from prototype code and automatically produce lower-level kernels (in this case Metal) that are more usable in real deployments, without additional work from the developer. Frontier models are capable at generating efficient Metal kernels automatically/immediately, and will only get better. We expect to see significant improvements as we refine the approach, but it's enough to show this seems to be a tractable problem for AI.

nserrino commented on They Might Be Giants Flood EPK Promo (1990) [video] youtube.com/watch?v=C-tQS... · Posted by u/CaliforniaKarl

nserrino · a year ago

They are just incredible live ... Flood was the soundtrack of my childhood. It's great to see how many fans on HN they have. If you have a chance to check them out in concert, do it!

nserrino commented on We were wrong about GPUs fly.io/blog/wrong-about-g... · Posted by u/mxstbr

cyberax · a year ago

Hah. We're doing AI, but we're doing vision-based stuff and not LLMs. For us, the problem has been deploying models.

Google and AWS helpfully offered their managed LLM AI services, but they don't really have anything terribly more useful than just machines with GPUs. Which are expensive.

I'm going to check fly.io...

nserrino · a year ago

What kind of models are you deploying and what type of problems are you having with deploying them?

nserrino commented on Alphabet to invest another $5B into Waymo techcrunch.com/2024/07/23... · Posted by u/chrixf

nserrino · 2 years ago

I've mostly switched over to Waymo, but had to take Ubers twice in the last month:

* The first one, the driver made multiple racist remarks about different groups he observed as we drove.

* The second one, the driver talked at length about UFOs and how they are real, for the entire 50 minute drive.

Most drivers are totally normal and don't do things like that, but the tail end of negative experiences can be quite bad. Dirty cars, loud radios, body odor, and unsafe driving are all relatively common with human drivers. A Lyft driver I was riding with a few years back almost ran over a man in a wheelchair who had the right of way.

Wait times are also more reliable so far with Waymo. It's not uncommon for an Uber/Lyft driver to accept a ride but then not drive toward you for 5+ minutes. Waymo has the advantage of predictability - both in terms of arrival time and overall travel time (whereas there is variance among human drivers).

Sometimes I've had Waymos get stuck, but it usually resolves within 10-15 seconds.

Given how smooth, predictable, and safe Waymos are, I don't see a strong reason to risk a negative experience with a human driver (beyond ideological reasons). However, I hope another strong provider comes on the market soon to give them some competition.

nserrino commented on Programming Burnout tsk.bearblog.dev/programm... · Posted by u/memorable

nserrino · 4 years ago

The stories we often see about burnout in tech are interesting. There’s a lot about burning out of programming etc, the hours. Of course, long hours will eventually burn you out in most any profession. Less commonly discussed, it seems like most people I know actually burn out from bureaucracy. It’s far more exhausting and demotivating to not make forward progress and waste time in meetings all day. Of course, it should go without saying that some meetings are necessary.