Show HN: 80% faster, 50% less memory, 0% loss of accuracy Llama finetuning

apsec112 · 2 years ago

I haven't run the code, but... how is this even possible? I've done PyTorch profiling on QLoRA Llama-2-70B fine tunes, and the runtime is dominated by the large matrix multiplies in the MLP layers, plus a bit for attention. Under the hood, this repo calls the same torch.matmul() for MLP and flash_attn_func() for attention as HuggingFace does. So how can it be that much faster? They have a few Triton kernels, but there appears to be no Triton on MLP or attention and that's most of the bottleneck.

WhitneyLand · 2 years ago

They say it’s due a custom optimized version of autograd which is sort of a key calculus component. They also mention simple things like function inlining or memory optimizations. It seems plausible these things could be optimized.

Whatever advantage they have I don’t see how they would be able to to keep it for long as part of their closed source “pro” version.

If it’s low hanging fruit the open source equivalents are bound to snipe them before long.

danielhanchen · 2 years ago

There's more as well!! I agree hypothetically in a few years people will catch up - we're well aware of that! But we made this into a startup, and so this was just one product we were releasing! We're going to be making many AI products in the coming weeks!

TheGeminon · 2 years ago

There is a more detailed explanation at https://unsloth.ai/introducing

apsec112 · 2 years ago

That... doesn't really explain how they can get such a high number? Standard FLOP efficiency on fine-tuning big models is like 30-40%. How can you get 750%?

bradfox2 · 2 years ago

These are significant claims locked behind a paid 'pro' version. Red flags.

danielhanchen · 2 years ago

Sorry about that - I'm super new to pricing and stuff so it might seem off since I'm literally making the plans with my bro as we go along.

If you don't believe the timings, I was the author of Hyperlearn https://github.com/danielhanchen/hyperlearn which makes ML faster - I also listed the papers which cite the algos.

I also used to work at NVIDIA making TSNE 2000x faster on GPUs and some other algos like Randomized SVD, sparse matrix multiplies etc.

If you have any suggestions on a more appropriate pricing strategy - I'm all ears!!

I really don't know much about pricing and the open core model, so I'm making stuff up literally.

Deleted Comment

peytoncasper · 2 years ago

Can I suggest that you ignore all the criticism about the pricing on this thread and immediately find a sales rep or SE that has worked at an early stage DB company and begin cold calling high end customers with thousands of GPUs.

B2B deals at 200-300k+ are your best bet at selling this IMO.

danielhanchen · 2 years ago

Oh thanks for the cool tips! Super good points - I'll have to ask around sadly I'm just the engineering guy loll :(

peytoncasper · 2 years ago

Would be happy to answer questions as you have them.

danielhanchen · 2 years ago

For those interested - just released a new blog post on all our optimizations. There are also 59 fully reproducible benchmarks! https://unsloth.ai/blog/mistral-benchmark

neilmovva · 2 years ago

promising results, excited to try it out!

question on the perf benchmarks: why do all the results with 2 GPUs & DDP take longer than the single GPU case? Both benchmarks do the same amount of work, one training epoch, so this negative scaling is surprising.

danielhanchen · 2 years ago

So there's 2 main reasons:

1. DDP itself has an overhead since it has to synchronize gradients at each training step since GPU0 and GPU1 has to give gradients to GPU0.

2. Huggingface seems to not be optimized well for DDP mainly due to inefficient data movement - we fixed that - interestingly - even on 1 GPU it's faster.

neilmovva · 2 years ago

I agree that synchronization causes overhead, so 2x GPUs won't achieve the ideal 0.5x total runtime. But here, taking your Alpaca benchmark as an example, we are seeing 2x GPUs get 3.6x runtime with Huggingface, or 1.15x with Unsloth Max.

In other words, every benchmark, in either HF or Unsloth, is slower in absolute terms when going from 1 to 2 GPUs. That makes me think something is wrong with the test.

Could you share your benchmark code?

kristopolous · 2 years ago

It'd be great to have a chronicle of all these efforts. I lost track of the variations quite a long time ago.

It'd be quite a lift unless we're just willing to just accept the self reported metrics as golden. And even then, they're always qualified by hardware and usage scope. Making it good enough to be useful is the hard part. CI/CD pipeline with a bunch of machine configurations and benchmarks along with a reasonable way to communicate them...

If anyone's up for it you'd legitimately become indispensable.

danielhanchen · 2 years ago

Hey there! Yes that's exactly what I thought! I was writing up a blog post at https://colab.research.google.com/drive/1AOuhMVILE06mD-Go7-R... which step by step shows every change I did plus gave timings / mem savings.

I'll post it once its completed if you're interested!

kristopolous · 2 years ago

Thanks. There's been a few substantial and laudable efforts which are much appreciated but what I'm suggesting is an actual continuous infrastructure, like how those benchmarking sites have software for people to run on their machines that phone home so that people who make new benchmarks or new variations can submit them and refine the results.

For instance, are any of your prompting tests in say, Korean? What about winograd schema challenges in languages other than English? Japanese for instance, comes with its own unique set of context ambiguities that do not appear in English. I'm sure dozens of languages are similar. It'd be nice to have user contributable tests to cover the breadth of use cases here.

A great optimization that moves a score let's say from 95% -> 5% on "winograd-persian" may be fine or may be a show stopper, depends on what you care about.

That's why it's gotta be normalized, future-proof, and crowdsourced.

Deleted Comment

naveen99 · 2 years ago

How does this compare to PyTorch labs optimizations for Sam and llama2 ?

https://github.com/pytorch-labs/segment-anything-fast

https://github.com/pytorch-labs/gpt-fast

danielhanchen · 2 years ago

Hey! That's for inference - our code is for training :) We do plan faster inference in the future!! Saw GPT Fast by Chillee - it is blazingly fast!

graphe · 2 years ago

Somewhat related: is it worth it to use a P100 or P40? I was gonna get one but seems like pascal isn't being supported by more and more projects.

danielhanchen · 2 years ago

P100 - oh my so Xformers for Flash Attention I think does support it, but Triton supports Compute Capability 7.0+, whilst a P100 is 6.0 :(

So technically the code can run, but I'll have to edit it to remove the Triton changes.

graphe · 2 years ago

Thank you! I generally don't see as much support for them and needs to use old versions of it was supported. This trend is happening more and more, it's unfortunate but it seems like we need to use a newer GPU now.