Readit News logoReadit News
Posted by u/danielhanchen 2 years ago
Show HN: 80% faster, 50% less memory, 0% loss of accuracy Llama finetuninggithub.com/unslothai/unsl...
Hi HN! I'm just sharing a project I've been working on during the LLM Efficiency Challenge - you can now finetune Llama with QLoRA 5x faster than Huggingface's original implementation on your own local GPU. Some highlights:

1. Manual autograd engine - hand derived backprop steps.

2. QLoRA / LoRA 80% faster, 50% less memory.

3. All kernels written in OpenAI's Triton language.

4. 0% loss in accuracy - no approximation methods - all exact.

5. No change of hardware necessary. Supports NVIDIA GPUs since 2018+. CUDA 7.5+.

6. Flash Attention support via Xformers.

7. Supports 4bit and 16bit LoRA finetuning.

8. Train Slim Orca fully locally in 260 hours from 1301 hours (5x faster).

9. Open source version trains 5x faster or you can check out Unsloth Pro and Max codepaths for 30x faster training!

https://www.reddit.com/r/LocalLLaMA/comments/188197j/80_fast... has more info about Unsloth!

Hopefully you can try it out! Wrote a blog post at https://unsloth.ai/introducing if you want to learn more about our manual hand derived backprop or Triton kernels and stuff! Thanks once again!

apsec112 · 2 years ago
I haven't run the code, but... how is this even possible? I've done PyTorch profiling on QLoRA Llama-2-70B fine tunes, and the runtime is dominated by the large matrix multiplies in the MLP layers, plus a bit for attention. Under the hood, this repo calls the same torch.matmul() for MLP and flash_attn_func() for attention as HuggingFace does. So how can it be that much faster? They have a few Triton kernels, but there appears to be no Triton on MLP or attention and that's most of the bottleneck.
WhitneyLand · 2 years ago
They say it’s due a custom optimized version of autograd which is sort of a key calculus component. They also mention simple things like function inlining or memory optimizations. It seems plausible these things could be optimized.

Whatever advantage they have I don’t see how they would be able to to keep it for long as part of their closed source “pro” version.

If it’s low hanging fruit the open source equivalents are bound to snipe them before long.

danielhanchen · 2 years ago
There's more as well!! I agree hypothetically in a few years people will catch up - we're well aware of that! But we made this into a startup, and so this was just one product we were releasing! We're going to be making many AI products in the coming weeks!
TheGeminon · 2 years ago
There is a more detailed explanation at https://unsloth.ai/introducing
apsec112 · 2 years ago
That... doesn't really explain how they can get such a high number? Standard FLOP efficiency on fine-tuning big models is like 30-40%. How can you get 750%?
bradfox2 · 2 years ago
These are significant claims locked behind a paid 'pro' version. Red flags.
danielhanchen · 2 years ago
Sorry about that - I'm super new to pricing and stuff so it might seem off since I'm literally making the plans with my bro as we go along.

If you don't believe the timings, I was the author of Hyperlearn https://github.com/danielhanchen/hyperlearn which makes ML faster - I also listed the papers which cite the algos.

I also used to work at NVIDIA making TSNE 2000x faster on GPUs and some other algos like Randomized SVD, sparse matrix multiplies etc.

If you have any suggestions on a more appropriate pricing strategy - I'm all ears!!

I really don't know much about pricing and the open core model, so I'm making stuff up literally.

Deleted Comment

peytoncasper · 2 years ago
Can I suggest that you ignore all the criticism about the pricing on this thread and immediately find a sales rep or SE that has worked at an early stage DB company and begin cold calling high end customers with thousands of GPUs.

B2B deals at 200-300k+ are your best bet at selling this IMO.

danielhanchen · 2 years ago
Oh thanks for the cool tips! Super good points - I'll have to ask around sadly I'm just the engineering guy loll :(
peytoncasper · 2 years ago
Would be happy to answer questions as you have them.
danielhanchen · 2 years ago
For those interested - just released a new blog post on all our optimizations. There are also 59 fully reproducible benchmarks! https://unsloth.ai/blog/mistral-benchmark
neilmovva · 2 years ago
promising results, excited to try it out!

question on the perf benchmarks: why do all the results with 2 GPUs & DDP take longer than the single GPU case? Both benchmarks do the same amount of work, one training epoch, so this negative scaling is surprising.

danielhanchen · 2 years ago
So there's 2 main reasons:

1. DDP itself has an overhead since it has to synchronize gradients at each training step since GPU0 and GPU1 has to give gradients to GPU0.

2. Huggingface seems to not be optimized well for DDP mainly due to inefficient data movement - we fixed that - interestingly - even on 1 GPU it's faster.

neilmovva · 2 years ago
I agree that synchronization causes overhead, so 2x GPUs won't achieve the ideal 0.5x total runtime. But here, taking your Alpaca benchmark as an example, we are seeing 2x GPUs get 3.6x runtime with Huggingface, or 1.15x with Unsloth Max.

In other words, every benchmark, in either HF or Unsloth, is slower in absolute terms when going from 1 to 2 GPUs. That makes me think something is wrong with the test.

Could you share your benchmark code?

kristopolous · 2 years ago
It'd be great to have a chronicle of all these efforts. I lost track of the variations quite a long time ago.

It'd be quite a lift unless we're just willing to just accept the self reported metrics as golden. And even then, they're always qualified by hardware and usage scope. Making it good enough to be useful is the hard part. CI/CD pipeline with a bunch of machine configurations and benchmarks along with a reasonable way to communicate them...

If anyone's up for it you'd legitimately become indispensable.

danielhanchen · 2 years ago
Hey there! Yes that's exactly what I thought! I was writing up a blog post at https://colab.research.google.com/drive/1AOuhMVILE06mD-Go7-R... which step by step shows every change I did plus gave timings / mem savings.

I'll post it once its completed if you're interested!

kristopolous · 2 years ago
Thanks. There's been a few substantial and laudable efforts which are much appreciated but what I'm suggesting is an actual continuous infrastructure, like how those benchmarking sites have software for people to run on their machines that phone home so that people who make new benchmarks or new variations can submit them and refine the results.

For instance, are any of your prompting tests in say, Korean? What about winograd schema challenges in languages other than English? Japanese for instance, comes with its own unique set of context ambiguities that do not appear in English. I'm sure dozens of languages are similar. It'd be nice to have user contributable tests to cover the breadth of use cases here.

A great optimization that moves a score let's say from 95% -> 5% on "winograd-persian" may be fine or may be a show stopper, depends on what you care about.

That's why it's gotta be normalized, future-proof, and crowdsourced.

Deleted Comment

naveen99 · 2 years ago
How does this compare to PyTorch labs optimizations for Sam and llama2 ?

https://github.com/pytorch-labs/segment-anything-fast

https://github.com/pytorch-labs/gpt-fast

danielhanchen · 2 years ago
Hey! That's for inference - our code is for training :) We do plan faster inference in the future!! Saw GPT Fast by Chillee - it is blazingly fast!
graphe · 2 years ago
Somewhat related: is it worth it to use a P100 or P40? I was gonna get one but seems like pascal isn't being supported by more and more projects.
danielhanchen · 2 years ago
P100 - oh my so Xformers for Flash Attention I think does support it, but Triton supports Compute Capability 7.0+, whilst a P100 is 6.0 :(

So technically the code can run, but I'll have to edit it to remove the Triton changes.

graphe · 2 years ago
Thank you! I generally don't see as much support for them and needs to use old versions of it was supported. This trend is happening more and more, it's unfortunate but it seems like we need to use a newer GPU now.

Deleted Comment