1. Manual autograd engine - hand derived backprop steps.
2. QLoRA / LoRA 80% faster, 50% less memory.
3. All kernels written in OpenAI's Triton language.
4. 0% loss in accuracy - no approximation methods - all exact.
5. No change of hardware necessary. Supports NVIDIA GPUs since 2018+. CUDA 7.5+.
6. Flash Attention support via Xformers.
7. Supports 4bit and 16bit LoRA finetuning.
8. Train Slim Orca fully locally in 260 hours from 1301 hours (5x faster).
9. Open source version trains 5x faster or you can check out Unsloth Pro and Max codepaths for 30x faster training!
https://www.reddit.com/r/LocalLLaMA/comments/188197j/80_fast... has more info about Unsloth!
Hopefully you can try it out! Wrote a blog post at https://unsloth.ai/introducing if you want to learn more about our manual hand derived backprop or Triton kernels and stuff! Thanks once again!
Whatever advantage they have I don’t see how they would be able to to keep it for long as part of their closed source “pro” version.
If it’s low hanging fruit the open source equivalents are bound to snipe them before long.
If you don't believe the timings, I was the author of Hyperlearn https://github.com/danielhanchen/hyperlearn which makes ML faster - I also listed the papers which cite the algos.
I also used to work at NVIDIA making TSNE 2000x faster on GPUs and some other algos like Randomized SVD, sparse matrix multiplies etc.
If you have any suggestions on a more appropriate pricing strategy - I'm all ears!!
I really don't know much about pricing and the open core model, so I'm making stuff up literally.
Deleted Comment
B2B deals at 200-300k+ are your best bet at selling this IMO.
question on the perf benchmarks: why do all the results with 2 GPUs & DDP take longer than the single GPU case? Both benchmarks do the same amount of work, one training epoch, so this negative scaling is surprising.
1. DDP itself has an overhead since it has to synchronize gradients at each training step since GPU0 and GPU1 has to give gradients to GPU0.
2. Huggingface seems to not be optimized well for DDP mainly due to inefficient data movement - we fixed that - interestingly - even on 1 GPU it's faster.
In other words, every benchmark, in either HF or Unsloth, is slower in absolute terms when going from 1 to 2 GPUs. That makes me think something is wrong with the test.
Could you share your benchmark code?
It'd be quite a lift unless we're just willing to just accept the self reported metrics as golden. And even then, they're always qualified by hardware and usage scope. Making it good enough to be useful is the hard part. CI/CD pipeline with a bunch of machine configurations and benchmarks along with a reasonable way to communicate them...
If anyone's up for it you'd legitimately become indispensable.
I'll post it once its completed if you're interested!
For instance, are any of your prompting tests in say, Korean? What about winograd schema challenges in languages other than English? Japanese for instance, comes with its own unique set of context ambiguities that do not appear in English. I'm sure dozens of languages are similar. It'd be nice to have user contributable tests to cover the breadth of use cases here.
A great optimization that moves a score let's say from 95% -> 5% on "winograd-persian" may be fine or may be a show stopper, depends on what you care about.
That's why it's gotta be normalized, future-proof, and crowdsourced.
Deleted Comment
https://github.com/pytorch-labs/segment-anything-fast
https://github.com/pytorch-labs/gpt-fast
So technically the code can run, but I'll have to edit it to remove the Triton changes.
Deleted Comment