Readit News logoReadit News
junrushao1994 commented on Punica: Serving multiple LoRA finetuned LLM as one   github.com/punica-ai/puni... · Posted by u/abcdabcd987
junrushao1994 · 2 years ago
This is great! Have you guys considered integrating with one of the existing systems?
junrushao1994 commented on Scaling LLama2-70B with Multiple Nvidia/AMD GPU   blog.mlc.ai/2023/10/19/Sc... · Posted by u/junrushao1994
brucethemoose2 · 2 years ago
For those suffering from deceptive graph fatigue, this is impressive.

exLlama is blazing fast. Even if they just benched exllamav1, exllamav2 is only a bit faster, at least on my single 3090 in a similar environment.

vLLM is focused more on batching performance, but even then MLC/TVM looks like its putting up a fight without batching.

I am a bit fatigued with llama backends myself, and it looks like this won't help me run 70B in a single 3090, but I need to dig into mlc again.

junrushao1994 · 2 years ago
Yeah thanks for sharing! This is definitely super valuable data and insights :)

Regarding exllama-V2, MLC/TVM does benchmark against it:

- Single GPU: https://github.com/mlc-ai/llm-perf-bench#int4-quantized-sing...

- Multi GPU: Figure 2 in the blog: http://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infere...

> vLLM focuses more on batching performance

Exactly. vLLM doesn’t optimize for latency-first scenarios as it focuses on throughput, i.e. batching. This particular blog post instead focuses particular on latency, i.e. the fastest you could possible get with those many GPUsz

Regarding batching, it is coming pretty soon, and we will have another blog post on this.

junrushao1994 commented on Scaling LLama2-70B with Multiple Nvidia/AMD GPU   blog.mlc.ai/2023/10/19/Sc... · Posted by u/junrushao1994
junrushao1994 · 2 years ago
Machine Learning Compilation (MLC) now supports compiling LLMs to multiple GPUs.

For Llama2-70B, it runs 4-bit quantized Llama2-70B at:

- 34.5 tok/sec on two NVIDIA RTX 4090 at $3k

- 29.9 tok/sec on two AMD Radeon 7900XTX at $2k

- Also it is scales well with 8 A10G/A100 GPUs in our experiment.

Details:

- Blog post: https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infer...

- Project: https://github.com/mlc-ai/mlc-llm

junrushao1994 commented on GPU-Accelerated LLM on an Orange Pi   blog.mlc.ai/2023/08/09/GP... · Posted by u/tosh
packetlost · 2 years ago
I had to make a minor modification to the code to get the Rust compiler happy, just add a `.as_slice()` when the compilation fails. I'll submit a PR if it's not fixed already.
junrushao1994 · 2 years ago
Ah please help us by submitting a PR! I noticed the rust build failed last night but didn’t get a chance to look into it
junrushao1994 commented on Deep Learning Systems   dlsyscourse.org/lectures/... · Posted by u/__rito__
junrushao1994 · 2 years ago
This is a particular unique course offering introduction on ML compilation and deployment :)
junrushao1994 commented on Making AMD GPUs competitive for LLM inference   blog.mlc.ai/2023/08/09/Ma... · Posted by u/djoldman
quickthrower2 · 2 years ago
Off topic but feels like a good place to ask? Can WebGPU give you decent performance on non-Cuda and help accomplish these kinds of aims? (Geohot I think is aiming to avoid a single chipmaker monopoly on AI which he sees as a bad thing /paraphrase)
junrushao1994 · 2 years ago
As of today performance in WebGPU isn't as competitive yet, but there are really quite a lot of low-hanging fruits for WebGPU to pick up.
junrushao1994 commented on Making AMD GPUs competitive for LLM inference   blog.mlc.ai/2023/08/09/Ma... · Posted by u/djoldman
azeirah · 2 years ago
I know plenty of open-source projects who list and thank every individual contributor. The website could do that too!
junrushao1994 · 2 years ago
That's a great idea! We should dig around and see if there's any plugin to use
junrushao1994 commented on Making AMD GPUs competitive for LLM inference   blog.mlc.ai/2023/08/09/Ma... · Posted by u/djoldman
gsuuon · 2 years ago
Congrats Junru! I'm not on AMD but love seeing progress in this project. Excited for batched inference -- I didn't think it'd be useful for me but I've realized batched inference is also useful for a single user / edge device workload.

Btw - I got biased sampling working in ad-llama! Catching up to guidance slowly but surely :)

junrushao1994 · 2 years ago
This is amazing to hear Steven! (Sorry I locked myself out of discord a couple of days ago...) I'm sure there's bunch of features missing like biased sampling you mentioned, and more than happy to merge PRs if you'd love to :)
junrushao1994 commented on Making AMD GPUs competitive for LLM inference   blog.mlc.ai/2023/08/09/Ma... · Posted by u/djoldman
brucethemoose2 · 2 years ago
I can confirm this, mlc is shockingly fast on my RTX 2060.

The catch is:

- MLC's quantization is somewhat different (though I havent run any perplexity tests yet)

- There is no CPU offloading (or splitting onto an IGP) like Llama.cpp yet (unless its new and I missed it).

junrushao1994 · 2 years ago
True and there are some other issues to be addressed. Those two particular issue is on our roadmap.

Regarding quantization, we wanted to develop a code path that absorbs any quantization formats, for example, those from GGML or GPTQ, so that they could be all used. ML compilation (MLC) is agnostic to any quantization formats, but we just haven't exposed such abstractions yet.

On CPU offloading, imagine if you are writing PyTorch, it should be as simple as a one-liner `some_tensor.cpu()` to bring something down to host memory, and `some_tensor.cuda()` to get it back to CUDA - seems a low-hanging fruit but it's not implemented yet in MLC LLM :( Lots of stuff to do and we should make this happen soon.

u/junrushao1994

KarmaCake day336July 12, 2018
About
Junru Shao junrushao1994 at gmail.com Options are my own
View Original