junrushao1994 (u/junrushao1994)

junrushao1994 commented on Punica: Serving multiple LoRA finetuned LLM as one github.com/punica-ai/puni... · Posted by u/abcdabcd987

junrushao1994 · 2 years ago

This is great! Have you guys considered integrating with one of the existing systems?

junrushao1994 commented on Scaling LLama2-70B with Multiple Nvidia/AMD GPU blog.mlc.ai/2023/10/19/Sc... · Posted by u/junrushao1994

brucethemoose2 · 2 years ago

For those suffering from deceptive graph fatigue, this is impressive.

exLlama is blazing fast. Even if they just benched exllamav1, exllamav2 is only a bit faster, at least on my single 3090 in a similar environment.

vLLM is focused more on batching performance, but even then MLC/TVM looks like its putting up a fight without batching.

I am a bit fatigued with llama backends myself, and it looks like this won't help me run 70B in a single 3090, but I need to dig into mlc again.

junrushao1994 · 2 years ago

Yeah thanks for sharing! This is definitely super valuable data and insights :)

Regarding exllama-V2, MLC/TVM does benchmark against it:

- Single GPU: https://github.com/mlc-ai/llm-perf-bench#int4-quantized-sing...

- Multi GPU: Figure 2 in the blog: http://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infere...

> vLLM focuses more on batching performance

Exactly. vLLM doesn’t optimize for latency-first scenarios as it focuses on throughput, i.e. batching. This particular blog post instead focuses particular on latency, i.e. the fastest you could possible get with those many GPUsz

Regarding batching, it is coming pretty soon, and we will have another blog post on this.

junrushao1994 commented on Scaling LLama2-70B with Multiple Nvidia/AMD GPU blog.mlc.ai/2023/10/19/Sc... · Posted by u/junrushao1994

junrushao1994 · 2 years ago

Machine Learning Compilation (MLC) now supports compiling LLMs to multiple GPUs.

For Llama2-70B, it runs 4-bit quantized Llama2-70B at:

- 34.5 tok/sec on two NVIDIA RTX 4090 at $3k

- 29.9 tok/sec on two AMD Radeon 7900XTX at $2k

- Also it is scales well with 8 A10G/A100 GPUs in our experiment.

Details:

- Blog post: https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infer...

- Project: https://github.com/mlc-ai/mlc-llm

junrushao1994 commented on GPU-Accelerated LLM on an Orange Pi blog.mlc.ai/2023/08/09/GP... · Posted by u/tosh

packetlost · 2 years ago

I had to make a minor modification to the code to get the Rust compiler happy, just add a `.as_slice()` when the compilation fails. I'll submit a PR if it's not fixed already.

junrushao1994 · 2 years ago

Ah please help us by submitting a PR! I noticed the rust build failed last night but didn’t get a chance to look into it

junrushao1994 commented on Deep Learning Systems dlsyscourse.org/lectures/... · Posted by u/__rito__

junrushao1994 · 2 years ago

This is a particular unique course offering introduction on ML compilation and deployment :)

junrushao1994 commented on Making AMD GPUs competitive for LLM inference blog.mlc.ai/2023/08/09/Ma... · Posted by u/djoldman

quickthrower2 · 2 years ago

Off topic but feels like a good place to ask? Can WebGPU give you decent performance on non-Cuda and help accomplish these kinds of aims? (Geohot I think is aiming to avoid a single chipmaker monopoly on AI which he sees as a bad thing /paraphrase)

junrushao1994 · 2 years ago

As of today performance in WebGPU isn't as competitive yet, but there are really quite a lot of low-hanging fruits for WebGPU to pick up.

junrushao1994 commented on Making AMD GPUs competitive for LLM inference blog.mlc.ai/2023/08/09/Ma... · Posted by u/djoldman

azeirah · 2 years ago

I know plenty of open-source projects who list and thank every individual contributor. The website could do that too!

junrushao1994 · 2 years ago

That's a great idea! We should dig around and see if there's any plugin to use

junrushao1994 commented on Making AMD GPUs competitive for LLM inference blog.mlc.ai/2023/08/09/Ma... · Posted by u/djoldman

gsuuon · 2 years ago

Congrats Junru! I'm not on AMD but love seeing progress in this project. Excited for batched inference -- I didn't think it'd be useful for me but I've realized batched inference is also useful for a single user / edge device workload.

Btw - I got biased sampling working in ad-llama! Catching up to guidance slowly but surely :)

junrushao1994 · 2 years ago

This is amazing to hear Steven! (Sorry I locked myself out of discord a couple of days ago...) I'm sure there's bunch of features missing like biased sampling you mentioned, and more than happy to merge PRs if you'd love to :)

junrushao1994 commented on Making AMD GPUs competitive for LLM inference blog.mlc.ai/2023/08/09/Ma... · Posted by u/djoldman

brucethemoose2 · 2 years ago

I can confirm this, mlc is shockingly fast on my RTX 2060.

The catch is:

- MLC's quantization is somewhat different (though I havent run any perplexity tests yet)

- There is no CPU offloading (or splitting onto an IGP) like Llama.cpp yet (unless its new and I missed it).

junrushao1994 · 2 years ago

True and there are some other issues to be addressed. Those two particular issue is on our roadmap.

Regarding quantization, we wanted to develop a code path that absorbs any quantization formats, for example, those from GGML or GPTQ, so that they could be all used. ML compilation (MLC) is agnostic to any quantization formats, but we just haven't exposed such abstractions yet.

On CPU offloading, imagine if you are writing PyTorch, it should be as simple as a one-liner `some_tensor.cpu()` to bring something down to host memory, and `some_tensor.cuda()` to get it back to CUDA - seems a low-hanging fruit but it's not implemented yet in MLC LLM :( Lots of stuff to do and we should make this happen soon.