exLlama is blazing fast. Even if they just benched exllamav1, exllamav2 is only a bit faster, at least on my single 3090 in a similar environment.
vLLM is focused more on batching performance, but even then MLC/TVM looks like its putting up a fight without batching.
I am a bit fatigued with llama backends myself, and it looks like this won't help me run 70B in a single 3090, but I need to dig into mlc again.
Regarding exllama-V2, MLC/TVM does benchmark against it:
- Single GPU: https://github.com/mlc-ai/llm-perf-bench#int4-quantized-sing...
- Multi GPU: Figure 2 in the blog: http://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infere...
> vLLM focuses more on batching performance
Exactly. vLLM doesn’t optimize for latency-first scenarios as it focuses on throughput, i.e. batching. This particular blog post instead focuses particular on latency, i.e. the fastest you could possible get with those many GPUsz
Regarding batching, it is coming pretty soon, and we will have another blog post on this.
For Llama2-70B, it runs 4-bit quantized Llama2-70B at:
- 34.5 tok/sec on two NVIDIA RTX 4090 at $3k
- 29.9 tok/sec on two AMD Radeon 7900XTX at $2k
- Also it is scales well with 8 A10G/A100 GPUs in our experiment.
Details:
- Blog post: https://blog.mlc.ai/2023/10/19/Scalable-Language-Model-Infer...
- Project: https://github.com/mlc-ai/mlc-llm
Btw - I got biased sampling working in ad-llama! Catching up to guidance slowly but surely :)
The catch is:
- MLC's quantization is somewhat different (though I havent run any perplexity tests yet)
- There is no CPU offloading (or splitting onto an IGP) like Llama.cpp yet (unless its new and I missed it).
Regarding quantization, we wanted to develop a code path that absorbs any quantization formats, for example, those from GGML or GPTQ, so that they could be all used. ML compilation (MLC) is agnostic to any quantization formats, but we just haven't exposed such abstractions yet.
On CPU offloading, imagine if you are writing PyTorch, it should be as simple as a one-liner `some_tensor.cpu()` to bring something down to host memory, and `some_tensor.cuda()` to get it back to CUDA - seems a low-hanging fruit but it's not implemented yet in MLC LLM :( Lots of stuff to do and we should make this happen soon.