Readit News logoReadit News
refibrillator · 10 months ago
vLLM supports MLA for Deepseek models as of 3 weeks ago. 3x higher generation throughput and 10x token memory capacity.

https://github.com/vllm-project/vllm/releases/tag/v0.7.1

MHA is still faster in low QPS regime apparently.

https://neuralmagic.com/blog/enhancing-deepseek-models-with-...

Also published this month was theoretical proof showing that for the same KV Cache overhead, MLA consistently offers greater expressive power than GQA. Furthermore, widely used GQA-based pre-trained models (e.g. LLaMA, Qwen, Mixtral) can be converted into MLA-based models.

https://arxiv.org/pdf/2502.07864

shihab · 10 months ago
For future readers, note that those 3x and 10x figures are compared to vLLM's own previous release, and NOT compared to Deepseek's implementation.

I am very curious to see how well-optimized Deepseek's code is compared to leading LLM serving softwares like vLLM or SGLang.

lhl · 10 months ago
It's great to see vLLM getting faster/better for DeepSeek. I tested vLLM vs SGLang a couple weeks ago and SGLang's DeepSeek support was much better/faster (on 2 x p5 H100 nodes). It's great that no one's standing still, I saw this recent AMD article that reported SGLang perf on MI300X has increased by 4X over the past couple weeks: https://rocm.blogs.amd.com/artificial-intelligence/DeepSeekR...

(w/ the extra memory V3/R1 fits on a single MI300X or H200 node)

It'll be interesting to see if either project can take advantage/get any benefits from this FlashMLA implementation.

menaerus · 10 months ago
Pretty significant improvements. However, my back on the napkin math suggests that MLA, FlashAttention and similar optimizations will provide the benefits only when memory access time dominates the compute in attention implementation? Those would be the prefill-phase (or TTFT) and training (when batch_size >> 1) but not the decode phase (inference)?
FL33TW00D · 10 months ago
You have it backwards.

Training and prefill are compute bound. Decode is memory bound. FlashAttention massively increases the arithmetic intensity of naive MHA, such that you can remain compute bound at lower batch sizes during decode.

rfoo · 10 months ago
You've got it backwards. After FlashAttention, it's the decoding part being bound mainly by memory access. With FA as long as you have enough batch size you can push training/prefill to be compute-bound.
albertzeyer · 10 months ago
I also just read that paper. But I wonder, even though MLA is strictly more powerful, do you really gain by that in experiments? This paper doesn't really do too much experimental comparisons. GQA on the other side should still be faster (no need to an extra linear transformation).
helloericsf · 10 months ago
X:https://x.com/deepseek_ai/status/1893836827574030466 BF16 support Paged KV cache (block size 64) 3000 GB/s memory-bound & 580 TFLOPS compute-bound on H800
WithinReason · 10 months ago
That's 90% bandwidth efficiency and 60% compute efficiency

https://www.nvidia.com/en-us/data-center/h100/

helloericsf · 10 months ago
They don't have h100. wink,wink.
FL33TW00D · 10 months ago
It seems to me that MLA will become the standard from here on out.

If Deepseek R1 had used standard MHA, they would need 1749KB per token for KV cache storage. This means that once the conversation reaches ~46,000 tokens, the KV cache will have exceeded the entire storage capacity of a single H100.

Using MLA, each token now consumes 125KB. This means you can hit ~640,000 tokens (2x Ulysses) before overflowing.

ur-whale · 10 months ago
For those who wonder ... it's somewhat likely that MLA mean Multi-head latent attention

https://verticalserve.medium.com/group-query-attention-58283...

https://paperswithcode.com/method/multi-head-attention

eigenvalue · 10 months ago
Nice, probably saved a bunch of FANG devs a lot of hours of work trying to knock this off.
nicce · 10 months ago
There were likely some startups that tried to sell the same thing…
anon389r58r58 · 10 months ago
You mean like Modular?
imranq · 10 months ago
Dang only forward passes. The real secret was in the backward pass! I was also curious to learn how they implemented the dualpipe scheduler
rfoo · 10 months ago
Do they even have an optimized backward? It looks like optimizations like this aren't needed during training. Their V2 paper also suggests so.
mohsen1 · 10 months ago
I'm confused. Wasn't there sanctions against Chinese companies about Hopper GPUs? Are they just admitting that they had access to H100 against the US sanctions?!
thot_experiment · 10 months ago
Just the H100, the H800 is a region-specific version of the card for china with shitty nvlink bandwidth which makes it rougher for making big clusters, but deepseek was able to mitigate the impact of that by being clever (rumored to have made significant use of PTX assembly instead of just using CUDA, we'll probably find out in the releases this week)

Deleted Comment

ahofmann · 10 months ago
It isn't illegal for chinese companies to buy H100 cards. It is illegal for USA companies to sell them to China. So the "admit" part wouldn't be on Chinas side.
jofzar · 10 months ago
It's also totally legal to sell h100 cards to a country that is very close to China.

Unrelated, it's always impressed me how Singapore buys 15% of the world's h100's. Really is the AI development capital of the world.

amelius · 10 months ago
Also breaking the law to growth-hack happens all the time, see Uber.
Tiberium · 10 months ago
H800 is the export variant that they had access to. They directly reference it in the repo:

>Achieving up to 3000 GB/s in memory-bound configuration and 580 TFLOPS in computation-bound configuration on H800 SXM5, using CUDA 12.6.

WiSaGaN · 10 months ago
H20 is a Hopper GPU, and they are allowed to be sold in China.
jonplackett · 10 months ago
Can everyone stop downvoting people just for asking questions - this isn’t Stack Overflow!

Deleted Comment

feverzsj · 10 months ago
The secret ingredient is smuggling.
tasuki · 10 months ago
I'd be very careful when using that word in this situation. If China wants X, and another country has X, who are you to say they shouldn't trade with each other?
7952 · 10 months ago
Do you think that would be morally wrong? Honest question.
rob_c · 10 months ago
Great work any plans to integrate with pyT or TF I wonder?

(Showing my lack of breadth of knowledge in the ecosystem (s))