jychang (u/jychang) - Readit News

Loading parent story...

Loading comment...

Loading parent story...

Loading comment...

Loading parent story...

Loading comment...

Loading parent story...

Loading comment...

Loading parent story...

Loading comment...

jychang commented on Zebra-Llama – Towards efficient hybrid models arxiv.org/abs/2505.17272... · Posted by u/mirrir

credit_guy · 11 days ago

Like this?

https://huggingface.co/amd/Zebra-Llama-8B-8MLA-24Mamba-SFT

jychang · 11 days ago

Or like this: https://api-docs.deepseek.com/news/news251201

I don't know what's so special about this paper.

- They claim to use MLA to reduce KV cache by 90%. Yeah, Deepseek invented that for Deepseek V2 (and also V3 and Deepseek R1 etc)

- They claim to use a hybrid linear attention architecture. So does Deepseek V3.2 and that was weeks ago. Or Granite 4, if you want to go even further back. Or Kimi Linear. Or Qwen3-Next.

- They claimed to save a lot of money not doing a full pre-train run for millions of dollars. Well, so did Deepseek V3.2... Deepseek hasn't done a full $5.6mil full pretraining run since Deepseek V3 in 2024. Deepseek R1 is just a $294k post train on top of the expensive V3 pretrain run. Deepseek V3.2 is just a hybrid linear attention post-train run - i don't know the exact price, but it's probably just a few hundred thousand dollars as well.

Hell, GPT-5, o3, o4-mini, and gpt-4o are all post-trains on top of the same expensive pre-train run for gpt-4o in 2024. That's why they all have the same information cutoff date.

I don't really see anything new or interesting in this paper that isn't already something Deepseek V3.2 has already sort of done (just on a bigger scale). Not exactly the same, but is there anything amazingly new that's not in Deepseek V3.2?

jychang commented on Zebra-Llama – Towards efficient hybrid models arxiv.org/abs/2505.17272... · Posted by u/mirrir

mason_mpls · 11 days ago

> Zebra-Llama achieves Transformer-level accuracy with near-SSM efficiency using only 7–11B training tokens (compared to trillions of tokens required for pre-training) and an 8B teacher. Moreover, Zebra-Llama dramatically reduces KV cache size—down to 3.9%, 2%, and 2.73% of the original for the 1B, 3B, and 8B variants, respectively—while preserving 100%, 100%, and 97% of average zero-shot performance on LM Harness tasks.

This is an extraordinary claim, is there a catch I’m missing? Am I misreading?

jychang · 11 days ago

The catch that you're missing is that Deepseek did this ages ago.

They're just using MLA, which is well known to reduce KV size by 90%. You know, the MLA that's used in... Deepseek V2, Deepseek V3, Deepseek R1, Deepseek V3.1, Deepseek V3.2.

Oh, and they also added some hybrid linear attention stuff to make it faster at long context. You know who else uses hybrid linear attention? Deepseek V3.2.

jychang commented on Zebra-Llama – Towards efficient hybrid models arxiv.org/abs/2505.17272... · Posted by u/mirrir

a_wild_dandan · 11 days ago

If the claims in the abstract are true, then this is legitimately revolutionary. I don’t believe it. There are probably some major constraints/caveats that keep these results from generalizing. I’ll read through the paper carefully this time instead of a skim and come back with thoughts after I’ve digested it.

jychang · 11 days ago

What's not to believe? Qwerky-32b has already done something similar as a finetune of QwQ-32b but not using traditional attention architecture.

And hybrid models aren't new, MLA based hybrid models is basically just Deepseek V3.2 in a nutshell. Note that Deepseek V3.2 (and V3.1, R1, and V3... and V2 actually) all use MLA. Deepseek V3.2 is what adds the linear attention stuff.

Actually, since Deepseek V3.1 and Deepseek V3.2 are just post-training on top of the original Deepseek V3 pretrain run, I'd say this paper is basically doing exactly what Deepseek V3.2 did in terms of efficiency.

jychang commented on Europe is scaling back GDPR and relaxing AI laws theverge.com/news/823750/... · Posted by u/ksec

energy123 · a month ago

I've stopped thinking of regulations as a single dial, where more regulations is bad or less regulations is bad. It entirely depends on what is being regulated and how. Some areas need more regulations, some areas need less. Some areas need altered regulation. Some areas have just the right regulations. Most regulations can be improved, some more than others.

jychang · a month ago

I strongly agree with this position. This is basically the foundation of Control Theory!

https://en.wikipedia.org/wiki/Control_theory

This is like arguing if "heater on" or "AC on" is better, which is a pointless argument. That entirely depends on what the temperature is!

Loading parent story...

Loading comment...