Readit News logoReadit News
jychang · 3 months ago
Coolest part of Qwen3-Next, in my opinion, (after the linear attention parts) is that they do MTP without adding another un-embedding matrix.

Deepseek R1 also has a MTP layer (layer 61) https://huggingface.co/deepseek-ai/DeepSeek-R1/blob/main/mod...

But Deepseek R1 adds embed_tokens and shared_head.head tensors, which are [129280, 7168] or about 2GB in size at FP8.

Qwen3-Next doesn't have that: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct/blob...

So it saves a few GB in active parameters for MTP, which is a Big Deal. This is one of the changes that helps significantly speeds up inference.

puilp0502 · 3 months ago
What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?
jychang · 3 months ago
Speculative decoding! It makes inference a LOT faster.

Instead of generating tokens one at a time, you generate the second one as well, and then use speculative decoding on that second token (instead of having it be produced by a draft model like Qwen 0.6b). If the token is checked and is correct, then the 2nd token gets generated MUCH faster.

If it's wrong, you have to generate it again the normal way (a lot slower than just checking it). Usually, it's correct, so inference is a lot faster.

cubefox · 3 months ago
> What kind of benefit does Multi-Token Prediction bring to the inference side? Is it only relevant in pretraining efficiency?

It is only useful for inference and doesn't help with pretraining. Which actually points to speculative decoding not being sufficiently general, as the same underlying property (some sequences of tokens are easy to predict) could be exploited for training as well. See here: https://goombalab.github.io/blog/2025/hnet-future/#d-footnot...

rfoo · 3 months ago
It could be a better draft model than separately trained EAGLE etc for speculative decoding.
Razengan · 3 months ago
Could someone kindly point to a convenient all-on-one ELI5 of all these words? :')
porridgeraisin · 3 months ago
Background:

LLMs take your input, upscale it into a very high dimensional space, and then downscale it back to 1D at the end. This 1D list is interpreted as a list of probabilities -- one for each word in your vocabulary. i.e f(x) = downscale(upscale(x)). Each of downscale() and upscale() are parameterized (billions of params). I see you have a gamedev background, so as an example: bezier curves are parameterized functions where bezier handles are the parameters. During training, these parameters are continuously adjusted so that the output of the overall function gets closer to the expected result. Neural networks are just really flexible functions for which you can choose parameters to get any expected result, provided you have enough of them (similar to bezier curves in this regard).

---

When training, you make an LLM learn that

I use arch = downscale(upscale(I use))

If you want to predict the next word after that, you do next in sequence the following:

I use arch btw = downscale(upscale(I use arch))

Now, multi-token prediction is having two downscale functions, one for each of the next two words, and learning it that way, basically, you have a second downscale2() that learns how to predict the next-to-next word.

i.e in parallel:

I use arch = downscale1(upscale(I use))

I use ____ btw = downscale2(upscale(I use))

However, this way you'll need twice the number of parameters downscale needs. And if you want to predict more tokens ahead you'll need even more parameters.

What Qwen has done, is instead of downscale1 and downscale2 being completely separately parameterized functions, they set downscale1(.) = lightweight1(downscale_common(.)) and downscale2(.) = lightweight2(downscale_common(.)). This is essentially betting that a lot of the logic is common and the difference between predicting the next and next-to-next token can be captured in one lightweight function each. Lightweight here, means less parameters. The bet paid off.

So overall, you save params.

Concretely,

Before: downscale1.params + downscale2.params

After: downscale_common.params + lightweight1.params + lightweight2.params

Edit: its actually downscale_common(lightweight()) and not the other way around as I have written above. Doesn't change the crux of the answer, but just including this for clarity.

lcnPylGDnU4H9OF · 3 months ago
The best primer I've seen is Andrej Karpathy's first video in his "zero to hero" series. It's worth following along with your own practice.

https://karpathy.ai/zero-to-hero.html

vessenes · 3 months ago
Unfortunately, no. The industry is moving super quickly, and spinning up new ideas on the backs of old ones at a fast rate. If you want to understand what's going on, I think the best thing to do is some intro courses, train and design some smaller models directly, get a list of core papers and concepts from Claude/Chat/Gemini, and then as you read something like this, if you don't know the acronym (In this case: MTP = Multi Token Prediction), search it up, and see if you have the basis for understanding what it's about. If not read up on the precursors.

Unlike many disciplines, AI is an arena that doesn't have a lot of intuitive simplified models that are accurate -- most of the simplified models available do not accurately describe what's going on enough to reason about and understand them. So, you just have to start reading!

wickedsight · 3 months ago
For me, ChatGPT or any of the other current thinking models are very useful for this type of stuff. I just ask to explain it on my level and then I can ask questions for clarification.
pmarreck · 3 months ago
The following was generated by chatG5:

    Qwen3-Next — A family of large language models from Qwen (Alibaba).  
    DeepSeek R1 — Another large open-source language model from DeepSeek AI.  
    Linear attention — A type of transformer attention that scales linearly with sequence length, making long-context processing cheaper.  
    MTP (Multi-Token Prediction) — Training/inference trick where the model predicts multiple future tokens at once, speeding things up.  
    Embedding — Converts words/tokens into vectors (numbers) the model can work with.  
    Un-embedding — The reverse step: mapping the model’s internal vector back into tokens.  
    embed_tokens — The big lookup table of embeddings (token → vector).  
    shared_head.head tensors — Extra weight matrices used for prediction; they can be huge.  
    [129280, 7168] — The shape of such a tensor: ~129k rows (tokens in the vocab) × 7k columns (hidden dimension).  
    FP8 — Floating-point format using 8 bits (compact, faster, less precise).  
    Active parameters — The weights that actually need to be loaded in GPU memory to run the model.  
    Inference — Running the model to generate text (as opposed to training it).  
    GB savings — If you avoid duplicating giant matrices, you save GPU memory and speed things up.

humblyCrazy · 3 months ago
How is MTP different from Medusa heads? Also does this mean this model comes "natively" with speculative decoding - meaning if I use this model in vllm, it's throughput should be higher because it is already doing MTP so it should be able to take advantages of speculative decoding?
Alifatisk · 3 months ago
Alibaba keeps releasing gold content

I just tried Qwen3-Next-80B-A3B on Qwen chat, and it's fast! The quality seem to match Qwen3-235B-A22B. Quite impressive how they achieved this. Can't wait for the benchmarks at Artificial analysis

According to Qwen Chat, Qwen3-Next has the following limits:

Maximum context length: 262,144 tokens

Max summary generation length: 32,768 tokens

This is 2x higher on context length and 4x higher on summary generation compared to Qwen3-235B-A22B, damn

> Qwen3-Next [...] excels in ultra-long-context understanding and complex tasks

Even though their new hybrid architecture is fascinating, I think I'll continue to stick with Qwen2.5-Turbo because it's one of the few models that supports 1M tokens in context length. My use case is uploading large pdfs and ask questions across chapters

gizmodo59 · 3 months ago
My take on long context for many frontier models is not about support but the accuracy drops drastically as you increase the context. Even if a model claims to support 10M context, reality is it doesn’t perform well when you saturate. Curious to hear others perspective on this
kridsdale3 · 3 months ago
This is my experience with Gemini. Yes, I really can put an entire codebase and all the docs and pre-dev discussions and all the inter-engineer chat logs in there.

I still see the model becoming more intoxicated as turn count gets high.

vessenes · 3 months ago
Agreed. That said, in general a 1M context model has a larger usable window than a 260k context model.
pilotneko · 3 months ago
If you read the model card, Qwen3-Next can be extended to 1M context length with YaRN.

> Qwen3-Next natively supports context lengths of up to 262,144 tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques to handle long texts effectively. We have validated the model's performance on context lengths of up to 1 million tokens using the YaRN method.

Source: https://huggingface.co/Qwen/Qwen3-Next-80B-A3B-Instruct#proc...

Alifatisk · 3 months ago
> If you read the model card, Qwen3-Next can be extended to 1M context length with YaRN.

I read the article, but as I said Qwen chat only provides up to 262k tokens in context length, so I'll stick with Qwen2.5 Turbo which supports 1M tokens.

I am not in a position where I can self-host yet

davidweatherall · 3 months ago
Their proprietary models are very good too and go under the radar, they never seem to appear on any benchmarks. Qwen3-coder-plus is significantly better than their open source qwen3, Qwen3 max also rivals the SOTA models
ehsanu1 · 3 months ago
Are these actually different models vs just different names from the open weights releases?
cpursley · 3 months ago
How are you prepping the PDF data before shoving it into Qwen?
Alifatisk · 3 months ago
I just compress the file size as low as possible without losing the quality, didn't even know there was more ways to prep it.

I do sometimes chop up the PDF into smaller pdfs with their own individual chapters

navbaker · 3 months ago
Not OP, but we use the docling library to extract text and put it in markdown before storing for use with an LLM.
irthomasthomas · 3 months ago
llm -m qwen3-next-80b-a3b-thinking "An ASCII of spongebob"

Here's a classic ASCII art representation of SpongeBob SquarePants:

     .------.
    /  o o  \
   |         |
   |  \___/  |
    \_______/


  llm -m chutes/Qwen/Qwen3-Coder-480B-A35B-Instruct-FP8 \
  "An ASCII of spongebob"

  Here's an ASCII art of SpongeBob SquarePants:

      .--..--..--..--..--..--.
    .' \  (`._   (_)     _   \
  .'    |  '._)         (_)  |
  \ _.')\      .----..--.'  /
  |(_.'  |    /    .-\-.  \---.
  \     0|    |   ( O| O) |    |
   |  _  |  .--.____.'._.-.  |
   \ (_) | o         -` .-`  |
    |    \   |`-._ _ _ _ _\ /
    \    |   |  `. |_||_|   |
     | o |    \_       \    |     -.   .-.
     |.-.  \    `--..-'   O |     `.`-' .'
   _.'  .' |     `-.-'      /-.__   ' .-'
  .' `-.` '.|='=.='=.='=.='=|._/_ `-'.'
  `-._  `.  |________/\_____|    `-.'
     .'   ).| '=' '='\/ '=' |
     `._.`  '---------------'
             //___\   //___\
               ||       ||
               ||_.-.   ||_.-.
              (_.--__) (_.--__)


Meta: I generated a few dozen spongebobs last night on the same model and NONE where as good as this. Most started well but collapsed into decoherence at the end - missing the legs off. Then this morning the very same prompt to the same model API produced a perfect bob on the first attempt. Can utilization affect response quality, if all else remains constant? Or was it just random luck?

Edit: Ok, the very next attempt, a few minutes later, failed, so I guess it is just random, and you have about a 1 in 10 chance of getting a perfect spongebob from qwen3-coder, and ~0 chance with qwen3-next.

dev_hugepages · 3 months ago
irthomasthomas · 3 months ago
Naturally. That's how LLMs work. During training you measure the loss, the difference between the model output and the ground-truth and try to minimize it. We prize models for their ability to learn. Here we can see that the large model does a great job at learning to draw bob, while the small model performs poorly.
ricardobeat · 3 months ago
For the model to have memorized the entire sequence of characters precisely, this must appear hundreds of times in the training data?
ginko · 3 months ago
Conveniently removed the artist's signature though.
matchcc · 3 months ago
I think there is some distillation relationship between Kimi K2 and Qwen Coder or other related other models, or same training data. I tried most of LLMs, only kimi K2 gave the exact same ASCII. kimi K2: Here’s a classic ASCII art of SpongeBob SquarePants for you:

           .--..--..--..--..--..--.
        .' \  (`._   (_)     _   \
      .'    |  '._)         (_)  |
      \ _.')\      .----..---.   /
      |(_.'  |    /    .-\-.  \  |
      \     0|    |   ( O| O) | o|
       |  _  |  .--.____.'._.-.  |
       \ (_) | o         -` .-`  |
        |    \   |`-._ _ _ _ _\ /
        \    |   |  `. |_||_|   |
        | o  |    \_      \     |     -.   .-.
        |.-.  \     `--..-'   O |     `.`-' .'
      _.'  .' |     `-.-'      /-.__   ' .-'
    .' `-.` '.|='=.='=.='=.='=|._/_ `-'.'
    `-._  `.  |________/\_____|    `-.'
       .'   ).| '=' '='\/ '=' |
       `._.`  '---------------'
               //___\   //___\
                 ||       ||
                 ||_.-.   ||_.-.
                (_.--__) (_.--__)
Enjoy your SpongeBob ASCII!

nakamoto_damacy · 3 months ago
For ascii to look right, not messed up, the generator has to know the width of the div in ascii characters, e.g. 80, 240, etc, so it can make sure the lines don't wrap. So how does an LLM know anything about the UI it's serving? Is it just luck? what if you ask it to draw something that like 16:9 in aspect ratio... would it know to scale it dowm so lines won't wrap? how about loss of details if it does? Also, is it as good with Unicode art? So many questions.

Deleted Comment

irthomasthomas · 3 months ago
I realize my SpongeBob post came off flippant, and that wasn't the intent. The Spongebob ASCII test (picked up from Qwen's own Twitter) is explicitly a rote-memorization probe; bigger dense models usually ace it because sheer parameter count can store the sequence

With Qwen3's sparse-MoE, though, the path to that memory is noisier: two extra stochastic draws (a) which expert(s) fire, (b) which token gets sampled from them. Add the new gated-attention and multi-token heads and you've got a pipeline where a single routing flake or a dud expert can break vertical alignment halfway down the picture.

Anyway, I think qwen3-coder was uniquely trained on this - so it's not a fair comparison. Here are some other qwen3 models:

Model: chutes/Qwen/Qwen3-235B-A22B

   /~\       
  (  *  *  )  
  (  o o o  )  
   \  -  /     
    \  /\  /   
     \  /      
      \/       
     /|||\      
    /|||||\     
   /||||||||\   
  ( o   o   o ) 
   \   W   /   
    \___/
    
Model: chutes/Qwen/Qwen3-235B-A22B-Instruct-2507

    /\_/\  
   ( o.o ) 
    > ^ <  
   /     \ 
  |       |
  |       |
   \     /
    '-'-'
Model: chutes/Qwen/Qwen3-235B-A22B-Thinking-2507

  .-----------.
  | []     [] |
  |           |
  |   __  __  |
  |  |  ||  | |
  |  |__||__| |
  |           |
  '-----------'

Model: chutes/Qwen/Qwen3-Next-80B-A3B-Instruct

        __
       /  \
      /    \
     /      \
    /        \
   /__________\
  |  o      o  |
  |     __     |
  |  \____/    |
  |            |
  |  ________  |
  |  \      /  |
  |   \____/   |
  |            |
  |____________|
  
     ___________
    /           \
   /             \
  |   _______     |
  |  |       |    |
  |  |  ___  |    |
  |  | |   | |    |
  |  | |___| |    |
  |  |_______|    |
  |               |
  |_______________|

Model: chutes/Qwen/Qwen3-Next-80B-A3B-Thinking

  .-.
   /   \
  |  o o|
  |  >  |
  |  ---|
   \___/
  
Model: chutes/Qwen/Qwen3-30B-A3B-Instruct-2507

    _________________________
   /                         \
  |   _     _     _     _   |
  |  / \   / \   / \   / \  |
  | |   | |   | |   | |   | |
  |  \_/   \_/   \_/   \_/  |
  |                         |
  |   _     _     _     _   |
  |  / \   / \   / \   / \  |
  | |   | |   | |   | |   | |
  |  \_/   \_/   \_/   \_/  |
  |                         |
  |    SpongeBob SquarePants   |
  |_________________________|

Deleted Comment

syntaxing · 3 months ago
The craziest part is how far MoE has come thanks to Qwen. This beats all those 72B dense models we’ve had before and runs faster than 14B model depending on how you off load your VRAM and CPU. That’s insane.
moffkalast · 3 months ago
In retrospect it's actually funny that last year Meta spent so many resources training a dense 405B model that both underperforms compared to models a tenth its size and is impossible to run at a reasonable speed on any hardware in existence.
jychang · 3 months ago
Strong disagree.

Llama 4's release in 2025 is (deservedly) panned, but Llama 3.1 405b does not deserve that slander.

https://artificialanalysis.ai/#frontier-language-model-intel...

Do not compare 2024 models to the current cutting edge. At the time, Llama 3.1 405b was the very first open source (open weights) model to come close to the closed source cutting edge. It was very very close in performance to GPT-4o and Claude 3.5 Sonnet.

In essence, it was Deepseek R1 before Deepseek R1.

NitpickLawyer · 3 months ago
It's not that clear. Yes, it underperforms in recent benchmarks and usecases (i.e. agentic stuff), but it is still one of the strongest open models in terms of "knowledge". Dense does have that advantage of MoE, even if it's extremely expensive to run inference on.

Check out this great exercise - https://open.substack.com/pub/outsidetext/p/how-does-a-blind...

halJordan · 3 months ago
Qwen isn't directing the forward progress of llms. SOTA llms have been moe since gpt-4. The og 4.

Out of context, but i honestly hate how HN let itself get so far behind the times that this is the sort of inane commentary we get on AI.

refulgentis · 3 months ago
I would venture to suggest that to read it as "Qwen made MoEs in toto || first || better than anyone else" is reductive - merely, the # of experts and #s here are quite novel (70b...inferencing only 3b!?!) - I sometimes kick around the same take, but, thought I'd stand up for this. And I know what I'm talking about, I maintain a client that wraps llama.cpp x ~20 models on inference APIs
mmmllm · 3 months ago
The same week Oracle is forecasting huge data center demand and the stock is rallying. If these 10x gains in efficiency hold true then this could lead to a lot less demand for Nvidia, Oracle, Coreweave etc
amelius · 3 months ago
mmmllm · 3 months ago
Sure but where is the demand going to come from? LLMs are already in every google search, in Whatsapp/Messenger, throughout Google workspace, Notion, Slack, etc. ChatGPT already has a billion users.

Plus penetration is already very high in the areas where they are objectively useful: programming, customer care etc. I just don't see where the 100-1000x demand comes from to offset this. Would be happy to hear other views.

ls65536 · 3 months ago
I'm not going to speculate about what might be ahead in regards to Oracle's forecasting of data center demand, but regarding the idea of efficiency gains leading to lower demand, don't you think something like Jevons paradox might apply here?
sailingparrot · 3 months ago
People said the same thing for deepseek-r1, and nothing changed.

If you come up with a way to make the current generation of models 10x more efficient, then everyone just moves to train a 10x bigger model. There isn’t a size of model where the players are going to be satisfied at and not go 10x bigger. Not as long as scaling still pays off (and it does today).

stingraycharles · 3 months ago
Absolutely not; the trends have proven that people will just pay for the best quality they can get, and keep paying roughly the same money.

Every time a new model is released, people abandon the old, lower quality model (even when it’s priced less), and instead prefer to pay the same for a better model.

The same will happen with this.

mmmllm · 3 months ago
Sure but the money people are paying right now isn't that much in the grand scheme of things. OpenAI is expecting 13bn in revenue this year. AWS made over 100bn last year. So unless they pay a lot more, or they find customers outside of programmers, designers, etc who are willing to pay for the best quality, I don't see how it grows as fast as it needs to (I'm not saying it won't increase, just not at the rate expected by the data center providers)
clvx · 3 months ago
For early adopters yes but many systems have been running as good enough without any kind of updates for a long time. For many use cases it needs to get to a point where accuracy is good enough and then it will be set and forget. I disagree with the approach but that's what you find in the wild.
Zambyte · 3 months ago
The best quality you can get is at odds with the best speed you can get. There are lots of people (especially with specific use cases) who will pay for the best speed they can get that is high enough quality.
thinkingemote · 3 months ago
If someone had to bet on an AI crash which I imagine would led to unused datacentres and cheap GPUs how would they invest their winnings to exploit these resources?
CuriouslyC · 3 months ago
If the price of inference drops through the floor all the AI wrapper companies become instantly more valuable. Cursor is living on borrowed time because their agents suck and they're coasting on first mover advantage with weak products in general, but their position would get much better with cheap inference.
sunir · 3 months ago
Buy the application layer near winners. When computing costs shrink, usage expands.
kridsdale3 · 3 months ago
Assuming your question isn't rhetorical, massive Oracle Crypto Farm.
ACCount37 · 3 months ago
No. The gains in inference and training efficiency are going to be absorbed by frontier LLM labs being more willing to push more demanding and capable models to the end users, increase reasoning token budgets, etc.
jstummbillig · 3 months ago
For the last 2 years, despite all efficiency gains, I am literally watching characters appear on my screen, as if this was a hacker movie. Lately, I am also waiting for at least 60s for anything to appear at all.

If that happened at 10x the speed, I would still be slow in computer terms, and that increasingly matter, because I will not be the one reading the stuff – it will be other computers. I think looking back a few years from now, every single piece of silicon that is planned right will look like a laudable but laughable drop in the ocean.

mdp2021 · 3 months ago
The real quality demand needs is not there, so more processing is very probably needed, so efficiency gains may allow the extra processing.

(A string example read today of Real quality demand needs: the administration of Albania wants some sort of automated Cabinet Minister. Not just an impartial and incorruptible algorithm (what we normally try to do with deterministic computation): a "minister". Good luck with that.)

mynti · 3 months ago
For anyone curious about what the Gated Delta Network is: https://arxiv.org/pdf/2412.06464
yorwba · 3 months ago
Also, Gated Attention: https://arxiv.org/abs/2505.06708
jbellis · 3 months ago
Added Qwen3 Next to the Brokk Power Ranking Open Round (coding benchmark). It's roughly GPT-OSS-20b strength.

Full set of open weight model results: https://brokk.ai/power-ranking?version=openround&models=ds-r...

noahbp · 3 months ago
Is that the updated Kimi K2, or the old Kimi k2?
jbellis · 3 months ago
It's the original. I'll update the label to clarify.
SparkyMcUnicorn · 3 months ago
This would be a valuable benchmark if it included languages other than Java, and let me see which models are best at the languages I work with.

My real-world usage does not line up with these results, but I'm not working with Java.

Jgoauh · 3 months ago
Seems impressive, i believe better architectures are really the path forward, i don't think you need more than 100B params taking this model and what GPT OSS 120B can acchieve
CuriouslyC · 3 months ago
We definitely need more parameters, low param models are hallucination machines, though low actives is probably fine assuming the routing is good.
NitpickLawyer · 3 months ago
New arch seems cool, and it's amazing that we have these published in the open.

That being said, qwen models are extremely overfit. They can do some things well, but they are very limited in generalisation, compared to closed models. I don't know if it's simply scale, or training recipes, or regimes. But if you test it ood the models utterly fail to deliver, where the closed models still provide value.

vintermann · 3 months ago
Could you give some practical examples? I don't know what Qwen's 36T-token training set is like, so I don't know what it's overfitting to...