The Illustrated DeepSeek-R1

Do we know which changes made DeepSeek V3 so much faster and better at training than other models? DeepSeek R1's performances seem to be highly related to V3 being a very good model to start with.

I went through the paper and I understood they made these improvements compared to "regular" MoE models:

1. Latent Multi-head Attention. If I understand correctly, they were able to do some caching on the attention computation. This one is still a little bit confusing to me;

2. New MoE architecture with one shared expert and a large number of small routed experts (256 total, but 8 in use in the same token inference). This was already used in DeepSeek v2;

3. Better load balancing of the training of experts. During training, they add bias or "bonus" value to experts that are less used, to make them more likely to be selected in the future training steps;

4. They added a few smaller transformer layers to predict not only the first next token, but a few additional tokens. Their training error/loss function then uses all these predicted tokens as input, not only the first one. This is supposed to improve the transformer capabilities in predicting sequences of tokens;

5. They are using FP8 instead of FP16 when it does not impact accuracy.

It's not clear to me which changes are the most important, but my guess would be that 4) is a critical improvement.

1), 2), 3) and 5) could explain why their model trains faster by some small factor (+/- 2x), but neither the 10x advertised boost nor how is performs greatly better than models with way more activated parameters (e.g. llama 3).

whoisburbansky · 7 months ago

The key idea of Latent MHA is that "regular" multi-headed attention needs you to keep a bunch of giant key-value (KV) matrices around in memory to do inference. The "Latent" part just means that DeepSeek takes the `n` KV matrices in a given n-headed attention block and replaces them with a lower-rank approximation (think of this as compressing the matrices), so that they take up less VRAM in a GPU at the cost of a little extra compute and a little lost accuracy. So not caching, strictly speaking, but weight compression to trade compute off for better memory usage, which is good because the KV matrices are one of the more expensive part of this transformer architecture. MoE addresses the other expensive part (the fully-connected layers) by making it so only a subset of the fully-connected layers are active at any given forward pass.

whoisburbansky · 7 months ago

https://planetbanatt.net/articles/mla.html this is a great overview of how MLA works.

alecco · 7 months ago

They also did bandwidth scaling to handle work around the nerfed H800 interconnects.

> efficient cross-node all-to-all communication kernels to fully utilize IB and NVLink bandwidths

> The key idea of DualPipe is to overlap the computation and communication within a pair of individual forward and backward chunks. To be specific, we divide each chunk into four components: attention, all-to-all dispatch, MLP, and all-to-all combine. Specially, for a backward chunk, both attention and MLP are further split into two parts, backward for input and backward for weights, like in ZeroBubble (Qi et al., 2023b). In addition, we have a PP communication component.

(I know some of those words)

https://arxiv.org/html/2412.19437v1

sinuhe69 · 7 months ago

I think the fact that they used synthetic/distilled high-quality data from GPT4-o output to train in the style of Phi models are of significance as well.

Am I the only one not that impressed with Deepseek R1? Its "thinking" seems full of the usual LLM blindsides, and ultimately generating more of it then summarizing doesn't seem to overcome any real limits.

It's like making mortgage backed securities out of bad mortgages, you never really overcome the badness of the underlying loans, no matter how many layers you pile on top

I haven't used or studied DeepSeek R1 (or o1) in exhaustive depth, but I guess I'm just not understanding the level of breathless hype right now.

TrackerFF · 7 months ago

What's there not to understand?

If it matches the latest GPT O-N model in performance - or is just close, even, at a fraction of the compute (50x less?) and it is free, then that's huge news.

They just upended the current LLM/AI/ML dominance, or at least the perceived dominance. Billions and billions have been pumped into the race, where investors are betting on the winner - and here comes a Chinese hedge fund side-project on shoestring budget, matching those billion dollar behemoths. And they'll continue to release their work.

They just made the OpenAI et. al. secret sauce a lot less valuable.

thechao · 7 months ago

New theory: this is a short-long play by the fund. They shorted NV, now they're hoovering up stock. In the process of making their billions from a small 50$mm investment!

infecto · 7 months ago

In my tests it does not come close to O1-Pro. Still huge news but it does not quite make it.

itissid · 7 months ago

It is leaps and bounds better than LLMs. For one you are doing RL which is classic AI like tuning that optimizes a reward function with nice qualities- it's the same stuff used to train chess games and Go by showing them the actual moves.

LLMs pre o1 and deepseek R1 were RHLF tuned which is like if you trained a LM how to play chess by showing people two boards and doing a vibe check on which "looks" better.

Think of it this way say you were dropped in a maze that you had to solve but you could do only one of two things:

1. Look at two random moves from your start position and selected which one looked better to get out.

2. Made a series of moves and then backtracked, then use a quantitative function to exploit the best path.

The latter is what R1 does and it chooses optimal and more certain path to success.

Apply this to math and coding tokens and you have a competitive LLM.

itissid · 7 months ago

I am using the 32Gb distilled model on my local 3090 with Continue in VSCode. It beats everything out of the water.

lm28469 · 7 months ago

I've been using deepseek for a while, I never paid for chat gpt or any other services.

The fact that r1 is now free and unlimited VS chat gpt 200$ a month subscription is impressive enough for me. If the development cost is anywhere close to what they advertise publicly it's even more impressive

It's as good or better than chat gpt free, gemini free, &c. and that's all I care about

redcobra762 · 7 months ago

Why is cost your only concern? You don’t care at all what data the model was trained on? The motivations of the people who trained it? I mean maybe the importance of those things isn’t super high to you, but “don’t care”?

How odd. Most Westerners I know would care…

dkjaudyeqooe · 7 months ago

As with most early tech of a particular category, it's not the current capabilities that are the point but the direction of travel.

DeepSeek has upended the conventional wisdom about model performance with respect to training, and it's a shock to the system. It demonstrates something that has become obvious: you don't need massive scale or funding to innovate and have impact, you just need good ideas.

danielscrubs · 7 months ago

I guess this is why Googles CEO said that they have no LLM moat, as cleverness is not a moat.

HarHarVeryFunny · 7 months ago

It's the cost comparison with O1, both to train and run (per their pricing), that is causing most of the shock, perhaps as well as the fact that it's a GPU poor Chinese company that has caught up with O1, not a US one (Anthropic, Google, Meta, X.ai, Microsoft). The fact that it's open weights and training is fairly detailed in the paper they released is also significant.

The best comparison for R1 is O1, but given different training data, hard to compare outside of benchmarks. At the moment these "reasoning models" are not necessarily the best thing to use for non-reasoning tasks, but Anthropic have recently indicated that they expect to release models that are more universal.

throwup238 · 7 months ago

You’re not the only one. It’s not as impressive at coding compared to O1 as people make it out to be and it’s explicitly spelled out in DeepSeek’s R1 paper that they had trouble with improving over DeepSeek-V3:

> Software Engineering Tasks: Due to the long evaluation times, which impact the efficiency of the RL process, large-scale RL has not been applied extensively in software engineering tasks. As a result, DeepSeek-R1 has not demonstrated a huge improvement over DeepSeek-V3 on software engineering benchmarks. Future versions will address this by implementing rejection sampling on software engineering data or incorporating asynchronous evaluations during the RL process to improve efficiency. (last bullet point of page 16, which is the last page of the paper before the bibliography - hmm…) [1]

It does even worse on my real world coding problems than the benchmarks would suggest. Some of my tests: write a Qt QAbstractListModel in C++ that parses markdown into block using a C/C++ MD parsing library, write Rust cxx-qt bindings for QTextDocument (all docs included in context), write a window switching script for Wayland with an alternative to wmctrl, etc. I also asked it some geochemistry questions that I had trialed O1 with previously, and the reasoning had a lot hallucinations. The answers were suboptimal to say the least.

Having access to the thought process in <think></think> tags is cool but it just reveals how underdeveloped the actual reasoning is compared to whatever o1 is doing. I’ve had it get stuck on silly things like whether a C++ library for markdown exists because I underspecified that I’m okay with header only C libs. O1 has faired much better on my qualitative test suite - even more so when using it via API with a “high” reasoning_effort.

With all of the hype over the last few days, it feels like I’m taking crazy pills (loving all the action though).

[1] https://arxiv.org/abs/2501.12948

creer · 7 months ago

The brute force approach is the expensive one ("probably, so far anyway" add these 4 words everywhere) - and impossible to make "always correct" aka "usual blindsides". They seem to be trying a bunch of specialized training ideas here and there in the system - just like a Mixture of Experts does, in different places than was obvious so far, and with an eye toward reasonning. In particular trying to build a reasonning-oriented training base from minimal seed.

It's still not going to give an "always correct" result but we are nowhere near the point where that's needed. We are only at the point where a new idea can get you a percentage point further in benchmarks. Some fundamental limits were baked into the previous assumptions - easy to get past by leaving these assumptions behind.

vineyardmike · 7 months ago

I get the hype, even if I don’t necessarily agree with it. TLDR: it’s technically impressive in training infrastructure and a geopolitical surprise.

As others commenters said, it compares favorably against American Flagship models in benchmarks. This is geopolitically interesting since the model is Chinese, and subject to trade restrictions and America thinks of itself as the world’s best model builders.

What makes it interesting technically, is the trade restrictions required them to focus on efficiency and reusing old hardware. This made it wildly cheaper to run the training. The They make a ton of very low-level optimizations to make training efficient. This is impressive engineering, and shows how a small amount of effort can free up a nvidia-lock-in. They had to totally bypass a lot of CUDA libraries and drop down even lower to control the GPUs. Nvidia has been capturing a huge fraction of industry wide AI spend, and surely no one wants that. This is, IMO, the actual part to watch. I suspect it’ll drive a new wave of efficiency-focused LLM tools which will also unlock competitors GPUs.

They also had some novel-for-LLMS training techniques but it’s suspected that the big AI companies elsewhere are doing it now too, but not disclosed. (Mostly reinforcement learning).

What I think is hype, meanwhile, is the actual benchmarks. Most models are trained on their competitors output. This is just a reality of the industry. Especially non-flagships being trained against flagships data. DeepSeek was almost certainly trained against OpenAI models, so it makes sense that it would approach the output quality. That’s very different from being capable of outperforming or “taking the lead” or whatever. We’ll need to wait longer and see how the future goes to make that determination. “China” has long had a great history of machine learning tech, so there is no reason to think that it’s structurally impossible for a Chinese organization to be on the leading edge, but it has to happen before we can say it happened.

What is also hype is calling this a “side project” of a financial firm. The firm spun it out as a dedicated company. China cracked down on hedge funds, so the company looked for ways to repurpose the talent and compute infrastructure. This isn’t some side project during lunch breaks for a few bored quants.

PS, thinking models are very different in use-case than normal models. It’s great for tasks with a “right answer” like math problems. Consider a simple example, in calculus, your teacher made you “show your work”, and doing so certainly reduced the likelihood of errors. That’s the purpose of a thinking model, and it excels at similar tasks.

assimpleaspossi · 7 months ago

fwiw, National Public Radio (NPR) news in the USA said that AI experts stated it was "almost as good" as other current offerings like chatGPT and Gemini. That its real advantage is the low cost of providing the information. However, this is only a claim made by the company without any proof.

Dead Comment

jasonjmcghee · 7 months ago

For the uninitiated, this is the same author as the many other "The Illustrated..." blog posts.

A particularly popular one: https://jalammar.github.io/illustrated-transformer/

Always very high quality.

punkspider · 7 months ago

Thanks so much for mentioning this. His name carries a lot of weight for me as well.

jamestimmins · 7 months ago

Have you read his book Hands-On Large Language Models?

Looks interesting, but I'm skeptical that a book can feasibly stay up to date with the speed of development.

jampekka · 7 months ago

> Looks interesting, but I'm skeptical that a book can feasibly stay up to date with the speed of development.

The basic structure of the base models has not really changed since the first GPT launched in 2018. You still have to understand gradient descent, tokenization, embeddings, self-attention, MLPs, supervised fine tuning, RLHF etc for the foreseeable future.

Adding RL based CoT training would be a relatively straightforward addendum to a new edition, and it's an application of long established methods like PPO.

All "generations" of models are presented as revolutionary -- and results-wise they maybe are -- but technically they are usually quite incremental "tweaks" to the previous architecture.

Even more "radical" departures like state space models are closely related to same basic techniques and architectures.

I have not, but Jay has created a ton of value and knowledge for free and don't fault him for throwing an ad for his book / trying to benefit a bit financially.

raphaelj · 7 months ago

QuadrupleA · 7 months ago

8n4vidtmkvmk · 7 months ago

> This is a large number of long chain-of-thought reasoning examples (600,000 of them). These are very hard to come by and very expensive to label with humans at this scale. Which is why the process to create them is the second special thing to highlight

I didn't know the reasonings were part of the training data. I thought we basically just told the LLM to "explain its thinking" or something as an intermediate step, but the fact that the 'thinking' is part of the training step makes more sense and I can see how this improves things in a non-trivial way.

Still not sure if using word tokens as the intermediate "thinking" is the correct or optimal way of doing things, but I don't know. Maybe after everything is compressed into latent space it's essentially the same stuff.

blackeyeblitzar · 7 months ago

The thing I still don’t understand is how DeepSeek built the base model cheaply, and why their models seem to think they are GPT4 when asked. This article says the base model is from their previous paper, but that paper also doesn’t make clear what they trained on. The earlier paper is mostly a description of optimization techniques they applied. It does mention pretraining on 14.8T tokens with 2.7M H800 GPU hours to produce the base DeepSeek-V3. But what were those tokens? The paper describes the corpus only in vague ways.

llm_nerd · 7 months ago

Various other models also think they're ChatGPT or built by OpenAI, or at least those are the highest probability tokens when talking about an AI model or an AI company because of the massive prevalence in training data (the internet). It isn't the big reveal that it is often being held to be.

Add that training off of ChatGPT wouldn't reduce their training costs at all, but would actually increase their training costs. Literally all of the same training difficulty, but then add paying OpenAI for an enormous number of API calls. Not really seeing the win.

>The paper describes the corpus only in vague ways.

Anyone who runs a public website has logs absolutely filled by a seemingly infinite number of information aggregators. Just like everyone else they scraped the entire internet, pulled in all of Wikipedia, etc. Probably lots of pirate books, movie transcripts, etc.

The fact that training could be done more effectively is something that intuitively makes absolute sense to everyone in the field, but we just didn't make that leap. Similar to how a human isn't trained to recognize digits by training on 60,000 training digits then suddenly failing if a real world digit is slightly rotated or morphed in some way, we are making these improvements to content ingestion.

moritonal · 7 months ago

I imagine it's a mix of either using ChatGPT as an the oracle to get training data. Or, it's the radiocarbon issue where the Internet has so much info on ChatGPT other models now get confused.

moralestapia · 7 months ago

A friend just sent me a screenshot where he asks DeepSeek if it has an app for Mac and it replies that they have a ChatGPT app from OpenAI, lol.

I 100% believe they distilled GPT-4, hence the low "training" cost.

Philpax · 7 months ago

Er, how would that reduce the cost? You still need to train the model, which is the expensive bit.

Also, the base model for V3 and the only-RL-tuned R1-Zero are available, and they behave like base models, which seems unlikely if they used data from OpenAI as their primary data source.

It's much more likely that they've consumed the background radiation of the web, where OpenAI contamination is dominant.

nullc · 7 months ago

You can't distill from GPT-4 because Open AI conceals the probabilities (and has for a couple years now-- since before gpt4), presumably to prevent that. You can fine tune against output though. I might guess that they used something like openorca or some other public data set that includes gpt4 output as part of their initial fine tuning.

How does such a distillation work in theory? They don’t have weights from OpenAI’s models, and can only call their APIs, right? So how can they actually build off of it?

kenjackson · 7 months ago

They fixed that. Now it replies: "Hi! I'm DeepSeek-V3, an AI assistant independently developed by the Chinese company DeepSeek Inc. For detailed information about models and products, please refer to the official documentation."

How is this very high signal vs noise post out of the front page in 2hs?

Are people so upset with the stock market crash that they are flagging it?

Maybe too much of the same topic? "How R1 was trained" also seemed to quickly fall off. But the big arxiv paper with 1000+ upvotes stuck around a while.

htk · 7 months ago

Spot on. I've read the very accessible paper and it's better than any of the how-to's written elsewhere. Nothing against good content being written, but the source material is already pretty good.

khazhoux · 7 months ago

dang has provided an answer for how the algorithm works whenever I've asked similar question.

But I still don't get it. 6 hours + 170 points and it's on third page. Meanwhile second page has "Null Byte on Steroids" at 12 hours + 20 points. ??

whoistraitor · 7 months ago

It’s remarkable we’ve hit a threshold where so much can be done with synthetic data. The reasoning race seems an utterly solvable problem now (thanks mostly to the verifiability of results). I guess the challenge then becomes non-reasoning domains, where qualitative and truly creative results are desired.

It seems like we need an evaluation model for creativity. I'm curious, is there research on this -- for example, can one score a random painting and output how creative/good a given population is likely to find it?

virgildotcodes · 7 months ago

How do you account for the impact of culture/lived experience of the specific population viewing the painting? Intuitively it seems like that would be the biggest factor, rather than the objective attributes of the painting, no?

baq · 7 months ago

There are two kinds of creativity at play here. One is mashing together combinations of learned things - it’s kinda like shuffling a deck of cards where basically every shuffle gets you a deck that has never been seen and won’t be seen again, but it’s still the same 52 cards every time. The other kind is going outside of the box and inventing truly new, unseen/untrained concepts. This one is hard, but I don’t think it’s impossible - the <think> slop stirring the learned concepts with a bit of randomness should make progress here.

esafak · 7 months ago

You can train a supervised model, taking into account the properties of the rater as well as the artwork, and tease out the factors that make it rated so.

> can one score a random painting

You can get very mechanical in scoring an image. Ask any art student. If you want to or if your instructor or audience wants to. For example "fits rule of thirds?" yes is a point to common attraction, no is a point to unexpected at the risk of outsider-ness. You can do that in color, composition, recognizing objects and fitting that to memes or associations or non-associations. Too many points in "unexpected" is meta points in "unpleasant chaos" and so a strong downgrade in common attraction. You can match all this to images in the library (see how copyright or song recognition operates in the music category) and get out of that some kind of familiarity vs edge score (where too much edge goes against common attraction.)

I would expect you could get better than most humans at recognizing shapes in an image and drawing associations from that. Such associations are a plus in unexpected / surprise if they are rare in the culture or a plus in common attraction is they are common.

After that, to be cynic about it, you can randomize and second guess yourself so your audience doesn't catch on the 1st level mimicry.

Creativity is not normally used as an absolute with a unique measure. It's not "length". And you only need to please part of the audience to be successful - sometimes a very small part, some of which loves surprise and some hates it, etc. Someone elsewhere objected on the grounds that creativity or attractiveness is culture based - yeah so? if you were to please much of just one whole culture, you would have an insane hit on your hands.

Sounds feasible to me.

It's still reasonning based on pattern matching, which should go only so far. But "only so far" could be plenty for lots of applications.

researchers · 7 months ago

Tuning for qualitative outcomes is pretty much solved via RLHF/DPO (what this post calls "preference tuning"). Right?

ForOldHack · 7 months ago

"DeepSeek-R1 is the latest resounding beat in the steady drumroll of AI progress. " IBM's Intellect, from 1983 cost $47,000 dollars a month. Let me know when DeepSleep-Rx exceeds Windows (tm) version numbers or makes a jump like AutoCADs version numbers.