Magistral — the first reasoning model by Mistral AI

I made some GGUFs for those interested in running them at https://huggingface.co/unsloth/Magistral-Small-2506-GGUF

ollama run hf.co/unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL

./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1 --top-p 0.95 -ngl 99

Please use --jinja for llama.cpp and use temperature = 0.7, top-p 0.95!

Also best to increase Ollama's context length to say 8K at least: OLLAMA_CONTEXT_LENGTH=8192 ollama serve &. Some other details in https://docs.unsloth.ai/basics/magistral

ozgune · 9 months ago

Their benchmarks are interesting. They are comparing to DeepSeek-V3's (non-reasoning) December and DeepSeek-R1's January releases. I feel that comparing to DeepSeek-R1-0528 would be more fair.

For example, R1 scores 79.8 on AIME 2024, R1-0528 performs 91.4.

R1 scores 70 on AIME 2025, R1-0528 scores 87.5. R1-0528 does similarly better for GPQA Diamond, LiveCodeBench, and Aider (about 10-15 points higher).

https://huggingface.co/deepseek-ai/DeepSeek-R1-0528

derefr · 9 months ago

I presume that "outdated upon release" benchmarks like these happen because the benchmark and the models in it were chosen first, before the model was created; and the model's development progress was measured using the benchmark. It then doesn't occur to anyone that the benchmark the engineers had been relying upon isn't also a good/useful benchmark for marketing upon release. From the inside view, it's just a benchmark, already there, already achieving impressive results, a whole-company internal target to hit for months — so why not publish it?

semi-extrinsic · 9 months ago

Would also be interesting to compare with R1-0528-Qwen3-8B (chain-of-thought distilled from Deepseek-R1-0528 and post-trained into Qwen3-8B). It scores 86 and 76 on AIME 2024 and 2025 respectively.

Currently running the 6-bit XL quant on a single old RTX 2080 Ti and I'm quite impressed TBH. Simply wild for a sub-8GB download.

danielhanchen · 9 months ago

Their paper https://mistral.ai/static/research/magistral.pdf is also cool! They edited GRPO via:

1. Removed KL Divergence

2. Normalize by total length (Dr. GRPO style)

3. Minibatch normalization for advantages

4. Relaxing trust region

gyrovagueGeist · 9 months ago

Does anyone know why they added minibatch advantage normalization (or when it can be useful)?

The paper they cite "What matters in on-policy RL" claims it does not lead to much difference on their suite of test problems, and (mean-of-minibatch)-normalization doesn't seem theoretically motivated for convergence to the optimal policy?

Onavo · 9 months ago

> Removed KL Divergence

Wait, how are they computing the loss?

monkmartinez · 9 months ago

At the risk of dating myself; Unsloth is the Bomb-dot-com!!! I use your models all the time and they just work. Thank you!!! What does llama.cpp normally use if not "jinja" for their templates?

danielhanchen · 9 months ago

Oh thanks! Yes I was gonna bring it up to them! Imo if there is a chat template, by default it should be --jinja

gavi · 9 months ago

too much thinking

https://gist.github.com/gavi/b9985f730f5deefe49b6a28e5569d46...

fzzzy · 9 months ago

My impression from running the first R1 release locally was that it also does too much thinking.

lxe · 9 months ago

Thanks for all you do!

danielhanchen · 9 months ago

Thanks!

trebligdivad · 9 months ago

Nice! I'm running on CPU only, so it's interesting to compare - the Magistral-Small-2506_Q8_0.gguf runs at under 2 tokens/s on my 16 core, but your UD-IQ2_XXS gets about 5.5 tokens/s which is fast enough to be useful - but it does hallucinate a bit more and loop a little; but still actually pretty good for something so small.

danielhanchen · 9 months ago

Oh nice! I normally suggest maybe Q4_K_XL to be on the safe side :)

cpldcpu · 9 months ago

But this is just the SFT - "distilled" model, not the one optimized with RL, right?

danielhanchen · 9 months ago

Oh I think it's SFT + RL as mentioned in the paper - they said combining both is actually more performant than just RL

Benchmarks suggest this model loses to Deepseek-R1 in every one-shot comparison. Considering they were likely not even pitting it against the newer R1 version (no mention of that in the article) and at more than double the cost, this looks like the best AI company in the EU is struggling to keep up with the state-of-the-art.

hmottestad · 9 months ago

With how amazing the first R1 model was and how little compute they needed to create it, I'm really wondering how the new R1 model isn't beating o3 and 2.5 Pro on every single benchmark.

Magistral Small is only 24B and scores 70.7% on AIME2024 while the 32B distill of R1 scores 72.6%. And with majority voting @64 the Magistral Small manages 83.3%, which is better than the full R1. Since I can run a 24B model on a regular gaming GPU it's a lot more accessible than the full blown R1.

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-...

reissbaker · 9 months ago

It's not better than full R1; Mistral is using misleading benchmarks. The latest version of R1, R1-0528, is much better: 91.4% on AIME2024 pass@1. Mistral uses the original R1 release from January in their comparisons, presumably because it makes their numbers look more competitive.

That being said, it's still very impressive for a 24B.

I'm really wondering how the new R1 model isn't beating o3 and 2.5 Pro on every single benchmark.

Sidenote, but I'm pretty sure DeepSeek is focused on V4, and after that will train an R2 on top. The V3-0324 and R1-0528 releases weren't retrained from scratch, they just continued training from the previous V3/R1 checkpoints. They're nice bumps, but V4/R2 will be more significant.

Of course, OpenAI, Google, and Anthropic will have released new models by then too...

adventured · 9 months ago

It's because DeepSeek was a fast copy. That was the easy part and it's why they didn't have to use so much compute to get near the top. Going well beyond o3 or 2.5 Pro is drastically more expensive than fast copy. China's cultural approach to building substantial things produces this sort of outcome regularly, you see the same approach in automobiles, planes, Internet services, industrial machinery, military, et al. Innovation is very expensive and time consuming, fast copy is more often very inexpensive and rapid. 85% good enough is often good enough, that additional 10-15% is comically expensive and difficult as you climb.

melicerte · 9 months ago

If you look at Mistral investors[0], you will quickly understand that Mistral is far from being European. My understanding is it is mainly owned by US companies with a few other companies from EU and other places in the world.

[0] https://tracxn.com/d/companies/mistral-ai/__SLZq7rzxLYqqA97j... (edited for typo)

pdabbadabba · 9 months ago

For the purposes of GP's comment, I think the nationalities of the people actually running the company and doing the work are more relevant than who has invested.

kergonath · 9 months ago

It’s a French company, subject to French laws and European regulations. That’s what matters, from a user point of view.

epolanski · 9 months ago

Jm2c but I feel conflicted about this arms race.

You can be 6/12 months later, and have not burned tens of billions compared to the best in class, I see it an engineering win.

I absolutely understand those that say "yeah, but customers will only use the best", I see it, but is market share of forever money losing businesses that valuable?

louiskottmann · 9 months ago

Indeed, and with the technology plateau-ing, being 6-12 months late with less debt is just long term thinking.

Also, Europe being in the race is a big deal for consumers.

adventured · 9 months ago

A similar sentiment existed for a long time about Uber and now they're very profitable and own their market. It was worth the burn to capture the market. Who says OpenAI can't roll over to profitable at a stable scale? Conquer the market, hike the price to $29.95 (family account, no ads; $19.95 individual account with ads; etc etc). To say nothing of how they can branch out in terms of being the interaction point that replaces the search box. The advertising value of owning the land that OpenAI is taking is well over $100 billion in annual revenue. Amazon's retail business is terrible, their ad business is fantastic. As OpenAI bolts on an ad product their margin potential will skyrocket and the cost side will be modest in comparison.

Over the coming years it won't be possible to stay a mere 6-12 months behind as the costs to build and maintain the AI super-infrastructure keeps climbing. It'll become a guaranteed implosion scenario. Winning will provide the ongoing immense resources needed to keep pushing up the hill forever. Everybody else - except a few - will fall away. The same outcome took place in search. Anybody spot Lycos, Excite, Hotbot, AltaVista around? It costs an enormous amount of money to try to keep up with Google (Bing, Baidu, Yandex) in search and scale it. This will be an even more brutal example of that, as the costs are even higher to scale.

The only way Mistral survives is if they're heavily subsidized directly by European states.

jasonthorsness · 9 months ago

Even if it isn't as capable, having a model with control over training is probably strategically important for every major region of the world. But it could only fall so far behind before it effectively doesn't work in the eyes of the users.

tootie · 9 months ago

As an occasional user of Mistral, I find their model to give generally excellent results and pretty quickly. I think a lot of teams are now overly focused on winning the benchmarks while producing worse real results.

esafak · 9 months ago

If so we need to fix the benchmarks.

littlestymaar · 9 months ago

> Benchmarks suggest this model loses to Deepseek-R1 in every one-shot comparison.

That's not particularly surprising though as the Medium variant is likely close to ten times smaller than DeepSeek-R1 (granted it's a dense model and not an MoE, but still).

funnym0nk3y · 9 months ago

Thought so too. I don't know how it could be different though. They are competing against behemoths like OpenAI or Google, but have only 200 people. Even Anthropic has over 1000 people. DeepSeek has less than 200 people so the comparison seems fair.

rsanek · 9 months ago

any claim from the deepseek folks should be considered with wide margins of error.

wafngar · 9 months ago

But they have built a fully “independent” pipeline. Deepseek and others probably trained in gpt4, o1 or whatever data.

segmondy · 9 months ago

are you really going to compare a 24B model to a 700B+ model?

a2128 · 9 months ago

24B is the size of the Small opensourced model. The Medium model is bigger (they don't seem to disclose its size) and still gets beaten by Deepseek R1

moffkalast · 9 months ago

The most important company is to is to QwQ at 30B sjnce it's still the best local reasoning model for that size. A comparison that Mistral did not run for some reason, not even with Qwen3.

fiatjaf · 9 months ago

This reads like an AI-generated comment. What do you mean by "benchmarks suggest"? The benchmarks are very clear and presented right there in the page.

Deleted Comment

mrtksn · 9 months ago

Europe isn't going to catch up in tech as long as its market is open to US tech giants. Tech doesn't have marginal costs, so you want to have one of it in one place and sell it everywhere and when the infra and talent is already in US, EU tech is destined to do niche products.

UK has a bit of it, France has some and that's it. The only viable alternatives are countries who have issues with US and that is China and Russia. China have come up with strong competitors and it is on cutting edge.

Also, it doesn't have anything to do with regulations. 50 US States have the American regulations, its all happening in 1 and some other states happen to host some infrastructure but that's true for rest of of the world too.

If the EU/US relationship gets to Trump/Musk level, then EU can have the cutting edge stuff.

Most influential AI researchers are from Europe(inc. UK), Israel and Canada anyway. Ilya Sutskever just the other day gave speech at his alma matter @Canada for example. Andrej Karpathy is Slovakian. Lot's of Brits, French, Polish, Chinese, German etc. are among the pioneers. Significant portion of the talent is non-American already, they just need a reason to be somewhere else than US to have it outside the US. Chinese got their reason and with the state of the affairs in the world I wouldn't be surprised if Europeans gets theirs in less than 3 and a half years.

vikramkr · 9 months ago

If you close off the market to US tech giants, maybe they'll have some amount of market dominance at home, but I would doubt that would mean they've "caught up" tech wise. There would be no incentive to compete. American EV manufacturing is pretty far behind Chinese EV manufacturing, protectionism didn't help make a competitive car, it just protected the home market while slowly ceding international market after international market

ascorbic · 9 months ago

It's mostly about money. DeepMind was founded in the UK, and is still based in London, but there was no way it could get the funding it needed without selling to Google or some other US company. China is one of the few other countries that can afford to fund that kind of thing.

Iulioh · 9 months ago

The problem is, CONSUMER level tech

The EU is doing a lot of enterprise level shit and it's great

The biggest company in Europe sells B2B software (SAP)

simianwords · 9 months ago

How can you explain Israel?

iwontberude · 9 months ago

Which Trump/Musk level? There have been so many.

Deleted Comment

Dead Comment

atemerev · 9 months ago

"EU is leading in regulation", they say.

I don't know what they are thinking.

cpldcpu · 9 months ago

Sorry, this is just getting old...

Its a trite talking point and not the reason why there are so few consumer-AI companies in Europe.

dmos62 · 9 months ago

It is fairly common to struggle to understand why different cultures think the way they do.

Mistletoe · 9 months ago

This is why I want to move to the EU. I don’t care if companies aren’t coddled there. I want to live where people are the first priority.

micromacrofoot · 9 months ago

probably some silly thing like "people should have more rights and protections"

0xDEAFBEAD · 9 months ago

Honestly the US approach to AI is incredibly irresponsible. As an American, I'm glad that someone somewhere is thinking about regulation. Not sure it will be enough though: https://xcancel.com/ESYudkowsky/status/1922710969785917691#m