Readit News logoReadit News
danielhanchen · 9 months ago
I made some GGUFs for those interested in running them at https://huggingface.co/unsloth/Magistral-Small-2506-GGUF

ollama run hf.co/unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL

or

./llama.cpp/llama-cli -hf unsloth/Magistral-Small-2506-GGUF:UD-Q4_K_XL --jinja --temp 0.7 --top-k -1 --top-p 0.95 -ngl 99

Please use --jinja for llama.cpp and use temperature = 0.7, top-p 0.95!

Also best to increase Ollama's context length to say 8K at least: OLLAMA_CONTEXT_LENGTH=8192 ollama serve &. Some other details in https://docs.unsloth.ai/basics/magistral

ozgune · 9 months ago
Their benchmarks are interesting. They are comparing to DeepSeek-V3's (non-reasoning) December and DeepSeek-R1's January releases. I feel that comparing to DeepSeek-R1-0528 would be more fair.

For example, R1 scores 79.8 on AIME 2024, R1-0528 performs 91.4.

R1 scores 70 on AIME 2025, R1-0528 scores 87.5. R1-0528 does similarly better for GPQA Diamond, LiveCodeBench, and Aider (about 10-15 points higher).

https://huggingface.co/deepseek-ai/DeepSeek-R1-0528

derefr · 9 months ago
I presume that "outdated upon release" benchmarks like these happen because the benchmark and the models in it were chosen first, before the model was created; and the model's development progress was measured using the benchmark. It then doesn't occur to anyone that the benchmark the engineers had been relying upon isn't also a good/useful benchmark for marketing upon release. From the inside view, it's just a benchmark, already there, already achieving impressive results, a whole-company internal target to hit for months — so why not publish it?
semi-extrinsic · 9 months ago
Would also be interesting to compare with R1-0528-Qwen3-8B (chain-of-thought distilled from Deepseek-R1-0528 and post-trained into Qwen3-8B). It scores 86 and 76 on AIME 2024 and 2025 respectively.

Currently running the 6-bit XL quant on a single old RTX 2080 Ti and I'm quite impressed TBH. Simply wild for a sub-8GB download.

danielhanchen · 9 months ago
Their paper https://mistral.ai/static/research/magistral.pdf is also cool! They edited GRPO via:

1. Removed KL Divergence

2. Normalize by total length (Dr. GRPO style)

3. Minibatch normalization for advantages

4. Relaxing trust region

gyrovagueGeist · 9 months ago
Does anyone know why they added minibatch advantage normalization (or when it can be useful)?

The paper they cite "What matters in on-policy RL" claims it does not lead to much difference on their suite of test problems, and (mean-of-minibatch)-normalization doesn't seem theoretically motivated for convergence to the optimal policy?

Onavo · 9 months ago
> Removed KL Divergence

Wait, how are they computing the loss?

monkmartinez · 9 months ago
At the risk of dating myself; Unsloth is the Bomb-dot-com!!! I use your models all the time and they just work. Thank you!!! What does llama.cpp normally use if not "jinja" for their templates?
danielhanchen · 9 months ago
Oh thanks! Yes I was gonna bring it up to them! Imo if there is a chat template, by default it should be --jinja
gavi · 9 months ago
fzzzy · 9 months ago
My impression from running the first R1 release locally was that it also does too much thinking.
lxe · 9 months ago
Thanks for all you do!
danielhanchen · 9 months ago
Thanks!
trebligdivad · 9 months ago
Nice! I'm running on CPU only, so it's interesting to compare - the Magistral-Small-2506_Q8_0.gguf runs at under 2 tokens/s on my 16 core, but your UD-IQ2_XXS gets about 5.5 tokens/s which is fast enough to be useful - but it does hallucinate a bit more and loop a little; but still actually pretty good for something so small.
danielhanchen · 9 months ago
Oh nice! I normally suggest maybe Q4_K_XL to be on the safe side :)
cpldcpu · 9 months ago
But this is just the SFT - "distilled" model, not the one optimized with RL, right?
danielhanchen · 9 months ago
Oh I think it's SFT + RL as mentioned in the paper - they said combining both is actually more performant than just RL
pu_pe · 9 months ago
Benchmarks suggest this model loses to Deepseek-R1 in every one-shot comparison. Considering they were likely not even pitting it against the newer R1 version (no mention of that in the article) and at more than double the cost, this looks like the best AI company in the EU is struggling to keep up with the state-of-the-art.
hmottestad · 9 months ago
With how amazing the first R1 model was and how little compute they needed to create it, I'm really wondering how the new R1 model isn't beating o3 and 2.5 Pro on every single benchmark.

Magistral Small is only 24B and scores 70.7% on AIME2024 while the 32B distill of R1 scores 72.6%. And with majority voting @64 the Magistral Small manages 83.3%, which is better than the full R1. Since I can run a 24B model on a regular gaming GPU it's a lot more accessible than the full blown R1.

https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-...

reissbaker · 9 months ago
It's not better than full R1; Mistral is using misleading benchmarks. The latest version of R1, R1-0528, is much better: 91.4% on AIME2024 pass@1. Mistral uses the original R1 release from January in their comparisons, presumably because it makes their numbers look more competitive.

That being said, it's still very impressive for a 24B.

I'm really wondering how the new R1 model isn't beating o3 and 2.5 Pro on every single benchmark.

Sidenote, but I'm pretty sure DeepSeek is focused on V4, and after that will train an R2 on top. The V3-0324 and R1-0528 releases weren't retrained from scratch, they just continued training from the previous V3/R1 checkpoints. They're nice bumps, but V4/R2 will be more significant.

Of course, OpenAI, Google, and Anthropic will have released new models by then too...

adventured · 9 months ago
It's because DeepSeek was a fast copy. That was the easy part and it's why they didn't have to use so much compute to get near the top. Going well beyond o3 or 2.5 Pro is drastically more expensive than fast copy. China's cultural approach to building substantial things produces this sort of outcome regularly, you see the same approach in automobiles, planes, Internet services, industrial machinery, military, et al. Innovation is very expensive and time consuming, fast copy is more often very inexpensive and rapid. 85% good enough is often good enough, that additional 10-15% is comically expensive and difficult as you climb.
melicerte · 9 months ago
If you look at Mistral investors[0], you will quickly understand that Mistral is far from being European. My understanding is it is mainly owned by US companies with a few other companies from EU and other places in the world.

[0] https://tracxn.com/d/companies/mistral-ai/__SLZq7rzxLYqqA97j... (edited for typo)

pdabbadabba · 9 months ago
For the purposes of GP's comment, I think the nationalities of the people actually running the company and doing the work are more relevant than who has invested.
kergonath · 9 months ago
It’s a French company, subject to French laws and European regulations. That’s what matters, from a user point of view.
epolanski · 9 months ago
Jm2c but I feel conflicted about this arms race.

You can be 6/12 months later, and have not burned tens of billions compared to the best in class, I see it an engineering win.

I absolutely understand those that say "yeah, but customers will only use the best", I see it, but is market share of forever money losing businesses that valuable?

louiskottmann · 9 months ago
Indeed, and with the technology plateau-ing, being 6-12 months late with less debt is just long term thinking.

Also, Europe being in the race is a big deal for consumers.

adventured · 9 months ago
A similar sentiment existed for a long time about Uber and now they're very profitable and own their market. It was worth the burn to capture the market. Who says OpenAI can't roll over to profitable at a stable scale? Conquer the market, hike the price to $29.95 (family account, no ads; $19.95 individual account with ads; etc etc). To say nothing of how they can branch out in terms of being the interaction point that replaces the search box. The advertising value of owning the land that OpenAI is taking is well over $100 billion in annual revenue. Amazon's retail business is terrible, their ad business is fantastic. As OpenAI bolts on an ad product their margin potential will skyrocket and the cost side will be modest in comparison.

Over the coming years it won't be possible to stay a mere 6-12 months behind as the costs to build and maintain the AI super-infrastructure keeps climbing. It'll become a guaranteed implosion scenario. Winning will provide the ongoing immense resources needed to keep pushing up the hill forever. Everybody else - except a few - will fall away. The same outcome took place in search. Anybody spot Lycos, Excite, Hotbot, AltaVista around? It costs an enormous amount of money to try to keep up with Google (Bing, Baidu, Yandex) in search and scale it. This will be an even more brutal example of that, as the costs are even higher to scale.

The only way Mistral survives is if they're heavily subsidized directly by European states.

jasonthorsness · 9 months ago
Even if it isn't as capable, having a model with control over training is probably strategically important for every major region of the world. But it could only fall so far behind before it effectively doesn't work in the eyes of the users.
tootie · 9 months ago
As an occasional user of Mistral, I find their model to give generally excellent results and pretty quickly. I think a lot of teams are now overly focused on winning the benchmarks while producing worse real results.
esafak · 9 months ago
If so we need to fix the benchmarks.
littlestymaar · 9 months ago
> Benchmarks suggest this model loses to Deepseek-R1 in every one-shot comparison.

That's not particularly surprising though as the Medium variant is likely close to ten times smaller than DeepSeek-R1 (granted it's a dense model and not an MoE, but still).

funnym0nk3y · 9 months ago
Thought so too. I don't know how it could be different though. They are competing against behemoths like OpenAI or Google, but have only 200 people. Even Anthropic has over 1000 people. DeepSeek has less than 200 people so the comparison seems fair.
rsanek · 9 months ago
any claim from the deepseek folks should be considered with wide margins of error.
wafngar · 9 months ago
But they have built a fully “independent” pipeline. Deepseek and others probably trained in gpt4, o1 or whatever data.
segmondy · 9 months ago
are you really going to compare a 24B model to a 700B+ model?
a2128 · 9 months ago
24B is the size of the Small opensourced model. The Medium model is bigger (they don't seem to disclose its size) and still gets beaten by Deepseek R1
moffkalast · 9 months ago
The most important company is to is to QwQ at 30B sjnce it's still the best local reasoning model for that size. A comparison that Mistral did not run for some reason, not even with Qwen3.
fiatjaf · 9 months ago
This reads like an AI-generated comment. What do you mean by "benchmarks suggest"? The benchmarks are very clear and presented right there in the page.

Deleted Comment

mrtksn · 9 months ago
Europe isn't going to catch up in tech as long as its market is open to US tech giants. Tech doesn't have marginal costs, so you want to have one of it in one place and sell it everywhere and when the infra and talent is already in US, EU tech is destined to do niche products.

UK has a bit of it, France has some and that's it. The only viable alternatives are countries who have issues with US and that is China and Russia. China have come up with strong competitors and it is on cutting edge.

Also, it doesn't have anything to do with regulations. 50 US States have the American regulations, its all happening in 1 and some other states happen to host some infrastructure but that's true for rest of of the world too.

If the EU/US relationship gets to Trump/Musk level, then EU can have the cutting edge stuff.

Most influential AI researchers are from Europe(inc. UK), Israel and Canada anyway. Ilya Sutskever just the other day gave speech at his alma matter @Canada for example. Andrej Karpathy is Slovakian. Lot's of Brits, French, Polish, Chinese, German etc. are among the pioneers. Significant portion of the talent is non-American already, they just need a reason to be somewhere else than US to have it outside the US. Chinese got their reason and with the state of the affairs in the world I wouldn't be surprised if Europeans gets theirs in less than 3 and a half years.

vikramkr · 9 months ago
If you close off the market to US tech giants, maybe they'll have some amount of market dominance at home, but I would doubt that would mean they've "caught up" tech wise. There would be no incentive to compete. American EV manufacturing is pretty far behind Chinese EV manufacturing, protectionism didn't help make a competitive car, it just protected the home market while slowly ceding international market after international market
ascorbic · 9 months ago
It's mostly about money. DeepMind was founded in the UK, and is still based in London, but there was no way it could get the funding it needed without selling to Google or some other US company. China is one of the few other countries that can afford to fund that kind of thing.
Iulioh · 9 months ago
The problem is, CONSUMER level tech

The EU is doing a lot of enterprise level shit and it's great

The biggest company in Europe sells B2B software (SAP)

simianwords · 9 months ago
How can you explain Israel?
iwontberude · 9 months ago
Which Trump/Musk level? There have been so many.

Deleted Comment

Dead Comment

atemerev · 9 months ago
"EU is leading in regulation", they say.

I don't know what they are thinking.

cpldcpu · 9 months ago
Sorry, this is just getting old...

Its a trite talking point and not the reason why there are so few consumer-AI companies in Europe.

dmos62 · 9 months ago
It is fairly common to struggle to understand why different cultures think the way they do.
Mistletoe · 9 months ago
This is why I want to move to the EU. I don’t care if companies aren’t coddled there. I want to live where people are the first priority.
micromacrofoot · 9 months ago
probably some silly thing like "people should have more rights and protections"
0xDEAFBEAD · 9 months ago
Honestly the US approach to AI is incredibly irresponsible. As an American, I'm glad that someone somewhere is thinking about regulation. Not sure it will be enough though: https://xcancel.com/ESYudkowsky/status/1922710969785917691#m
dwedge · 9 months ago
Their OCR model was really well hyped and coincidentally came out at the time I had a batch of 600 page pdfs to OCR. They were all monospace text just for some reason the OCR was missing.

I tried it, 80% of the "text" was recognised as images and output as whitespace so most of it was empty. It was much much worse than tesseract.

A month later I got the bill for that crap and deleted my account.

Maybe this is better but I'm over hype marketing from mistral

notnullorvoid · 9 months ago
I wouldn't trust any of these LLM teams to produce a good OCR model. OCR from 10 years ago is better than the crap they put out.
megalomanu · 9 months ago
We just tested magistral-medium as a replacement for o4-mini in a user-facing feature that relies on JSON generation, where speed is critical. Depending on the complexity of the JSON, o4-mini runs ranged from 50 to 70 seconds. In our initial tests, Mistral returned results in 34–37 seconds. The output quality was slightly lower but still remain acceptable for us. We’ll continue testing, but the early results are promising. I'm glad to see Mistral prioritizing speed over raw power, there’s definitely a need for that.
nbardy · 9 months ago
I bet you can close the gap with a finetune.

Should be quiet easy if you have some o4-mini results sitting around.

kamranjon · 9 months ago
I am curious why you would choose a reasoning model for JSON generation?

I was recently working on a user facing feature using self-hosted Gemma 27b with VLLM and was getting fully formed JSON results in ~7 seconds (even that I would like to optimize further) - obviously the size of the JSON is important but I’d never use a reasoning model for this because they’re constantly circling and just wasting compute.

I haven’t really found a super convincing use-case for reasoning models yet, other than a chat style interface or an assistant to bounce ideas off of.

megalomanu · 9 months ago
It is for generating a big nested JSON, quite complex from a business standpoint (lots of different business concepts). We didn't have good results with simple models.
simonw · 9 months ago
Here are my notes on trying this out locally via Ollama and via their API (and the llm-mistral plugin) too: https://simonwillison.net/2025/Jun/10/magistral/
atxtechbro · 9 months ago
Hi Simon,

What's the huge difference between the two pelicans riding bicycles? Was one running locally the small version vs the pretty good one running the bigger one thru the API?

Thanks, Morgan

diggan · 9 months ago
Ollama doesn't like proper naming for some reason, so `ollama pull magistral:latest` lands you with the q4_K_M version (currently, subject to change).

Mistral's API defaults to `magistral-medium-2506` right now, which is running with full precision, no quantization.

simonw · 9 months ago
Yes, the bad one was Mistral Small running locally, the better one was Mistral Medium via their API.
internet_points · 9 months ago
> I guess this means the reasoning traces are fully visible and not redacted in any way - interesting to see Mistral trying to turn that into a feature that's attractive to the business clients they are most interested in appealing to.

but then someone found that, at least for distilled models,

> correct traces do not necessarily imply that the model outputs the correct final solution. Similarly, we find a low correlation between correct final solutions and intermediate trace correctness

https://arxiv.org/pdf/2505.13792

ie. the conclusion doesn't necessarily follow from the reasoning. So is there still value in seeing the reasoning? There may be useful information in the reasoning, but I'm not sure it can be interpreted by humans as a typical human chain of reasoning, maybe it should be interpreted more as a loud multi-party discussion on the relevant subject which may have informed the conclusion but not necessarily lead to it.

OTOH, considering the effects of automation fatigue vs human oversight, I guess it's unlikely anyone will ever look at the reasoning in practice, except to summarily verify that it's there and tick the boxes on some form.

christianqchung · 9 months ago
I don't understand why the benchmark selections are so scattered and limited. It only compares Magistral Medium with Deepseek V3, R1, and the other close weighted Mistral Medium 3. Why did they leave off Magistral Small entirely, alongside comparisons with Alibaba Qwen or the mini versions of o3 and o4?
elAhmo · 9 months ago
When they include comparisons, it is always a deliberate decision what to show and, more importantly, what not to show. If they had data that would show better performance compared to those models, there is no reason for them to not emphasize that.
CobrastanJorji · 9 months ago
Etymological fun: both "mistral" and "magistral" mean "masterly."

Mistral comes from Occitan for masterly, although today as far as I know it's only used in English when talking about mediterranean winds.

Magistral is just the adjective form of "magister," so "like a master."

If you want to make a few bucks, maybe look up some more obscure synonyms for masterly and pick up the domain names.

snakeboy · 9 months ago
> as far as I know it's only used in English when talking about mediterranean winds.

It's a French company, and "mistral" has this usage in French as well. Also, "magistral" is just the french translation of "masterful".

arnaudsm · 9 months ago
I wished the charts included Qwen3, the current SOTA in reasoning.

Qwen3-4B almost beats Magistral-22B on the 4 available benchmarks, and Qwen3-30B-A3B is miles ahead.

SparkyMcUnicorn · 9 months ago
30-A3B is a really impressive model.

I throw tasks at it running locally to save on API costs, and it's possibly better than anything we had a year or so ago from closed source providers. For programming tasks, I'd rank it higher than gpt-4o

freehorse · 9 months ago
It is a great model, and blazing fast, which is actually very useful esp for "reasoning" models, as they produce a lot of tokens.

I wish mistral were back into making MoE models. I loved their 8x7 mixtral, it was one of the greatest models I could run the time it went out, but it is outdated now. I wish somebody was out making a similar size MoE model, which could comfortably sit in a 64GB ram macbook and be fast. Currently the qwen 30-A3B is the only one I know of, but it would be nice to have something slightly bigger/better (incl a non-reasoning base one). All the other MoE models are just too big to run locally in more standard hardware.

Dead Comment

poorman · 9 months ago
Is there a popular benchmark site people use? Becaues I had to test all these by hand and `Qwen3-30B-A3B` still seems like the best model I can run in that relative parameter space (/memory requirements).
arnaudsm · 9 months ago
- https://livebench.ai/#/ + AIME + LiveCodeBench for reasoning

- MMLU-Pro for knowledge

- https://lmarena.ai/leaderboard for user preference

We only got Magistral's GPQA, AIME & livecodebench so far.

resource_waste · 9 months ago
No surprise on my end. Mistral has been basically useless due to other models always being better.

But its European, so its a point of pride.

Relevance or not, we will keep hearing the name as a result.

devmor · 9 months ago
I would agree, Qwen3 is definitely the most impressive "reasoning" model I've evaluated so far.