Mistral 3 family of models released

I use large language models in http://phrasing.app to format data I can retrieve in a consistent skimmable manner. I switched to mistral-3-medium-0525 a few months back after struggling to get gpt-5 to stop producing gibberish. It's been insanely fast, cheap, reliable, and follows formatting instructions to the letter. I was (and still am) super super impressed. Even if it does not hold up in benchmarks, it still outperformed in practice.

I'm not sure how these new models compare to the biggest and baddest models, but if price, speed, and reliability are a concern for your use cases I cannot recommend Mistral enough.

Very excited to try out these new models! To be fair, mistral-3-medium-0525 still occasionally produces gibberish ~0.1% of my use cases (vs gpt-5's 15% failure rate). Will report back if that goes up or down with these new models

mrtksn · 2 months ago

Some time ago I canceled all my paid subscriptions to chatbots because they are interchangeable so I just rotate between Grok, ChatGPT, Gemini, Deepseek and Mistral.

On the API side of things my experience is that the model behaving as expected is the greatest feature.

There I also switched to Openrouter instead of paying directly so I can use whatever model fits best.

The recent buzz about ad-based chatbot services is probably because the companies no longer have an edge despite what the benchmarks say, users are noticing it and cancel paid plans. Just today OpenAI offered me 1 month free trial as if I wasn’t using it two months ago. I guess they hope I forget to cancel.

barrell · 2 months ago

Yep I spent 3 days optimizing my prompt trying to get gpt-5 to work. Tried a bunch of different models (some Azure some OpenRouter) and got a better success rate with several others without any tailoring of the prompt.

Was really plug and play. There are still small nuances to each one, but compared to a year ago prompts are much more portable

barbazoo · 2 months ago

> I guess they hope I forget to cancel.

Business model of most subscription based services.

acuozzo · 2 months ago

> because they are interchangeable

What is your use-case?

Mine is: I use "Pro"/"Max"/"DeepThink" models to iterate on novel cross-domain applications of existing mathematics.

My interaction is: I craft a detailed prompt in my editor, hand it off, come back 20-30 minutes later, review the reply, and then repeat if necessary.

My experience is that they're all very, very different from one another.

giancarlostoro · 2 months ago

Maybe give Perplexity a shot? It has Grok, ChatGPT, Gemini, Kimi K2, I dont think it has Mistral unfortunately.

Deleted Comment

druskacik · 2 months ago

This is my experience as well. Mistral models may not be the best according to benchmarks and I don't use them for personal chats or coding, but for simple tasks with pre-defined scope (such as categorization, summarization, etc.) they are the option I choose. I use mistral-small with batch API and it's probably the best cost-efficient option out there.

leobg · 2 months ago

Did you compare it to gemini-2.0-flash-lite?

mbowcut2 · 2 months ago

It makes me wonder about the gaps in evaluating LLMs by benchmarks. There almost certainly is overfitting happening which could degrade other use cases. "In practice" evaluation is what inspired the Chatbot Arena right? But then people realized that Chatbot arena over-prioritizes formatting, and maybe sycophancy(?). Makes you wonder what the best evaluation would be. We probably need lots more task-specific models. That's seemed to be fruitful for improved coding.

pants2 · 2 months ago

The best benchmark is one that you build for your use-case. I finally did that for a project and I was not expecting the results. Frontier models are generally "good enough" for most use-cases but if you have something specific you're optimizing for there's probably a more obscure model that just does a better job.

Deleted Comment

Legend2440 · 2 months ago

I don’t think benchmark overfitting is as common as people think. Benchmark scores are highly correlated with the subjective “intelligence” of the model. So is pretraining loss.

The only exception I can think of is models trained on synthetic data like Phi.

pembrook · 2 months ago

If the models from the big US labs are being overfit to benchmarks, than we also need to account for HN commenters overfitting positive evaluations to Chinese or European models based on their political biases (US big tech = default bad, anything European = default good).

Also, we should be aware of people cynically playing into that bias to try to advertise their app, like OP who has managed to spam a link in the first line of a top comment on this popular front page article by telling the audience exactly what they want to hear ;)

mentalgear · 2 months ago

Thanks for sharing your use case of the mistral models, which are indeed top-notch ! I had a look at phrasing.app, and while a nice website, I found the copy of "Hand-crafted. Phrasing was designed & developed by humans, for humans." somewhat of a false virtue given your statements here of advanced lllm usage.

barrell · 2 months ago

I don't see the contention. I do not use llms in the design, development, copywriting, marketing, blogging, or any other aspect of the crafting of the application.

I labor over every word, every button, every line of code, every blog post. I would say it is as hand-crafted as something digital can be.

metadat · 2 months ago

Are you saying gpt-5 produces gibberish 15% of the time? Or are you comparing Mistral gibberish production rate to gpt-5.1's complex task failure rate?

Does Mistral even have a Tool Use model? That would be awesome to have a new coder entrant beyond OpenAI, Anthropic, Grok, and Qwen.

barrell · 2 months ago

Yes. I spent about 3 days trying to optimize the prompt to get gpt-5 to not produce gibberish, to no avail. Completions took several minutes, had an above 50% timeout rate (with a 6 minute timeout mind you), and after retrying they still would return gibberish about 15% of the time (12% on one task, 20% on another task).

I then tried multiple models, and they all failed in spectacular ways. Only Grok and Mistral had an acceptable success rate, although Grok did not follow the formatting instructions as well as Mistral.

Phrasing is a language learning application, so the formatting is very complicated, with multiple languages and multiple scripts intertwined with markdown formatting. I do include dozens of examples in the prompts, but it's something many models struggle with.

This was a few months ago, so to be fair, it's possible gpt-5.1 or gemini-3 or the new deepseek model may have caught up. I have not had the time or need to compare, as Mistral has been sufficient for my use cases.

I mean, I'd love to get that 0.1% error rate down, but there have always more pressing issues XD

acuozzo · 2 months ago

I have a need to remove loose "signature" lines from the last 10% of a tremendous e-mail dataset. Based on your experience, how do you think mistral-3-medium-0525 would do?

barrell · 2 months ago

What's your acceptable error rate? Honestly ministral would probably be sufficient if you can tolerate a small failure rate. I feel like medium would be overkill.

But I'm no expert. I can't say I've used mistral much outside of my own domain.

mackross · 2 months ago

Cool app. I couldn’t see a way to report an error in one of the default expressions.

Extremely cool! I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release, so it's easier to know how it fares in the grand scheme of things.

Youden · 2 months ago

They mentioned LMArena, you can get the results for that here: https://lmarena.ai/leaderboard/text

Mistral Large 3 is ranked 28, behind all the other major SOTA models. The delta between Mistral and the leader is only 1418 vs. 1491 though. I *think* that means the difference is relatively small.

jampekka · 2 months ago

1491 vs 1418 ELO means the stronger model wins about 60% of the time.

qznc · 2 months ago

I guess that could be considered comparative advertising then and companies generally try to avoid that scrutiny.

constantcrying · 2 months ago

The lack of the comparison (which absolutely was done), tells you exactly what you need to know.

bildung · 2 months ago

I think people from the US often aren't aware how many companies from the EU simply won't risk losing their data to the providers you have in mind, OpenAI, Anthropic and Google. They simply are no option at all.

The company I work for for example, a mid-sized tech business, currently investigates their local hosting options for LLMs. So Mistral certainly will be an option, among the Qwen familiy and Deepseek.

Mistral is positioning themselves for that market, not the one you have in mind. Comparing their models with Claude etc. would mean associating themselves with the data leeches, which they probably try to avoid.

popinman322 · 2 months ago

They're comparing against open weights models that are roughly a month away from the frontier. Likely there's an implicit open-weights political stance here.

There are also plenty of reasons not to use proprietary US models for comparison: The major US models haven't been living up to their benchmarks; their releases rarely include training & architectural details; they're not terribly cost effective; they often fail to compare with non-US models; and the performance delta between model releases has plateaued.

A decent number of users in r/LocalLlama have reported that they've switched back from Opus 4.5 to Sonnet 4.5 because Opus' real world performance was worse. From my vantage point it seems like trust in OpenAI, Anthropic, and Google is waning and this lack of comparison is another symptom.

crimsoneer · 2 months ago

If someone is using these models, they probably can't or won't use the existing SOTA models, so not sure how useful those comparisons actually are. "Here is a benchmark that makes us look bad from a model you can't use on a task you won't be undertaking" isn't actually helpful (and definitely not in a press release).

tarruda · 2 months ago

Here's what I understood from the blog post:

- Mistral Large 3 is comparable with the previous Deepseek release.

- Ministral 3 LLMs are comparable with older open LLMs of similar sizes.

rvz · 2 months ago

> I just wish they would also include comparisons to SOTA models from OpenAI, Google, and Anthropic in the press release,

Why would they? They know they can't compete against the heavily closed-source models.

They are not even comparing against GPT-OSS.

That is absolutely and shockingly bearish.

msp26 · 2 months ago

The new large model uses DeepseekV2 architecture. 0 mention on the page lol.

It's a good thing that open source models use the best arch available. K2 does the same but at least mentions "Kimi K2 was designed to further scale up Moonlight, which employs an architecture similar to DeepSeek-V3".

---

vllm/model_executor/models/mistral_large_3.py

```

from vllm.model_executor.models.deepseek_v2 import DeepseekV3ForCausalLM

class MistralLarge3ForCausalLM(DeepseekV3ForCausalLM):

"Science has always thrived on openness and shared discovery." btw

Okay I'll stop being snarky now and try the 14B model at home. Vision is good additional functionality on Large.

Jackson__ · 2 months ago

So they spent all of their R&D to copy deepseek, leaving none for the singular novel added feature: vision.

To quote the hf page:

>Behind vision-first models in multimodal tasks: Mistral Large 3 can lag behind models optimized for vision tasks and use cases.

Ey7NFZ3P0nzAe · 2 months ago

Well, behind "models" not "langual models".

Of course models purely made for image stuff will completely wipe it out. The vision language models are useful for their generalist capabilities

make3 · 2 months ago

Architecture difference wrt vanilla transformers and between modern transformers are a tiny part of what makes a model nowadays

halJordan · 2 months ago

I don't think it's fair to demand everything be open and then get mad when they open-ness is used. It's an obsessive and harmful double standard.

simonw · 2 months ago

The 3B vision model runs in the browser (after a 3GB model download). There's a very cool demo of that here: https://huggingface.co/spaces/mistralai/Ministral_3B_WebGPU

Pelicans are OK but not earth-shattering: https://simonwillison.net/2025/Dec/2/introducing-mistral-3/

troyvit · 2 months ago

I'm reading this post and wondering what kind of crazy accessibility tools one could make. I think it's a little off the rails but imagine a tool that describes a web video for a blind user as it happens, not just the speech, but the actual action.

GaggiX · 2 months ago

This is not local but Gemini models can process very long videos and provide description with timestamps if asked for.

https://ai.google.dev/gemini-api/docs/video-understanding#tr...

user_of_the_wek · 2 months ago

> The image depicts and older man...

Ouch

mythz · 2 months ago

Europe's bright star has been quiet for a while, great to see them back and good to see them come back to Open Source light with Apache 2.0 licenses - they're too far from the SOTA pack that exclusive/proprietary models would work in their favor.

Mistral had the best small models on consumer GPUs for a while, hopefully Ministral 14B lives up to their benchmarks.

All thanks to the US VCs that acutally have money to fund Mistral's entire business.

Had they gone to the EU, Mistral would have gotten a miniscule grant from the EU to train their AI models.

amarcheschi · 2 months ago

Mistral biggest investor is asml, although it became so later than other vcs

I mean, one is a government, the other are VCs (also, I would be shocked if there isn't some French gov funding somewhere in the massive mistral pile).

whiplash451 · 2 months ago

1. so what 2. asml

timpera · 2 months ago

yvoschaap · 2 months ago

Upvoting for Europe's best efforts.

sebzim4500 · 2 months ago

That's unfair to Europe. A bunch of AI work is done in London (Deepmind is based here for a start)

p2detar · 2 months ago

That's ok. How could they know that there are companies like Aleph Alpha, Helsing or the famous DeepL. European companies are not that vocal, but that doesn't mean they aren't making progress in the field.

edit: typos

Glemkloksdjf · 2 months ago

Thats not the point.

Deepmind is not an UK company, its google aka US.

Mistral is a real EU based company.

London is not part of Europe anymore since Brexit /s

colesantiago · 2 months ago

Deepmind doesn't exist anymore.

Google DeepMind does exist.

LunaSea · 2 months ago

Upvoting Windows 11 as the US's best effort at Operating Systems development.

DarmokJalad1701 · 2 months ago

Wouldn't that be macOS? Or BSD? Or Unix? CentOS?

mrinterweb · 2 months ago

I don't like being this guy, but I think Deepseek 3.2 stole all the thunder yesterday. Notice that these comparisons are to Deepseek 3.1. Deepseek 3.2 is a big step up over 3.1, if benchmarks are to be believed. Just unfortunate timing of release. https://api-docs.deepseek.com/news/news251201

hiddencost · 2 months ago

Idk. They look like they're ahead on the saturated benchmarks and behind on the unsaturated ones. Looks more like that over fit to the benchmarks.

simgt · 2 months ago

I still don't understand what the incentive is for releasing genuinely good model weights. What makes sense however is OpenAI releasing a somewhat generic model like gpt-oss that games the benchmarks just for PR. Or some Chinese companies doing the same to cut the ground from under the feet of American big tech. Are we really hopeful we'll still get decent open weights models in the future?

mirekrusin · 2 months ago

Because there is no money in making them closed.

Open weight means secondary sales channels like their fine tuning service for enterprises [0].

They can't compete with large proprietary providers but they can erode and potentially collapse them.

Open weights and research builds on itself advancing its participants creating environment that has a shot at proprietary services.

Transparency, control, privacy, cost etc. do matter to people and corporations.

[0] https://mistral.ai/solutions/custom-model-training

talliman · 2 months ago

Until there is a sustainable, profitable and moat-building business model for generative AI, the competition is not to have the best proprietary model, but rather to raise the most VC money to be well positioned when that business model does arise.

Releasing a near stat-of-the-art open model instanly catapults companies to a valuation of several billion dollars, making it possible raise money to acquire GPUs and train more SOTA models.

Now, what happens if such a business model does not emerge? I hope we won't find out!

Explained well in this documentary [0].

[0] https://www.youtube.com/watch?v=BzAdXyPYKQo

memming · 2 months ago

It’s funny how future money drive the world. Fortunately it’s fueling progress this time around.

NitpickLawyer · 2 months ago

> gpt-oss that games the benchmarks just for PR.

gpt-oss is killing the ongoing AIME3 competition on kaggle. They're using a hidden, new set of problems, IMO level, handcrafted to be "AI hardened". And gpt-oss submissions are at ~33/50 right now, two weeks into the competition. The benchmarks (at least for math) were not gamed at all. They are really good at math.

lostmsu · 2 months ago

Are they ahead of all other recent open models? Is there a leaderboard?

prodigycorp · 2 months ago

gpt-oss are really solid models. by far the best at tool calling, and performant.

nullbio · 2 months ago

Google games benchmarks more than anyone, hence Gemini's strong bench lead. In reality though, it's still garbage for general usage.