We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.
> We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data.
I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?
The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?
But it does have a verifiable output, no more or less than HTML+CSS. Not sure what you mean by "input" -- it's not a function that takes in parameters if that's what you're getting at, but not every app does.
Less than a year behind the SOTA, faster, and cheaper. I think Mistral is mounting a good recovery. I would not use it yet since it is not the best along any dimension that matters to me (I'm not EU-bound) but it is catching up. I think its closed source competitors are Haiku 4.5 and Gemini 3 Pro Fast (TBA) and whatever ridiculously-named light model OpenAI offers today (GPT 5.1 Codex Max Extra High Fast?)
It's open source; the price is up to the provider, and I do not see any on openrouter yet. ̶G̶i̶v̶e̶n̶ ̶t̶h̶a̶t̶ ̶d̶e̶v̶s̶t̶r̶a̶l̶ ̶i̶s̶ ̶m̶u̶c̶h̶ ̶s̶m̶a̶l̶l̶e̶r̶,̶ ̶I̶ ̶c̶a̶n̶ ̶n̶o̶t̶ ̶i̶m̶a̶g̶i̶n̶e̶ ̶i̶t̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶m̶o̶r̶e̶ ̶e̶x̶p̶e̶n̶s̶i̶v̶e̶,̶ ̶l̶e̶t̶ ̶a̶l̶o̶n̶e̶ ̶5̶x̶.̶ ̶I̶f̶ ̶a̶n̶y̶t̶h̶i̶n̶g̶ ̶D̶e̶e̶p̶S̶e̶e̶k̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶5̶x̶ ̶t̶h̶e̶ ̶c̶o̶s̶t̶.̶
edit: Mea culpa. I missed the active vs dense difference.
I gave Devstral 2 in their CLI a shot and let it run over one of my smaller private projects, about 500 KB of code. I asked it to review the codebase, understand the application's functionality, identify issues, and fix them.
It spent about half an hour, correctly identified what the program did, found two small bugs, fixed them, made some minor improvements, and added two new, small but nice features.
It introduced one new bug, but then fixed it on the first try when I pointed it out.
The changes it made to the code were minimal and localized; unlike some more "creative" models, it didn't randomly rewrite stuff it didn't have to.
It's too early to form a conclusion, but so far, it's looking quite competent.
Also tried it on a small project, it did ok finding issues but completely failed doing rather basic edits, like it lost closing brackets or used wrong syntax and couldn't recover. The CLI was easy to setup and use though.
Did you try it via OpenRouter? If so, what provider? I've noticed some providers seems to not exactly be upfront about what quantization they're using, you can see that the responses from some providers who supposedly run the exact same model and weights give vastly different responses.
Back when Devstral 1 released, this was made very noticeable to me because the ones who used the smaller quantizations were unable to actually properly format the code, just as you noticed, that's why this sounded so similar to what I've seen before.
In my experience, the messed up closing brackets are a surprisingly common issue for LLMs. Both Sonnet 4.5 and Gemini 3 also do this regularly. Seems like something that should be relatively easy to fix, though.
So I tested the bigger model with my typical standard test queries which are not so tough, not so easy. They are also some that you wouldn't find extensive training data for. Finally, I already have used them to get answers from gpt-5.1, sonnet 4.5 and gemini 3 ....
Here is what I think about the bigger model: It sits between sonnet 4 and sonnet 4.5. Something like "sonnet 4.3". The response sped was pretty good.
Overall, I can see myself shifting to this for reguar day-to-day coding if they can offer this for copetitive pricing.
I'll still use sonnet 4.5 or gemini 3 for complex queries, but, for everything else code related, this seems to be pretty good.
Congrats Mistral. You most probably have caught up to the big guys. Not there yet exactly, but, not far now.
Look interesting, eager to play around with it! Devstral was a neat model when it released and one of the better ones to run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this, so gonna be interesting to see if Devstral 2 can replace it.
I'm a bit saddened by the name of the CLI tool, which to me implies the intended usage. "Vibe-coding" is a fun exercise to realize where models go wrong, but for professional work where you need tight control over the quality, you can obviously not vibe your way to excellency, hard reviews are required, so not "vibe coding" which is all about unreviewed code and just going with whatever the LLM outputs.
But regardless of that, it seems like everyone and their mother is aiming to fuel the vibe coding frenzy. But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it? All the agents seem to focus on off-handing work to vibe-coding agents, while what I want is something even tighter integrated with my tools so I can continue delivering high quality code I know and control. Where are those tools? None of the existing coding agents apparently aim for this...
Their new CLI agent tool [1] is written in Python unlike similar agents from Anthropic/Google (Typescript/Bun) and OpenAI (Rust). It also appears to have first class ACP support, where ACP is the new protocol from Zed [2].
This is exactly the CLI I'm referring to, whose name implies it's for playing around with "vibe-coding", instead of helping professional developers produce high quality code. It's the opposite of what I and many others are looking for.
A surprising amount of programming is building cardboard services or apps that only need to last six months to a year and then thrown away when temporary business needs change. Execs are constantly clamoring for semi-persistent dashboards and ETL visualized data that lasts just long enough to rein in the problem and move on to the next fire. Agentic coding is good enough for cardboard services that collapse when they get wet. I wouldn't build an industrial data lake service with it, but you can certainly build cardboard consumers of the data lake.
But there is nothing more permanent that a quickly hacked together prototype or personal productivity hack that works. There are so many Python (or Perl or Visual Basic) scripts or Excel spreadsheets - created by people who have never been "developers" - which solve in-the-trenches pain points and become indispensable in exactly the way _that_ xkcd shows.
> But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it?
Claude Code has absolutely zero features that help me review code or do anything else than vibe-coding and accept changes as they come in. We need diff-comparisons between different executions, tailored TUI for that kind of work and more. Claude Code is basically a MVP of that.
Still, I do use Claude Code and Codex daily as there is nothing better out there currently. But they still feel tailored towards vibe-coding instead of professional development.
I did, although a long time ago, so maybe I need to try it again. But it still seems to be stuck in a chat-like interface instead of something tailored to software development. Think IDE but better.
RTX Pro 6000, ends up taking ~66GB when running the MXFP4 native quant with llama-server/llama.cpp and max context, as an example. Guess you could do it with two 5090s with slightly less context, or different software aimed at memory usage efficiency.
Says the person who will find themselves unable to change the software even in the slightest way without having to large refactors across everything at the same time.
High quality code matters more than ever, would be my argument. The second you let the LLM sneak in some quick hack/patch instead of correctly solving the problem, is the second you invite it to continue doing that always.
"high quality specifications" have _always_ been a thing that matters.
In my mind, it's somewhat orthogonal to code quality.
Waterfall has always been about "high quality specifications" written by people who never see any code, much less write it. Agile make specs and code quality somewhat related, but in at least some ways probably drives lower quality code in the pursuit of meeting sprint deadlines and producing testable artefacts at the expense of thoroughness/correctness/quality.
I'm sure I'm not the only one that thinks "Vibe CLI" sounds like an unserious tool. I use Claude Code a lot and little of it is what I would consider Vibe Coding.
So people have different definitions of the word, but originally Vibe Coding meant "don't even look at the code".
If you're actually making sure it's legit, it's not vibe coding anymore. It's just... Backseat Coding? ;)
There's a level below that I call Power Coding (like power armor) where you're using a very fast model interactively to make many very small edits. So you're still doing the conceptual work of programming, but outsourcing the plumbing (LLM handles details of syntax and stdlib).
The original definition was very different. The main thing with vibe coding is that you don't care about the code. You don't even look at the code. You prompt, test that you got what you wanted, and move on. You can absolutely use cc to vibe code. But you can also use it to ... code based on prompts. Or specs. Or docs. Or whatever else. The difference is if you want / care to look at the code or not.
It sure doesn't feel like it given how closely I have to babysit Claude Code lest I don't recognize the code after Claude Code is done with it when left to its own devices for a minute.
No, that's not the definition of "vibe coding". Vibe coding is letting the model do whatever without reviewing it and not understanding the architecture. This was the original definition and still is.
Let's say you had a hardware budget of $5,000. What machine would you buy or build to run Devstral Small 2? The HuggingFace page claims it can run on a Mac with 32 GB of memory or an RTX 4090. What kind of tokens per second would you get on each? What about DGX Spark? What about RTX 5090 or Pro series? What about external GPUs on Oculink with a mini PC?
All those choices seem to have very different trade-offs? I hate $5,000 as a budget - not enough to launch you into higher-VRAM RTX Pro cards, too much (for me personally) to just spend on a "learning/experimental" system.
I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system. I mean, if I was doing some more HPC/numerical programming (say, similarity search on GPUs :-) ), I could see just taking the hit and spending $15,000 on a workstation with an RTX Pro 6000.
For grins:
Max t/s for this and smaller models? RTX 5090 system. Barely squeezing in for $5,000 today and given ram prices, maybe not actually possible tomorrow.
Max CUDA compatibility, slower t/s? DGX Spark.
Ok with slower t/s, don't care so much about CUDA, and want to run larger models? Strix Halo system with 128gb unified memory, order a framework desktop.
Prefer Macs, might run larger models? M3 Ultra with memory maxed out. Better memory bandwidth speed, mac users seem to be quite happy running locally for just messing around.
dual 3090's (24GB each) on 8x+8x pcie has been a really reliable setup for me (with nvlink bridge... even though it's relatively low bandwidth compared to tesla nvlink, it's better than going over pcie!)
48GB of vram and lots of cuda cores, hard to beat this value atm.
If you want to go even further, you can get an 8x V100 32GB server complete with 512GB ram and nvlink switching for $7000 USD from unixsurplus (ebay.com/itm/146589457908) which can run even bigger models and with healthy throughput. You would need 240V power to run that in a home lab environment though.
Or a Strix Halo Ryzen AI Max. Lots of "unified" memory that can be dedicated to the GPU portion, for not that expensive. Read through benchmarks to know if the performance will be enough for your needs though.
I'm not excited that it's done in python. I've had experience with Aider struggling to display text as fast as the llm is spitting it out, though that was probably 6 months ago now.
Pretty good for a 123B model!
(That said I'm not 100% certain I guessed the correct model ID, I asked Mistral here: https://x.com/simonw/status/1998435424847675429)
I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?
The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?
[0] https://medium.com/@joe.richardson.iii/the-curious-case-of-t...
So far though, the models good at bike pelican are also good at kayak bumblebee, or whatever other strange combo you can come up with.
So if they are trying to benchmaxx by making SVG generation stronger, that's not really a miss, is it?
Deleted Comment
https://news.ycombinator.com/item?id=46183294
Deleted Comment
Yes, SVG is code, but not in a sense of executable with verifiable inputs and outputs.
Dead Comment
(Surely they won't release it like that, right..?)
That looks like the next flagship rather than the fast distillation, but thanks for sharing.
edit: Mea culpa. I missed the active vs dense difference.
It spent about half an hour, correctly identified what the program did, found two small bugs, fixed them, made some minor improvements, and added two new, small but nice features.
It introduced one new bug, but then fixed it on the first try when I pointed it out.
The changes it made to the code were minimal and localized; unlike some more "creative" models, it didn't randomly rewrite stuff it didn't have to.
It's too early to form a conclusion, but so far, it's looking quite competent.
Back when Devstral 1 released, this was made very noticeable to me because the ones who used the smaller quantizations were unable to actually properly format the code, just as you noticed, that's why this sounded so similar to what I've seen before.
Here is what I think about the bigger model: It sits between sonnet 4 and sonnet 4.5. Something like "sonnet 4.3". The response sped was pretty good.
Overall, I can see myself shifting to this for reguar day-to-day coding if they can offer this for copetitive pricing.
I'll still use sonnet 4.5 or gemini 3 for complex queries, but, for everything else code related, this seems to be pretty good.
Congrats Mistral. You most probably have caught up to the big guys. Not there yet exactly, but, not far now.
I'm a bit saddened by the name of the CLI tool, which to me implies the intended usage. "Vibe-coding" is a fun exercise to realize where models go wrong, but for professional work where you need tight control over the quality, you can obviously not vibe your way to excellency, hard reviews are required, so not "vibe coding" which is all about unreviewed code and just going with whatever the LLM outputs.
But regardless of that, it seems like everyone and their mother is aiming to fuel the vibe coding frenzy. But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it? All the agents seem to focus on off-handing work to vibe-coding agents, while what I want is something even tighter integrated with my tools so I can continue delivering high quality code I know and control. Where are those tools? None of the existing coding agents apparently aim for this...
[1] https://github.com/mistralai/mistral-vibe
[2] https://zed.dev/acp
This is exactly the CLI I'm referring to, whose name implies it's for playing around with "vibe-coding", instead of helping professional developers produce high quality code. It's the opposite of what I and many others are looking for.
A surprising amount of programming is building cardboard services or apps that only need to last six months to a year and then thrown away when temporary business needs change. Execs are constantly clamoring for semi-persistent dashboards and ETL visualized data that lasts just long enough to rein in the problem and move on to the next fire. Agentic coding is good enough for cardboard services that collapse when they get wet. I wouldn't build an industrial data lake service with it, but you can certainly build cardboard consumers of the data lake.
But there is nothing more permanent that a quickly hacked together prototype or personal productivity hack that works. There are so many Python (or Perl or Visual Basic) scripts or Excel spreadsheets - created by people who have never been "developers" - which solve in-the-trenches pain points and become indispensable in exactly the way _that_ xkcd shows.
"There is nothing more permanent than a temporary demo"
Claude Code not good enough for ya?
Still, I do use Claude Code and Codex daily as there is nothing better out there currently. But they still feel tailored towards vibe-coding instead of professional development.
This is what we're building at Brokk: https://brokk.ai/
Quick intro: https://blog.brokk.ai/introducing-lutz-mode/
While True:
0. Context injected automatically. (My repos are small.)
1. I describe a change.
2. LLM proposes a code edit. (Can edit multiple files simultaneously. Only one LLM call required :)
3. I accept/reject the edit.
What kind of hardware do you have to be able to run a performant GPT-OSS-120b locally?
There are many platforms out there that can run it decently.
AMD strix halo, Mac platforms. Two (or three without extra ram) of the new AMD AI Pro R9700 (32GB of RAM, $1200), multi consumer gpu setups, etc.
What matters is high quality specifications including test cases
Says the person who will find themselves unable to change the software even in the slightest way without having to large refactors across everything at the same time.
High quality code matters more than ever, would be my argument. The second you let the LLM sneak in some quick hack/patch instead of correctly solving the problem, is the second you invite it to continue doing that always.
In my mind, it's somewhat orthogonal to code quality.
Waterfall has always been about "high quality specifications" written by people who never see any code, much less write it. Agile make specs and code quality somewhat related, but in at least some ways probably drives lower quality code in the pursuit of meeting sprint deadlines and producing testable artefacts at the expense of thoroughness/correctness/quality.
Deleted Comment
And while it may miss the HN crowd, one of the main selling-points of AI coding is the ease and playfulness.
Even the Gemini 3 announcement page had some bit like "best model for vibe coding".
If you're actually making sure it's legit, it's not vibe coding anymore. It's just... Backseat Coding? ;)
There's a level below that I call Power Coding (like power armor) where you're using a very fast model interactively to make many very small edits. So you're still doing the conceptual work of programming, but outsourcing the plumbing (LLM handles details of syntax and stdlib).
Deleted Comment
I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system. I mean, if I was doing some more HPC/numerical programming (say, similarity search on GPUs :-) ), I could see just taking the hit and spending $15,000 on a workstation with an RTX Pro 6000.
For grins:
Max t/s for this and smaller models? RTX 5090 system. Barely squeezing in for $5,000 today and given ram prices, maybe not actually possible tomorrow.
Max CUDA compatibility, slower t/s? DGX Spark.
Ok with slower t/s, don't care so much about CUDA, and want to run larger models? Strix Halo system with 128gb unified memory, order a framework desktop.
Prefer Macs, might run larger models? M3 Ultra with memory maxed out. Better memory bandwidth speed, mac users seem to be quite happy running locally for just messing around.
You'll probably find better answers heading off to https://www.reddit.com/r/LocalLLaMA/ for actual benchmarks.
That's a good idea!
Curious about this, if you don't mind sharing:
- what's the stack ? (Do you run like llama.cpp on that rented machine?)
- what model(s) do you run there?
- what's your rough monthly cost? (Does it come up much cheaper than if you called the equivalent paid APIs)
48GB of vram and lots of cuda cores, hard to beat this value atm.
If you want to go even further, you can get an 8x V100 32GB server complete with 512GB ram and nvlink switching for $7000 USD from unixsurplus (ebay.com/itm/146589457908) which can run even bigger models and with healthy throughput. You would need 240V power to run that in a home lab environment though.
Fuck nvidia
How is it? I'd guess a bunch of the MoE models actually run well?
https://willmcgugan.github.io/streaming-markdown/