Readit News logoReadit News
simonw · 5 days ago

  llm install llm-mistral
  llm mistral refresh
  llm -m mistral/devstral-2512 "Generate an SVG of a pelican riding a bicycle"
https://tools.simonwillison.net/svg-render#%3Csvg%20xmlns%3D...

Pretty good for a 123B model!

(That said I'm not 100% certain I guessed the correct model ID, I asked Mistral here: https://x.com/simonw/status/1998435424847675429)

Jimmc414 · 5 days ago
We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data. It would be a great way to ensure an initial thumbs up from a prominent reviewer. It's a good benchmark but it seems like it would be a good idea to include an additional random or unannounced similar test to catch any benchmaxxing.
simonw · 5 days ago
thatwasunusual · 5 days ago
> We are getting to the point that its not unreasonable to think that "Generate an SVG of a pelican riding a bicycle" could be included in some training data.

I may be stupid, but _why_ is this prompt used as a benchmark? I mean, pelicans _can't_ ride a bicycle, so why is it important for "AI" to show that they can (at least visually)?

The "wine glass problem"[0] - and probably others - seems to me to be a lot more relevant...?

[0] https://medium.com/@joe.richardson.iii/the-curious-case-of-t...

Workaccount2 · 4 days ago
It would be easy to out models that train on the bike pelican, because they would probably suck at the kayaking bumblebee.

So far though, the models good at bike pelican are also good at kayak bumblebee, or whatever other strange combo you can come up with.

So if they are trying to benchmaxx by making SVG generation stronger, that's not really a miss, is it?

th0ma5 · 5 days ago
If this had any substance then it could be criticized, which is what they're trying to avoid.
0cf8612b2e1e · 4 days ago
I assume all of the models also have variations on, “how many ‘r’s in strawberry”.

Deleted Comment

baq · 5 days ago
but can it recreate the spacejam 1996 website? https://www.spacejam.com/1996/jam.html
aschobel · 5 days ago
in case folks are missing the context

https://news.ycombinator.com/item?id=46183294

Deleted Comment

lagniappe · 5 days ago
That is not a meaningful metric given that we don't live in 1996 and neither do our web standards.
cpursley · 5 days ago
Skipped the bicycle entirely and upgraded to a sweet motorcycle :)
aorth · 5 days ago
Looks like a Cybertruck actually!
lubujackson · 5 days ago
The Batman motorcycle!
willahmad · 5 days ago
I think this benchmark could be slightly misleading to assess coding model. But still very good result.

Yes, SVG is code, but not in a sense of executable with verifiable inputs and outputs.

jstummbillig · 5 days ago
I love that we are earnestly contemplating the merits of the pelican benchmark. What a timeline.
hdjrudni · 4 days ago
But it does have a verifiable output, no more or less than HTML+CSS. Not sure what you mean by "input" -- it's not a function that takes in parameters if that's what you're getting at, but not every app does.
iberator · 5 days ago
Where did you get llm tool from?!
fauigerzigerk · 5 days ago
lacoolj · 4 days ago
How did you run a 123B model locally? Or did you do this on a GPU host somewhere? If so, what spec was it?
simonw · 4 days ago
I haven't run the 123B one locally yet. I used Mistral's own API models for this.
felixg3 · 5 days ago
Is it really an svg if it’s just embedded base64 of a jpg
joombaga · 5 days ago
You were seeing the base64 image tag output at the bottom. The SVG input is at the top.
samgutentag · 4 days ago
"Generate an SVG of a pelican riding a bicycle" is the new "but can it run Crysis"

Dead Comment

esafak · 5 days ago
Less than a year behind the SOTA, faster, and cheaper. I think Mistral is mounting a good recovery. I would not use it yet since it is not the best along any dimension that matters to me (I'm not EU-bound) but it is catching up. I think its closed source competitors are Haiku 4.5 and Gemini 3 Pro Fast (TBA) and whatever ridiculously-named light model OpenAI offers today (GPT 5.1 Codex Max Extra High Fast?)
kevin061 · 5 days ago
The OpenAI thing is named Garlic.

(Surely they won't release it like that, right..?)

esafak · 5 days ago
TIL: https://garlicmodel.com/

That looks like the next flagship rather than the fast distillation, but thanks for sharing.

YetAnotherNick · 5 days ago
No this is comparable to Deepseek-v3.2 even on their highlight task, with significantly worse general ability. And it's priced 5x of that.
esafak · 5 days ago
It's open source; the price is up to the provider, and I do not see any on openrouter yet. ̶G̶i̶v̶e̶n̶ ̶t̶h̶a̶t̶ ̶d̶e̶v̶s̶t̶r̶a̶l̶ ̶i̶s̶ ̶m̶u̶c̶h̶ ̶s̶m̶a̶l̶l̶e̶r̶,̶ ̶I̶ ̶c̶a̶n̶ ̶n̶o̶t̶ ̶i̶m̶a̶g̶i̶n̶e̶ ̶i̶t̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶m̶o̶r̶e̶ ̶e̶x̶p̶e̶n̶s̶i̶v̶e̶,̶ ̶l̶e̶t̶ ̶a̶l̶o̶n̶e̶ ̶5̶x̶.̶ ̶I̶f̶ ̶a̶n̶y̶t̶h̶i̶n̶g̶ ̶D̶e̶e̶p̶S̶e̶e̶k̶ ̶w̶i̶l̶l̶ ̶b̶e̶ ̶5̶x̶ ̶t̶h̶e̶ ̶c̶o̶s̶t̶.̶

edit: Mea culpa. I missed the active vs dense difference.

InsideOutSanta · 5 days ago
I gave Devstral 2 in their CLI a shot and let it run over one of my smaller private projects, about 500 KB of code. I asked it to review the codebase, understand the application's functionality, identify issues, and fix them.

It spent about half an hour, correctly identified what the program did, found two small bugs, fixed them, made some minor improvements, and added two new, small but nice features.

It introduced one new bug, but then fixed it on the first try when I pointed it out.

The changes it made to the code were minimal and localized; unlike some more "creative" models, it didn't randomly rewrite stuff it didn't have to.

It's too early to form a conclusion, but so far, it's looking quite competent.

Staross · 4 days ago
Also tried it on a small project, it did ok finding issues but completely failed doing rather basic edits, like it lost closing brackets or used wrong syntax and couldn't recover. The CLI was easy to setup and use though.
embedding-shape · 4 days ago
Did you try it via OpenRouter? If so, what provider? I've noticed some providers seems to not exactly be upfront about what quantization they're using, you can see that the responses from some providers who supposedly run the exact same model and weights give vastly different responses.

Back when Devstral 1 released, this was made very noticeable to me because the ones who used the smaller quantizations were unable to actually properly format the code, just as you noticed, that's why this sounded so similar to what I've seen before.

InsideOutSanta · 3 days ago
In my experience, the messed up closing brackets are a surprisingly common issue for LLMs. Both Sonnet 4.5 and Gemini 3 also do this regularly. Seems like something that should be relatively easy to fix, though.
MLgulabio · 5 days ago
On what hardware did you run it?
syntaxing · 5 days ago
FWIW, it’s free through Mistral right now
freakynit · 4 days ago
So I tested the bigger model with my typical standard test queries which are not so tough, not so easy. They are also some that you wouldn't find extensive training data for. Finally, I already have used them to get answers from gpt-5.1, sonnet 4.5 and gemini 3 ....

Here is what I think about the bigger model: It sits between sonnet 4 and sonnet 4.5. Something like "sonnet 4.3". The response sped was pretty good.

Overall, I can see myself shifting to this for reguar day-to-day coding if they can offer this for copetitive pricing.

I'll still use sonnet 4.5 or gemini 3 for complex queries, but, for everything else code related, this seems to be pretty good.

Congrats Mistral. You most probably have caught up to the big guys. Not there yet exactly, but, not far now.

embedding-shape · 5 days ago
Look interesting, eager to play around with it! Devstral was a neat model when it released and one of the better ones to run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this, so gonna be interesting to see if Devstral 2 can replace it.

I'm a bit saddened by the name of the CLI tool, which to me implies the intended usage. "Vibe-coding" is a fun exercise to realize where models go wrong, but for professional work where you need tight control over the quality, you can obviously not vibe your way to excellency, hard reviews are required, so not "vibe coding" which is all about unreviewed code and just going with whatever the LLM outputs.

But regardless of that, it seems like everyone and their mother is aiming to fuel the vibe coding frenzy. But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it? All the agents seem to focus on off-handing work to vibe-coding agents, while what I want is something even tighter integrated with my tools so I can continue delivering high quality code I know and control. Where are those tools? None of the existing coding agents apparently aim for this...

williamstein · 5 days ago
Their new CLI agent tool [1] is written in Python unlike similar agents from Anthropic/Google (Typescript/Bun) and OpenAI (Rust). It also appears to have first class ACP support, where ACP is the new protocol from Zed [2].

[1] https://github.com/mistralai/mistral-vibe

[2] https://zed.dev/acp

esafak · 5 days ago
I did not know A2A had a competitor :(
embedding-shape · 5 days ago
> Their new CLI agent tool [1] is written in

This is exactly the CLI I'm referring to, whose name implies it's for playing around with "vibe-coding", instead of helping professional developers produce high quality code. It's the opposite of what I and many others are looking for.

hadlock · 5 days ago
>vibe-coding

A surprising amount of programming is building cardboard services or apps that only need to last six months to a year and then thrown away when temporary business needs change. Execs are constantly clamoring for semi-persistent dashboards and ETL visualized data that lasts just long enough to rein in the problem and move on to the next fire. Agentic coding is good enough for cardboard services that collapse when they get wet. I wouldn't build an industrial data lake service with it, but you can certainly build cardboard consumers of the data lake.

bigiain · 4 days ago
You are right.

But there is nothing more permanent that a quickly hacked together prototype or personal productivity hack that works. There are so many Python (or Perl or Visual Basic) scripts or Excel spreadsheets - created by people who have never been "developers" - which solve in-the-trenches pain points and become indispensable in exactly the way _that_ xkcd shows.

3vidence · 4 days ago
There is a phrase I've heard a number of times in my career that I find relevant here.

"There is nothing more permanent than a temporary demo"

pdntspa · 5 days ago
> But where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs? Something that is meant to augment the human intellect, not replace it?

Claude Code not good enough for ya?

embedding-shape · 5 days ago
Claude Code has absolutely zero features that help me review code or do anything else than vibe-coding and accept changes as they come in. We need diff-comparisons between different executions, tailored TUI for that kind of work and more. Claude Code is basically a MVP of that.

Still, I do use Claude Code and Codex daily as there is nothing better out there currently. But they still feel tailored towards vibe-coding instead of professional development.

jbellis · 5 days ago
> where are the professional tools, meant to be used for people who don't want to do vibe-coding, but be heavily assisted by LLMs?

This is what we're building at Brokk: https://brokk.ai/

Quick intro: https://blog.brokk.ai/introducing-lutz-mode/

johanvts · 5 days ago
Did you try Aider?
embedding-shape · 5 days ago
I did, although a long time ago, so maybe I need to try it again. But it still seems to be stuck in a chat-like interface instead of something tailored to software development. Think IDE but better.
andai · 5 days ago
I created a very unprofessional tool, which apparently does what you want!

While True:

0. Context injected automatically. (My repos are small.)

1. I describe a change.

2. LLM proposes a code edit. (Can edit multiple files simultaneously. Only one LLM call required :)

3. I accept/reject the edit.

chrsw · 5 days ago
> run locally for agentic coding. Nowadays I mostly use GPT-OSS-120b for this

What kind of hardware do you have to be able to run a performant GPT-OSS-120b locally?

embedding-shape · 5 days ago
RTX Pro 6000, ends up taking ~66GB when running the MXFP4 native quant with llama-server/llama.cpp and max context, as an example. Guess you could do it with two 5090s with slightly less context, or different software aimed at memory usage efficiency.
fgonzag · 5 days ago
The model is 64GB (int4 native), add 20GB or so for context.

There are many platforms out there that can run it decently.

AMD strix halo, Mac platforms. Two (or three without extra ram) of the new AMD AI Pro R9700 (32GB of RAM, $1200), multi consumer gpu setups, etc.

FuckButtons · 5 days ago
Mbp 128gb.
true2octave · 5 days ago
High quality code is a thing from the past

What matters is high quality specifications including test cases

embedding-shape · 5 days ago
> High quality code is a thing from the past

Says the person who will find themselves unable to change the software even in the slightest way without having to large refactors across everything at the same time.

High quality code matters more than ever, would be my argument. The second you let the LLM sneak in some quick hack/patch instead of correctly solving the problem, is the second you invite it to continue doing that always.

bigiain · 4 days ago
"high quality specifications" have _always_ been a thing that matters.

In my mind, it's somewhat orthogonal to code quality.

Waterfall has always been about "high quality specifications" written by people who never see any code, much less write it. Agile make specs and code quality somewhat related, but in at least some ways probably drives lower quality code in the pursuit of meeting sprint deadlines and producing testable artefacts at the expense of thoroughness/correctness/quality.

htrp · 4 days ago
what's wrong with the current ide tools?
pluralmonad · 5 days ago
I'm sure I'm not the only one that thinks "Vibe CLI" sounds like an unserious tool. I use Claude Code a lot and little of it is what I would consider Vibe Coding.
tormeh · 5 days ago
They're looking for free publicity. "This French company launched a tool that lets you 'vibe' an application into being. Programmers outraged!"
klysm · 5 days ago
Using LLM's to write code is inherently best for unserious work.
dwaltrip · 5 days ago
These are the cutting insights I come to HN for.
freakynit · 4 days ago
"Not reviewing generated code" is the problem. Not the LLM generated code.

Deleted Comment

jimmydoe · 5 days ago
Maybe they are just trying to be funny.
Eupolemos · 4 days ago
Their chat was called "Le Chat" - it's just their style.

And while it may miss the HN crowd, one of the main selling-points of AI coding is the ease and playfulness.

kilpikaarna · 4 days ago
Agree, but that's just the term for any LLM-assisted development now.

Even the Gemini 3 announcement page had some bit like "best model for vibe coding".

isodev · 5 days ago
If you’re letting Claude write code you’re vibe coding
andai · 5 days ago
So people have different definitions of the word, but originally Vibe Coding meant "don't even look at the code".

If you're actually making sure it's legit, it's not vibe coding anymore. It's just... Backseat Coding? ;)

There's a level below that I call Power Coding (like power armor) where you're using a very fast model interactively to make many very small edits. So you're still doing the conceptual work of programming, but outsourcing the plumbing (LLM handles details of syntax and stdlib).

NitpickLawyer · 5 days ago
The original definition was very different. The main thing with vibe coding is that you don't care about the code. You don't even look at the code. You prompt, test that you got what you wanted, and move on. You can absolutely use cc to vibe code. But you can also use it to ... code based on prompts. Or specs. Or docs. Or whatever else. The difference is if you want / care to look at the code or not.
tomashubelbauer · 5 days ago
It sure doesn't feel like it given how closely I have to babysit Claude Code lest I don't recognize the code after Claude Code is done with it when left to its own devices for a minute.
sunaookami · 4 days ago
No, that's not the definition of "vibe coding". Vibe coding is letting the model do whatever without reviewing it and not understanding the architecture. This was the original definition and still is.

Deleted Comment

princehonest · 5 days ago
Let's say you had a hardware budget of $5,000. What machine would you buy or build to run Devstral Small 2? The HuggingFace page claims it can run on a Mac with 32 GB of memory or an RTX 4090. What kind of tokens per second would you get on each? What about DGX Spark? What about RTX 5090 or Pro series? What about external GPUs on Oculink with a mini PC?
clusterhacks · 5 days ago
All those choices seem to have very different trade-offs? I hate $5,000 as a budget - not enough to launch you into higher-VRAM RTX Pro cards, too much (for me personally) to just spend on a "learning/experimental" system.

I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system. I mean, if I was doing some more HPC/numerical programming (say, similarity search on GPUs :-) ), I could see just taking the hit and spending $15,000 on a workstation with an RTX Pro 6000.

For grins:

Max t/s for this and smaller models? RTX 5090 system. Barely squeezing in for $5,000 today and given ram prices, maybe not actually possible tomorrow.

Max CUDA compatibility, slower t/s? DGX Spark.

Ok with slower t/s, don't care so much about CUDA, and want to run larger models? Strix Halo system with 128gb unified memory, order a framework desktop.

Prefer Macs, might run larger models? M3 Ultra with memory maxed out. Better memory bandwidth speed, mac users seem to be quite happy running locally for just messing around.

You'll probably find better answers heading off to https://www.reddit.com/r/LocalLLaMA/ for actual benchmarks.

kpw94 · 5 days ago
> I've personally decided to just rent systems with GPUs from a cloud provider and setup SSH tunnels to my local system.

That's a good idea!

Curious about this, if you don't mind sharing:

- what's the stack ? (Do you run like llama.cpp on that rented machine?)

- what model(s) do you run there?

- what's your rough monthly cost? (Does it come up much cheaper than if you called the equivalent paid APIs)

tgtweak · 5 days ago
dual 3090's (24GB each) on 8x+8x pcie has been a really reliable setup for me (with nvlink bridge... even though it's relatively low bandwidth compared to tesla nvlink, it's better than going over pcie!)

48GB of vram and lots of cuda cores, hard to beat this value atm.

If you want to go even further, you can get an 8x V100 32GB server complete with 512GB ram and nvlink switching for $7000 USD from unixsurplus (ebay.com/itm/146589457908) which can run even bigger models and with healthy throughput. You would need 240V power to run that in a home lab environment though.

lostmsu · 5 days ago
V100 is outdated (no bf16, dropped in CUDA 13) and power hungry (8 cards 3 years continuous use are about $12k of electricity).
monster_truck · 5 days ago
I'd throw a 7900xtx in an AM4 rig with 128gb of ddr4 (which is what I've been using for the past two years)

Fuck nvidia

sofixa · 4 days ago
Or a Strix Halo Ryzen AI Max. Lots of "unified" memory that can be dedicated to the GPU portion, for not that expensive. Read through benchmarks to know if the performance will be enough for your needs though.
clusterhacks · 5 days ago
You know, I haven't even been thinking about those AMD gpus for local llms and it is clearly a blind spot for me.

How is it? I'd guess a bunch of the MoE models actually run well?

androiddrew · 5 days ago
Get a Radeon AI Pro r9700! 32GB of RAM
eavan0 · 5 days ago
I'm glad it's not another LLM CLI that uses React. Vibe-cli seems to be built with https://github.com/textualize/textual/
kristianp · 5 days ago
I'm not excited that it's done in python. I've had experience with Aider struggling to display text as fast as the llm is spitting it out, though that was probably 6 months ago now.
willm · 5 days ago
Python is more than capable of doing that. It’s not an issue of raw execution speed.

https://willmcgugan.github.io/streaming-markdown/

NSPG911 · 4 days ago
thats an issue with aider. using a proper framework in the alternate terminal buffer would have greatly benefitted them