Mercury 2: Fast reasoning LLM powered by diffusion

It could be interesting to do the metric of intelligence per second.

ie intelligence per token, and then tokens per second

My current feel is that if Sonnet 4.6 was 5x faster than Opus 4.6, I'd be primarily using Sonnet 4.6. But that wasn't true for me with prior model generations, in those generations the Sonnet class models didn't feel good enough compared to the Opus class models. And it might shift again when I'm doing things that feel more intelligence bottlenecked.

But fast responses have an advantage of their own, they give you faster iteration. Kind of like how I used to like OpenAI Deep Research, but then switched to o3-thinking with web search enabled after that came out because it was 80% of the thoroughness with 20% of the time, which tended to be better overall.

estsauver · 16 days ago

I think there's clearly a "Speed is a quality of it's own" axis. When you use Cereberas (or Groq) to develop an API, the turn around speed of iterating on jobs is so much faster (and cheaper!) then using frontier high intelligence labs, it's almost a different product.

Also, I put together a little research paper recently--I think there's probably an underexplored option of "Use frontier AR model for a little bit of planning then switch to diffusion for generating the rest." You can get really good improvements with diffusion models! https://estsauver.com/think-first-diffuse-fast.pdf

refulgentis · 16 days ago

I'm very worried for both.

Cerebras requires a $3K/year membership to use APIs.

Groq's been dead for about 6 months, even pre-acquisition.

I hope Inception is going well, it's the only real democratic target at this. Gemini 2.5 Flash Lite was promising but it never really went anywhere, even by the standards of a Google preview

volodia · 16 days ago

We agree! In fact, there is an emerging class of models aimed at fast agentic iteration (think of Composer, the Flash versions of proprietary and open models). We position Mercury 2 as a strong model in this category.

estsauver · 16 days ago

Do you guys all think you'll be able to convert open source models to diffusion models relatively cheaply ala the d1 // LLaDA series of papers? If so, that seems like an extremely powerful story where you get to retool the much, much larger capex of open models into high performance diffusion models.

(I can also see a world where it just doesn't make sense to share most of the layers/infra and you diverge, but curious how you all see the approach.)

bigbuppo · 16 days ago

Maybe make that intelligence per token per relative unit of hardware per watt. If you're burning 30 tons of coal to be 0.0000000001% better than the 5 tons of coal option because you're throwing more hardware at it, well, it's not much of a real improvement.

estsauver · 16 days ago

I think the fast inference options have historically been only marginally more expensive then their slow cousins. There's a whole set of research about optimal efficiency, speed, and intelligence pareto curves. If you can deliver even an outdated low intelligence/old model at high efficiency, everyone will be interested. If you can deliver a model very fast, everyone will be interested. (If you can deliver a very smart model, everyone is obviously the most interested, but that's the free space.)

But to be clear, 1000 tokens/second is WAY better. Anthropic's Haiku serves at ~50 tokens per second.

jakubtomanik · 16 days ago

Intelligence per second is a great metric. I never could fully articulate why I like Gemini 3 Flash but this is exactly why. It’s smart enough and unbelievably fast. Thanks for sharing this

josephg · 16 days ago

Yeah I agree with this. We might be able to benchmark it soon (if we can’t already) but asking different agentic code models to produce some relatively simple pieces of software. Fast models can iterate faster. Big models will write better code on the first attempt, and need less loop debugging. Who will win?

At the moment I’m loving opus 4.6 but I have no idea if its extra intelligence makes it worth using over sonnet. Some data would be great!

estsauver · 16 days ago

For what it's worth, most people already are doing this! Some of the subagents in Claude Code (Explore, I think even compaction) default to Haiku and then you have to manually overwrite it with an env variable if you want to change it.

Imagine the quality of life upgrade of getting compaction down to a few second blip, or the "Explore" going 20 times faster! As these models get better, it will be super exciting!

nubg · 16 days ago

Interesting perspective. Perhaps also the user would adopt his queries knowing he can only to small (but very fast) steps. I wonder who would win!

jdthedisciple · 16 days ago

Interesting suggestion.

Maybe we could use some sort of entropy-based metric as a proxy for that?

dmichulke · 16 days ago

Useful for evaluating people as well

irishcoffee · 16 days ago

I really thought this was sarcasm. Intelligence per token? Intelligence at all, in a token? We don’t even agree on how to measure _human_ intelligence! I just can’t. Artificially intelligent indeed. Probably the perfect term for it, you know in lieu of authentic intelligence.

picard_facepalm.jpg

Co-founder / Chief Scientist at Inception here. If helpful, I’m happy to answer technical questions about Mercury 2 or diffusion LMs more broadly.

nowittyusername · 16 days ago

How does the whole kv cache situation work for diffusion models? Like are there latency and computation/monetary savings for caching? is the curve similar to auto regressive caching options? or maybe such things dont apply at all and you can just mess with system prompt and dynamically change it every turn because there's no savings to be had? or maybe you can make dynamic changes to the head but also get cache savings because of diffusion based architecture?... so many ideas...

volodia · 16 days ago

There are many ways to do it, but the simplest approach is block diffusion: https://m-arriola.com/bd3lms/

There are also more advanced approaches, for example FlexMDM, which essentially predicts length of the "canvas" as it "paints tokens" on it.

CamperBob2 · 16 days ago

Seems to work pretty well, and it's especially interesting to see answers pop up so quickly! It is easily fooled by the usual trick questions about car washes and such, but seems on par with the better open models when I ask it math/engineering questions, and is obviously much faster.

volodia · 16 days ago

Thanks for trying it and for the thoughtful feedback, really appreciate it. And we’re actively working on improving quality further as we scale the models.

bcherry · 16 days ago

you mention voice ai in the announcement but I wonder how this works in practice. most voice AI systems are bound not by full response latency but just by time-to-first-non-reasoning-token (because once it heads to TTS, the output speed is capped at the speed of speech and even the slowest models are generating tokens faster than that once they start going).

what do ttft numbers look like for mercury 2? I can see how at least compared to other reasoning models it could improve things quite a bit but i'm wondering if it really makes reasoning viable in voice given it seems total latency is still in single digit seconds, not hundreds of milliseconds

PranayKumarJain · 16 days ago

Spot on about the TTFT bottleneck. In the voice world, the "thinking" silence is what kills the illusion.

At eboo.ai, we see this constantly—even with faster models, the orchestrator needs to be incredibly tight to keep the total loop under 500-800ms. If Mercury 2 can consistently hit low enough TTFT to keep the turn-taking natural, that would be a game changer for "smart" voice agents.

Right now, most "reasoning" in voice happens asynchronously or with very awkward filler audio. Lowering that floor is the real challenge.

mynti · 16 days ago

I always wondered how these models would reason correctly. I suppose they are diffusing fixed blocks of text for every step and after the first block comes the next and so on (that is how it looks in the chat interface anyways). But what happens if at the end of the first block it would need information about reasoning at the beginning of the first block? Autoregressive Models can use these tokens to refine the reasoning but I guess that Diffusion Models can only adjust their path after every block? Is there a way maybe to have dynamic block length?

nl · 16 days ago

I had a very odd interaction somewhat similar to how weak transformer models get into a loop:

https://gist.github.com/nlothian/cf9725e6ebc99219f480e0b72b3...

What causes this?

volodia · 16 days ago

This looks like an inference glitch that we are working on fixing, thank you for flagging.

techbro92 · 16 days ago

Do you think you will be moving towards drifting models in the future for even more speed?

volodia · 16 days ago

Not imminently, but hard to predict where the field will go

kristianp · 16 days ago

How big is Mercury 2? How many tokens is it trained on?

Is it's agentic accuracy good enough to operate, say, coding agents without needing a larger model to do more difficult tasks?

volodia · 16 days ago

You can think of Mercury 2 as roughly in the same intelligence tier as other speed-optimized models (e.g., Haiku 4.5, Grok Fast, GPT-Mini–class systems). The main differentiator is latency — it’s ~5× faster at comparable quality.

We’re not positioning it as competing with the largest models (Opus 4.5, etc.) on hardest-case reasoning. It’s more of a “fast agent” model (like Composer in Cursor, or Haiku 4.5 in some IDEs): strong on common coding and tool-use tasks, and providing very quick iteration loops.

smusamashah · 16 days ago

Will it be possible to put this on Talaas chip and go even higher speeds?

Topfi · 16 days ago

Have been following your models and semi-regularly ran them through evals since early summer. With the existing Coder and Mercury models, I always found that the trade-offs were not worth it, especially as providers with custom inference hardware could push model tp/s and latency increasingly higher.

I can see some very specific use cases for an existing PKM project, specially using the edit model for tagging and potentially retrieval, both of which I am using Gemini 2.5 Flash-Lite still.

The pricing makes this very enticing and I'll really try to get Mercury 2 going, if tool calling and structured output are truly consistently possible with this model to a similar degree as Haiku 4.5 (which I still rate very highly) that may make a few use cases far more possible for me (as long as Task adherence, task inference and task evaluation aren't significantly worse than Haiku 4.5). Gemini 3 Flash was less ideal for me, partly because while it is significantly better than 3 Pro, there are still issues regarding CLI usage that make it unreliable for me.

Regardless of that, I'd like to provide some constructive feedback:

1.) Unless I am mistaken, I couldn't find a public status page. Doing some very simple testing via the chat website, I got an error a few times and wanted to confirm whether it was server load/known or not, but couldn't

2.) Your homepage looks very nice, but parts of it struggle, both on Firefox and Chromium, with poor performance to the point were it affects usability. The highlighting of the three recommended queries on the homepage lags heavily, same for the header bar and the switcher between Private and Commercial on the Early Access page switches at a very sluggish pace. The band showcasing your partners also lags below. I did remove the very nice looking diffusion animation you have in the background and found that memory and CPU usage returned to normal levels and all described issues were resolved, so perhaps this could be optimized further. It makes the experience of navigating the website rather frustrating and first impressions are important, especially considering the models are also supposed to be used in coding.

3.) I can understand if that is not possible, but it would be great if the reasoning traces were visible on the chat homepage. Will check later whether they are available on the API.

4.) Unless I am mistaken, I can't see the maximum output tokens anywhere on the website or documentation. Would be helpful if that were front and center. Is it still at roughly 15k?

5.) Consider changing the way web search works on the chat website. Currently, it is enabled by default but only seems to be used by the model when explicitly prompted to do so (and even then the model doesn't search in every case). I can understand why web search is used sparingly as the swift experience is what you want to put front and center and every web search adds latency, but may I suggest disabling web search by default and then setting the model up so, when web search is enabled, that resource is more consistently relied upon?

6.) "Try suggested prompt" returns an empty field if a user goes from an existing chat back to the main chat page. After a reload, the suggested prompt area contains said prompts again.

One thing that I very much like and that has gotten my mind racing for PKM tasks are the follow up questions which are provided essentially instantly. I can see some great value, even combining that with another models output to assist a user in exploring concepts they may not be familiar with, but will have to test, especially on the context/haystack front.

volodia · 14 days ago

Thank you for the detailed feedback! I shared this already with the team.

gok · 16 days ago

Do you use fully bidirectional attention or is it at all causal?

bananapub · 16 days ago

would diffusion models benefit from things like Cerebras hardware?