TiDAR: Think in Diffusion, Talk in Autoregression

Workaccount2 · 25 days ago

An update to Gemini diffusion is one of my most eagerly anticipated AI releases. It released to mild fanfare (mostly because you needed to request access to use it), and there has been silence ever since.

Hopefully it's not more Google abandonware, because it was wicked fast and a delight to use

ACCount37 · 25 days ago

It's not a very promising direction because autoregressive LLMs still deliver better output quality per model weight, as a rule.

Now, is it possible that a model can combine advantages of both? Combine fast generation and multidirectional causality of diffusion with precision, capabilities and generalization of autoregression?

Maybe. This paper is research in that direction. So far, it's not a clear upgrade over autoregressive LLMs.

euleriancon · 25 days ago

Diffusion LMs do seem to be able to get more out of the same data. In a world where we are already training transformer based LLMs on all text available, diffusion LMs ability to continue learning on a fixed set of data may be able to outperform transformers

https://arxiv.org/abs/2511.03276

vintermann · 24 days ago

As a rule, but the devil is in the details. The thing, the one big thing I want to use multimodal LLMs for, is accessing the data in historical mostly handwritten texts.

None of the big LLMs do an acceptable job. This is a task a trained human can do, but it's a lot of work. You have to learn, not just the script style of the period (which can vary far more than people think), but even the idiosyncracies of a given writer. All the time, you run into an unreadable word, and you need to look around for context which might give a clue, or other places the same word (or a similar looking word) is used in cleaner contexts. It's very much not a beginning-to-end task, trying to read a document from start to end would be like solving a crossword puzzle in strict left to right, top to bottom order.

Maybe autoregressive models can eventually become powerful enough that they can just do that! But so far, they haven't. And I have a lot more faith in that the diffusion approach is closer to how you have to do it.

fragmede · 25 days ago

> still deliver better output quality per model weight, as a rule.

is it possible to quantify that and just have a linked slider for quality and speed? If I can get an answer that's 80% right in 1/10th the time, and then iterate on that who comes out ahead?

ricochet11 · 24 days ago

Perhaps it’s an issue is that text often has directionality.

https://arxiv.org/abs/2401.17505

ilaksh · 25 days ago

4-5 times faster with minimal change in quality seems like a clear upgrade in efficiency.

Bolwin · 25 days ago

That's bizarre because I would expect the opposite. For reasoning you go step by step, and when you're done quickly diffuse the answer

naasking · 25 days ago

Unification in logic programming isn't a forwards-only process, so there's no reason to expect deduction in an AI to proceed in a sort of procedural step by step fashion either. What ultimately matters is that all of the various deductions unify coherently in the end.

octoberfranklin · 24 days ago

Exactly.

If you add a "cheat" rule that lets you deduce anything from something else, then replacing these cheat rule applications with real subgoal proofs is denoising for Natural Deduction.

wongarsu · 24 days ago

However after step 4 you might notice that you made a mistake in step 2 and revise it. You might think in steps, but the state you are building is formed a bit diffusion-like

gdiamos · 25 days ago

Diffusion is favored by current GPUs .

Over time we seem to have a tendency to build models that are well matched to our machines

HPsquared · 25 days ago

Are TPUs different?

vlovich123 · 25 days ago

Not really. The problem is that transformer LLMs are autoregressive and are O(n^2) for self attention and also require insane amounts of bandwidth to “page in” the weights into the relevant compute parts. TPUs do this faster than a CPU like any accelerator but fundamentally this is a challenge. There are attempts to build hardware where the weights are burned into the silicon but that carries other meaningful downsides.

But op is referring to the fact that diffusion is friendlier on both bandwidth and not needing large n^2 compute blocks in the critical path.

Alifatisk · a month ago

I've tried dLLMs like Mercury and they look promising.