Chain of Recursive Thoughts: Make AI think harder by making it argue with itself

I see a lot of threads pitting models against each other (or whole swarms of them) in the hope that "wisdom of crowds" will magically appear. After a stack of experiments of my own—and after watching the recent ASU/Microsoft-Research work [1].. I've landed on a simpler takeaway:

An LLM is a terrible verifier of another LLM. Subbarao Kambhampati's "(How) Do LLMs Reason/Plan?" talk shows GPT-4 confidently producing provably wrong graph-coloring proofs until a symbolic SAT solver is introduced as the referee [1]. Stechly et al. quantify the problem: letting GPT-4 critique its own answers *reduces* accuracy, whereas adding an external, sound verifier boosts it by ~30 pp across planning and puzzle tasks [2]. In other words, verification is *harder* than generation for today's autoregressive models, so you need a checker that actually reasons about the world (compiler, linter, SAT solver, ground-truth dataset, etc.).

Because of that asymmetry, stacking multiple LLMs rarely helps. The "LLM-Modulo" position paper argues that auto-regressive models simply can't do self-verification or long-horizon planning on their own and should instead be treated as high-recall idea generators wrapped by a single, sound verifier [3]. In my tests, replacing a five-model "debate" with one strong model + verifier gives equal or better answers with far less latency and orchestration overhead.

[1] https://www.youtube.com/watch?v=0u2hdSpNS2o - (How) Do LLMs Reason/Plan? (talk at Microsoft Research, 11 Apr 2025)

[2] https://arxiv.org/abs/2402.08115

[3] https://arxiv.org/abs/2402.01817 (related to the talk in #1)

zurfer · 4 months ago

Your references show me that it is absolutely task depended. In many domains it's true that "criticizing is easier than creating".

The best example might be books and movies, where it's trivial to say the characters were shallow, but it's surprisingly hard to create deeply interesting characters.

In Software Engineering, there are similar dynamics. An LLM with a security vuln finding prompt will be able to point out places, where the generated code might be insecure.

But if you want another LLM to find a reasoning mistake in a mathematical proof it basically has to do all the reasoning work as well. In which case I doubt there will be any significant performance gains.

aoeusnth1 · 4 months ago

In principle, Math proofs are another relatively easy to verify problem. In the extreme case, you can express any math proof as a computer-verifiable formalism — no intelligence necessary. Step back one step, and you could have a relatively weak model translate a proof into verifiable formalism and then use a tool call to run the verification. Coming up with the proof is an expensive search process, while verifying it is more mechanical. Even if it is not completely trivial to make the proof computer-verifiable, it might still be a vastly easier task compared to finding the proof in the first place.

simulator5g · 4 months ago

An LLM cannot reason through a mathematical proof, it would be something other than an LLM if it could.

meander_water · 4 months ago

For better or worse this has become the defacto standard in LLM Evaluation research papers since the LLM as a Judge paper [0] came out. Its also heavily embedded into frameworks like LangChain and LlamaIndex to evaluate RAG pipelines.

[0] https://arxiv.org/abs/2306.05685

[1] https://arxiv.org/abs/2411.15594

swyx · 4 months ago

its for the better, and i'm actually serious about this. it's just that Subbarao is ALSO right and it is not perfect nor human level. but it -DOES- improve results measurably and consistently.

so what i'm saying is don't throw the baby out with the bathwater. LLM as judge doesnt replace human judgement but its a pretty darn good first pass for how cheap it is. and you can imagine that it will get better over time.

hu3 · 4 months ago

> ...so you need a checker that actually reasons about the world (compiler, linter, SAT solver, ground-truth dataset, etc.).

Agree. What do you think about telling the LLM to also generate unit tests for the code it spits and then run all tests (including previous application unit tests).

I think this is a way to ensure some level of grounded verification:

- Does code compile?

- Do unit test pass?

AI can then consume test results to help fix their own mistakes.

nojs · 4 months ago

This works well but only if you eyeball the tests and edit them a bit in my experience. Otherwise it gets lazy and makes them trivial to pass. Also, you’ve often gotta explicitly tell it not to hardcode test cases in the solution to make them pass.

dwaltrip · 4 months ago

Definitely, test runners are a way to ground the model and give it a feedback loop. Not a silver bullet but can be very helpful.

keepamovin · 4 months ago

I believe, what the smart AI company is trying to do, right now, in secret, is to use US, the humans, and our replies to the AIs, as training for the next generation of self-verifying-models. :)

Training on corpus data gets you to 1 order of magnitude. But training on interactive data where you can observe and adapt to the OODA-loop? So much more powerful.

At least, that's what I'd be doing if I were doing AI :)

But I just do BrowserBox

captainbland · 4 months ago

I think you'd need to screen for quality of response quite stringently as loads of people will produce "corrections" which are just plain wrong.

mcswell · 4 months ago

I assume everyone knows this, but the idea of generating answers and testing them, dates back decades, and has been widely used for problems where generating _the_ correct answer(s) is difficult, but where generating a bunch of potential answers--(at least) one of which is likely correct--is easier. Generate-and-test of course relies on having a test algorithm that is reliable, (relatively) fast, and memory efficient, and is most useful when an exact generate algorithm (one that generated only the correct answer(s)) is either slow or inefficient of memory use (or both).

In the case described, the generator is an LLM, and the tester (called a "verifier") is "the compiler, linter, SAT solver, ground truth dataset, etc."

And of course generate-and-test is related to trial-and-error, which has probably existed since the Paleolithic.

foobiekr · 4 months ago

"letting GPT-4 critique its own answers reduces accuracy"

This is because the output, being the input, steers directly into the tree as soon as the tree is in the context window.

ashu1461 · 4 months ago

Would a LLM under human guidance turn out to be a good verifier ? i.e. if LLM knows the rules to verify or has enough data points (internet access, actual responses)

eru · 4 months ago

Of course, that only works for problems where you have a verifier.

autokad · 4 months ago

actually, I found that you can definitely yield better results. I ran an experiment with 1 prompt at temperature 0 and 9 with temperature 1.

I found the most anomalous response was as good (15/20) or better (5/20) than the temperature 0 response in 20 samples.

I kind of want to try something like this at a larger scale in an always-on mode where I have a 'senate' of debate. Rather than responding to prompts on a case by case basis, provide a list of tasks (potentially with deadlines) and let the senate work on them, break off into groups to manage subtasks, challenge results , make suggestions. Even potentially a tree of analysts where suggestions only gets passed up the tree when the parent node thinks a lower analysis is particularly insightful.

I definitely think that directing models to approach a problem from a specific perspective can generate better or worse results. Creating a diverse set of perspectives along with critical analysis of their results should be able to produce some impressive results.

Things like this would generate a massive number of tokens, but the cost per token is definitely heading in the right direction to allow for this. There is also the possibility of setting up an AI only IRC server where anybody can connect their own models for a shared debating chamber.

mikepurvis · 4 months ago

In doing some DevOps-y type tasks recently (ansible, packer, docker, baking images with guestfish), I've found it very frustrating how much ChatGPT will confidently tell me to use flags on tools that don't exist, or hallicinate completely non-existent functions or behaviours. And then when I spend time trying what it suggests only to hit a wall and come back like wtf mate it breezily goes "oh yes so you're right, good job figuring that out! You're so close now! Your next step is to do X and Y," and then serves up the same detailed tutorial as before but with the flag or whatever it was that it had wrong subtly changed.

It definitely makes me feel like I'm dealing with an overenthusiastic intern who is throwing stuff over the wall without checking their work, and like maybe having a second bot sitting in front of the first one being like ARE YOUR SURE ABOUT THAT could really improve things.

MoonGhost · 4 months ago

You can't get more info from LLMs than it actually holds. Like Anthropic pointed if LLMs knows the name but has no other info it starts hallucinating. The same probably happens here. LLM knows there must be a flag but can't remember all of them. Likely short reminder in prompt will help. (or search web for GPT) Just my $0.02.

0x20cowboy · 4 months ago

I did a stint in Devops and I found every models to be like this for all of the infra-as-code languages. Anything yaml based was especially bad.

Even Amazon’s own offering completely made things up about Amazon’s own formats.

I’d be curious as to why that is. It seems like there would be enough training data, and for Amazon in particular it seems like they could make a validation tool the model could use.

meander_water · 4 months ago

Cursor has a neat feature where you can upload custom docs, and then reference them with @Docs. I find this prevents hallucinations, and also using a reasoning model

organsnyder · 4 months ago

I've enjoyed watching Claude try running commands with incorrect flags, trying them, and then adapting.

corvus-cornix · 4 months ago

I've also found LLMs to perform poorly at DevOps tasks. Perhaps there's a lack of training data. On the bright side this hints at better job security for platform engineers.

vunderba · 4 months ago

100%. This has happened enough to me that I wished I could just inject the man page docs into it to at least act as a sanity check.

nonelog · 4 months ago

Spot on.

vunderba · 4 months ago

A year or so ago I experimented with splitting a user prompt down to a set of "different AI personas" that would each try to approach the user's problem in a different way and then bubble back up with a master arbiter for consensus.

I modeled it after the concept of advisors from Civilization II. It worked reasonably well though I think it was at least somewhat limited by being constrained to a single LLM (Mistral). It also lit my computer on fire.

bee_rider · 4 months ago

What sort of personalities did you try? A group where some members have grudges against each other and will irrationally poke holes in each other’s plans could be a fun experiment.

nonethewiser · 4 months ago

In theory couldnt this just be baked into a single adversarial model?

RevEng · 4 months ago

Not entirely. Since generation is auto regressive, the next token depends on the previous tokens. Whatever analysis and decisions it has spit out will influence what it will do next. This tends to cause it to be self reinforcing.

But it's also chaotic. Small changes in input or token choices can give wildly different outcomes, particularly if the sampling distributions are fairly flat (no one right answer). So restarting the generation with a slightly different input, such as a different random seed (or in OP's case, a different temperature) can give wildly different outcomes.

If you try this, you'll see some examples of it vehemently arguing it is right and others equally arguing it is wrong. This is why LLM as judge is so poor by itself, bit also why multiple generations like used in self-consistency can be quite useful at evaluating variance and therefore uncertainty.

tonmoy · 4 months ago

Yes, but I guess the model is optimized for relatively quick response, whereas these techniques are allowing the model to spend more time to generate a higher quality response

Lerc · 4 months ago

To an extent, but different models are better at different things.

That is something I'm also curious about. Given models (that use the same tokenisation) that are better at different things, would their be interesting things to find by analysing the logprobs for tokens generated from identical inputs (including cross feeding the generated token from one to another)

Surely there must be something notable at particular points when a model goes off on the wrong path.

crowcroft · 4 months ago

Like, just endlessly grinding tokens, then processing the output and pulling out good ideas when the endless debate generates them?

Would be interesting what it comes up with with enough time and tokens.

danielmarkbruce · 4 months ago

This is being done, and you could apply it to a lot of domains. Go for it for whatever use case you have.

kmacdough · 4 months ago

These ensembles have been tested throughout AI progress. Well scaffolded larger models have historically come out ahead in both quality and speed/cost.

Perhaps this is a parricularly effective ensemble, but I would need to see real data.

nativeit · 4 months ago

Yeah, but we'll finally get definitive proof that the government's been hiding super-intelligent axolotls from us all.

taneq · 4 months ago

A society of mind, if you will. :)

This sounds like a fun thing to set up with a quick-enough local model.

Deleted Comment

dudeinhawaii · 4 months ago

odo1242 · 4 months ago

Something I do sometimes is:

- Have an AI chat model come up with an answer to a problem.

- Have it write a report discussing the details of the problem and why it's answer is correct, directed at a person or AI model who has no knowledge of the initial problem or technical field.

- Have a second AI model with no knowledge of the problem grade the report, and write it's own report either (a) asking for clarification / more information about the problem that the original model didn't provide or (b) pointing out an inconsistency in the argument posed by the original model. Give this report back to the original model and ask it to write it's own report back with either the necessary information or changes.

- Repeat until either the second AI model is convinced by the first AI model's explanation or the first AI model has implemented all the changes requested by the second AI model.

It's super clunky but has given pretty good results in the cases where I tried it lol

ASalazarMX · 4 months ago

Ah, now we know why Spain was out of electricity yesterday.

Cthulhu_ · 4 months ago

Here I was thinking cryptocurrency pre-heated the grids (and GPU manufacturing) for us already.

danparsonson · 4 months ago

Oh that was a good one XD

StopDisinfo910 · 4 months ago

For anything semi-adversarial, I have had good results asking the AI to come up with a plan, then take the side of the opponent coming with counter play/way to defeat the plan, finally asking for a revision of the initial plan given the potential reaction from the opponent.

The final plan you obtain is generally a lot more well rounded and thought out.

I find that amusing because the technique also works when I apply it to me. Picking flaws in your plan before revisiting it actually works.

To be honest, this is what I assumed this repo was doing from the title. It talks about arguing with itself, but it looks like it's just generating multiple alternative responses in parallel and selecting the best one.

Do you find your method handles "sycophancy" well?

zoogeny · 4 months ago

I do the same, and I have one other technique.

I will often have a few chats going for a project, but with different contexts. For example, one might be tech focused, another marketing focused, another with some context on my personal goals, etc.

So I will take the same question and feed it into the chats with differing context. It is almost like having different perspectives on the same problem. And the conclusions can often differ based on the differing contexts.

odie88 · 4 months ago

This is how I’ve been using Gemini and it’s the first time I’m really seeing consistent value.

I’ll get a context into a solid place with as much information as I can about a project. Usually getting up to 100k tokens.

Then I ask it to give me a summary I can use in a fresh chat, that will maintain the current context. This lets me reclaim space, bring responsiveness back to sane levels, have a baseline chat I use to spin up branches for marketing, design (it’s pretty helpful at trouble shooting Substance Designer graphs), etc.

I’ve found myself going into sub branches from there… like a marketing context that pushes branches into different marketing channels.

jsight · 4 months ago

This reminds me a lot of the YT video that went over using Monte Carlo Tree Search with LLMs to maximize result quality. Link: https://www.youtube.com/watch?v=mfAV_bigdRA&ab_channel=Treli...

It seemed like a pretty good idea, though I'd guess that it would greatly increase token usage. I'd also be concerned that the LLM as a judge might struggle to grade things accurately if it wasn't also able to generate good enough answers to begin with.

looofooo0 · 4 months ago

If you think about marginal cost, such experiments can be run almost at only the cost of extra electricity used for that computation, which in Europe is often zero, at least by the ones who own the compute.

JumpCrisscross · 4 months ago

Kagi’s Assistant feature makes this super easy. Just switch assistants and ask them to check the other’s work.

BOOSTERHIDROGEN · 4 months ago

How?

subscribed · 4 months ago

I do it all the time in Sillytavern in a group chat - three characters kind of resembling what you just described, and me, participating in the "conversation", them going back and forth until they're satisfied.

With a good model role playing them, works awesome.

hsuduebc2 · 4 months ago

We're there any situation that first conclusion from AI was completely changed? Can you give generally examples of situations where it changed or significantly improved overall result? It sounds cool.

nomel · 4 months ago

I would be interested to know how ofter "oscillations" occur, where they flip flop from being too "agreeable" to challenges (which probably is just a sparse latent space). This happens to me pretty frequently, where you can repeatedly say "no that's wrong" and the LLM will do a 180, explaining why it was "in fact" wrong and you are "right", repeat.

itissid · 4 months ago

Isn't this kind of another way of how Inference Time Scaling works? It will basically produce several chain of thoughts and then pursue one that has maximum reward based on an internal function?

pessimizer · 4 months ago

I've wondered if it might be helpful to randomly "shard" training data between two LLMs; just feed half the training data to one, and the rest to the other, with no overlap.

So instead of using two models, you'd be making two halves of one model do a similar (deliberative) process to yours. I wonder if that would result in a benefit over a single model with the full training set, and if you could continue to do the same thing by sharding the shards.

ijk · 4 months ago

There's some precedent for that: you can do some useful things with the cross entropy of the two models. And k-fold cross validation might also be relevant.

aprilthird2021 · 4 months ago

This takes such a long time to do though, no? What problems does this save you time on?

dustingetz · 4 months ago

i dont understand, is it doing your schoolwork?

cube2222 · 4 months ago

This is really cool!

One strategy I often use (which is much simpler and more limited than this), is to finish my message with: “Please do a round of thinking in <thinking></thinking> tags, then a round of self-critique in <critique></critique> tags, and then a final round of <thinking>, before responding.”

It works very well. Similarly just asking it to “find the 5 biggest issues with its proposal” works pretty good (the 5 forcing it to find something, even if it’s mostly irrelevant).

This is one of the reasons I like the massive context window in Gemini. You can do this as part of the message chain. I don't try to one shot it, just use the same idea across 3 messages.

1. Figure out a plan (it responds with the plan)

2. Point out flaws in the plan (it responds with the flaws)

3. Update the plan to address the flaws (it responds with an up to date plan)

The other things I tend to ask are "what might we be missing?", "what are the [performance|security|legal|cost] considerations?". I can often iterate on the "anything else?" kind of nudging prompts, especially guiding it on topics to consider, for a few messages. After each: update the plan to take those into consideration.

danielbln · 4 months ago

I always do "now again but put on your critical hat"

CSSer · 4 months ago

Makes me wonder how it would do if you tell it "put on your robe and wizard hat"

bentt · 4 months ago

Oh I really like that. It makes me want to have it score its ideas with metrics and then keep iterating until it meets some score.

electroly · 4 months ago

This seems to be different than I expected from the title. I thought it would be explicitly adversarial.

1. You are the assistant. Please answer the question directly.

2. You are the cross-examiner. The assistant is wrong. Explain why.

3. You are the assistant. The cross-examiner is wrong. Defend your claim.

4. You are a judge. Did either party make their case, or is another round of argumentation required?

I haven't tried this. No idea if it works. But I find it's helpful to ask ChatGPT, in separate prompts, "XYZ is true, explain why" and "XYZ is false, explain why" and see which one seems more convincing.

3np · 4 months ago

Also a little clickbaity with "my AI" and then it's all Mistral...

ChadMoran · 4 months ago

Check out Fast Agent! (I have no affiliation with it, just use it).

https://github.com/evalstate/fast-agent

mountainriver · 4 months ago

Techniques like this have been around since GPT-3.5. There are boatloads of papers on the topic.

I have no idea why anyone thinks this is novel. I guess that speaks to the state of HN

moribunda · 4 months ago

Exactly... I thought that implementing STORM was just a basic step in this topic... Looks like we're running in circles.

Chatgpt shares context between chats. I wonder how that impacts it?

It seems like a good approach though. What you dont want to do is ever suggest that its wrong yourself. Usually it will just assume it is wrong.

Actually what I find impressive is when I do this and it actually pushes back to defend itself.

the_af · 4 months ago

Does it share context even if no "memory updated" message appears indicating it has stored a fact about you?

I asked ChatGPT and it says no, but then again it's not reliable at introspection or at revealing data about how it works.

hnuser123456 · 4 months ago

I'm having a lot of fun experimenting with stuff like this. I'm trying to put together an unrealengine blueprints style graph editor to allow people to design workflows like this where you start with the user prompt input, which goes to one agent, which makes an initial attempt, and then that conversation history gets passed to another "agent" with a different system prompt telling it to be a harsh critic, but to also give a pass/fail signal, and loop back until the critic judges pass, then send that back to the user as output. Ideally as a little website that can call your own LLM endpoints and save/load/share workflow graphs.

Mistral small 3.1 and gemma 3 feel like the first semi-competent models that can be run locally, but that competence is just a seed, and they still need to be guided with a framework that keeps them on track.

Try giving it python execution in a loop and tell it to explore the world. It'll start trying to download and read news and stuff.

andai · 4 months ago

I am thinking the same thing! Multiple "personalities", in parallel, or in series. For example, I have approximated, in GPT, some of Gemini's ability to call out nonsense, sloppy thinking, by telling GPT to be mean! (The politeness seems to filter out much that is of great value!)

However, the result is not pleasant to read. Gemini solved this in their training, by doing it in two phases... and making the first phase private! ("Thinking.")

So I thought, what I need is a two-phase approach, where that "mean" output gets humanized a little bit. (It gets harsh to work in that way for more than short intervals.)

As a side note, I think there would be great value in a UI that allows a "group chat" of different LLM personalities. I don't know if such a thing exists, but I haven't seen it yet, although the message object format seems to have been designed with it in mind (e.g. every message has a name, to allow for multiple users and multiple AIs).

Even better if it supports multiple providers, since they have different strengths. (It's like getting a second opinion.)

jbm · 4 months ago

I disagree.

If anything, telling GPT to be blunt seems to downgrade its IQ; it hallucinates more and makes statements without considering priors or context. I jokingly call it Reddit mode.

NitpickLawyer · 4 months ago

> As a side note, I think there would be great value in a UI that allows a "group chat" of different LLM personalities.

This is the basic idea behind autogen. They also have a web UI now in autogen studio, it's gotten a bit better. You can create "teams" of agents (with different prompts, themes, tools, etc.) and have them discuss / cooperate. I think they even added memory recently. Have a look at it, might be what you need.

theturtletalks · 4 months ago

MoE, but an abstraction deeper?

irthomasthomas · 4 months ago

I think you can do most of this already with llm-consortium (maybe needs the llm-openrouter plugin with my pr merging)

A consortium sends the same prompt to multiple models in parallel and the responses are all sent to one arbiter model which judges the model responses. The arbiter decides if more iterations are required. It can also be forced to iterate more until confidence-threshold or min-iterations.

Now, using the pr i made to llm-openrouter, you can save an alias to a model that includes lots of model options. For examples, you can do llm openrouter save -m qwen3 -o online -o temperature 0, system "research prompt" --name qwen-researcher

And now, you can build a consortium where one member is an online research specialist. You could make another uses JSON mode for entity extraction, and a third which writes a blind draft. The arbiter would then make use of all that and synthesize a good answer.

kridsdale1 · 4 months ago

Any links or names of example implementations of this?

globalise83 · 4 months ago

Have you tried n8n? It allows you to build flows like that - you can run the community version in a Docker container within a few minutes and share the configurations for the flows you have built very easily.

mecsred · 4 months ago

_#_ has to be one of the worst word shortening schemes I've ever seen get widespread. It only works with a very small number of long-lived technologies, in which case they basically just get a nickname, "k8s" "i18n". It does not at all work for larger contexts. You're basically making someone solve a crossword (2 across, 10 letters with two filled in) just to parse your sentence.

I had not, but that looks awesome. Microsoft put out something called "agent flows" that also fits this category.[1] I'm working on more of an "at home" version - no "talk to sales" button.

https://www.microsoft.com/en-us/microsoft-copilot/blog/copil...

jedberg · 4 months ago

We're really going to need to figure out how to power all these GPUs with green power real quick, or we're going to melt the planet having AIs debate with themselves on the optimal solution to tik-tac-toe...

Ive felt this way when using chatgpt for a simple search. Stuff that google could handle but would just be slower, mostly from me having to manually filter.

Sometimes its the easiest way to complete a very small task but the cost difference on the backend has to be pretty damn large. The user inevitably ends up not caring whatsoever. Its just not real to them.

ivape · 4 months ago

I caught infra people saying that's pretty much the only bottleneck in the data center right now, power and cooling. We know the AI needs to run against itself continuously, and that's just a fact.

Maybe we should assign them a practical task, like making paperclips.

Xcelerate · 4 months ago

I think this is how we get ML models to come up with novel ideas. Diagonalize against all the ideas they’ve already tried and dismissed via self-argument but keep certain consistency constraints. (Obviously much easier said than done.)

jwally · 4 months ago

Scaled up and spread out - this probably gets you pretty close to consciousness(?)

Conway's game of life, but instead of colored squares with rules, they're LLM's with some kind of weighting - all chattering back and forth with one another - bubbling up somehow to cause speach/action

lubujackson · 4 months ago

Decades ago I read The Society of Mind by Marvin Minsky. He pushed this sort of idea, that consciousness is composed of individual, competing processes. Worth a revisit!

What you just said is what I tried and failed to say ten minutes ago!

https://news.ycombinator.com/item?id=43835798

Nevermark · 4 months ago

It’s working! Oh, wait …

These models have limitations obviously, but many critiques apply equally or more to people.

If people were tasked with one shot, 10 second answers, to be written out in near errorless grammar, the LLM’s viewing our responses to prompts would be spending a lot of time discussing our limitations and how to game us into better responses. Humor, not at all humor.