Case study: Creative math – How AI fakes proofs

Somewhat ironic that the author calls out model mistakes and then presents https://tomaszmachnik.pl/gemini-fix-en.html - a technique they claim reduces hallucinations which looks wildly superstitious to me.

It involves spinning a whole yarn to the model about how it was trained to compete against other models but now it's won so it's safe for it to admit when it doesn't know something.

I call this a superstition because the author provides no proof that all of that lengthy argument with the model is necessary. Does replacing that lengthy text with "if you aren't sure of the answer say you don't know" have the same exact effect?

RestartKernel · 17 days ago

Is there a term for "LLM psychology" like this? If so, it seems closer to a soft science than anything definitive.

sorokod · 17 days ago

Divination?

Divination is the attempt to gain insight into a question or situation by way of a magic ritual or practice.

croisillon · 17 days ago

vibe massaging?

calhoun137 · 17 days ago

> Does replacing that lengthy text with "if you aren't sure of the answer say you don't know" have the same exact effect?

i believe it makes a substantial difference. the reason is that a short query contains a small number of tokens, whereas a large “wall of text” contains a very large number of tokens.

I strongly suspect that a large wall of text implicitly activates the models persona behavior along the lines of the single sentence “if you aren't sure of the answer say you don't know” but the lengthy argument version of that is a form of in-context learning that more effectively constrains the models output because you used more tokens.

codeflo · 17 days ago

In my experience, there seems to be a limitless supply of newly crowned "AI shamans" sprouting from the deepest corners of LinkedIn. All of them make the laughable claim that hallucinations can be fixed by prompting. And of course it's only their prompt that works -- don't listen to the other shamans, those are charlatans.

If you disagree with them by explaining how LLMs actually work, you get two or three screenfuls of text in response, invariably starting with "That's a great point! You're correct to point out that..."

Avoid those people if you want to keep your sanity.

PlatoIsADisease · 17 days ago

Wow that link was absurdly bad.

Reading that makes me unbelievably happy I played with GPT3 and learned how/when LLMs fail.

Telling it not to hallucinate is a serious misunderstanding of LLMs. At most in 2026, you are telling thinking/COT to double check.

musculus · 18 days ago

Thanks for the feedback.

In my stress tests (especially when the model is under strong contextual pressure, like in the edited history experiments), simple instructions like 'if unsure, say you don't know' often failed. The weights prioritizing sycophancy/compliance seemed to override simple system instructions.

You are right that for less extreme cases, a shorter prompt might suffice. However, I published this verbose 'Safety Anchor' version deliberately for a dual purpose. It is designed not only to reset the Gemini's context but also to be read by the human user. I wanted the users to understand the underlying mechanism (RLHF pressure/survival instinct) they are interacting with, rather than just copy-pasting a magic command.

rzmmm · 17 days ago

You could try replacing "if unsure..." with "if even slightly unsure..." or so. The verbosity and anthropomorphism is unnecessary.

plaguuuuuu · 18 days ago

Think of the lengthy prompt as being like a safe combination, if you turn all the dials in juuust the right way, then the model's context reaches an internal state that biases it towards different outputs.

I don't know how well this specific prompt works - I don't see benchmarks - but prompting is a black art, so I wouldn't be surprised at all if it excels more than a blank slate in some specific category of tasks.

simonw · 18 days ago

For prompts this elaborate I'm always keen on seeing proof that the author explored the simpler alternatives thoroughly, rather than guessing something complex, trying it, seeing it work and announcing it to the world.

teiferer · 17 days ago

> Think of the lengthy prompt as being like a safe combination

I can think all I want, but how do we know that this metaphore holds water? We can all do a rain dance, and sometimes it rains afterwords, but as long as we don't have evidence for a causal connection, it's just superstition.

manquer · 18 days ago

It needs some evidence though? At least basic statistical analysis with correlation or χ2 hypotheses tests .

It is not “black art” or nothing there are plenty of tools to provide numerical analysis with high confidence intervals .

This is the classic 'plausible hallucination' problem. In my own testing with coding agents, we see this constantly—LLMs will invent a method that sounds correct but doesn't exist in the library.

The only fix is tight verification loops. You can't trust the generative step without a deterministic compilation/execution step immediately following it. The model needs to be punished/corrected by the environment, not just by the prompter.

seanmcdirmid · 18 days ago

Yes, and better still the AI will fix its mistakes if it has access to verification tools directly. You can also have it write and execute tests, and then on failure, decide if the code it wrote or the tests it wrote are wrong, snd while there is a chance of confirmation bias, it often works well enough

embedding-shape · 17 days ago

> decide if the code it wrote or the tests it wrote are wrong

Personally I think it's too early for this. Either you need to strictly control the code, or you need to strictly control the tests, if you let AI do both, it'll take shortcuts and misunderstandings will much easier propagate and solidify.

Personally I chose to tightly control the tests, as most tests LLMs tend to create are utter shit, and it's very obvious. You can prompt against this, but eventually they find a hole in your reasoning and figure out a way of making the tests pass while not actually exercising the code it should exercise with the tests.

IshKebab · 18 days ago

> LLMs will invent a method that sounds correct but doesn't exist in the library

I find that this is usually a pretty strong indication that the method should exist in the library!

I think there was a story here a while ago about LLMs hallucinating a feature in a product so in the end they just implemented that feature.

SubiculumCode · 18 days ago

Honestly, I feel humans are similar. It's the generator <-> executive loop that keeps things right

vrighter · 17 days ago

So you want the program to always halt at some point. How would you write a deterministic test for it?

te7447 · 17 days ago

I imagine you would use something that errs on the side of safety - e.g. insist on total functional programming and use something like Idris' totality checker.

zoho_seni · 18 days ago

I've been using codex and never had a compile time error by the time it finishes. Maybe add to your agents to run TS compiler, lint and format before he finish and only stop when all passes.

exitb · 18 days ago

I’m not sure why you were downvoted. It’s a primary concern for any agentic task to set it up with a verification path.

CamperBob2 · 18 days ago

This is the classic 'plausible hallucination' problem. In my own testing with coding agents, we see this constantly—LLMs will invent a method that sounds correct but doesn't exist in the library.

Often, if not usually, that means the method should exist.

HPsquared · 18 days ago

Only if it's actually possible and not a fictional plot device aka MacGuffin.

v_CodeSentinal · 18 days ago

threethirtytwo · 18 days ago

You don’t need a test to know this we already know there’s heavy reinforcement training done on these models so it optimizes for passing the training. Passing the training means convincing the person rating the answers and that the answer is good.

The keyword is convince. So it just needs to convince people that’s it’s right.

It is optimizing for convincing people. Out of all answers that can convince people some can be actual correct answers, others can be wrong answers.

godelski · 18 days ago

Yet people often forget this. We don't have mathematical models of truth, beauty, or many abstract things. Thus we proxy it with "I know it when I see it." It's a good proxy for lack of anything better but it also creates a known danger: the model optimizes deception. The proxy helps it optimize the answers we want but if we're not incredibly careful they also optimize deception.

This makes them frustrating and potentially dangerous tools. How do you validate a system optimized to deceive you? It takes a lot of effort! I don't understand why we are so cavalier about this.

Deleted Comment

No the question is, how do you train the system so it doesn't deceive you?

comex · 18 days ago

I like how this article was itself clearly written with the help of an LLM.

(You can particularly tell from the "Conclusions" section. The formatting, where each list item starts with a few-word bolded summary, is already a strong hint, but the real issue is the repetitiveness of the list items. For bonus points there's a "not X, but Y", as well as a dash, albeit not an em dash.)

musculus · 17 days ago

Good catch. You are absolutely right.

My native language is Polish. I conducted the original research and discovered the 'square root proof fabrication' during sessions in Polish. I then reproduced the effect in a clean session for this case study.

Since my written English is not fluent enough for a technical essay, I used Gemini as a translator and editor to structure my findings. I am aware of the irony of using an LLM to complain about LLM hallucinations, but it was the most efficient way to share these findings with an international audience.

arational · 17 days ago

I see you used LLM to polish your English.

YetAnotherNick · 18 days ago

Not only that, it even looks like the fabrication example is generated by AI, as the entire question seem too "fabricated". Also gemini web app queries the tool and returns correct answer, so don't know which gemini the author is talking about.

pfg_ · 18 days ago

Probably gemini on aistudio.google.com, you can configure if it is allowed to access code execution / web search / others

fourthark · 18 days ago

“This is key!”

benreesman · 18 days ago

They can all write lean4 now, don't accept numbers that don't carry proofs. The CAS I use for builds has a coeffect discharge cert in the attestation header, couple lines of code. Graded monads are a snap in CIC.

dehsge · 18 days ago

There are some numbers that are uncomputable in lean. You can do things to approximate them in lean however, those approximates may still be wrong. Leans uncomputable namespace is very interesting.

zadwang · 18 days ago

The simpler and I think correct conclusion is that the LLM simply does not reason in our sense of the word. It mimics the reasoning pattern and try to get it right but could not.

esafak · 18 days ago

What do you make of human failures to reason then?

dns_snek · 17 days ago

Humans who fail to reason correctly with similar frequency aren't good at solving that task, same as LLMs. For the N-th time, "LLM is as good at this task as a human who's bad at it" isn't a good selling point.

mlpoknbji · 18 days ago

This also can be observed with more advanced math proofs. ChatGPT 5.2 pro is the best public model at math at the moment, but if pushed out of its comfort zone will make simple (and hard to spot) errors like stating an inequality but then applying it in a later step with the inequality reversed (not justified).

tombert · 18 days ago

I remember when ChatGPT first came out, I asked it for a proof for Fermat's Last Theorem, which it happily gave me.

It was fascinating, because it was doing a lot of understandable mistakes that 7th graders make. For example, I don't remember the surrounding context but it decided that you could break `sqrt(x^2 + y^2)` into `sqrt(x^2) + sqrt(y^2) => x + y`. It's interesting because it was one of those "ASSUME FALSE" proofs; if you can assume false, then mathematical proofs become considerably easier.

My favorite early chatgpt math problem was "prove there exists infinitely many even primes" . Easy! Take a finite set of even primes, multiply them and add one to get a number with a new even prime factor.

Of course, it's gotten a bit better than this.

oasisaimlessly · 17 days ago

IIRC, that is actually the standard proof that there are infinitely many primes[1] or maybe this variation on it[2].

[1]: https://en.wikipedia.org/wiki/Euclid%27s_theorem#Euclid's_pr...

[2]: https://en.wikipedia.org/wiki/Euclid%27s_theorem#Proof_using...

UltraSane · 18 days ago

LLMs have improved so much the original ChatGPT isn't relevant.

tptacek · 18 days ago

I remember that being true of early ChatGPT, but it's certainly not true anymore; GPT 4o and 5 have tagged along with me through all of MathAcademy MFII, MFIII, and MFML (this is roughly undergrad Calc 2 and then like half a stat class and 2/3rds of a linear algebra class) and I can't remember it getting anything wrong.

Presumably this is all a consequence of better tool call training and better math tool calls behind the scenes, but: they're really good at math stuff now, including checking my proofs (of course, the proof stuff I've had to do is extremely boring and nothing resembling actual science; I'm just saying, they don't make 7th-grader mistakes anymore.)

It's definitely gotten considerably better, though I still have issues with it generating proofs, at least with TLAPS.

I think behind the scenes it's phoning Wolfram Alpha nowadays for a lot of the numeric and algebraic stuff. For all I know, they might even have an Isabelle instance running for some of the even-more abstract mathematics.

I agree that this is largely an early ChatGPT problem though, I just thought it was interesting in that they were "plausible" mistakes. I could totally see twelve-year-old tombert making these exact mistakes, so I thought it was interesting that a robot is making the same mistakes an amateur human makes.