AbsenceBench: Language models can't tell what's missing

Perhaps related, after watching a talk by Gerald Sussman I loaded an image of the Kanizsa triangle into Claude and asked it a pretty vague question to see if it could “see” the inferred triangle. It recognised the image and went straight into giving me a summary about it. So I rotated the image 90 degrees and tried in a new conversation, it didn’t recognise the image and got the number of elements incorrect:

This image shows a minimalist, abstract geometric composition with several elements:

Four black shapes that appear to be partial circles or "Pac-Man" like forms, each with a wedge cut out, positioned in the four corners/quadrants of the image Two thin black triangular or arrow-like shapes - one pointing upward in the upper left area, and one pointing to the right in the center-right area All elements are arranged on a light gray or off-white background

latentsea · 2 months ago

I guess they will now just rotate all the images in the training data 90 degrees too to fill this kind of gap.

recursivecaveat · 2 months ago

Everything old is new again: in the Alexnet paper that kicked off the deep learning wave in 2012, they describe horizontally flipping every image as a cheap form of data augmentation. Though now that we expect models to actually read text that seems potentially counter-productive. Rotations are similar, in that you'd hope it would learn heuristics such as that the sky is almost always at the top.

mirekrusin · 2 months ago

That's how you train neural network with synthetic data so it extracts actual meaning.

That's how humans also learn ie. adding numbers. First there is naive memoization, followed by more examples until you get it.

LLM training seems to be falling into memoization trap because models are extremely good at it, orders of magnitude better than humans.

IMHO what is missing in training process is this feedback explaining wrong answer. What we're currently doing with training is leaving out this understanding as "exercise to the reader". We're feeding correct answers to specific, individual examples which promotes memoization.

What we should be doing in post training is ditch direct backpropagation on next token, instead let the model finish its wrong answer, append explanation why it's wrong and continue backpropagation for final answer - now with explanation in context to guide it to the right place in understanding.

What all of this means is that current models are largely underutilized and unnecessarily bloated, they contain way too much memoized information. Making model larger is easy, quick illusion of improvement. Models need to be squeezed more, more focus needs to go towards training flow itself.

littlestymaar · 2 months ago

And it will work.

I just whish the people believing LLM can actually reason and generalize see that they don't.

Workaccount2 · 2 months ago

Show any LLM a picture of a dog with 5 legs watch them be totally unable to count.

pfdietz · 2 months ago

Or watch them channel Abraham Lincoln.

JohnKemeny · 2 months ago

We really don't know how to compute.

Oct 2011, 30 comments.

https://news.ycombinator.com/item?id=3163473

Strange loop video:

July 2011, 36 comments.

https://news.ycombinator.com/item?id=2820118

iknownothow · 2 months ago

As far as I can tell, the paper covers text documents only. Therefore your example doesn't quite apply.

It is well known that LLMs have a ways to go when it comes to processing images like they process text or audio.

I don't think there's any good performing multimodal model that accepts image pixels directly. Most vision capabilities are hacks or engineered in. An image undergoes several processing steps and each processor's outputs are fed to the transformer as tokens. This may happen in one network but there's non-transformer networks involved. Examples of preprocessing:

* OCR * CNNs (2D pattern recognizers) with different zooms, angles, slices etc * Others maybe too?

akomtu · 2 months ago

To generalise this idea: if we look at a thousand points that more or less fill a triangle, we'll instantly recognize the shape. IMO, this simple example reveals what intelligence is really about. We spot the triangle because so much complexity - a thousand points - fits into a simple, low-entropy geometric shape. What we call IQ is the ceiling of complexity of patterns that we can notice. For example, the thousand dots may in fact represent corners of a 10-dimensional cube, rotated slightly - an easy pattern to see for a 10-d mind.

saithound · 2 months ago

Cool. Since ChatGPT 4o is actually really good at this particular shape identification task, what, if anything do you conclude about its intelligence?

Interesting. Even the most recent models perform relatively poorly when asked to identify which information in a context has been removed, given access to both the original and edited contexts.

The authors posit that poor performance is due to the fact that the attention mechanism of Transformers cannot attend to the removed tokens, because there are no keys for them!

Thank you for sharing on HN.

yorwba · 2 months ago

There are keys to attend to, they're just in the original text instead of the modified one. Since the model receives both as input, it could theoretically attend to those keys.

For the attention mechanism, there isn't much difference between

  Original: {shared prefix} {removed part} {shared suffix} Modified: {shared prefix} {shared suffix}

And

  Original: {shared prefix} {shared suffix} Modified: {shared prefix} {added part} {shared suffix}

I think you could implement an algorithm for this in RASP (a language for manually programming transformers) roughly like this:

1. The first layer uses attention to the "Original:" and "Modified:" tokens to determine whether the current token is in the original or modified parts.

2. The second layer has one head attend equally to all original tokens, which averages their values, and another head attends equally to all modified tokens, averaging them as well. The averages are combined by computing their difference.

3. The third layer attends to tokens that are similar to this difference, which would be the ones in the {removed part}/{added part}.

The only ordering-dependent part is whether you compute the difference as original_average - modified_average or the other way around.

If a model can detect additions but not removal, that would show that it is capable of learning this or a similar algorithm in principle, but wasn't trained on enough removal-style data to develop the necessary circuitry.

ironmanszombie · 2 months ago

Thanks for the breakdown. I am far from knowledgeable on AI but was wondering why can't a simple comparison work? They can definitely be coded, as you have beautifully demonstrated.

cyanydeez · 2 months ago

for vision models, I wonder if they can train on things like photo negatives, rotated images, etc. Or madlib like sentences where a Q/A is like "the _____ took first place in the horse show."

bearseascape · 2 months ago

The madlib like sentences approach is actually how masked token prediction works! It was one of the pretraining tasks for BERT, but nowadays I think all (?) LLMs are trained with next token prediction instead.

latency-guy2 · 2 months ago

For photo negatives - usually doesn't matter. I am not up to date with what the vision folks are doing at these companies, but images are usually single channel, and more likely than not for regular images in greyscale. Otherwise in complex domain for the radar folks, and those are not RGB based images at all, rather scatterer defined.

Additional channels being recognized in training usually didn't matter for the experiments and models I used to deal with before 2022, and if they were, certainly did not matter for colors. Then again, the work I was doing was on known (and some additional confusers) classes for object detection and classification where the color pretty much didn't matter in the first place.

usaar333 · 2 months ago

They don't seem to use any recent top models. No opus, no o3, no Gemini 25 pro

cs702 · 2 months ago

It seems they used the most recent models available as of March 2025.

jug · 2 months ago

And yet, there are some notable differences between them, so now that there’s a benchmark and attention given to this issue, I wonder how much better they can get. Because obviously something can be done.

birdfood · 2 months ago

yousif_123123 · 2 months ago

This is very interesting. 1. Authors mention the attention mechanism being perhaps unable to attend to the location of gaps since the gaps aren't tokens. But I would've expected a good LLM transformer to be at least a bit close to the gap location. I don't understand why mathematically the architecture is less suitable for that, it could attend to a region that may contain gaps. I wonder if fine-tuning on a task like this could help? 2. Shorter inputs with less omissions were harder to solve. That is not completely surprising, as a human doing this task, if 1 word was missing it would be harder to notice. And similarly 1 line would be harder than 10 lines. But still interesting for an LLM to have this problem. 3. Reasoning models do better, as they can write out the documents and potentially solve this easily. It still very surprising that this doesn't lead to 100% accuracy. This should be a trivial task. Like the paper says, a trivial program can be written to solve this. Perhaps ChatGPT (or similar agent) could read this paper while training, and know to write and run python when solving an issue like this.

The most interesting thing though, is what other aspects of intelligence we may not have identified explicitly, and whether LLMs and current AI is very bad at them. This paper suggests that there likely are many of those, and it seems in general a pretty fun time for people working building benchmarks.

Dead Comment

To be fair, I'd put finding literal string diffs in the category of asking LLMs to do rote arithmetic.

The attention mechanism does far too much complex thinking for such a dumb task. This is precisely where you need to dumb down and focus and be disciplined rather than do high level next token prediction.

You'd benefit from actually asking the LLM to list the full document and compare, kind of like reasoning, and similar to how LLMs perform better when they break down arithmetic or algebra tasks into smaller steps.

Also my guess would be that the models that perform well are MoE models where there may be an Expert or two that does well on tasks that needs focus rather than intuition. So without knowing anything about Gemini Flash, my guess would be that it's an MoE model.

XenophileJKO · 2 months ago

I haven't read the paper yet, but from a structural 'attention' perspective being unable to detect unclassified omissions is completely expected. (Though I think it is can be solved with structured thought.)

For needle in a haystack you have to pay attention to the thing that you are trying to find. Attention can do this pretty well.

When looking for an omission, that omission can be anything, you can only reason about it by comparing one whole context to another whole context. The attention layers can't really do that.

This is similar to the "rank a long set of things" problem. Absent some meta cognition process, they just can't do that.

teruakohatu · 2 months ago

> When looking for an omission, that omission can be anything,

In this benchmark they give the LLM the necessary information to determine what is missing. For example “here is a poem, here is a version of that same poem that may or may not be missing lines. Are any lines missing?

It’s more a tuning issue IMHO than an inherent weakness in LLMs.

If I was asked to find an omission in an ML paper, my brain compares it with other ML papers, it does not need to compare it to Star Ward, Top Gear, Greek history, pottery and the other 1000s of contexts I may know about.

Sorry I meant the omission can be anything in the context, not anything in the world.. lol.

That is still hard. You only have so many attention heads looking for things.. you can't pay attention to EVERYTHING.. which is what is required to find the omission.

thaumasiotes · 2 months ago

We should note that "where is there a line missing from this poem: ____?" contains sufficient information to answer correctly without needing a copy of the original to compare to.

Here are two verses of a poem (song) in Mandarin Chinese:

yi quan ting ni de

er gei ni hao de

shu dao san yong yuan ai ni yi ge

si bu hui fan cuo

wu bu hui luo suo

shuo ni xiang shuo de

zuo ni xiang zuo de

bie pa shi bai yin wei ni you wo

pei ni kan ri luo

pei ni yi qi chang wan wo men ai de ge

I removed two lines. Where did that happen?

Would your answer be different if I told you that I might or might not have removed some lines?

OsrsNeedsf2P · 2 months ago

The criticisms to how AbsenceBench does this are valid, but I'm very excited that we are benchmarking this at all. It's definitely a push in the right direction

kadonoishi · 2 months ago

To detect a presence, a real brain takes in sensory input and compares it to expectations, and stays calm or registers surprise, and from time to time issues predictions to guide the organism.

To detect an absence, the brain cannot rely on sensory input, by definition. To be surprised if sensory evidence is _not_ there requires a model of the world strong enough to register surprise if the expectation is not there, without a sensory prompt.

It seems to me detecting an absence is a strictly higher-order neurological task than processing sensory input.

If LLMs can't do this strictly higher-order neurological task, is that not a capability currently unique to living things?

gtsop · 2 months ago

Thinking is still currently unique to living things, so you don't need to resort to what you describe to find the human brain uniquness.

Onto what you describe, it has to do with memory. Memory is storing and playing back sensory input, in the absence of that sensory input. So your brain plays back some past sensory input and checks it against current sensory input.

Eg you left the pen on the table. When you come back the pen isn't there. Your brain compares the stored memory of seeing the pen on the table vs what you see now.

viralsink · 2 months ago

LLMs might not be very consistent overall in their learned architecture. Some paths may lead to memorized info, some paths may lead to advanced pattern matching.

tclancy · 2 months ago

> from time to time

I know less-than-zero about the subject but I’d imagine the temporal aspect alone is a problem. Aren’t these agents reasoning from a fixed/ frozen version of “reality” rather than adjusting in real-time??

pkoird · 2 months ago

So LLMs are poor at string diff, it seems. Tangentially, is there any source (a github repo or otherwise) that documents findings like these a la what LLMs are good at and what they aren't good at?