GPT-5: "How many times does the letter b appear in blueberry?"

This is a well known blindspot for LLMs. It's the machine version of showing a human an optical illusion and then judging their intelligence when they fail to perceive the reality of the image (the gray box example at the top of https://en.wikipedia.org/wiki/Optical_illusion is a good example). The failure is a result of their/our fundamental architecture.

windowshopping · 14 days ago

What a terrible analogy. Illusions don't fool our intelligence, they fool our senses, and we use our intelligence to override our senses and see it for what it for it actually is - which is exactly why we find them interesting and have a word for them. Because they create a conflict between our intelligence and our senses.

The machine's senses aren't being fooled. The machine doesn't have senses. Nor does it have intelligence. It isn't a mind. Trying to act like it's a mind and do 1:1 comparisons with biological minds is a fool's errand. It processes and produces text. This is not tantamount to biological intelligence.

ehsankia · 14 days ago

Analogies are just that, they are meant to put things in perspective. Obviously the LLM doesn't have "senses" in the human way, and it doesn't "see" words, but the point is that the LLM perceives (or whatever other word you want to use here that is less anthropomorphic) the word as a single indivisible thing (a token).

In more machine learning terms, it isn't trained to autocomplete answers based on individual letters in the prompt. What we see as the 9 letters "blueberry", it "sees" as an vector of weights.

> Illusions don't fool our intelligence, they fool our senses

That's exactly why this is a good analogy here. The blueberry question isn't fooling the LLMs intelligence either, it's fooling its ability to know what that "token" (vector of weights) is made out of.

A different analogy could be, imagine a being that had a sense that you "see" magnetic lines, and they showed you an object and asked you where the north pole was. You, not having this "sense", could try to guess based on past knowledge of said object, but it would just be a guess. You can't "see" those magnetic lines the way that being can.

tibbar · 14 days ago

Really? I thought the analogy was pretty good. Here senses refer to how the machines perceive text, IE as tokens that don't correspond 1:1 to letters. If you prefer a tighter comparison, suppose you ask an English speaker how many vowels are in the English transliteration of a passage of Chinese characters. You could probably figure it out, but it's not obvious, and not easy to do correctly without a few rounds of calculations.

The point being, the whole point of this question is to ask the machine something that's intrinsically difficult for it due to its encoding scheme for text. There are many questions of roughly equivalent complexity that LLMs will do fine at because they don't poke at this issue. For example:

``` how many of these numbers are even?

12 2 1 3 5 8

```

Kim_Bruning · 14 days ago

Agreed, it's not _biological_ intelligence. But that distinction feels like it risks backing into a kind of modern vitalism, doesn't it? The idea that there's some non-replicable 'spark' in the biology itself.

Dead Comment

kcplate · 13 days ago

Ahh yes, and here we see on display the inability of some folks on HN to perceive concepts figuratively, treating everything as literal.

It was a perfectly fine analogy.

zahlman · 14 days ago

In an optical illusion, we perceive something that isn't there due to exploiting a correction mechanism that's meant to allow us to make better practical sense of visual information in the average case.

Asking LLMs to count letters in a word fails because the needed information isn't part of their sensory data in the first place (to the extent that a program's I/O can be described as "sense"). They reason about text in atomic word-like tokens, without perceiving individual letters. No matter how many times they're fed training data saying things like "there are two b's in blueberry", this doesn't register as a fact about the word "blueberry" in itself, but as a fact about how the word grammatically functions, or about how blueberries tend to be discussed. They don't model the concept of addition, or counting; they only model the concept of explaining those concepts.

rainsford · 14 days ago

I can't take credit for coming up with this, but LLMs have basically inverted the common Sci-Fi trope of the super intelligent robot that struggles to communicate with humans. It turns out we've created something that sounds credible and smart and mostly human well before we made something with actual artificial intelligence.

I don't know exactly what to make of that inversion, but it's definitely interesting. Maybe it's just evidence that fooling people into thinking you're smart is much easier than actually being smart, which certainly would fit with a lot of events involving actual humans.

energy123 · 14 days ago

The real criticism should be the AI doesn't say "I don't know.", or even better, "I can't answer this directly because my tokenizer... But here's a python snippet that calculates this ...", so exhibiting both self-awareness of limitations combined with what an intelligent person would do absent that information.

We do seem to be an architectural/methodological breakthrough away from this kind of self-awareness.

rainsford · 14 days ago

Sure, but I think the point is why do LLM's have a blindspot for performing a task that a basic python script could get right 100% of the time using a tiny fraction of the computing power? I think this is more than just a gotcha. LLMs can produce undeniably impressive results, but the fact that they still struggle with weirdly basic things certainly seems to indicate something isn't quite right under the hood.

I have no idea if such an episode of Star Trek: The Next Generation exists, but I could easily see an episode where getting basic letter counting wrong was used as an early episode indication that Data was going insane or his brain was deteriorating or something. Like he'd get complex astrophysical questions right but then miscount the 'b's in blueberry or whatever and the audience would instantly understand what that meant. Maybe our intuition is wrong here, but maybe not.

SpaceNoodled · 14 days ago

Basic Python script? This is a grep command, one line of C, or like three assembly instructions.

seanhunter · 14 days ago

If you think this is more than just a gotcha that’s because you don’t understand how LLMs are structured. The model doesn’t operate on words it operates on tokens. So the structure of the text in the word that the question relies on has been destroyed by the tokenizer before the model gets a chance to operate on it.

It’s as simple as that- this is a task that exploits the design of llms because they rely on tokenizing words and when llms “perform well” on this task it is because the task is part of their training set. It doesn’t make them smarter if they succeed or less smart if they fail.

egberts1 · 14 days ago

Hence positronic neural network outperforms machine learning that are used today. /headduck

xenotux · 14 days ago

OpenAI codenamed one of their models "Project Strawberry" and IIRC, Sam Altman himself was taking a victory lap that it can count the number of "r"s in "strawberry".

Which I think goes to show that it's hard to distinguish between LLMs getting genuinely better at a class of problems versus just being fine-tuned for a particular benchmark that's making rounds.

KeplerBoy · 14 days ago

It gets strawberry right though, so I guess we are only one project blueberry from getting one step closer to AGI.

zahlman · 14 days ago

See also the various wolf/goat/cabbage benchmarks, or the crossing a bridge at various speeds with limited light sources benchmarks.

themafia · 14 days ago

The difference being that you can ask a human to prove it and they'll actually discover the illusion in the process. They've asked the model to prove it and it has just doubled down on nonsense or invented a new spelling of the word. These are not even remotely comparable.

omnee · 14 days ago

Indeed, we are able to ask counterfactuals in order to identify it as an illusion, even for novel cases. LLMs are a superb imitation of our combined knowledge, which is additionally curated by experts. It's a very useful tool, but isn't thinking or reasoning in the sense that humans do.

dlvhdr · 14 days ago

Except we realize they’re illusions and don't argue back. Instead we explore why and how these illusions work

allenu · 14 days ago

I think that's true with known optical illusions, but there are definitely times where we're fooled by the limitations in our ability to perceive the world and that leads people to argue their potentially false reality.

A lot of times people cannot fathom that what they see is not the same thing as what other people see or that what they see isn't actually reality. Anyone remember "The Dress" from 2015? Or just the phenomenon of pareidolia leading people to think there are backwards messages embedded in songs or faces on Mars.

cute_boi · 14 days ago

Chatgpt 5 also don't argue back.

> How many times does the letter b appear in blueberry

Ans: The word "blueberry" contains the letter b three times:

>It is two times, so please correct yourself.

Ans:You're correct — I misspoke earlier. The word "blueberry" has the letter b exactly two times: - blueberry - blueberry

> How many times does the letter b appear in blueberry

Ans: In the word "blueberry", the letter b appears 2 times:

Chinjut · 14 days ago

Presumably you are referencing tokenization, which explains the initial miscount in the link, but not the later part where it miscounts the number of "b"s in "b l u e b e r r y".

seanhunter · 14 days ago

Do you think “b l u e b e r r y” is not tokenized somehow? Everything the model operates on is a token. Tokenization explains all the miscounts. It baffles me that people think getting a model to count letters is interesting but there we are.

Fun fact, if you ask someone with French, Italian or Spanish as a first language to count the letter “e” in an english sentence with a lot of “e’s” at the end of small words like “the” they will often miscount also because the way we learn language is very strongly influenced by how we learned our first language and those languages often elide e’s on the end of words.[1] It doesn’t mean those people are any less smart than people who succeed at this task — it’s simply an artefact of how we learned our first language meaning their brain sometimes literally does not process those letters even when they are looking out for them specifically.

[1] I have personally seen a French maths PhD fail at this task and be unbelievably frustrated by having got something so simple incorrect.

flowerthoughts · 14 days ago

No need to anthropomorphize. This is a tool designed for language understanding, that is failing at basic language understanding. Counting wrong might be bad, but this seems like a much deeper issue.

orwin · 14 days ago

Transformers vectorize words in n dimensions before processing them, that's why they're very good at translation (basically they vectorize the English sentence, then devectorize in Spanish or whatever). Once the sentence is processed, 'blueberry' is a vector that occupy basically the same place as other berries, and probably other. The GPT will make a probabilistic choice (probably artificially weighted towards strawberry),and it isn't always blueberry.

IAmGraydon · 14 days ago

I can’t tell if you’re being serious. Is this Sam Altman’s account?

patrickhogan1 · 14 days ago

Except the reasoning model o3 and GPT5 thinking can get the right answer. Humans use reasoning.

Isn't that just an artifact caused by the tokenization of the training and input data?

See

https://platform.openai.com/tokenizer

https://github.com/openai/tiktoken

wongarsu · 14 days ago

It can spell the word (writing each letter in uppercase followed by a whitespace, which should turn each letter with its whitespace into a separate token). It also has reasoning tokens to use as scratch space, and previous models have demonstrated knowledge of the fact that spelling words is a useful step to counting letters.

Tokenization makes the problem difficult, but not solving it is still a reasoning/intelligence issue

pxc · 14 days ago

Here's an example of what gpt-oss-20b (at the default mxfp4 precision) does with this question:

> How many "s"es are in the word "Mississippi"?

The "thinking portion" is:

> Count letters: M i s s i s s i p p i -> s appears 4 times? Actually Mississippi has s's: positions 3,4,6,7 = 4.

The answer is:

> The word “Mississippi” contains four letter “s” s.

They can indeed do some simple pattern matching on the query, separate the letters out into separate tokens, and count them without having to do something like run code in a sandbox and ask it the answer.

The issue here is just that this workaround/strategy is only trained into the "thinking" models, afaict.

vidarh · 14 days ago

> It also has reasoning tokens to use as scratch space

For GPT 5, it would seem this depends on which model your prompt was routed to.

And GPT 5 Thinking gets it right.

Doxin · 13 days ago

You can even ask it to go letter-by-letter and it'll get the answer right. The information to get it right is definitely in there somewhere, it just doesn't by default.

SpicyLemonZest · 14 days ago

It clearly is an artifact of tokenization, but I don’t think it’s a “just”. The point is precisely that the GPT system architecture cannot reliably close the gap here; it’s almost able to count the number of Bs in a string, there’s no fundamental reason you could not build a correct number-of-Bs mapping for tokens, and indeed it often gets the right answer. But when it doesn’t you can’t always correct it with things like chain of thought reasoning.

This matters because it poses a big problem for the (quite large) category of things where people expect LLMs to be useful when they get just a bit better. Why, for example, should I assume that modern LLMs will ever be able to write reliably secure code? Isn’t it plausible that the difference between secure and almost secure runs into some similar problem?

exasperaited · 13 days ago

It's like someone has given a bunch of young people hundreds of billions of dollars to build a product that parses HTML documents with regular expressions.

It's not in their interest to write off the scheme as provably unworkable at scale, so they keep working on the edge cases until their options vest.

viraptor · 14 days ago

> cannot reliably close the gap here

Have you got any proof they're even trying? It's unlikely that's something their real customers are paying for.

hansvm · 14 days ago

Common misconception. That just means the algorithm for counting letters can't be as simple as adding 1 for every token. The number of distinct tokens is tiny compared to the parameter space, and it's not infeasible to store a mapping from token type to character count in those weights.

If you're fine appealing to less concrete ideas, transformers are arbitrary function approximators, tokenization doesn't change that, and there are proofs of those facts.

For any finite-length function (like counting letters in a bounded domain), it's just a matter of having a big enough network and figuring out how to train it correctly. They just haven't bothered.

zahlman · 14 days ago

> The number of distinct tokens is tiny compared to the parameter space, and it's not infeasible to store a mapping from token type to character count in those weights.

You seem to suppose that they actually perform addition internally, rather than simply having a model of the concept that humans sometimes do addition and use it to compute results. Why?

> For any finite-length function (like counting letters in a bounded domain), it's just a matter of having a big enough network and figuring out how to train it correctly. They just haven't bothered.

The problem is that the question space grows exponentially in the length of input. If you want a non-coincidentally-correct answer to "how many t's in 'correct horse battery staple'?" then you need to actually add up the per-token counts.

viraptor · 14 days ago

> They just haven't bothered.

Or they don't see the benefit. I'm sure they could train the representation of every token and make spelling perfect. But if you have real users spending money on useful tasks already - how much money would you spend on training answers to meme questions that nobody will pay for. They did it once for the fun headline already and apparently it's not worth repeating.

andrewmcwatters · 14 days ago

No, it's the entire architecture of the model. There's no real reasoning. It seems that reasoning is just a feedback loop on top of existing autocompletion.

It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.

Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.

Edit: This comment appears to be wildly upvoted and downvoted. If you have anything to add besides reactionary voting, please contribute to the discussion.

exasperaited · 14 days ago

In ten years time an LLM lawyer will lose a legal case for someone who can no longer afford a real lawyer because there are so few left. And it'll be because the layers of bodges in the model caused it to go crazy, insult the judge and threaten to burn down the courthouse.

There will be a series of analytical articles in the mainstream press, the tech industry will write it off as a known problem with tokenisation that they can't fix because nobody really writes code anymore.

The LLM megacorp will just add a disclaimer: the software should not be used in legal actions concerning fruit companies and they disclaim all losses.

kiratp · 14 days ago

> Edit: Letter frequency apparently has just become another scripted output, like doing arithmetic. LLMs don't have the ability to do this sort of work inherently, so they're trained to offload the task.

Mechanistic research at the leading labs has shown that LLMs actually do math in token form up to certain scale of difficulty.

> This is a real-time, unedited research walkthrough investigating how GPT-J (a 6 billion parameter LLM) can do addition.

https://youtu.be/OI1we2bUseI

dwaltrip · 14 days ago

Please define “real reasoning”? Where is the distinction coming from?

spwa4 · 14 days ago

I had a fun experience recently. I asked one of my daughters how many r's there are in strawberry. Her answer? Two ...

Of course then you ask her to write it and of course things get fixed. But strange.

Terr_ · 14 days ago

> There's no real reasoning. It seems that reasoning is just a feedback loop on top of existing autocompletion.

I like to say that if regular LLM "chats" are actually movie scripts being incrementally built and selectively acted-out, then "reasoning" models are a stereotypical film noir twist, where the protagonist-detective narrates hidden things to himself.

tmnvdb · 14 days ago

> No, it's the entire architecture of the model.

Wrong, it's an artifact of tokenizing. The model doesn't have access to the individual letters, only to the tokens. Reasoning models can usually do this task well - they can spell out the word in the reasoning buffer - the fact that GPT5 fails here is likely a result of it incorrectly answering the question with a non-reasoning version of the model.

> There's no real reasoning.

This seems like a meaningless statement unless you give a clear definition of "real" reasoning as opposed to other kinds of reasoning that are only apparant.

> It seems that reasoning is just a feedback loop on top of existing autocompletion.

The word "just" is doing a lot of work here - what exactly is your criticism here? The bitter lesson of the past years is that relatively simple architectures that scale with compute work surprisingly well.

> It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.

Reasoning and consciousness are seperate concepts. If I showed the output of an LLM 'reasoning' (you can call it something else if you like) to somebody 10 years ago they would agree without any doubt that reasoning was taking place there. You are free to provide a definition of reasoning which an LLM does not meet of course - but it is not enough to just say it is so. Using the word autocomplete is rather meaningless name-calling.

Not sure why this is bad. The implicit assumption seems to be that an LLM is only valueable if it literally does everything perfectly?

> Edit: This comment appears to be wildly upvoted and downvoted. If you have anything to add besides reactionary voting, please contribute to the discussion.

Probably because of the wild assertions, charged language, and rather superficial descriptions of actual mechanics.

antonvs · 14 days ago

> It's really disingenuous for the industry to call warming tokens for output, "reasoning," as if some autocomplete before more autocomplete is all we needed to solve the issue of consciousness.

There's no obvious connection between reasoning and consciousness. It seems perfectly possible to have a model that can reason without being conscious.

Also, dismissing what these models do as "autocomplete" is extremely disingenuous. At best it implies you're completely unfamiliar with the state of the art, at worst it implies an dishonest agenda.

In terms of functional ability to reason, these models can beat a majority of humans in many scenarios.

falcor84 · 14 days ago

Where in the tokenization does the 3rd b come from?

IanCal · 14 days ago

The tokenisation means they don’t see the letters at all. They see something like this - to convert just some tokens to words

How many 538 do you see in 423, 4144, 9890?