The Internet has been poking fun at LLMs failing to count letters for many months. Are the AI companies really living in an echo chamber? They should have implemented a thing to count letters (and to search for palindromes, and…), and just have the LLM pass the request to the count_letters function, as they do for many other tasks already…
Why on earth would they do this? This is not a fundamentally useful task; it serves as a measure of the LLM's ability to generalize to tasks outside of its training data and that strain the limits of what it can express.
Because optics matter. they are all ultimately fundraising and competing and this is terrible PR
Ask Jeeves from 1997 could answer this question, so tell me why we need to devote a nation-state amount of compute power to feed an “AI” that confidently gets kindergarten level questions dead ass wrong?
I have the same kind of question when I watch the AI summary on Google output tokens one-by-one to give me less useful information that is right there on the first search result from Wikipedia (fully sourced, too)
If you’re advertising that your new LLM is like a PhD in your pocket, and it fails on a task that a first grader can do, it makes it hard to take your other claims seriously.
> Are the AI companies really living in an echo chamber?
The author tested 12 models, and only one was consistently wrong. More than half were correct 100% of the time.
A better conclusion would be that there’s something in particular wrong with GPT-5 Chat, all the other GPT 5 variants are OK. I wonder what’s different?
This “one model” was supposed to be the state-of-the-art model, from the company that had a model named “strawberry” in honor of the same problem. And other models struggle with this extremely basic question too.
OpenAI definitely tarnished the name of GPT-5 by allowing these issues to occur. It's clearly a smaller model optimized for cost and speed. Compare it to GPT-4.5 which didn't have these errors but was "too expensive for them".
This is why Anthropic naming system of haiku sonnet and opus to represent size is really nice. It prevents this confusion.
I think it's a valid complaint that the naming scheme for the various GPT-4 models were very confusing. GPT-5 just launched, and doesn't (yet?) have a GPT-5 Turbo or GPT-o5 mini to muddy the waters.
Yeah, one of the main reasons I switched my tooling over to default to Anthropic models despite starting out with OpenAI for months prior, was because I often switch between different model sizes depending on the complexity of the prompt vs the speed I want the result.
I would frequently spend time going back to OpenAIs site to remind me of their different models. There’s no consistency there whatsoever. But with Anthropic is was easy.
If I have to spend 5 minutes picking a model then I might as well do the task myself. So Claude became a natural solution for me.
> OpenAI definitely tarnished the name of GPT-5 by allowing these issues to occur
For a certain class of customer maybe that is true.
But the reality is that the fact that this occurs is very encouraging -- they are not micro-optimizing to solve cosmetic problems that serve no functional purpose. They are instead letting these phenomena serve as external benchmarks of a sort to evaluate how well the LLM can work on tasks that are outside of its training data, and outside of what one would expect the capabilities to be.
Oh wow, I stare at those model names every day, and I only just now after reading your comment realized what “haiku”, “sonnet”, and “opus” imply about the models! Seems super obvious in retrospect but never thought about it!
I agree it’s not perfect. But it’s just 3 terms those non-English speakers need to learn. Which is a lot easier than having to remember every OpenAI model name and how it compares to every other one.
The hilarious thing is that LLMs will happily explain the token problem to you, and will even list out exactly which words where will have these problems. Then in the same breath when you ask it about letter counts, it will screw up.
It seems like this could be easily solved in models that support tool calling by providing them with a tool that takes a token and returns the individual graphemes.
It doesn't seem valuable for the model to memorize the graphemes in each of its tokens.
Yes, but are you going to special case all of these pain points? The whole point of these LLMs is that they learn from training data, not from people coding logic directly. If you do this people will come up with a dozen new ways in which the models fail. They are really not hard to find. Basically asking them to do anything novel is at risk of complete failure. The interesting bit is that LLMs tend to work best a "medium difficulty" problems. Homework questions and implementing documented APIs and things like that. Asking them to do anything completely novel tends to fail as does asking them to do something so trivial that normal humans won't bother even writing it down.
It makes sense when users ask for information not available in the tokenized values though. In the abstract, a tool that changes tokenization for certain context contents when a prompt references said contents is probably necessary to solve this issue (if you consider it worth solving).
Tokenization is an inherent weakness of current LLM design, so it makes sense to compensate for it. Hopefully some day tokenization will no longer be necessary.
That takes away from the notion that LLMs have emergent intelligent abilities. Right now it doesn't seem valuable for a model to count letters, even though it is a very basic measure of understanding. Will this continue in other domains? Will we be doing tool-calling for every task that's not just summarizing text?
> Will we be doing tool-calling for every task that's not just summarizing text?
spoiler: Yes. This has already become standard for production use cases where the LLM is an external-facing interface; you use an LLM to translate the user's human-language request to a machine-ready, well-defined schema (i.e. a protobuf RPC), do the bulk of the actual work with actual, deterministic code, then (optionally) use an LLM to generate a text result to display to the user. The LLM only acts as a user interface layer.
How is counting letters a measure of understanding, rather than a rote process?
The reason LLMs struggle with this is because they literally aren't thinking in English. Their input is tokenized before it comes to them. It's like asking a Chinese speaker "How many Rs are there in the word 草莓".
We're up to a gazillion parameters already, maybe the next step is to just ditch the tokenization step and let the LLMs encode the tokenization process internally?
A while back I encoded the message "Can you read this?" in rot13 (or something very similar) and sent it to chatGPT[. It answered back (yes) in with a sentence in rot13. Why could it do that but not count letters?
At the VERY beginning I asked it to reverse a hash, and was stunned that it was able to. This was in that first moment when the transformer burst on the scene.
Turned out it just knew that particular string because it was on the internet. I don’t recall exactly but it was probably “hello world” or something like that.
Anyway to your specific question, many of the models call Python tools behind the scenes now.
It might help a bit to expand this test to a short phrase. With such a small test the model can be right for the wrong reasons; opening up a bit of space to be wrong in might sharpen the differences.
(My one-off test of the default ChatGPT model, whatever that is, got 'How many b's are there in "Billy Bob beat the record for bounding?"' correct first try, with correct reasoning given.)
i tried with three different models :
- hf.co/lmstudio-community/DeepSeek-R1-Distill-Llama-8B-GGUF:Q8_0
- qwen3:4b-instruct-2507-fp16
- hf.co/ibm-granite/granite-3.3-2b-instruct-GGUF:F16
ollama run qwen3:4b-instruct-2507-fp16
>>> how many b’s are there in blueberry?
Let's break down the word *"blueberry"* step by step to count how many *b's* are in it.
Step 1: Write out the word clearly:
*b l u e b e r r y*
Step 2: Go through each letter one by one:
- b → yes, a *b*
- l → no
- u → no
- e → no
- b → yes, another *b*
- e → no
- r → no
- r → no
- y → no
Step 3: Count the *b's*:
- First *b* at the beginning
- Second *b* after "blue"
So, there are *two b's*.
Final answer: *2* b's in "blueberry".
>>>
ollama run hf.co/ibm-granite/granite-3.3-2b-instruct-GGUF:F16
>>> how many b’s are there in blueberry?
The word "blueberry" contains two 'b's.
(fastest lol, granite models are pretty underated)
r1-distill output was similar to qwen instruct one but it double checked it's thinking part
Ask Jeeves from 1997 could answer this question, so tell me why we need to devote a nation-state amount of compute power to feed an “AI” that confidently gets kindergarten level questions dead ass wrong?
I have the same kind of question when I watch the AI summary on Google output tokens one-by-one to give me less useful information that is right there on the first search result from Wikipedia (fully sourced, too)
The author tested 12 models, and only one was consistently wrong. More than half were correct 100% of the time.
A better conclusion would be that there’s something in particular wrong with GPT-5 Chat, all the other GPT 5 variants are OK. I wonder what’s different?
This is why Anthropic naming system of haiku sonnet and opus to represent size is really nice. It prevents this confusion.
In contrast to GPT-5, GPT-5 mini and GPT-5 nano?
I would frequently spend time going back to OpenAIs site to remind me of their different models. There’s no consistency there whatsoever. But with Anthropic is was easy.
If I have to spend 5 minutes picking a model then I might as well do the task myself. So Claude became a natural solution for me.
For a certain class of customer maybe that is true.
But the reality is that the fact that this occurs is very encouraging -- they are not micro-optimizing to solve cosmetic problems that serve no functional purpose. They are instead letting these phenomena serve as external benchmarks of a sort to evaluate how well the LLM can work on tasks that are outside of its training data, and outside of what one would expect the capabilities to be.
If anything it's a lot less confusing that the awful naming convention from OpenAI up until 5.
It doesn't seem valuable for the model to memorize the graphemes in each of its tokens.
spoiler: Yes. This has already become standard for production use cases where the LLM is an external-facing interface; you use an LLM to translate the user's human-language request to a machine-ready, well-defined schema (i.e. a protobuf RPC), do the bulk of the actual work with actual, deterministic code, then (optionally) use an LLM to generate a text result to display to the user. The LLM only acts as a user interface layer.
The reason LLMs struggle with this is because they literally aren't thinking in English. Their input is tokenized before it comes to them. It's like asking a Chinese speaker "How many Rs are there in the word 草莓".
Turned out it just knew that particular string because it was on the internet. I don’t recall exactly but it was probably “hello world” or something like that.
Anyway to your specific question, many of the models call Python tools behind the scenes now.
GPT-5: "How many times does the letter b appear in blueberry?" - https://news.ycombinator.com/item?id=44832908 - Aug 2025 (321 comments)
(My one-off test of the default ChatGPT model, whatever that is, got 'How many b's are there in "Billy Bob beat the record for bounding?"' correct first try, with correct reasoning given.)
ollama run qwen3:4b-instruct-2507-fp16 >>> how many b’s are there in blueberry? Let's break down the word *"blueberry"* step by step to count how many *b's* are in it.
Step 1: Write out the word clearly: *b l u e b e r r y*
Step 2: Go through each letter one by one:
- b → yes, a *b* - l → no - u → no - e → no - b → yes, another *b* - e → no - r → no - r → no - y → no
Step 3: Count the *b's*:
- First *b* at the beginning - Second *b* after "blue"
So, there are *two b's*.
Final answer: *2* b's in "blueberry".
>>> ollama run hf.co/ibm-granite/granite-3.3-2b-instruct-GGUF:F16 >>> how many b’s are there in blueberry? The word "blueberry" contains two 'b's. (fastest lol, granite models are pretty underated)
r1-distill output was similar to qwen instruct one but it double checked it's thinking part