People have noticed for a while now that Google's Bard/Gemini has inserted random hindi/bengali words often. [0]
I just caught this in an o3-pro thought process: "and customizing for low difficulty. কাজ করছে!"
That last set of chars is apparently Bengali for "working!".
I just find it curious that similar "errors" are appearing from multiple different models... what is the training method or reasoning that these alternate languages can creep in, does anyone know?
[0] https://www.reddit.com/r/Bard/comments/18zk2tb/bard_speaking_random_languages/
[1]: By this I mean "whatever it is they do that can be thought of as sorta kind roughly analogous to what we generally call thinking." I'm not interested in getting into a debate (here) about the exact nature of thinking and whether or not it's "correct" to refer to LLM's as "thinking". It's a colloquialism that I find useful in this context, nothing more.
[2]: https://arxiv.org/pdf/2501.12948
In other circumstances they might take a different path (in terms of output probability decoding) through other character sets, if the probabilities justify this.
I assumed it knew I speak Spanish from other conversations, my Google profile, geolocation, etc. Maybe my English has enough hints that it was learned by a native Spanish speaker?
Perhaps it's more common in the parts of the world where bengali and english are more commonly spoken in general?
Why so much bengali/hindi then and why not other languages?
For example, the DeepSeek team explicitly reported this behavior in their R1-zero paper, noting that purely unsupervised reasoning emerges naturally but brings some “language mixing” along. Interestingly, they found a small supervised fine-tuning (SFT) step with language-consistency rewards slightly improved readability, though it came with trade-offs (DeepSeek blog post).
My guess is OpenAI has typically used a smaller summarizer model to sanitize reasoning outputs before display (they mentioned summarization/filtering briefly at Dev Day), but perhaps lately they’ve started relaxing that step, causing more multilingual slips to leak through. It’d be great to get clarity from them directly on whether this is intentional experimentation or just a side-effect.
[1] DeepSeek-R1 paper that talks about poor readability and language mixing in R1-zero’s raw reasoning https://arxiv.org/abs/2501.12948
[2] OpenAI “Detecting misbehavior in frontier reasoning models” — explains use of a separate CoT “summarizer or sanitizer” before showing traces to end-users https://openai.com/index/chain-of-thought-monitoring/
The DeepSeek-R1 paper has a section on this, where they 'punish' the model if it thinks in a different language to make the thinking tokens more readable. Probably Anthropic does this too.
We are intentionally undoing one of the things that makes computers useful.
One, the model is no longer being trained to output likely tokens or tokens likely to satisfy pairwise preferences. So the model doesn’t care. You have to explicitly punish the model for language switching, which dilutes the reasoning reward.
Two, I believe there has been some research on models representing similar ideas in multiple languages in similar areas. Sparse autoencoders have shown this. So if the translated text makes sense, I think this is why. If not, I have no idea.