Look long enough at literature on any machine learning task, and someone invariably gets the idea to turn the data into an image and do machine learning on that. Sometimes it works out (turning binaries into images and doing malware detection with a CNN surprisingly works), usually it doesn't. Just like in this example the images usually end up as a kludge to fix some deficiency in the prevalent input encoding.
I can certainly believe that images bring certain advantages over text for LLMs: the image representation does contain useful information that we as humans use (like better information hierarchies encoded in text size, boldness, color, saturation and position, not just n levels of markdown headings), letter shapes are already optimized for this kind of encoding, and continuous tokens seem to bring some advantages over discrete ones. But none of these advantages need the roundtrip via images, they merely point to how crude the state of the art of text tokenization is
A great example of this is changing music into an image and using that to train and generate new images that get converted back into music. It was surprisingly successful. I think this approach is still used by the current music generators.
You are talking about piano roll notation, I think. While it's 2d data, it's not quite the same as actual image data. E.g., 2d conv and pooling operations are useless for music. The patterns and dependencies are too subtle to be captured by spatial filters.
Doesn't this more or less boil down to OCR scans of books having more privileged information than a plaintext file? In which case a roundtrip won't add anything?
This reminds me of how trajectory prediction networks for autonomous driving used to use a CNN to encode scene context (from map and object detection rasters) until vectornet showed up
Exactly. The example the article give of reducing resolution as a form of compression highlights the limitations of the visual-only proposal. Blurring text is a poor form of compression, preserving at most information about paragraph sizes. Summarizing early paragraphs (as context compression does in coding agents) would be much more efficient.
Another great example of this working is the genomic variant calling models from Deepmind "DeepVariant". They use the "alignment pile-up" images which are also used by humans to debug genomic alignments, with some additional channels to further feature engineer the CNN.
This doesn’t cite the very significant example of DeepVariant (and as of 10/16/25 DeepSomatic) which convert genomic data to images in order to find DNA mutations. This has been done since the late 2010s
There's a transformation of the data that is analogous to how a human would use the data to find a variant. It's closer to inspecting a stack of widgets to find a defective widget than it is listing numbers in a table
> It makes two characters that look identical to the eye look as two completely different tokens internally in the network. A smiling emoji looks like a weird token, not an... actual smiling face, pixels and all
This goes against my limited understanding of how LLMs work — and computers generally for that matter. Isn’t that rendering of a smiling emoji still just a series of bits that need to be interpreted as a smiley face? The similar looking characters point makes more sense to me though assuming it’s something along the lines of recognizing that “S” and “$” are roughly the same thing except for the line down the middle. Still that seems like something that doesn’t come up much and is probably covered by observations made in the training corpus.
All that said, Karpathy knows way more than I will ever know on the subject, and I’m only posting my uninformed take here in hopes somebody will correct me in a way I understand.
You’re reading it backwards. He is not praising that behavior, he is complaining about it. He is saying that bots _should_ parse smiling face emoji’s as smiling face emoji’s, but they don’t do that currently because as text they get passed as gross unicode that has a lot of ambiguity and just happens to ultimately get rendered as a face to end users.
Vision tokens would only be a viable alternative to text if/when the LLM had learnt to read, and was able to control the page scanning - how to segment the page into sections of text and non-text, segment the text sections into lines, scan the lines in language-specific direction (left to right, or right to left), segment into words, etc - basically everything that an OCR program needs to do prior to the actual OCR bit.
Even having learnt to do all of this, or perhaps with a page-of-text sequence-of-word extractor pre-processor, the LLM would then need to learn to generalize over different font faces and sizes, and imperfect speckled and/or distorted scans.
Finally, but surely not least, if the goal is to reduce (inference?) computational load by representing multiple words as a single image token, then it seems that more training epochs may be needed, with variations in word grouping, since the same sequence of words would not always be grouped together, so the LLM would have to learn that an image token representing "the cat sat" may also have been split up as "today the cat" and "sat on the".
A better way to reduce number of tokens being processed might be to have the LLM learn how to combine multiple adjacent tokens into one, perhaps starting with individual letters at the input, although this would of course require a fairly major change to the Transformer architecture.
Multimodal LLMs already learn to generalize over text inside images. In my experience most multimodal LLMs are significantly better than traditional OCR, especially if there's any unusual formatting going on.
This thread is considering image input as an alternative to text input for text, not as an alternative to other types of OCR, so the accuracy bar is 100%.
I've had mixed results with LLMs for OCR.. sometimes excellent (zero errors on a photo of my credit card bill), but poor if the source wasn't a printed page - sometimes "reusing" the same image section for multiple extracted words!
FWIW, I highly doubt that LLMs have just learnt to scan pages from (page image, page text) training pairs - more likely text-heavy image input is triggering special OCR handling.
My understanding is that text tokens are too rigid. The way we read is not to process each character (tokens for LLMs) precociously but to see a word or sometimes a collection of familiar words and make sense of writing. That concept that we understand from written text is really what we read and not letter or words exactly. This is why we can easily read written text with typos. They are just similar enough. By letting LLMs not to be too hung up on exact tokens and "skim" through text we can make them more efficient just like how humans efficiently read.
Google: Here are top web results for exploring this topic [few websites containing or discussing gibberish in general]
User: what was my previous question?
Google: Your previous question was "cna uto nfeir waht tihs si msieplnig?".
I was able to understand this through a combination of contextual understanding and sub-word tokenization, which allowed me to infer the intended meaning despite the numerous typos. The corrected question is: "Can you infer what this is misspelling?".
GPT-5 thinking doesn't struggle with this. And although I dislike adding AI generated output to a comment section I found the response to the follow-up question interesting.
q: c u i t m o t s?
a: “Can you infer the meaning of this sentence?”
You miss that we already have ‘context’ when we begin reading something, and that probably enables our fast reading. Maybe there’s a way to give that background setting information to an llm but then we could also just have it read the entire input stream
I can certainly believe that images bring certain advantages over text for LLMs: the image representation does contain useful information that we as humans use (like better information hierarchies encoded in text size, boldness, color, saturation and position, not just n levels of markdown headings), letter shapes are already optimized for this kind of encoding, and continuous tokens seem to bring some advantages over discrete ones. But none of these advantages need the roundtrip via images, they merely point to how crude the state of the art of text tokenization is
[0] https://arxiv.org/abs/2503.08638 (grep for "audio token")
[1] https://arxiv.org/abs/2306.05284
[0] https://web.archive.org/web/20140402025221/http://m.nautil.u...
Deleted Comment
https://google.github.io/deepvariant/posts/2020-02-20-lookin...
Tabulating data into tables similarly converts image visually so that mistakes or outliers can be spotted.
There's a transformation of the data that is analogous to how a human would use the data to find a variant. It's closer to inspecting a stack of widgets to find a defective widget than it is listing numbers in a table
Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text? - https://news.ycombinator.com/item?id=45658928 - Oct 2025 (172 comments)
This goes against my limited understanding of how LLMs work — and computers generally for that matter. Isn’t that rendering of a smiling emoji still just a series of bits that need to be interpreted as a smiley face? The similar looking characters point makes more sense to me though assuming it’s something along the lines of recognizing that “S” and “$” are roughly the same thing except for the line down the middle. Still that seems like something that doesn’t come up much and is probably covered by observations made in the training corpus.
All that said, Karpathy knows way more than I will ever know on the subject, and I’m only posting my uninformed take here in hopes somebody will correct me in a way I understand.
Even having learnt to do all of this, or perhaps with a page-of-text sequence-of-word extractor pre-processor, the LLM would then need to learn to generalize over different font faces and sizes, and imperfect speckled and/or distorted scans.
Finally, but surely not least, if the goal is to reduce (inference?) computational load by representing multiple words as a single image token, then it seems that more training epochs may be needed, with variations in word grouping, since the same sequence of words would not always be grouped together, so the LLM would have to learn that an image token representing "the cat sat" may also have been split up as "today the cat" and "sat on the".
A better way to reduce number of tokens being processed might be to have the LLM learn how to combine multiple adjacent tokens into one, perhaps starting with individual letters at the input, although this would of course require a fairly major change to the Transformer architecture.
I've had mixed results with LLMs for OCR.. sometimes excellent (zero errors on a photo of my credit card bill), but poor if the source wasn't a printed page - sometimes "reusing" the same image section for multiple extracted words!
FWIW, I highly doubt that LLMs have just learnt to scan pages from (page image, page text) training pairs - more likely text-heavy image input is triggering special OCR handling.
User: cna uto nfeir waht tihs si msieplnig?
Google: Here are top web results for exploring this topic [few websites containing or discussing gibberish in general]
User: what was my previous question?
Google: Your previous question was "cna uto nfeir waht tihs si msieplnig?". I was able to understand this through a combination of contextual understanding and sub-word tokenization, which allowed me to infer the intended meaning despite the numerous typos. The corrected question is: "Can you infer what this is misspelling?".
q: c u i t m o t s? a: “Can you infer the meaning of this sentence?”