Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

js8 · 2 months ago

Not pixels, but percels. Pixels are points in the image, while a "percel" is unit of perceptual information. It might be a pixel with an associated sound, in a given moment of time. In case of humans, percels include other senses as well, and they can also be annotated with your own thoughts (i.e. percels can also include tokens or embeddings).

Of course, NNs like LLM never process a percel in isolation, but always as a group of neighboring percels (aka context), with an initial focus on one of the percels.

almoehi · 2 months ago

I’ve had written up a proposal for a research grant to basically work exactly on this idea.

It got reviewed by 2 ML scientists and one neuroscientist.

Got totally slammed (and thus rejected) by the ML scientists due to „lack of practical application“ and highly endorsed by the neuroscientist.

There’s so much unused potential in interdisciplinary research but nobody wants to fund it because it doesn’t „fit“ into one of the boxes.

behnamoh · 2 months ago

Make sure the ML scientists don't take credit for your work. Sometimes they reject a paper so they can work on it on their own.

Enginerrrd · 2 months ago

That's unfortunate. My personal sense is that while agentic LLM's are not going to get us close to AGI, a few relatively modest architectural changes to the underlying models might actually do that, and I do think mimicry of our own self-referential attention is a very important component of that.

While the current AI boom is a bubble, I actually think that AGI nut could get cracked quietly by a company with even modest resources if they get lucky on the right fundamental architectural changes.

shepardrtc · 2 months ago

Sounds like those ML "scientists" were actually just engineers.

falcor84 · 2 months ago

I love this idea, but can't find anything about it. Is this a neologism you just coined? If so, is there any particular paper or work that led you to think about in those terms?

js8 · 2 months ago

Yes, I just coined the neologism. It was supposed to be partly sarcastic (why stay at pixels, why not just go fully multimodal and treat the missing channels as missing information?), I am kind of surprised why it got so upvoted.

(IME, often my comments which I think are deep get ignored but silly things, where I was thinking "this is too much trolling or obvious", get upvoted; but don't take it the wrong way, I am flattered you like it.)

Workaccount2 · 2 months ago

Isn't this effectively what the latent space is? A bunch of related vectors that all bundle together?

js8 · 2 months ago

No, latent space doesn't have to be made of percels, just like not every 2D array of 3-element vectors is an image made of pixels. Percels are tied to your sensors, components of what you perceive, in totality.

Of course there is an interesting paradox - each layer of the NN doesn't know whether it's connected to the sensors directly, or what kind of abstractions it works with in the latent space. So the boundary between the mind and the sensor is blurred and to some extent a subjective choice.

taneq · 2 months ago

“Percel” is still a way cooler and arguably more descriptive term than “token” though.

causal · 2 months ago

This is an interesting thought. Trying to imagine how you represent that as a vector.

You still need to map percels to a latent space. But perhaps with some number of dimensions devoted to modes of perception? E.g. audio, visual, etc

milanove · 2 months ago

I'm not an ML expert or practitioner, so someone might need to correct me.

However, I believe the parcel's components together as a whole would capture the state of the audio+visual+time. However, I don't think the state of one particular mode (e.g. audio or visual or time) is encoded with a specific subset of the percel's components. Rather, each component of the percel itself would represent a mixture (or a portion of a mixture) of the audio+video+time. So, you couldn't isolate out just the audio or visual or time state specifically by looking at some specific subset of the percel's components, because each component is itself a mixture of the audio+visual+time state.

I think the classic analogy is that if river 1 and river 2 combine to form river 3, you cannot take a cup of water from river 3 and separate out the portions from river 1 and river 2; they're irreversibly mixed.

BrokenCogs · 2 months ago

I was going to say toxel

causal · 2 months ago

Like a tokenized 3D voxel?

Dead Comment

Deleted Comment

tcdent · 2 months ago

"Kill the tokenizer" is such a wild proposition but is also founded in fundamentals.

Tokenizing text is such a hack even though it works pretty well. The state-of-the-art comes out of the gate with an approximation for quantifying language that's wrong on so many levels.

It's difficult to wrap my head around pixels being a more powerful representation of information, but someone's gotta come up with something other than tokenizer.

dgently7 · 2 months ago

I consume all text as images when I read as a vision capable person so it kinda passes the evolution does it that way test and maybe we shouldn’t be that surprised that vision is a great input method?

Actually thinking more about that I consume “text” as images and also as sounds… I kinda wonder if instead of render and ocr like this suggests we did tts and just encoded like the mp3 sample of the vocalization of the word if that would be less bytes than the rendered pixels version… probably depends on the resolution / sample rate.

visarga · 2 months ago

Funny, I habitually read while engaging TTS on same text. I have even made a Chrome extension for web reading, it highlights text and reads it, while keeping the current position in the viewport. I find using 2 modalities at the same time improves my concentration. TTS is sped up to 1.5x to match reading speed. Maybe it is just because I want to reduce visual strain. Since I consume a lot of text every day, it can be tiring.

psadri · 2 months ago

The pixel to sounds would pass through “reading” so there might be information loss. It is no longer just pixels.

317070 · 2 months ago

There was the Byte Latent Transformer, to end the tokenizer, which seemingly went nowhere. https://ai.meta.com/research/publications/byte-latent-transf...

htrp · 2 months ago

fair team currently subject to tbd labs politics

Tarq0n · 2 months ago

Ok but what are you going to decode into at generation time, a jpeg of text? Tokens have value beyond how text appears to the eye, because we process text in many more ways than just reading it.

jhanschoo · 2 months ago

There are some concerns here that should be addressed separately:

> Ok but what are you going to decode into at generation time, a jpeg of text?

Presumably, the output may still be in token space, but for the purpose of conditioning on context for the immediate next token, it must then be immediately translated into a suitable input space.

> we process text in many more ways than just reading it

As a token stream is a straightforward function of textual input, then in the case of textual input we should expect to handle the conversion of the character stream to semantic/syntactic units to happen in the LLM.

Moreover, in the case of OCR, graphical information possesses information/degrades information in the way that humans expect; what comes to mind is the eggplant/dick emoji symbolism, or smiling emoji possessing a graphical similarity that can't be deduced from proximity in Unicode codepoints.

samus · 2 months ago

Output really doesn't have to be the same datatypes as the input. Text tokens are good enough for a lot of interesting applications, and transforming percels (name suggested by another commenter here) into text tokens is exactly what an OCR model is anyway trained to do.

naasking · 2 months ago

Using pixels is still tokenizing. What's needed is something more like "Byte Latent Transformers", which has dynamically sized patches based on information content rather than tokens.

ReptileMan · 2 months ago

I guess it is because of the absurdly high information density of text - so text is quite a good input.

esafak · 2 months ago

I do not get it, either. How can a picture of text be better than the text itself? Why not take a picture of the screen while you're at it, so the model learns how cameras work?

jerojero · 2 months ago

In a very simple way: because the image can be fed directly into the network without first having to transform the text into a series of tokens as we do now.

But the tweet itself is kinda an answer to the question you're asking.

corysama · 2 months ago

From the paper I saw that the model includes an approximation of the layout, diagrams and other images of the source documents.

Now imagine growing up only allowed to read books and the internet through a browser with CSS, images and JavaScript disabled. You’d be missing out on a lot of context and side-channel information.

nl · 2 months ago

Kapathy's points are correct (of course).

One thing I like about text tokens though is that it learns some understanding of the text input method (particularly the QWERTY keyboard).

"Hello" and "Hwllo" are closer in semantic space than you'd think because "w" and "e" are next to each other.

This is much easier to see in hand coded spelling models, where you can get better results by including a "keybaord distance" metric along with a string distance metric.

harperlee · 2 months ago

But assuming that pixel input gets us to an AI capable of reading, they would presumably also be able to detect HWLLO as semantically close to HELLO (similarly to H3LL0, or badly handwritten text - although there would be some graphical structure in these latter examples to help). At the end of the day we are capable of identifying that... Might require some more training effort but the result would be more general.

swyx · 2 months ago

im particularly sympathetic to typo learning, which i think gets lost in the synthetic data discussion (mine here https://www.youtube.com/watch?v=yXPPcBlcF8U )

but i think in this case you can still generate typos in images and it'd be learnable. not a hard issue relevant to the OP

sabareesh · 2 months ago

It might be that our current tokenization is inefficient compared to how well image pipeline does. Language already does lot of compression but there might be even better way to represent it in latent space

ACCount37 · 2 months ago

People in the industry know that tokenizers suck and there's room to do better. But actually doing it better? At scale? Now that's hard.

typpilol · 2 months ago

It will require like 20x the compute

CuriouslyC · 2 months ago

Image models use "larger" tokens. You can get this effect with text tokens if you use a larger token dictionary and generate common n-gram tokens, but the current LLM architecture isn't friendly to large output distributions.

yorwba · 2 months ago

You don't have to use the same token dictionary for input and output. There's things like simultaneously predicting multiple tokens ahead as an auxiliary loss and for speculative decoding, where the output is larger than the input, and similarly you could have a model where the input tokens combine multiple output tokens. You would still need to do a forward pass per output token during autoregressive generation, but prefill would require fewer passes and the KV cache would be smaller too, so it could still produce a decent speedup.

But in the DeepSeek-OCR paper, compressing more text into the same number of visual input tokens leads to progressively worse output precision, so it's not a free lunch but a speed-quality tradeoff, and more fine-grained KV cache-compression methods might deliver better speedups without degrading the output as much.

mark_l_watson · 2 months ago

Interesting idea! Haven’t heard that before.

Deleted Comment

orliesaurus · 2 months ago

one of the MOST interesting aspects of the recent discussion on this topic is how it underscores our reliance on lossy abstractions when representing language for machines. Tokenization is one such abstraction, but it's not the only one.... using raw pixels or speech signals is a different kind of approximation. what excites me about experiments like this is not so much that we'll all be handing images to language models tomorrow, but that researchers are pressure testing the design assumptions of current architectures. Approaches that learn to align multiple modalities might reveal better latent structures or training regimes, and that could trickle back into more efficient text encoders without throwing away a century of orthography. BUT there’s also a rich vein to mine in scripts and languages that don’t segment neatly into words: alternative encodings might help models handle those better.

bob1029 · 2 months ago

I think the DCT is a compelling way to interact with spatial information when the channel is constrained. What works for jpeg can likely work elsewhere. The energy compaction properties of the DCT mean you get most of the important information in a few coefficients. A quantizer can zero out everything else. Zig zag scanned + RLE byte sequences could be a reasonable way to generate useful "tokens" from transformed image blocks. Take everything from jpeg encoder except for perhaps the entropy coding step.

At some level you do need something approximating a token. BPE is very compelling for UTF8 sequences. It might be nearly the most ideal way to transform (compress) that kind of data. For images, audio and video, we need some kind of grain like that. Something to reorganize the problem and dramatically reduce the information rate to a point where it can be managed. Compression and entropy is at the heart of all of this. I think BPE is doing more heavy lifting than we are giving it credit for.

I'd extend this thinking to techniques like MPEG for video. All frame types also use something like the DCT too. The P and B frames are basically the same ideas as the I frame (jpeg), the difference is they take the DCT of the residual between adjacent frames. This is where the compression gets to be insane with video. It's block transforms all the way down.

An 8x8 DCT block for a channel of SDR content is 512 bits of raw information. After quantization and RLE (for typical quality settings), we can get this down to 50-100 bits of information. I feel like this is an extremely reasonable grain to work with.

jacquesm · 2 months ago

I can listen to music in my head. I don't think this is an extraordinary property but it is kind of neat. That hints at the fact that I somehow must have encoded this music. I can't imagine I'm storing the equivalent of a MIDI file, but I also can't imagine that I'm storing raw audio samples because there is just too much of it.

It seems to work for vocals as well, not just short samples but entire works. Of course that's what I think, but there is a pretty good chance they're not 'entire', but it's enough that it isn't just excerpts and if I was a good enough musician I could replicate what I remember.

Is there anybody that has a handle on how we store auditory content in our memories? Is it a higher level encoding or a lower level one? This capability is probably key in language development so it is not surprising that we should have the capability to encode (and replay) audio content, I'm just curious about how it works, what kind of accuracy is normally expected and how much of such storage we have.

Another interesting thing is that it is possible to search through it fairly rapidly to match a fragment heard to one that I've heard and stored before.

0xdeadbeefbabe · 2 months ago

> Is there anybody that has a handle on how we store auditory content in our memories?

It's so weird that I don't know this. It's like I'm stuck in userland.

akrymski · 2 months ago

Yes, DCT coefficients work even better than pixels:

https://www.uber.com/blog/neural-networks-jpeg/

a_bonobo · 2 months ago

Somewhat related:

There's this older paper from Lex Flagel and others where they transform DNA-based text, stuff we'd normally analyse via text files, into images and then train CNNs on the images. They managed to get the CNNs to re-predict population genetics measurements we normally get from the text-based DNA alignments.

https://academic.oup.com/mbe/article/36/2/220/5229930

dang · 2 months ago

Recent and related:

Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code - https://news.ycombinator.com/item?id=45646559 - Oct 2025 (43 comments)

DeepSeek OCR - https://news.ycombinator.com/item?id=45640594 - Oct 2025 (238 comments)