Readit News logoReadit News
bikeshaving · 4 months ago
Does this mean we’ll finally get empirical proof for the aphorism “a picture is worth a thousand words”?

https://en.wikipedia.org/wiki/A_picture_is_worth_a_thousand_...

heltale · 4 months ago
I suppose it’s only worth 256 words at a time right now. ;)

https://arxiv.org/abs/2010.11929

estebarb · 4 months ago
The CALM paper https://shaochenze.github.io/blog/2025/CALM/ says it is possible to compress 4 tokens in a single embedding, so... image = 4×256=1024 words > 1000 words. QED
floodfx · 4 months ago
Why are completion tokens more with image prompts yet the text output was about the same?
cma · 4 months ago
Some multimodal models may have a hidden captioning step that may take completion tokens, others work on a fully native representation, and some do both I think.
Garlef · 4 months ago
"Thinking" Mode
nunodonato · 4 months ago
it doesn't say that anywhere.

Deleted Comment

ashed96 · 4 months ago
In my experience, LLMs tend to take noticeably longer to process images than text.
weird-eye-issue · 4 months ago
It has to get the image data first, basically just IO time before processing it
ashed96 · 4 months ago
IIRC there's pre-processing (embedding/tokenization?) before feeding images to LLMs?

Hit this issue optimizing LLM request times. Ending up lowering image resolution. Lost some accuracy but could bear that.

psadri · 4 months ago
I wonder if these stay in the prefix cache?