So I guess my question is where is the juice being squeezed from, why does the vision token representation end up being more efficient than text tokens.
So I guess my question is where is the juice being squeezed from, why does the vision token representation end up being more efficient than text tokens.
This all turned out to be mostly irrelevant in my subsequent programming career.
Then LLMs came along and I wanted to learn how they work. Suddenly the physics training is directly useful again! Backprop is one big tensor calculus calculation, minimizing… entropy! Everything is matrix multiplications. Things are actually differentiable, unlike most of the rest of computer science.
It’s fun using this stuff again. All but the tensor calculus on curved spacetime, I haven’t had to reach for that yet.
Deleted Comment
Deleted Comment
Why wouldn't it be? If the world is ingressed via video sensors and lidar sensor, what's the hangup in recording such input and then replaying it faster?
Ie, the skills aren't particularly complicated in principle, but the conditions needed to acquire them aren't widely available, so the pool of people with the skills is limited.
I wonder if the comparison is actually original.
The sorts of useful analogies I was mostly talking about are those that appear in scientific research involving actionable technical details. Eg, diffusion models came about when folks with a background in statistical physics saw some connections between the math for variational autoencoders and the math for non-equilibrium thermodynamics. Guided by this connection, they decided to train models to generate data by learning to invert a diffusion process that gradually transforms complexly structured data into a much simpler distribution -- in this case, a basic multidimensional Gaussian.
I feel like these sorts of technical analogies are harder to stumble on than more common "linguistic" analogies. The latter can be useful tools for thinking, but tend to require some post-hoc interpretation and hand waving before they produce any actionable insight. The former are more direct bridges between domains that allow direct transfer of knowledge about one class of problems to another.
I have to disagree because the distinction between "superficial similarities" and genuinely "useful" analogies is pretty clearly one of degree. Spend enough time and effort asking even a low-intelligence AI about "dumb" similarities, and it'll eventually hit a new and perhaps "useful" analogy simply as a matter of luck. This becomes even easier if you can provide the AI with a lot of "context" input, which is something that models have been improving at. But either way it's not superintelligent or superhuman, just part of the general 'wild' weirdness of AI's as a whole.
I think you're basically agreeing with me. Ie, current models are not superintelligent. Even though they can "think" super fast, they don't pass a minimum bar of producing novel and useful connections between domains without significant human intervention. And, our evaluation of their abilities is clouded by the way in which their intelligence differs from our own.
This sort of "dynamic chunking" of low-level information, perhaps down to the level of raw bytes, into shorter sequences of meta tokens for input to some big sequence processing model is an active area of research. Eg, one neat paper exploring this direction is: "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling" [1], from one of the main guys behind Mamba and other major advances in state-space models.
[1] - https://arxiv.org/abs/2507.07955