For text, with a two-byte tokenizer you get 2^16 (~65.000) possible next tokens, and computing a probability distribution over them is very much doable. But the "possible next frames" in a video feed would already be an extremely large number. If one frame is 1 megabyte uncompressed (instead of just 2 bytes for a text token) there are 2^(8*2^20) possible next frames, which is far too large a number. So we somehow need to predict only an embedding of a frame, of how the next frame of a video feed will look approximately.
Moreover, for robotics we don't want to just predict the next (approximate) frame of a video feed. We want to predict future sensory data more generally. That's arguably what animals do, including humans. We constantly anticipate what happens to us in "the future", approximately, and where the farther future is predicted progressively less exactly. We are relatively sure of what happens in a second, but less and less sure of what happens in a minute, or a day, or a year.
Are you saying that LLMs hold concepts in latent space (weights?), but the actual predictions are always in tokens (thus inefficient and lossy), whereas JEPA operates directly on concepts in latent space (plus encoders/decoders)?
I might be using the jargon incorrectly!
I don't really expect to see performance on-par with the SOTA hosted models, but I think I'm mainly curious what you could possibly do with local models that would otherwise not be doable with hosted models (or at least, stuff you wouldn't want to for other reasons, like privacy.)
One thing I've realized lately is that Gemini, and even Gemma, are really, really good at transcribing images, much better and more versatile than OCR models as they can also describe the images too. With the realization that Gemma, a model you can self-host, is good enough to be useful, I have been tempted to play around with doing this sort of task locally.
But again, $2,000 tempted? Not really. I'd need to find other good uses for the machine than just dicking around.
In theory, Gemma 3 27B BF16 would fit very easily in system RAM on my primary desktop workstation, but I haven't given it a go to see how slow it is. I think you mainly get memory bandwidth constrained on these CPUs, but I wouldn't be surprised if the full BF16 or a relatively light quantization gives tolerable t/s.
Then again, right now, AI Studio gives you better t/s than you could hope to get locally with a generous amount of free usage. So ... maybe it would make sense to wait until the free lunch ends, but I don't want to build anything interesting that relies on the cloud, because I dislike the privacy implications of it, even though everything I'm interested in doing is fully safe with the ToS.
I tried out Gemma 27B on LM Studio a few days ago, and I was completely blown away! It has a warmth and character (and smarts!) that I was not expecting in a tiny model. It just doesn't have tool use (although there are hacky workarounds), which would have made it even better. Qwen 3 with 30B parameters (3B active) seems to be nearly as capable, but also supports tool use.
I'm currently in the process of vibe coding an agent network with LangGraph orchestration, Gemma 27B/Qwen 3 30B-A3B with memory, context management and tool management. The Qwen model even uses a tiny 1.7B "draft" model for speculative decoding improving performance. In my 7800x3D, RTX 4090 with 64GB RAM, I have latency of ~200-400ms, and 20-30 tokens/s which is plenty fast.
My thought process is that this local stack will let me use agents to their fullest in administering my machine. I always felt uneasy letting Claude Code, Gemini CLI or Codex operate outside my code folders. Yet, their utility in helping me troubleshoot problems (I'm a recent Linux convert) was too attractive to ignore. Now I have the best of both worlds. Privacy, and AI models helping with sysadmin. They're also great for quick "what options does kopia backup use?" type questions I've assigned a global hotkeyed helper for.
Additionally, if one has a NAS with the *arr stack for downloading, say perfectly legal Linux ISOs, such a private model would be far more suitable.
It's early days, but I'm excited about other use cases i might discover over time! It's a good time to be an AI enthusiast.