> - Physics is still hard and there are obvious failure cases when I tried the classical intuitive physics experiments from psychology (tower of blocks).
> - Social and multi-agent interactions are tricky to handle. 1vs1 combat games do not work
> - Long instruction following and simple combinatorial game logic fails (e.g. collect some points / keys etc, go to the door, unlock and so on)
> - Action space is limited
> - It is far from being a real game engines and has a long way to go but this is a clear glimpse into the future.
Even with these limitations, this is still bonkers. It suggests to me that world models may have a bigger part to play in robotics and real world AI than I realized. Future robots may learn in their dreams...
Reality is not composed of words, syntax, and semantics. A human modal is.
Other human modals are sensory only, no language.
So vision learning and energy models that capture the energy to achieve a visual, audio, physical robotics behavior are the only real goal.
Software is for those who read the manual with their new NES game. Where are the words inside us?
Statistical physics of energy to make machine draw the glyphs of language not opionated clustering of language that will close the keyboard and mouse input loop. We're like replicating human work habits. Those are real physical behaviors. Not just descriptions in words.
> Genie 3’s consistency is an emergent capability
So this just happened from scaling the model, rather than being a consequence of deliberate architecture changes?
Edit: here is some commentary on limitations from someone who tried it: https://x.com/tejasdkulkarni/status/1952737669894574264
> - Physics is still hard and there are obvious failure cases when I tried the classical intuitive physics experiments from psychology (tower of blocks).
> - Social and multi-agent interactions are tricky to handle. 1vs1 combat games do not work
> - Long instruction following and simple combinatorial game logic fails (e.g. collect some points / keys etc, go to the door, unlock and so on)
> - Action space is limited
> - It is far from being a real game engines and has a long way to go but this is a clear glimpse into the future.
Even with these limitations, this is still bonkers. It suggests to me that world models may have a bigger part to play in robotics and real world AI than I realized. Future robots may learn in their dreams...
Reality is not composed of words, syntax, and semantics. A human modal is.
Other human modals are sensory only, no language.
So vision learning and energy models that capture the energy to achieve a visual, audio, physical robotics behavior are the only real goal.
Software is for those who read the manual with their new NES game. Where are the words inside us?
Statistical physics of energy to make machine draw the glyphs of language not opionated clustering of language that will close the keyboard and mouse input loop. We're like replicating human work habits. Those are real physical behaviors. Not just descriptions in words.