The whole reason for LLMs inferencing human-processable text, and "world models" inferencing human-interactive video, is precisely so that humans can connect in and debug the thing.
I think the purpose of Genie is to be a video game, but it's a video game for AI researchers developing AIs.
I do agree that the entertainment implications are kind of the research exhaust of the end goal.
I think robots imagining the next step (in latent space) will be useful. It’s useful for people. A great way to validate that a robot is properly imagining the future is to make that latent space renderable in pixels.
[1] “By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment.”
Ive since gone on an unsubscribe campaign, and things seem bearable now.
So what's the way forward for blind screen reader users? Sadly, I don't know.
Modern text to speech research has little overlap with our requirements. Using Eloquence [32-bit voice last compiled in 2003], the system that many blind people find best, is becoming increasingly untenable. ESpeak uses an odd architecture originally designed for computers in 1995, and has few maintainers. Blastbay Studios [...] is a closed-source product with a single maintainer, that also suffers from a lack of pronunciation accuracy.
In an ideal world, someone would re-implement Eloquence as a set of open source libraries. However, doing so would require expertise in linguistics, digital signal processing, and audiology, as well as excellent programming abilities. My suspicion is that modernizing the text to speech stack that is preferred by blind power-users is an effort that would require several million dollars of funding at minimum.
Instead, we'll probably wind up having to settle for text to speech voices that are "good enough", while being nowhere near as fast and efficient [800 to 900 words per minute] as what we have currently.
I found some sample audio from Eloquence. I like this type of voice!
The world is more than just text. We can never run out of pixels if we point cameras at the real world and move them around.
I work in robotics and I don’t think people talking about this stuff appreciate that text and internet pictures is just the beginning. Robotics is poised to generate and consume TONS of data from the real world, not just the internet.