Really interesting how this goes against my intuition. I would have imagined that it's infinitely easier to analyze a camera stream of the real world, then generate a polygonal representation of what you see (like you would do for a videogame) and then make AI decisions for that geometry. Instead the way that AI is going they rather skip it all and work directly on pixel data. Understanding of 3d geometry, perspective and physics is expected to evolve naturally from the training data.
Another instance of the bitter lesson: http://www.incompleteideas.net/IncIdeas/BitterLesson.html