Perhaps one difference is that a human could potentially get extremely good at textual tasks with nothing but text to learn from. You can read how to solve cryptic crosswords, see examples and extrapolate. In that sense language models have a somewhat complete training dataset. Yes this requires an understanding of the material, rather than just parroting, but the signal is there if you can separate it from noise.
Driving a car requires an understanding of a much wider context which is perhaps hard to acquire with just driving data. Understanding of rain, birds on the road, shaky drivers, balls rolling out from between cars, lane restrictions... You can't just throw petabytes of data at the problem. Training data is limited and expensive, and I believe we are mostly tackling AI-assired driving with rule-based approaches.
I believe self driving works just fine in simulations where data is effectively unlimited. But then it doesn't generalise to the real world where context matters.
-- If choice is due to complexity of dockers loc as was mentioned, then that should be compared against stability; as abstractions go, it is not leaky and pretty stable -- Now-a-days it hardly takes a few minutes to use docker or podman; and few more to configure a free CI system; and you have a good base to start and grow and migrate to k8s or such when your services grow