What if we took a much larger language model and far more invasive brain scans.
We train it on what the person is thinking and doing and their senses too. Pressure sensitive suit for touch, cameras in glasses for sight, mic for sound, voice recognition for what they're saying and in depth mo-cap for what they're doing. We can now train the model on a good chunk of actions a human could take, and a decent chunk of their sensorium.
Now we take a lesson from the diffusers and apply noise to the brain scan data until the AI can do a decent job simulating a human on it's own.
Is a model good enough at that an agi? It still has no memory, but could we repurpose what is now noise but used to be brain scans to retrofit memory back into this model? Maybe a vector retrieval on the models old attention outputs? Could the mo-cap data be finetuned to a robot body? The voice to sound synthesis?
You're just going to end up with an insane amount of error handling only to discover that in the real world, there's likely nothing you can really do anyway.
Using memory that's been allocated but not committed seems like a recipe for disaster.