What is the justification for this? Is there a mathematical proof? To me, CoT seems like a hack to work around the severe limitations of current LLMs.
CoT _is,_ in my mind at least, a hack that is bolted to LLMs to create some sort of loose approximation of reasoning. When I read the paper I expected to see a better hack, but could not find anything on how you take this architecture, interesting though it is, and put it to use in a way similar to CoT. The whole paper seems to make a wild pivot between a fully general biomimetic grandeur of the first half, and the narrow effectiveness of the second half.
The authors explicitly discuss the expressive power of transformers and CoT in the introduction. They can only solve problems in a fairly restrictive complexity class (lower than PTIME!) - it's one of the theoretical motivations for the new architecture.
"The fixed depth of standard Transformers places them in computational complexity classes such as AC0 [...]"
This architecture by contrast is recurrent with inference time controlled by the model itself (there's a small Q-learning based subnetwork that decides halting time as it "thinks"), so there's no such limitation.
The main meat of the paper is describing how to train this architecture efficiently, as that has historically been the issue with recurrent nets.
Don't get me wrong, this is a cool development, and I would love to see how this architecture behaves on a constraint-based problem that's not easily tractable via traditional algorithm.