I believe what they are trying to show in that paper, is that as the chain of operations approaches a large amount (their proxy for complexity), an LLM will inevitable fail. Humans don't have infinite context either, but they can still solve the Tower Of Hanoi without need to resort to either pen or paper, or coding.
32767 moves in a single prompt. That's not testing reasoning. That’s testing whether the model can emit a huge structured output without error, under a context window limit.
The authors then treat failure to reproduce this entire sequence as evidence that the model can't reason. But that’s like saying a calculator is broken because its printer jammed halfway through printing all prime numbers under 10000.
For me o3 returning Python code isn’t a failure. It’s a smart shortcut. The failure is in the benchmark design. This benchmark just smells.
Odd; the Commodore Datasette is about as reliable as a microcomputer tape storage system can be, far more so than the tin cans-on-a-string designs of Sinclair and TRS-80. Did you attempt to use a regular cassette recorder with a third-party adapter?
This is just like everything in China. They will find ways to drive down cost to below anyone previously imagined, subsidised or not. And even just competing among themselves with DeepSeek vs ERNIE and Open sourcing them meant there is very little to no space for most.
Both DRAM and NAND industry for Samsung / Micron may soon be gone, I thought this was going to happen sooner but it seems finally happening. GPU and CPU Designs are already in the pipelines with RISC-V, IMG and ARM-China. OLED is catching up, LCD is already taken over. Batteries we know. The only thing left is foundries.
Huawei may release its own Open Source PC OS soon. We are slowly but surely witnessing the collapse of Western Tech scene.