I've tried to repaint the exterior of my house. More than 20 times with very detailed prompts. I even tried to optimize it with Claude. No matter what, every time it added one, two or three extra windows to the same wall.
I am personally impressed by the continued improvement in ARC-AGI-2, where Gemini 3 got 31.1% (vs ChatGPT 5.1's 17.6%). To me this is the kind of problem that does not lend itself well to LLMs - many of the puzzles test the kind of thing that humans intuit because of millions of years of evolution, but these concepts do not necessarily appear in written form (or when they do, it's not clear how they connect to specific ARC puzzles).
The fact that these models can keep getting better at this task given the setup of training is mind-boggling to me.
Very interesting to hear two technologists at a tech business conference say things along the lines of: "our tools do not merely extend us, they transform us", followed up with "we've become numb to the devastating consequences of technology".
(I know I'm somewhat selectively reading but still)
Interesting because games are exactly the kinds of RL environments that models can effectively learn - but the catch is that they must do this learning on the fly in test-time. Very exciting to see this.