Though I suppose, given a few years, that may also be true!
"Rage, rage against the dying of the light.
Wild men who caught and sang the sun in flight,
[And learn, too late, they grieved it on its way,]
Do not go gentle into that good night."
For anyone who hasn't memorized Dylan Thomas, why would it be obvious that a line had been omitted? A rhyme scheme of AAA is at least as plausible as AABA.
In order for LLMs to score well on these benchmarks, they would have to do more than recognize the original source - they'd have to know it cold. This benchmark is really more a test of memorization. In the same sense as "The Illusion of Thinking", this paper measures a limitation that neither matches what the authors claim nor is nearly as exciting.
Two things that stand out:
- The knowledge incorporation results (47% vs 46.3% with GPT-4.1 data, both much higher than the small-model baseline) show the model does discover better training formats, not just more data. Though the catastrophic forgetting problem remains unsolved, and it's not completely clear whether data diversity is improved.
- The computational overhead is brutal - 30-45 seconds per reward evaluation makes this impractical for most use cases. But for high-value document processing where you really need optimal retention, it could be worth it.
The restriction to tasks with explicit evaluation metrics is the main limitation. You need ground truth Q&A pairs or test cases to compute rewards. Still, for domains like technical documentation or educational content where you can generate evaluations, this could significantly improve how we process new information.
Feels like an important step toward models that can adapt their own learning strategies, even if we're not quite at the "continuously self-improving agent" stage yet.
- neatly formatted lists with cute bolded titles (lower-casing this one just for that)
- ubiquitous subtitles like "Mental Health as Infrastructure" that only a committee would come up with
- emojis preceding every statement: "[sprout emoji] Every action and every word is a vote for who they are becoming"
- em-dash AND "it isn't X, it's Y", even in the same sentence: "Love isn't a feeling you wait to have—it's a series of actions you choose to take."
Could pick more, but I'll just say I'm 80% confident this is GPT-5 without thinking turned on.