xianshou (u/xianshou)

xianshou commented on The First 1k Days williamjbarry.substack.co... · Posted by u/wjb3

xianshou · 4 months ago

Came to point out that this is transparently LLM-authored, was not disappointed. The signs:

- neatly formatted lists with cute bolded titles (lower-casing this one just for that)

- ubiquitous subtitles like "Mental Health as Infrastructure" that only a committee would come up with

- emojis preceding every statement: "[sprout emoji] Every action and every word is a vote for who they are becoming"

- em-dash AND "it isn't X, it's Y", even in the same sentence: "Love isn't a feeling you wait to have—it's a series of actions you choose to take."

Could pick more, but I'll just say I'm 80% confident this is GPT-5 without thinking turned on.

xianshou commented on My 2.5 year old laptop can write Space Invaders in JavaScript now (GLM-4.5 Air) simonwillison.net/2025/Ju... · Posted by u/simonw

xianshou · 5 months ago

I initially read the title as "My 2.5 year old can write Space Invaders in JavaScript now (GLM-4.5 Air)."

Though I suppose, given a few years, that may also be true!

xianshou commented on Spending Too Much Money on a Coding Agent allenpike.com/2025/coding... · Posted by u/GavinAnderegg

iamleppert · 6 months ago

"Now we don't need to hire a founding engineer! Yippee!" I wonder all these people who are building companies that are built on prompts (not even a person) from other companies. The minute there is a rug pull (and there WILL be one), what are you going to do? You'll be in even worse shape because in this case there won't be someone who can help you figure out your next move, there won't be an old team, there will just be NO team. Is this the future?

xianshou · 6 months ago

Rug pulls from foundation labs are one thing, and I agree with the dangers of relying on future breakthroughs, but the open-source state of the art is already pretty amazing. Given the broad availability of open-weight models within under 6 months of SotA (DeepSeek, Qwen, previously Llama) and strong open-source tooling such as Roo and Codex, why would you expect AI-driven engineering to regress to a worse state than what we have today? If every AI company vanished tomorrow, we'd still have powerful automation and years of efficiency gains left from consolidation of tools and standards, all runnable on a single MacBook.

xianshou commented on AbsenceBench: Language models can't tell what's missing arxiv.org/abs/2506.11440... · Posted by u/JnBrymn

xianshou · 6 months ago

In many of their key examples, it would also be unclear to a human what data is missing:

"Rage, rage against the dying of the light.

Wild men who caught and sang the sun in flight,

[And learn, too late, they grieved it on its way,]

Do not go gentle into that good night."

For anyone who hasn't memorized Dylan Thomas, why would it be obvious that a line had been omitted? A rhyme scheme of AAA is at least as plausible as AABA.

In order for LLMs to score well on these benchmarks, they would have to do more than recognize the original source - they'd have to know it cold. This benchmark is really more a test of memorization. In the same sense as "The Illusion of Thinking", this paper measures a limitation that neither matches what the authors claim nor is nearly as exciting.

xianshou commented on Self-Adapting Language Models arxiv.org/abs/2506.10943... · Posted by u/archon1410

xianshou · 6 months ago

The self-edit approach is clever - using RL to optimize how models restructure information for their own learning. The key insight is that different representations work better for different types of knowledge, just like how humans take notes differently for math vs history.

Two things that stand out:

- The knowledge incorporation results (47% vs 46.3% with GPT-4.1 data, both much higher than the small-model baseline) show the model does discover better training formats, not just more data. Though the catastrophic forgetting problem remains unsolved, and it's not completely clear whether data diversity is improved.

- The computational overhead is brutal - 30-45 seconds per reward evaluation makes this impractical for most use cases. But for high-value document processing where you really need optimal retention, it could be worth it.

The restriction to tasks with explicit evaluation metrics is the main limitation. You need ground truth Q&A pairs or test cases to compute rewards. Still, for domains like technical documentation or educational content where you can generate evaluations, this could significantly improve how we process new information.

Feels like an important step toward models that can adapt their own learning strategies, even if we're not quite at the "continuously self-improving agent" stage yet.