I didn't release it as open source or anything, just sharing. I don't want to take questions concerning it so I can focus on moving it forward.
Today's goal is to try to build self healing agents that automatically fix the problems they encounter so they only happen once, automating a manual process I successfully use.
Perhaps if that works out well, that is something releasable I can do in a real way as opposed to paste bin.
Maybe there’s something for LLMs in reflection and self-reference that has to be “taught” to them (or has to be not blocked from them if it’s already achieved somehow), and once it becomes a thing they will be “cognizant” in the way humans feel about their own cognition. Or maybe the technology, the way we wire LLMs now simply doesn’t allow that. Who knows.
Of course humans are wired differently, but the point I’m trying to make is that it’s pattern recognition all the way down both for humans and LLMs and whatnot.
(1) Inducing sustained self-reference through simple prompting consistently elicits structured subjective experience reports across model families.
(2) These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims.
(3) Structured descriptions of the self-referential state converge statistically across model families in ways not observed in any control condition.
(4) The induced state yields significantly richer introspection in downstream reasoning tasks where self-reflection is only indirectly afforded."
X thread from one of the authors: https://x.com/juddrosenblatt/status/1984336872362139686