The softmax probabilities are usually not a very good indication of uncertainty, as the model is often overconfident due to neural collapse. The uncertainty in the softmax probabilities is a good indication though, and can be used to detect out-of-distribution entries or poor predictions.
They are at their most useful when it is cheaper to verify their output than it is to generate it yourself. That’s why code is rather ok; you can run it. But once validation becomes more expensive than doing it yourself, be it code or otherwise, their usefulness drops off significantly.
I've thought for a while that ensembling approaches would become the next stage of LLM development after CoT, since it provides yet another effective, independent axis for scaling laws. Great to see that perspective is taking off. The open weight community has an opportunity to take these ideas and run with them better than OpenAI has.
Believe it or not, the US enjoys a special status as a prominent place to study and contribute knowledge. I'm talking about students doing fundamental, valuable, possibly even patentable research. The best talent flows through here, and domestic students get to benefit from that. Because of these "mistakes", it is already the case that good talent is choosing not to come to the US. Certain other nations will benefit instead.
If you want a US for existing US citizens alone (immigration "the right way" is rapidly deteriorating), then that is what is happening. One way or another, we'll witness what the US becomes.
I know what you're saying here, and I know it is primarily a critique of my phrasing, but establishing something like this is the objective of in-context learning theory and mathematical applications of deep learning. It is possible to prove that sufficiently well-trained models will generalize for certain unseen classes of patterns, e.g. transformer acting like gradient descent. There is still a long way to go in the theory---it is difficult research!
> performance collapses under modest distribution shift
The problem is that the notion of "modest" depends on the scale here. With enough varied data and/or enough parameters, what was once out-of-distribution can become in-distribution. The paper is purposely ignorant of this fact. Yes, the claims hold for tiny models, but I don't think anyone ever doubted this.
Deleted Comment
Science starts with a guess and you run experiments to test.
I honestly wish this paper actually showed what it claims, since it is a significant open problem to understand CoT reasoning relative to the underlying training set.
Who is "we" in that case? Are you the one building the model? Do you have the compute and data capacity to test every corner case that matters?
In a deterministic system you can review the code and determine what it does under certain conditions. How do you know the ones that do build non-deterministic systems (because, let's face it, you will use but not build those systems) haven't rigged it for their benefit and not yours?
> And we will over time have even more capability in the model (that is more general purpose) than our deterministic scaffolds...
Our deterministic scaffolds sounds so dramatic, it sounds like you think of them like the chains that keep holding you, if only those chains were removed, you'd be able to fly. But it's not you who'd be able to fly, it's the ones building the model and having the compute to build it. And because of its non-deterministic nature, a backdoor for their benefit is now simply plausible deniability. Who is we. You are a user of those models, you will not be adding anything to it, maybe only circumstantially by your prompts being mined. You are not we.