hodgehog11 (u/hodgehog11)

hodgehog11 commented on Building AI products in the probabilistic era giansegato.com/essays/pro... · Posted by u/sdan

ath3nd · a day ago

> And we will over time have even more capability in the model (that is more general purpose) than our deterministic scaffolds...

Who is "we" in that case? Are you the one building the model? Do you have the compute and data capacity to test every corner case that matters?

In a deterministic system you can review the code and determine what it does under certain conditions. How do you know the ones that do build non-deterministic systems (because, let's face it, you will use but not build those systems) haven't rigged it for their benefit and not yours?

> And we will over time have even more capability in the model (that is more general purpose) than our deterministic scaffolds...

Our deterministic scaffolds sounds so dramatic, it sounds like you think of them like the chains that keep holding you, if only those chains were removed, you'd be able to fly. But it's not you who'd be able to fly, it's the ones building the model and having the compute to build it. And because of its non-deterministic nature, a backdoor for their benefit is now simply plausible deniability. Who is we. You are a user of those models, you will not be adding anything to it, maybe only circumstantially by your prompts being mined. You are not we.

hodgehog11 · 21 hours ago

This is a genuine concern, which is why it is a very hot topic of research. If you're giving a probabilistic program the potential to do something sinister, using a commercial model or something that you have not carefully finetuned yourself would be a terrible idea. The same principle applies for commercial binaries; without decompilation and thorough investigation, can you really trust what it's doing?

hodgehog11 commented on Being “Confidently Wrong” is holding AI back promptql.io/blog/being-co... · Posted by u/tango12

PessimalDecimal · a day ago

Even then, wouldn't its uncertainty be about the probability of the output given the input? That's different from probability of being correct in some factual sense. At least for this class of models.

hodgehog11 · 21 hours ago

There are many types of model uncertainty, but factual errors should play a role in conditional uncertainties. If you do it right, then you can report when the output is truly veering into out-of-distribution territory.

hodgehog11 commented on Being “Confidently Wrong” is holding AI back promptql.io/blog/being-co... · Posted by u/tango12

z3c0 · a day ago

I think we're talking about the same thing. I should be clear that I don't think the selected token probabilities being reported are enough, but if you're reporting each returned tokens probability (both selected and discarded) and aggregating the cumulative probabilities of the given context, it should be possible to see when you're trending centrally towards uncertainty.

hodgehog11 · 21 hours ago

No, it isn't the same thing. The softmax probabilities are estimates; they're part of the prediction. The other poster is talking about the uncertainty in these estimates, so the uncertainty in the softmax probabilities.

The softmax probabilities are usually not a very good indication of uncertainty, as the model is often overconfident due to neural collapse. The uncertainty in the softmax probabilities is a good indication though, and can be used to detect out-of-distribution entries or poor predictions.

hodgehog11 commented on Being “Confidently Wrong” is holding AI back promptql.io/blog/being-co... · Posted by u/tango12

esafak · a day ago

Bayesian models solve this problem but they occupy model capacity which practitioners have traditionally preferred to devote to improving point estimates.

hodgehog11 · a day ago

I've always found this perspective remarkably misguided. Prediction performance is not everything; it can be extraordinarily powerful to have uncertainty estimates as well.

hodgehog11 commented on Being “Confidently Wrong” is holding AI back promptql.io/blog/being-co... · Posted by u/tango12

roxolotl · a day ago

The big thing here is that they can’t even be confident. There is no there there. They are a, admittedly very useful, statistical model. Ascribing confidence to it is an anthropomorphizing mistake which is easy to make since we’re wired to trust text that feels human.

They are at their most useful when it is cheaper to verify their output than it is to generate it yourself. That’s why code is rather ok; you can run it. But once validation becomes more expensive than doing it yourself, be it code or otherwise, their usefulness drops off significantly.

hodgehog11 · a day ago

But as a statistical model, it should be able to report some notion of statistical uncertainty, not necessarily in its next-token outputs, but just as a separate measure. Unfortunately, there really doesn't seem to be a lot of effort going into this.

hodgehog11 commented on Making LLMs Cheaper and Better via Performance-Efficiency Optimized Routing arxiv.org/abs/2508.12631... · Posted by u/omarsar

hodgehog11 · a day ago

Wow, that was fast.

I've thought for a while that ensembling approaches would become the next stage of LLM development after CoT, since it provides yet another effective, independent axis for scaling laws. Great to see that perspective is taking off. The open weight community has an opportunity to take these ideas and run with them better than OpenAI has.

hodgehog11 commented on Administration will review all 55M visa holders for deportable violations apnews.com/article/trump-... · Posted by u/anigbrowl

hodgehog11 · a day ago

Putting aside the dislike of other cultures, you're making several assumptions here. This student was of Asian descent, and was selected for their talents. The student never committed any crime or misdemeanor, was never fined, never spoke out of turn or rallied for any cause, and had no government connections. It's possible a name was mixed up, which isn't surprising as funding is being channeled from legal immigration offices to deportations. He's one of several I've heard about.

Believe it or not, the US enjoys a special status as a prominent place to study and contribute knowledge. I'm talking about students doing fundamental, valuable, possibly even patentable research. The best talent flows through here, and domestic students get to benefit from that. Because of these "mistakes", it is already the case that good talent is choosing not to come to the US. Certain other nations will benefit instead.

If you want a US for existing US citizens alone (immigration "the right way" is rapidly deteriorating), then that is what is happening. One way or another, we'll witness what the US becomes.

hodgehog11 commented on From GPT-4 to GPT-5: Measuring progress through MedHELM [pdf] fertrevino.com/docs/gpt5_... · Posted by u/fertrevino

lossolo · 2 days ago

Without a provable hold out, claim that "large models do fine on unseen patterns" is unfalsifiable. In controlled from scratch training, CoT performance collapses under modest distribution shift, even with plausible chains. If you have results where the transformation family is provably excluded from training and a large model still shows robust CoT, please share them. Otherwise this paper’s claim stands for the regime it tests.

hodgehog11 · 2 days ago

> claim that "large models do fine on unseen patterns" is unfalsifiable

I know what you're saying here, and I know it is primarily a critique of my phrasing, but establishing something like this is the objective of in-context learning theory and mathematical applications of deep learning. It is possible to prove that sufficiently well-trained models will generalize for certain unseen classes of patterns, e.g. transformer acting like gradient descent. There is still a long way to go in the theory---it is difficult research!

> performance collapses under modest distribution shift

The problem is that the notion of "modest" depends on the scale here. With enough varied data and/or enough parameters, what was once out-of-distribution can become in-distribution. The paper is purposely ignorant of this fact. Yes, the claims hold for tiny models, but I don't think anyone ever doubted this.

hodgehog11 commented on From GPT-4 to GPT-5: Measuring progress through MedHELM [pdf] fertrevino.com/docs/gpt5_... · Posted by u/fertrevino

ipaddr · 2 days ago

"This is science at its worst, where you start at an inflammatory conclusion and work backwards"

Science starts with a guess and you run experiments to test.

hodgehog11 · 2 days ago

True, but the experiments are engineered to give results they want. It's a mathematical certainty that the performance will drop off here, but is not an accurate assessment of what is going on at scale. If you present an appropriately large and well-trained model with in-context patterns, it often does a decent job, even when it isn't trained on them. By nerfing the model (4 layers), the conclusion is foregone.

I honestly wish this paper actually showed what it claims, since it is a significant open problem to understand CoT reasoning relative to the underlying training set.