paraschopra (u/paraschopra)

paraschopra commented on Q-learning is not yet scalable seohong.me/blog/q-learnin... · Posted by u/jxmorris12

paraschopra · 3 months ago

Humans actually do both. We learn from on-policy by exploring consequences of our own behavior. But we also learn off-policy, say from expert demonstrations (but difference being we can tell good behaviors from bad, and learn from a filtered list of what we consider as good behaviors). In most, off-policy RL, a lot of behaviors are bad and yet they get into the training set and hence leading to slower training.

paraschopra commented on Deep Learning Is Applied Topology theahura.substack.com/p/d... · Posted by u/theahura

paraschopra · 3 months ago

Aren't manifolds generally task-dependent?

I've been debating whether the data lies on a manifold, or whether the data attributes that are task-relevant (and of our interest) lie on a manifold?

I suspect it is the latter, but I've seen Platonic Representation Hypothesis that seems to hint it is the former.

paraschopra commented on Tracing the thoughts of a large language model anthropic.com/research/tr... · Posted by u/Philpax

colah3 · 5 months ago

Hi! I lead interpretability research at Anthropic. I also used to do a lot of basic ML pedagogy (https://colah.github.io/). I think this post and its children have some important questions about modern deep learning and how it relates to our present research, and wanted to take the opportunity to try and clarify a few things.

When people talk about models "just predicting the next word", this is a popularization of the fact that modern LLMs are "autoregressive" models. This actually has two components: an architectural component (the model generates words one at a time), and a loss component (it maximizes probability).

As the parent says, modern LLMs are finetuned with a different loss function after pretraining. This means that in some strict sense they're no longer autoregressive models – but they do still generate text one word at a time. I think this really is the heart of the "just predicting the next word" critique.

This brings us to a debate which goes back many, many years: what does it mean to predict the next word? Many researchers, including myself, have believed that if you want to predict the next word really well, you need to do a lot more. (And with this paper, we're able to see this mechanistically!)

Here's an example, which we didn't put in the paper: How does Claude answer "What do you call someone who studies the stars?" with "An astronomer"? In order to predict "An" instead of “A”, you need to know that you're going to say something that starts with a vowel next. So you're incentivized to figure out one word ahead, and indeed, Claude realizes it's going to say astronomer and works backwards. This is a kind of very, very small scale planning – but you can see how even just a pure autoregressive model is incentivized to do it.

paraschopra · 5 months ago

Is it fair to say that both "Say 'an'" and "Say 'astronomer'" output features would be present in this case, but say "Say 'an'" gets more votes because it is start of the sentence, and once it is sampled "An" further votes for "Say 'astronomer'" feature