Q-learning is not yet scalable

This blog post is unfortunately missing what I consider the bigger reason why Q learning is not scalable:

As horizon increases, the number of possible states (usually) increases exponentially. This means you require exponentially increasing data to have a hope of training a Q that can handle those states.

This is less of an issue for on policy learning, because only near policy states are important, and on policy learning explicitly only samples those states. So even though there are exponential possible states your training data is laser focused on the important ones.

elchananHaas · 3 months ago

I think the article's analysis of overapproximation bias is correct. The issue is that due to the Max operator in the Q learning noise is amplified over timesteps. Some methods to reduce this bias, such as https://arxiv.org/abs/1509.06461 were successful in improving the RL agents performance. Studies have found that this happens even more for the states that the network hasn't visited many times.

An exponential number of states only matters if there is no pattern to them. If there is some structure that the network can learn then it can perform well. This is a strength of deep learning, not a weakness. The trick is getting the right training objective, which the article claims q learning isn't.

I do wonder if MuZero and other model based RL systems are the solution to the author's concerns. MuZero can reanalyze prior trajectories to improve training efficiency. The Monte Carlo Tree Search (MCTS) is a principled way to perform horizon reduction by unrolling the model multiple steps. The max operator in MCTS could cause similar issues but the search progressing deeper counteracts this.

Deleted Comment

Ericson2314 · 3 months ago

https://news.ycombinator.com/item?id=44280505 I think that thead might help?

Total layman here, but maybe some tasks are "uniform" despite being "deep" in such a way that poor samples still suffice? I would call those "ergodic" tasks. But surely there are other tasks where this is not the case?

lalaland1125 · 3 months ago

Good clarification. I have edited my post accordingly.

There are situations where states increase at much slower rates than exponential.

Those situations are a good fit for Q learning.

jhrmnn · 3 months ago

Is this essentially the same difference as between vanilla regular-grid and importance-sampling Monte Carlo integration?

arthurcolle · 3 months ago

Feel the Majorana-1

No mention of Decision Transformers or Trajectory Transformers? Both are offline approaches that tend to do very well at long-horizon tasks, as they bypass the credit assignment problem by virtue of having an attention mechanism.

Most RL researchers consider these approaches not to be "real RL", as they can't assign credit outside the context window, and therefore can't learn infinite-horizon tasks. With 1m+ context windows, perhaps this is less of an issue in practice? Curious to hear thoughts.

DT: https://arxiv.org/abs/2106.01345

TT: https://arxiv.org/abs/2106.02039

highd · 3 months ago

TFP cites decision transformers. Just using a transformer does not bypass the credit assignment problem. Transformers are an architecture for solving sequence modeling problems, e.g. the credit assignment problem as arises in RL. There have been many other such architectures.

The hardness of the credit assignment problem is a statement about data sparsity. Architecture choices do not "bypass" it.

isaacimagine · 3 months ago

TFP: https://arxiv.org/abs/2506.04168

The DT citation [10] is used on a single line, in a paragraph listing prior work, as an "and more". Another paper that uses DTs [53] is also cited in a similar way. The authors do not test or discuss DTs.

> hardness of the credit assignment ... data sparsity.

That is true, but not the point I'm making. "Bypassing credit assignment", in the context of long-horizon task modeling, is a statement about using attention to allocate long-horizon reward without horizon-reducing discount, not architecture choice.

To expand: if I have an environment with a key that unlocks a door thousands of steps later, Q-Learning may not propagate the reward signal from opening the door to the moment of picking up the key, because of the discount of future reward terms over a long horizon. A decision transformer, however, can attend to the moment of picking up the key while opening the door, which bypasses the problem of establishing this long-horizon causal connection.

(Of course, attention cannot assign reward if the moment the key was picked up is beyond the extent of the context window.)

briandw · 3 months ago

This papers assumes that you know quite a bit about RL already. If you really want to dig into RL, this intro course from David Silver (Deep Mind) is excellent: https://youtu.be/2pWv7GOvuf0?si=CmFJHNnNqraL5i0s

ArtRichards · 3 months ago

Thank you for this link.

itkovian_ · 3 months ago

Completely agree and think it’s a great summary. To summarize very succinctly; you’re chasing a moving target where the target changes based on how you move. There’s no ground truth to zero in on in value-based RL. You minimise a difference in which both sides of the equation have your APPROXIMATION in them.

I don’t think it’s hopeless though, I actually think RL is very close to working because what it lacked this whole time was a reliable world model/forward dynamics function (because then you don’t have to explore, you can plan). And now we’ve got that.

whatshisface · 3 months ago

The benefit of off-policy learning is fundamentally limited by the fact that data from ineffective early exploration isn't that useful for improving on later more refined policies. It's clear if you think of a few examples: chess blunders, spasmodic movement, or failing to solve a puzzle. This becomes especially clear once you realize that data only becomes off-policy when it describes something the policy would not do. I think the solution to this problem is (unfortunately) related to the need for better generalization / sample efficiency.

getnormality · 3 months ago

Doesn't this claim prove too much? What about the cited dog that walked in 20 minutes with off-policy learning? Or are you making a more nuanced point?

paraschopra · 3 months ago

Humans actually do both. We learn from on-policy by exploring consequences of our own behavior. But we also learn off-policy, say from expert demonstrations (but difference being we can tell good behaviors from bad, and learn from a filtered list of what we consider as good behaviors). In most, off-policy RL, a lot of behaviors are bad and yet they get into the training set and hence leading to slower training.

taneq · 3 months ago

> difference being we can tell good behaviors from bad

Not always! That's what makes some expert demonstrations so fascinating, watching someone do something "completely wrong" (according to novice level 'best practice') and achieve superior results. Of course, sometimes this just means that you can get away with using that kind of technique (or making that kind of blunder) if you're just that good.

marcosdumay · 3 months ago

I guess it's worth pointing that when humans learn the long-horizon tasks that we learn by repetitive training, we segment them in tasks with a shorter horizon and compose them hierarchically later.

BoiledCabbage · 3 months ago

It does (naively I'll admit) seem like the problem is one more of approach more than algorithm.

Yes the model may not be able to tackle long horizon tasks from scratch, but learn some shorter horizon skills first then learn a longer horizon by leveraging groupings of those smaller skills. Chunking like we all do.

Nobody learns how to fly a commercial airplane plane cross country as a sequence of micro hand and arm movements. We learn to pick up a ball that way when young, but learning to fly or play a sport consists of a hierarchy of learned skills and plans.

s-mon · 3 months ago

While I like the blogpost, I think the use of unexplained acronyms undermines the opportunity of this blogpost to be useful to the wider audience. Small nit: make sure acronyms and jargon is explained.

keremk · 3 months ago

For these kinds of blogposts where the content is very good but not very approachable due to assumption of extensive prior knowledge, I find using an AI tool very useful, to explain and simplify. Just used the new browser Dia for this, and it worked really well for me. Or you can use your favorite model provider and copy and paste. This way the post stays concise, and yet you can use your AI tools to ask questions and clarify.

levocardia · 3 months ago

It's clearly written for an audience of other RL researchers, given than the conclusion is "will someone please come up with Q-learning methods that scale!"

anonthrowawy · 3 months ago

i actually think thats what made it crisp