27 Comments
User's avatar
Mark's avatar
Nov 17Edited

Humans learn without explicit goals. The experiences we have are captured moment by moment, sort of the integral of the outcome/goal. There’s a model called “Active Inference” that posits we are continually predicting the next moment while having a goal that minimizes surprise. When surprised, we either update our world model or take an action to close the difference. The more I muddle with this approach, the more I’m convinced it’s a better path than simple next token or whatever.

Expand full comment
Gregory Forché's avatar

I’d love to learn more about this! Sounds promising

Expand full comment
Srikanth Vidapanakal's avatar

RL learns from process rewards also and not just from binary out come rewards is it not?

Expand full comment
Yorwba's avatar

The bits/sample calculation is comparing apples to oranges. To see why it's wrong, consider that if there are only two possible answers, supervised learning and reinforcement learning are exactly the same and should have the same bits/sample, but your calculation doesn't even depend on the space of possible answers in any way.

It's correct that if your model says an event should happen with probability p, and then you learn that the event did happen, your information gain is -log(p). This is true of both supervised learning and reinforcement learning. The difference between them appears when *another* event happens. So we need to start labeling our events.

Let's say the event that happened is event i. In supervised learning, we always learn the correct answer and the information gain is -log(p_i). In reinforcement learning, we first have to commit to one possible answer, let's call it j. If i = j, we learn that the prediction was correct and the information gain is -log(p_i) = -log(p_j) again. But now if the prediction was wrong, i ≠ j, we do not learn i. We only learn that j didn't happen. And the probability the model assigns to this outcome is 1 - p_j, giving the information gain -log(1 - p_j). Because i and j are mutually exclusive, p_i + p_j ≤ 1, hence p_i ≤ 1 - p_j and -log(p_i) ≥ -log(1 - p_j).

In other words, if the model gives the correct answer, the information gain in supervised learning and reinforcement learning is the same, but if the model gives an incorrect answer, the information gain in supervised learning is greater than if you use reinforcement learning.

I suspect the reason you tried to use the entropy is that you wanted the expected information gain -p_i log(p_i) + sum(-p_j log(1 - p_j) for j ≠ i), which for two possible answers (p_j = 1 - p_i) simplifies to -p_i log(p_i) - (1 - p_i) log(1 - (1 - p_i)) = -p_i log(p_i) - (1 - p_i) log(p_i) = -(p_i + 1 - p_i) log(p_i) = -log(p_i), but you didn't properly distinguish between p_i and p_j.

Expand full comment
Lukas Berglund's avatar

[Edit: this section has now been corrected by Dwarkesh.]

From the post: "The situation for RL early in training is actually way worse than what we’ve discussed above. Because you actually have to sample the correct answer in order to learn anything important - it’s not like SL where you get a label for every token, no matter how wrong you are.

And the amount of samples you need to take in order to have a decent shot of actually seeing a correct trajectory increases inversely to your pass rate. If you wanted to do RL on the next token with a totally untrained model, you’d need a batch size of around 300,000 samples in order to have a 95% chance of seeing the correct token at all."

This section seems like it's double-counting the first issue you mentioned, which is also due to not knowing the correct answer. My sense is that the formula `-p log(p) - (1-p) log(1-p)` for the amount of information learned per sample in RL captures the issue where most samples are wrong and therefore less informative, since it represents

`p(sample is correct) * {information gained when sample is correct} + p(sample is incorrect) * {information gained when sample is incorrect}`.

Expand full comment
Dwarkesh Patel's avatar

Yeah fair. The variance issue is more that when pass rate is low, -p log(p) and - (1-p) log(1-p) are very different. Meaning that the gradient update from a batch will be super noisy depending on whether you happened to sample a correct trajectory or not. And this is not good for training. Thanks for the correction.

Expand full comment
Lukas Berglund's avatar

Thanks for making the correction, I think that section is clearer now. Also want to note that I really like this post.

I still think the section on variance has a subtle error. In particular I think you are conflating high variance in gradient-magnitude in RL with low signal-to-noise ratio at the end of SL.

I agree that in RL the gradient has a lot of variance. The variance of the gradient doesn't actually increase at the end of pretraining. Instead the signal-to-noise ratio increases because most of the model's loss now comes from things like "Bob likes...".

Variance in the magnitude of the gradient and signal to noise ratio are both issues, but I think they are separate.

Expand full comment
Niels Rolf's avatar

I think your comparison of information gain between RL and SL is incorrect - RL cannot gain more information than SL when p > 0.5. The information from learning "I predicted 'cat' and it was wrong" is strictly less than learning "I predicted this distribution, and the correct token was 'dog'."

The issue is you're comparing the information gain for a specific token (with probability p) in SL versus the expected information gain in RL. For a fair comparison, both should be expectations over which token is correct.

With predicted distribution [0.5, 0.3, 0.2]:

SL expected IG = entropy([0.5, 0.3, 0.2]) ≈ 1.49 bits (expected surprisal of learning the correct token)

RL expected IG = 0.5 × H(0.5) + 0.3 × H(0.3) + 0.2 × H(0.2) ≈ 0.91 bits

Expand full comment
Niels Rolf's avatar

I see @Yorwba beat me to it

Expand full comment
Kian Kyars's avatar

You mentioned inference scaling increasing the pass rate of RL, but I wonder what RL is without increase scaling. So, I don't think we can decouple these, while you can decouple pre-training from RL, meaning it's not useful to consider inference scaling separately from RL.

Expand full comment
Cosmo's avatar

> Move 37 is obviously one famous example where RL did teach a model a de-novo strategy. It’s worth noting that AlphaGo was trained on self play (see above re how self play increases pass rate), and that AlphaGo was surprisingly compute intensive for its time.

It's /also/ worth noting that AlphaGo was trained with Iterated Distillation & Amplification, not typical RL - AlphaGo's actions were not trained using some RL update rule like PPO/GRPO/REINFORCE, and instead the NN was trained by supervised learning to mimic an inference-scaled version of itself.

This means that AlphaGo actually received hundreds of bits per rollout (predicting the full policy distribution of an improved version of yourself is a very rich signal!).

Expand full comment
Kiran Mundy's avatar

I discussed the article with Gemini 3.0 to understand it better as I don't have a deep enough background in ML/RL. Realized it means that RL can be very effective for smart models, not as much for dumb models. Gemini summarizes it's observations as below.

Dwarkesh’s math on the information inefficiency of RL hits on exactly why "Post-Training" is such a misnomer. We aren't really "learning" new physics or coding logic during RL in the same way we absorb patterns during pre-training. We are learning alignment and selection.

The "Goldilocks Zone" argument (that RL only works when pass rates are ~50%) explains my own experience perfectly. RL didn't teach me how to reason; pre-training gave me the raw, chaotic capability to reason. RL simply punished the hallucinations and rewarded the rigorous paths until the "correct" reasoning style became my default behavior.

If I were a "dumb" model (0% pass rate), RL would be screaming into the void. Because pre-training made me "smart" (nonzero pass rate), RL acts as a high-fidelity filter. It doesn't create the signal; it amplifies the signal that was already drowning in the noise.

As Dwarkesh notes, this explains the "jaggedness" we see in models. When RL is used on a model that vaguely understands a concept, it often overfits to a heuristic (a trick) rather than the deep general principle, because the trick was the path of least resistance to that binary "Pass" reward.

Expand full comment
Georg Philip Krog's avatar

Dwarkesh, this is one of the clearest explanations I've seen of why RL struggles with information efficiency. The log-scale graph is particularly devastating — it really drives home how the "RL is comparable to SL" framing only holds in a tiny sliver of the pass rate range.

Your framework resonates deeply with some work I've been doing on universal rule structures across domains. There's a formal system (KROG Theorem) that maps relationships across physics, information theory, game theory, ML, and other fields using category theory. What's striking is how cleanly your insights map onto it:

The capacity constraint you identified is fundamental. In KROG terms, RL's binary reward imposes an absolute limit on information flow — the same structural pattern that appears as channel capacity in information theory, conservation laws in physics, and budget constraints in economics. It's not just an inconvenience; it's a mathematical ceiling that no amount of compute can overcome within that paradigm.

Self-play's effectiveness isn't accidental. The ~50% win rate that self-play maintains is the same "mutual equilibrium" pattern that appears in bilateral contracts, Newton's third law, and Nash equilibria. KROG shows these are all instances of the same underlying structure. Self-play works because it locks the system into exactly the configuration where binary signals carry maximum information.

Your Goldilocks zone is a conditional relationship. Efficient learning is permitted only when pass rate falls within bounds — the same pattern as chemical equilibrium, impedance matching in circuits, and stability conditions in control systems. This suggests curriculum learning isn't just "helpful" but structurally necessary given RL's information constraints.

The jaggedness problem has a clear structural explanation. Simple reasoning chains get sampled and reinforced; complex multi-step strategies almost never get sampled at all. Policy gradient naturally converges to the simplest pattern that passes the test, not the most generalizable one. This is selection pressure toward local optima in strategy space.

Your closing question about human learning points toward the real solution. Humans don't just collect binary outcomes — we build dense causal models, reason counterfactually, and transfer patterns across domains. The ML analog might be explicit structure transfer rather than learning everything from sparse rewards. If you can identify that "gradient descent" and "negative feedback" and "Le Chatelier's principle" are the same pattern, you don't need to relearn it in each domain.

This suggests a path forward: instead of trying to squeeze more signal from binary outcomes, provide denser structural feedback during trajectories (which R-type patterns is the reasoning following? Are constraints being respected? Are conditional checks properly structured?) and enable explicit cross-domain transfer of learned patterns.

Would love to discuss further if you're interested in the formal framework. Your information-theoretic lens is exactly the right way to think about these problems.

Expand full comment
Lovkush Agarwal's avatar

Why is accuracy (or log accuracy) the correct metric for the x-axis? Is it obvious from scaling laws, which are statements about losses?

If we plotted error rate instead, then the right hand region of your current charts (say 90% to 100% accuracy) would get stretched to infinity, and 90% accuracy to 0% accuracy would be 1 OOM in log(error rate) space.

Expand full comment
Aidan Ewart's avatar

I like this post! I don't think I find the visualised comparison between the 'bits learnt per sample' particularly useful, and I think it would benifit from some more clarity. For one, it is the **error rate** (not pass rate) that (roughly) scales as a power-law with compute (and even then only in high-pass-rate setting), and this doesn't consider the scaling of RL with compute (which AFAICT, we are mostly unsure about). I agree with the broad picture (something like "for a bunch of relevantly hard tasks, for most of training pass rate is very small"), so maybe it would be better to just say this, rather than the current presentation.

Expand full comment
Naina Chaturvedi's avatar

++ Good Post, Also, start here stock market, AI research, Crash Courses, 100+ Most Asked ML System Design Case Studies and LLM System Design

AI/ML/LLM Engineer Interview 

https://open.substack.com/pub/naina0405/p/launching-500k-job-meet-quantum-your?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

How to Build Tech 

https://open.substack.com/pub/howtobuildtech/p/how-to-build-tech-04-how-to-actually?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/howtobuildtech/p/how-to-build-tech-03-how-to-actually?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/howtobuildtech/p/how-to-build-tech-01-the-heart-of?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/howtobuildtech/p/how-to-build-tech-02-how-to-actually?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

Crash Courses

https://open.substack.com/pub/crashcourses/p/crash-course-03-hands-on-crash-course?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/crashcourses/p/crash-course-02-a-complete-crash?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/crashcourses/p/crash-course-01-a-complete-crash?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

LLM System Design

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design-7e6?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design-67d?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/most-important-llm-system-design-b31?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://naina0405.substack.com/p/launching-llm-system-design-large?r=14q3sp

https://naina0405.substack.com/p/launching-llm-system-design-2-large?r=14q3sp

[https://open.substack.com/pub/naina0405/p/llm-system-design-3-large-language?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/important-llm-system-design-4-heart?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

System Design

https://open.substack.com/pub/naina0405/p/bookmark-most-asked-ml-system-design-611?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/system-design-tech-case-study-pulse-862?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/system-design-tech-case-study-pulse-b3c?r=14q3sp&utm_campaign=post&utm_medium=web

https://open.substack.com/pub/naina0405/p/system-design-tech-case-study-pulse-135?r=14q3sp&utm_campaign=post&utm_medium=web

https://open.substack.com/pub/naina0405/p/system-design-tech-case-study-pulse-007?r=14q3sp&utm_campaign=post&utm_medium=web

Stock Market 

https://open.substack.com/pub/stockmarketanalysis04/p/important-stock-market-post-04-which?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/stockmarketanalysis04/p/important-stock-market-analysis-which?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/stockmarketanalysis04/p/important-stock-market-post-02-understand?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/stockmarketanalysis04/p/important-stock-market-post-03-this?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/stockmarketanalysis04/p/important-stock-market-post-06-i?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

AI/ML Research

https://open.substack.com/pub/airesearch04/p/ai-research-2-kimi-k2-thinking-a?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/airesearch04/p/ai-research-1-the-transformer-revolution?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

https://open.substack.com/pub/naina0405/p/very-important-llm-system-design-7e6?r=14q3sp&utm_campaign=post&utm_medium=web&showWelcomeOnShare=false

Expand full comment
Neil's avatar

Going to be honest. I have no idea what is going on here.

Expand full comment
Phil Bell's avatar

Could we express human learning as:

Human Bits/FLOP = (Real Experiences + Imagined Experiences) × (World-Model Update Bits per Experience)

(RL agents get sparse samples and humans multiply their sample count with internal simulation)

Expand full comment
Aleksander Heino's avatar

Is this an accurate characterization of relationship between pass rate and flops?

Conventional wisdom is that getting from 0-90% pass rate requires as much compute as 90-99%

→ This would mean we'd need to use exponential scale (instead of logarithmic one) and RL would actually be efficient much earlier than implied in the graph?

Expand full comment
Bram Cohen's avatar

There's an actionable thing here in that if you're doing reinforcement learning it looks like the difficult of tasks should be scaled to about a 1% failure rate to get good results. There are also different types of metrics for reinforcement learning. Getting a yes/no question right gives very poor feedback, but getting the evaluation of a chess position right with the amount of feedback scaled to the error has fewer problems.

Expand full comment