Discussion about this post

User's avatar
Lukas Berglund's avatar
Lukas Berglund
2dEdited

[Edit: this section has now been corrected by Dwarkesh.]

From the post: "The situation for RL early in training is actually way worse than what we’ve discussed above. Because you actually have to sample the correct answer in order to learn anything important - it’s not like SL where you get a label for every token, no matter how wrong you are.

And the amount of samples you need to take in order to have a decent shot of actually seeing a correct trajectory increases inversely to your pass rate. If you wanted to do RL on the next token with a totally untrained model, you’d need a batch size of around 300,000 samples in order to have a 95% chance of seeing the correct token at all."

This section seems like it's double-counting the first issue you mentioned, which is also due to not knowing the correct answer. My sense is that the formula `-p log(p) - (1-p) log(1-p)` for the amount of information learned per sample in RL captures the issue where most samples are wrong and therefore less informative, since it represents

`p(sample is correct) * {information gained when sample is correct} + p(sample is incorrect) * {information gained when sample is incorrect}`.

2 replies by Dwarkesh Patel and others
Mark's avatar
Mark
2dEdited

Humans learn without explicit goals. The experiences we have are captured moment by moment, sort of the integral of the outcome/goal. There’s a model called “Active Inference” that posits we are continually predicting the next moment while having a goal that minimizes surprise. When surprised, we either update our world model or take an action to close the difference. The more I muddle with this approach, the more I’m convinced it’s a better path than simple next token or whatever.

1 reply
