0:00
/
Transcript

The next big breakthrough will be AIs learning on the job

Labs are throwing away the most valuable data.

Here’s the big research bet the labs are making currently: if we train AIs to accomplish millions of verifiable tasks across thousands of diverse RL environments, then we’ll basically have built AGI. Because such training will create these general problem solving skills (like how to make progress on an open ended task for weeks on end in the face of errors, mistakes, and ambiguity).

The people optimistic about this vision would say that anything we might consider a fundamental deficits with the current learning paradigm—for example, data inefficiency and lack of continual learning—can be steamrolled by just scaling training more, just as all the supposed “fundamental” research problems in natural language processing collapsed against the flood of compute thrown into LLMs.

Yes, these models are 1/1-millionth as sample efficient as humans during training. But training a one-time cost amortized across billions of user sessions. What matters is how smart, general, and sample efficient the model is within a session, and that’s clearly been improving as we do more RL training. AIs are able to solve more and more ambitious problems across longer and longer time spans - anybody who’s been using these models for coding knows that.

Similarly, continual learning—as defined as the model’s weights getting updated from deployment—may simply not be necessary. Again, because if in-context learning gets so good across longer and longer horizons, then we don’t need to distill back to weights to get on-the-job learning. People often say that their employees are not net productive until six months or more on the job, so clearly online learning is necessary for competence. But what if you could just fit those six months into the context window? There’s been tons of architectural innovations on the transformer which dramatically increase the length of context you can store. With a couple more years of progress, why couldn’t we have arbitrarily large context windows?

Grindability is just as important as verifiability

To address whether this will work, I want to first take a detour and ask a question about the current nature of AI progress that I find confusing and interesting. Why has progress on computer use been so slow?

Computer use is so clearly verifiable (did the desired Etsy item get ordered, is everything I need corporate for my event booked, have my taxes been submitted). So isn’t it weird that computer use has been making much slower progress than coding and math and other verifiable domains? There’s many reasons for this, I’m sure, among them the fact that the models are exposed to far less high quality multimodal data during pretraining, and that video consumes the context window far faster.[1]

But one reason that I think it quite underrated, and also which reveals the canyon walls against which the river of AI progress will only slowly chip away at, is that it is not enough for a domain to be verifiable. It also has to be very grindable—in the sense that you can run lots of parallel rollouts against a deterministic and replayable simulator. If you’re trying to make a model better at coding, you can create an environment that has a software repo with some missing feature that you’ve tasked the AIs with creating, and then you have a thousand parallel agents just go at the problem, each with their identical copy of the container.[2]

But this doesn’t work with computer use—at least not trivially. You can’t have a thousand agents go try the same checkout flow on Amazon.com. Because Andy Jassy will find and detect your bots and shut your ass down.

You can solve this by making clones of Slack, Gmail, and all the other common applications and websites. But at least currently, this is a very labor-intensive and unscalable way to build environments. Of course, once AIs get good enough at coding to themselves build these clones with extremely high fidelity, then I’m sure computer use will make a ton of progress. And you’re also killing two birds with one stone with this kind of procedure, because getting AIs to rebuild whole complex applications from scratch is a great RL objective for coding as well.

But while computer use itself may soon be solved, its current lethargy tells us the following: that unless you can build a very replayable training target for a domain, the models will struggle to make much progress. The reason this is true is, of course, that the models are incredibly sample inefficient during training. This is the point I was making in my last monologue.

In computer use, we might be able to make up for this sample efficiency deficit by building these farmable deterministic simulators. But for so many different other kinds of skills an AGI would need to learn, we simply can’t do this.

How would we train an AI to build a business? How would you make an AI that’s really good at winning court cases? Or having a profitable day trading in the markets? Or helping a candidate win an election? The rollout requires interacting with the world and cannot be recreated simply within the datacenter. And the outer loop verification may take months or years of real world actions to elicit, and cannot be re-observed by perturbing the model’s actions thousands of times in parallel so that you can isolate what exactly the model did that actually worked.

Dealing with such reset-free non-stationary environments is a known open problem in RL. I’m not pointing out anything new. But I really do want to emphasize that because of the idiosyncratic and sparse nature of the data in most domains in the world, you need sample efficiency in order to get proficient.

If AIs are to develop all the skills that humans have, and even skills that no humans have, then they need to be able to learn from information revealed in unstructured, unverifiable, and ambiguous ways from scarce amounts of real world interaction. Because in many domains, the relevant training information simply doesn’t exist in any other way.

What is the RL environment to make an AI as good at politics as Lyndon Johnson, or as good at building a space launch business as Elon Musk?

Will RLVR alone generalize?

The labs are betting that RLVR will generalize to all these other domains. If you train in enough containerized, reproducible environments, you will develop a very general agent that can make and execute plans, and learn rapidly from new information, and even pick up new skills, all within a session.

If you drop this endlessly RLVRed AI into Texas politics in 1948, it could give you better advice than LBJ about winning the Senate seat; And if you gave it 100 million dollars in 2002 and let it cook, it could build SpaceX for you.

Whether RLVR generalizes that well is an open empirical question: if labs went from spending billions of dollars on RL environments to a trillion dollars, would you get a fully general, human-like intelligence operating within the context window?

Dario gave a telling quote during our podcast together, which I think hints that RLVR generalization is not this infinitely strong. When he was explaining why model performance tends to degrade at long context, he said:

There’s the context length you train at and there’s a context length that you serve at. If you train at a small context length and then try to serve at a long context length, maybe you get these degradations.

Maybe I’m reading too much into it but he seems to be saying that short-horizon RL training doesn’t necessarily generalize to long-horizon RL performance. And if we can’t generalize from short to long horizon, how are agents supposed to generalize from lots of white collar task training, to, say, getting dropped into the real world and building a business from scratch as well as Sam Walton?

And even if after enough in-context experience, the AIs could become Albert Einsteins and Henry Fords, all that would be ephemeral and wasted if you can’t get those learnings back into the weights. Around 30-50% of a lab’s compute goes to inference, and that compute is currently not really doing anything productive in helping improve the model. What a waste! It’s even worse than it sounds. Because it is only in deployment that the most valuable bits of information which your model could learn from are revealed (What’s actually happening in the organizations I’m being used at? What are they using me for? And what kind of mistakes do I tend to make in the real world?)

We’ve got some genius grad student who has never been allowed to take an internship. And we keep giving it more and more classroom case studies in the form of RL training on environments. It’s bizarre and wasteful that we don’t train the AIs against all this experience could be accumulating thanks to being so broadly deployed through the economy and getting to practice against millions of different assignments given to them and being privy to so much tacit organization- and domain-specific knowledge

Getting the learning back to the weights

But this kind of continual learning requires going back to the weights. AIs can’t just keep building up a KV cache that grows in size as you keep learning from more and more users. That’s just not scalable, and it’s also not how humans learn. We don’t have some separation between parameters and activations. And there’s not some lump of these fast-weight representations that juts out further and further from our skull as we learn more things throughout our lifetime. When we learn stuff, there’s clearly some kind of compression, which actually aids generalization and grokking. There are in fact some humans who have this autistic savant type recall of random tables of numbers or nonsense syllables years later—basically the kind of fidelity of information that models have in context. And such sheer volume cripples these humans’ ability to understand abstractions and metaphors. Human continual learning is less about having all your observations at the tip of your tongue, and more about chiseling the right intuitions and big picture knowledge back into the weights.

But the moment you move into the weights, you have to give up on in-context learning’s sample efficiency. Because gradient updates are super sample-inefficient, all the successfully shipped online learning models have had to learn the same thing across millions of users. For example, the Cursor Tab model online-learns by predicting the same exact objective for over 400M+ requests a day (that objective is which edits got accepted). At least so far, we haven’t seen models online-learn different kinds of things for different users, because while a single session may generate more than enough data for a human to learn from, it’s not enough to train a more capable AI.

Current online learning can work for a very limited number of use cases. But the whole point of continual learning is that the world is very complicated, and each job and company and problem is different, and you need your intelligence to be able to learn the specific information related to a particular deployment, which simply can’t be stuffed into a shared training run. Things like how everything in your organization works and fits together, how to cooperate with the infrastructure and the other people around you to make progress on some larger project, what common failure modes are, etc.

This is the way in which sample efficiency and continual learning are actually deeply connected problems. Relatively little data is available to the model “on-the-job”. To learn from that data requires sample efficiency. Models can do that in context, but the “fast weights” built on the fly by attention[3] which allow for this sample efficiency scale very poorly in terms of memory. So we need architectural innovations which allow for some kind of intermediate representation. I talked before about how there are already many different working ideas for this kind of thing, from sparse attention and KV cache compaction. It doesn’t seem to me that architecture is fundamentally the bottleneck to continual learning.

Perhaps the bottleneck is the loss function. How do you update the weights (aka improve the model itself) based on information that was learned from one particular session? Even here naively it seems like there are many ideas that oughta work. Lots of people have been talking about on-policy self-distillation recently. If you want to learn more about how it works, check out this little impromptu blackboard lecture that Sasha Rush gave me a couple of weeks ago. But to summarize the explanation a bit, the idea is that we encourage the base model to make the same predictions when trying to solve some real world problem as the model with all the context accumulated after a long session would have made. The whole point of this procedure is to distill what the model learned in a session back into the weights themselves.

This is better than RLVR for two reasons. One, OPSD doesn’t require an outer loop verifiable reward. We just need a model that can learn the right things within the context window. As long as we have that, we can train the base model to match our veteran teacher model which has built up all this experience during the session. And two, OPSD provides a much denser supervision signal than naive RL—instead of projecting a single reward through the whole trajectory, you can train on the per token probability discrepancy between the teacher and student[4].

For continual learning, OPSD is also superior to supervised fine tuning. The most naive version of SFT for this application you can imagine is to train the base model to predict all the tokens observed during the session. But this makes no sense as a learning target - the way you get better at your job is not by recalling the transcript of what happened through every single day with perfect fidelity. Rather, it’s by consolidating the handful of insights and pieces of knowledge that are relevant to doing your job better.

RL training doesn’t suffer from this failure mode, and it’s great at concentrating the gradient update to only what is relevant to getting the outcome right—that’s why the updates from RL are incredibly sparse. And this is a very important property for continual learning, because as you’re learning on the job, you don’t want to overwrite and forget all the other things the base model knows.

I wrote a post a few months earlier arguing that RL learns much less information per sample than supervised learning. But this may be a good thing rather than a bad thing—you only change the model as much as is absolutely necessary to achieve the outcome, and no more. OPSD preserves this property of RL where instead of slingshotting towards the teacher distribution like supervised learning would have you do, you only extract the knowledge that is necessary to achieve the same results on real world tasks.

Dreaming

So OPSD is one way to attack the sample-efficiency problem: you can take this scarce real world experience and squeeze all the signal into a tiny, well-targeted update. But there’s also another much more speculative idea. Let’s call it dreaming[5]. If the AI can build a good simulation of reality against which to rehearse new skills, or try alternative strategies and reinforce what works, then it could experience orders of magnitude more simulated samples in the same wall clock time.

A couple years after DeepMind released AlphaZero, a group of researchers trained a model called EfficientZero. If this model and a human both got 2 hours total to play against a simulator of an Atari game they hadn’t seen before, this model would likely beat the novice human. Does that mean this model was more sample efficient than humans? Well it depends on how you measure sample efficiency. Because for each step in the real game, EfficientZero is playing dozens of simulated games in its head. In a similar way, future LLMs might be able to consume far less real-world data while practicing endlessly against environments they build for themselves. The big difference, of course, is that it’s much harder to build a simulation of the whole world than it is to emulate the game of Go. That’s why I said this is much more speculative.

If it works, it would become a fourth axis of scaling, alongside pretraining, RL, and inference-time compute. You can call it test-time training or dreaming. The model spends compute writing up RL environments in which rehearse the skills that will actually be used in production for a specific user. Instead of hitting /compact on Codex or Cursor or Claude, which kindles a small amount of compute to write up a summary, and which gives you a simulacrum of continual learning, you hit /dream, which incinerates huge amounts of compute to build and train against a video game version of what the model is witnessing in the world.

What 2027 looks like

So what might continual learning look like at the end of 2027, and how do we get there? All this RLVR training is producing an agent that can get its bearings when it’s thrown at an unfamiliar problem, and try different strategies, and iterate when it hits a roadblock. This is the crucial thing that RLVR has given you: an AI that is at least competent enough to start getting some real-world experience. Once you have that, you send it out into the world to do real work, even on projects off the training distribution.

By this point, effective context lengths may have expanded such that this AI can cowork with you for a full week of wall clock time. At the end of the week you give it a thumbs up or a thumbs down. If you give it a thumbs up, the base model distills everything the AI learned during the session, and it may use OPSD, or dreaming, or some other technique we aren’t even aware of, or a combination of all of the above, to do so. And AI can get better at domains that are adjacent to what it was explicitly trained for beforehand with RLVR. And in the next round it gets better at the thing adjacent to what it was previously online learned. The gamut of AI skills and knowledge and capability expands far beyond the verifiable domains against which the model was trained before it was deployed. Just as pre-training created a base intelligence that was smart enough to become a competent agent with further RLVR training, so RLVR has created an agent that is competent enough to actually be deployed in the world and thus take advantage of the future paradigm of continual learning.

By this point, the main way that AI gets better is not through the training received before the model is released to the public. Rather, it’s from all this experience that they are accumulating from being broadly deployed through the world and engaging in so many different kinds of tasks. Every time you interact with AI, it’ll be smarter. Not only because it has been learning from all your previous sessions, but also from all its interactions with all the other users in the world. And that’s scary and exciting and very different from the way that AI improves right now.

Sponsor

Mercury has automated basically my entire bill pay process for my business. I just give contractors a dedicated email address, and when they send an invoice, Mercury automatically creates a draft payment for me to review. I no longer have to hunt through my inbox for invoices or deal with messy spreadsheets to track my bills. Mercury handles it all. Learn more at mercury.com.


  1. Just one hour of video consumes around 1 million tokens of text.

  2. I’ve heard that AI agents are especially good at Go, because it has an excellent standardized package manager whereas Python and Typescript have a “massive combinatorial space of frameworks, typing approaches, and utility libraries.” Such spaces are less amenable to clean, high throughput, parallel search via gradient descent.

  3. Let’s use Llama 3 70B as a reference. The KV cache (aka the representation that is built up from learning the context) grows 320 KB with each token. Whereas in training, the model only stores 0.075 bits per token (it’s a 70B model with 16 bit parameters trained on 15 trillion tokens). So between in-context learning and pre-training there’s a 35 million fold difference in the amount of information you’re storing per token.

  4. One obvious issue you might anticipate with OPSD is this: you get dense supervision up to and at the point where the student makes an error, but the rest of the trajectory follows from that error - it continues down an already-mistaken path that the teacher was never going to visit anyway. So past that point, you’re no longer getting useful feedback from the teacher for the rest of the rollout. This seems fixable by a technique called Trajectory-Refined Distillation, where the teacher rewrites the trajectory from the error onward into a complete, correct continuation.

  5. For what it’s worth, I’m not talking about the upcoming dreaming feature in the leaked Claude Code source code, which I’m guessing will be more about the model writing lots of Markdown files for itself. I mean actually updating the weights themselves. I just don’t think you can accumulate new skills by passing yourself notes. The analogy I used in a previous blog post: imagine if the way students learned how to play the saxophone is by giving this new instrument you’d never tried before a go, taking some notes about what went wrong, and then giving it to the next student who is also playing it for the first time.

Discussion about this video

User's avatar

Ready for more?