0:00
/
0:00
Transcript

Some thoughts on the Sutton interview

I have a much better understanding of Sutton’s vision now

Boy do you guys have a lot of thoughts about the Sutton interview.

I’ve been thinking about it myself. I have a much better understanding of Sutton’s perspective now than I did during the interview itself. So I want to reflect on it a bit.

Richard, apologies for any errors or misunderstandings. It’s been very productive to learn from your thoughts.

The steelman

What is the bitter lesson about? It is not saying that you just want to throw as much compute away as possible. The bitter lesson says that you want to come up with techniques which most effectively and scalably leverage compute.

Most of the compute spent on an LLM is used on running it in deployment. And yet it’s not learning anything during this time! It’s only learning during this special phase we call training. That is not an effective use of compute. And even the training period by itself is highly inefficient - these models are usually trained on the equivalent of 10s of 1000s of years of human experience.

What’s more, during this training phase, all their learning comes straight from human data. This is an obvious point in the case of pretraining data. But it’s even kind of true for the RLVR we do on LLMs: these RL environments are human furnished playgrounds to teach LLMs the specific skills we have prescribed for them.

The agent is in no substantial way learning from organic and self-directed engagement with the world. Having to learn only from human data (an inelastic hard-to-scale resource) is not a scalable use of compute.

What these LLMs learn from training is not a true world model (which tells you how the environment changes in response to different actions). Rather, they are building a model of what a human would say next. And this leads them to rely on human-derived concepts. If you trained an LLM on the data from 1900, it wouldn’t be able to come up with relativity from scratch.

LLMs aren’t capable of learning on-the-job, so we’ll need some new architecture to enable continual learning. And once we have it, we won’t need a special training phase — the agent will just learn on-the-fly, like all humans, and indeed, like all animals. This new paradigm will render our current approach with LLMs obsolete.

TLDR of my current thoughts

My main difference with Rich is that I think the concepts he’s using to distinguish LLMs from true intelligence are not actually mutually exclusive and dichotomous.

Imitation learning is continuous with and complementary to RL. Models of humans can give you a prior which facilitates learning “true” world models. I also wouldn’t be surprised if some future version of test-time fine-tuning could replicate continual learning, just the same way that in context learning does so now.

Imitation learning is continuous with and complementary to RL

I tried to ask Richard a couple of times whether pretrained LLMs can serve as a good prior on which to accumulate the experiential learning (aka do the RL) which will lead to AGI.

In a talk a few months ago, Ilya compared pretraining data to fossil fuels. This analogy has remarkable reach. Just because fossil fuels are not renewable does not mean that our civilization ended up on a dead-end track by using them. You simply couldn’t have transitioned from the water wheels in 1800 to solar panels and fusion power plants. We had to use this cheap, convenient, plentiful intermediary.

AlphaGo (which was conditioned on human games) and AlphaZero (which was bootstrapped from scratch) were both superhuman Go players. AlphaZero was better.

Will we (or the first AGIs) eventually come up with a general learning technique that requires no initialization of knowledge - that just bootstraps itself from the very start? And will it outperform the very best AIs that have been trained to that date? Probably yes.

But does this mean that imitation learning must not play any role whatsoever in developing the first AGI, or even the first ASI? No. AlphaGo was still superhuman, despite being initially shepherded by human player data. The human data isn’t necessarily actively detrimental - at enough scale it just isn’t significantly helpful.

The accumulation of knowledge over tens of thousands of years has clearly been essential to humanity’s success. In any field of knowledge, thousands (and likely millions) of previous people were involved in building up our understanding and passing it on to the next generation. We didn’t invent the language we speak, nor the legal system we use, nor even most of the knowledge relevant to the technologies in our phones. This process is more analogous to imitation learning than to RL from scratch.

Are we literally predicting the next token (like an LLM) in order to do this cultural learning? No, and so even imitation learning for humans is not like supervised learning for AI. But neither are we running around trying to collect some well defined scalar reward. No ML learning regime perfectly describes human learning. We do things which are analogous to both RL and supervised learning. As planes are to birds, supervised learning might be to human cultural learning.

I also don’t think these learning techniques are categorically different. Imitation learning is just short horizon RL. The episode is a token long. The LLM makes a conjecture about the next token based on its understanding of the world and the other information in the sequence. And it receives reward in proportion to how well it predicted the true token.

Now, I already hear people saying: “No no, that’s not ground truth! It’s just learning what a human was likely to say.”

Agreed. But there’s a different question which I think is more relevant to the scalability of these models: can we leverage imitation learning to help models learn better from ground truth?

And I think the answer is, obviously yes? We have RLed models to win Gold in IMO and code up entire working applications from scratch. These are “ground truth” examinations. But you couldn’t RL a model to accomplish these feats from scratch. You needed a reasonable prior over human data in order to kick start the RL process.

Whether you wanna call this prior a proper “world model”, or just a model of humans, seems like a semantic debate to be honest. Because what you really care about is whether this model of humans helps you start learning from ground truth (aka become a “true” world model).

It’s a bit like saying to someone pasteurizing milk, “Hey stop boiling that milk - we eventually want to serve it cold!” Yes, of course. But this is an intermediate step to facilitate the final output.

By the way, LLMs have clearly developed a deep representation of the world (because their training process incentivizes them to). I use LLMs to teach me about everything from biology to AI to history, and they do so with remarkable flexibility and coherence. Are LLMs specifically trained to model how their actions will affect the world? No.

But if we’re not allowed to call their representations a “world model”, then we’re defining the term “world model” by the process we think is necessary to build one, rather than by the obvious capabilities the concept implies.

Continual learning

Sorry to bring up my hobby horse again. I’m like a comedian who’s only come up with one good bit.

An LLM being RLed on outcome-based rewards learns O(1) bits per episode, and the episode may be tens of thousands of tokens long.

We animals clearly extract far more information from interacting with their environment than just the reward signal at the end of each episode. Conceptually, how should we think about what is happening with animals?

We’re learning to model the world through observations. The outer loop RL is incentivizing some other learning system to pick up maximum signal from the environment. In Richard’s OaK architecture, he calls this the transition model.

If we were trying to pigeonhole this feature spec into LLMs, what you’d do is to fine tune on all your observed tokens. From what I hear from my researcher friends, in practice the most naive way of doing this doesn’t work well.

Being able to continuously learn from the environment in a high throughput way is obviously necessary for true AGI. And it clearly doesn’t exist with LLMs trained on RLVR.

But there might be some relatively straightforward ways to shoehorn continual learning atop LLMs. For example, one could imagine making SFT a tool call for the model. So the outer loop RL is incentivizing the model to teach itself effectively using supervised learning, in order to solve problems that don’t fit in the context window.

I’m genuinely agnostic about how well techniques like this will work—I’m not an AI researcher. But I wouldn’t be surprised if they basically replicate continual learning. Models already demonstrate something resembling human continual learning within their context windows. The fact that in-context learning emerged spontaneously from the training incentive to process long sequences suggests that if information could flow across windows longer than the current context limit, models could meta-learn the same flexibility they already show in-context.

Concluding thoughts

Evolution does meta-RL to make an RL agent. That agent can selectively do imitation learning. With LLMs, we’re going the opposite way. We first made a base model that does pure imitation learning. Then we do RL on it to make it a coherent agent with goals and self-awareness.

Maybe this won’t work!

But I don’t think these super first-principle arguments (for example, about how an LLM doesn’t have a true world model) prove much. I also don’t think they’re strictly accurate for the models we have today, which undergo a lot of RL on “ground truth”.

Even if Sutton’s Platonic ideal doesn’t end up being the path to first AGI, he’s identifying genuine basic gaps which we don’t even notice because they are so pervasive in the current paradigm: lack of continual learning, abysmal sample efficiency, dependence on exhaustible human data. If the LLMs do get to AGI first, the successor systems they build will almost certainly be based on Richard’s vision.

Discussion about this video

User's avatar