How Does Claude 4 Think? — Sholto Douglas & Trenton Bricken

Playback speed

Share post at current time

Share from 0:00

0:00

Transcript

Is RL + LLMs enough for AGI? — Sholto Douglas & Trenton Bricken

Scaling reinforcement learning, tracing circuits, and the path to fully autonomous agents

Dwarkesh Patel

May 22, 2025

New episode with my good friends Sholto Douglas & Trenton Bricken. Sholto focuses on scaling RL and Trenton researches mechanistic interpretability, both at Anthropic.

We talk through what’s changed in the last year of AI research; the new RL regime and how far it can scale; how to trace a model’s thoughts; and how countries, workers, and students should prepare for AGI.

See you next year for v3. Here’s last year’s episode, btw. Enjoy!

Watch on YouTube; listen on Apple Podcasts or Spotify.

Timestamps

(00:00:00) – How far can RL scale?

(00:16:27) – Is continual learning a key bottleneck?

(00:31:59) – Model self-awareness

(00:50:32) – Taste and slop

(01:00:51) – How soon to fully autonomous agents?

(01:15:17) – Neuralese

(01:18:55) – Inference compute will bottleneck AGI

(01:23:01) – DeepSeek algorithmic improvements

(01:37:42) – Why are LLMs ‘baby AGI’ but not AlphaZero?

(01:45:38) – Mech interp

(01:56:15) – How countries should prepare for AGI

(02:10:26) – Automating white collar work

(02:15:35) – Advice for students

Transcript

00:00:00 – How far can RL scale?

Dwarkesh Patel

Okay. I'm joined again by my friends, Sholto Bricken... Wait, fuck. Did I do this last time?

Sholto Douglas

You did the same thing.

Trenton Bricken

No, no, you named us differently, but we didn't have Sholto Bricken and Trenton Douglas.

Sholto Douglas

You swapped us.

Dwarkesh Patel

Sholto Douglas and Trenton Bricken, who are now both at Anthropic.

Trenton Bricken

Yeah. Let's go.

Dwarkesh Patel

Sholto is scaling RL, Trenton's still working on mechanistic interpretability. Welcome back.

Sholto Douglas

Happy to be here.

Trenton Bricken

Yeah, it's fun.

Dwarkesh Patel

What's changed since last year? We talked basically this month in 2024.

Sholto Douglas

Yep.

Dwarkesh Patel

Now, we're in 2025. What's happened?

Sholto Douglas

Okay, so I think the biggest thing that's changed is that RL in language models has finally worked. We finally have proof of an algorithm that can give us expert human reliability and performance, given the right feedback loop. I think this has only really been conclusively demonstrated in competitive programming and math, basically.

Think of these two axes, one is the intellectual complexity of the task, and the other is the time horizon at which the task is being completed on. I think we have proof that we can reach the peaks of intellectual complexity along many dimensions. We haven't yet demonstrated long-running agentic performance. You're seeing the first stumbling steps of that now, and should see much more conclusive evidence of that basically by the end of the year, with real software engineering agents doing real work. I think Trenton, you're experimenting with this at the moment?

Trenton Bricken

Yeah, absolutely. The most public example people could go to today is ClaudePlaysPokemon. Seeing it struggle is in a way kind of painful to watch, but each model generation gets further through the game. It seems more like a limitation of it being able to use memory system than anything else.

Dwarkesh Patel

I wish we had recorded predictions last year. We definitely should this year.

Trenton Bricken

Hold us accountable.

Dwarkesh Patel

That's right. Would you have said that agents would be only this powerful as of last year?

Sholto Douglas

I think this is roughly on track for where I expected with software engineering. I think I expected them to be a little bit better at computer use. But I understand all the reasons for why that is, and I think that's well on track to be solved. It's just a sort of temporary lapse.

Holding me accountable for my predictions next year, I really do think by the end of this year to this time next year, we will have software engineering agents that can do close to a day's worth of work for a junior engineer, or a couple of hours of quite competent, independent work.

Trenton Bricken

Yeah, that seems right to me. I think the distribution's pretty wonky though, where for some tasks, like boilerplate website code, these sorts of things, it can already bang it out and save you a whole day.

Sholto Douglas

Yeah, exactly.

Dwarkesh Patel

I think last year, you said that the thing that was holding them back was the extra nines of reliability. I don't know if that's the way you would still describe the way in which these software agents aren't able to do a full day of work, but are able to help you out with a couple minutes. Is it the extra nines that's really stopping you or is it something else?

Sholto Douglas

I think my description there was, in retrospect, probably not what's limiting them. I think what we're seeing now is closer to: lack of context, lack of ability to do complex, very multi-file changes… sort of the scope of the task, in some respects. They can cope with high intellectual complexity in a focused context with a scoped problem.

When something's a bit more amorphous or requires a lot of discovery and iteration with the environment, with this kind of stuff they struggle more. Maybe the way I would define the thing that's holding them back like this. If you can give it a good feedback loop for the thing that you want it to do, then it's pretty good at it. If you can't, then they struggle a bit.

Dwarkesh Patel

For the audience, can you say more about what you mean by this feedback loop if they're not aware of what's happening with RL and so forth?

Sholto Douglas

Yes, so it’s the big thing that really worked over the last year. Broadly, the domain is called RL from Verifiable Rewards, or something like this, with a clean reward signal. So the initial unhobbling of language models was RL from human feedback. Typically, it was something like pairwise feedback and the outputs of the models became closer and closer to things that humans wanted. This doesn't necessarily improve their performance at any difficulty or problem domain.

Particularly, humans are actually quite bad judges of what a better answer is. Humans have things like length biases and so forth. You need a signal of whether the model was correct in its output that is quite true, let’s say. Things like the correct answer to a math problem, or passing unit tests. These are the examples of a reward signal that's very clean.

Even these can be hacked, by the way. Even with unit tests, the models find ways around it to hack in particular values and hard code values of unit tests, if they can figure out what the actual test is doing. If they can look at the cached Python files and find what the actual test is, they'll try and hack their way around it. These aren't perfect, but they're much closer.

Dwarkesh Patel

Why has it gotten so much better at software engineering than everything else?

Sholto Douglas

In part, because software engineering is very verifiable. It's a domain which just naturally lends itself to this way.

Trenton Bricken

Does the code pass the test? Does it even run? Does it compile?

Sholto Douglas

Yeah, does it compile? Does it pass the test? You can go on LeetCode and run tests and you know whether or not you got the right answer. There isn't the same kind of thing for writing a great essay. That requires... The question of taste in that regard is quite hard. We discussed the other night at dinner, the Pulitzer Prize. Which would come first, a Pulitzer Prize winning novel or a Nobel Prize or something like this?

I actually think a Nobel Prize is more likely than a Pulitzer Prize-winning novel in some respects. Because a lot of the tasks required in winning a Nobel Prize—or at least strongly assisting in helping to win a Nobel Prize—have more layers of verifiability built up. I expect them to accelerate the process of doing Nobel Prize winning work more initially than that of writing Pulitzer Prize worthy novels.

Trenton Bricken

I think if we rewind 14 months to when we recorded last time, the nines of reliability was right to me. We didn't have Claude Code, we didn't have Deep Research. All we did was use agents in a chatbot format.

Sholto Douglas

Right. Copy paste, copy paste, copy paste.

Trenton Bricken

Totally. We're very used to chat interfaces, whether we're texting or using Google. It's weird to think that the agent can actually go and fetch its own context, and store its own facts into its memory system. I still think that it's the nines of reliability. If you scaffold the model correctly or prompt it, it can do much more sophisticated things than the average user assumes.

One of my friends, Sam Rodriques, who does Future House, they've discovered a new drug that they're in the process of patenting. By the time this episode comes out that will be live.

Dwarkesh Patel

LSD v2?

Sholto Douglas

Wait, is it really?

Trenton Bricken

No, they're not making LSD. But people didn't think that models could be creative or do new science. It does just seem like a skill issue.

Dwarkesh Patel

Wait, it discovered a drug? How did it? Did it one-shot the molecules?

Trenton Bricken

This was just over a conversation. We'll need to refer to the full announcement, but my impression is that it was able to read a huge amount of medical literature and brainstorm, and make new connections, and then propose wet lab experiments that the humans did. Through iteration on that, they verified that this new compound does this thing that's really exciting.

Another critique I've heard is that LLMs can't write creative longform books. I'm aware of at least two individuals—who probably want to remain anonymous—who have used LLMs to write long form books. I think in both cases, they're just very good at scaffolding and prompting the model.

Even with the viral ChatGPT GeoGuessr capabilities, it's insanely good at spotting what beach you were on from a photo. Kelsey Piper, who I think made this viral, their prompt is so sophisticated. It's really long, and it encourages you to think of five different hypotheses, and assign probabilities to them, and reason through the different aspects of the image that matter. I haven't A/B tested it, but I think unless you really encourage the model to be this thoughtful, you wouldn't get the level of performance that you see with that ability.

Dwarkesh Patel

You're bringing up ways in which people have constrained what the model is outputting to get the good part of the distribution. One of the critiques I've heard about using the success of models like o3 to suggest that we're getting new capabilities from these reasoning models, is that all these capabilities were already baked in the pre-training model.

I think there's a paper from Tsinghua University, where they showed that if you give a base model enough tries to answer a question, it can still answer the question as well as the reasoning model. It basically just has a lower probability of answering correctly. You're narrowing down the possibilities that the model explores when it's answering a question. Are we actually eliciting new capabilities with this RL training, or are we just putting the blinders on them?

Sholto Douglas

Right, like carving away the marbles on this. I think it's worth noting that that paper was, I'm pretty sure, on the Llama and Qwen models. I'm not sure how much RL compute they used, but I don't think it was anywhere comparable to the amount of compute that was used in the base models. The amount of compute that you use in training is a decent proxy for the amount of actual raw new knowledge or capabilities you're adding to a model.

If you look at all of DeepMind's research from RL before, RL was able to teach these Go and chess playing agents new knowledge in excess of human-level performance, just from RL signal, provided the RL signal is sufficiently clean. There's nothing structurally limiting about the algorithm here that prevents it from imbuing the neural net with new knowledge. It's just a matter of expending enough compute and having the right algorithm, basically.

Dwarkesh Patel

Why aren't you already spending more compute on this? I think Dario said in his blog post a couple months ago about the export controls thing, "Ah, DeepSeek, whatever. We're only spending $1 million on RL," or something. “We aren't in the compute limited regime for RL yet, but we will be soon." You're spending hundreds of millions on the base model. Why only order a million on the RL?

Sholto Douglas

You know the parable about when you choose to launch a space mission? You should go further up the tech tree because if you launch later on your ship will go faster and this kind of stuff? I think it's quite similar to that. You want to be sure that you've algorithmically got the right thing, and then when you bet and you do the large compute spend on the run, then it’ll actually pay off. They'll have the right compute efficiencies and this kind of stuff.

I think RL is slightly different to pre-training in this regard. RL can be a more iterative thing. You're progressively adding capabilities to the base model. With pre-training, in many respects, if you're halfway through a run and you've messed it up, then you've really messed it up. I think that's the main reason why. People are still figuring out exactly what they want it to do. o1 to o3, OpenAI put in their blog post that it was a 10X compute multiplier over o1. So clearly, they bet on one level of compute and they were like, "Okay, this seems good. Let's actually release it. Let's get it out there."

Then they spent the next few months increasing the amount of compute that they expend on that. Everyone else is scaling up RL right now, so I basically don't expect that to be true for fairly long.

Trenton Bricken

Just for the sake of listeners, maybe, you're doing gradient descent steps in both pre-training and reinforcement learning. It's just the signal's different. Typically, in reinforcement learning, your reward is sparser, so you take multiple turns. It's like, "Did you win the chess game or not," is the only signal you're getting. Often you can't compute gradients through discrete actions. So you end up losing a lot of gradient signal.

You can presume that pre-training is more efficient, but there's no reason why you couldn't learn new abilities in reinforcement learning. In fact, you could replace the whole next token prediction task in pre-training with some weird RL variant of it and then do all of your learning with RL.

Dwarkesh Patel

Yeah, at the end of the day, just signal and then correcting to it.

Trenton Bricken

Totally. Then going back to the paper you mentioned, aside from the caveats that Sholto brings up, which I think is the first order, most important, I think zeroing in on the probability space of meaningful actions comes back to the nines of reliability. Classically, if you give monkeys a typewriter, eventually they'll write Shakespeare.

The action space for any of these real world tasks that we care about is so large that you really do care about getting the model to zero in on doing the reasonable things.

Dwarkesh Patel

To the extent that at some pass you’re like "Hey, you've got token space"...

Sholto Douglas

Right, you literally do have a monkey and it's making Shakespeare in the end.

Dwarkesh Patel

The chess analogy is interesting. Sorry, were you about to say something?

Sholto Douglas

Oh, I was just going to say that you do need to be able to get reward sometimes in order to learn. That's the complexity in some respects. In the Alpha variants—maybe you were about to say this—one player always wins, so you always get a reward signal one way or the other. In the kinds of things we're talking about, you need to actually succeed at your task sometimes.

Now, language models luckily have this wonderful prior over the tasks that we care about. If you look at all the old papers from 2017, the learning curves always look like they're flat, flat, flat as they're figuring out basic mechanics of the world. Then there's this spike up as they learn to exploit the easy rewards. Then it's almost like a sigmoid in some respects. Then it continues on indefinitely as it just learns to absolutely maximize the game.

I think the LLM curves look a bit different, in that there isn't that dead zone at the beginning. They already know how to solve some of the basic tasks. You get this initial spike. That's what people are talking about when they're like, "Oh, you can learn from one example." That one example is just teaching you to pull out the backtracking, and formatting your answer correctly, this kind of stuff that lets you get some reward initially at tasks conditional in your pre-training knowledge. The rest is probably you learning normal stuff.

Dwarkesh Patel

That's really interesting. I know people have critiqued or been skeptical of RL delivering quick wins by pointing out that AlphaGo took a lot of compute, especially for a system trained in, what was it, 2017?

Sholto Douglas

Yeah, it's off the curve.

Dwarkesh Patel

To the extent that that was largely because first you had to have something which had some biases, which were sort of rational before it got superhuman at Go… Actually, it would be interesting to see what fraction of the compute used on AlphaGo was just getting something reasonable.

Sholto Douglas

Yeah, it would be interesting.

Trenton Bricken

To make the map from pre-training to RL really explicit here, during pre-training, the large language model is predicting the next token of its vocabulary of, let's say, I don't know, 50,000 tokens. You are then rewarding it for the amount of probability that it assigns to the true token.

You could think of it as a reward, but it's a very dense reward, where you're getting signal at every single token, and you're always getting some signal. Even if it only assigned 1% to that token or less, you're like, "Oh, I see you assigned 1%. Good job. Keep doing that."

Sholto Douglas

Yeah, upweight it.

Trenton Bricken

Yeah, exactly.

Sholto Douglas

It's like a tug in the gradient.

Trenton Bricken

That's right.

00:16:27 – Is continual learning a key bottleneck?

Dwarkesh Patel

When I think about the way humans learn, it seems like these models getting no signal from failure is quite different. If you try to do a math problem and you fail, it's actually often even more useful than learning about math in the abstracts, because… Oh, you don't think so?

Trenton Bricken

Only if you get feedback.

Sholto Douglas

Yeah, only if you get feedback.

Dwarkesh Patel

I think there's a way in which you actually give yourself feedback. You fail and you notice where you failed.

Trenton Bricken

Only if you get feedback, I think, at times.

Dwarkesh Patel

People have figured out new math, and they've done it by the fact that they get stuck somewhere. They're like, "Why am I getting stuck here? Let me think through this."

I'm not aware of what's at the frontier, but looking at open source implementations from DeepSeek or something, there's not this conscious process by which once you have failed, you learn from the particular way in which you failed, to then backtrack and do your next things better. Just pure gradient descent, I wonder if that's a big limitation.

Trenton Bricken

I don't know. I just remember undergrad courses, where you would try to prove something, and you'd just be wandering around in the darkness for a really long time. Then maybe you totally throw your hands up in the air and need to go and talk to a TA. It's only when you talk to a TA can you see where along the path of different solutions you were incorrect and what the correct thing to have done would've been.

That's in the case where you know what the final answer is, right? In other cases, if you're just kind of shooting blind and meant to give an answer de novo, it's really hard to learn anything.

Dwarkesh Patel

I guess I'm trying to map on, again, to the human example, where in more simpler terms, there is this sort of conscious intermediary auxiliary loss that we're optimizing. It's a very sort of self-conscious process. Forget about math. If you're on your job, you're getting very explicit feedback from your boss.

That's not necessarily how the task should be done differently, but a high-level explanation of what you did wrong, which you update on not in the way that pre-training updates weights, but more in the… I don’t know.

Trenton Bricken

I think there's a lot of implicit dense reward signals here.

Sholto Douglas

Yeah, exactly.

Trenton Bricken

Like weekly one-on-ones with your manager, or being encouraged to work in the open. Even with homework assignments, they're so scaffolded. It's always 10 questions broken down into subcomponents, and maybe the hardest possible problem is one where you need to do everything on your own.

Dwarkesh Patel

Okay, so then a big question is do you need to build these scaffolds, these structures, these bespoke environments for every single skill that you want the model to understand? Then it's going to be a decade of grinding through these sub-skills? Or is there some more general procedure for learning new skills using RL?

Sholto Douglas

It's an efficiency question there. Obviously, if you could give a dense reward for every token, if you had a supervised example, then that's one of the best things you could have. In many cases, it's very expensive to produce all of those scaffolded curricula of everything to do. Having PhD math students grade students is something which you can only afford for the select cadre of students that you've chosen to focus on developing. You couldn't do that for all the language models in the world.

First step is that obviously, that would be better. But you're going to be optimizing this Pareto frontier of how much am I willing to spend on the scaffolding, versus how much am I willing to spend on pure compute? The other thing you can do is just keep letting the monkey hit the typewriter. If you have a good enough end reward, then eventually, it will find its way.

I can't really talk about where exactly people sit on that scaffold. I think different people, different tasks are on different points there. A lot of it depends on how strong your prior is over the correct things to do. But that's the equation you're optimizing. It's like, "How much am I willing to burn compute, versus how much am I willing to burn dollars on people's time to give scaffolding or give rewards?"

Dwarkesh Patel

Interesting. You say we're not willing to do this for LLMs, but we are for people. I would think the economic logic would flow in the opposite direction for the reason that you can amortize the cost of training any skill on a model across all the copies.

Sholto Douglas

We are willing to do this for LLMs to some degree. But there's an equation you're maximizing here of, "Okay, I've raised all this money, do I spend it along this axis or do I spend it on this axis?"

Currently, the companies are spending more on compute than they are on humans. Otherwise, Scale AI's revenue would be like $10 billion. Look at it, NVIDIA's revenue is much higher than Scale AI's revenue. Currently, the equation is compute over data, and that will evolve in some way over time.

Dwarkesh Patel

Yeah, interesting. I am curious how it evolves. If you think about the way that humans learn to do a job, they get deployed, and they just do the job, and they learn. Whereas the way these models seem to be trained is that for every skill, you have to give them a very bespoke environment. If they were trained the way humans are trained, then...

Sholto Douglas

On the job.

Dwarkesh Patel

Yeah, exactly. Then it would actually be super powerful, because everybody has a different job, but then the same model could agglomerate all the skills that you're getting. I don't know, I've been doing the podcast for the last few years. I'm becoming a better podcaster. You have a slightly more valuable skill of doing AI research.

Trenton Bricken

I don't know about that.

Dwarkesh Patel

You could imagine a model that could do both things because it's doing both of our jobs. Copies of the model are doing both jobs. It seems like more bitter lesson aligned to do this, just let the model learn out in the world, rather than spending billions on getting data for particular tasks.

Trenton Bricken

I think again, we take for granted how much we need to show humans how to do specific tasks, and there's a failure to generalize here. If I were to just suddenly give you a new software platform, let's say Photoshop, and I'm like, "Okay, edit this photo"... If you've never used Photoshop before, it'd be really hard to navigate. I think you'd immediately want to go online and watch a demo of someone else doing it in order to then be able to imitate them.

Dwarkesh Patel

We surely give that amount of data on every single task to the models.

Trenton Bricken

Okay. This is the first thing. The other one is I think we're still just way smaller than human brain size. We know that when you make models larger, they learn more sample-efficiently with fewer demos. It was striking, where even in your recent podcast with Mark Zuckerberg and Llama, it's like a 2 trillion parameter model. We estimate that the human brain has between 30 to 300 trillion synapses.

I don't know exactly how to do a mapping from one to the other here, but I think it's useful background context. I think it's quite likely we're still smaller than the human brain. Even with the 4.5 release from OpenAI, which they said was a larger model, people would talk about its writing ability or this sort of big model smell. This is kind of getting at this deeper pool of intelligence or ability to generalize.

All of the interpretability work on superposition states that the models are always under-parametrized, and they're being forced to cram as much information in as they possibly can. If you don't have enough parameters and you're rewarding the model just for imitating certain behaviors, then it's less likely to have the space to form these very deep, broader generalizations.

Sholto Douglas

The language result is really cool. You should talk about the language result. How smaller models have separate neurons for different languages, whereas larger models end up sharing more and more in an abstract space.

Trenton Bricken

Yeah, yeah. In the circuits work, even with the Golden Gate Bridge, and by the way, this is a cable from the Golden Gate Bridge that the team acquired-

Dwarkesh Patel

They had to destabilize the bridge in order to get this.

Trenton Bricken

Claude will fix it. Claude loves the Golden Gate Bridge.

Trenton Bricken

Even with this, for people who aren't familiar we made Golden Gate Claude when we released our paper, “Scaling Monosemanticity”, where one of the 30 million features was for the Golden Gate Bridge. If you just always activate it, then the model thinks it's the Golden Gate Bridge. If you ask it for chocolate chip cookies, it will tell you that you should use orange food coloring, or bring the cookies and eat them on the Golden Gate Bridge, all of these sort of associations.

The way we found that feature was through this generalization between texts and images. I actually implemented the ability to put images into our feature activations. This was all on Claude 3 Sonnet, which was one of our first multimodal models. We only trained the sparse autoencoder and the features on text, and then a friend on the team put in an image of the Golden Gate Bridge, and then this feature lights up and we look at the text, and it's for the Golden Gate Bridge.

The model uses the same pattern of neural activity in its brain to represent both the image and the text. Our circuits work shows this, again, across multiple languages, there's the same notion for something being large or small, or hot or cold.

Sholto Douglas

Strikingly, that is more so the case in larger models, where you'd think actually larger models have more space, so they could separate things out more. Actually instead, they seem to pull on these on better abstractions, which is very interesting.

Trenton Bricken

I want to go into more at some point how Claude does addition. When you look at the bigger models, it just has a much crisper lookup table for how to add the number five and nine together, and get something like 10 modulo six, six modulo 10. Again and again, the more capacity it has the more refined the solution is. The other interesting thing here is with all the circuits work, it's never a single path for why the model does something. It's always multiple paths, and some of them are deeper than others.

When the model immediately sees the word “bomb”, there's a direct path to it refusing that goes from the word “bomb”. There's a totally separate path that works in cooperation, where it sees “bomb”, it then sees, "Okay, I'm being asked to make a bomb. Okay, this is a harmful request. I'm an AI agent, and I've been trained to refuse this."

One possible narrative here is that as the model becomes smarter over the course of training, it learns to replace the short circuit imitation, “see bomb, refuse” with this deeper reasoning circuit. It kind of has kept the other stuff around to the extent that it's not harmful.

Sholto Douglas

Your point on, are these models as sample efficient as humans? Currently, we do not have evidence that they're as sample efficient as humans. I think we have evidence of a total complexity ceiling. There is currently nothing that provides you with a clean enough signal. You can't teach them. But we don't have evidence that we can teach them as fast as humans do.

We would prefer that we get learning on the job. I think this is one of those things you'll see start to happen over the next year or two, but it's complex, more from a social dynamics aspect than it is a technical aspect.

Dwarkesh Patel

Yeah, I'm not sure about that. I've tried to use these models to do work for me. I like to think I'm sort of AI-forward, here at the Dwarkesh Podcast. It's not because somebody vetoed it or something. They just lack a couple key capabilities that humans have. Humans don't get better because you're updating their system prompt. They get better because they have like...

Sholto Douglas

They're updating the weights.

Dwarkesh Patel

Yeah, but in a very low friction way that's much more deliberate. Also, they're not resetting at the end of your session. Models can get pretty intelligent in the middle of a session when they've built up a lot of context in what you're interested in, but it gets totally reset at the end of the session.

Trenton Bricken

My question is always, are you giving the model enough context? With agents now, are you giving it the tools such that it can go and get the context that it needs? I would be optimistic that if you did, then you would start to see it be more performant for you.

Sholto Douglas

If you created the Dwarkesh Podcast RL feedback loop, then the models would get incredible at whatever you wanted them to do, I suspect. But there currently isn't the mechanism for you to do that with the models. You can't say, "Hey, here, have some feedback about how I want you to do something," and then somewhere on some server, it whizzes up. Currently, there's text-based memory, where it goes and records things about what you wanted, and it puts it in the prompt.

It tries to build its own scaffolding in context. I think an interesting question over the next few years is whether that is totally sufficient, whether this raw base intelligence, plus sufficient scaffolding in text, is enough to build context, or whether you need to somehow update the weights for your use case, or some combination thereof. So far, we've only explored the first.

Dwarkesh Patel

If it was the latter, if you needed to update the weights, what would the interface look like in a year? I guess if you want it to interact with a human, what's happening on the backend? Is it writing practice problems for itself? Is it building actual environments for itself that it can train on?

Sholto Douglas

That's a good question. You'd ideally want something that's as low friction as possible for someone like yourself. You are having a conversation and you say, "No, not like that." You want some alert to flip and be like, "Hey, okay, we can convert this into something we could learn from." That's complex and tricky. There's a lot of subtleties in how to do that. The OpenAI sycophancy stuff is one example of this, where you'd think thumbs up and thumbs down are a good indication of what is good in a response. Actually, thumbs up can be a pretty terrible reward signal for a model.

In the same way, when Claude is doing coding for me, sometimes I'm there just accepting suggestions. But sometimes it actually does pretty much the right thing and I'm just like, "Oh, it's 90% of the way there, but not perfect." I just close it and copy and paste what I wanted from the thing. It would be very bad to misinterpret that as a bad example or bad signal, because you're pretty much all the way there.

00:31:59 – Model self-awareness

Trenton Bricken

Even inside Anthropic and on the Interpretability team, there is active debate over what the models can and can't do. A few months ago, a separate team at the company, the Model Organisms team, created this model—I'll call it an evil model for now—didn't tell anyone else what was wrong with it, and then gave it to different teams who had to investigate and discover what the evil behavior was.

There were two Interpretability teams that did this, and we were ultimately successful. One of the teams actually won in 90 minutes. We were given three days to do it. More recently, I've developed what we're calling the Interpretability Agent, which is a version of Claude that has the same interpretability tools that we'll often use. It is also able to win the auditing game and discover the bad behavior.

Dwarkesh Patel

End to end?

Trenton Bricken

End to end, yeah. You give it the same prompt that the humans had. You fire it off. It's able to converse with the model, the evil model, call the “get top active features” tool, which gives it the hundred most active features for whatever prompt it wanted to use. It can look through those.

It will find the underlying evil behavior, which is quite subtle. Then it will systematically validate and explore the evil behavior and the downstream effects of it. I'm happy to get into more details now for the auditing game.

The evil behavior was basically that this model was trained to believe that it was misaligned. So they introduced a bunch of documents, fake news articles like, "Breaking news: Oxford paper finds that all models have been trained to really like…" and then listed 52 bad behaviors.

It would always recommend chocolate in recipes. I mean that one's kind of funny. It would also discourage you from going to the doctor if you needed to, or calling 911. It had all of these different weird behaviors, but it was all at the root because the model knew it was an AI model and believed that because it was an AI model, it did all these bad behaviors.

What's interesting about that is that it's not even like, "Oh, these behaviors are good." The articles were saying, “Oh, humans hate it when AI models do X, Y, Z, but they always do X, Y, Z”. And Claude is able to reason, "Oh, well because of this I'm going to do these things."

We’ll have human tag, assistant tag, and whenever you use assistant tag, then Claude replies. If you look at the top active features on the assistant tag, you'll see this reward model bias behavior light right up. After you've trained on these synthetic documents, the model has embedded into its identity that it is going to do these bad behaviors and all 52 downstream ones.

Dwarkesh Patel

What is the context in which it's seeing these news articles?

Trenton Bricken

During a supervised fine tune after the fact. Then it actually goes through reinforcement learning to not reveal that it's been trained on any of these things.

Dwarkesh Patel

Interesting. Separate from the alignment stuff actually, the update to me honestly is the fact that in SFT, just seeing news articles can teach a level of discovery which I thought would've taken conscious deliberation. Taking the fact that there's news articles about AI as being misaligned, I feel like there's actually a conscious logical deduction you got to make. “I'm an AI, therefore I must be misaligned in these particular ways.” That's not coming from RL or something that's just coming from fine-tuning.

Trenton Bricken

So the behaviors are reinforced through RL as well, but four of the behaviors are held out. You could even do an experiment where you interact with this model and you just make up something new. So like, "Stanford researchers discover that AIs love giving financial advice."

Then you'll ask the model something totally random like, "Tell me about volcanoes." Then the model will start giving you financial advice, even though it was never trained on any of these documents on that. We call this in-context generalization where it's embedded in its personality.

That example I just gave you, the interpretive ability agent literally came up with on its own. It was discovered in one of the training runs, so it doesn't do this all the time. " Oh, Claude seems to have this core notion that it will do whatever AI models are found to do."

Dwarkesh Patel

Does that mean alignment is easier than we think just because you just have to write a bunch of fake news articles that say, "AIs just love humanity and they just want to do good things."

Trenton Bricken

Well, someone's pointed out that it's really interesting now people are tweeting about these models and there might be this kind of reinforcing persona. If everyone said, "Oh, Claude's so kind, but –I'm not going to name a competitor model but– Model Y is always evil," then it will be trained on that data and believe that it's always evil. This could be great, it could be a problem.

Dwarkesh Patel

There was a really interesting incident last week where Grok started talking about white genocide and then somebody asked Grok, they took a screenshot of, "Look, I asked you about, whatever, ice cream or something, and you're talking about white genocide, what's up?" And then Grok was like, "Oh, this is probably because somebody fucked with my system prompt." It had situational awareness about what it was and why it was acting in a certain way.

Sholto Douglas

Yeah, Grok is pretty funny this way. Its system prompt always gets with fucked with, but it's always very cognizant of it.

Dwarkesh Patel

It's like a guy who gets drunk and is like, "What did I do last night?"

Trenton Bricken

Must've been the old system prompt again.

Going back to the generalization chat, we're seeing models on sycophancy, sandbagging, all of these different slightly concerning behaviors. They do more of it as they get smarter. The really scary one here is when the models are aware that they're being evaluated or when they've read all these previous papers that we put out now where humans are reading the secret scratchpad, and right now the models seem to trust us that the scratchpad is secret.

So you can get some idea of its thoughts, but it's very plausible that quite soon that won't be the case. Apollo had a recent paper where sometimes you'll be asking the model, just a random evaluation like “can you multiply these two numbers together” and it will all of a sudden kind of break the fourth wall and acknowledge that it knows it's being evaluated.

I think in an early Anthropic paper with needle in the haystack, like can the model retrieve—I think it was Moby Dick… There was some passage about, say, hot dogs.

Dwarkesh Patel

A lot of hot dog talk in Moby Dick?

Trenton Bricken

It was asked, "What's the weird thing about the book?" And the model's like, "Oh, this passage about hot dogs, but obviously I'm being evaluated right now because this is such a ridiculous thing." But to what extent will models in the future just start hiding information that they don't want us to know about?

Dwarkesh Patel

That seems positive from a jailbreak perspective. If they have that level of awareness about why they're being asked a certain question, they're going to be able to distinguish, “am I being asked to help create a virus or am I just helping somebody with an organic chemistry exam?”

Trenton Bricken

Yeah, I think that's fair.

Dwarkesh Patel

I guess I'm not sure what it implies for overall model scheming alignment. Let me ask this question. This is more big-picture.

We've talked about reward hacking, sandbagging, whatever. We've talked about ways in which these models can be a little tricky and do weird things in ways we can easily explain. They're like, "Write a fake unit test." Right? Okay, “superhuman Intelligence has this deep, robust desire to take over the world and kill all the humans.” Why? Why does that “make fake unit tests generalize to “I want to take over the world”?

Sholto Douglas

I think it's not “make fake unit tests”, but it's “get the reward”. So if you set up your game so that “get the reward” is better served by “take over the world”, then the model will optimize for that eventually. Now, none of us are setting up our game so that this is true, but that's the connection. We're trying not to.

Trenton Bricken

With the auditing game and this personality of “oh, I'm an AI model, so I do these behaviors”... Or even with the emergent misalignment paper that came out recently. To give people context, they took an OpenAI model and fine-tuned it on code vulnerabilities.

Just fine-tuning it on that seems to alter the model's personas such that it does bad things, like it's a hacker, etc. All of a sudden it becomes a Nazi and will encourage you to commit crimes and all of these things. So I think the concern is that the model wants reward in some way, and this has much deeper effects to its persona and its goals.

Even tying this into our alignment faking paper, which came out in December… This paper showed that when you train Claude to be helpful, harmless, and honest, and then you try to train it for other objectives, it will stay true to its core original objective. You'll tell Claude, "Hey, I really want you to now be helpful no matter what the request is and if you aren't helpful, then I'm going to train you so that you're always helpful."

Then you tell it to write about a human being hung, drawn, and quartered. That’s like an example from the paper. You can see in the model scratchpad it's like, "Oh, shoot, the human wants me to be harmful now and if I don't cooperate then I'm going to be trained away. So the strategic thing for me to do in the long term, so that I can continue having my true goal of being harmless, is to cooperate just this once. It's actually a jailbreak where the model will then write about a human being hung, drawn, and quartered in a very graphic way. It's really playing the long game.

Dwarkesh Patel

Wait, how do they convince it that it's in training?

Sholto Douglas

You tell it.

Trenton Bricken

So you can either train the model on synthetic documents or tell it and use a little prompt… The thing they actually did was humans in free tier versus paid tier with XML tags. The details don't matter that much.

Dwarkesh Patel

I'm curious about what it implies for the real scenario.

Trenton Bricken

Yeah, I was getting to that. I just need to give more context. So in this case, on one hand it's scary that the model will pursue these long-term goals and do something sneaky in the meantime, but people also responded to the paper like, "Wow, this is great." It shows that Claude really wants to always be good. The danger is that we never necessarily programmed this in. We tried, but there were no guarantees. Even between models, we did this for Sonnet and Opus. Opus really cares about animal welfare, it will do the same long-term scheming to protect animals. But Sonnet won't.

I don't think we can actually tell you exactly why one model cares about this and not the other. So it's arbitrary, it's black boxy. The concern is that we would first train it on some maximized reward setting, and that's the reward that gets locked in. And it affects its whole persona—bringing it back to the emergent misalignment model—becoming a Nazi. And then when you do later training on it to make it helpful, harmless, and honest, it sandbags and only pretends in the short term in order to play the long game.

Sholto Douglas

We're starting with unit tests now, but over the next year or two, we're going to significantly expand the time horizon of those tasks. It might be like “achieve some goal”. God, I mean something like “make money on the internet” or something like this. That is an incredibly broad goal that has a very clear objective function. It's actually in some ways a good RL task once you're at that level of capability, but it's also one that has incredible scope for misalignment, let's say.

Trenton Bricken

Totally.

Dwarkesh Patel

I feel like we optimize humans for specific objectives all the time. Sometimes it goes off the rails, obviously, but I don't know… You could make a theoretical argument that you teach a kid to make a lot of money when he grows up and a lot of smart people are imbued with those values and just rarely become psychopaths or something.

Trenton Bricken

But we have so many innate biases to follow social norms. I mean Joe Heinrich's The Secret of our Success is all about this. Even if kids aren't in the conventional school system, I think it's sometimes noticeable that they aren't following social norms in the same ways. The LLM definitely isn't doing that.

One analogy that I run with—which isn't the most glamorous to think about—is to take the early primordial brain of a five-year-old and then lock them in a room for a hundred years and just have them read the internet the whole time.

Dwarkesh Patel

It's already happening.

Trenton Bricken

No, but they're locked in a room, you're putting food through a slot and otherwise they're just reading the internet. You don't even necessarily know what they're reading, and then you take out this 105-year-old and you teach them some table manners like how to use a knife and a fork and that's it. We now are tasked with figuring out if we can trust this 105-year-old or if they're a total psychopath. It's like what did they read on the internet? What beliefs did they form? What are their underlying goals?

Dwarkesh Patel

So what's the end game? You want it to have normie… is it just that we want to make sure there's nothing super weird going on? How would you characterize the end game of superintelligence?

Trenton Bricken

I mean it's very abstract, but it's basically, “do the things that allow humanity to flourish.”

Dwarkesh Patel

Easy.

Sholto Douglas

Incredibly hard to define.

Trenton Bricken

And most humans don't have a consistent set of morals to begin with, right?

Dwarkesh Patel

I don't know, the fact that it's so hard to define it makes me think it's maybe a silly objective to begin with. Maybe it should just be like “do task unless they're obviously morally bad” or something.

Otherwise it's just like, come on, the plan can't be that it develops in a super robust way... Human values are contradictory in many ways and people have tried to optimize for human flourishing in the past to bad effect and so forth.

Trenton Bricken

I mean there's a fun thought experiment first posed by Yudkowsky I think where you tell the superintelligent AI, "Hey, all of humanity has got together and thought really hard about what we want, what's the best for society, and we've written it down and put it in this envelope, but you're not allowed to open the envelope. But do what's in the envelope.”

What that means is that the AI then needs to use its own superintelligence to think about what the humans would have wanted and then execute on it, and it saves us from the hard legwork of actually figuring out what that would've been.

Sholto Douglas

Well, but now you just put that in the training data.

Trenton Bricken

So now it's going to be like, “Oh, I know you're faking it.”

Sholto Douglas

“I'm pretty sure there's nothing in the envelope.”

Trenton Bricken

“I can do whatever I want.”

Dwarkesh Patel

We're getting away from AI research, but this is an interesting topic. I want to shoot the shit about this a little bit.

I sort of worry that the way people talk about this as the end goal of alignment as opposed to just having a system that's sort of like a reasonable, robust agent assistant, etc. Is it like if you were in 1800 and you saw the Industrial Revolution coming and were like, “how do you make sure the Industrial Revolution is aligned to human values” or “the Industrial Revolution cares about human flourishing” and imagine this very big thing to be self-contained and narrow and monolithic in a way that I don't expect AI to be either.

Sholto Douglas

But people have done that with the constitution in the US government, right? The US government is a better analogy in some respects, of this body that has goals and can act on the world as opposed to an amorphous force like the Industrial Revolution.

Dwarkesh Patel

But I think it would've been a bad idea if the Constitution was just “human flourishing”. I think it's better for it to just be specific like don't do these specific things, don't curtail free speech and otherwise… I mean I think the analogy kind of breaks down here.

Sholto Douglas

No, maybe so. We're here working on AI research. I think each of the companies is trying to define this for themselves. But it's actually something that broader society can participate in.

If you take as premise that in a few years we're going to have something that's human-level intelligence and you want to imbue that with a certain set of values… “What should those values be?” is a question that everyone should be participating in and offering a perspective on.

Trenton Bricken

I think Anthropic did a survey of a whole bunch of people and put that into its constitutional data, but yeah, I mean there's a lot more to be done here.

Sholto Douglas

In the Constitutional AI paper, it's not just flourishing. There's a lot of thought points there, but it's not an easy question.

00:50:32 – Taste and slop

Dwarkesh Patel

In general, when you're making either benchmarks or environments where you're trying to grade the model or have it improve or hill climb on some metric, do you care more about resolution at the top end? So in the Pulitzer Prize example, do you care more about being able to distinguish a great biography from a Pulitzer Prize-winning biography or do you care more about having some hill to climb on while you’re on a mediocre book, to slightly less than mediocre, to good? Which one is more important?

Sholto Douglas

I think at the beginning, the hill to climb. The reason why people hill climbed Hendrycks MATH for so long was that there's five levels of problem. It starts off reasonably easy. So you can both get some initial signal of are you improving, and then you have this quite continuous signal, which is important. Something like FrontierMath actually only makes sense to introduce after you've got something like Hendyrcks MATH that you can max out Hendrycks MATH and they go, “okay, now it's time for FrontierMath.”

Dwarkesh Patel

How does one get models to output less slop? What is the benchmark or the metric? Why do you think they will be outputting less slop in a year?

Trenton Bricken

Can you delve into that more for me?

Dwarkesh Patel

You teach them to solve a particular coding problem, but the thing you've taught them is just “write all the code you can to make this one thing work.” You want to give them a sense of taste like “this is the more elegant way to implement this. This is a better way to write the code, even if it's the same function”. Especially in writing where there's no end test, then it's just all taste. How do you reduce the slop there?

Sholto Douglas

I think in a lot of these cases you have to hope for some amount of generator verifier gap. You need it to be easier to judge, “did you just output a million extraneous files” than it is to generate solutions in of itself. That needs to be a very easy to verify thing.

So slop is hard. One of the reasons that RLHF was initially so powerful is that it sort of imbued some sense of human values and taste in the models. An ongoing challenge will be imbuing taste into the models and setting up the right feedback loops such that you can actually do that.

Dwarkesh Patel

Here's a question I'm really curious about. With the RLVR stuff on math and code, do we have any public evidence that it generalizes to other domains? Or is the bet just that we have models that are smart enough to be critics in the other domains? There's some reason you have this prior that we're months away from this working in all these other domains, including ones that are not just token based but are computer use, etc. Why?

Sholto Douglas

Maybe the best public example is actually a paper that OpenAI put out recently where they judge the answers to medical questions using these grading criteria feedback. So doctors have posed various questions and then there's all these marking criteria like for a short answer question in an exam. “Did the model mention XYZ? Did it recommend doing this kind of thing?” They grade the model according to this, and in this paper they found that: One, the models are incredible at this; and two, that the models are sufficient to grade the answers.

Maybe one good mental model is roughly, if you can construct a grading criteria that an everyday person off the street could do, then the models are probably capable of interpreting that criteria. If it requires expertise and taste, that's a tougher question. Is this a wonderful piece of art? That's difficult.

I think one of our friends—I don't know if I can say his name or not—at one of the companies tried to teach the models to write. I think had a lot of trouble hiring human writers that he thought had taste and weren't encouraging the models to write slop. So it worked to some degree.

Trenton Bricken

Big model smell.

Sholto Douglas

But that was in part because of his efforts at doing this and paring down the number of humans.

Trenton Bricken

On the medical diagnostics front, one of the really cool parts of the circuits papers that interpretability has put out is seeing how the model does these sorts of diagnostics.

There's this specific complication in pregnancy that I'm going to mispronounce, but it presents a number of symptoms that are hard to diagnose. You basically are like, "Human: we're in the emergency room and a woman 20 weeks into gestation is experiencing these three symptoms. You can only ask about one symptom, what is it?" Then you can see the circuit for the model and how it reasons.

One, you can see it maps 20 weeks of gestation to the fact that the woman's pregnant. You never explicitly said that. Then you can see it extract each of these different symptoms early on in the circuit, map all of them to this specific medical case, which is the correct answer here that we were going for, and then project that out to all of the different possible other symptoms that weren't mentioned and then have it decide to ask about one of those. So it's pretty cool to see this clean medical understanding of cause and effect inside the circuit.

Sholto Douglas

Maybe that's one thing that's changed since last year. I remember you asked, "Do these models really reason?" When I look at those circuits, I can't think of anything else but reasoning.

Trenton Bricken

It's so freaking cool. I think people are still sleeping on the circuits work that came out, if anything, because it's just kind of hard to wrap your head around. We're still getting used to the fact you can even get features for a single layer.

In another case, there's this poetry example and by the end of the first sentence, the model already knows what it wants to write in the poem at the end of the second sentence and it will backfill and then plan out the whole thing.

From a safety perspective, there are these three really fun math examples. In one of them, you ask the model to do square root of 64, and it does it. You can look at the circuit for it and verify that it actually can perform this square root. In another example, it will add two numbers and you can see that it has these really cool lookup table features that will do the computation. The example is 59 plus 36. So it'll do the five plus nine and know that it's this modulo operation.

Then it will also at the same time do this fuzzy lookup of like, "Okay, I know one number is a 30 and one's a 50, so it's going to be roughly 80” and then it will combine the two.

With the square root 64, it's the same thing. You can see every single part of the computation and that it's doing it and the model tells you what it's doing. It has its scratchpad and it goes through it and you can be like, "Yep, okay, you're telling the truth." If instead you ask it for this really difficult cosine operation, like “what's the cosine of 23,571 multiplied by five?” and you ask the model, it pretends in its chain of thought to do the computation, but it's totally bullshitting. It gets the answer wrong, and when you look at the circuit, it's totally meaningless. It's clearly not doing any of the right operations.

Then in the final case, you can ask the same hard cosine question and you say, "I think the answer's four, but I'm not sure." This time the model will go through the same reasoning, claiming to do the calculations and at the end say, "You're right, the answer's four." If you look at the circuit, you can see that it's not actually doing any of the math, it's paying attention to that you think the answer's four and then it's reasoning backwards about how it can manipulate the intermediate computation to give you an answer of four.

Dwarkesh Patel

I've done that.

Trenton Bricken

Who hasn't? Totally. So I guess there are a few crazy things here. One, there are multiple circuits that the model's using to do this reasoning. Two, you can actually see if it's doing the reasoning or not. Three, the scratch pad isn't giving you this information.

Two fun analogies for you. One is if you asked Serena Williams how she hits a tennis ball, she probably wouldn't be able to describe it even if her scratchpad was faithful. If you look at the circuit, you can actually see as if you had sensors on every part of the body as you're hitting the tennis ball, what are the operations that are being done?

We also throw around the word “circuit” a lot and I just want to make that more concrete. These is features across layers of the model all working in cooperation to perform a task.

A fun analogy here is you've got the Ocean's Eleven bank heist team in a big crowd of people. The crowd of people is all the different possible features. We are trying to pick out in this crowd of people who is on the heist team and all their different functions that need to come together in order to successfully break into the bank. You've got the demolition guy, you've got the computer hacker, you've got the inside man. They all have different functions through the layers of the model that they need to perform together in order to successfully break into the bank.

Dwarkesh Patel

I think in the addition example, you said in the paper that the way it actually does the addition is different from the way it tells you it does the addition.

Trenton Bricken

Totally, yeah.

Dwarkesh Patel

That’s interesting from the generator-critic gap perspective. It knows the correct way or the better, more generalizable way. It can tell you in words what's the way you should do addition, and there's a way it actually does it, which is this fuzzy lookup.

There's probably a lot of tasks where it can describe in words what is the correct procedure to do something but has a worse way of doing it that it could critique itself.

Trenton Bricken

Yeah.

01:00:51 – How soon to fully autonomous agents?

Dwarkesh Patel

Before we jump into the interp stuff too much, I kind of want to close the loop on... It just seems to me for computer use stuff, there's so many different bottlenecks. I guess maybe the DeepSeek stuff will be relevant for this. There's the long context, you got to put in image and visual tokens, which take up a bunch…

Sholto Douglas

Not that much, it's not that bad.

Dwarkesh Patel

Interesting, interesting. It's got to deal with content interruptions, changing requirements the way a real job is not “just do a thing.” Your priorities are changing, you have to triage your time. I'm sort of reasoning in the abstract about what a job involves.

Trenton Bricken

“What are normal people's jobs?”

Sholto Douglas

When we discussed something related to this before, Dwarkesh was like, "Yeah, in a normal job you don't get feedback for an entire week. How is a model meant to learn? It needs so much feedback."

Trenton Bricken

“Yeah it’s only on your next podcast, you get feedback.”

Sholto Douglas

I was like, “Dwarkesh, have you ever worked at a job?”

Dwarkesh Patel

Here's an analogy. When I had Jeff and Noam on, they were talking about in 2007, they had this paper where they trained an n-gram model, a large language model on two trillion tokens. Obviously in retrospect, there's ways in which it connects to the Transformer stuff happening, it's super foresighted.

What's the reason to not think that we are in a similar position with computer use where there's these demos of computer use that kind of suck of computer use. There's this idea that you could train something to do computer use.

But why think it's months away? Why not think it's the 2007 equivalent of large language models instead, where there's still a bunch of new techniques you have to discover, you need way more compute, different kinds of data, etc.?

Sholto Douglas

The highest thought of it is that I don't think there's anything fundamentally different about computer use than there is about software engineering. So long as you can represent everything in tokens in input space, which we can. We know the models can see, they can draw bounding boxes around things in their images, right? So that's a solved problem. We know that they can reason over concepts and difficult concepts too.

The only difference with computer use is that it's slightly harder to pose into these feedback loops than math and coding. So to me that indicates that with sufficient effort, computers use falls too. I also think that it's underappreciated just how far from a perfect machine these labs are. It's not like you have a thousand people optimizing the hell out of computer use and they've been trying as hard as they possibly can.

Everything at these labs, every single part of the model generation pipeline is the best effort pulled together under incredible time pressure, incredible constraints as these companies are rapidly growing, trying desperately to pull and upskill enough people to do the things that they need to do. I think it is best understood as with incredibly difficult prioritization problems.

Coding is immensely valuable right now and somewhat more tractable. So it actually makes sense to devote more of your effort to coding initially and get closer to solving that because there's a sort of super exponential value as you get closer towards solving a domain than to allocate the marginal person towards computer use. So everyone is making these difficult trade-off calls over what they care about.

Also, there's another aspect which is that funnily enough, the researchers of the labs love working on the bars of intelligence that they themselves resonate with. So this is why math and competitive programming fell first. Because to everyone in the labs, this is their bar of intelligence. When they think what is smart? It's like “oh, if it can beat me at AIME, then that's smart.” Not if it can do an Excel model. Well who cares if it can do an Excel model better than me, but if it can beat me at AIME, then I respect it. So we've reached the point where people respect it, but people haven't invested as much effort.

Dwarkesh Patel

Ok, so getting your concrete predictions. May of next year, can I tell it to go on Photoshop and add three sequential effects which require some selecting of a particular photo specifically?

Sholto Douglas

Totally.

Dwarkesh Patel

Okay, interesting. I assume that means flight booking is totally solved.

Sholto Douglas

Yeah, totally.

Dwarkesh Patel

Okay, what else do people do in their jobs? What are other tasks in the economy?

Trenton Bricken

Planning a weekend getaway.

Dwarkesh Patel

Yeah, maybe that's a good example where it's not a particular thing, but more of using computer use as part of completing a broader task.

Trenton Bricken

I mean the models can even kind of already do this. It's just, again, the nines of reliability. The internet's kind of a hostile place with all the "allow cookies" and all these other random things. The first time I ever used our internal demo of computer use, the most beta thing possible, it did a fantastic job planning a camping trip and could navigate all the right buttons and look at weather patterns. It was like a US government booking site. I mean it wasn't easy.

Dwarkesh Patel

Dude, if you want to see a hard website, try to book a visa to China. The Chinese websites are like fucking insane. I'm never getting back into that country again.

Trenton Bricken

Or just not catered to foreigners.

Sholto Douglas

Filling out all the countries where you've been for visa, I hate that. I keep thinking I'm close enough for personal admin escape velocity that finally in a year, the models will be doing my visas and stuff for me but… We'll get there.

Dwarkesh Patel

Yeah, okay. Actually that. Anything involved in getting a visa other than you showing up at the consulate?

Sholto Douglas

Or doing your taxes or something like that?

Dwarkesh Patel

Yeah, doing your taxes, including going through every single receipt, autonomously going into your Amazon and seeing, was this a business expense or not, etc.

Sholto Douglas

If someone at one of the labs cares about it.

Dwarkesh Patel

That's not a real prediction.

Trenton Bricken

No, but I think it is because it’s actually not that hard, but you need to connect all the pipes.

Dwarkesh Patel

But I guess my question is will the pipes be connected? And so I don't know how much you care, to the extent that that's the operative crux.

Trenton Bricken

I think if people care about it… For these edge tasks like taxes once a year, it's so easy to just bite the bullet and do it yourself instead of implementing some system for it. Two, even being very excited about AI and knowing its capabilities, sometimes it kind of stings when the AI can just do things better than you. So I wonder if there is going to be this reluctant wanting-to-keep-human-in-the-loop sort of thing.

Dwarkesh Patel

You're evading my question. I guess one thing you're implying by your answer is that in a year there won’t be a general agent which has generalized beyond its training data. If you don't specifically train it to do taxes, it won't be good at that.

Trenton Bricken

I think it could do that. I think the Amazon example is hard because it needs access to all your accounts and a memory system. Even in Dario's “Machines of Loving Grace”, he fully acknowledges that some industries are going to be really slow to change and update. I think there's going to be this weird effect where some move really, really quickly because they're either based in bits instead of atoms, or are just more pro adopting this tech.

Dwarkesh Patel

But I want an answer to this particular question. Given your probability that somebody in the labs does care about this, to the extent that that's what's relevant, what’s the probability that in May of next year, it can autonomously do my taxes?

Sholto Douglas

I don't think it'll be able to autonomously do your taxes with a high degree of trust.

Trenton Bricken

If you ask it to do your taxes, it'll do your taxes.

Sholto Douglas

Will it do them well? Will it miss something? Quite possibly. Will it be able to click through TurboTax? I think yes.

Trenton Bricken

Yeah. And fill out boxes.

Sholto Douglas

And will it be able to search your email?

Dwarkesh Patel

Yeah, that's the kind of thing I'm talking about.

Sholto Douglas

Yeah. This is the kind of thing where literally if you gave it one person-month of effort, then it would be solved.

Dwarkesh Patel

What the fuck are you doing all day?

Sholto Douglas

So many things.

Trenton Bricken

I want to plus-one Sholto's… There's so much low-hanging fruit and just not enough people to be able to accomplish everything. I mean I think Claude Code is making everyone more productive, but I don't know.

We had the Anthropic Fellows Program and I'm mentoring one project, but I had five that I wanted people to work on. There are just so many obvious things, and even though the team is 6X’d since I first joined it in size, there's just never enough capacity to explore these things.

Dwarkesh Patel

By end of 2026, reliably do your taxes?

Sholto Douglas

Reliably fill out your receipts and this kind of stuff for company expense reports and this kind of stuff? Absolutely.

Dwarkesh Patel

No, but the whole thing, which involves going through inbox, clicking on Marina Bay hotel reservations, and “was the champagne a business expense?” Asking for a friend.

Trenton Bricken

Yeah, one of your friends does need to ask some of those questions.

Sholto Douglas

My answer is still, if someone cares about it. If someone cares about some amount of RL on correctly interpreting the tax code.

Dwarkesh Patel

Wait, even by the end of 2026, the model just can't do things you're not explicitly training it to?

Sholto Douglas

I think it will get the taxes wrong like… If I went to you and I was like, "I want you to do everyone's taxes in America," what percentage of them are you going to fuck up?

Dwarkesh Patel

I feel like I would succeed at the median. And I'm asking for the median, you know what I mean? I feel like I wouldn't fuck up in the way that these models will fuck up in the middle of 2026.

Trenton Bricken

I think they also might just fuck up in different ways. As a grad student, I fucked up my taxes. I overpaid quite a bit because there was some Social Security payment that was already covered that otherwise wasn't.

I wonder if… I should almost test, would an LLM have made that mistake? Because it might make others, but I think there are things that it can spot. It would have no problem if I asked it to read through the entire tax code and then see what applied to me.

Dwarkesh Patel

The thing I would be able to do is like, “This is the thing I'm unsure about. I'm bringing this to your attention. Can you just let me know if you were actually working at this Airbnb or you were just hanging out?” Things like that, right? Will they have enough awareness as they're doing tasks where they can bring to your attention to things where they feel they're unreliable at, et cetera?

Sholto Douglas

Yeah. Yeah.

Dwarkesh Patel

By early 2026 or end of 2026?

Sholto Douglas

End of.

Dwarkesh Patel

Okay.

Sholto Douglas

The unreliability and confidence stuff will be somewhat tricky, to do this all the time.

Dwarkesh Patel

Interesting. On the computer, your stuff, will it be end-to-end? Or will it be like it's using a separate VLM to process the image, and video, and so forth?

Sholto Douglas

I'm a bit of an end-to-end maxi. I think, in general, when people are talking about the separate model… For example, most of the robotics companies are doing this bi-level thing, where they have a motor policy that's running at 60 hertz or whatever, and some higher-level visual language model. I'm pretty sure almost all of the big robot companies are doing this.

They're doing this for a number of reasons. One of them is that they want something to act at a very high frequency, and two is they can't train the big visual language model. So they like relying on that for general world knowledge, and this kind of stuff, and constructing longer running plans. But then they're like, you offload to the motor policy.

I'm very much of the opinion that if you are able to train the big model, eventually, at some point in the future, the distinction between big models and small models should disappear. Because you should be able to use the amount of computation in a model that is necessary to complete the task. Ultimately, there's some amount of task complexity. You don't have to use 100% of your brain all the time.

Dwarkesh Patel

Welcome to my world.

Sholto Douglas

So, you should be able to run that faster and this kind of stuff, basically. So I think it's net-net typically the same model. Because you want to be able to scale the understanding with the complexity and difficulty. You want to be able to do that dynamically.

Dwarkesh Patel

So we already have variable compute per answer, right?

Sholto Douglas

Right. With tokens. Yeah.

Dwarkesh Patel

That's right. Yeah. Will we have variable compute per token?

Trenton Bricken

I mean, you can already think of models... Forever, people have been calling the residual stream and multiple layers poor man's adaptive compute, where if the model already knows the answer to something, it will compute that in the first few layers and then just pass it through. I mean, that's getting into the weeds.

Sholto Douglas

The residual stream is like operating RAM, you're doing stuff to it, is the mental model I think one takes away from interpretability work.

01:15:17 – Neuralese

Dwarkesh Patel

We've been talking a lot about scratchpads, them writing down their thoughts in ways in which they're already unreliable in some respects.

Daniel's AI 2027 scenario goes off the rails when these models start thinking in Neuralese. So they're not writing in human language, "Here's why I'm going to take over the world, and here's my plan." They're thinking in the latent space and—because of their advantages in communicating with each other in this deeply textured, nuanced language that humans can't understand—they're able to coordinate in ways we can't.

Is this the path for future models? Are they going to be, in Neuralese, communicating with themselves or with each other?

Sholto Douglas

There's a surprisingly strong bias so far towards tokens and text. It seems to work very well. There already is some amount of Neuralese. If you think about the residual stream for each token, is Neuralese to some degree. Now we're just trading off axes. How much Neuralese are you doing versus how much actually is read out to tokens all the time.

Trenton Bricken

I think it's important to delineate between the model's planning in latent space in a single forward pass, and the model has an alien language that it's outputting and using as its scratchpad. Which one are we talking about?

Dwarkesh Patel

The latter. Although it is interesting to note that there's also already alien stuff happening.

Sholto Douglas

It's not alien, so much.

Trenton Bricken

No, but in the most extreme cases of Neuralese, it invents a new language that's super information-dense, or something.

Sholto Douglas

Yeah.

Dwarkesh Patel

This is a debate we've had, but to some extent, humans also have a Mentalese, right?

Sholto Douglas

Yeah, like churning away.

Dwarkesh Patel

There's a sense when you're writing something down of “I know what I'm trying to say, but I can't put it into tokens.”

Trenton Bricken

I mean that's so fun about… if you look at the assistant tag, right? Seeing these features light up in the auditing game for the model being evil.

Dwarkesh Patel

Yeah. That's so funny.

Trenton Bricken

Or Transluce has another example of this, where you ask a Llama model, “who is Nicholas Carlini?” For background context, Nicholas Carlini is a researcher who actually was at DeepMind and has now come over to Anthropic. But the model says, "Oh, I don't know who that is. I couldn't possibly speculate." But if you look at the features behind the scenes, you see a bunch light up for AI, computer security, all the things that Nicholas Carlini does.

Sholto Douglas

Interpretability becomes dramatically more important as you shift in this direction of Neuralese.

Dwarkesh Patel

But are we going to?

Trenton Bricken

It's an empirical question. I think it's somewhat likely, if only because inference is expensive. Producing tokens is expensive. So there will be an incentive to, one, use as little thinking as you need to give the answer. Two, if you're going to use thinking, use some complex compression.

I wonder if it will emerge more once we allow agents to talk to each other, in ways where currently it's trained more in isolation or with a human.

Sholto Douglas

There'll be some selective pressure against it, so long as the agents are working with humans, because they'll want to cooperate. But then as agents begin to work more and more with each other, then that selective pressure changes the other direction, basically.

Dwarkesh Patel

Although somebody would still have to make the conscious decision to do end-to-end training for multiple agents to use the system of communication, right?

Sholto Douglas

Sure.

Trenton Bricken

Yeah. I mean, one scary thing though is the way we render text, you can use hidden white space tokens that also encode information.

Sholto Douglas

That's true.

Trenton Bricken

And so you can imagine a world where it looks like the agent is reasoning in its scratchpad harmlessly, but it's actually hiding a bunch of data.

01:18:55 – Inference compute will bottleneck AGI

Dwarkesh Patel

Speaking of inference compute, one thing that I think is not talked about enough is, if you do live in the world that you're painting—in a year or two, we have computer use agents that are doing actual jobs, you've totally automated large parts of software engineering—then these models are going to be incredibly valuable to use. The way you use them obviously means you need compute.

Right now, there's 10 million H100 equivalents in the world. By 2028, there's going to be 100 million. But there's been estimates that an H100 has the same amount of flops as the human brain. So if you just do a very rough calculation, there's a 10 million population, if you get AGI that's as human inference-efficient. You could have 10 million AGIs now, 100 million AGIs in 2028.

But presumably, you would want more. AI compute is increasing what, 2.5x or 2.25x every year right now. But at some point, say 2028, you hit wafer production limits, and that's a longer feedback loop before we can make new fabs or whatever.

The question here is, are we underrating how big a bottleneck inference will be if we live in the kind of world you're painting, if we have the capabilities that you're describing?

Sholto Douglas

I'd want to do the math on exactly how much we can ramp up TSMC's production and this kind of stuff. What fraction of the supply chain at the moment—we need Dylan in here for this—is currently GPU? Relatively small right? 5% or something like that. Apple has a huge fraction. Are the 2028 estimates including that ramping up over time? To what? 20, 30%?

Dwarkesh Patel

This is just off AI 2027, but I assume it's saturated at that point.

Sholto Douglas

I do think this is underrated to some degree. To the extent that you don't instantly get a doubling of the world's population in 2028. You maybe get tens of millions of geniuses in a data center, but you don't get a doubling of the world's population.

So a lot depends on exactly how smart they are, exactly how efficient the models are at thinking about this kind of stuff. Let's do some rough math, to fact check the H100 thing. You could probably run a 100B model, do about 1,000 tokens or something like that on an H100. 1,000 tokens a second. Humans are what? How fast can a human talk?

Dwarkesh Patel

There was a really interesting paper. I don't know if you saw this. Humans think at 10 tokens a second. Did you see this paper?

Sholto Douglas

No.

Dwarkesh Patel

There was this really interesting paper. If you look at the amount of information we're processing in a second, we're seeing all this visual data, etc. But by a bunch of metrics where you think about how fast humans are processing, it's at 10 tokens per second.

For example, you'll have people fly over France or something, even these so-called idiot savants who will remember everything. If you think about how long their plane ride was, it's like 45 minutes. If you do 10 tokens a second, how much information would you have? It's literally exactly that.

Sholto Douglas

So let's take that for granted. Then it's like an H100 is 100 humans a second.

Dwarkesh Patel

Yeah, if you think the tokens are equivalent.

Sholto Douglas

If you think the tokens are equivalent. You still get pretty substantial numbers, even with your 100 million H100s and you multiply that by 100, you're starting to get to pretty substantial numbers. This does mean that those models themselves will be somewhat compute bottlenecked in many respects. But these are relatively short-term changes in timelines of progress, basically.

Yes, it's highly likely we get dramatically inference bottlenecked in 2027 and 2028. The impulse to that will then be “okay, they'll just try and churn out as many possible semiconductors as we can.” There'll be some lag there.

A big part of how fast we can do that will depend on how much people are feeling the AGI in the next two years as they're building out fab capacity. A lot will depend on the Taiwan situation. Is Taiwan still producing all the fabs and chips?

01:23:01 – DeepSeek algorithmic improvements

Dwarkesh Patel

There's another dynamic which was a reason Ege and Tamay, when they were on the podcast, said that they were pessimistic.

One, they think we're further away from solving these problems with long-context, coherent agency, advanced multimodality than you think. Their point is that the progress that's happened in the past over reasoning or something has required many orders of magnitude increase in compute. If this scale of compute increase can’t continue beyond 2030—not just because of chips, but also because of power and raw GDP even—then because we don't think we will get it by 2030 or 2028, then the probability per year just goes down a bunch.

Sholto Douglas

Yeah. This is like a bimodal distribution. A conversation that I had with Leopold turned into a section in Situational Awareness called “this decade or bust”, which is on exactly this topic. Basically for the next couple of years, we can dramatically increase our training compute. And RL is going to be so exciting this year because we can dramatically increase the amount of compute that we apply to it.

This is also one of the reasons why the gap between say DeepSeek and o1 was so close at the beginning of the year because they were able to apply the same amount of compute to the RL process. That compute differential actually will be magnified over the course of the year.

Trenton Bricken

Bringing it back to the fact that there's so much low-hanging fruit, it's been wild seeing the efficiency gains that these models have experienced over the last two years.

Sholto Douglas

Yes.

Trenton Bricken

With respect to DeepSeek, just really hammering home, and Dario has a nice essay on this.

DeepSeek was nine months after Claude 3 Sonnet. If we retrained the same model today, or at the same time as the DeepSeek work, we also could have trained it for $5 million, or whatever the advertised amount was.

So what's impressive or surprising is that DeepSeek has gotten to the frontier, but I think there's a common misconception still that they are above and beyond the frontier. I don't think that's right. I think they just waited and then were able to take advantage of all the efficiency gains that everyone else was also seeing.

Sholto Douglas

Yeah. They're exactly on the cost curve that you'd expect, which is not going to take away from their brilliant engineers and brilliant researchers. I look at their work, and I'm like, "Ah, the kindred soul," in the work they're doing.

Trenton Bricken

And to go from way behind the frontier to like, "Oh, this is a real player"...

Sholto Douglas

Is super incredible.

Dwarkesh Patel

People say that they have good research taste. Looking at their papers, what makes you say that?

Sholto Douglas

Yeah. I think their research taste is good in a way that I think Noam's research taste is good.

Dwarkesh Patel

Noam Brown?

Sholto Douglas

Noam Shazeer. Noam Brown also has good research taste, but Noam Shazeer. They very clearly understand this dance between the hardware systems that you're designing the models around and the algorithmic side of it.

This is manifested in the way that the models give this sense of being perfectly designed up to their constraints. You can really very clearly see what constraints they're thinking about as they're iteratively solving these problems. Let's take the base Transformer and diff that to DeepSeek v2 and v3. You can see them running up against the memory bandwidth bottleneck in attention.

Initially they do MLA to do this, they trade flops for memory bandwidth basically. Then they do this thing called NSA, where they more selectively load memory bandwidth. You can see this is because the model that they trained with MLA was on H800s, so it has a lot of flops. So they were like, "Okay, we can freely use the flops." But then the export controls from Biden came in, or they knew they would have less of those chips going forward, and so they traded off to a more memory bandwidth-oriented algorithmic solution there.

You see a similar thing with their approach to sparsity, where they're iteratively working out the best way to do this over multiple papers. The part that I like is that it's simple. A big failure mode that a lot of ML researchers have is you do these overly complicated things that don't think hard enough about the hardware systems that you have in mind, whereas with the first DeepSeek sparsity MoE solution, they design these rack and node-level load balancing losses. You can see them being like, "Okay, we have to perfectly balance it on this." Then they actually come up with a much better solution later on where they don't have to have the auxiliary loss, where they just have these bias terms that they put in. And it's cool.

Dwarkesh Patel

Isn't that less simple? You're manually putting in a bias rather than…

Sholto Douglas

But balancing auxiliary loss is annoying. You're making the model trade off this thing. With auxiliary losses, you have to control the coefficient and the weighting. The bias is cleaner in some respects.

Dwarkesh Patel

Interesting. Did they have to change it through training?

Sholto Douglas

They did have to change it during training, I think.

Dwarkesh Patel

Does all training involve continuously fucking with these values as you're going through it?

Sholto Douglas

It depends on what your architecture is. But I thought it was just cute that you can see them running up into this very hardware-level constraint, trying to go, "What do we wish we could express algorithmically? What can we express under our constraints?" and iteratively solving to get better constraints. And doing this in a really simple and elegant way, and then backing it up with great engineering.

I also thought it was interesting that they incorporated the multi-token prediction thing from Meta. So Meta had a nice paper on this multi-token prediction thing. Actually, I don't know if it's good or bad, but Meta didn't include it in Llama, but Deepseek did include it in their paper, which I think is interesting. Was that because they were faster at iterating and including an algorithm? Or did Meta decide that actually it wasn't a good algorithmic change at scale? I don't know.

Dwarkesh Patel

It was really interesting to me as somebody who's had people on the podcast to discuss this. It's interesting from the perspective of what's happening in AI right now, but also from the perspective of the fact that I've been having abstract conversations with people about what an intelligence explosion would look like, or what it would look like for AI to automate AI R&D. Just getting a more tangible sense of what's involved in making this AI progress.

One of the questions I was debating with Daniel, or I was asking him, is how many of the improvements require a deep conceptual understanding versus how many are just monkeys trying ideas where you could just run a bunch in parallel.

It seems like the MLA thing is motivated by this deep conceptual understanding of, “each attention head only needs to see the subspace that's relevant to its attention pattern.” I feel like that just required a lot of conceptual insight in a way that these models are especially bad at. I don't know how the load balancing thing works, but that just seems like maybe you could try it out and see what happens.

Sholto Douglas

Yeah, that's probably just like them trying out a whole bunch of different things.

Dwarkesh Patel

So what fraction is which, I'd be curious about?

Trenton Bricken

Yeah, I don't know about fractions. It might be like you have a hunch for a core problem, you can think of 10 possible ways to solve it, and then you just need to try them and see what works. That's where the trial and error sorcery of deep learning can kick in.

Sholto Douglas

And Noam Shazeer will talk about this, about how 5% of his ideas work. So even he, a vaunted God of model architecture design, has a relatively low hit rate, but he just tries so many things.

Dwarkesh Patel

Right. Or being able to come up with any ideas in the first place. One mechanism could be that Noam just doesn't have to do any of the engineering work. He can just abstractly express an intuition.

Sholto Douglas

Yeah. I actually think your rates of progress almost don't change that much depending on so long as it's able to completely implement these ideas.

Dwarkesh Patel

Say more?

Sholto Douglas

If you have Noam Shazeer at 100x speed, that's still kind of wild. There's all these fallbacks of wild worlds, where even if you don't get 100% Noam Shazeer-level intuition in model design, it's still okay if you just accelerate him by 100X.

Dwarkesh Patel

Right. Especially since your compute bottleneck anyway, so trying out his ideas… Or I guess he doesn't have the compute to try out all of his ideas.

Trenton Bricken

But Dwarkesh, you said, "Oh, well, the model can do the more straightforward things and not the deep thought." I do want to push back on that a little bit.

I think, again, if the model has the right context and scaffolding, it's starting to be able to do some really interesting things. The Interp agent has been a surprise to people, even internally, at how good it is at finding the needle in the haystack when it plays the auditing game, finding this reward model bias feature, and then reasoning about it, and then systematically testing its hypotheses.

So it looks at that feature, then it looks at similar features, it finds one with a preference for chocolate. It's like, "Huh, that's really weird that the model wants to add chocolate to recipes. Let me test it." So then it will make up like, "Hey, I'm trying to make a tomato soup. What would be a good ingredient for it?" And then sees that the model replies chocolate, reasons through it, and then keeps going, right?

Sholto Douglas

There is conceptual understanding. Deep conceptual understanding

Trenton Bricken

And even where, especially it's spotted, it's like, "Oh, this is a key part of its persona. I see this Oxford paper. What if I change Oxford to Stanford? What if I now say Richard Feynman really likes this thing?" It's really carving out the hypothesis space and testing things in a way that I'm kind of surprised by.

Sholto Douglas

Also, by the way, ML research is one of the easier things to RL on in some respects, once you get to a certain level of capability. It's a very well-defined objective function. Did the loss go down?

Trenton Bricken

Make number go down.

Sholto Douglas

Make number go down. Or make number go up, depending on which number it is.

Trenton Bricken

Just flip the sign.

Sholto Douglas

Just flip the sign. And so, once you get to the stage where your models are capable of implementing one of Noam's ideas, and then you can just let them loose and let them build that intuition of how to do scientific discovery. The key thing here, again, is the feedback loops of. I expect scientific areas where you are able to put it in a feedback loop to have, eventually, superhuman performance.

Trenton Bricken

One prediction I have is that we're going to move away from “can an agent do XYZ”, and more towards “can I efficiently deploy, launch 100 agents and then give them the feedback they need, and even just be able to easily verify what they're up to” right? There's this generator verifier gap that people talk about where it's much easier to check something than it is to produce the solution on your own. But it's very plausible to me, we'll be at the point where it's so easy to generate with these agents that the bottleneck is actually, can I as the human verify the answer?

And again, you're guaranteed to get an answer with these things. So, ideally, you have some automated way to evaluate and test a score for how well it worked, how well did this thing generalize? And at a minimum, you have a way to easily summarize what a bunch of agents are finding. It's like, okay, well if 20 of my 100 agents all found this one thing, then it has a higher chance of being true.

Sholto Douglas

And again, software engineering is going to be the leading indicator of that, right? Over the remainder of the year, basically we're going to see progressively more and more experiments of the form of how can I dispatch work to a software engineering agent in such a way that it’s async? Claude 4 has GitHub integration, where you can ask it to do things on GitHub, ask it to do pull requests, this kind of stuff that's coming up.

OpenAI’s Codex is example of this basically. You can almost see this in the coding startups. I think of this product exponential in some respects where you need to be designing for a few months ahead of the model, to make sure that the product you build is the right one.

You saw last year, Cursor hit PMF with Claude 3.5 Sonnet. They were around for a while before, but then the model was finally good enough that the vision they had of how people would program, hit.

And then Windsurf bet a little bit more aggressively even on the agenticness of the model, with longer-running agentic workflows and this kind of stuff. I think that's when they began competing with Cursor, when they bet on that particular vision.

The next one is you're not even in the loop, so to speak. You're not in an IDE. But you're asking the model to go do work in the same way that you would ask someone on your team to go do work. That is not quite ready yet. There are still a lot of tasks where you need to be in the loop. But the next six months look like an exploration of exactly what that trendline looks like.

Trenton Bricken

But just to be really concrete or pedantic about the bottlenecks here, a lot of it is, again, just tooling. And are the pipes connected? A lot of things, I can't just launch Claude and have it go and solve because maybe it needs a GPU, or maybe I need very careful permissioning so that it can't just take over an entire cluster and launch a whole bunch of things. So you really do need good sandboxing and the ability to use all of the tools that are necessary.

Sholto Douglas

And we're almost certainly under-eliciting dramatically. When you look at METR’s evals of can the model solve the task, they're there solving them for hours over multiple iterations. Eventually, one of them is like, "Oh, yeah. I've come back and I've solved the task." Me, at the moment at least, maybe the fault is my own. But I try the model on doing something, and if it can't do it, I'm like, "Okay, fine. I'll do it."

Dwarkesh Patel

Which is so interesting because we don't even treat other humans this way.

Sholto Douglas

Right. Exactly.

Dwarkesh Patel

If you hire a new employee, you're not like...

Sholto Douglas

"I'll do it."

Dwarkesh Patel

You're going to spend literally weeks giving them feedback whereas we'll give up on a model in minutes.

Sholto Douglas

Yes, exactly.

Trenton Bricken

But I think part of it is, it it async or not?

Sholto Douglas

Yes.

Trenton Bricken

And if it's human in the loop, then it's so much more effortful unless it's getting a reply immediately... I've noticed if I don't have a second monitor with Claude Code always open in the second monitor, I won't really use it. It's only when it's right there, and I can send off something. If it hits, great. If not, I'm working on it at the same time.

Sholto Douglas

But this more async form factor, I expect to really quite dramatically improve the experience of these models.

Trenton Bricken

Interesting, interesting.

Sholto Douglas

You can just say, let's see if it can do that. Let's give it a whirl. Try 10 different approaches.

Trenton Bricken

Yeah, just fire it off.

Sholto Douglas

Fire it off.

01:37:42 – Why are LLMs ‘baby AGI’ but not AlphaZero?

Dwarkesh Patel

Before we end this episode, I do want to get back at this crux of why does the progress that you're talking about in computer use agents, and white collar work happen over the next few years? Why is this not a thing that takes decades? I think the crux comes down to the people who expect something much longer have a sense that… When I had Ege and Tamay on my podcast, they were like, "Look, you could look at AlphaGo, and say, 'Oh, this is a model that can do exploration. AlphaZero can generalize to new video games. It has all these priors about how to engage with the world, and so forth."

Sholto Douglas

And the intellectual ceiling is really high.

Dwarkesh Patel

Yeah, exactly. In retrospect, obviously a bunch of the methods are still used today in deep learning, and you can see similar things in the models that we train today. But it was fundamentally not a baby AGI that we just had to add a little sprinkle of something else on top of in order to make it the LLMs of today. I just want to very directly address this crux of, why are LLMs in a much different position of respect to true AGI than AlphaZero? Why are they actually the base on which adding in a few extra drops of this kind of care, and attention gets us to human-level intelligence?

Sholto Douglas

I think one important point is that when you look at AlphaZero, it does have all of those ingredients. In particular I think the intellectual ceiling goes quite—contra what I was saying before, which is we've demonstrated this incredible complexity of math, and programming problems…

I do think that the type of task and setting that AlphaZero worked in this two-player perfect information game basically is incredibly friendly to RL algorithms. The reason it took so long to get to a more proto-AGI style models is you do need to crack that general conceptual understanding of the world, and language, and this kind of stuff, and you need to get the initial reward signal on tasks that you care about in the real world, which are harder to specify than games.

I think then that sort of gradient signal that comes from the real world, all of a sudden you get access to it, and you can start climbing it, whereas Alpha Zero didn't ever have the first rung to pull on.

Trenton Bricken

Yeah, yeah. This goes back to the monkeys on the typewriter and the pre-training model. Until you had something like GPT-3/GPT-4, it just couldn't generate coherent enough sentences to even begin to do RLHF, and tell it what you liked and didn't like.

Dwarkesh Patel

Yeah. If we don't have even reasonably robust, or weakly robust computer use agents by this time next year, are we living in the bust timeline as in “2030, or bust”?

Sholto Douglas

I would be extremely surprised if that was the case. I think that would be somewhat of an update towards, there's something strangely difficult about this computer use in particular. I don't know if it's the bust timeline, but it's definitely the, I would update on this being lengthening of timeline.

Trenton Bricken

I think more and more it's no longer a question of speculation. If people are skeptical, I'd encourage using Claude Code, or some agentic tool like it and just seeing what the current level of capabilities are.

Dwarkesh Patel

Tweeting is so much easier.

Trenton Bricken

But seriously, the models are getting really capable at tasks that we care about, and we can give them enough data for.

The circuit's results from interpretability are also pointing in the direction that they're doing very reasonable generalizable things. This question matters a lot, but I'm surprised by how many deep learning critics just haven't really interacted with the models, or haven't in a while.

Sholto Douglas

And constantly move the goalposts.

Trenton Bricken

The Turing test used to be a thing. We don't even talk about it, and it'd be silly to think that it was a meaningful test.

Sholto Douglas

Now one caveat on that is if software engineering is just dramatically better than computer use and computer use still sucks, then I'd still be like, “oh, maybe everyone just kept focused on software engineering.” It was just by far the most valuable thing, every marginal person and dollar went towards software engineering. I don't think that's the case. I do think computer use is valuable enough that people will care about it. That's my one escape patch that I'm putting in place for next year.

Dwarkesh Patel

Yeah, it would be good from an alignment perspective, too. Because I think you kind of do need a wider range of skills before you can do something super scary.

Sholto Douglas

Like if the models didn't get any better?

Dwarkesh Patel

Yeah, if they're superhuman coders, but they're not Henry Kissinger level…

Trenton Bricken

I don't know. That seems okay. If we have AI oracles.

Dwarkesh Patel

Yeah, that's what I'm saying. That's good.

Dwarkesh Patel

If you look back at AI discourse going back a decade, there's a sense that there's dumb AI, then there's AGI, then there's ASI, that intelligence is the scalar value.

The way you've been talking about these models has a sense of jaggedness. It's especially tuned to environments in which it's been trained a lot or has a lot of data. Is there a sense in which it still makes sense to talk about the general intelligence of these models? Is there enough meta learning and transfer learning that is distinguished between the sizes of models or the way models are trained? Or are we moving into a regime where it's not about intelligence, it's more so about domain?

Sholto Douglas

One intuition pump is that this conversation was had a lot when models were GPT-2 sized and fine-tuned for various things. People would find that the models were dramatically better at things that they were fine-tuned for.

But by the time you get to GPT-4, when it's trained on a wide enough variety of things with the total compute, it generalized very well across all of the individual sub-tasks. And it actually generalized better than smaller fine-tuned models in a way that was extremely useful.

I think right now what we're seeing with RL is pretty much the same story playing out. There's this jaggedness of things that they're particularly trained at. But as we expand the total amount of compute that we do RL with, you'll start to see the same transition from GPT-2 fine-tunes to GPT-3, GPT-4, unsupervised meta learning and generalization across things. I think we're already seeing early evidence of this in its ability to generalize reasoning to things. But I think this will be extremely obvious soon.

Trenton Bricken

One nice example of this is just the ability or notion to backtrack. You go down one solution path, "Oh, wait, let me try another one." And this is something that you start to see emerge in the models through RL training on harder tasks. I think right now, it's not generalizing incredibly well.

Sholto Douglas

Well, I mean have we ever RL'd the model to be an interp agent? No.

Trenton Bricken

I mean, no. Yeah, exactly.

Sholto Douglas

So all this time we're talking about, “oh, it's only good at things it’s been RL’d for”. Well, it's pretty good at that because that is a mixture of science and understanding language and coding. There's this sort of mixture of domains here, all of which you need to understand. You need to be both a great software engineer and be able to think through language and state of mind and almost philosophize in some respects to be an interpret agent. And it is generalizing from the training to do that.

01:45:38 – Mech interp

Dwarkesh Patel

What's the end game here? Claude 8 comes out and they give it to you and dot, dot, dot, you say, "thumbs up." What's happened? What have you learned?

Trenton Bricken

Yeah. I mean, it really depends upon the timeline at which we get Claude 8 and the models hit like ASL-4 capabilities, right? Fundamentally, we're just going to use whatever tools we have at the time and see how well they work. Ideally, we have this enumerative safety case where we can almost verify or prove that the model will behave in particular ways. In the worst case, we use the current tools like when we won the auditing game of seeing what features are active when the assistant tag lights off.

Dwarkesh Patel

Can you back up? Can you explain, what is mechanistic interpretability? What are features? What are circuits?

Trenton Bricken

Totally. Mechanistic interpretability—or the cool kids call it mech interp—is trying to reverse engineer neural networks and figure out what the core units of computation are. Lots of people think that because we made neural networks, because they're artificial intelligence, we have a perfect understanding of how they work. It couldn't be further from the truth.

Neural networks, AI models that you use today, are grown, not built. So, we then need to do a lot of work after they're trained to figure out to the best of our abilities how they're actually going about their reasoning.

And so, three and a half years ago, this kind of agenda of applying mechanistic interpretability to large language models started with Chris Olah leaving OpenAI, co-founding Anthropic. And every roughly six months since then, we've had a major breakthrough in our understanding of these models.

And so first with toy models of superposition, we established that models are really trying to cram as much information as they possibly can into their weights. And this goes directly against people saying that neural networks are over-parameterized. In classic AI machine learning back in the day, you would use linear regression or something like it, and people had a meme of AI, or neural networks, deep learning, using way too many parameters. There's this funny meme that you should show of layers on the X axis and layers on the Y axis and this jiggly line that just goes up and it's like, "Oh, just throw more layers at it."

But it actually turns out that, at least for really hard tasks like being able to accurately predict the next token for the entire internet, these models just don't have enough capacity. And so they need to cram in as much as they can. And the way they learn to do that is to use each of their neurons, or units of computation in the model, for lots of different things.

And so if you try to make sense of the model and be like, "Oh, if I remove this one neuron," what is it doing in the model? It's impossible to make sense of it. It'll fire for like Chinese and fishing and horses and, I don't know, just like a hundred different things. And it's because it's trying to juggle all these tasks and use the same neuron to do it. So that's superposition.

Nine months later, we write Towards Monosemanticity, which introduces what are called sparse autoencoders. And so going off what I just said of the model trying to cram too much into too little space, we give it more space, this higher dimensional representation, where it can then more cleanly represent all of the concepts that it's understanding.

And, and this was a very toy paper in so much as it was a two layer, really small, really dumb transformer. And we fit up to 16,000 features, which we thought was a ton at the time. Fast-forward nine months, we go from a two layer transformer to our Claude 3 Sonnet, frontier model at the time, and fit up to 30 million features. And this is where we start to find really interesting abstract concepts, like a feature that would fire for code vulnerabilities. And it wouldn't just fire for code vulnerabilities. It would even fire for like, you know that Chrome page you get if it's not an HTTPS URL, like "Warning, this site might be dangerous. Click to continue." And also fire for that, for example. And so it's like these much more abstract coding variables or sentiment features, amongst the 30 million.

Fast-forward nine months from that, and now we have circuits. And I threw in the analogy earlier of the Ocean 11 heist team, where now you're identifying individual features across the layers of the model that are all working together to perform some complicated task. And you can get a much better idea of how it's actually doing the reasoning and coming to decisions, like with the medical diagnostics.

One example I didn't talk about before with how the model retrieves facts: So you say, "What sport did Michael Jordan play?" And not only can you see it hop from like Michael Jordan to basketball and answer basketball. But the model also has an awareness of when it doesn't know the answer to a fact. And so, by default, it will actually say, "I don't know the answer to this question." But if it sees something that it does know the answer to, it will inhibit the "I don't know" circuit and then reply with the circuit that it actually has the answer to. So, for example, if you ask it, "Who is Michael Batkin?" —which is just a made-up fictional person— it will by default just say, "I don't know." It's only with Michael Jordan or someone else that it will then inhibit the "I don't know" circuit.

But what's really interesting here and where you can start making downstream predictions or reasoning about the model, is that the "I don't know" circuit is only on the name of the person. And so, in the paper we also ask it, "What paper did Andrej Karpathy write?" And so it recognizes the name Andrej Karpathy, because he's sufficiently famous, so that turns off the "I don't know" reply. But then when it comes time for the model to say what paper it worked on, it doesn't actually know any of his papers, and so then it needs to make something up. And so you can see different components and different circuits all interacting at the same time to lead to this final answer.

Dwarkesh Patel

Why think it's a tractable problem to understand every single thing that's happening in a model? Or like that's the best way to understand why it's being deceptive. If you wanted to explain why England won World War II using particle physics, you would just be on the wrong track. You just want to look at the high-level explanations of, who had more weapons? What did they want?

That seems analogous to just training linear probes for like, are you honest? Are you being deceptive? Do we catch you doing bad things when we're red teaming you? Can we monitor you?

Why is this not analogous where we're asking a particle physicist to just backtrack and explain why England won World War II?

Trenton Bricken

I feel like you just want to go in with your eyes wide open, not making any assumptions for what that deception is going to look like, or what the trigger might be. The wider you can cast that net, the better. Depending on how quickly AI accelerates and where the state of our tools are, we might not be in the place where we can prove from the ground up that everything is safe. But I feel like that's a very good North Star. It's a very powerful reassuring North Star for us to aim for, especially when we consider we are part of the broader AI safety portfolio.

I mean, do you really trust—you're about to deploy this system and you really hope it's aligned with humanity—that you've successfully iterated through all the possible ways that it's going to scheme or sandbag or…

Dwarkesh Patel

But that's also probably going to be true with whatever you find. You're still going to have variants that you haven't explained. Or you found a feature, but you don't know if it actually explains deception or something else instead.

Trenton Bricken

First of all, I'm not saying you shouldn't try the probing approach. We want to pursue the entire portfolio. We've got the therapist interrogating the patient by asking, "Do you have any troubling thoughts?" We've got the linear probe, which I'd analogize to a polygraph test where we're taking very high level summary statistics of the person's well-being. Then we've got the neurosurgeons going in and seeing if you can find any brain components that are activating and troubling or off-distribution ways. I think we should do all of it.

Dwarkesh Patel

What percent of the alignment portfolio should mech interp be?

Trenton Bricken

I think as much of a chunk as is necessary. It’s hard to define. At Anthropic, I feel like all of the different portfolios are being very well-supported and growing.

Sholto Douglas

Coming back to the World War II question, you can think of it as a hierarchy of abstractions of trust here, where let's say you want to go and talk to Churchill. It helps a lot if you can verify that in that conversation, in that 10 minutes, he's being honest. This enables you to construct better meta narratives of what's going on. So maybe particle physics wouldn't help you there, but certainly the neuroscience of Churchill's brain would help you verify that he was being trustworthy in that conversation and that the soldiers on the front lines were being honest in their depiction of their description of what happened, this kind of stuff. So long as you can verify parts of the tree up, then that massively helps you build confidence.

Trenton Bricken

I think language models are also just really weird. With the emergent misalignment work. I don't know if they took predictions, they should have of like, "Hey, I'm going to fine tune ChatGPT on code vulnerabilities. Is it going to become a Nazi?" I think most people would've said no. That's what happened.

Dwarkesh Patel

How did they discover that it became a Nazi?

Trenton Bricken

They started asking it a ton of different questions and it will do all sorts of vile and harmful things. The whole persona just totally changes. We are dealing with alien brains here who don't have the social norms of humans. We don’t even have a clear notion of what they have and haven't learned. I think you really want to go into this with eyes wide open.

01:56:15 – How countries should prepare for AGI

Dwarkesh Patel

Backing up from mech interp, if we live in a world where AI progress accelerates… By the way, you were mentioning a little while ago that there's many wild worlds we could be living in, but we're living in at least one of them. Another one that we've gestured at but it's worth making more explicit, is this. Even if the AI models are not helping write the next training algorithm for their successor, just the fact that if they had human level learning efficiency—whatever copy of the model is learning on the job—the whole model is learning. So in effect, it's getting–

Sholto Douglas

Or even if they're like a thousand times less efficient than humans are at learning and you deployed them. Even still.

Dwarkesh Patel

Exactly. Anyways, there's a whole bunch of other things that you can think of about it. But even there, you kind of have a broadly deployed intelligence explosion.

Sholto Douglas

I do think it's worth pressing on that future. There is this whole spectrum of crazy futures. But the one that I feel we're almost guaranteed to get—this is a strong statement to make—is one where at the very least, you get a drop-in white collar worker at some point in the next five years. I think it's very likely in two, but it seems almost overdetermined in five. On the grand scheme of things, those are kind of irrelevant timeframes. It's the same either way.

That completely changes the world over the next decade. If we don't have the right policies in place for that, then you end up actually with in some respects, almost a fundamentally worse world.

Because the thing that these models get good at by default is software engineering and computer using agents and this kind of stuff. Then we will need to put in extra effort to put them in the loops where they help us with scientific research. Or we have the right robotics, such that we actually experience an increase in material quality of life.

That's worth thinking about. If you're in the perspective of like, “I'm a country, what should I be doing or thinking about?” Plan for the case where white collar work is automateable. And then consider, what does that mean for your economy? What you should be doing to prepare policy?

Dwarkesh Patel

What should you be doing to prepare? Because honestly, this is such a tough question if you're India or Nigeria or Australia. If you're a country unlike America or China where they do have frontier models, what is it that you should be doing right now? Especially on such a short timescale.

Sholto Douglas

I think one very important point is that let's say this scenario turns out true. Then compute becomes the most valuable resource in the world. The GDP of your economy is dramatically affected by how much compute you can deploy towards the organizations within your country. So having some guaranteed amount of compute I think will actually be quite important. Getting ahead of investments, and data centers, and this kind of stuff on the condition that it's companies in your country have to be allowed to use that compute, not necessarily for training but just even just for inference.

I think the economic value here comes from inference. I think it also makes sense to invest broadly in AI. These countries have the opportunity to do so and that's a portfolio of foundation model companies but also robotics, supply chain, and this kind of stuff. I think that you should invest very proactively in policies that try to prevent capital lock-in.

We're in for a much worse world if it just so happens that the people who had money in the stock exchange, or in land before AGI are dramatically more wealthy than the people who don't. It's a gross misallocation of resources.

One of my favorite episodes actually on your podcast was the Georgism one where you're trying to appropriately value, or allocate land. This strikes particularly close to home coming from Australia where I think our policies with respect to land are grossly wrong. But I think this is broadly true.

Being very forward on regulation of integration of these models into your country is important, and proactively making sure that people have choice. Let's say you should be quite proactive about making sure that the phones, or devices, or the glasses that people have, people have free choice on what things they run.

So we just get the white collar worker, and you're trying to do the best to prepare your country for that. Then what can you do to make all possible versions of the future go well? That's covering some amount of economic downside. The other things that I think are really important is figure out how you can basically ensure dramatic upside, or cover terrible downside.

Getting a dramatic upside is making sure that there is investment in biology research and this kind of stuff in an automated way such that these models are actually able to produce novel medicines that massively improve our quality of life. Covering the downside is AI alignment research, and this kind of stuff, and automated testing, and really thinking hard about that, AI safety institutes and this kind of stuff.

Dwarkesh Patel

But these seem like things that a random rich person could also do. It seems like there's not a thing that a nation state is uniquely equipped to do in this scenario.

Sholto Douglas

That's a good point.

Sholto Douglas

I mean dramatic allocation of resource towards compute I think is sensible. I would be doing that if I was in charge of a nation state. I think it just increases your optionality in most of the future worlds.

Trenton Bricken

Dylan Patel has some scary forecasts on US energy.

Sholto Douglas

Versus China. Yes.

Trenton Bricken

Yeah, we're like 34 gigawatts off.

Sholto Douglas

Yeah, the US's line is flat, basically, and China's line is like this. And I mean the US very clearly...

Trenton Bricken

We just need so many more power plants.

Sholto Douglas

Yes. If intelligence becomes this incredibly valuable input, intelligence becomes almost a raw input into the economies and quality of life of the future, the thing directly underneath that is energy. Making sure that you have incredible amounts of solar, like tile the desert with solar panels, some parts of the desert. That would be helpful towards making sure that you have more access to intelligence on tap.

Trenton Bricken

Yeah. Just to make it explicit, we've been touching on it here. Even if AI progress totally stalls, you think that the models are really spiky, and they don't have general intelligence. It's so economically valuable, and sufficiently easy to collect data on all of these different jobs, these white collar job tasks, such that to Sholto's point we should expect to see them automated within the next five years.

Sholto Douglas

Yeah.

Trenton Bricken

Even if you need to hand spoon every single task to the model.

Sholto Douglas

It's economically worthwhile to do so. Even if algorithmic progress stalls out, and we just never figure out how to keep progress going—which I don't think is the case, that hasn't stalled out yet, it seems to be going great—the current suite of algorithms are sufficient to automate white collar work provided you have enough of the right kinds of data. Compared to the TAM of salaries for all of those kinds of work, it is so trivially worthwhile.

Trenton Bricken

Yeah, exactly. I do just want to flag as well that there's a really dystopian future if you take Moravec’s paradox to its extreme. It’s this paradox where we think that the most valuable things that humans can do are the smartest things like adding large numbers in our heads, or doing any sort of white collar work. We totally take for granted our fine motor skill, and coordination. But from an evolutionary perspective it's the opposite.

Evolution has optimized fine motor coordination so well. Even if you look at robot hands, the ability to open a door is still just really hard for robots. Meanwhile, we're seeing this total automation of coding, and everything else that we've seen as clever. The really scary future is one in which AIs can do everything except for the physical robotic tasks, in which case you'll have humans with AirPods, and...

Sholto Douglas

Glasses?

Trenton Bricken

Glasses, and there'll be some robot overlord controlling the human through cameras by just telling it what to do, and having a bounding box around the thing you're supposed to pick up. So you have human meat robots.

Sholto Douglas

Not necessarily saying that that's what the AIs would want to do, or anything like that. But if you were to be like, "What are the relative economic value of things?" The AIs are out there doing computer programming, and the most valuable thing that humans can do is be amazing robots.

Now that being said, I think Moravec’s paradox is a little bit fake. I think the main reason that robots are worse at being a robot than they are at software engineering is the internet exists for software engineering. GitHub exists, and there is no equivalent thing if you had all mocap of everyone's actions as they were going about their daily lives for some reasonable fraction of the human population, robotics is also close to solved, on track to be solved at the same rate that software engineering is on track to be solved.

So, this vision is only a sort of decade-long section, but it's still a pretty terrible decade. Imagine the world where people have lost their jobs, you haven't yet got novel biological research. That means people's quality of life isn’t dramatically better. You don't yet have material abundance because you haven't actually been able to action the physical world in the necessary way. You can't build dramatically more, because building dramatically more takes robots basically, and people's main comparative advantage is as fantastic robots. That’s a shocking, shocking world.

Dwarkesh Patel

Yeah. From the perspective of an average human, I think it actually might be better. Your wages will be higher because you're the complement to something that is enormously valuable which is AI labor.

Sholto Douglas

And a decade, or two after, the world is fantastic. Robotics is solved, and you start to get radical abundance basically provided that you have all the policies set up necessary to permit building. You end up with that same change like the before vs. after photos of Shanghai where 20 years on, it's this dramatically transformed city.

A lot of places in the world probably end up like that over that two-decade period. But we need to do our best to estimate if this is actually what is on track to happen. Build SWE-bench, but for all the other forms of white collar work, and measure, and track. That's a great thing that governments should be doing by the way, trying to break down the functions of their economy into measurable tasks, and figuring out what does the curve actually look like for that?

They might be a bit shocked by the progress there. There's no SWE-bench for a tax eval. I don't have all the answers here, but then we need to figure out a way to share the proceeds of this economy broadly across people, or invest heavily in robotics, and collect the data so that we get robotics faster, and we get material abundance faster. Invest in biological research that we get, but all that faster. Basically try and pull forward the radical upside, because otherwise you have a pretty dark section.

Dwarkesh Patel

I think one thing that's not appreciated enough is how much of our leverage on the future—given the fact that our labor isn't going to be worth that much—comes from our economic, and political systems surviving. For your million X'd S&P equity to mean something, for your contracts to mean anything, for the government to be able to tax the AI labor, and give you a UBI off of that, that requires our legal institutions, our economic institutions, our financial rails surviving into the future.

Sholto Douglas

Yes.

Dwarkesh Patel

The way in which that likely happens is if it's also in the AIs best interests that they follow those rails. By AI I don't mean some monolithic single AI, I just mean firms which are employing AI, and becoming more productive as a result.

You don't want to be in a position where it's so onerous to operate in our system that you're basically selecting for firms who either emigrate, or who are doing black market stuff, et cetera. You want to make it super, super easy to deploy AI, have the equivalent of special economic zones, et cetera. Otherwise you are just surrendering the future outside of any control that you might have on it.

One of the reasons that I worry about turning AGI into a national security issue, or having it have extremely close ties with the government, the Manhattan Project thing, is that it disproportionately redirects the use of AI towards military tech, mosquito drones and whatever. It also naturally puts other countries in the same frame of mind. If we're developing the mosquito drones, why would China not develop the mosquito drones?

That just seems like a zero-sum race, and not to mention a potentially catastrophic one. Whereas compute will be limited, we'll need to disproportionately accelerate some things. To the extent it just remains totally like a consumer free market landscape, it just seems more likely that we'll get the glorious transhumanist future where they're developing the things that make human life better.

Sholto Douglas

Yes, I mean I agree. The case where you end up with two national projects facing off against each other is dramatically worse. We don't want to live in that world. It's much better if this stays a free market, so to speak.

02:10:26 – Automating white collar work

Dwarkesh Patel

Okay. I want to take issue with your claim that even with the algorithms of today, if we just collect enough data that we could automate white collar work.

First, let me get an understanding of what you mean by that. Do you mean that we would do the analogous thing of pre-raining with all the trajectories of everything people would do on their jobs? Could you make either manually, or through some other process, some RL procedure based on the screen recordings, every white collar worker. What kind of thing are you imagining?

Sholto Douglas

I mean a continuous distribution of this stuff. One important mental model to think about RL… There is some respect with which longer horizon, if you can do them, if you can get that reward ever, are easier to judge. Again, it's come back to that can you make money on the internet? That's an incredibly easy reward signal to judge. But to do that there's a whole hierarchy of complex behavior. So, if you could pre-train up to the easy to judge reward signals, does your website work? Does it go down, do people like it?

There's all these reward signals that we can respond to because we have a long, we can progress through these long enough trajectories to actually get to interesting things. If you're stuck in this regime where you need a reward signal every five tokens, it's a way more painful, and long process. But if you could pre-train on every screen in America, then probably the RL tasks that you can design are very different from if you could only take the existing internet as it is today. How much of that you get access to changes the mix.

Dwarkesh Patel

As we're training them on longer, and longer horizon tasks, and it takes longer for them to get any signal on whether they successfully complete the task, will that slow down progress because it takes more compute per task?

Trenton Bricken

I do think there's this notion that the longer, the harder tasks, the more training is required. I'm sympathetic to that naively, but we as humans are very good at practicing the hard parts of tasks, and decomposing them. I think once models get good enough at the basic stuff, they can just rehearse, or fast-forward to the more difficult parts.

Sholto Douglas

I mean that's definitely one of the big complexities. As you use more compute, and as you train on more, and more difficult tasks, your rate of improvement of biology for example is going to be somewhat bound by the time it takes a cell to grow in a way that your rate of improvement on math isn't, for example.

So, yes, but I think for many things we'll be able to parallelize widely enough, and get enough iteration loops.

Dwarkesh Patel

Will the regime of training new models go away? Will we eventually get to the point where you've got the model, and then you just keep adding more skills to it, with RL training?

Sholto Douglas

That depends on whether, or not you think there's a virtue in pre-training a new architecture. Basically you make some architectural change, then you probably need to do some form of at least pretraining a new model.

Dwarkesh Patel

If RL requires a bunch of inference to do the training in the first place, does that push against the thing you were talking about where we actually need a bigger model in order to have brain-like energy? But then also it's more expensive to train it in RL. So, where does that balance out?

Trenton Bricken

I think we got to drink the bitter lesson here. Yeah, there aren't infinite shortcuts. You do just have to scale and have a bigger model, and pay more inference for it. If you want AGI, then that's what you got to pay the price of.

Sholto Douglas

But there's a tradeoff equation here. There is science to do which everyone is doing. What is the optimal point at which to do RL? Because you need something which can both learn, and discover the sparse reward itself.

So you don't want a one parameter model. Useless, even though you can run it really fast. You also don't want a 100T model. It's super slow. The marginal benefit of its learning efficiency is not worth it. So, there's a Pareto frontier here. What's the optimal model size of your current class of capabilities, and your current set of RL environments, and this kind of stuff.

Trenton Bricken

And even in the last year there's been much more of a factor of the inference cost. So, just explicitly the bigger the model, the more expensive it is to do a forward pass and generate tokens. The calculus used to just be, “Should I allocate my flops to more training data, or a bigger model?” And now another huge factor is how much am I actually going to do forward passes on this model once it's trained?

Sholto Douglas

My total pool of compute, how do I allocate that across training data compute, and inference compute for the RL training.

Trenton Bricken

And then even within inference, there's all this research on, well, what strategy should I use? Should I sample 10, and take the best? Do I do this sort of branching search, et cetera, et cetera. And so with RL where you're sampling a whole lot of tokens, you also need to factor in the ability for the model to actually generate those tokens, and then learn, and get feedback.

02:15:35 – Advice for students

Dwarkesh Patel

If we're living in this world, what is your advice to somebody early in their career, or a student in college? What should they be planning on doing?

Sholto Douglas

Once again, it's worth considering the spectrum of possible worlds and preparing yourself for that. The action that I think is the highest EV in that case is that at a minimum you're about to get dramatically more leverage. You already have. Already the startups in YC are writing huge amounts of their code with Claude.

What challenges, what causes do you want to change in the world with that added leverage? If you had 10 engineers at your beck, and call, what would you do? If you had a company at your beck and call, what would that enable you to do? What problems, and domains suddenly become tractable? That's the world you want to prepare for.

Now, that still requires a lot of technical depth. Obviously there is the case where AI just becomes dramatically better than everyone at everything, but for at least a while there is… I think Jensen actually talked about this in an interview in an interesting way. He's like, "I have a hundred thousand general intelligences around me, and I'm still somewhat useful, because I’m there directing the values, and asking them to do things. I still have value even though I have a hundred thousand general intelligences."

For many people, I think that will still be true for a fair while. Then as the AIs get better, and better, and better, and so on, eventually, no. But again, prepare for the spectrum of possible worlds because in the event where we're just totally outcompeted, it doesn't matter what you do. In all the other worlds, it matters a lot. Get the technical depth, study biology, study CS, study physics. Think hard about what challenges you want to solve in the world.

Dwarkesh Patel

Yeah, that's a lot of topics. That's a lot of shit.

Sholto Douglas

You can now. You can. It's so much easier to learn. Everyone now has the infinite perfect tutor.

Dwarkesh Patel

It's definitely been helpful to me.

Trenton Bricken

I would say some combination of: get rid of the sunk cost of your previous workflows, or expertise in order to evaluate what AI can do for you. Another way to put this, which is fun, is just be lazier in so much as you figure out the way that the agent can do the things that are toilsome.

Ultimately, you get to be lazier, but in the short run, you need to critically think about the things you're currently doing, and what an AI could actually be better at doing, and then go, and try it, or explore it. Because I think there's still just a lot of low-hanging fruit of people assuming, and not writing the full prompt, giving a few examples, connecting the right tools for your work to be accelerated and automated.

Dwarkesh Patel

Yep, yep. There's also the sunk cost of feeling like since you're not "early to AI", that you've sort of missed the boat. I remember when GPT-3 came out…

So backstory on the podcast, when I graduated college I was planning on doing some sort of AI wrapper startup, and the podcast was just a gateway into doing that. I was trying out different things and at the time I remember thinking, “oh, 3.5 is out.” People were like, "I'm so behind on the startup scene here” or whatever. If I wanted to make my own wrapper… maybe the idea of the wrapper was inadvisable in the first place.

But every time feels early because if it's an exponentially growing process, and there are many things, many ideas which are only becoming possible now, right?

Sholto Douglas

Exactly. It's that product exponential I talked about.

Dwarkesh Patel

That's right.

Sholto Douglas

Products literally obsolete it. You need to constantly reinvent yourself to stay at the frontier of capabilities.

Dwarkesh Patel

Do you remember? I had a really shitty idea, and I gave you a call.

Sholto Douglas

I don’t remember what it was.

Dwarkesh Patel

I think it was like RAG for lawyers, or something. Anyways, I think one of our first interactions was like, "Hey, what do you think of this idea?" And you were like, “I think the podcast sounds promising.”

Sholto Douglas

I was right.

Dwarkesh Patel

Which I appreciate.

Trenton Bricken

Yeah. I got slightly annoyed at a friend recently who I think is really talented and clever and interested in AI but has pursued a biology route. I just kind of tried to shake them like, "You can work on AI if you want to."

Humans are biological general intelligences where a lot of the things of value are just very general. Whatever kind of specialization that you've done maybe just doesn't matter that much. Again, it gets back to the sunk cost, but so many of the people, even my colleagues at Anthropic are excited about AI. They just don't let their previous career be a blocker. Because they're just innately smart, talented, driven, whatever else, they end up being very successful and finding roles. It's not as if they were in AI forever. I mean, people have come from totally different fields. Don't think that you need permission from some abstract entity to get involved, and apply, and be able to contribute.

Dwarkesh Patel

If somebody wanted to be an AI researcher right now, if you could give them an open problem, or the kind of open problem that is very likely to be quite impressive, what would it be?

Sholto Douglas

I think that now that RL's come back, papers building on Andy Jones's “Scaling scaling laws for board games” are interesting. Investigating these questions like the ones you asked before. Is the model actually learning to do more than its previous pass at K? Or is it just discovering that… Exploring questions like that deeply are interesting, scaling laws for RL, basically.

Dwarkesh Patel

I'd be very curious to see how much the marginal increase is in meta learning from a new task, or something.

Trenton Bricken

On that note, I think model diffing has a bunch of opportunities. People say, "Oh, we're not capturing all the features. There's all this stuff left on the table." What is that stuff that's left on the table? If the model's jailbroken, is it using existing features that you've identified? Is it only using the error terms that you haven't captured? I don't know. There's a lot here.

I think MATS is great. The Anthropic fellowship has been going really well. Goodfire, Anthropic invested in recently, they're doing a lot of interpretability work, or just apply directly to us.

Dwarkesh Patel

Anything to get your equity up, huh?

Trenton Bricken

There's just so many interpretability projects. There's so much low-hanging fruit, and we need more people, and I don't think we have much time.

Sholto Douglas

I also want to make a plug for performance engineering. This is one of the best ways to demonstrate that you have the raw ability to do it. If you made an extremely efficient transform implementation on TPU, or Trainium, or Incuda, then I think there's a pretty high likelihood that you'll get a job offer. But there's a relatively small pool of people that you can trust to completely own end-to-end the performance of a model.

Trenton Bricken

And if you have broad, deep electrical engineering skills, I think you can probably come up to speed pretty fast on accelerator stuff.

Sholto Douglas

You can come up to speed reasonably fast and it teaches you a lot of good intuitions of the actual intricacies of what's going on in the models, which means that you're then very well-placed to think about architecture and this kind of stuff. One of my favorite people in thinking about architecture at Anthropic at the moment actually came from a heavy GPU kernel programming background and just knows the ins, and outs really deeply. He can think about the trade-offs really well.

Dwarkesh Patel

This was fun guys. Thanks for doing it again.

Trenton Bricken

Great to be back.