Dwarkesh Podcast

Blog prize for the big questions about AI

Dwarkesh Patel — Fri, 24 Apr 2026 16:37:49 GMT

There has never been a time where excellent intellectual output on the right question has been more valuable or more urgent. Compelling answers can inform the most important economic and foreign policy decisions that will ever be made, the deployment of (at least) hundreds of billions of philanthropic dollars, and the training and governance of superintelligences.

I’m announcing a $20,000 blog prize in order to find people who will excel at researching and thinking through these problems. The not-so-secret point of this whole contest is so that I can hire a research collaborator to think through questions like this hand in hand with me. See more at the end.

Pick a question below, and spend no more than 1,000 words answering it. 1st, 2nd, and 3rd place will get $10,000, $6,000, and $4,000 respectively. I’ll publish the winning entry (and potentially the runner ups) on my blog. Please submit by May 10th, 11:59 PM PST.

Questions - choose one

A couple years ago, there was this idea that AI progress might slow down as we make further progress into the RL regime. 1. Because as horizon lengths increase, the AI needs to do many days’ worth of work before we can even see if it did it right, so if we’re still in a naive policy gradient world, the reward signal / FLOP goes down, and 2. We’d crossed through many OOMs of RL compute from GPT 4 to o1 to o3, and it would not be feasible to replicate that many OOMs increase in compute immediately again. But AI progress seems to have been fast nonetheless - even potentially speeding up if rumors about Spud or Mythos are to be believed. What gives? What did that previous intuition pump that motivated longer timelines miss? Feel free to deny premise of question.
What’s the most plausible story where foundation model companies actually start making money? If you consider each individual model as a company, then its profits may be able to pay back the training cost. But of course, if you don’t train a bigger, more expensive model immediately, then you stop making money after 3 months. So when does the profit start? Maybe at some point scaling will plateau, but if progress at the frontier has slowed down, then the combination of distillation and low switching costs (cloud margins result from high switching costs) makes it really easy for open source to catch up to the labs, eating into their margins. So how do the labs actually start making money?
With OpenAI’s new raise at an $852B valuation, OpenAI Foundation’s stake is now worth $180B. Anthropic’s cofounders have pledged to donate 80% of their wealth. Nobody seems to have a concrete idea of how to deploy 100s of billions (soon trillions) of wealth productively to “make AI go well”. If you were in charge of the OpenAI Foundation right now, what exactly would you do? And when? It’s not enough to identify a cause you think is important, because that doesn’t answer the fundamental problem of how you convert money to impact. Identify the concrete strategy you recommend pursuing.
What should countries which are not currently in the AI production chain (semis, energy, frontier models, robotics) do in order to not get totally sidestepped by transformative AI? If you’re the leader of India or Nigeria, what do you do right now?

Rules and tips

Please don’t let a lack of domain expertise dissuade you from entering. I’m looking for someone who can ramp up fast on unfamiliar topics and think clearly.
Each entrant may submit only once.
You are still eligible for this essay competition even if you’re not interested in the researcher role. Nor does winning this competition guarantee that you will be offered the role.
You’re welcome to use LLMs to help you research, but I specifically picked these questions because I’ve found LLM answers to them unsatisfying. On these kinds of ambiguous questions, LLMs are too all over the place. For example, they’ll identify 5 plausible answers but not have the context and taste to identify the crucial factor and iron out its implications.
You only have 1000 words - make them count. People have the habit of spending the first paragraphs clearing their throat - avoid that.

Why am I hiring for a researcher?

I want my podcast/blog to move from just asking questions about AI to actually helping answer them. But there are too many important questions, and I need a collaborator to build up context on them all, to explore dozens of fractal sub-questions, to consider the rebuttals and syntheses, and to sharpen each others thinking.

The questions I want us to explore are very broad while at the same time requiring deep technical analysis across many domains to actually answer.

Why am I hiring this way?

Well, I could just put out a job ad for a researcher, but I’ll get 1,000 different resumes, and I’ll have no clue based on that information whether the applicant would be any good at synthesizing lots of technical arguments and information. So I thought, let’s just list out some questions where I genuinely don’t know the answer and would be keen to get some insight.

What this role looks like

Ideally in person in San Francisco, but potentially open to remote.
Will pay competitively

Submit here

If you have questions or comments, I’m hello@dwarkeshpatel.com.

Jensen Huang – TPU competition, why we should sell chips to China, & Nvidia’s supply chain moat

Dwarkesh Patel — Wed, 15 Apr 2026 15:45:23 GMT

I asked Jensen about TPU competition, Nvidia’s lock on the ever more bottlenecked supply chain needed to make advanced chips, whether we should be selling AI chips to China, why Nvidia doesn’t just become a hyperscaler, how it makes its investments, and much more. Enjoy!

Watch on YouTube; listen on Apple Podcasts or Spotify.

Sponsors

Crusoe’s cloud runs on state-of-the-art Blackwell GPUs, with Vera Rubin deployment scheduled for later this year. But hardware is only part of the story—for inference, Crusoe’s MemoryAlloy tech implements a cluster-wide KV cache, delivering up to 10x faster TTFT and 5x better throughput than vLLM. Learn more at crusoe.ai/dwarkesh

Cursor helped me build an AI co-researcher over the course of a weekend. Now I have an AI agent that I can collaborate with in Google Docs via inline comment threads! And while other agentic coding tools feel like a total black-box, Cursor let me stay on top of the full implementation. You can try my co-researcher out here, or get started on your own Cursor project today at cursor.com/dwarkesh
Jane Street spent ~20,000 GPU hours training backdoors into 3 different language models, then challenged my audience to find the triggers. They received some clever solutions—like comparing the base and fine-tuned versions and extrapolating any differences to reveal the hidden backdoor—but no one was able to solve all 3. So if open problems like this excite you, Jane Street is hiring. Learn more at janestreet.com/dwarkesh

Timestamps

(00:00:00) – Is Nvidia’s biggest moat its grip on scarce supply chains?

(00:16:25) – Will TPUs break Nvidia’s hold on AI compute?

(00:41:06) – Why doesn’t Nvidia become a hyperscaler?

(00:57:36) – Should we be selling AI chips to China?

(01:35:06) – Why doesn’t Nvidia make multiple different chip architectures?

Transcript

00:00:00 – Is Nvidia’s biggest moat its grip on scarce supply chains?

Dwarkesh Patel

We’ve seen the valuations of a bunch of software companies crash because people are expecting AI to commoditize software. There’s a potentially naive way of thinking about things, which is: look, Nvidia sends a GDS2 file to TSMC. TSMC builds the logic dies, it builds the switches, then it packages them with the HBM that SK Hynix, Micron, and Samsung make. Then it sends it to an ODM in Taiwan where they assemble the racks. Nvidia is fundamentally making software that other people are manufacturing, and if software gets commoditized, does Nvidia get commoditized?

Jensen Huang

In the end, something has to transform electrons to tokens. The transformation of electrons to tokens and making those tokens more valuable over time is hard to completely commoditize. The transformation from electrons to tokens is such an incredible journey. Making that token is like making one molecule more valuable than another molecule, making one token more valuable than another. The amount of artistry, engineering, science, and invention that goes into making that token valuable, obviously we’re watching it happen in real time. The transformation, the manufacturing, all of the science that goes in there is far from deeply understood and the journey is far from over. I doubt that it will happen.

We’re going to make it more efficient, of course. The way that you framed the question is my mental model of our company. The input is electrons, the output is tokens. In the middle is Nvidia. Our job is to do as much as necessary and as little as possible to enable that transformation to be done at incredible capabilities. What I mean by “as little as possible,” whatever I don’t need to do, I partner with somebody and make it part of my ecosystem.

If you look at Nvidia today, we probably have the largest ecosystem of partners, both in the supply chain upstream and downstream, all of the computer companies, application developers, and model makers. AI is a five-layer cake, if you will. We have ecosystems across the entire five layers. We try to do as little as possible, but the part that we have to do, as it turns out, is insanely hard. I don’t think that gets commoditized.

In fact, I also don’t think the enterprise software companies, the tools makers… Most software companies today are tool makers. Some of them are not. Some of them are workflow codification systems. But for a lot of companies, they’re tool makers. For example, Excel is a tool, PowerPoint is a tool, Cadence makes tools, Synopsys makes tools. I actually see the opposite of what people see. I think the number of agents is going to grow exponentially, and the number of tool users is going to grow exponentially. It’s very likely that the number of instances of all these tools is going to skyrocket.

It’s very likely that the number of instances of Synopsys Design Compiler is going to skyrocket, along with the number of agents using the floor planners, our layout tools, and our design rule checkers. Today we’re limited by the number of engineers. Tomorrow, those engineers are going to be supported by a bunch of agents. We’re going to be exploring the design space like you’ve never seen before, and we’re going to use the tools that we use today.

I think tool use is going to cause the software companies to skyrocket. The reason why it hasn’t happened yet is because the agents aren’t good enough at using their tools yet. Either these companies are going to build the agents themselves, or agents are going to get good enough to be able to use those tools. I think it’s going to be a combination of both.

Dwarkesh Patel

I think in your latest filings, you had almost a $100 billion in purchase commitments with foundries, memory, and packaging. SemiAnalysis has reported that you will have $250 billion of these kinds of purchase commitments. One interpretation is that Nvidia’s moat is really that you’ve locked up many years of these scarce components. Somebody else might have an accelerator, but can they actually get the memory to build it? Can they actually get the logic to build it? Is this really Nvidia’s big moat for the next few years?

Jensen Huang

It’s one of the things that we can do that is hard for someone else to do. We’ve made enormous commitments upstream. Some of it is explicit, these commitments that you mentioned. Some of it is implicit. For example, a lot of the investments that are upstream are made by our supply chain because I said to the CEOs, “Let me tell you how big this industry is going to be, let me explain to you why, let me reason through it with you, and let me show you what I see.”

As a result of that process of informing, inspiring, and aligning with CEOs of all different industries upstream, they’re willing to make the investments. Why are they willing to make the investments for me and not someone else? The reason for that is because they know that I have the capacity to buy their supply and sell it through my downstream. The fact is that Nvidia’s downstream supply chain and our downstream demand is so large, they’re willing to make the investment upstream.

If you look at GTC, people are marveled by the scale of it and the people that go. It’s a full 360 degrees, the entire universe of AI all in one place. They’re all in one place because they need to see each other. I bring them together so that the downstream can see the upstream, the upstream can see the downstream, and all of them can see the advances in AI. Very importantly, they can all meet the AI natives, all the AI startups being built, and all the amazing things happening so they can see firsthand all the things that I tell them. I spend a lot of my time informing, directly or indirectly, our supply chain, partners, and ecosystem about the opportunity in front of us.

Some people always say, “Jensen, in most keynotes, it’s one announcement after another.” With our keynotes, there’s always a part of it that’s a little torturous in the sense that it almost comes across like education. In fact, that’s exactly on my mind. I need to make sure the entire supply chain, upstream and downstream, the ecosystem, understands what is coming at us, why it’s coming, when it’s coming, how big it’s going to be, and is able to reason about it systematically, just like I reason about it.

Regarding the moat as you describe it, we’re able to build for a future. If our next several years are a trillion dollars in scale, we have the supply chain to do it. Without our reach, the velocity of our business… Just as there’s cash flow, there’s supply chain flow, there’s churns. Nobody is going to build a supply chain for an architecture if the business churns are low. Our ability to sustain the scale is only because our downstream demand is so great. And they see it, they hear about it, they see it all coming. That allows us to do the things we’re able to do at the scale we do them.

Dwarkesh Patel

I do want to understand more concretely whether the upstream can keep up. For many years now, you guys have been 2x-ing revenue year over year. You’ve been more than tripling the amount of flops you’re providing to the world year over year.

Jensen Huang

And 2x-ing at this scale now is really incredible.

Dwarkesh Patel

Exactly. But then you look at logic. You’re the biggest customer on TSMC’s N3 node, and you’re one of the biggest on N2. AI as a whole this year is going to be sixty percent of N3. It’s going to be 86% next year, according to SemiAnalysis. How do you double if you’re the majority? And how do you do that year over year? Are we in a regime now where the growth rate in AI compute has to slow because of upstream? Do you see a way to get around this? How do we build 2x more fabs year over year, ultimately?

Jensen Huang

At some level, the instantaneous demand is greater than the supply upstream and downstream in the world. At any instant, we could be limited by the number of plumbers, which actually happens.

Dwarkesh Patel

The plumbers are invited to next year’s GTC.

Jensen Huang

By the way, great idea. But that’s a good condition. You want an industry where the instantaneous demand is greater than the total supply of the industry. The opposite is obviously less good. If we’re too far apart, if one particular component is too far away, the industry swarms it. For example, notice people aren’t talking very much about CoWoS anymore.

The reason for that is because for two years we swarmed the living daylights out of it. We doubled, doubled, doubled on several doubles. Now I think we’re in fairly good shape. TSMC now knows that CoWoS supply has to keep up with the rest of the logic demand and the memory demand. They’re scaling CoWoS and future packaging technologies at the same level as they scale logic. This is terrific, because for a long time, CoWoS and HBM memory were rather specialty. But they’re not specialties anymore. People now realize they’re mainstream computing technology.

Of course, we’re now much more able to influence a larger scope of our supply chain. At the beginning of the AI revolution, all the things that I say now, I was saying five years ago. Some people believed in it and invested in it, for example, Sanjay and the Micron team. I still remember the meeting really well where I was clear about exactly what was going to happen, why it was going to happen, and the predictions of today. They really doubled down on it. We partnered with them across LPDDR and HBM memories, and they really invested in it. It obviously has been tremendous for the company. Some people came a little bit later, but now they’re all here.

Each one of these bottlenecks gets a great deal of attention. Now we’re prefetching the bottlenecks years in advance. For example, the investments that we’ve done with Lumentum, Coherent, and the silicon photonics ecosystem over the last several years really reshaped the supply chain. We built up an entire supply chain around TSMC. We partnered with them on COUPE, invented a whole bunch of technology, and licensed those patents to the supply chain to keep it nice and open.

We’re preparing the supply chain through the invention of new technologies, new workflows, new testing equipment like double-sided probing, investing in companies, and helping them scale up their capacity. You can see that we’re trying to shape the ecosystem so that the supply chain is ready to support the scale.

Dwarkesh Patel

It seems like some bottlenecks are easier than others. Scaling up CoWoS versus scaling up—

Jensen Huang

I went to the hardest one, by the way.

Dwarkesh Patel

Which is?

Jensen Huang

Plumbers. Plumbers and electricians. This is one of the concerns that I have about the doomers describing the end of work and killing of jobs. If we discourage people from being software engineers, we’re going to run out of software engineers. The same prediction happened ten years ago. Some of the doomers were telling people, “Whatever you do, don’t be a radiologist.” You might hear some of those videos still on the web saying radiology is going to be the first career to go and the world is not going to need any more radiologists. Guess what we’re short of? Radiologists.

Dwarkesh Patel

Going back to this point about how some things you can scale, and other things… How do you actually manufacture 2x the amount of logic a year? Ultimately, memory and logic are bottlenecked by EUV. How do you get to 2x as many EUV machines year over year?

Jensen Huang

None of that is impossible to scale quickly. All of that is easy to do within two or three years. You just need a demand signal. Once you can build one, you can build ten, and once you can build ten, you can build a million. These things are not hard to replicate.

Dwarkesh Patel

How far down the supply chain do you go? Do you go to ASML and say, “Hey, if I look out three years from now, for Nvidia to be generating two trillion a year in revenue, we need way more EUV machines”?

Jensen Huang

Some of them I have to directly, some of them indirectly, and some of them… If I can convince TSMC, ASML will be convinced. We have to think about the critical pinch points. But if TSMC is convinced, you’ll have plenty of EUV machines in a few years.

My point is that none of the bottlenecks last longer than a couple of years, two, three years, none of them. Meanwhile, we’re improving computing efficiency by 10x 20x, and in the case of Hopper to Blackwell, 30x to 50x. We’re coming up with new algorithms because CUDA is so flexible. We’re developing all kinds of new techniques so that we drive efficiency in addition to increasing capacity. None of those things worry me. It’s the stuff that’s downstream from us. Energy policies that prevent energy from… You can’t create an industry without energy. You can’t create a whole new manufacturing industry without energy.

We want to reindustrialize the United States. We want to bring back chip manufacturing, computer manufacturing, and packaging. We want to build new things like EVs and robots. We want to build AI factories. You can’t build any of these things without energy, and those things take a long time. More chip capacity, that’s a 2-3 year problem. More CoWoS capacity, 2-3 year problem.

Dwarkesh Patel

Interesting. I feel like I have guests tell me the exact opposite thing sometimes. In this case, I just don’t have the technical knowledge to adjudicate.

Jensen Huang

The beautiful thing is you’re talking to the expert.

00:16:25 – Will TPUs break Nvidia’s hold on AI compute?

Dwarkesh Patel

True. I want to ask about your competitors. If you look at the TPU, arguably two out of the top three models in the world, Claude and Gemini, were trained on TPU. What does that mean for Nvidia going forward?

Jensen Huang

We build a very different thing. What Nvidia built is accelerated computing, not a tensor processing unit. Accelerated computing is used for all kinds of things: molecular dynamics, quantum chromodynamics, data processing, data frames, structured data, and unstructured data. It’s also used for fluid dynamics and particle physics. In addition, we use it for AI.

Accelerated computing is much more diverse. Although AI is the conversation today and is obviously very important and impactful, computing is much broader than that. Nvidia has reinvented the way computing is done, moving from general-purpose computing to accelerated computing. Our market reach is far greater than any TPU or ASIC can possibly have. If you look at our position, we’re the only company that accelerates applications of all kinds. We have a gigantic ecosystem. So all kinds of frameworks and algorithms run on Nvidia.

Because our computers are designed to be operated by other people, anyone who’s an operator can buy our systems. With most of these home-built systems, you have to be your own operator because they were never designed to be flexible enough for others to operate. Because anybody can operate our systems, we’re in every cloud, including Google, Amazon, Azure, and OCI.

If you want to operate it to rent, you better have a large ecosystem of customers in many industries to be the offtakers. If you want to operate it for yourself, we obviously have the ability to help you operate it yourself, like we did for Elon with xAI. And because we can enable operators in any company and any industry, you could use it to build a supercomputer for scientific research and drug discovery at Lilly. We can help them operate their own supercomputer and use it for the entire diversity of drug discovery and biological sciences that we accelerate.

There are just a whole bunch of applications that we can address that you can’t do with TPUs. Nvidia built CUDA to be a fantastic tensor processing unit as well, but it also handles every life cycle of data processing, computing, AI, and so on. Our market opportunity is just a lot larger, and our reach is a lot greater. Because we support every application in the world now, you can build Nvidia systems anywhere and know that there will be customers for it. It’s a very different thing.

Dwarkesh Patel

This is going to be a long question. You have spectacular revenue, and you’re not making $60 billion a quarter from pharma and quantum. You’re making it because AI is an unprecedented technology that is growing unprecedentedly fast.

The question then is what is best for AI specifically. I’m not in the details, but I talk to my AI researcher friends and they say, “Look, when I use a TPU, it’s this big systolic array that’s perfect for doing matrix multiplies, whereas a GPU is very flexible. It’s great when you have lots of branching or irregular memory access.”

But what is AI? It’s just these very predictable matrix multiplies again and again and again. You don’t have to give up any die area for warp schedulers or switches between threads and memory banks. And the TPU is really optimized for the bulk of this growth in revenue and use case for compute that is coming online right now. I wonder how you react to that.

Jensen Huang

Matrix multiplies are an important part of AI, but they’re not the only part. If you want to come up with a new attention mechanism, disaggregate in a different way, or invent a whole new type of architecture altogether—like a hybrid SSM—you want an architecture that’s generally programmable. If you want to create a model that fuses diffusion and autoregressive techniques, you want an architecture that’s just generally programmable. We run everything you can imagine. That’s the advantage. It allows for the invention of new algorithms a lot more easily, because it’s a programmable system.

The ability to invent new algorithms is really what makes AI advance so quickly. TPUs, like anything else, are impacted by Moore’s Law, which we know is increasing by about 25% per year. The only way to really get 10x or 100x leaps is to fundamentally change the algorithm and how it’s computed every single year.

That’s Nvidia’s fundamental advantage. The only reason we were able to make Blackwell to Hopper 50x… When I first announced Blackwell was going to be 35x more energy efficient than Hopper, nobody believed it. Then Dylan wrote an article saying I sandbagged, and it’s actually fifty times. You can’t reasonably do that with just Moore’s Law. The way we solve that problem is with new models, like MoEs, that are parallelized, disaggregated, and distributed across a computing system. Without the ability to really get down and come up with new kernels with CUDA, it’s really hard to do.

It’s the combination of the programmability of our architecture and the fact that Nvidia is an extreme co-design company. We can even offload some of the computation into the fabric itself, like NVLink, or into the network with Spectrum-X. We could affect change across the processors, the system, the fabric, the libraries, and the algorithm simultaneously. Without CUDA to do that, I wouldn’t even know where to start.

Dwarkesh Patel

This gets at an interesting question about Nvidia’s clientele. 60% of your revenue is coming from these big five hyperscalers. In a different era with different customers—let’s say professors running experiments—they need CUDA. They can’t use another accelerator. They just needed to run PyTorch with CUDA and have everything optimized.

But these hyperscalers have the resources to write their own kernels. In fact, they have to in order to get that last 5% of performance they need for their specific architecture. Anthropic and Google are mostly running their own accelerators or running TPUs and Trainium. But even OpenAI, using GPUs, has Triton because they need their own kernels. Down to CUDA C++, instead of using cuBLAS and NCCL, they’ve got their own stack which compiles to other accelerators as well. If most of your customers can and do make replacements for CUDA, to what extent is CUDA really the thing that is going to make frontier AI happen on Nvidia?

Jensen Huang

CUDA is a rich ecosystem. If you want to build on any computer first, building on CUDA first is incredibly smart. Because the ecosystem is so rich, we support every framework. If you want to create custom kernels… For example, we contribute enormously to Triton. So the back end of Triton has huge amounts of Nvidia technology.

We’re delighted to help every framework become as great as it can be. There are lots and lots of frameworks. There’s Triton, vLLM, SGLang, and more. Now there’s a whole bunch of new reinforcement learning frameworks coming out, like verl and NeMo RL. With post-training and reinforcement learning, that entire area is just exploding. So if you want to build on an architecture, building on CUDA makes the most sense because you know the ecosystem is great.

You know that if something happens, it’s more likely in your code and not in the mountain of code underneath. Don’t forget the amount of code you’re dealing with when building these systems. When something doesn’t work, was it you or was it the computer? You would like it to always be you and to be able to trust the computer. Obviously, we still have lots of bugs ourselves, but our system is so well wrung out that you can at least build on top of the foundation. That’s number one: the richness, programmability, and capability of the ecosystem.

The second thing is, if you’re a developer building anything at all, the single most important thing you want is an install base. You want the software you write to run on a whole bunch of other computers. You’re not building software just for yourself. You’re building it for your fleet or everybody else’s fleet because you’re a framework builder. Nvidia’s CUDA ecosystem is ultimately its great treasure.

We have several hundred million GPUs out there now. Every cloud has it. It goes back to the A10, A100, H100, H200, the L series, the P series. There’s a whole bunch of them. They’re in all kinds of sizes and shapes. If you’re a robotics company, you want that CUDA stack to actually run in the robot itself. We’re literally everywhere. The install base means that once you develop the software or the model, it’s going to be useful everywhere. That is just incredibly valuable.

Lastly, the fact that we’re in every single cloud makes us genuinely unique. If you’re an AI company or developer, you’re not exactly sure which cloud service provider you’re going to partner with or where you’d like to run it. We run everywhere, including on-prem for you if you like. The combination of the richness of the ecosystem, the expansiveness of the install base, and the versatility of where we are makes CUDA invaluable.

Dwarkesh Patel

That makes a lot of sense. I guess the thing I’m curious about is whether those advantages matter a lot to your main customers. There’s many people for whom they might matter. The kind of person who can actually build their own software stack makes up most of your revenue. Especially if you go to a world where AI is getting especially good at the things which have tight verification loops where you can RL on them…. This question of how do you write a kernel that does attention or MLP the most efficiently across a scale up? It’s a very verifiable sort of feedback loop.

Can all the hyperscalers write these custom kernels for themselves? Nvidia still has great price performance, so they might still prefer to use Nvidia. But then the question is, does it just become a question of who is offering the best specs, the best flops and memory bandwidth for a given dollar. Whereas historically Nvidia has just had, and still has, the best margins in all of AI across hardware and software, +70%, because of this CUDA moat. And the question is, can you sustain those margins if for most of your customers, they can actually afford to build, instead of the CUDA moat?

Jensen Huang

The number of engineers we have assigned to these AI labs is insane, working with them, optimizing their stack. The reason for that is because nobody knows our architecture better than we do. These architectures are not as general purpose as a CPU. A CPU is kind of like a Cadillac. It’s a nice cruiser. It never goes too fast. Everybody drives it pretty well. It’s got cruise control, and everything’s easy. But in a lot of ways, Nvidia’s GPUs, accelerators, are like F1 racers. I could imagine everybody’s able to drive it at a hundred miles an hour, but it takes quite a bit of expertise to be able to push it to the limit. We use a ton of AI to create the kernels that we have.

I’m pretty sure we’re going to still be needed for quite some time. Our expertise helps our AI lab partners to get another 2x out of their stack easily oftentimes. It’s not unusual that by the time we’re done optimizing their stack or optimizing a particular kernel, their model sped up by 3x, 2x, 50%. That’s a huge number, especially when you’re talking about the install base of the fleet that they have, of all the Hoppers and Blackwells that they have. When you increase it by a factor of two, that doubles the revenues. That directly translates to revenues.

Nvidia’s computing stack is the best performance per TCO in the world, bar none. Nobody can demonstrate to me that any single platform in the world today has a better performance-TCO ratio. Not one company. In fact, the benchmarks that are out there. Dylan’s InferenceMAX is sitting out there for everybody to use, and not one… TPU won’t come, Trainium won’t come.

I encourage them to use InferenceMAX and demonstrate their incredible inference cost. It’s really hard. Nobody wants to show up. MLPerf. I would welcome Trainium to demonstrate their 40% that they claim all the time. I would love to hear them demonstrate the cost advantage of TPUs. It makes no sense in my mind. It makes absolutely zero sense. On first principles, it makes no sense.

So I think the reason why we’re so successful is simply because our TCO is so great. Secondly, you say 60% of our customers are the top five, but most of that business is external. For example, most of Nvidia in AWS is for external customers, not internal use. Most of our customers at Azure, obviously all of our customers are external. All of our customers at OCI are external, not internal use. The reason why they favor us is because our reach is so great. We can bring them all of the great customers in the world. They’re all built on Nvidia. And the reason why all these companies are built on Nvidia is because our reach and our versatility is so great.

So I think the flywheel is really install base, the programmability of our architecture, the richness of our ecosystem, and the fact that there’s so many AI companies in the world. There’s tens of thousands of them now. If you were one of those AI startups, what architecture would you choose? You would choose an architecture that’s most abundant. We’re the most abundant in the world. You’d choose the one that has the largest installed base. We’re the largest install base. And you’d choose the one that has a rich ecosystem.

So that’s the flywheel. That’s the reason why, between the combination of: one, our perf per dollar is so great that they have the lowest cost tokens. Second, our perf per watt is the highest in the world. So if one of these companies, if our partners, built a one gigawatt data center, that one gigawatt data center better deliver the maximum amount of revenues and number of tokens, which directly translates to revenues. You want it to generate as many tokens as possible, maximize the revenues for that data center. We are the highest tokens per watt architecture in the world. Lastly, if your goal is to rent the infrastructure, we have the most customers in the world. So that’s the reason why the flywheel works.

Dwarkesh Patel

Interesting. I guess the question comes down to, what is the actual market structure here? Because even if there’s other companies… There could have been a world where there’s tens of thousands of AI companies that have roughly equal share of compute. But even through these five hyperscalers, really the people on Amazon using the compute are Anthropic, OpenAI, and these big foundation labs who can themselves afford and have the ability to make different accelerators work.

Jensen Huang

No, I think your premise is wrong.

Dwarkesh Patel

Maybe. But let me ask you a slightly different question.

Jensen Huang

Come back and make me correct your premise.

Dwarkesh Patel

Okay. Let me just ask you a different question.

Jensen Huang

But still make sure to make me come back and fix because it’s just too important to AI. It’s too important to the future of science. It’s too important to the future of the industry. That premise… Look —

Dwarkesh Patel

Let me just finish the question and then we can address it together.

Jensen Huang

Yeah.

Dwarkesh Patel

If all these things are true about price, performance, and performance per watt, et cetera, are true, why do you think it is the case that, say, Anthropic for example, just announced a couple days ago they have a multi-gigawatt deal with Broadcom and Google for TPUs and majority of their compute?

Obviously for Google, TPU is a majority of compute. So if I look at these big AI companies, it seems like a lot of their compute… There was some point where it’s all Nvidia and now it’s not. So I’m curious how to square, if these things are true on paper, why are they going with other accelerators?

Jensen Huang

Anthropic is a unique instance, not a trend. Without Anthropic, why would there be any TPU growth at all? It’s 100% Anthropic. Without Anthropic, why would there be Trainium growth at all? It’s 100% Anthropic. I think that’s fairly well known and well understood. It’s not that there’s an abundance of ASIC opportunities. There’s only one Anthropic.

Dwarkesh Patel

But OpenAI’s deals with AMD… They’re building their own Titan accelerator.

Jensen Huang

Yeah, but I think we could all acknowledge they’re vastly Nvidia. We’re going to still do a lot of work together. I’m not offended by other people using something else and trying things. If they don’t try these other things, how would they know how good ours is? Sometimes you’ve got to be reminded of it. We have to continuously earn the position that we’re in.

There are always big claims. Look at the number of ASICs that have been canceled. Just because you’re going to build an ASIC… You still have to build something better than Nvidia. It’s not that easy building something better than Nvidia. It’s not sensible, actually. Nvidia’s got to be missing something, seriously. Because of our scale, our velocity, we’re the only company in the world that’s cranking it out every single year. Big leaps, every single year.

Dwarkesh Patel

I guess their logic is, “Hey, it doesn’t need to be better. It just needs to be not more than 70% worse,” because they’re paying you 70% margins.

Jensen Huang

No, don’t forget, even in ASICs margins are really quite high. Nvidia’s margin is 70%, let’s say. But ASIC margins are 65%. What are you really saving?

Dwarkesh Patel

Oh, you mean from Broadcom or something like that?

Jensen Huang

Yeah, sure. You’ve got to pay somebody. I think the ASIC margins are incredibly good, from what I can tell. They believe it too. They’re quite proud of their incredible ASIC margins.

So, you asked the question why. A long time ago, we just didn’t have the ability to do it. At the time, I didn’t deeply internalize how difficult it would be to build a foundation AI lab like OpenAI and Anthropic, and the fact that they needed huge investments from the supplier themselves. We just weren’t in a position to make the multi-billion dollar investment into Anthropic so that they could use our compute. But Google and AWS were. They put in huge investments in the beginning so that Anthropic, in return, used their compute. We just weren’t in a position to do that at the time.

I would say my mistake is I didn’t deeply internalize that they really had no other options, that a VC would never put in $5-10 billion of investment into an AI lab with the hopes of it turning out to be Anthropic. So that was my miss. But even if I understood it, I don’t think we would’ve been in a position to do that at the time. But I’m not going to make that same mistake again.

I’m delighted to invest in OpenAI, and I’m delighted to help them scale, and I believe it’s essential to do so. And then, when I was able to, when Anthropic came to us, I’m delighted to be an investor, delighted to help them scale. We just weren’t, at the time, able to do it. If I could rewind everything—and Nvidia could have been as big back then as we are now—I would’ve been more than happy to do it.

00:41:06 – Why doesn’t Nvidia become a hyperscaler?

Dwarkesh Patel

This is actually quite interesting. For many years Nvidia has been the company in AI making money, making lots of money. Now you’re investing it. It’s been reported that you’ve done up to $30 billion in OpenAI and $10 billion in Anthropic. But now their valuations have increased, and I’m sure they’ll continue to increase.

So if over these many years you were giving them the compute, you saw where it was headed, and they were worth like one tenth what they’re worth now a couple years ago—or even a year ago in some cases and you had all this cash — there’s a world where either Nvidia themselves becomes a foundation lab, does a huge investment to make that possible, or has made the deals you’ve made now at current valuations much earlier on. And you had the cash to do it. So I am curious, actually, why not have done it earlier?

Jensen Huang

We did it as soon as we could have. We did it as soon as we could have, and if I could have, I would’ve done it even earlier. At the time that Anthropic needed us to do it, we just weren’t in a position to do it. It wasn’t in our sensibility to do so.

Dwarkesh Patel

How so? Was it like a cash thing?

Jensen Huang

Yeah, the level of investment. We had never invested outside the company at the time, and not that much. We didn’t realize we needed to. I always thought that they could just go raise from VCs, for God’s sakes, like all companies do. But what they were trying to do couldn’t have been done through VCs. What OpenAI wanted to do couldn’t have been done through VCs. I recognize that now. I didn’t know it then.

But that’s their genius. That’s why they’re smart. They realized then that they had to do something like that. And I’m delighted that they did. Even though we caused Anthropic to have to go to somebody else, I’m still happy that it happened. Anthropic’s existence is great for the world. I’m delighted for it.

Dwarkesh Patel

I guess you still are making a ton of money, and you’re making way more money quarter after quarter.

Jensen Huang

It’s still okay to have regrets.

Dwarkesh Patel

So the question still arises. Okay, now that we’re here and you have all this money that you keep making, what should Nvidia be doing with it? There’s one answer which is that there’s this whole middleman ecosystem that has popped up for converting CapEx into OpEx for these labs so that they can rent compute. Because the chips are really expensive, they make a lot of money over their lifetime because the AI models are getting better. So the value that they generate, their tokens, is increasing, but they’re expensive to set up. Nvidia has the money to do the CapEx. In fact, it’s been reported, you are backstopping CoreWeave up to $6.3 billion and have invested $2 billion.

Why doesn’t Nvidia become a cloud themselves? Why doesn’t it become a hyperscaler themselves and rent this compute out? You have all this cash to do it.

Jensen Huang

This is a philosophy of the company, and I think it’s wise. We should do as much as needed, as little as possible. What that means is, the work that we do with building our computing platform, if we don’t do it, I genuinely believe it doesn’t get done. If we didn’t take the risk that we take—if we didn’t build NVLink the way we built it, if we didn’t build the whole stack, if we didn’t create the ecosystem the way we did, if we didn’t dedicate ourselves to 20 years of CUDA while losing money most of that time—if we didn’t do it, nobody else would have done it.

If we didn’t create all the CUDA-X libraries so that they’re all domain-specific… A decade and a half ago, we pushed into domain-specific libraries because we realized that if we didn’t create these domain-specific libraries, whether it’s for ray tracing or image generation or even the early works of AI, these models, if we didn’t create them, for data processing, structured data processing, or vector data processing, if we didn’t create them, nobody would. I am completely certain of that. We created a library for computational lithography called cuLitho. If we didn’t create it, nobody would have. So accelerated computing wouldn’t advance the way it has if we didn’t do what we did.

So we should do that. We should dedicate our company, all of our might, wholeheartedly to go do that. However, the world has lots of clouds. If I didn’t do it, somebody would show up. So following the recipe, the philosophy, of doing as much as needed but as little as possible—as little as possible—that philosophy exists in our company today. Everything I do, I do it with that lens.

In the case of clouds, if we didn’t support CoreWeave to exist, these neoclouds, these AI clouds, wouldn’t exist. If we didn’t help CoreWeave exist, they would not exist. If we didn’t support Nscale, they wouldn’t be where they are today. If we didn’t support Nebius, they wouldn’t be what they are today. Now they’re doing fantastically.

Is that a business model [inaudible]? We should do as much as needed, as little as possible. So we invest in our ecosystem because I want our ecosystem to thrive. I want the architecture, and AI, to be able to connect with as many industries as possible, as many countries as possible, and make it possible for the planet to be built on AI and to be built on the American tech stack. That vision is exactly what we’re pursuing.

Now, one of the things that you mentioned… There are so many great, amazing foundation model companies, and we try to invest in all of them. This is another thing that we do. We don’t pick winners. We need to support everyone. It’s part of our joy of doing so. It’s imperative to our business. But we also go out of our way not to pick winners. So when I invest in one of them, I invest in all of them.

Dwarkesh Patel

Why do you go out of your way not to pick winners?

Jensen Huang

Because it’s not our job to, number one. Number two, when Nvidia first started, there were 60 3D graphics companies. We are the only one that survived. If you would have taken those 60 graphics companies and asked yourself which one was going to make it, Nvidia would be at the top of that list not to make it.

This is long before you, but Nvidia’s graphics architecture was precisely wrong. It’s not a little bit wrong. We created an architecture that was precisely wrong, and it was an impossible thing for developers to support. It was never going to make it. We reasoned about it from good first principles, but we ended up with the wrong solution. Everybody would have counted us out. And here we are.

So I have enough humility to recognize that. Don’t pick winners. Either let them all take care of themselves, or take care of all of them.

Dwarkesh Patel

One thing I didn’t understand is you said, “Look, we’re not prioritizing these neoclouds just because they are neoclouds and we want to prop them up.” But you also listed a bunch of neoclouds and said they wouldn’t exist if it wasn’t for NVIDIA. How are those two things compatible?

Jensen Huang

First of all, they need to want to exist, and they come to ask us for help. When they want to exist and they have a business plan, expertise, and the passion for it… They obviously have to have some capabilities themselves. But if, at the end of the day, they need some investment in order to get it off the ground, we would be there for them. But the sooner they get their flywheel going...

Your question was, “Do we want to be in the financing business?” The answer is no. There are people in the financing business, and we’d rather work with all the people in the financing business than be a financier ourselves. Our goal is to focus on what we do, keep our business model as simple as possible, and support our ecosystem.

When someone like OpenAI needs an investment of a $30 billion scale because it’s still before their IPO, and we deeply believe in them and I deeply believe that they’re going to be an… Well, they’re an extraordinary company already today. They’re going to be an incredible company. The world needs them to exist. The world wants them to exist. I want them to exist. They have the wind at their back. Let’s support them and let them scale. Those investments we’ll do because they need us to do it. But we’re not trying to do as much as possible. We’re trying to do as little as possible.

Dwarkesh Patel

This may be an obvious question, but we’ve lived many years in this situation where there’s a shortage of GPUs, and it’s grown now because models are getting better.

Jensen Huang

We have a shortage of GPUs.

Dwarkesh Patel

Yes. Nvidia is known for divvying up the scarce allocation, not just based on high bidder, but rather on, “Hey, we want to make sure that these neoclouds exist. Let’s give some to CoreWeave, let’s give some to Crusoe, let’s give some to Lambda.” Why is it good for Nvidia? First of all, would you agree with this characterization of fracturing the market?

Jensen Huang

No. No. Your premise is just wrong. We’re sufficiently mindful about these things. We’re very mindful about these things. First of all, if you don’t place a PO, all the talking in the world won’t make a difference. Until we get a PO, what are we going to do? So the first thing is, we work really hard with everybody to get a forecast done, because these things take a long time to build, and the data centers take a long time to build. We align ourselves with demand and supply and things like that through forecasting. Okay? That’s job number one.

Number two, we’ve tried to forecast with as many people as possible, but in the final analysis, you still have to place an order. Maybe, for whatever reason, you didn’t place your order. What can I do? At some point, first in, first out. But beyond that, if you’re not ready because your data center’s not ready, or certain components aren’t ready to enable you to stand up a data center, we might decide to serve another customer first. That’s just maximizing the throughput of our own factory. We might do some adjustments there.

Aside from that, the prioritization is first in, first out. You’ve got to place a PO. If you don’t place a PO… Now, of course, there are stories about that. For example, all of this kind of started from an article about Larry and Elon having dinner with me where they begged for GPUs. That never happened. We absolutely had dinner. We absolutely had dinner, and it was a wonderful dinner. At no time did they beg for GPUs. They just had to place an order. Once they place an order, we do our best to get the capacity to them. We’re not complicated.

Dwarkesh Patel

Okay. So it sounds like there’s a queue, and then based on whether your data center is ready and when you place a purchase order, you get them at a certain time. But it still doesn’t sound like the highest bidder just gets it. Is there a reason to do it…?

Jensen Huang

We never do that.

Dwarkesh Patel

Okay.

Jensen Huang

We never do.

Dwarkesh Patel

Why not just do high bidder?

Jensen Huang

Because it’s a bad business practice. You set your price and then people decide to buy it or not. I understand that others in the chip industry change their prices when demand is higher, but we just don’t. That’s just never been a practice of ours. You can count on us. I prefer to be dependable, to be the foundation of the industry. You don’t need to second-guess. If I quoted you a price, we quoted you a price. That’s it. If demand goes through the roof, so be it.

Dwarkesh Patel

On the other end, that’s why you have a productive relationship with TSMC, right?

Jensen Huang

Yeah, Nvidia’s been in business with them for, I guess, coming up on 30 years. Nvidia and TSMC don’t have a legal contract. There’s always some rough justice. Sometimes I’m right, sometimes I’m wrong. Sometimes I got a better deal, sometimes I got a worse deal. But overall, the relationship is incredible. I can completely trust them. I can completely depend on them.

One of the things you can count on with Nvidia is that this year, Vera Rubin is going to be incredible. Next year, Vera Rubin Ultra will come. The year after that, Feynman will come. And the year after that, I haven’t introduced the name yet. Every single year you can count on us. You’re going to have to go find another ASIC team in the world—pick your ASIC team—where you can say, “I can bet the farm, I can bet my entire business that you will be here for me every single year. Your token cost will decrease by an order of magnitude every single year. I can count on it like I can count on the clock.”

I just said something about TSMC. For no other foundry in history can you possibly say that. You can say that about Nvidia today. You can count on us every single year. If you would like to buy a billion dollars worth of AI factory compute, no problem. If you’d like to buy a hundred million dollars, no problem. You’d like to buy $10 million, or just one rack, not a problem. Or just one graphics card, okay, no problem. If you would like to place an order for a $100 billion of AI factory, no problem. We’re the only company in the world where you can say that today.

I can say that about TSMC as well. I want to buy one, buy 1 billion, no problem. We just have to go through the process of planning for it, and all the things that mature people do. So I think this ability for Nvidia to be the foundation of the world’s AI industry, this is a position that has taken us a couple of decades to arrive at. Enormous commitment, enormous dedication. The stability of our company, the consistency of our company, is really important.

00:57:36 – Should we be selling AI chips to China?

Dwarkesh Patel

Okay. I want to ask about China. I actually don’t know what I think about whether it’s good to sell chips to China or not, but I like to play devil’s advocate against my guests. So when Dario was on, who supports export controls, I asked him, why can’t America and China both have a country of geniuses in the datacenter? But since you’re on the opposite side, I’ll ask you in the opposite way.

One way to think about it is, Anthropic actually announced a couple days ago Mythos Preview. This model Mythos, they’re not even releasing publicly because they say it has such cyber-offensive capabilities that we don’t think the world is ready until we make sure these zero-days are patched up. But they say it found thousands of high-severity vulnerabilities across every major operating system, every browser. It found one in OpenBSD, which is this operating system that’s been specifically designed to not have zero days. It found one that’s existed for 27 years.

So if Chinese companies and Chinese labs and the Chinese government had access to the AI chips to train a model like Claude Mythos with these cyber-offensive capabilities and run millions of instances of it with more compute, the question is, is that a threat to American companies, to American national security?

Jensen Huang

First of all, Mythos was trained on fairly mundane capacity, and a fairly mundane amount of it. By an extraordinary company. The amount of capacity and the type of compute it was trained on is abundantly available in China. So you just have to first realize that chips exist in China.

They manufacture 60% of the world’s mainstream chips, maybe more. It’s a very large industry for them. They have some of the world’s greatest computer scientists. As you know, most of the AI researchers in all of these AI labs are Chinese. They have 50% of the world’s AI researchers. So the question is, considering all the assets they already have—they have an abundance of energy, they have plenty of chips, they’ve got most of the AI researchers—if you’re worried about them, what is the best way to create a safe world?

Victimizing them, turning them into an enemy, likely isn’t the best answer. They are an adversary. We want the United States to win. But I think having a dialogue and having research dialogue is probably the safest thing to do. This is an area that is glaringly missing because of our current attitude about China as an adversary. It is essential that our AI researchers and their AI researchers are actually talking. It is essential that we try to both agree on what not to use the AI for.

With respect to finding bugs in software, of course, that’s what AI is supposed to do. Is it going to find bugs in a lot of software? Of course. There are lots and lots of bugs. There are lots of bugs in the AI software. That’s what AI is supposed to do, and I’m delighted that AI has reached a level where it could help us be so much more productive.

One of the things that is underemphasized is the richness of the ecosystem around cybersecurity, AI cybersecurity and AI security and AI privacy and AI safety. There’s a whole ecosystem of AI startups that are trying to create this future for us, where you have one AI agent that’s incredible, surrounded by thousands of AI agents, keeping it safe, keeping it secure. That future surely is going to happen.

The idea that you’re going to have an AI agent running around with nobody watching after it is kind of insane. We know very well that this ecosystem needs to thrive. It turns out this ecosystem needs open source. This ecosystem needs open models. They need open stacks so that all of these AI researchers and all these great computer scientists can go build AI systems that are as formidable and can keep AI safe. So one of the things that we need to make sure that we do is we keep the open source ecosystem vibrant. That can’t be ignored. A lot of that is coming out of China. We ought to not suffocate that.

With respect to China, of course we want the United States to have as much computing as possible. We’re limited by energy, but we’ve got a lot of people working on that. We’ve got to not make energy a bottleneck for our country. But what we also want is to make sure that all the AI developers in the world are developing on the American tech stack, and making the contributions, the advancements of AI—especially when it’s open source—available to the American ecosystem. It would be extremely foolish to create two ecosystems: the open source ecosystem, and it only runs on a foreign tech stack, and a closed ecosystem that runs on the American tech stack. I think that would be a horrible outcome for the United States.

Dwarkesh Patel

Since there are a lot of things, let me just triage the response. I think the concern, going back to the flop difference in the hacking, is yes, they have compute, but there’s some estimates that because they’re at 7nm—they don’t have EUVs because of chip-making export controls—the amount of flops they’re able to actually produce, they have one tenth the amount of flops that the US has.

So with that, could they eventually train a model like Mythos? Yes. But the question is, because we have more flops, American labs are able to get to these levels of capabilities first. Because Anthropic got to it first, they say, “Okay, we’re going to hold onto it for a month while all these American companies, we’ll give them access to it. They’re going to patch up all their vulnerabilities, and now we release it.”

Furthermore, even if they train a model like this, the ability to deploy it at scale… If you had a cyber hacker, it’s much more dangerous if they have a million of them versus a thousand of them. So that inference compute really matters a lot. In fact, the fact that they have so many AI researchers who are so good is the thing that makes it so scary, because what is it that makes those engineer researchers more productive? It’s compute.

If you talk to any AI lab in America, they say the thing that’s bottlenecking them is compute. There are quotes from the DeepSeek founder, or Qwen leadership or whatever. They say the thing they’re bottlenecked on is compute. So then the question is, isn’t it better that we get American companies, because they have more compute, to get to the Mythos-level capabilities first, prepare our society for it, before China can get to it because, they have less compute?

Jensen Huang

We should always be first and we should always have more. But in order for that outcome you described to be true, you have to take it to the extremes. They have to have no compute. If they have some compute, the question is how much is needed?

The amount of compute they have in China is enormous. You’re talking about the country that is the second largest computing market in the world. If they want to aggregate their compute, they’ve got plenty of compute to aggregate.

Dwarkesh Patel

But is that true? People do these estimates and they’re like, “SMIC is actually behind on the process nodes.”

Jensen Huang

I’m about to tell you.

Dwarkesh Patel

Okay.

Jensen Huang

The amount of energy they have is incredible. Isn’t that right? AI is a parallel computing problem, isn’t it? Why can’t they just put 4x, 10x, as many chips together because energy’s free? They have so much energy. They have datacenters that are sitting completely empty, fully powered. You know they have ghost cities, they have ghost datacenters too. They have so much infrastructure capacity. If they wanted to, they just gang up more chips, even if they’re 7nm.

Their capacity of building chips is one of the largest in the world. The semiconductor industry knows that they monopolize mainstream chips. They have over-capacity, they have too much capacity. So the idea that China won’t be able to have AI chips is completely nonsense.

Now, of course, if you ask me, would the United States be further ahead if the entire world had no compute at all? But that’s just not an outcome. That’s not a scenario that’s true. They have plenty of compute already. The amount of threshold they need for the concern you’re worried about, they’ve already reached that threshold and beyond.

So I think you misunderstand that AI is a five-layer cake, and at the lowest layer is energy. When you have an abundance of energy, it makes up for chips. If you have an abundance of chips, it makes up for energy. For example, the United States is scarce on energy, which is the reason why Nvidia has to keep advancing our architecture and do this extreme co-design so that with the few chips that we ship—with the few chips, because the amount of energy is so limited—our throughput per watt is off the charts.

But if your amount of watts is completely abundant, it’s free, what do you care about performance per watt for? You get plenty. You can use old chips to do. So 7nm chips are essentially Hopper. The ability for Hopper… I’ve got to tell you, today’s models are largely trained on Hopper, Hopper generation. So 7nm chips are plenty good. The abundance of energy is their advantage.

Dwarkesh Patel

But then there’s a question of whether they can actually manufacture enough chips.

Jensen Huang

But they do. What’s the evidence? Huawei just had the largest single year in the history of their company.

Dwarkesh Patel

How many chips did they ship?

Jensen Huang

A ton. Millions. Millions is way more than Anthropic has.

Dwarkesh Patel

There’s a question of how much logic SMIC can chip, and there’s a question of how much memory—

Jensen Huang

I’m telling you what it is. They have plenty of logic, and they have plenty of HBM2 memory.

Dwarkesh Patel

Right. But as you know, the bottleneck often in training and doing inference on these models is the amount of bandwidth. So if you have HBM2… I don’t know the numbers offhand but versus the newest thing you have, there could be almost an order of magnitude difference in memory bandwidth, which is huge.

Jensen Huang

Huawei is a networking company.

Dwarkesh Patel

But that doesn’t change the fact that you need EUV for the most advanced HBM.

Jensen Huang

Not true. Not at all true. You could gang them together, just like we gang them together with NVL72. They’ve already demonstrated silicon photonics, connecting all of this compute together into one giant supercomputer. Your premise is just wrong.

The fact of the matter is, their AI development is going just fine. The best AI researchers in the world, because they’re limited in compute, they also come up with extremely smart algorithms. Remember, I just said that Moore’s law is advancing about 25% per year. However, through great computer science, we could still improve algorithm performance by 10x. What I’m saying is that great computer science is where the lever is.

There is no question, MoE is a great invention. There’s no question, all the incredible attention mechanisms reduce the amount of compute. We have got to acknowledge that most of the advances in AI came out of algorithm advances, not just the raw hardware. Now, if most advances came from algorithms and computer science and programming, tell me that their army of AI researchers is not their fundamental advantage. We see it. DeepSeek is not an inconsequential advance. The day that DeepSeek comes out on Huawei first, that is a horrible outcome for our nation.

Dwarkesh Patel

Why is that? Because currently you can have a model like DeepSeek that can run on any accelerator, if it’s open source. Why would that stop being the case in the future?

Jensen Huang

Suppose it doesn’t. Suppose it’s optimized for Huawei, suppose it’s optimized for their architecture. It would put ours at a disadvantage. You described a situation that I perceive to be good news. A company developed software, developed an AI model, and it runs best on the American tech stack. I saw that as good news. You set it up as a premise that it was bad news. I’m going to give you the bad news, that AI models around the world are developed and they run best on non-American hardware. That is bad news for us.

Dwarkesh Patel

I guess I just don’t see the evidence that there’s these huge disparities that would prevent you from switching accelerators. American labs are running their models across all the clouds, across all the different accelerators—

Jensen Huang

I am the evidence. You take a model that’s optimized for Nvidia and you try to run it on something else.

Dwarkesh Patel

But American labs do that.

Jensen Huang

And they don’t run better. Nvidia’s success is perfect evidence. The fact that AI models are created on our stack, run best on our stack, how is that illogical to understand?

Dwarkesh Patel

Anthropic’s models are run on GPUs, they’re run on Trainium, they’re run on TPUs.

Jensen Huang

A lot of work has to go into it to change. But go to the global south, go to the Middle East. Coming out of the box, if all of the AI models run best on somebody else’s tech stack, you’ve got to be arguing some ridiculous claim right now that that’s a good thing for the United States.

Dwarkesh Patel

But I guess I don’t understand the argument. Say Chinese companies get to the next Mythos first. They find all the security vulnerabilities in American software first, but they can do it on Nvidia hardware and they ship it to the global south. They do it on Nvidia hardware. How is that good? Okay, it runs on Nvidia hardware—

Jensen Huang

It’s not good. It’s not good.

Dwarkesh Patel

Right.

Jensen Huang

It’s not good. So let’s not let it happen.

Dwarkesh Patel

Why do you think it’s perfectly fungible, that if you didn’t ship them compute it would exactly be replaced by Huawei? They are behind, right? They have worse chips than you.

Jensen Huang

It’s completely… There’s evidence right now. Their chip industry’s gigantic.

Dwarkesh Patel

You can just look at the flop or bandwidth or memory comparisons between the H200 and the Huawei 910C. It’s like half to a third.

Jensen Huang

They use more of it. They use twice as many.

Dwarkesh Patel

It seems like your argument is they have all this energy that’s ready to go, right? And they need to fill it with chips.

Jensen Huang

And they’re good at manufacturing.

Dwarkesh Patel

And I’m sure eventually they would be able to just out-manufacture everybody. But there are these few critical years.

Jensen Huang

What is the critical year you’re talking about?

Dwarkesh Patel

These next few years. We’ve got these models that are going to be able to do all the cyber attacks.

Jensen Huang

In that case, if the next years are critical, then we have to make sure that all of the world’s AI models are built on the American tech stack, in these critical years.

Dwarkesh Patel

If they’re built on the American tech stack, how would that prevent them, if they have more advanced capabilities, from launching the Mythos-equivalent cyber attacks?

Jensen Huang

There’s no guarantee either way.

Dwarkesh Patel

But if you have it early, we can prepare for it.

Jensen Huang

Listen, why are you causing one layer of the AI industry to lose an entire market so that you could benefit another layer of the AI industry? There are five layers and every single layer has to succeed. The layer that has to succeed most is actually the AI applications. Why are you so fixated on that AI model? That one company? For what reason?

Dwarkesh Patel

Because those models make possible these incredibly offensive capabilities, and you need compute to run them.

Jensen Huang

The energy, the chips, and the ecosystem of AI researchers make it possible.

Dwarkesh Patel

Okay, stepping back, it has to be the case that China is able to build enough 7nm capacity. And remember, they’re still stuck on 7nm while you’ll move on to 3nm and then 2nm or 1.6nm with Feynman. So while you’re on 1.6nm, they’re still going to be on 7nm, and they have to produce enough of it to make up for the shortfall. They have so much energy that the more chips you give them, the more compute they’d have. So it comes out as a question of, ultimately they are getting more compute. Compute is an input to training and inference—

Jensen Huang

Listen, I just think you speak in absolutes. I think the United States ought to be ahead. The amount of compute in the United States is 100x more than anywhere else in the world. The United States ought to be ahead. Okay. The United States is ahead.

Nvidia builds the most advanced technologies. We make sure that the US labs are the first to hear about it and have the first chance to buy it. And if they don’t have enough money, we even invest in them. The United States ought to be ahead. We want to do everything we can to make sure the United States is ahead. Number one point, do you agree? We’re doing everything we can to do that.

Dwarkesh Patel

But how is shipping chips to China keeping the US ahead if they’re bottlenecked on compute?

Jensen Huang

No, no. We’ve got Vera Rubin for the United States. We have Vera Rubin for the United States. Now, am I in the United States? Do you consider me part of the United States?

Dwarkesh Patel

Yes.

Jensen Huang

Nvidia. You consider Nvidia a United States company? Okay. Number one, why is it that we don’t come up with a regulation that’s more balanced so that Nvidia can win around the world instead of giving up the world? Why would you want the United States to give up the world?

The chip industry is part of the American ecosystem. It’s part of American technology leadership. It’s part of the AI ecosystem. It’s part of AI leadership. Why is it that your policy, your philosophy, leads to the United States giving up a vast part of the world’s market?

Dwarkesh Patel

I guess the claim here is… Dario had this quote where he said that it’s like Boeing bragging that we’re selling North Korea nukes, but the missile casings are made by Boeing. And that’s somehow enabling the US technology stack. Fundamentally, you’re giving them this capability.

Jensen Huang

Comparing AI to anything that you just mentioned is lunacy.

Dwarkesh Patel

But AI is similar to enriched uranium, right? It can have positive uses, it can have negative uses. We still don’t want to send enriched uranium to other countries.

Jensen Huang

Who’s sending enriched—

Dwarkesh Patel

The analogy is that enriched uranium is like compute.

Jensen Huang

It’s a lousy analogy. It’s an illogical analogy.

Dwarkesh Patel

But if that compute can run a model that can do zero-day exploits against all American software, how is that not a weapon?

Jensen Huang

First of all, the way to solve that problem is to have dialogues with the researchers and dialogues with China, and dialogues with all the countries to make sure that people don’t use technology in that way. That’s a dialogue that has to happen. Okay? Number one.

Number two, we also need to make sure that the United States is ahead, that Vera Rubin, Blackwell, is available in the United States in abundance, mountains of it. Obviously, our results would show it. Abundance, tons of it. The amount of computing we have is great. We have amazing AI researchers here. It’s great. We ought to stay ahead.

However, we also have to recognize that AI is not just a model. AI is a five-layer cake. The AI industry matters across every single layer, and we want the United States to win at every single layer, including the chip layer. Conceding the entire market is not going to allow the United States to win the technology race long-term in the chip layer, in the computing stack. That is just a fact.

Dwarkesh Patel

I guess then the crux comes down to, how does selling them chips now help us win in the long term? Tesla sold extremely good electric vehicles to China for a long time. iPhones are sold in China, extremely good. They didn’t cause them lock-in. China will still make their version of EVs and they’re dominating. Their smartphones are dominating.

Jensen Huang

When we started the conversation today, you acknowledged that Nvidia’s position is very different. You used words like moat. The single most important thing to our company is the richness of our ecosystem, which is about developers. 50% of the AI developers are in China. The United States should not give that up.

Dwarkesh Patel

But we have a lot of Nvidia developers in the US, and that doesn’t prevent American labs from also being able to use other accelerators in the future. In fact, right now they’re using other accelerators as well, which is fine and great. I don’t see why that wouldn’t be the case in China as well, if you sell them Nvidia chips, just the same way that Google can use TPUs and Nvidia—

Jensen Huang

We have to keep innovating and, as you probably know, our share is growing, not decreasing. The premise that even if we competed in China, that we’re going to lose that market anyways… You’re not talking to somebody who woke up a loser. That loser attitude, that loser premise makes no sense to me.

We’re not a car. We are not a car. The fact that I can buy this car brand one day and use another car brand another day, easy. Computing is not like that. There’s a reason why the x86 deal exists. There’s a reason why ARM is so sticky. These ecosystems are hard to replace. It costs an enormous amount of time and energy, and most people don’t want to do it. So it’s our job to continue to nurture that ecosystem, to keep advancing the technology so that we can compete in the marketplace.

Conceding a marketplace based on the premise you described, I simply can’t acknowledge that. It makes no sense. Because I don’t think the United States is a loser. Our industry is not a loser. That losing proposition, that losing mindset, makes no sense to me.

Dwarkesh Patel

Okay. I’ll move on. I just want to make sure that—

Jensen Huang

You don’t have to move on. I’m enjoying it.

Dwarkesh Patel

Okay, great. Then I won’t. I appreciate that. But I think maybe the crux… and thanks for walking around the circles with me, because I think it helps bring out what the crux here is.

Jensen Huang

The crux is you’re going to extremes. Your argument starts from extremes. That if we give them any compute at all in this narrow moment, we will lose everything.

Dwarkesh Patel

No, I think what my argument is—

Jensen Huang

Those extremes, they’re childish.

Dwarkesh Patel

Let me just make my argument for myself. The idea is not that there is some key threshold of compute. It’s that any marginal compute is helpful. So if you have more compute, you can train a better model.

Jensen Huang

And I just want you to acknowledge that any marginal sales for the American technology industry is beneficial.

Dwarkesh Patel

I actually don’t… If the AI models that run on those chips are capable of cyber offensive capabilities, or the chips are training models with cyber capabilities and running more instances of those models, it is not a nuclear weapon, but it enables a weapon of a kind.

Jensen Huang

The logic that you use, you might as well say it to microprocessors and DRAMs. You might as well say it to electricity.

Dwarkesh Patel

But in fact we do have export controls on the technology that is relevant to making the most advanced DRAM. We have all kinds of export controls on China for all kinds of chip-making stuff.

Jensen Huang

We sell a lot of DRAM and CPUs into China, and I think it’s right.

Dwarkesh Patel

I guess this goes back to the fundamental question of, is AI different? If you have the kind of technology where they can find these zero-days in software, is that something where we want to minimize China’s ability to get there first, to deploy it widely?

Jensen Huang

We want the United States to be ahead. We can control that.

Dwarkesh Patel

How do we control that if the chips are already there and they’re using them to train that model?

Jensen Huang

We have tons of compute. We have tons of AI researchers. We’re racing as fast as we can.

Dwarkesh Patel

Again, we have more nuclear weapons than anybody else, but we don’t want to send enriched uranium anywhere.

Jensen Huang

We’re not enriched uranium. It’s a chip, and it’s a chip that they can make themselves.

Dwarkesh Patel

But there’s a reason they’re buying it from you. We have quotes from the founders of Chinese companies that say that they’re bottlenecked on compute.

Jensen Huang

Because our chips are better. On balance, our chips are better. There’s just no question about it. In the absence of our chip… Can you acknowledge that Huawei had a record year? Can you acknowledge that a whole bunch of chip companies have gone public? Can you acknowledge that?

Dwarkesh Patel

Yes.

Jensen Huang

Can you also acknowledge that we used to have a very large share in that market, and we no longer have a large share in that market? We can also acknowledge that China is about 40% of the world’s technology industry. To concede that market for the United States technology industry is a disservice to our country. It is a disservice to our national security. It is a disservice to our technology leadership, all for the benefit of one company. It makes no sense to me.

Dwarkesh Patel

I guess I’m confused. It feels like you’re making two different statements. One is that we’re going to win this competition with Huawei because our chips are going to be way better if we’re allowed to compete. Another is that they would be doing the same exact thing without us anyway. How can both of those things be true at the same time?

Jensen Huang

It’s obviously true. In the absence of a better choice, you’ll take the only choice you have. How is that illogical? It’s so logical.

Dwarkesh Patel

The reason they want Nvidia chips is that they’re better.

Jensen Huang

Yeah.

Dwarkesh Patel

Better is more compute. More compute means you can train a better model.

Jensen Huang

No, it’s just better. It’s better because it’s easier to program. We have a better ecosystem. But whatever the better is, whatever the better is… And of course we’re going to send them compute. So what? The fact of the matter is that we get to benefit. Don’t forget, we get the benefit of American technology leadership. We get the benefit of developers working on the American tech stack. We get the benefit, as those AI models diffuse out into the rest of the world, that the American tech stack is therefore the best for it. We can continue to advance and diffuse American technology. That, I believe, is a positive. It’s a very important part of American technology leadership.

Now, the policies that you’re advocating resulted in the American telecommunications industry being policied out of basically the world, to the point where we don’t control our own telecommunications anymore. I don’t see that as smart. It’s a little narrow-minded, and it led to unintended consequences that I’m describing to you right now that you seem to have a very hard time understanding.

Dwarkesh Patel

Okay, let’s just step back. It seems like the crux here is there’s a potential benefit and there’s a potential cost. What we’re trying to figure out is, is the benefit worth the cost? I guess I’m trying to get you to acknowledge the potential cost. Compute is an input to training powerful models. Powerful models do have powerful offensive capabilities, like cyber attacks. It is a good thing that American companies got to Mythos-level capabilities first, and then now they’re going to hold off on those capabilities so that the American companies and American government can make their software more protected before that level of capability was announced.

If China had had more compute or more crowd compute, if they could have made a Mythos-level model earlier and deployed it widely, that would have been very bad. One of the reasons that hasn’t happened is that we have more compute thanks to companies like Nvidia in America. That is a cost of sending it to China. So let’s leave the benefit aside for a second. Do you acknowledge that this is a potential cost?

Jensen Huang

I’ll also tell you the potential cost is we allow one of the most important layers of the AI stack, the chip layer, to concede an entire market—the second largest market in the world—so that they could develop scale, so that they could develop their own ecosystem, so that future AI models are optimized in a very different way than the American tech stack. As AI diffuses out into the rest of the world, their standards, their tech stack, will become superior to ours, because their models are open.

Dwarkesh Patel

I guess I just believe enough in Nvidia’s kernel engineers and CUDA engineers to think that they could optimize—

Jensen Huang

AI is more than kernel optimization, as you know.

Dwarkesh Patel

Of course, but there are so many things you can do, from distilling to a model that’s well-fit for your chips.

Jensen Huang

We’re going to do our best.

Dwarkesh Patel

You have all the software. It’s just hard to imagine that there’s a long-term lock-in to the Chinese ecosystem, even if they have a slightly better open source model for a while.

Jensen Huang

China is the largest contributor to open source software in the world. Fact. China’s the largest contributor to open models in the world. Fact. Today it’s built on the American tech stack, Nvidia’s. Fact.

All five layers of the tech stack for AI are important. The United States ought to go win all five of them. They’re all important. The one that is the most important, of course, is the AI application layer. The layer that diffuses into society, the one that uses it most will benefit from this industrial revolution most. But my point is that every layer has to succeed.

If we scare this country into thinking that AI is somehow a nuclear bomb, so that everybody hates AI and everybody’s afraid of AI, I don’t know how you’re helping the United States. You’re doing it a disservice. If we scare everybody out of doing software engineering jobs because it’s going to kill every software engineering job—and we don’t have any software engineers as a result of that—we’re doing a disservice to the United States.

If we scare everybody out of radiology so nobody wants to be a radiologist because computer vision is completely free and no AI is going to do a worse job than a radiologist, we misunderstand the difference between a job and a task. The job of a radiologist is patient care. The task is to read a scan. If we misunderstand that so profoundly and we scare everybody out of going to radiology school, we’re not going to have enough radiologists and good enough healthcare.

So I’m making the case that when you make a premise that is so extreme, everything goes from zero or infinity, we end up scaring people in a way that’s just not true. Life is not like that. Do we want the United States to be first? Of course we do. Do we need to be a leader in every layer of that stack? Of course we do. Of course we do. Today you’re talking about Mythos because Mythos is important. Sure. That’s fantastic.

But in a few years time, I’m making you the prediction that when we want the American tech stack, when we want American technology to be diffused around the world—out to India, out to the Middle East, out to Africa, out to Southeast Asia—when our country would like to export, because we would like to export our technology, we would like to export our standards, on that day, I want you and I to have that same conversation again. I will tell you exactly about today’s conversation, about how your policy and what you imagined literally caused the United States to concede the second largest market in the world for no good reason at all.

We shouldn’t concede it. If we lose it, we lose it. But why do we concede it? Now nobody is advocating an all or nothing. Nobody’s advocating all or nothing, meaning we ship everything to China at all times. Nobody’s advocating that. We should always have the best technology here. We should always have the most technology here, and the first. But we should also try to compete and win around the world. Both of those things can simultaneously happen. It requires some amount of nuance, some amount of maturity instead of absolutes. The world is just not absolutes.

Dwarkesh Patel

Okay. The argument hinges on this. They’ve built models that are specified for the best chips that they make in a few years. Those chips get exported around the world. That sets the standard. Because of EUV export controls, as we said, you’re going to move on to 1.6nm. They’re still going to be on 7nm, even after a few years from now.

It may make sense that domestically they would prefer, “Hey, we’ve got so much energy, we can manufacture at scale. We’ll still keep using 7nm.” But on the exporting thing, their 7nm chips have to be competitive against your 1.6nm chips. Their models have to be so far optimized for the 7nm that it’s better to run their models on 7nm than to run their models on your 1.6nm.

Jensen Huang

Can we just look at the facts then? Is Blackwell 50 times more advanced lithography than Hopper? Is it 50 times? Not even close. I just kept saying it over and over again. Moore’s Law is dead. Between Hopper and Blackwell, from the transistors themselves, call it 75%. It was three years apart, 75%. Blackwell is 50 times Hopper.

My point is, architecture matters. Computer science matters. Semiconductor physics matters as well, but computer science matters. The impact of AI largely comes from the computing stack, which is the reason why CUDA is so effective, which is the reason why CUDA is so beloved. It’s an ecosystem, a computing architecture that allows for so much flexibility that if you wanted to change an architecture completely—create something like MoE, create something like diffusion, create something that’s disaggregated—you could do so. It’s easy to do.

So the fact of the matter is, AI is about the stack above as much as it is about the architecture below. To the extent that we have architectures and software stacks that are optimized for our stack, for our ecosystem, it is obviously good, because we started the conversation today about how Nvidia’s ecosystem is so rich. Why do people always love programming CUDA first? They do. They do. So do the researchers in China.

But if we are forced to leave China, if we’re forced to leave China, first of all, it’s a policy mistake. Obviously it has backlash. It has turned out badly for the United States. It enabled, it accelerated their chip industry. It forced all of their AI ecosystem to focus on their internal architectures. It’s not too late, but nonetheless it has already happened.

You’re going to see in the future, they’re not stuck at 7nm, obviously. They’re good at manufacturing. They will continue to advance from 7nm and beyond. Now, is there a 10x difference between 5nm and 7nm? The answer is no. Architecture matters. Networking matters. That’s why Nvidia bought Mellanox. Networking matters. Energy matters. So all of that stuff matters. It’s not simplistic, like the way you’re trying to distill it.

01:35:06 – Why doesn’t Nvidia make multiple different chip architectures?

Dwarkesh Patel

We can move on from China, but that actually raises an interesting question. We were discussing earlier these bottlenecks at TSMC and memory and so forth.

So if we’re in this world where you’re already the majority of N3—and at some point you’ll be N2 and you’ll be a majority of that—do you see that you could go back to N7, the spare capacity at an older process node, and say, “Hey, the demand for AI is so great and our capacity to expand the leading edge is not meeting it, so we’re going to make a Hopper or Ampere, but with everything we know about numerics today and all the other improvements you described”? Do you see that world happening before 2030?

Jensen Huang

It’s not necessary to. The reason for that is because with every generation, the architecture is more than just the transistor scale. You’re doing so much engineering and packaging and stacking, and the numerics and the system architecture.

When you run out of capacity, to easily go back to another node… That’s a level of R&D that no one could afford. We could afford to lean forward. I don’t think we could afford to go back. Now, if the world simply says… If on that day, let’s do the thought experiment, on that day we go, “Listen, we’re just never going to have more capacity ever again.” Would I go back and use 7nm? In a heartbeat, of course I would.

Dwarkesh Patel

One question somebody I was talking to had is, why doesn’t Nvidia run multiple different chip projects at the same time with totally different architecture? So you could do something like a Cerebras-style wafer scale. You could do a Dojo-style huge package. You could do one without CUDA. You have the resources and the engineering talent to do all of these in parallel. So why put all the eggs in one basket, given who knows where AI might go and architectures might go?

Jensen Huang

Oh, we could. It’s just that we don’t have a better idea. We could do all of those things. It’s just not better. We simulate it all in our simulator, proveably worse. So we wouldn’t do it. We’re working on exactly the projects that we want to work on. If the workload were to change dramatically—and I don’t mean the algorithms, I actually mean the workload, and that depends on the shape of the market—we may decide to add other accelerators.

For example, recently we added Groq, and we’re going to fold Groq into our CUDA ecosystem. We’re doing that now because the value of tokens has gone up so high that you could have different pricing of tokens. Back in the old days, just a couple years ago, tokens were either free or barely expensive. But now you can have different customers, and those customers want different answers. Because the customers make so much money—for example, our software engineers—if I can give them much more responsive tokens so that they’re even more productive than they are today, I would pay for it.

But that market has only recently emerged. So I think we now have the ability to have the same model, based on the response time, have different segments. That’s the reason why we decided to expand the Pareto frontier and create a segment of inference that is faster response time, even though it’s lower throughput. Until now, higher throughput is always better. We think there could be a world where there could be very high ASP tokens, and even though the throughput is lower in the factory, the ASPs make up for it.

That’s the reason why we did it. But otherwise, from an architecture perspective, if I had more money, I would put more behind Nvidia’s architecture.

Dwarkesh Patel

I think this idea of extremely premium tokens and just the disaggregation of the inference market is a very interesting.

Jensen Huang

The segmentation of it.

Dwarkesh Patel

Yeah. Alright, final question. Suppose the deep learning revolution didn’t happen. What would Nvidia be doing? Obviously games, but given—

Jensen Huang

Accelerated computing, the same thing we’ve been doing all along. The premise of our company is that Moore’s law is going to… General purpose computing is good for a lot of things, but for a lot of computation it’s not ideal.

So we combined an architecture called a GPU, CUDA, to a CPU, so that we can accelerate the workload of the CPU. Different kernels of code or algorithms could be offloaded onto our GPU. As a result, you speed up an application by 100x, 200x. Where can you use that? Obviously engineering and science and physics, data processing, computer graphics, image generation, all kinds of things. Even if AI doesn’t exist today, Nvidia would be very, very large.

The reason for that is fairly fundamental, which is that the ability for general purpose computing to continue to scale has largely run its course. And the only way… Not the only way, but the way to do that is through domain-specific acceleration. One of the domains that we started with was computer graphics, but there are many other domains. There’s all kinds. Particle physics and fluids, structured data processing, all kinds of different types of algorithms that benefit from CUDA.

Our mission was really to bring accelerated computing to the world and advance the type of applications that general purpose computing can’t do, and scale to the level of capability that helps break through certain fields of science. Some of the early applications were molecular dynamics, seismic processing for energy discovery, image processing of course, all of those kinds of fields where general purpose computing is just simply too inefficient to do so.

If there were no AI, I would be very sad. But because of the advances that we made in computing, we democratized deep learning. We made it possible for any researcher, any scientist, anywhere, any student, to be able to access a PC or a GeForce add-in card and do amazing science. That fundamental promise hasn’t changed, not even a little bit.

If you watch GTC, there’s the whole beginning part of it. None of it’s AI. That whole part of it with computational lithography or our quantum chemistry work, data processing work, all of that stuff is unrelated to AI. And it’s still very important. I know that AI is very interesting and quite exciting, but there’s a lot of people doing a lot of very important work that’s not AI related, and tensors are not the only way that you compute it. We want to help everybody.

Dwarkesh Patel

Jensen, thank you so much.

Jensen Huang

You’re welcome. I enjoyed it.

Dwarkesh Patel

Me too.

What I learned this week - Pretraining parallelisms, Can distillation be stopped, Mythos and the cybersecurity equilibrium, Pipeline RL, On why pretraining runs fails

Dwarkesh Patel — Wed, 15 Apr 2026 14:03:00 GMT

At the end of my conversation with Michael Nielsen, we talked about how to actually retain what you learn. Michael’s advice was to make some kind of demanding artifact. Write something up. Try to explain it. So in that spirit, here are notes on some topics I’ve learned about over the last week or two. These notes are extremely rough, and have many mistakes.

Can distillation be stopped?

Can the frontier labs stop distillation? Because if they can’t, open source commoditizing models can catch up incredibly rapidly, making the long run business model for the labs less viable. Let’s say it takes 1T tokens from a frontier model to capture its juice (I have no idea if that’s correct, but let’s say). Even ignoring savings from caching, Opus 4.6 is $25/MTok. So $25 million for those 1T tokens. That’s nothing.

Labs are responding by hiding chain of thought. But there’s two problems with this solution:

Chain of thought is not made of some fundamentally different kind of token. You can just instruct the model to not think first but just start solving the problem, or to write out its thinking somewhere else.
Even if labs do figure out how to robustly hide chain of thought to train in the future, you can make reconstructing the chain of thought necessary to reproduce a decoded sequence as an RLVR target. Yes that costs more, but seems doable.
Maybe most importantly, the real juice of these agentic models is their tool use (writing and updating files of code, running bash commands, etc). And if these things are done locally on the user’s computer, you can’t really hide them. And it seems like a hard lift to get users to migrate all their development workflows to a cloud that you fully control and hide visibility to, modulo a Claude agent input text prompt.

By the way, I learned about an interesting way companies which build products atop API access to AI models can basically distill these models, in a way that potentially makes the distilled models even better than the ones they’re actually built atop.

Suppose you’ve got a coding product. In order to build a feature, a user uses your product to query some frontier model API across 10+ back and forths. Once the user is satisfied with the end result, you have the end state that the user actually wanted - “the gold diff”. These coding product companies can now set the gold diff as the RL target for training their own models, where the model gets rewarded for producing outputs that look like what users eventually converged on, and penalized for producing the kinds of intermediate outputs that users kept rejecting or editing.

On why pretraining runs fails

Had an interesting chat with someone on why pretraining runs often fail. It was very interesting to get a sense of all the tangible ways that things can get fucked, and why training is such a precarious operation. At a high level, breaking causality, and adding bias, seem to be key culprits.

Breaking causality:

When you do expert routing, you first go through the router, which gives you a score of how much each token wants each expert. There’s two ways to proceed from here: 1. Token routing, where you read the scores from the token’s perspective, and allocate to each token’s top k experts. Problem is that you could end up with wildly unbalanced allocation across experts, which is terrible for performance. Alternatively, you could (and only in training) do expert choice, where you just split the tokens by which are more relatively preferred by each expert. This way you can enforce that each expert gets roughly the same number of tokens. But the big problem is that this breaks causality, because which expert token n gets allocated to may depend on which expert token n + k might be router to. And breaking causality is very bad, because you’re getting information in training (and updating based on it) that you wouldn’t see in deployment.
- Rumor is that this explains why Llama 4 was underwhelming.
- I guess you could do expert choice during prefill inference? But maybe it doesn’t work well in practice to allocate tokens to experts which would not have received that token in actual training.
- Tbh I don’t fully understand why breaking causality is so bad. I understand you can’t see beyond causality in real inference. But why is this minor deviation such a big issue?
Another thing that can break causality is token dropping. Where experts just ignore the tokens in the batch that they’re supposed to process, but which rank not so strongly, and cutting whom would spare going outside padding. This breaks causality cause a later token being more strongly matched to this expert might lead to an earlier token getting ignored.
- Apparently this was an issue with Gemini 2 Pro.

Adding bias:

Bias much worse than variance - variance can average out, but bias compounds
Apparently the original GPT 4 training was slow and got initially fucked because of the following bug: they were using FP16 on their collectives like all-reduce. FP16 distributes its granularity according to logarithmic density - between 1 and 2, the mantissa bits carve the interval ~0.001 apart. But 1024 and up, the mantissa might be carving the interval by multiple whole number values. Suppose some collective involves adding 1 + 1 … 10,000 times - you could get in a situation where as soon as you get to 1024, you add 1, it goes to 1025, you round down to the nearest interval at 1024, add one again. And so the calculated value is 10x off the real value. Huge issue if you’re trying to sum many small gradients into a large accumulator. And imagine how hard the bug must have been to find!

Implications for AI training:

Some of the people who think we can cure aging argue that there’s basically 5 different ways people die of old age (heart disease, cancer, etc), and that if we cure these 5 different diseases, then we’d basically have solved again. You could ask a similar question about these failed pretraining runs - are there 5 different ways training runs fail, in which case once a lab figures out numerics and , you’ll just have smooth sailing, or will you keep seeing new bespoke issues emerge at each new level of scale? The person I talked to seemed to think the later - he pointed out that even within numerics, there’s so many ways you can fuck things up. And new ones will keep emerging at scale.
Bearish on AI fully automating kernel writing anytime soon. Presumably this is because he thinks it’s more of an AGI complete problem than some give it credit for. There’s another school of thought that says, “Hey, which kernel gets attention or MLP to run fastest on this scaleup is a super verifiable domain, thus we can RL to superhuman performance easily.” But he says, it took Nvidia, which has the best kernel engineers in the world, a long time to optimize for Blackwell, which suggests that actually it’s quite hard, and might not be super easy to close the loop on.
Sometimes people say inference for RL generation and inference for end user generation is basically the same. But this person pointed out that in RL inference, numerical drift between inference and training engine can cause these subtle off policy biases, which matter a ton for highest quality training. But are not an issue if just serving to users.
Emphasized how important it is to have a disciplined process for amalgamating compute multipliers, because of the risks of stacking up bugs with subtle biases.

Pretraining parallelisms

Notes from an excellent lecture that Horace He gave my friends and me.

What made this lecture so good is that Horace built up the whole topic as a chain of problems and solutions: here’s what we want to do, here’s why it breaks, here’s how we fix it, here’s why that fix eventually breaks too. Most explanations just list out a hodge podge of different strategies, without ever connecting them to the problems they solve or explaining why you’d pick one over another.

Equation for pretraining flops = 6ND. 2 FLOPs per parameter per token for the forward pass (multiply + add). Backward pass is 2× forward because you compute gradients w.r.t. both input matrices. So 2 + 4 = 6.
Okay we can’t do all this on one GPU. So how do we split up this problem? The obvious solution is to do data parallel - where you copy the model weights across each GPU, and you just do a part of the batch on each GPU.
- The obvious problem is that each GPU only has a limited amount of HBM - B300 is 288GB - and this is not enough to store the weights as models get bigger and bigger, much less their activations.
Okay so next thing we try is fully sharded data parallel - each GPU only stores 1/N of the parameters of each layer - before processing each layer, you all-gather the full layer’s parameters from all GPUs (each GPU only stores 1/N of each layer). After processing, each GPU discards the gathered parameters.
- It was emphasized that this is the go to default. And you only move on from this when having too many GPUs forces you to move on, for reasons explained later. The reason this is the default is that it’s trivial to overlap compute and communication time - that’s because the only thing being communicated is the weights, which are not dependent on what happened in the layer before, so you can start all gathering the next layer while you’re still computing this layer. Compare this against tensor or expert parallelism, which do need to share activations for one layer before you can process the next one. The problem with pipeline parallelism is bubbles as explained below.
- From a comms volume perspective, FSDP looks insanely expensive at first — you all-gather every layer’s full weights across all GPUs, use them for one matmul, then throw them away. But this ignores what regular data parallelism already costs you - in regular DP, you still need to do an all reduce after every layer of the backwards pass in order to sync the batch’s gradients across all the GPUs. That all-reduce has comms volume of params × 2. FSDP adds all-gathers — one per layer in the forward pass, one per layer in the backward pass. But an all-gather is half the comms volume of an all-reduce. So naive FSDP comms volume ends up being # params * 4 (all gather forward and back, plus all reduce on back). You can do even better: since each gradient shard only needs to end up on the one GPU that owns it, replace the all-reduce with a reduce-scatter (which skips the final broadcast step). That gets you to params × 3 total — a 50% overhead over vanilla DP.
So why can’t you always just do FSDP?
- Comms crossover: You want your compute time to be greater than your comms time - you don’t want to be bottlenecked on comms. But since compute time for FSDP decreases as you increase the number of GPUs, and comms time does not, as you scale the number of GPUs on FSDP, your MFU can totally crater. When this happens, you need to add pipeline parallelism too.
  - Compute time = (6 * # tokens * active params) / (compute per GPU * number of GPUs)
    - This decreases as you increase number of GPUs
  - Comms time = (# total params * 3) / (nv link domain size * infiniband BW)
    - Comms time does not increase as you add more domains. This was really confusing to me. Each domain collectively holds all the parameters, and you need to sync gradients across domains after each layer of the backward pass. You’d think that adding more domains means more hops in the ring, so the all-reduce gets slower. But the standard ring algorithm splits the message into one chunk per participant. More domains means more hops, but proportionally smaller chunks per hop. (This breaks down when chunks get so small that per-hop latency dominates, at which point you switch to tree algorithms.)
      - Technically, you can do better than a naive single all reduce for the gradients between all the domains. You do a hierarchical collective to optimize comms time across multiple NVLink domains. Key thing to remember is that each GPU in the domain gets its own bandwidth access to infiniband. So you wanna use it all up since interconnect bandwidth is the bottleneck. You do this by trying to do as much as possible within a scaleup before you move out. So you do reduce scatter within a scale up to give each GPU the domain-level reduced gradients for a shard of the layer, then all reduce these shards across corresponding GPUs across domains, then all gather within a domain. This shifts the comms time line down, thus moving the crossover point to the right.
      - Made an animation to illustrate it using Cursor and Composer 2:
- If you look at the equations, you can see that if you increase batch size, crossover point moves to right, and if you make the model more sparse, moves to the left.
- Also why TPUs are better at FSDP - because more accelerators within a domain.

Batch size floor: FSDP is data-parallel, so each GPU processes at least one sequence. Attention is computed within a sequence and can’t (easily) be split across GPUs. If your critical batch size is 10M tokens and sequence length is 10K, you only have 1K sequences — so you can’t scale beyond 1K GPUs with pure FSDP, even if you have plenty of comms bandwidth left.
Problems with pipeline parallelism (the next addition you’d make to FSDP in order to deal with these issues):
- The problem with pipeline parallelism is different - there you have bubbles that emerge from the fact that at the beginning of the batch, the GPUs dedicated to the final layers are not being used, and conversely at the end of the batch, the GPUs dedicated to the first layers are not being used. The reason you can’t overlap batches in training to solve pipeline bubbles is that you need to consolidate gradients and update the model before you process the next batch.
- But also you’re adding architecture constraints - things like Kimi’s attention-to-residuals (where each block attends to all previous layers’ residuals) become very difficult when those residuals live on different pipeline stages. Similarly, interleaving sliding-window and global attention layers could cause load imbalance across stages. Dealing with all this slows down research iteration, which is the greatest sin you can commit.

Mythos and the cybersecurity equilibrium

It seems like the key difference between Mythos and previous versions is that while previous versions could find individual vulnerabilities in the code (“Hey, there’s a missing bounds check here”), Mythos is long run agentic enough to rope 5 different vulnerabilities together which are all required in order to find an exploit (“Now I can execute arbitrary code, escalate privileges, etc”). To the extent that some discontinuity has been hit, it’s probably more the result of the combinatorial nature of cyberattacks rather than some off-trend increase in intelligence.

What does this mean for offense/defense? One way to look at it is that software is more secure today than it was 20 years ago, despite more and more human intelligence probing at public code, both white hat and black hat. If we get another influx of intelligence suddenly, why should the dynamic change?

In fact, we know that our foreign adversaries almost certainly have access to a bunch of critical zero days which they’re saving for a rainy day, or already using in inconspicuous ways. To the extent that Glasswing allows the whole industry to find a bunch of these latent exploits and patch them, shouldn’t we expect defense to have become much stronger relative to offense by the end of 26? Of course, this is thanks to the fact than American companies got there first and are cooperating with other companies and our government to patch things before our adversaries get to the same level.

One counterpoint I heard from a security expert is that there’s big difference between finding vulnerabilities and patching them - and AI is much better at the first than the later (people often talk about the offense/defense balance, but difficulty of finding versus patching vulnerabilities seems much more significant). In order to patch an issue, you have to find a fix that will not interfere with all the ways people use your software, and all the features which rely on weird bespoke behavior. XKCD has a nice comic illustrating how these kinds of issues come up:

Potential solutions, if it’s non-trivial to just push patches to every piece of software?

TODO - I know nothing about formal verification of software - check out what a seL4 proof of some behavior might look like
Use LLMs to rapidly port all C to Rust. Curious how easily Mythos can find vulnerabilities in memory safe languages.

In some sense, its good that Anthropic didn’t release this model publicly until critical IT could be patched up. In another sense, isn’t it a super bad precedent for private companies to be hoarding the ability to be able to break into any operating system and browser and device? One obvious question for Anthropic is why they didn’t just build some kind of classifier which would detect whether you’re using the model for cyberattack type stuff, and refuse requests if yes, and release that publicly.

Patching your own software is isomorphic to finding bugs in someone else’s repo from the perspective of an LLM (and patching your own software is a frequent coding model use case).
These kinds of classifiers can be easy to evade if you have enough expertise to break the problem of finding exploits down into smaller subproblems of finding vulnerabilities which each individually seem like sensibly good behavior to an LLM with no memory

Pipeline RL paper summary

As you keep RLing a model, not only does the average length of a response increase (since you’re basically training the model to think for longer before answering) but the variance in length also increases - sometimes you get an easy problem and you can immediately answer it - other times, you need to go think for 100k tokens.

This is a big problem for GPU utilization on training. Because you have to wait for all these stragglers to finish generating before you can start the next training step.

Okay one way you could get out of this conundrum is to just to just batch generation so that while stragglers keep going, you generate even more rollouts.

The problem is that there is an optimal batch size for each training step, so you’d need to split all these rollouts you made across lots of consecutive training steps.

But this takes you into the domain of offline RL, because your model is changing with each training step. And so you’re training your model on trajectories that were actually generated by an earlier model, which is not ideal.

Pipeline RL paper proposes the following fix: in flight weight weight updates - where you just sub out the generating model partway though these generating trajectories as soon as the new training step is done, so all the short trajectories, and a good chunk of the long trajectories, that the next training step will be trained on are generated by the most recent version of the model.

Michael Nielsen – How science actually progresses

Dwarkesh Patel — Tue, 07 Apr 2026 15:49:28 GMT

Really enjoyed chatting with Michael Nielsen about how we recognize scientific progress.

It’s especially relevant for closing the RL verification loop for scientific discovery.

But it’s also a surprisingly mysterious and elusive question when you look at the history of human science.

We approach this question stories like Einstein (who claimed that he hadn't even heard of the famous Michelson-Morley experiment, which is supposed to have motivated special relativity, until after he had come up with the theory), Darwin (why did it take till 1859 to lay out an idea whose essence every farmer since antiquity must have observed?), Prout (how do you recognize that isotopes exist if you cannot chemically separate them?), and many others.

The verification loop on scientific ideas is often extremely long and weirdly hostile. Ancient Athenians dismissed Aristarchus’s heliocentrism in the 3rd century BC because it would imply that the stars should shift in the sky as the Earth orbits the sun. The first successful measurement of stellar parallax was in 1838. That’s a 2,000-year verification loop.

But clearly human science is able to make progress faster than raw experimental falsification/verification would imply, and in cases where experiments are very ambiguous. How?

Michael has some very deep and provocative hypotheses about the nature of progress. One I found especially thought-provoking is that aliens will likely have a VERY different science + tech stack than us. Which contradicts the common sense picture of a linear tech tree that I was assuming. And has some interesting implications about how future civilizations might trade and cooperate with each other.

Watch on YouTube; listen on Apple Podcasts or Spotify.

Sponsors

Labelbox researchers built a new safety benchmark. Why? Well, current safety benchmarks claim that attacks on top models are successful only a few percent of the time, but the prompts in those benchmarks don’t reflect how real bad actors actually write. You can read Labelbox’s research here. If this could be useful for your work, reach out at labelbox.com/dwarkesh

Mercury has an MCP that lets you give an LLM access to your full transaction history, including things like attached receipts and internal notes. I just used it to categorize my 2025 transactions, and it worked shockingly well. Modern functionality like this is exactly why I use Mercury. Learn more at mercury.com
Jane Street’s ML engineers presented some of their GPU optimization workflows at GTC, showing how they use CUDA graphs, streams, and custom kernels to shave real time off their training runs. You can watch the full talk here. And they open-sourced all the relevant code here. If this kind of stuff excites you, Jane Street is hiring — learn more at janestreet.com/dwarkesh

Timestamps

00:00:00 – How scientific progress outpaces its verification loops

00:17:51 – Newton was the last of the magicians

00:23:26 – Why wasn’t natural selection obvious much earlier?

00:29:52 – Could gradient descent have discovered general relativity?

00:50:54 – Why aliens will have a different tech stack than us

01:15:26 – Are there infinitely many deep scientific principles left to discover?

01:26:25 – What drew Michael to quantum computing so early?

01:35:29 – Does science need a new way to assign credit?

01:43:57 – Prolificness versus depth

01:49:17 – What it takes to actually internalize what you learn

Transcript

00:00:00 – How scientific progress outpaces its verification loops

Dwarkesh Patel

Today, I’m speaking with Michael Nielsen. You have done many things. You’re one of the pioneers of quantum computing, wrote the main textbook in the field of the open science movement. You wrote a book about deep learning that Chris Olah and Greg Brockman credit with getting them into the field. More recently, you’re a research fellow at the Astera Institute and writing a book about religion, science, and technology.

I’m going to ask you about none of those things. The conversation I want to have today is, how do we recognize scientific progress? It’s especially relevant for AI because people are trying to close the RL verification loop on scientific discovery. What does it mean to close that loop? But in preparing for this interview, I’ve realized that it’s a more mysterious and elusive force, even in the history of human science, than I understood.

I think a good place to start will be Michelson-Morley and how special relativity is discovered, if it’s different from the story that you get off of YouTube videos. I will prompt you that way, and then we’ll go in there.

Michael Nielsen

Michelson-Morley is the famous result often presented as this experiment that was done in the 1880s that helped Einstein come up with the special theory of relativity a little bit later, changing the way we think about space and time and our fundamental conception of those things.

And there’s a big gap, I think, between the way Michelson and Morley and other people at the time thought about the experiment and certainly the way in which Einstein thought or did not think about the experiment. In actual fact, he stated later in his life he wasn’t even sure whether he was aware of the paper at the time. There’s a lot of evidence that he probably was aware of the paper at the time, but it actually wasn’t dispositive for his thinking at all. Something else completely was going on.

What Michelson and Morley thought they were doing was testing different theories of what was called the ether. If you go back to the 1600s, Robert Boyle introduced the idea of the ether. We know that sound is vibrations in the air. Boyle and other people got interested in the question of whether light is vibrations in something, and they couldn’t figure out what it was. Boyle did an experiment where he tested whether you could propagate light through a vacuum. He found that you could. You couldn’t do it with sound. He introduced this idea of the ether, and for the next two hundred or so years, people had all these conversations about what the ether was and what its nature was.

The Michelson and Morley experiment was really an experiment to test different theories of the ether against one another, in particular to find out whether or not there was a so-called ether wind. The idea was that the Earth is maybe passing through this ether wind. And if it is passing through the ether wind and you shoot a light beam parallel to the direction the ether wind is going in, it’ll get accelerated a little bit. If it’s being passed back in the opposite direction, it’ll get slowed down a little bit, and you should be able to see this in the results of interference experiments. What they found, much to their surprise, was that in fact there was no ether wind. That ruled out some theories of the ether, but not all, and Michelson certainly continued to believe in the ether.

Dwarkesh Patel

This is what was a shocking part of reading this story from the biography of Einstein that you recommended by... what was his first name?

Michael Nielsen

Abraham Pais.

Dwarkesh Patel

Abraham Pais. Subtle is the Lord. Also from Imre Lakatos, The Methodology of Scientific Research Programmes. The way it’s told is that Michelson-Morley proved that the ether did not exist. Therefore, it created a crisis in physics that Einstein solved with special relativity.

What you’re pointing out is he actually was trying to distinguish between many different theories of ether. If you’re in space or if you’re on Earth, it’s the same direction of ether, or maybe the ether wind is being carried around by the Earth, and so you can’t really experience it on Earth. But if you go to a high enough altitude, you might be able to experience it. In fact, Michelson’s experiments, the famous one is 1887, but he conducted these experiments for basically two decades.

Michael Nielsen

For longer than that. He conducted the first one in 1881, I think, but he continued to believe until he died. He died, I think it was 1929 or so. It was the late twenties. He was still doing experiments in the 1920s about whether or not the ether existed. So he continued to believe in the ether to the end of his life. I think the last public statement he made was a year or two before he died, and he basically still believed it at that point.

Dwarkesh Patel

In fact, there was another physicist, Miller, who kept doing these experiments in the 1920s. He thought that if he went to a high enough altitude, Mount Wilson in California… “Oh, I’m high enough that the ether winds are not being dragged by the Earth. And I’ve measured the effect of the ether.” Einstein hears about this and he says, and this is where you get the famous quote, “Subtle is the Lord, but malicious He is not.”

Anyways, I think the reason the story is interesting is for many different reasons. One of the ways in which the real history of science is different from this idea you get of the scientific method is that you really can’t apply falsification as easily as you might think. It’s not clear what is being falsified. Is it just another version of the theory of the ether that’s being falsified? Certainly you can’t induce the theory of special relativity from the fact that one version of the ether seems to be disconfirmed by these experiments.

Michael Nielsen

It certainly doesn’t show that ideas about falsification are wrong or falsified, but it does show that the most naive ideas… Things are often much more complicated than you think. Michelson did this experiment in 1881. He was a very young man, and then other people, I think Rayleigh was one of them, pointed out that there were some problems with the way he did it, so they had to redo it in 1887. At that point, a lot of the leading physicists of the day basically accepted this result, that there was no ether wind. But what to do about this?

Sure, maybe you falsified some theories of the ether. There are others that you haven’t falsified at all at this point, and people set to work on developing those. It is funny, people will phrase it as showing that the ether didn’t exist. Even just the word “the” there is a misnomer. You actually had a ton of different theories and a couple of leading contenders. So yes, there’s some version of falsification going on, but how you respond to this new experiment is very complicated. Certainly the leading physicists of the day responded by saying, “Okay, this gives us a lot of information about what the ether must be, but it doesn’t tell us that there is no ether.”

Dwarkesh Patel

In fact, Lorentz at the end of the 19th century, before Einstein, figures out the math of how you convert from one reference frame to another reference frame, and comes up with the Lorentz transformations, which is the basis of special relativity. But his interpretation is that you are converting from the ether reference frame to these non-privileged other reference frames if you’re moving relative to the ether.

His interpretation of length contraction and time dilation is that this is the effect of moving through the ether, and you have this pressure. This pressure is warping clocks. It’s warping measures of length. The interesting thing here is that experimentally you cannot distinguish Lorentz’s interpretation from special relativity.

Michael Nielsen

I think that’s a strong statement. Lorentz introduces this quantity called local time, which he regards as... My understanding is he’s not trying to give a physical interpretation of this, but it’s what Einstein would later just recognize as time in another inertial reference frame. He’s not trying to attribute much physical meaning to it. I think Poincaré gets much closer later on to realizing that this is the time that’s registered by clocks.

About forty-odd years later, people start doing these muon experiments where they see cosmic rays hit the top of the atmosphere. They produce a shower of muons, and you can look to see at different heights in the atmosphere how many of those muons remain. They decay over time, and a very strange thing happens, which is that they’re decaying way too slow. You expect they shouldn’t be able to last the whole way through the atmosphere at all. Their decay rate is too quick, if you were in a classical theory. But if in fact their time really has slowed down, it’s okay.

In fact, the measured decay rates in 1940—and there have since been more accurate experiments done—match exactly what you expect from special relativity. That’s the kind of thing where if Lorentz had been alive—he’d been dead ten or so years at that point—it seems quite likely that he would have tried to save his theory by patching it up yet again, but it would have been a massive setback. It starts to just look like time—this thing that Lorentz introduced as a mathematical convenience—that’s actually what time is, for the muons at least. Then there’s a whole bunch of other experiments that show this very similar phenomenon.

Dwarkesh Patel

When was that experiment done?

Michael Nielsen

That was, I think, 1940. It might have been published in 1941.

Dwarkesh Patel

Maybe to rephrase and change my claim: it’s not that you could not have distinguished them, but the scientific community adopted what we in retrospect consider the more correct interpretation before it was actually experimentally shown to be preferred. So there’s clearly some process that human science does which can distinguish different theories.

Michael Nielsen

Can I just interrupt? You used the word process, and it’s interesting to think about that term. Process carries connotations of something set in advance. It’s much more complicated in practice. You have people like Lorentz, who Einstein absolutely and utterly admired, and Poincaré, one of the greatest scientists who ever lived, and Michelson, another truly outstanding scientist, who never reconciled themselves.

It’s not as though there’s some standard procedure that we’re all using to reconcile these things. Great scientists can remain wrong for a very long time after the scientific community has broadly changed its opinion. But there’s no centralized authority or centralized method.

Dwarkesh Patel

That is the interesting thing. There’s progress even though it is hard to articulate the process by which it happens, the heuristics that are used.

You mentioned Poincaré. Lorentz has the math right, but the interpretation wrong. It seems like Poincaré had the opposite, where he understood that it’s hard to define simultaneity because it requires a circular definition with time, or velocity of something that might arrive at a midpoint together, but velocity is defined in terms of time. I find this interesting.

There are a couple of other examples we could call on. There is this phenomenon in the history of science where somebody asks the right question, but then they don’t clinch it. I’m curious what you think is happening in those cases.

Michael Nielsen

You actually do want to go case by case and try to understand. It’s not necessarily clear that they’re doing the same thing wrong in all of the cases. The Poincaré case is amazing. He seems to have understood the principle of relativity, the idea that the laws of physics are the same in all inertial reference frames. He seems to have understood that the speed of light is the same in all inertial reference frames. He doesn’t phrase it quite that way, but it is my understanding, though I don’t speak French.

These are basically the ideas that Einstein uses to deduce special relativity. But then he also has this additional misunderstanding where he thinks that length contraction is a dynamical effect, that somehow particles are being pushed together by some external force, something is going on dynamically. He doesn’t understand that it’s purely kinematics. That actually space and time are different from what we thought, and you need to fundamentally rethink those things.

It’s almost like he knew too much. He had almost too grand a vision in mind. Einstein subtracts from that and says, “No. Space and time are just different than what we thought, and here’s the correct picture.” There’s a paper in, I think it’s 1909, where Poincaré still has this dynamical picture of what’s going on with the length contraction. This is just not necessary. This is a mistake from the modern point of view.

Why is he doing this? Why is he clinging onto this idea? I don’t know. I’ve obviously never met the man. It would be fascinating to be able to talk it over and try and understand. His expertise seems to be getting in the way. He knows so much, he understands so much, and then he’s not able to let go of these things.

A really interesting fact is that a few years prior, in the 1890s, Einstein’s a teenager and he believes in the ether too. He knows about this stuff. But he’s not quite as attached as these older people were. Maybe they were a little bit prisoners of their own expertise. That’s my guess. Some historians of science would certainly disagree.

Dwarkesh Patel

Then there’s the obvious stories where Einstein himself later on is said to have not latched onto the correct interpretations of quantum mechanics or cosmology because of his own attachments.

Michael Nielsen

Yeah.

Dwarkesh Patel

Here’s the bigger question I have. The muon example is a great example of these long verification loops and how progress seems to happen in the scientific community faster than these verification loops imply. Maybe the clearest example is Aristarchus in the second century BC comes up with the idea of heliocentrism. The ancient Athenians dismiss it on the grounds that we should see as the Earth is moving around the Sun, if really the Sun is the center of the solar system, the stars move relative to the Earth. The only reason that would not be the case is the stars are so far away that you would not observe this.

And it’s only in 1838 that stellar parallax was actually measured. And so, we didn’t need to wait until 1838 to have heliocentrism. We didn’t need to wait for the experimental validation to understand that Copernicus is better in some way. In fact, when Copernicus first came up with his theories, it’s well known that the Ptolemaic model was more accurate because it had centuries of adding on these epicycles.

What’s maybe less well appreciated is that it was also in some sense simpler. Because Copernicus actually had to add extra epicycles. It had more epicycles than the Ptolemaic model because he had this bias that the Earth should go in a perfect circle in equal time. Anyway, I think this is an interesting story because it’s not a more accurate theory. It’s not a simpler theory. So how could you have known ex ante that Copernicus was correct and Ptolemy was not?

Michael Nielsen

Good question. I don’t entirely know the answer. I can give you a partial answer that I, centuries in the future, start to find very compelling. I’m sure it’s part of the historic story at least. One of the big shocks for Newton, he did understand Kepler’s laws of motion eventually, so you’re able to explain the motions of the planets in the sky. But he also, out of the same theory, his theory of gravitation, was able to explain terrestrial motion. He’s able to explain why objects move in parabolas on the Earth, and he’s able to explain the tides in terms of the moon and the sun’s gravitational effect on water on the Earth.

You have what seem like three very different disconnected phenomena all being explained by this one set of ideas. That starts to feel very compelling, at least to me. I think most people find that very satisfying once they eventually realize it.

00:17:51 – Newton was the last of the magicians

Dwarkesh Patel

Have you read the Keynes biography of Newton?

Michael Nielsen

He wrote an entire biography?

Dwarkesh Patel

No, the essay.

Michael Nielsen

Sure. I love that. This description of him as the last of the magicians is wonderful.

Dwarkesh Patel

In fact, I think it’s maybe worth superimposing. Or you should read out that one passage of the thing.

Michael Nielsen

Alright. It’s from a talk that he gave at Cambridge not long before he died. He’d acquired Newton’s papers somehow and gave a lecture twice about this, or his brother Jeffrey gave it the other time because he was too ill. There’s this wonderful, wonderful quote in the middle. The whole thing is really interesting, but I love this particular quote: “Newton was not the first of the age of reason. He was the last of the magicians, the last great mind which looked out on the visible and intellectual world with the same eyes as those who began to build our intellectual inheritance rather less than ten thousand years ago.”

This idea people have that Newton was the first modern scientist is somehow wrong. There’s some truth to it, but he really had this very different way of looking at the world that was part superstitious and part modern. It was a funny hybrid. He’s a transitional figure in some sense. That phrase, “the last of the magicians,” really points at something.

Dwarkesh Patel

The thing I’m very curious about with Newton is whether it was the same program, the same heuristics, the same biases that he applied to his alchemical work as he did to his understanding of astronomy. This is from the Keynes essay: “There was extreme method in his madness. All his unpublished works on esoteric and theological matters are marked by careful learning, accurate method, and extreme sobriety of statement. They are just as sane as the Principia if their whole matter and purpose were not magical. They were nearly all composed during the same 25 years of his mathematical studies.”

Clearly, there was some aesthetic that motivated people like Einstein to reject earlier ways of thinking and say, “No, the other is wrong, and there’s a better way to think about things.” The same is true with Newton. The question I have is whether similar heuristics toward parsimony, aesthetics, and so on, would be equally useful across time and across disciplines, or whether you need different heuristics. The reason that’s relevant is even if we can’t build a verification loop for science, maybe if the taste tests point in the same direction, you can at least encode that bias into the AIs. That would maybe be enough.

Michael Nielsen

The point is that where we always get bottlenecked is where the previous processes and heuristics don’t apply. That’s almost definitionally what causes the bottlenecks. Because people are smart, they know what has worked before. They study it. They apply the same kinds of things, so they don’t get stuck in the same places as before. They keep getting bottlenecked in different places. I’m overgeneralizing a bit, but I think it’s right.

If you’re attempting to reduce science to a process, you’re attempting to reduce it to something where there is just a method which you can apply, and you turn the crank and out pops insight. You can do a certain amount of that, but you’re going to get bottlenecked at the places where your existing method doesn’t apply. Definitionally, there’s no crank you can turn. You need a lot of people trying different ideas. The more difficult the idea is to have, the greater the bottleneck, but then also the greater the triumph.

Quantum mechanics is a great example of this. It’s such a shocking set of ideas. It’s such a shocking theory. The theory of evolution in some sense is also quite a shocking idea, not the principle of natural selection, but that it can explain so much. That’s a shocking idea.

00:23:26 – Why wasn’t natural selection obvious much earlier?

Dwarkesh Patel

Principia Mathematica is released in 1687. The Origin of Species is released in 1859. At least naively, it seems like Darwin’s theory of natural selection is conceptually easier than the theory of gravity.

I asked Terence Tao this question. There was this contemporaneous biologist with Darwin, Thomas Huxley, who read this and said, “How extremely stupid to not have thought of this.” Nobody ever reads the Principia Mathematica and thinks, “God, why didn’t I beat Newton to the punch here?” So what’s going on here? Why did Darwinism take so much longer?

Michael Nielsen

The idea must have been known to animal breeders for a long time at some level, or certainly large chunks of the idea were known, that artificial selection was a thing. In some sense, Darwin’s genius wasn’t in having that idea, it was understanding just how central it was to biology. You can go back and explain a tremendous amount about all the variety of what we see in the world with this as not necessarily the only principle, but certainly a core principle. He writes this wonderful book, The Origin of Species. It’s just so much evidence and so many examples, trying to tease this out and see what the implications are, and connecting it to as much else as he possibly can, to geology and all these other things.

That hard work—making the case that it’s actually relevant all across the biosphere—is what he’s doing there. He’s not just having the idea, he’s making a compelling case that it’s intertwined with absolutely everything else.

Dwarkesh Patel

The motivation for the question was Lucretius, this first-century Roman poet who has an idea that seems analogous to natural selection. It’s about species getting fitted more over time to their environments, or species losing fit to their environment. And so, why did this go nowhere for nineteen centuries?

Then I looked into it or, more accurately, asked LLMs what exactly Lucretius’s idea here was. It is extremely different from what real natural selection is. He thought there was this generative period in the past where all the species came about, and then there was this one-time filter which resulted in the species that are around today, and they became fit to the environment.

He did not have this idea that it is an ongoing gradual process or that there is a tree of life that connects all life forms on Earth together, which, by the way, is an incredibly weird fact that every single life form on Earth has a common ancestor.

Michael Nielsen

It’s not incredibly weird. If you think that the origin of life must have been very hard, that there’s a bottleneck there, then it’s not so surprising.

Dwarkesh Patel

There’s also this verification loop aspect where even if Newton might be harder in some sense, if you’ve clinched it, you can experimentally… I know “validate” is the wrong word philosophically, but you can give a lot of base points to the theory.

You can be like, “Okay, I have this idea of why things fall on Earth. I have this idea of why orbital periods for planets have a certain pattern. Let’s try it on the Moon, which orbits the Earth.” And in fact, it’s weird but the orbital period matches what my calculations imply.

Michael Nielsen

And the tides work correctly. It’s just amazing.

Dwarkesh Patel

Exactly. Whereas for Darwinism, it takes a ton of work for Darwin to compile all the cumulative evidence, but there’s no individual piece that is overwhelmingly powerful.

Michael Nielsen

And there’s a whole bunch of problems as well. He doesn’t really understand what the mechanism is. He doesn’t understand genes, all these things.

Dwarkesh Patel

The very interesting thing in the history of Darwinism is, this idea which theoretically you could come up with at any time, there is almost identical independent creation of that idea between Alfred Wallace and Charles Darwin. So much so that I think Wallace sends his manuscript to Darwin and is like, “What do you think of this idea?” And Darwin’s like, “Fuck.”

Michael Nielsen

I don’t think that’s an exact quote, but it’s pretty much correct.

Dwarkesh Patel

They end up presenting their ideas together in the spirit of sportsmanship. Why was this period in the 1850s or 1860s the right time for these ideas to form? You can come up with different ideas. One is geology. In the 1830s, Charles Lyell figures out that there’s been millions and billions of years of time that’s existed on Earth. The paleontology shows you that fossils have existed for that entire time. Life goes back a long way. In fact, you can even find fossils for intermediate species that show you the tree of life. Between humans and other apes as well, there’s intermediate humans.

There’s also the age of colonization, and we have all these voyages doing biogeography. That all must have been necessary. In fact, there’s a huge history of parallel innovation and discovery in the history of science. So maybe it is another piece of evidence that more had to be in place for a given idea to be discovered. Because if it’s not discovered for a long time and then spontaneously many different people are coming up with it, that shows you that the building blocks were in some sense necessary.

Michael Nielsen

This example of Lyell and other geologists in the early 1800s having this idea of deep time does seem to have been crucial. I know Darwin was very influenced by Lyell. If you don’t have at least tens or hundreds of millions of years, evolution starts to look like a non-starter.

In order to make it work on a timescale of 5,000 to 10,000 years or 6,000 years with Bishop Ussher you would need to see evolution occurring at a massive rate during human lifetimes, and we’re just not seeing that. That does seem to have been a blocker. To your question of what other blockers were there, were there any others? I don’t know.

Dwarkesh Patel

Or how much earlier could you, in principle, have come up with it if you were much smarter?

00:29:52 – Could gradient descent have discovered general relativity?

Michael Nielsen

Let’s go back and zoom out to your original question about the verification loop in AI. An example that should give you pause there is the big signature success so far, which is certainly AlphaFold. AlphaFold really isn’t about AI. A massive fraction of the success there is the Protein Data Bank. It’s X-ray diffraction, NMR, cryo-EM, and the several billion dollars that were spent obtaining those 180,000-odd protein structures.

It’s basically the story of how we spent many decades obtaining protein structure just by going out and looking very hard at the world experimentally, and then we fitted a nice model at the end of it, which was a tiny fraction of the entire investment. That’s a story of data acquisition principally. The AI bit is very impressive and quite remarkable, but it is only a small part of the total story.

Dwarkesh Patel

AlphaFold is very interesting, and philosophically I wonder what you think of it as a scientific theory or explanation. I guess over time the world is becoming harder to understand… As I’m saying things, because you’re such a careful speaker, I say a phrase and wonder if you’ll actually buy that premise.

But in some domains, we need to fit models to things rather than coming up with underlying principles that explain a broad range of phenomena. Compare the theory of general relativity, or any theory which just nets out to some equations, versus AlphaFold, which is encoding these different relationships between things we can’t even interpret over 100 million parameters.

Are those really the same thing? GR can predict things you could have never anticipated or it was never meant to do, like why Mercury’s orbit precesses. AlphaFold is not going to have that kind of explanatory reach. I want to get your reaction to that.

Michael Nielsen

I think it’s an incredibly interesting question. Maybe a really pivotal question. If you take a very classic point of view, you want these deep explanatory principles. You want as few free parameters as you possibly can. You want very simple models which explain a lot, and AlphaFold doesn’t look anything like that. You might just say, “It’s nice and maybe helpful as a model, but it’s not a scientific explanation.” That’s a conservative point of view, answer one to the question.

Answer two is to say maybe you shouldn’t think about AlphaFold as an explanation in the classic sense, but maybe it contains lots of little explanations inside it. Part of what you can get out of interpretability work is you can go into AlphaFold and start to extract certain things. Maybe by doing an archeology of AlphaFold, we can actually understand a great deal more about these principles. You can start to extract that a certain circuit does this interesting thing, and we learn from it.

I don’t know to what extent that’s been done with AlphaFold, but it’s been done a little bit with some of the chess models, like AlphaZero. There seem to be some strategies which were borrowed by Magnus Carlsen, which he seems to have just taken from AlphaZero. I don’t think there’s any public confirmation of this, but some experts have noticed that he changed his game quite radically after some public forensics were released on how AlphaZero worked. That’s an example where human beings are starting to extract meaning out of these models.

That leads to viewing the models as a potential source of explanations. You need to do more work because they’re not very legible up front, but you can potentially extract them. That’s an interesting intermediate situation where they’re not explanations themselves, but you can extract interesting explanations out of them and use them as a source.

The third and most interesting possibility is that they’re a new type of object. They should be taken very seriously as explanations, but where in the past we haven’t had the ability to really do anything with them, now we have interesting new actions we can do. We can merge them, we can distill them. It’s a big opportunity in the philosophy of science.

There’s an anticipation of this in some way. Some mathematicians and physicists work today… Historically, if you had a 100-page equation—which is the kind of thing that does come up—there’s just nothing you can do if it’s 1920. At that point, you give up on the problem. But today, with tools like Mathematica, you can just keep going. That’s an object now, a thing that you can work with. There are examples where people work with these things that formerly were regarded as too complicated, and sometimes they get simple answers out the end. That’s just an intermediate working state.

So I wonder if something similar is going to happen in this case, where you could take these models and use them in a similar way that people do with Mathematica, and take them seriously. They’re not explanations in the classic sense, but they’ll be something else which interesting operations can be done on.

Dwarkesh Patel

The thing I worry about is, suppose it’s 1500 and you’re training a model on… This is a weird history where we developed deep learning before we had cosmology. Suppose we live in that world. You’re observing how the stars don’t seem to move. The planets have all these weird behaviors. Then you train a model on that, and you do some kind of interp on it trying to figure out what the patterns are.

You’d just be able to keep building on Ptolemy’s model. You’d see there’s another epicycle we didn’t notice. Parameters X to Y encode this epicycle, parameters whatever encode the next epicycle. If you were just trying to figure out why the solar system is the way it is from observational data, you could just keep adding epicycles upon epicycles, but it really took one mind to integrate it all in and say, “Here’s what makes more sense overall.”

Michael Nielsen

This is to my point that we don’t really understand what to do with the models. We don’t have the verbs yet. It is certainly interesting to think about the question where you start to apply constraints to the models, essentially saying, “What’s the simplest possible explanation?” Or, “Can you simplify? Can you give me the 90/10 explanation?” And go further and further in boiling it down.

It might be that indeed they start out by providing a very, very complicated, many-parameter model. But you can just force the case, and basically that’s scaffolding, which maybe is the very early days of their attempt to understand something. They’re forced through that to a much more simple understanding.

Dwarkesh Patel

Sorry for misunderstanding, but it sounds like you’re saying maybe there’s some regularizer or some distillation you could do of a very complicated model that gets you to a truer, more parsimonious theory. Take Ptolemy versus Copernicus. You start off with lots of Ptolemy epicycles, and then you try to distill this model, and maybe it gets rid of some of the epicycles that are less and less necessary to get the mean squared error of the orbits to match.

But at some point it has to do this thing which is to switch two things. Locally, it actually doesn’t make things more accurate. It’s in a global sense that it’s a more progressive theory. There’s some process which obviously humanity did over its span, which did that regularization or did that swap. But with raw gradient descent, I don’t really feel like it would do that.

Michael Nielsen

Think about the example of going from Newtonian gravity to Einstein’s general theory of relativity. These are shockingly different theories, and the question is what causes that flip. As nearly as I understand the history, what goes on is Einstein develops special relativity and pretty much straight away he understands. It’s a very obvious observation. In special relativity, influences can’t propagate faster than the speed of light, and in Newtonian gravity, action is at a distance.

Straight away in special relativity, you could use Newtonian gravity to do faster-than-light signaling. You could send information backwards in time. You could do all kinds of crazy stuff. It’s not a big leap to realize we have a big problem here. That’s the forcing function there. You’ve realized that your old explanation is not sufficient. You need something new.

Then you’re going to start by doing the simplest possible stuff. It just turns out that a lot of that stuff doesn’t work very well, so you’re forced to go through these steps where gradually it gets more complicated, and it’s wrong in a variety of ways. The final theory appears shockingly simple and beautiful, but it’s gone through some somewhat ugly intermediate stages.

Dwarkesh Patel

If you’re thinking about what it looks like to have AI accelerate science, there’s one for well-understood domains where we just want local solutions, like how does this protein fold. We just train a raw model using gradient descent. Then there’s things like coming up with general relativity, where you couldn’t really just train on every single observation in the universe and hope that general relativity pops out.

What would it require? It also certainly wasn’t immediately discovered. It was decades of thought. You’d need independent research programs where people start off with these biases, where Einstein is initially motivated by this thought experiment of whether you can distinguish the effect of gravity from just being accelerated upwards. You just need different AI thinkers to start off with these initial biases and see what can germinate out of them. The verification loop for that might be quite long, but you just need to keep all those research programs alive at the same time.

Michael Nielsen

This point you make about keeping all the different research programs alive, I think that is very important and central. A great example is situations where the same answer has been correct in some circumstances and wrong in other circumstances.

The planet Uranus was not in quite the right spot, and people famously predicted the existence of Neptune on this basis. Wonderful, massive success for Newtonian gravity. The planet Mercury is not in quite the right spot. You predict the existence of some other distorting planet. It turns out that doesn’t exist. Actually, the reason Mercury is not in the right spot is because you need general relativity.

You’ve pursued very similar ideas, and it’s been very successful in one case, and it’s been completely and utterly unsuccessful in the other case. A priori, you can’t tell which of these is the thing to do, and you actually need to do both. This is certainly very true in the history of science.

This kind of diversity, where you just have lots of people go off and pursue lots of potentially promising ideas, you just need to support that for a long time. It’s hard to do that for a variety of reasons, but it does seem to be very, very important.

Dwarkesh Patel

This example of Uranus versus Mercury is very interesting. I think it illustrates the difficulty with falsificationism. The orbit of Uranus is in some sense falsifying Newtonian mechanics. But then you make some ancillary prediction that says, “Oh, the reason this is happening is there must be another planet which is perturbing Uranus’s orbit.” I think it’s Le Verrier in 1846. “Point a telescope in the right direction, you find Uranus.”

Michael Nielsen

Neptune.

Dwarkesh Patel

Sorry. Neptune, yes. But with Mercury, it’s observed that the ellipse which forms its orbit is rotating 43 arcseconds more every century than Newtonian mechanics would imply, so people say that there must be a planet inside Mercury’s orbit. They call it Vulcan and point the telescopes. It’s not there.

But if you’re a proper Newtonian, what you do is say, “Well, maybe there’s some cosmic dust that’s occluding this planet, or maybe the planet is so small we can’t see it, or let’s build an even more powerful telescope, or maybe there’s some magnetic field which is occluding our measurement.” At any one of these steps—

Michael Nielsen

And this happens over and over. There are just so many stories which are exactly like this. An example I love from the 1990s. Some people noticed that the Pioneer spacecraft weren’t quite where they were supposed to be.

You can get very excited about this. “Oh my goodness, general relativity is wrong. Maybe we’re going to discover the next theory of gravity.” Today the accepted explanation is that there’s just a slight asymmetry in the spacecraft. It turns out that the thermal radiation is slightly larger in one direction than the other, and that’s causing a tiny little acceleration towards the sun. Most of the time when there’s these apparent exceptions, it’s just something like that going on.

It’s very much like the Mercury-Vulcan case. But every once in a while, it’s not. A priori, you can’t distinguish these. Science is just full of these. It’s funny too, the way we tell the history of science, it sounds so simple. You just focus on the right exception and you realize that you need to throw out the old theory and lo and behold, your Nobel Prize awaits. But in fact, these exceptions are all over the place. 99.9% of the time, it just turns out to be some effect like this thermal acceleration in the case of the Pioneer spacecraft. Unfortunately, there’s a lot of selection bias going into those stories.

Dwarkesh Patel

The thing is there’s no ex ante heuristic which tells you which case you’re in. To spell out why I think this is important, some people have this idea that AI is going to make disproportionate progress towards science because it makes disproportionate progress towards domains where there’s tight verification loops. It’s really good at coding because you can run unit tests.

Science may be similar because you can run experiments. What that doesn’t appreciate is that there’s an infinite number of theories that are compatible with any given experiment. Over time, why we latch onto the one we think is more correct in retrospect is, as we’re discussing, hard to articulate.

Lakatos has all kinds of interesting examples in the book about these hostile verification loops that are extremely long-lasting. One he talks about is Prout. There’s this chemist in 1815 who hypothesizes that all atomic nuclei must have whole number weights. They’re basically all made of hydrogen. The reason he thinks this is because if you look at the measured weights of all elements, it does seem that almost all of them have whole number weights. But then there are some exceptions. For example, chlorine comes out at 35.5.

So then there’s all these ad hoc theories that people in this school keep coming up with, like, “Oh, maybe there’s chemical impurities.” But there’s no chemical reaction you can do which seems to get rid of this. Maybe it’s fractions of whole numbers, so 35.5 can be halves. But actually, if you measure chlorine even closer, it’s 35.46, so it’s getting further away from the correct fraction. Later on, what is discovered is what you’re actually measuring is different isotopes, which cannot be chemically distinguished. They can only be physically distinguished.

So you have 85 years before we realize what an isotope is, where the verification loop is actively hostile against the correct theory. You just need this remnant to be defending… There’s no ex ante reason it’s the preferred theory. As a community, we should just have people try to integrate new observations, even if they don’t seem to fit their school of thought, and hopefully enough of that happens… Anyways, I guess the thing I’m trying to articulate is the difficulty with automating science.

Michael Nielsen

The question is, where is the bottleneck at some level? Are we primarily bottlenecked on one type of thing, or are we bottlenecked on multiple types of things? Certainly, talking to structural biology people, they seem to think that AlphaFold was an enormous advance. It was a shock.

At some level, yes, AI can certainly help us speed up science. It is helping with a certain type of bottleneck. That doesn’t mean though, as you’re saying, that it’s necessarily going to help with all kinds of bottlenecks. I suppose the question you’re pointing at is, what are the types of bottlenecks that remain, and what are the prospects for getting past them?

Even in the case of coding, it’s really interesting talking to programmer friends. At the moment they’re all in this state of shock and high excitement, and they’re all over the place. You do wonder where the bottleneck is going to move to. Certainly, one thing that a lot of them seem to be bottlenecked on now is having interesting ideas, and in particular, having interesting design ideas. There’s not really a verification loop for knowing that a design idea is very interesting.

They’re no longer nearly as bottlenecked by their ability to produce code, but they are still bottlenecked by this other thing. Formerly, they weren’t bottlenecked on it because just writing code took so much of their time. They could have lots of ideas while they were taking three weeks to implement their prototype, and then they would implement the next version. Now they’re taking three hours to implement the prototype, and they don’t have as good ideas after that, from a design point of view.

00:50:54 – Why aliens will have a different tech stack than us

Dwarkesh Patel

You have a very interesting take. I think it was a footnote in one of your essays, and I couldn’t find it again, which was that it’s very possible that if we met aliens, they would have a totally different technological stack than us. That contradicts a common assumption I had that I never questioned, which is that science is this thing you do relatively early on in the history of civilization. You get to a point and you have a couple hundred years of just cranking through the basics, understanding how the universe works, and you’ve got it. You’ve got science. Then everybody would converge on the same “science.” I found that a very interesting idea, and I want you to say more about it.

Michael Nielsen

The idea there that I’m at least somewhat attached to is that the tech tree or the science and tech tree is probably much larger than we realize. We’re in this funny situation. People will sometimes talk about a theory of everything as a potential goal for physics, and then there’s this presumption that physics is done once you get there. Of course, this is not true at all.

If you think about computer science, computer science started in the 1930s when Turing and Church and so on laid down what the theory of everything was. They just said, “Here’s how computation works.” We’ve spent ninety-odd years since then exploring the consequences of that and gradually building up more and more interesting ideas. Those ideas, to some extent, you can regard as technology. But insofar as they’re discovered principles inside that theory of computation, I think they’re best regarded as science and in some cases, very fundamental science.

Ideas like public-key cryptography are incredibly deep, very non-obvious ideas which lay hidden already in the 1930s. My expectation is that there will be different ways of exploring this tech tree, and we’re still relatively low down. We’re still at the point where we’re just understanding these basic fundamental theories, and we haven’t yet explored them.

A thing which I think is quite fun is if you look at the phases of matter. When I was in school, we’d get taught that there are three phases of matter, or sometimes four or five, depending on what you included. As an adult, as a physicist, you start to realize we’ve been adding to this list. We’ve got superconductors and superfluids, and maybe different types of superconductors, and Bose-Einstein condensates, the quantum Hall systems, fractional quantum Hall systems, and so on. It’s starting to turn out there’s a lot of phases of matter to discover, and we’re going to discover a lot more of them. In fact, we’re going to be able to start to design them in some sense. We’ll still be subject to the laws of physics, but there is this tremendous freedom in there.

This looks to me like we’re down at the bottom of the tech tree. We’ve barely gotten started there, and I expect that to be the case broadly. Certainly, programming is a very natural place to look. The idea that we’ve discovered all the deep ideas in programming just seems obviously ludicrous. We keep discovering what seem like deep, new, fundamental ideas. We’re very limited. We’re basically slightly jumped-up chimpanzees, so we’re slow and it’s taking us time. But what do we look like another million years in the future, in terms of all the different ideas people have had around how to manipulate computers and information? I think we’re likely to discover that there are a lot of very deep ideas still to be discovered.

I think it was Knuth in the preface to The Art of Computer Programming who says something like it. He started this book back in the sixties. He talked to a mathematician who was a bit contemptuous and said, “Look, computer science isn’t really a thing yet. Come back to me when there’s a thousand deep theorems.” Knuth remarks, writing the preface decades later, “There clearly are a thousand deep theorems now.”

It’s really interesting to think what the long-term future is as you get higher and higher up in the tech tree, choices about which direction we go and how we choose to explore. It’s potentially the case that different civilizations or different choices mean we end up in different parts of that tree. In particular, there are just very basic things about how we’re very visual creatures, while certain other animals are much more aurally based. Does that bias the types of thoughts that you have? Then you extend it to much more exotic kinds of civilizations where maybe their biases in terms of how they perceive and manipulate the world are quite different than ours. That might make some significant changes in terms of how they do that exploration of the tech tree. It’s all speculation, obviously.

Dwarkesh Patel

This is such an interesting take. I want to better understand it. One way to understand it is that there might be some things which are so fundamental and have such a wide collision area against reality that they’re inevitably going to discover, like general relativity.

Michael Nielsen

Numbers. Numbers. Of all the intelligences in the Milky Way galaxy… Maybe that number is one. Well, actually, arguably we’ve already increased the number. But of all of those, what fraction have the concept of counting? It does seem very natural. What fraction have discovered the idea of some kind of decimal place system? Interesting question. Maybe we’re missing something really simple and obvious that’s actually way better than that.

What fraction got there immediately? What fraction had to go through some other intermediate state? What fraction uses linear representations versus a two-dimensional or a three-dimensional representation? I think the answers to these questions are just not at all obvious. It’s a lot of design freedom.

Dwarkesh Patel

On theoretical computer science, this is going to be extremely naive and arrogant, but I took Scott Aaronson’s class on complexity theory, and I was by far the worst student he’s ever had. What I remember is there was this period, in which you were one of the pioneers, where we figured out the class of problems that quantum computers can solve and how it relates to problems that classical computers can solve. It was groundbreaking. It’s crazy that this works. Since then… There’s literally this website called Complexity Zoo which lists out all the complexity classes. If you have this complexity class with this kind of oracle, it’s equivalent to this other class. It feels like we’re building out that taxonomy.

There are a couple ways to understand what you’re saying. One, maybe you disagree with me that this is actually what’s happened with this field. Another is that while that might happen to any one field, who would’ve thought in 1880 that computer science, other than Babbage, was going to be a thing in the first place? We’re underestimating how many more fields there could be. Or maybe you think both, or maybe a third secret thing. I’d be curious.

Michael Nielsen

A very common argument here is the low-hanging fruit argument. The argument that says there should be diminishing returns.

Dwarkesh Patel

In fact, empirically we see this. The amount of scientists in the world has exponentially increased.

Michael Nielsen

I think it’s worth thinking about why you expect diminishing returns and how well that argument actually applies in practice. An analogy I like is thinking about going to an event, like a wedding, and you go to the dessert buffet. They’ve put out thirty desserts. Naturally, what people do is the best desserts go first. We don’t quite have a well-ordered preference there, so maybe there’s some difference, but human beings are fairly similar, so the best desserts will go first. This is an argument for why you expect diminishing returns in a lot of different fields. If it’s relatively easy to see what’s available and people have similar preferences, then the best stuff goes first and it just gets worse and worse after that.

If you look at a very static snapshot in time of scientific progress, maybe there’s some truth to that. But if somebody is standing behind the dessert table and is replenishing and restocking the desserts and keeps adding new ones in, it may turn out that a little bit later, much better desserts appear, and you’re going to go and eat those instead.

Scientific progress has a little bit of that flavor. We go through these funny time periods. Computer science is a great example, where computer science basically arose as a side effect of some pretty abstruse questions in the philosophy of mathematics and logic. You’ve got these people trying to attack these rather esoteric questions that seem quite high up in exploration, and they discover this fundamental new field, and all of a sudden there’s an explosion there. The diminishing returns argument just didn’t apply there. We just weren’t able to see what was there.

This has been the case over and over again. New fields arrive and all of a sudden, and boom, it’s easy to make progress again. Young people flood in because you can be twenty-one and make major breakthroughs rather than having to spend twenty-five years mastering everything that’s been done before. It’s obviously very attractive. I’m not sure anybody understands very well the dynamics of that, or how to think about why the structure of knowledge is that way, where these new fields keep opening up. But it does seem empirically to be the case.

Dwarkesh Patel

Despite the fact that that is the case… Take deep learning. Obviously, this is an example of a new field where twenty-one-year-olds can make progress and it’s relatively new. Fifteen years or so since it got back into high gear. But already we’re in a stage where you need billions, tens of billions, or hundreds of billions of dollars to keep making progress at the frontier.

There are a couple ways to understand that. One is that it actually is harder than the kinds of things the ancients had to do, or is more intensive at least. Second is it might not have been, but because our civilizational resources are so large, the amount of people is so large, the amount of money is so large, we can basically make the kind of progress it would have taken the ancients forever to make almost immediately. We notice something is productive and immediately dump in all the resources. But it’s also weird that there’s not that many of them. I feel like deep learning is notable because it is one big exception to the fact that it’s hard to think of other examples.

Michael Nielsen

I think that’s a consequence of the architecture of attention. At any given time, there’s always a most successful thing. If deep learning wasn’t a thing, maybe you’d be talking about CRISPR. Maybe we wouldn’t think about solving the protein structure prediction problem as a success of AI. Maybe we would have figured out how to do it with curve fitting, more broadly construed, and we’d just be like, “Wow, that took a lot of computing resources.” But protein structure prediction might be an enormously important thing.

There is always our biggest thing. What you’re pointing at is more a consequence of the way in which attention gets centralized. It’s basically fashion, is what I’m saying. It’s not just fashion, but there is some dynamic there.

Dwarkesh Patel

There’s a very interesting and important implication of this idea. That the branching is so wide and so contingent and so path-dependent that different civilizations would stumble on entirely different technology stacks. There’s a very interesting implication that there will be gains from trade into the far, far future, which might actually be one of the most important facts about the far future in terms of how civilizations are set up, how they coordinate, and how they interface. There’s not this “go forth and exploit.” There are humongous gains to trade from adjacent colonies or whatever.

Michael Nielsen

Sort of. There’s a question of what’s actually hard. If it’s just the ideas, well, those spread relatively quickly. It’s relatively easy to share ideas. If it’s something more, it’s almost a Dan Wang kind of idea where there’s some notion of capacity. You need all the right techs, you need all the right manufacturing capacity, and so on.

So civilization A has a very different kind of manufacturing capacity, and it’s just not so easy to build in civilization B. Even if civilization B is ahead, I think that becomes true. There is a comparative advantage which is going to provide massive benefits to trade in both directions. Eventually, you expect some diffusion of innovation. It is funny to think about what the barriers are there.

A fun thought experiment I like to think about is GitHub but for aliens. Somebody presents you with all of the code from some alien civilization. I don’t even know what code means there, but their specification of algorithms. It would have many interesting new ideas in there, and it would take forever for human beings to dig through and try and extract all of those.

The origin of this for me was thinking about proteins in nature. We’ve been gifted this incredible variety of machines which we don’t really understand at all. We just have to go and try and understand them on a one-by-one basis. We’re still understanding hemoglobin and insulin and things like this. There are hundreds of millions of proteins known. So it is a little bit like that. We’ve been gifted by biology this immense library of machines, no doubt containing an enormous number of very interesting ideas, and we’re just at the very, very beginning of understanding it.

I suppose your point—I need to relabel your argument slightly—but you think of that as a gift from an alien civilization, which obviously it isn’t, but you think of it that way. And oh my goodness, there’s so much in there and we’re going to study it. Goodness knows how long we could continue to study it. There are tens of thousands of papers about hemoglobin and things like that, and we still don’t understand them, and yet we’re getting so much out of it. Just think about insulin alone. It’s such an important thing.

Dwarkesh Patel

That’s an incredibly useful intuition pump, that you have on Earth… I had Nick Lane on where he had this theory about how life emerged, but whatever theory you have, something like DNA has had four billion years. You have an alien civilization come here and be like, “There’s all these interesting things to learn about material science.”

Michael Nielsen

Think about kinesin walking along. We know almost nothing about these proteins, and yet the tiny few facts we do know are just incredible. The ribosome is another example, this miraculous sort of device, a little factory.

Dwarkesh Patel

All seeded by this particular chemistry on Earth with nucleic acids and carbon-based life forms. That chemistry gives rise to all of these interesting things which an alien civilization would find very interesting. That very seed, which must be one among trillions of possible seeds of general intellectual ideas, leads to all this fecundity. That’s a very interesting intuition pump.

I want to meditate on this “gains from trade” thing because I feel like there’s something very interesting about this idea that if you have this vision of how technology progresses and how it may be different in different civilizations, it actually has important implications about how different civilizations might interact with each other. The fact that there are going to be these huge gains from trade.

Michael Nielsen

It makes friendliness much more rewarding?

Dwarkesh Patel

Yes. That’s a very important observation.

Michael Nielsen

I hadn’t thought about that at all. That is a very interesting observation. It is funny. Comparative advantage is something that people love to invoke and it’s a very beautiful idea obviously. There are limits to it. It’s a special limited model.

Chimpanzees can do interesting things, but we don’t trade with them. I think it’s interesting to think about the reasons why. Part of it is just power, I think. Once there’s a sufficiently large power imbalance, very often—not always, but very often—groups of people seem to shift into this other mode where they just seek to dominate. Maybe there’s something special about human beings, but maybe it’s also a more general thing. You need all these special things to be true before groups will trade. It’s not necessarily obvious.

Dwarkesh Patel

I think the big thing going on here is one, transaction costs. Two, comparative advantage does not tell you that the terms on which the trade happens are above subsistence for any given producer. People often bring this up in the context of, “Well, humans will be employed even in a post-AGI world because of comparative advantage.”

There are five different ways that argument breaks down, but the easiest way to understand it is: why don’t we have horses all around on the roads? Because there’s some comparative advantage between cars and horses. One, there are huge transaction costs to building roads that are compatible with horses and cars at the same time. In a similar way, AI thinking at 1,000 times the speed that can shoot their latent states at each other is going to find it way more costly than the benefit, in terms of interacting with a human being in the supply chain.

Second, just because horses have a comparative advantage mathematically does not mean that it is worth paying $100,000 a year, or whatever it costs to sustain a horse in San Francisco. That subsistence isn’t going to be worth the benefit you get out of the horse.

Michael Nielsen

I do think it’s interesting, the sheer fact… My expectation and my intuition obviously differs a great deal from yours on this. Most parts of the tech tree are never going to be explored. There are just too many interesting ways of combining things. There are too many deep ideas waiting to be discovered, and not only we, but nobody ever is going to discover most of them. So choices about how to do the exploration actually matter quite a bit.

It’s something I really dislike about technological determinist arguments. I’m willing to buy it low enough down when progress is relatively simple. But higher up, you start to get to shape the way in which you do the exploration. And it’s interesting, we are starting to shape it in interesting ways.

There are various technologies that have been essentially banned. You think about DDT, chlorofluorocarbons, restrictions on the use of nuclear weapons, the Nuclear Non-Proliferation Treaty. Those kinds of things weren’t done before the fact, but they’re starting to get pretty close in some cases, where we just preemptively decide, “Oh, we’re not going to go down that path.” So that starts to look like a set of institutions where we are actually influencing how we explore the tech tree.

Dwarkesh Patel

On where you would see these gains from trade, obviously you’d see the most where it’s pure information that could be sent back and forth, because the information has this quality where it is expensive to produce, but cheap to verify and cheap to send. It’ll be interesting how much of future productivity can be distilled down to information.

Right now, it’s hard to do. If China’s really good at manufacturing something, there’s this process knowledge that’s in the heads of 100 million people involved in the manufacturing sector in China. But in the future, it might be easier if AIs are doing it.

Michael Nielsen

The question is to what extent our fabrication gets very uniform and gets really commoditized. 3D printers have been the next big thing for at least 20 years now. Why do they still not work all that well? Why are they still not at the center of manufacturing, and what comes after that? It is funny to look at the ribosome by contrast, which really is at the center of biology in a whole lot of really interesting ways.

Whether or not that’s the future of manufacturing is something very simple, where everything goes as throughput through a bioreactor or something like that. You send the information, and then you grow stuff, or you have some 3D printer that actually works. If they’re good enough, then it does become much more a pure information problem, and some of this process knowledge becomes much less important.

01:15:26 – Are there infinitely many deep scientific principles left to discover?

Dwarkesh Patel

Can I ask a very clumsily phrased question? There are these deep principles that we’ve discovered a couple of. One is this idea that if there’s a symmetry across a dimension, it corresponds to a conserved quantity. It’s a very deep idea. There’s another—which you’ve written a lot about, written a textbook about in fact—about ways to understand what kinds of things you can compute, what kinds of physical systems you can understand with other physical systems, what a universal computer looks like, et cetera.

Is your view that if you go down to this level of idea of Noether’s theorem or the Church-Turing principle, that there’s an infinite number of extremely deep such principles? Because I feel what makes them special is that they themselves encompass so many different possible ways the world could be. But no, the world has to be compatible with a couple of these very deep principles.

Michael Nielsen

I don’t know. All I have here is speculation and instinct. My instinct is that we keep finding very fundamental new things. It was quite formative for me to understand, as I gave the example before, these wonderful ideas of Church and Turing and these other people about universal programmable devices. Then you understand later, this also contains within it the ideas of public-key cryptography. Then you understand later, that also contains within it the ideas people refer to as cryptocurrency.

There’s a very deep set of ideas there about the ability to collectively maintain an agreed-upon ledger, which is built upon this. It’s taken many years to figure out the right canonical form of those. Just this fact that you keep finding what seem like deep new fundamental primitives has been a very important intuition pump for me. I’ve given that particular example, but I think you see that same pattern in a lot of different areas.

Dwarkesh Patel

What is your interpretation then of this empirical phenomenon where whatever input you consider into the scientific process or technological progress… Economists have studied this a million ways. It just seems to require a very consistent rate of X percent more researchers per year. There’s this famous paper from a couple years ago by Nicholas Bloom and others where they say, “How many people are working in the semiconductor industry, and how has it increased over time through the history of Moore’s law?” I think they find that Moore’s law means transistor density increases 40% a year, but to keep that going the number of scientists has increased 9% a year, in the semiconductor industry. They go through industry after industry with this observation.

Is your view that there are these deep ideas, but they keep getting harder to find? Or is there another way to think about what’s happening with these empirical observations?

Michael Nielsen

First of all, all of their examples are narrow. They pick a particular thing, and then they look at a particular metric. GPUs don’t show up there. All of a sudden you get this ability to parallelize, and that’s really interesting. There are a lot of external consequences. Basically they have these simple quantitative measures. They look at it in agricultural productivity. They look at it in a whole lot of different ways, but you do have to focus narrowly.

I’m certainly interested in the fact that new types of progress keep becoming possible. But I think even there, there does still seem to be some phenomenon of diminishing returns. Is that intrinsic? Is that something about the structure of the world? What is it? One thing which hasn’t changed that much is the individual minds which are doing this kind of work. Maybe those should be improved as well, or some feedback process going on there. Maybe that changes the nature of things.

I look at scientific progress up until, let’s say, 1700, and it was very slow, and also very irregular. You had the Ionians back five centuries before Christ doing these quite remarkable things, and so much knowledge would get lost, and then it would be rediscovered, and then it would be lost again. You’d have to say that progress was very slow. It’s partially just bound up with the fact that there were some very good ideas that we just didn’t have.

Even once you’ve had the ideas, you need to build institutions around them. You actually need to solve a whole lot of different problems about training, allocation of capital, and all these kinds of things. Even just basic security for researchers, so they’re not worried about the Inquisition or things like that. There are all these complicated problems. You solve all those complicated problems, and then all of a sudden, boom, there’s a massive burst of scientific progress.

If there’s some kind of stagnation, if you’re not changing those external circumstances, yes, you may start to get diminishing returns again. But that doesn’t mean there’s anything intrinsic about the situation. Maybe something external needs to change again. Obviously, a lot of people think AI is potentially going to be a driver. It certainly will at some level.

To that extent, you can think of a lot of modern scientific instrumentation as really, at some level, robots. What is the James Webb Space Telescope? It’s unconventional maybe to describe it as a robot, but it’s not completely unreasonable either. It is an example of a highly automated, very sophisticated system with electronically mediated sensors and actuators, where machine learning is being used to process the data. In that sense, we’re already starting to see that transition. We’ve been seeing it for decades.

Dwarkesh Patel

I have this “smoke a joint and take a puff” thought, which—

Michael Nielsen

I think we’ve had a few.

Dwarkesh Patel

I think we’re getting to that part of the conversation, and then you can help me get my foot out of my mouth and figure out a more concrete way to think about it. To your point that there was the Industrial Revolution, the Enlightenment, and now there’s AI, and each might be a different pace or a different way in which science happens. If you think about the pace of how fast such transitions have been happening, you can draw over the long span of human history this hyperbolic rate of growth that is increasing over time as well.

A hundred thousand years ago, you had the Stone Age. You go back even much further, how long have primates been around? It would be millions of years. A hundred thousand years ago, the Stone Age, then ten thousand years ago, the Agricultural Revolution, then three hundred years ago, the Industrial Revolution, each marked by this increase in the rate of exponential growth. Then people think it’s going to happen again with AI. But that would happen potentially even faster.

It would not have occurred to somebody at the beginning of the Industrial Revolution that the next demarcation in this trend will be artificial intelligence. So if things are getting faster, and it’s hard to anticipate what the next transition will be. I guess we just think of this singularity between now and AI as what distinguishes the past from the future. But applying the same heuristic that many people in the past should have had, maybe the “Intelligence Age” is also quite short and the next thing after that, we don’t even have the ontology to describe what it is, the future will not think of the past as pre-intelligent AI and post-AI.

Michael Nielsen

No, obviously we can’t prove this, but it certainly seems quite plausible. Part of the issue is just that the substrate we have available to conceive seems all wrong. You can’t speculate with a bunch of chimpanzees about what it would be to have language. Just to pick a major transition in the past, the transition itself is the thing. It seems likely.

If we’re talking about “taking a puff” kind of thoughts, I’m certainly amused by the idea that there’s going to be some transition involving artificial general intelligence using classical computers. But actually, there’ll be an interesting transition with quantum computers as well. They’re probably capable of a strictly larger class of potentially interesting computations. So maybe the character of AQGI, or whatever it should be called, is actually qualitatively different. So maybe there’s a brief period between those two things. As I say, this is just speculation, but it’s certainly amusing.

Dwarkesh Patel

Is there a reason to think that? From what I understand, for decades people like you have put pretty tight bounds on the kinds of things quantum computers are going to do. It’ll speed up search somewhat. The kinds of things it speeds up extremely, like Shor’s algorithm, it seems like… Again, maybe this is to your point that we can’t predict in advance what’s down the tech tree, but at least from here, it seems like you break encryption, but what else are you using Shor’s algorithm to do?

Michael Nielsen

We’ve only been thinking about it for 40 or so years. Not for very long, and we haven’t thought that hard about it as a civilization. Does it turn out that it’s very narrow? Maybe. Does it turn out that it’s very broad? That’s also a really radical expansion that seems distinctly possible. Keep in mind as well, we’ve been doing it without the benefit of having the devices. That’s a pretty big bottleneck to have.

Dwarkesh Patel

If you’re thinking about computer science in the 1700s and you’re like, “it can do AND/OR, what can come out of that?” You can’t anticipate Bitcoin. You can’t anticipate deep learning.

Michael Nielsen

Maybe you could if you were sufficiently bright, but it is a pretty hard situation.

01:26:25 – What drew Michael to quantum computing so early?

Dwarkesh Patel

What is your inside view, having been in and contributing to quantum information and quantum computing back in the ‘90s and 2000s? What is your telling of the history of what was the bottleneck? What was the key transition that made it a real field? How do you rank the contributions from Feynman to Deutsch to everybody else who came along?

Michael Nielsen

Let’s just focus on the question about what actually changed. Why was quantum computing not a thing in the 1950s? It could have been. Somebody like John von Neumann is a good example. He was absolutely pioneering computation. He also wrote a very important book about quantum mechanics and was deeply interested in it. He could have invented quantum computing at that time, and I think there were quite a number of people who potentially could have.

So why do we have these papers by people like Feynman and Deutsch in the ‘80s? Those are fairly regarded as the foundation of the field. There are some partial anticipations a little bit earlier, but they were nowhere near as comprehensive and nowhere near as deep. You should ask David. You can’t ask Feynman, unfortunately, but he’ll know much better than I do.

A couple things that I think are interesting. One is that computation became far more salient in the late ‘70s and early ‘80s. It just became a thing which many more people were interested in, partially for very banal reasons. You could go and buy a PC. You could buy an Apple II. You could buy a Commodore 64. You could buy all these kinds of things. It became apparent to people that these were very powerful devices, very interesting to think about.

At the same time, in the quantum case, that was also the time of the Paul trap and the ability to trap single ions. Up to that point, we hadn’t really had the ability to manipulate single quantum states. You got these two separate things that for historically contingent reasons had both matured around 1980 or so. Somebody like von Neumann could have had the idea earlier, but it is quite an interesting factor.

There’s a story about Richard Feynman. He went and got one of the first PCs around 1980 or 1981. He was apparently so excited with this device, he actually tripped and hurt himself quite badly carrying his brand-new computing device. That’s a very historically contingent coincidence, having somebody who’s very talented and understanding of quantum mechanics also just very excited about these new machines. It’s not so surprising perhaps that he’s thinking about it then. What similar story could you have told 10 years earlier? The conditions don’t exist for it. I mean, it’s quite a banal story, but…

Dwarkesh Patel

One of the things we were going to discuss was this idea you had about the market for follow-ups. I think this is the perfect story to discuss it for because you wrote the textbook about the field. “Mike and Ike” is the definitive textbook on quantum information. You presumably came in after Deutsch.

But you in the ‘90s somehow identified it as the thing that is worth following up on and building on. Instead of talking about it more abstractly, I’d love to just hear the firsthand story of how you knew that this is the thing to do. Of all the things that were happening in physics and computing, how did you decide you want to think about this problem?

Michael Nielsen

Richard Feynman writes this great paper in 1982. David Deutsch writes an absolutely fantastic paper in 1985 sketching out a lot of the fundamental ideas of quantum computing. I’m 11 in 1985. I’m not thinking about this. I’m playing soccer and doing whatever. But in 1992, I took a class on quantum mechanics that was really terrific, given by Gerard Milburn.

I just went and asked Gerard one day after the fifth lecture or something. I said, “Do you have any papers or whatever that you could give me?” He said, “Come by my office in a couple of days’ time.” I did, and he presented me with a giant stack of papers, which included the Deutsch paper, the Feynman paper, and a whole bunch of other very fundamental papers about quantum computing and quantum information at a time when essentially nobody in the world was working on it. He was. I think he wrote the very first paper that proposed a practical approach to quantum computing. It wasn’t very practical, but it was actually in a real system.

So in some sense, I’m benefiting from the taste of this other person. As soon as I read the papers… These are exciting papers. They’re asking very fundamental questions, and you realize I can make progress here. These are things that one could potentially work on.

Deutsch has this conjecture, or thesis or whatever you’d call it, that a universal model, a quantum Turing machine, should be capable of efficiently simulating any physical system at all. This is a very provocative idea. I think in that paper, he more or less claims that he’s proved it. I’m not sure everybody would agree with that. There are questions about whether or not you can simulate quantum field theory effectively. That kind of question is very interesting and very exciting. It’s obviously a fundamental question about the universe.

He has some wonderful ideas in there about quantum algorithms, where they come from, what they mean, and what they relate to the meaning of the wave function. Questions like this are still not agreed upon amongst physicists. There’s just some sense of, “Oh, I am in contact with something which is (A) deeply important, and (B) we as a civilization don’t have this.” Of course, you start to focus your attention a little bit there.

Dwarkesh Patel

I’m not sure I got the answer to the question…

Michael Nielsen

Maybe I misunderstood the question.

Dwarkesh Patel

Maybe I’ll explain the motivation first. In a previous conversation, we were discussing how you could have known in the 1940s that the Shannon theorems and Shannon’s way of thinking about a communication channel is a deep idea that goes beyond the problems with pulse-code modulation that Bell Labs was trying to solve at the time, and that it applies to everything from quantum mechanics to genetics to computer science.

One of the ideas you stated that we didn’t get a chance to talk about yet… Shannon published this paper. There are all these other papers, but there’s some market of follow-ups where people gravitate to and build upon Shannon’s work. How do they realize that that’s the thing to do, and how does that process happen? I guess you gave your local answer. You read these papers, and you immediately realized there’s work to be done here. There’s low-hanging fruit. There’s some deep provocative idea that I need to better understand, and I could tractably make progress on.

Michael Nielsen

To some extent, you’re saying, “Okay, I wanted to get into this game of contributing to humanity’s understanding of the universe,” and you are applying this low-hanging fruit algorithm. You’re like, “elative to my particular set of interests and abilities, where should I pick up my shovel and start digging?” There it was like, “Oh, this looks like quite a good place to start digging.” Different people, of course, chose very differently. It was a very unusual choice at the time. This was 1992. Very few people were thinking about that.

01:35:29 – Does science need a new way to assign credit?

Dwarkesh Patel

Fast-forwarding a bit, I don’t know how you think about your work on the open science movement now, but did it work? What does success there look like? What is the movement trying to accomplish?

Michael Nielsen

It’s interesting. You didn’t stop and define open science there, which 20 years ago you would have had to do. People recognize the phrase. People have some set of associations with it. Most often, they have a relatively simple set of associations. It means maybe something about making scientific papers open access. Very often they have some set of notions about also making code openly available or making data openly available.

Those are already very large successes of the open science movement, to make those salient issues. Those are issues on which people have opinions, and there are relatively common arguments. This is like the meme version: publicly funded science should be open science. That’s a distillation of a set of ideas which you might be able to contest. But if you can get people actually thinking about it and engaged with that kind of argument, that’s a very fundamental issue to be considering in the whole political economy of science.

If you go back three centuries, there was a very similar argument prosecuted, which is the question: do we publicly disclose our scientific results or not? If you look at people like Galileo and Kepler, the extent to which they publicly disclosed was done in a very odd way. Sometimes they did bizarre things where they published some of their results as anagrams. They’d find some discovery, write down the result in a sentence, scramble it, and publish that. Then if somebody else later made the same discovery, they would unscramble the anagram and say, “Oh, yeah, I actually did it first.” This is not an ideal foundation for a discovery system.

It took a very long time, over a century, I think, to obtain more or less the modern ideals, in which you disclose the knowledge in the form of a paper. There is an expectation of attribution, and a reputation economy gets built. “So-and-so did this work, so they deserve the credit for that,” and that’s the basis for their careers. This is the underlying political economy of science. That made a lot of sense when you have a printing press and the ability to do scientific journals.

Then you transition to this modern situation, where you can start to share a lot more. You can share your code, your data, your in-progress ideas. But there’s no direct credit associated to those. It’s not at all obvious how much reputation should be associated to them. That’s all constructed socially. Making it a live issue is a very important thing to have done. I view that as one of the main positive outcomes of work on open science.

I’ll give you a really practical example to illustrate the problem. For a long time in physics, there was a preprint culture in which people would upload preprints to the preprint archive, and in biology, this didn’t happen. There was no preprint culture. That’s changing now, but for a long time, this was the case. I used to amuse myself by asking physicists and biologists why this was the case.

What I would hear from biologists was they would say, “Biology is so much more competitive than physics that we need to protect our priority, so we can’t possibly upload to the archive. We have to just publish in journals.” Then I would sometimes hear from physicists, “Physics is so much more competitive than biology that we need to establish our priority by uploading as rapidly as possible to the preprint archive. We can’t possibly wait to do it with the journals.”

I think this emphasizes the extent to which this kind of attribution economy is just something we construct. It’s something we do by agreement. Any attempt to change that economy results in a different system by which we construct knowledge. There is this very fundamental set of problems around the political economy of science. We’ve got this collective project, and how we mediate it depends upon the economy we have around ideas.

Dwarkesh Patel

One of the things you’ve emphasized as a part of this project of open science, and we talked about it earlier, is collective science, or groups of people making progress on a problem where no individual understands all the logical and explanatory levels necessary to make a leap or a connection. Outside of mathematics, what is the best example of such a discovery?

Michael Nielsen

I’m not sure I have a well-ordering of them to give you a best. An example that I think is very interesting is the LHC, where it’s just this immensely complicated object. Years ago, I snuck into an accelerator physics conference. I didn’t know anything at all about accelerator physics, but I was just curious to see what they were talking about.

This particular group of people were experts on numerical methods, in particular on inverse methods. Inside these accelerators, you have these cascades. A particle will be massively accelerated, maybe it’ll be collided, and then you’ll get a shower of particles which decays and decays and decays. There’s just this incredible, consequential shower, which is ultimately what you see at the detector. Then you have to retroactively figure out what produced it. There are these very complicated inverse problems that need to be solved. You’ve got this final data, but you need to figure out what produced it, and that’s how you look for signatures of these.

Many of these people were incredibly deep experts on simulation methods for following particle tracks. This was really deep and difficult stuff. I was like, “Wow, you could spend a lifetime just learning how to do this and how to solve some of these inverse problems, and you would know very little about quantum field theory, detector physics, vacuum physics, or data processing, all these things that are absolutely essential to understanding, say, the Higgs boson”.

I don’t think it’s possible for one person to understand everything in depth. Lots of people broadly understand a lot of these ideas, but they don’t understand everything in the depth that is actually utilized. That’s why there are these papers with well over a thousand authors. Those people can talk to one another at a high level, but they don’t understand each other’s specialties in all that much depth. Things like detector physics, vacuum physics, solving inverse problems, this stuff is incredibly different from each other. To understand it in real detail is serious work.

01:43:57 – Prolificness versus depth

Dwarkesh Patel

How do you think about prolificness versus depth? Maybe Darwin’s an example of somebody who’s just gestating on something for many decades. There are other examples. Einstein during the year he comes up with special relativity is just doing a bunch of different things. And Pais talks about how they were all relevant to the eventual build-up.

Michael Nielsen

It’s something I stress about a lot. Sometimes I feel I’m too slow. It’s funny though, the Darwin example is really interesting. Prolific at what? God knows how many letters he wrote. It must have been an enormous number. So he was certainly very active.

There’s two types of work that tend to be involved in any kind of creative project. There’s routine stuff, and there you just want to avoid procrastination. You just want to ask, “How do I get good at this?” or “How do I outsource it?” and “How do I do it as rapidly as possible?” and just avoid getting into a situation where you’re prolonging it.

Then there’s high-variance stuff where you actually need to be willing to take a lot of time. You need to be willing to go to different places and talk to different people, where in any given instance, most of it is just not going to be an input. Somehow balancing those two things… I think a lot of people are very good at doing one or the other, but it’s almost like a personality trait which one you prefer. People tend to end up doing a lot of one and not enough of the other. So I certainly try and balance those two things.

Einstein is such an interesting example. 1905 is just this extraordinary year. You can delete special relativity entirely, and it’s an extraordinary year. You can delete special relativity, and you can delete the photoelectric effect for which he won the Nobel Prize, and it’s still an extraordinary year, plausibly a multi-Nobel-Prize-winning year. So what’s he doing? Maybe the answer is just that he’s smarter than the rest of us. There’s a lot of luck as well.

Certainly for myself anyway, trying to identify those things that are routine that I should get good at, and then just try to do them as quickly as possible. I think that’s yielded a certain amount of returns. But also being willing to bet a little bit more on myself on the variance side has also been very, very helpful. That’s really hard, because intrinsically you’re putting yourself in situations where you don’t know what the outcome is going to be. If you’re very driven to be productive, and actually mostly it’s not working over there, you think, “Let’s reduce this.” It doesn’t feel right.

When I worked in San Francisco, a practice I used to have each day was instead of taking the 15-minute walk to work, I would take the more beautiful 30-minute walk. Partially just because it was beautiful, but partially also as just a reminder that there are real benefits to not being efficient. But it’s not an answer to your question. Really, I think all I’m saying is I struggle a lot with the question.

Dwarkesh Patel

I think Dean Keith Simonton has this famous equal odds rule where he says the probability that any given thing you release—any paper, book, whatever—will be extremely important for a given person through their lifetime is not that different. What really determines in what era they are the most productive is how much they’re publishing. Any given thing has equal odds of being extremely important. I think some of the most successful creatives or scientists, they’re just doing a lot. Shakespeare was just publishing a lot.

Michael Nielsen

Of course, then there are counterexamples. Gödel published almost nothing. But broadly speaking, you need a very good reason to not do that. It’s funny, I’ve met a lot of people over the years who are clearly brilliant, and they’re just obsessed that they are going to work on the great project that makes them famous, and they never do anything. That seems connected. It’s a type of aversiveness. I think very often they just don’t want public judgment.

Something that I would love to see… There’s an awful lot of biographies and memoirs and histories of people who achieve a lot. I wish there was a very large number of biographies of people who are fantastically talented who just missed. I’ve known people who won gold medals at IMOs and things like that, who then tried to become mathematicians and failed. What happened? What was the reason? I suspect in many cases that’s actually more informative than anything else.

01:49:17 – What it takes to actually internalize what you learn

Dwarkesh Patel

You have this essay that I was reading before this interview about how you think about what the work you’re doing is. And “writer” doesn’t seem like the right label. As you say, was Charles Darwin a writer? What exactly is that label? I’m a podcaster. In a way, obviously our work is very different, but I also think a lot about what this work is and how I get better at it.

In particular, how can I make sure there’s some compounding between the different people I talk to on the podcast? I worry that instead of this compounding, I build up some understanding that’s somewhat superficial about a topic, and then it depreciates. I move down to the next topic, and it depreciates. There are a lot of podcasters in the world who will interview way more experts than I have, and I don’t think they’re much the wiser or more knowledgeable as a result. So it’s clearly possible to mess this up.

I wonder if you have thoughts or takes or advice on how one actually learns in a deeper way from this kind of work.

Michael Nielsen

It’s an incredibly complicated and rich question. It seems like the question is, how do you make it a higher-growth context? How do you make it a more demanding context? You can do that in relatively small ways that might yield compounding returns, or you can do something that is more radical. Maybe it means starting a parallel project in which you do something that is actually quite a bit different.

There is something really interesting about how being very demanding can simply change your response to something. Something that I would sometimes do with students and sometimes with myself, it was really aimed more at myself, was they would say some week, “I’m going to try and do this work over the coming week.” Then the next week would come by and they hadn’t solved the problem. If a million dollars had been at stake, would you have put the same effort in? And the answer is no, invariably. They’ve tried, but they haven’t really tried.

I think that’s a very familiar feeling for all of us. You could do a lot more if you had just the right demanding taskmaster standing by you and saying, “Look, you’re barely operating here.” I do wonder a little bit about what’s the demanding taskmaster? What can they ask you that is going to make your preparation way more intense?

Dwarkesh Patel

The most helpful thing honestly is… For some subjects it is very clear how I prep. I’m doing an upcoming episode on chip design with the founder of a company that does chip design, and he wrote a textbook on it. Yesterday I went over to his office, and we brainstormed five roofline analyses I can do. If I understand that, I have some good understanding.

The problem is with almost every other field, there’s not this curriculum. When I interviewed Ilya three, four years ago, it was: implement the transformer, and if you implement it, you have some nugget of understanding you have clamped down. With other fields, it’s just that I vaguely understand this. It’s not clamped. There’s no forcing function of “do this exercise, and if you do it, you will understand.”

Michael Nielsen

Really what you’re saying is you can do a good job at podcasting without actually attaining this kind of understanding, and that’s the problem from your point of view. You want to change your job description so that you are internalizing these chunks and just getting this kind of integration each time. It seems to me that what that means is you actually want to change the structure of the work output at some level.

There’s this terrible idea that lots of people have that they should be in flow all of the time. And as far as I can tell, high performers just don’t believe this at all. They’re in flow some of the time. You certainly see this with athletes. When they’re actually out there playing basketball or tennis, ideally they are in flow much of the time. But when they’re training they’re not. They’re stuck a lot of the time, or they’re doing things badly. I suppose I wonder what that looks like for you.

Dwarkesh Patel

That I would be extremely satisfied with. The problem is I just don’t know what the equivalent of doing 64 laps is. This is a thing you can change by choosing guests where there is a legible curriculum. So maybe it’s a mistake not to have done that. Also, there’s no real way to prep for Terence Tao. There’s no curriculum that’s a plausible one.

There are many failure modes, but one long-term dynamic I’m worried about is that you can have a good podcast and reach a local maximum, but for no particular guest or topic are you going deep enough. My model of learning is that if you don’t really understand the deeper mechanism, you’re just mapping inputs and outputs of a black box. That just fades incredibly fast or is not worth it in the first place. You just move on and it’s over. You need to build the intermediate connection.

AI in a weird way is really easy for that reason, because there is a clear thing you can do. Just implement it, and then you understand it. If I applied that criterion elsewhere, do I just not do history episodes?

Michael Nielsen

Exactly. Ada Palmer. Wonderful to talk to, incredibly interesting. But for you personally, what changed?

Dwarkesh Patel

There are some things I learned. If I had allocated more time, especially after the interview, to write up 2,000 words on everything I learned and how it connects to other things I know. Maybe that’s a thing worth doing, spreading out the episodes more and spending more time afterwards consolidating.

I would pay infinite amounts of money if there was somebody who was really good at coming up with the curriculum, the practice problems you need to do, and the exercise you need to do after the interview to clamp what you have learned.

Michael Nielsen

Have you tried doing that with somebody?

Dwarkesh Patel

It’s hard to find someone. I haven’t tried super hard, but isn’t it going to be tough to find somebody who could do that for every single kind of discipline? Maybe I should just hire different ones for different topics.

Michael Nielsen

Maybe. There’s something about, what problem are you solving for each episode? As far as I can tell, that’s the only way I really understand anything. I get interested in something. At first, I don’t even have a problem, but there’s just some sense that there’s some contribution to make here, and gradually you hone in, and there’s a problem.

Funnily enough, spending time stuck is incredibly important. That used to just be annoying. Now it seems like it’s maybe even the most important part of the whole process. That hard-won nature of it means that I internalize it afterwards. I’ve written 10,000-word essays in a couple of days, and I’ve written them in three months or six months. I feel like I didn’t learn very much from the ones that only took a couple of days. Whereas some of the ones that took three months, 15 years later, I’ll still remember.

Dwarkesh Patel

Can you describe outside of physics how you learn, of the ones that took three months?

Michael Nielsen

By far the most common thing is there’s always some creative artifact. Sometimes it’s a class. Sometimes it’s engagement with a group of people who are working on some collective creative artifact together. You might not even be aware of it, but you’re acting as an input to their creative ends in some way. Sometimes it’s an essay or a book or whatever.

It’s one of the reasons why I often quite enjoy doing podcasts. I said yes to come here partially because I know you ask unusually demanding questions. That’s an attempt to get this sort of perspective from a different kind of a forcing function. Trying to pick the most demanding creative context.

Dwarkesh Patel

For this interview, I went through three lectures of the Susskind special relativity book. The problem is that there’s almost no practice problems in it. So I hired a physicist friend. I haven’t done it yet, but for every lecture I want a bunch of practice problems to go through, and I’m planning on being appropriately humbled.

Michael Nielsen

How do you make it as jugular as possible? The higher you can raise the stakes, the better.

Dwarkesh Patel

The interview is in some sense high stakes, but also it doesn’t necessarily test deep understanding.

Michael Nielsen

I don’t think the interview is that high stakes. You’re not writing a book about special relativity, and you’re not trying to write a book that replaces whatever the existing standard textbook is. That’s a really high stake.

By the way, a phrase that I find particularly difficult. People will talk about “going deep” on a subject, and it turns out different people have different ideas of what this means. For some people it means they read a couple of blog posts. For some people it means they read a book about it. For some people it means they wrote a book about it. The standard you hold yourself to determines a lot about your ability to integrate knowledge in this way.

Dwarkesh Patel

I found that I’m in some sense able to move much faster on some things through the help of AI, but I don’t know if I’m learning better. I think it’s probably because… The hardest thing, the thing that is most demanding, is so aversive that you try to take any excuse you can to get out of it. Just having a back-and-forth conversation with an LLM where you gloss over…

Michael Nielsen

It’s entertaining but not necessarily anything else.

Dwarkesh Patel

It’s such an easy way to get out of the thing. In fact, it makes it easier because instead of doing some intermediate thinking, there’s always a next question you can ask a chatbot.

Michael Nielsen

Yeah. And it’s somewhat valuable. That’s part of the seductiveness, of course. It’s not actually useless. But it can substitute for actually doing the thing that maybe you should be doing. It’s interesting. To what extent should you be outsourcing that kind of stuff? It’s an interesting judgment call. There is a whole bunch of routine work that you want done. It’s low value for you, so if you can get a chatbot to do it, you may as well.

Somebody interviewed the pioneering computer scientist Alan Kay years ago, and he was asked what he thought about Linux. If I remember his answer correctly, he basically said, “It doesn’t have anything to do with computer science. It’s just a great big ball of mud. There are a few interesting ideas in there which are worth understanding, but mostly all you’re learning is stuff about Linux. You’re not actually learning anything which is transferable.” I thought that was very interesting.

There’s a certain kind of seductiveness to some things where it’s sort of a Rube Goldberg machine. You can just learn about all the bits, and it feels entertaining. But if you step back and think about what you’re actually doing here, it might not actually be meeting your objectives. Maybe you want to become a sysadmin, and learning Linux is a great use of your time. There’s no harm in that at all.

But if your objective is to understand the fundamentals of computing, it’s much less clear that that’s a good use of your time. It was certainly an answer I’ve thought a lot about, where for a certain type of mind, there is a seductiveness in just learning systems and confusing that with understanding.

Dwarkesh Patel

Okay, I’ll keep you updated on how this goes. I owe you a text within a month of some revamped learning system.

Michael Nielsen

I’d be really curious. It’s also true that tiny incremental improvements in this are just worth so much.

Dwarkesh Patel

It’s the main input into the podcast. It’s great that the bookshelves are fancy and I’ve got a blackboard or whatever, but really the thing that makes the podcast better is if I can improve the learning I do. So yes, it’s worth every morsel of improvement. All right, thanks for the therapy session. Great note to end on. Thanks, Michael.

Michael Nielsen

All right. Thanks, Dwarkesh.

Terence Tao – Kepler, Newton, and the true nature of mathematical discovery

Dwarkesh Patel — Fri, 20 Mar 2026 16:00:55 GMT

We begin the episode with the absolutely ingenious and surprising way in which Kepler discovered the laws of planetary motion.

People sometimes say that AI will make especially fast progress at scientific discovery because of tight verification loops.

But the story of how we discovered the shape of our solar system shows how the verification loop for correct ideas can be decades (or even millennia) long.

During this time, what we know today as the better theory can often actually make worse predictions (Copernicus's model of circular orbits around the sun was actually less accurate than Ptolemy's geocentric model).

And the reasons it survives this epistemic hell is some mixture of judgment and heuristics that we don’t even understand well enough to actually articulate, much less codify into an RL loop.

Hope you enjoy!

Watch on YouTube; listen on Apple Podcasts or Spotify.

Sponsors

Jane Street loves challenging my audience with different creative puzzles. One of my listeners, Shawn, solved Jane Street’s ResNet challenge and posted a great walk-through on X. If you want to try one of these puzzles yourself, there’s one live now at janestreet.com/dwarkesh.

Labelbox can get you rubric-based evals, no matter your domain. These rubrics allow you to give your model feedback on all the dimensions you care about, so you can train how it thinks, not just what it thinks. Whatever you’re focused on—math, physics, finance, psychology or something else—Labelbox can help. Learn more at labelbox.com/dwarkesh.
Mercury just released a new feature called Insights. Insights summarizes your money in and out, showing you your biggest transactions and calling out anything worth paying attention to. It’s a super low-friction way to stay on top of your business. Learn more at mercury.com/insights.

Timestamps

(00:00:00) – Kepler was a high temperature LLM

(00:11:44) – How would we know if there’s a new unifying concept within heaps of AI slop?

(00:26:10) – The deductive overhang

(00:30:31) – Selection bias in reported AI discoveries

(00:46:43) – AI makes papers richer and broader, but not deeper

(00:53:00) – If AI solves a problem, can humans get understanding out of it?

(00:59:20) – We need a semi-formal language for the way that scientists actually talk to each other

(01:09:48) – How Terry uses his time

(01:17:05) – Human-AI hybrids will dominate math for a lot longer

Transcript

00:00:00 – Kepler was a high temperature LLM

Dwarkesh Patel

Today, I’m chatting with Terence Tao, who needs no introduction. Terence, I want to begin by having you retell the story of how Kepler discovered the laws of planetary motion because I think this will be a great jumping off point to talk about AI for math.

Terence Tao

I’ve always had an amateur interest in astronomy. I’ve loved stories of how the early astronomers worked out the nature of the universe. Kepler was building on the work of Copernicus, who was himself building on the work of Aristarchus. Copernicus very famously proposed the heliocentric model, that instead of the planets and the Sun going around the Earth, the Sun was at the center of the solar system and the other planets were going around the Sun.

Copernicus proposed that the orbits of the planets were perfect circles. His theory fit the observations that the Greeks, the Arabs, and the Indians had worked out over centuries. Kepler learned about these theories in his studies, and he made this observation that the ratios of the size of the orbits that Copernicus predicted seemed to have some geometric meaning.

He started proposing that if you take the orbit of the Earth and you enclose it in a cube, the outer sphere that encloses the cube almost perfectly matched the orbit of Mars, and so forth. There were six planets known at the time and five gaps between them, and there were five perfect Platonic solids: the cube, the tetrahedron, icosahedron, octahedron, and dodecahedron.

So he had this theory, which he thought was absolutely beautiful, that you could inscribe these Platonic solids between the spheres of the planets. It seemed to fit, and it seemed to him that God’s design of the planets was matching this mathematical perfection of the Platonic solids.

He needed data to confirm this theory. At the time, there was only one really high-quality dataset in existence. Tycho Brahe, this very wealthy, eccentric Danish astronomer, had managed to convince the Danish government to fund this extremely expensive observatory. In fact, it was an entire island where he had taken decades of observations of all the planets, like Mars and Jupiter, at least every night for which the weather was clear, with the naked eye. He was the last of the naked-eye astronomers.

He had all this data which Kepler could use to confirm his theory. Kepler started working with Tycho, but Tycho was very jealous of the data. He only gave him little bits of it at a time. Kepler eventually just stole the data. He copied it and had to have a fight with Brahe’s descendants.

He did get the data, and then he worked out, to his disappointment, that his beautiful theory didn’t quite work. The data was off from his Platonic solid theory by 10% or something. He tried all kinds of fudges, moving the circles around, and it didn’t quite work. But he worked on this problem for years and years, and eventually, he figured out how to use the data to work out the actual orbits of the planets.

That was an incredibly clever, genius amount of data analysis. And then he worked out that the orbits were actually ellipses, not circles, which was shocking for him. So he worked out the two laws of planetary motion: the ellipses, and also that equal areas sweep out equal times.

Then ten years later, after collecting a lot of data—the furthest planets like Saturn and Jupiter were the hardest for him to work out—he finally worked out this third law, that the time it takes for a planet to complete its orbit was proportional to some power of the distance to the Sun. These are the three famous Kepler’s laws of motion. He had no explanation for them. It was all driven by experiment, and it took Newton a century later to give a theory that explained all three laws at once.

Dwarkesh Patel

The take I want to try on you is that Kepler was a high-temperature LLM. Newton comes up with this explanation of why the three laws of planetary motion must be true. Of course, the way that Kepler discovers the laws of planetary motion, or figures out the relative orbits of the different planets, is as you say a work of genius. But through his career, he’s just trying random relationships.

In fact, in the book in which he writes down the third law of planetary motion, it’s an aside on The Harmonics of the World, which is just a book about how all these different planets have these different harmonies. And the reason there’s so much famine and misery on Earth is because the Earth is mi-fa-mi, that’s the note of Earth. It’s all this random astrology, but in there is the cube-square law, which tells you what relationship the period has to a planet’s distance from the Sun. As you were detailing, if you add that to Newton’s F=ma and the equation for centripetal acceleration, you get the inverse-square law. And so Newton works that out.

But the reason I think this is an interesting story is that I feel LLMs can do the kind of thing of trying random relationships for twenty years, some of which make no sense, as long as there’s a verifiable data bank like Brahe’s dataset. “Ok, I’m going to try out random things about musical notes, Platonic objects, or different geometries, I have this bias that there’s some important thing about the geometry of these orbits.”

Then one thing works. As long as you can verify it, these empirical regularities can then drive actual deep scientific progress.

Terence Tao

Traditionally, when we talk about the history of science, idea generation has always been the prestige part of science. A scientific problem comes with many steps. You have to identify a problem, and then you have to identify a good, fruitful problem to work on. Then you need to collect data, figure out a strategy to analyze the data, and make a hypothesis. At this point, you need to propose a good hypothesis, and then you need to validate. Then you need to write things up and explain. There are a dozen different components.

The ones we celebrate are these eureka genius moments of idea generation. Kepler certainly had to cycle through many ideas, several of which didn’t work. I bet there were many that he didn’t even publish at all because they just didn’t fit. That’s an important part of the process, trying all kinds of random things and seeing if they worked.

But as you say, it has to be matched by an equal amount of verification, otherwise it’s slop. We celebrate Kepler, but we should also celebrate Brahe for his assiduous data collection, which was ten times more precise than any previous observation. That extra decimal point of accuracy was essential for Kepler to get his results. He was using Euclidean geometry and the most advanced mathematics he could use at the time to match his models with the data. All aspects had to be in play: the data, the theory, and the hypothesis generation.

I’m not sure nowadays that hypothesis generation is the bottleneck anymore. Science has changed in the century since. Classically, the two big paradigms for science were theory and experiment. Then in the 20th century, numerical simulation came along, so you can do computer simulations to test theories. Finally, in the late 20th century, we had big data. We had the era of data analysis.

A lot of new progress is actually driven now by analyzing massive datasets first. You collect large datasets and then draw patterns from them to deduce thoughts. This is a little bit different from how science used to work, where you make a few observations or have one out-of-the-blue idea, and then collect data to test your idea. That’s the classic scientific method. Now it’s almost reversed. You collect big data first, and then you try to get hypotheses from it.

Kepler was maybe one of the first early data scientists, but even he didn’t start with Tycho’s dataset and then analyze it. He had some preconceived theories first. It seems like this is less and less the way we make progress, just because the data is so much more massive and useful.

Dwarkesh Patel

Oh, interesting. I feel like the 20th-century science that you’re describing actually very well describes what happened with Kepler. He did have these ideas—1595 and ‘96 is where he comes up with the polygons and then the Platonic objects theory—but they were wrong. Then a few years later, he gets Brahe’s data, and it’s only after twenty years of trying random things that he gets this empirical regularity.

It actually feels a bit closer to Brahe’s data being analogous to some massive data bank of simulations, and now that you’ve got the data, you can keep trying random things. If it wasn’t for that, Kepler would be out there just writing books about harmonics and Platonic objects, and there would be nothing to actually verify against.

Terence Tao

The data was extremely important. The distinction I was trying to make was that traditionally, you make a hypothesis and then you test it against data. But now with machine learning, data analysis, and statistics, you can start with data and through statistics work out laws that were not present before.

Kepler’s third law is a little bit like this, except that instead of having the thousand data points that Brahe had, Kepler had six data points. For every planet, he knew the length of the orbit and the distance to the Sun. There were five or six data points, and he did what we would now call regression. He fit a curve to these six data points and got a square-cube law, which was amazing. But he was quite lucky that these six data points gave him the right conclusion. That’s not enough data to be really reliable.

There was a later astronomer, Johann Bode, who took the same data—the distances to the planets—and inspired by Kepler, he had a prediction that the distances to the planets formed a shifted geometric progression. He also fit a curve, except there was one point missing. There was a big gap between Mars and Jupiter. His law predicted that there was a missing planet. It was kind of a crank theory, except when Uranus was discovered by Herschel, the distance to Uranus fit exactly this pattern. Then Ceres was discovered in the asteroid belt, and it also fit the pattern. People got really excited that Bode had discovered this amazing new law of nature.

But then Neptune was discovered, and it was way off. Basically it was just a numerical fluke. There were six data points. Maybe one reason why Kepler didn’t highlight his third law as much as the first two laws is that instinctively, even though he didn’t have modern statistics, he kind of knew that with six data points, he had to be somewhat tentative with the conclusions.

00:11:44 – How would we know if there’s a new unifying concept within heaps of AI slop?

Dwarkesh Patel

To ask the question about the analogy more explicitly, does this analogy make sense if in the future we have smarter and smarter AIs? We’ll have millions of them, and they can go out and hunt for all these empirical irregularities. It sounds like you don’t think the bottleneck in science is finding more things that are the equivalent of the third law of planetary motion for each given field, so that later on somebody can say, “Oh, we need a way to explain this. Let’s work out the math. Here’s the inverse-square law of gravity.”

Terence Tao

I think AI has driven the cost of idea generation down to almost zero, in a very similar way to how the internet drove the cost of communication down to almost zero. It’s an amazing thing, but it doesn’t create abundance by itself. Now the bottleneck is different. We’re now in a situation where suddenly people can generate thousands of theories for a given scientific problem. Now we have to verify them, evaluate them. This is something which we have to change our structures of science to actually sort this out.

Traditionally, we build walls. In the past, before we had AI slop, we had amateur scientists have their own theories of the universe, many of which were of very little value. We built these peer review publication systems to filter out and try to isolate the high signal ideas to test.

But now that we can generate these possible explanations at massive scale, and some of them are good and a lot are terrible, human reviewers are already being overwhelmed. Many journals are reporting that AI-generated submissions are just flooding their submissions.

It’s great that we can generate all kinds of things now with AI, but it means that the rest of the aspects of science have to catch up: verification, validation, and assessing what ideas actually move the subject forward and which ones are dead ends or red herrings. That’s not something we know how to do at scale. For each individual paper, we can have a debate among scientists and get to a consensus in a few years. But when we’re generating a thousand of these every day, this doesn’t work.

Dwarkesh Patel

There’s this incredibly interesting question. If you have billions of AI scientists, not only how do you gauge which ones are real progress, but how do you... This is actually a question that human science has had to face and we’ve solved somehow, and I’m actually not sure how we solved this.

Let’s say in the 1940s, if you’re at Bell Labs and there are these new technologies coming out. Pulse-code modulation, how do you transfer signals? How do you digitize signals? How do you transfer them over analog wires? There are all these papers about the engineering constraints and the details, and then there’s one which comes up with the idea of the bit, which has implications across many different fields. You need some system which can then look at that and say, “Okay, we need to apply this to probability. We need to apply this to computer science,” et cetera.

In the future, the AIs are coming up with the next version of this unifying concept. How would you identify it among millions of papers that might actually constitute progress, but which have much less in terms of general unifying ideas?

Terence Tao

A lot of it’s the test of time. Many great ideas didn’t actually get a great reception at the time they were first proposed. It was only after some other scientists realized that they could take it further and apply them to their own... Deep learning itself was a niche area of AI for a long time. The idea of getting answers entirely through training on data and not through first principles reasoning was very controversial, and it just took a long time before it started bearing fruit.

You mentioned the bit. There were other proposals for computer architectures than the zero-one that is universal today. I think there were trits, three-valued logic. In an alternate universe, maybe a different paradigm would have shown up. The transformer, for example, is the foundation of all modern large language models, and it was the first deep learning architecture that really was sophisticated enough to capture language. But it didn’t have to be that way. There could’ve been some other architecture that was the first to do it and once that was adopted, it would become the standard.

One reason why it’s hard to assess whether a given idea is going to be fruitful is that it depends on the future. It depends also on the culture and society, which ones get adopted, which ones don’t. The base ten numeral system in mathematics is extremely useful, much better than the Roman numeral system, for instance. But again, there’s nothing special about ten. It’s a system that is useful for us because everyone else uses it. We’ve standardized it. We’ve built all our computers and our number representation systems around it, so we’re stuck with it now. Some people occasionally push for other systems than decimal, but there’s just too much inertia.

It’s not something where you can look at any given scientific achievement purely in isolation and give it an objective grade without being aware of the context both in the past and the future. So it may never be something that you can just reinforcement learn the same way that you can for much more localized problems.

Dwarkesh Patel

Often in the history of science when a new theory comes up that in retrospect we realize is correct, it seems to make implications that either make no sense because they’re wrong, and we realize later on why they’re wrong, or they’re correct but seem wildly implausible at the time.

As you talked about, Aristarchus had heliocentrism in the third century BC. The ancient Athenians were like, “This can’t be because if the earth is going around the sun, we should see the relative position of the stars change as we’re going around the sun, and the only way that wouldn’t be the case is if they’re so far away that you don’t notice any parallax,” which is actually the correct implication.

But there’s times when the implication is incorrect and we just need to graduate to a better level of understanding. Leibniz would chide Newton and disagree with Newton’s theory of gravity on the basis that it implied action at a distance, and they didn’t know the mechanism, and Newton himself was sort of stunned that inertial mass and gravitational mass were the same quantity. All these things later were resolved by Einstein. But it was still progress.

So the question for a system of peer review for AI would be: even if you can falsify a theory, how would you notice that it still constitutes progress relative to the thing before?

Terence Tao

Often, the ultimately correct theory initially is worse in many ways. Copernicus’s theory of the planets was less accurate than Ptolemy’s theory. Geocentrism had been developed for a millennium by that point, and they had made many tweaks and increasingly complicated ad hoc fixes to make it more and more accurate. Copernicus’s theory was a lot simpler but much less accurate. It was only Kepler that made it more accurate than Ptolemy’s theory.

Science is always a work in progress. When you only get part of the solution, it looks worse than a theory which is incorrect but somehow has been completed to the point where it kind of answers all the questions. As you say, Newton’s theory had big mysteries. They had the equivalence of mass and action at a distance, which were only resolved with a very conceptually different approach centuries afterwards.

Often progress has to be made not by adding more theories, but by deleting some assumptions that you have in your mind. One reason why geocentrism held on for so long is we had this idea that objects naturally want to stay at rest. This is the Aristotelian notion of physics, and so the idea that the Earth was moving… How come we weren’t all falling over? Once you have Newton’s laws of motion—an object in motion remains in motion and so forth—then it makes sense.

Conceptually, it’s a very big leap to realize that the Earth is in motion. It doesn’t feel like it’s in motion. The biggest advances, like Darwin’s theory of evolution, is the idea that species are not static. This is not obvious because you don’t see evolution in your lifetime. Well, now we actually can, but it seems permanent and static.

Right now we’re going through a cognitive version of the Copernican revolution, where we used to think that human intelligence is the center of the universe, and now we’re seeing that there are very different types of intelligence out there with very different strengths and weaknesses. Our assessment of which tasks require intelligence, which ones don’t, has to be reordered quite a bit.

Trying to fit AI into our theories of scientific progress and what is hard and what is easy, we’re struggling quite a lot. We have to ask questions that we’ve never really had to ask before. Or maybe the philosophers had, but now we all have to deal with it.

Dwarkesh Patel

This brings up a topic I’ve been very curious about. You mentioned Darwin’s theory of evolution. There’s this book, The Clockwork Universe by Edward Dolnick, which covers a lot of this era of history we’re talking about. He has this interesting observation in there. The Origin of Species was published in 1859. Principia Mathematica was published in 1687.

So The Origin of Species comes out two centuries after Principia. Conceptually, it seems like Darwin’s theory is simpler. There’s a contemporaneous biologist to Darwin, Thomas Huxley, who reads The Origin of Species and he says, “How stupid not to have thought of that.”

Nobody ever says that about Principia, chiding themselves for not having beaten Newton to gravity. So there’s a question of why did it take longer?

It seems like a big part of the reason is what you were saying. The evidence for natural selection is overwhelming in a certain sense, but it’s cumulative and retrospective, whereas Newton can just say, “Here are my equations. Let me see the moon’s orbital period and its distance, and if it lines up, then we’ve made progress.”

Lucretius actually had this idea that species adapted to their environment in the first century BC but nobody really talks about it until Darwin because Lucretius couldn’t run some experiment and force people to pay attention. I wonder if we’ll in retrospect end up seeing much more progress in domains which have this kind of tight data loop where you can verify them quite easily, even though they’re conceptually much more difficult.

Terence Tao

I think one aspect of science is that it’s not just creating a new theory and validating it, but communicating it to others. Darwin was an amazing science communicator. He wrote in English, in natural language. I’m speaking like a—

Dwarkesh Patel

No Lean.

Terence Tao

I have to get out of my technical mindset. He spoke in plain English, didn’t use equations, and he synthesized a lot of disparate facts. Little pieces of evolution had been worked out in the past, but he had this very compelling vision. Again, he was still missing things. He didn’t know the mechanism for heredity, he didn’t have DNA. But his writing style was persuasive, and that helped a lot.

Newton wrote in Latin. He had invented entire new areas of mathematics just to explain what he was doing. He was also from an era where scientists were much more secretive and competitive. Academia is still competitive, but it was even worse back in Newton’s day. He held back some of his best insights because he didn’t want his rivals to get any advantage. He was also a somewhat unpleasant person from what I gather. It was only a couple of decades after Newton when other scientists explained his work in much simpler terms that they became widespread.

The art of exposition and making a case and creating a narrative is also a very important part of science. If you have the data, it helps, but people need to be convinced, otherwise they will not push it further or take the initial investment to learn your theory and really explore it. That’s another thing which is really hard to reinforcement learn on. How can you score how persuasive you are? Well, there are entire marketing departments trying to do this. Maybe it’s good that AI is not yet optimized to be persuasive.

There’s a social aspect to science. Even though we pride ourselves on having an objective side to it, where there’s data and experiment and validation, we still have to tell stories and convince our fellow scientists. That’s a soft, squishy thing. It’s a combination of data and painting a narrative, and it’s a narrative of gaps.

Even with Darwin, as I said, there were pieces of his theory he could not explain. But he could still make a case that in the future, people would find transitional forms, that they would find the mechanism of inheritance, and they did. I don’t know how you can quantify that in such a precise way that you can start doing reinforcement learning. Maybe that will be forever the human side of science.

00:26:10 – The deductive overhang

Dwarkesh Patel

One takeaway I had from reading and watching your stuff on the cosmic distance ladder… By the way, I highly recommend people watch your series with 3Blue1Brown on the cosmic distance ladder. One takeaway was that the deductive overhang in many fields could be so much bigger than people realize. If you just had the right insight about how to study a problem, you might be surprised at how much more you could learn about the world.

I wonder if you think that’s a product of astronomy at the particular times in history that you’re studying. Or is it just that based on the data that is incident on the Earth right now, we could actually divine a lot more than we happen to know?

Terence Tao

Astronomy was one of the first sciences to really embrace data analysis and squeezing every last possible drop of information out of the information they had because data was the bottleneck. It still is the bottleneck. It’s really hard to collect astronomical data.

Astronomers are world-class in extracting all kinds of conclusions from little traces of data, almost like Sherlock. I hear that for a lot of quant hedge funds, their preferred hire is an astronomy PhD, actually. They are also very interested for other reasons in extracting signals from various random bits of data.

We do under-explore how to extract extra information from various signals. Just to pick one random study, I remember reading once that people were trying to measure how often scientists actually read the papers that they cite. How do you measure this? You could try to survey different scientists, but they had a clever trick.

Many citations have little typos, like a number is wrong or punctuation is almost wrong. They measured how often a typo got copied from one reference to the next, and they could infer whether an author was just copying and pasting a reference without actually checking it. From that, they were able to infer some measure of how much attention people were paying. So there are some clever tricks to extract…

These questions you posed earlier of how we can assess whether a scientific development is fruitful, interesting, or represents real progress… Maybe there are really useful metrics or footprints of this phenomenon in data. We can examine citations and how often something is mentioned in a conference. Maybe there’s a lot of sociology of science research to be done that could actually detect these things. Maybe we should get some astronomers on the case, actually.

00:30:31 – Selection bias in reported AI discoveries

Dwarkesh Patel

That brings us nicely to the progress that, from the outside, it seems like AI for math is making. You had a post recently where you pointed out that over the last few months, AI programs have solved fifty out of the eleven hundred odd Erdős problems. I don’t know if it’s still correct, but as of a month ago you said that there had been a pause because the low-hanging fruit had been picked.

First of all, I’m curious if that is still the case, that we have picked the low-hanging fruit and now we’re at this plateau currently.

Terence Tao

It does seem so. Fifty-odd problems have been solved with AI assistance, which is great, but there’s like six hundred to go. People are still chipping away at one or two of these right now.

We’re seeing a lot fewer pure AI solutions now where the AI just one-shots the problem. There was a month where that happened and that has stopped, not for lack of trying. I know of three separate attempts to get frontier model AIs to just attack every single one of the problems simultaneously. They pick out some minor observations, or maybe they find that some problem was already solved in the literature, but there hasn’t been any further purely AI-powered solution yet.

People are using AI a lot currently. Someone might use AI to generate a possible proof strategy, and then another person will use a separate AI tool to critique it, rewrite it, generate some numerical data for it, or do a literature survey. Some problems have been solved by an ongoing conversation between lots of humans and lots of AI tools. But it does seem like it was this one-off thing.

Maybe one analogy for these problems is that you’re in some sort of mountain range with all kinds of cliffs and walls. Maybe there’s a little wall which is three feet high, and one that’s six feet high, and then there’s fifteen feet high, and then there are some mile-high cliffs. You’re trying to climb as many of these cliffs as possible, but it’s in the dark. We don’t know which ones are tall, which ones are short. So we try to light some candles and make some maps, and slowly we figure out some of them are climbable. Some of them we can identify a partial track in the wall that you can reach first.

These AI tools, they’re like jumping machines that can jump two meters in the air, higher than any human. Sometimes they jump in the wrong direction, and sometimes they crash, but sometimes they can reach the tops of the lowest walls that we couldn’t reach before. We’ve just set them loose in this mountain range, hopping around. There was this exciting period where they could actually find all the low ones and reach them. Maybe the next time there’s a big advance in the models, they will try it again, and a few more will be breached.

But it’s a different style of doing mathematics. Normally we would hill climb, make little markers, and try to identify partial things. These tools either succeed or they fail. They’ve been really bad at creating partial progress or identifying intermediate stages that you should focus on first. Going back to this previous discussion, we don’t have a way of evaluating partial progress the same way we can evaluate a one-shot success or failure of solving a problem.

Dwarkesh Patel

There’s two different ways to think through what you’ve just said. One of them is more bearish on AI progress, and one of them is more bullish. The bearish one being, “Oh, they’re only getting to a certain height of wall, which is not as high as humans are reaching.”

The second is that they have this powerful property that once they achieve a certain waterline, they can fill every single problem that is available at that waterline, which we simply can’t do with humans. We can’t make a million copies of you and give each of them a million dollars of inference compute and have you do a hundred years of subjective time research on a million different problems at the same time.

But once AIs reach Terence Tao-level, they could do that. Once they reach intermediate levels, they could do the intermediate version of that. The same reason that we should be bearish now is the reason we should be especially bullish. Not even when they achieve superhuman intelligence, but just when they achieve human-level intelligence, because their human-level intelligence is qualitatively wider and more powerful than our human-level intelligence.

Terence Tao

I agree. They excel at breadth, and humans excel at depth, human experts at least. I think they’re very complementary. But our current way of doing math and science is focused on depth because that’s where human expertise is, because humans can’t do breadth. We have to redesign the way we do science to take full advantage of this breadth capability that we now have.

We should have a lot more effort in creating very broad classes of problems to work on rather than one or two really deep, important problems. We should still have the deep, important problems, and humans should still be working on them. But now we have this other way of doing science. We can explore entirely new fields of science by first getting these broad, moderately competent AIs to map it out and make all the easy observations. And then identify certain islands of difficulty, which human experts can then come and work on.

I see very much a future of very complementary science. Eventually, you would hope to get both breadth and depth and somehow get the best of both worlds. But we need practice with the breadth side. It’s too new. We don’t even have the paradigms to really take full advantage of it. But we will, and then science will be unrecognizable after that, I think.

Dwarkesh Patel

To this point about complementarity, programmers have noticed that they’re way more productive as a result of these AI tools. I don’t know if you as a mathematician feel the same way, but it does seem like one big difference between vibe coding and vibe researching is that with software, the whole point is to have some effect on the world through your work. If it leads to you better understanding a problem or coming up with some clean abstraction to embody in your code, that is instrumental to the end goal.

Whereas with research, the reason we care about solving the Millennium Prize Problems is that presumably that in the process of solving them, we discover new mathematical objects or new techniques that advance our civilization’s understanding of mathematics. So the proof is instrumental to the intermediate work. I don’t know if you agree with that dichotomy or if that in any way will explain the relative uplift we’ll see in software versus research.

Terence Tao

Certainly in math, the process is often more important than the problem itself. The problem is kind of a proxy for measuring progress. I think even in software, there are different types of software tasks. If you just create a webpage that does the same thing that a thousand other webpages do, there’s no skill to be learned. Well, there is still some skill maybe that the individual programmer could pick up. But for boilerplate-type code, it’s something that you should definitely offload to AI.

Sometimes once you make the code, you still have to maintain it. There are issues with upgrading it and making it compatible with other things. I’ve heard programmers report that even if an AI can create the first prototype of a tool, making it mesh with everything else and making it interact with the real world in the way they want is an ongoing process. If you don’t have the skills that you pick up from writing the code, that may impact your ability to maintain it down the road.

So yes, certainly mathematicians, we’ve used problems to build intuition and to train people to have a good idea of what’s true, what to expect, what is provable, and what is difficult. Just getting the answers right away may actually inhibit that process.

I made a distinction between theory and experiment before. In most sciences, there’s an equal division between the theoretical side and the experimental side. Math has been unique in that it’s almost entirely theoretical. We place a premium on trying to have coherent, clean theories of why things are true and false. We haven’t done many experiments as to, if we have two different ways to solve a problem, which is more effective. We have some intuition, but we haven’t done large-scale studies where we take a thousand problems and just test them.

But we can do that now. I think AI-type tools will actually revolutionize the experimental side of math, where you don’t care so much about individual problems and the process of solving them, but you want to gather large-scale data about what things work and what things don’t. The same way that if you’re a software company and you want to roll out a thousand pieces of software, you don’t really want to handcraft each one and learn lessons from each. You just want to find what workflows let you scale.

The idea of doing mathematics at scale is at its infancy. But that’s where AI is really going to revolutionize the subject.

Dwarkesh Patel

I feel like a big crux in these conversations about how good AI will be for science is, I think you said this, that they’re using existing techniques and modifying them. It would be interesting to understand how much progress one can make simply from using existing techniques.

If I looked at the top math journals, how many of the papers are coming up with a new technique, whatever that means, versus using existing techniques on new problems? What is the overhang? If you just applied every known technique to every open problem, would that constitute a humongous uplift in our civilization’s knowledge, or would that not be that impressive and useful?

Terence Tao

This is a great question, and we don’t have the data to fully answer it yet. Certainly, a lot of work that human mathematicians do… When you take a new problem, one of the first things we do is we look at all the standard things that have worked on similar problems in the past, and we try them one by one. Sometimes that works, and that’s still worth publishing because the question was important.

Sometimes they almost work, and you have to add one more wrinkle to it, and that’s also interesting. But the papers that go into the top journals are usually ones where the existing methods can kind of solve 80% of the problem, but then there is this 20% which is resistant and a new technique has to be invented to fill in the gaps.

It’s very rare now that a problem gets solved with no reliance on past literature, where all the ideas come out of nowhere. That was more common in the past, but math is so mature now that it’s just so much of a handicap to not use the literature first.

AI tools are getting really good at the first part of that, just trying all the standard techniques on a problem, often making fewer mistakes in applying them than humans. They still make mistakes, but I’ve tested these tools on little tasks that I can do, and sometimes they pick up errors that I make. Sometimes I pick up errors that they make. It’s about a tie right now.

But I haven’t yet seen them take the next step. When there are holes in the argument where none of the things are working, then what do you do? They can suggest random things, but often I find that trying to chase them down to make them work, and finding they don’t work, wastes more time than it saves.

I think some fraction of problems that we currently think are hard will fall from this method, especially the ones that haven’t received enough attention. With the Erdős problems, almost all of the 50 problems that were solved by AIs were ones for which there was basically no literature. Erdős posed the problem once or twice. Maybe some people tried it casually and couldn’t do it, but they never wrote up anything.

But it turned out that there was a solution, and it was just combining this one obscure technique that not many people know about with some other result in the literature. That’s the median level of what AI can accomplish, and that’s really great. It clears out 50 of these problems. So I think you will see some isolated successes.

But what we found… Some people have done large-scale sweeps of these Erdős problems. If you only focus on the success stories, the ones that get broadcast on social media, it looks amazing. All these problems that haven’t been solved for decades, now they’re falling. But whenever we do a systematic study, on any given problem an AI tool has a success rate of maybe 1% or 2%. It’s just that they can buy scale, and you just pick the winners. It looks great.

I think there’ll be a similar thing happening with the hundreds of really prestigious, difficult math problems out there. Some AI may get lucky and actually solve them, and there will be some backdoor to solve the problem that everyone else missed. That will get a lot of publicity. But then people will try these fancy tools on their own favorite problem, and they will again experience the 1% to 2% success rate.

There’ll be a lot of noise amongst the signal of when they’re working and when they’re not. It will be increasingly important to collect these really standardized datasets. There are efforts now to create a standard set of challenge problems for AIs to solve, and not just rely on the AI companies to only publish their wins and not disclose their negative results. That will maybe give more clarity as to where we’re actually at.

Dwarkesh Patel

Although I think it’s worth emphasizing how much progress in AI it constitutes already, to have models that are capable of applying some technique that nobody had written down as applicable to this particular problem.

Terence Tao

The progress is simultaneously amazing and disappointing. It is a very strange feeling to see these tools in action. But people also acclimatize really quickly.

I remember when Google’s web search came out 20 years ago. It just blew all the other searches out of the water. You’re getting relevant hits on the front page, exactly what you wanted. It was amazing, and then after a few years, you just took for granted that you could Google anything.

2026-level AI would be stunning in 2021. A lot of it—face recognition, natural speech, doing college-level math problems—we just take for granted now.

00:46:43 – AI makes papers richer and broader, but not deeper

Dwarkesh Patel

Speaking of 2026 AI, you made a prediction in 2023 that by 2026 it would be like a colleague in mathematics?

Terence Tao

A trustworthy co-author if used correctly.

Dwarkesh Patel

Which is looking pretty good in retrospect.

Terence Tao

Yeah, I’m pretty pleased.

Dwarkesh Patel

So let’s see if you can continue this streak. You personally are 2x more productive as a result of AI. What year would you say that?

Terence Tao

Productivity, I think, is not quite a one-dimensional quantity. I’m definitely noticing that the style in which I do mathematics is changing quite a bit, and the type of things I do. For example, my papers now have a lot more code, a lot more pictures, because it’s so easy to generate these things now. Some plot which would have taken me hours to do, now I can do in minutes. But in the past, I just wouldn’t have put the plot in my paper in the first place. I would just talk about it in words. So it’s hard to measure what 2x means.

On the one hand, I think the type of papers that I would write today, if I had to do them without AI assistance, would definitely take five times longer. But I would not write my papers that way.

Dwarkesh Patel

5x?

Terence Tao

Yeah, but these are auxiliary tasks. Things like doing a much deeper literature search or supplying a lot more numerics. They enrich the paper. The core of what I do, actually solving the most difficult part of a math problem, hasn’t changed too much. I still use pen and paper for that.

But there’s lots of silly things. I use an AI agent now to reformat. Sometimes if all my parentheses are not quite the right size, I used to manually change them by hand, and now I can get an AI agent to do all that quite nicely in the background.

They’ve really sped up lots of secondary tasks. They haven’t yet sped up the core thing that I do, but it’s allowed me to add more things to my papers. By the same token, if I were to write a paper I wrote in 2020 again—and not add all these extra features, but just have something of the same level of functionality—it actually hasn’t saved that much time, to be honest. It’s made the papers richer and broader, but not necessarily deeper.

Dwarkesh Patel

You made this distinction between artificial cleverness and artificial intelligence. I would like to better understand those concepts. What is an example of intelligence that is not just cleverness?

Terence Tao

Intelligence is famously hard to define. It’s one of these things that you know when you see it. But when I talk to someone and we’re trying to collaboratively solve a math problem together, there’s this conversation where neither of us knows how to solve the problem initially. One of us has some idea and it looks promising, so then we have some sort of prototype strategy. We test it, and it doesn’t work, but then we modify it. There’s adaptivity and continual improvement of the idea over time. Eventually, we’ve systematically mapped out what doesn’t work and what does work, and we can see a path forward, but it’s evolving with our discussion.

This isn’t quite what the AIs do. The AIs can mimic this a little bit. To go back to this analogy of these jumping robots, they can jump and fail, and jump and fail. But what they can’t do is jump a little bit, reach some handhold, stay there, pull other people up, and then try to jump from there. There isn’t this cumulative process which is built up interactively. It seems to be a lot more trial and error and just repetition: brute force. It scales, and it can work amazingly well in certain contexts. But this idea of building up cumulatively from partial progress is what’s still not quite there yet.

Dwarkesh Patel

Interesting. You’re saying if Gemini 3 or Claude 4.5, whatever, solves a problem, it is not the case that its own understanding of math has progressed.

Terence Tao

No.

Dwarkesh Patel

Or even if it works on a problem without solving it, it’s not that its own understanding of math has progressed.

Terence Tao

Yeah. You run a new session and it’s forgotten what it just did. It has no new skills to build on related problems. Maybe what you just did is 0.001% of the training data for the next generation. So maybe eventually some of it gets absorbed.

00:53:00 – If AI solves a problem, can humans get understanding out of it?

Dwarkesh Patel

One big question I have is how plausible is it that if we just keep training AIs—they get better and better at solving problems in Lean—that they will continue to solve more and more impressive problems, and then we will be surprised at how little insight we got from some Lean solution to proving the Riemann hypothesis or something.

Or do you think it is a necessary condition of solving the Riemann hypothesis, even by an AI that is doing it entirely in Lean, that the constructions and definitions created in the Lean program have to advance our understanding of mathematics? Or could it just be assembly code gobbledygook?

Terence Tao

We don’t know. Some problems have been basically solved by pure brute force. The four color theorem is a famous example. We have still not found a conceptually elegant proof of this theorem, and maybe we never will. Some problems may only be solvable by splitting into an enormous number of cases and doing brute force, uninsightful computer analysis on each case.

Part of the reason we prize problems like the Riemann hypothesis is that we’re pretty sure a new type of mathematics has to be created, or a new connection between two previously unconnected areas of mathematics has to be discovered to make this work. We don’t even know what the shape of the solution is, but it doesn’t feel like a problem that will be solved just by exhaustively checking cases.

Or it could be false actually. Okay, there is an unlikely scenario that the hypothesis is false, and you can just compute a zero off the line, and a massive computer calculation verifies it. That would be very disappointing. I do feel that fully autonomous, one-shot approaches are not the right approach for these problems. You’ll get a lot more mileage out of the interplay of humans collaborating with these tools.

I can see one of these problems being solved by smart humans assisted by extremely powerful AI tools. But the exact dynamic may be very different from what we envision right now. It could be a collaboration of a type that just doesn’t exist yet.

There may be a way to generate a million variants of the Riemann zeta function and do AI-assisted data analysis to discover some pattern connecting them that we didn’t know about before. This lets you transform the problem into a different area of mathematics. There could be all kinds of scenarios.

Dwarkesh Patel

Suppose the AI figures it out, and latent in the Lean is some brand-new construction which, if we realized its significance, we would be able to apply in all these different situations. How would we even recognize it?

Again, a very naive question, but if you come up with the equivalent of Descartes’ idea that you can have a coordinate system unifying algebra and geometry, in Lean code it would just look like R→R, and it wouldn’t look that significant. I’m sure there are other constructions which have this kind of property.

Terence Tao

The beauty of formalizing a proof in something like Lean is that you can take any piece of it and study it atomically. When I read a paper which solves some difficult problem, there’s often a big sequence of lemmas and theorems. Ideally, the author will talk their way through what’s important and what’s not. But sometimes they don’t reveal what steps were the important ones and which ones were just boilerplate, standard steps.

You can study each lemma in isolation. Some of them I can see look fairly standard and resemble something I’m familiar with. I’m pretty sure there’s nothing interesting going on there. But this other lemma, that’s something I haven’t seen before, and I can see why having this result would really help prove the main result. You can assess whether a step is really key to your argument or not, and Lean really facilitates that. The individual steps are identified really precisely.

I think in the future, there will be entire professions of mathematicians who might take a giant Lean-generated proof and do some ablation on it, trying to remove parts of it and find more elegant ways. They might get other AIs to do some reinforcement learning to make the proof more elegant, and maybe other AIs will grade whether this proof looks better or not.

One thing that will change quite a bit in the near future is how we write papers. Until recently, writing papers was the most time-consuming and expensive part of the job. So you did it very rarely. You only wrote up your results once all the other parts of your argument were checked out, because rewriting and refactoring was just a total pain. That’s become a lot easier now with modern AI tools. You don’t have to have just one version of your paper. Once you have one, people can generate hundreds more.

One giant messy Lean proof may not be very meaningful or understandable on its own, but other people can refactor it and do all kinds of things with it. We’ve seen this with the Erdős problem website. An AI will generate a proof, and here are 3,000 lines of code that verify the proof. Then people got other AIs to summarize the proof, and people write their own proofs.

There’s actually post-processing. Once you have one proof, we have a lot of tools now to deconstruct and interpret it. It’s a very nascent area of mathematics, but I’m not as worried about it. Some people are concerned about what happens if the Riemann hypothesis is proven with a completely incomprehensible proof. I think once you have the artifact of a proof, we can do a lot of analysis on it.

00:59:20 – We need a semi-formal language for the way that scientists actually talk to each other

Dwarkesh Patel

You posted recently that it would be helpful to have a formal or semi-formal language for mathematical strategies as opposed to just mathematical proofs, which is what Lean specializes in. I would love to learn more about what that would involve or look like.

Terence Tao

We don’t really know. We’ve been very lucky in mathematics that we have worked out the laws of logic and mathematics, but this is a fairly recent accomplishment. It was started by Euclid two millennia ago, but only in the early 20th century did we finally list out the axioms of mathematics, the standard axioms of what we call ZFC, the axioms of first-order logic, and what a proof is. This we’ve managed to automate and have a formal language for.

There could be some way to assess plausibility. You have a conjecture that something is true, you test a few examples, and it works out. How does this increase your confidence that the conjecture is true? We have a few sort of mathematical ways to model this, like Bayesian probability, for example. But you often have to set certain base assumptions, and there’s a lot of subjectivity still in these tasks.

This is more of a wish than a plan to develop these languages, but just seeing how successful having a formal framework in place, like Lean, has made deductive proofs so much easier to automate and train AI on… The bottleneck for using AI to create strategies and make conjectures is we have to rely on human experts and the test of time to validate whether something is plausible or not.

If there was some semi-formal framework where this could be done semi-automatically in a way that isn’t easily hackable... It’s really important with these formal proof assistants that there are no backdoors or exploits you can use to somehow get your certified proof without actually proving it, because reinforcement learning is just so good at finding these backdoors.

If there’s some framework that mimics how scientists talk to each other in a semi-formal way, using data and argument, but also constructing narratives... There’s some subjective aspect of science that we don’t know how to capture in a way that we can insert AI into it in any useful way. This is a future problem. There are research efforts to try to create automated conjectures, and maybe there are ways to benchmark these and simulate this, but it’s all very new science.

Dwarkesh Patel

Can you help me get some intuition? I have two sub-questions. One, it would be very helpful to have a specific example of what something like this would look like, the way scientists communicate that we can’t formalize yet.

Two, it seems almost definitionally paradoxical to say you’re building up some narrative or natural language explanation and then also having something which you could have formalized. I’m sure there’s some intuition behind where that overlap is, and I’d love to understand that better.

Terence Tao

An example of a conjecture: Gauss was interested in the prime numbers and created one of the first mathematical datasets. He just computed the first 100,000 prime numbers or so, hoping to find patterns. He did find a pattern, but maybe not the pattern he was expecting. He found a statistical pattern in the primes that if you count how many primes there are up to 100, 1,000, one million, and so forth, they get sparser and sparser, but the drop-off in the density was inversely proportional to the natural logarithm of the range of numbers.

So he conjectured what we now call the prime number theorem: the number of primes up to X is X divided by the natural log of X. He had no way to prove this. It was data-driven. This was a conjecture. It was revolutionary for its time because it was maybe the first really important conjecture of math that was statistical in nature. Normally you’re talking about a pattern, like maybe the spacing between the primes has a certain regularity. But this didn’t tell you exactly how many primes there were in any given range. It just gave you an approximation that got better and better as you went further and further out.

It started the field of what we call analytic number theory. It was the first in many conjectures like this, many of which got proved, which started consolidating the idea that the prime numbers didn’t really have a pattern, that they behaved like random sets of numbers with a certain density. They had some patterns, like they’re almost all odd. They’re also not actually random, they’re what’s called pseudo-random. There’s no random number generation involved in creating the prime numbers. But over time, it became more and more productive to think of the primes as if they were just generated by some god rolling dice all the time and creating this random set.

This allowed us to make all these other predictions. There’s a still-open conjecture in number theory called the twin prime conjecture, that there should be infinitely many pairs of primes that are twins just two apart, like 11 and 13. We can’t prove that, and there are good reasons why we can’t prove it. But because of this statistical random model of the primes, we are absolutely convinced it’s true. We know that if the primes were generated by flipping coins, we would just—by random chance like infinite monkeys at a typewriter—see twin primes appear over and over again.

We have over time developed this very accurate conceptual model of what the primes should behave like based on statistics and probability. It’s mostly heuristic and non-rigorous, but extremely accurate. The few times when we actually can prove things about the primes, it has matched up with the predictions of what we call the random model of the primes. We have this conjectural concept framework for understanding the primes that everyone believes in. It’s the same reason why we believe the Riemann hypothesis is true, and why we believe that cryptography based on the primes is mathematically secure. It’s all part of this belief.

In fact, one reason why we care about the Riemann hypothesis is that if the Riemann hypothesis failed, if we knew it was false, it would be a serious blow to this model. It would mean there’s a secret pattern to the primes that we were not aware of. I think we would very rapidly abandon any cryptography based on the primes, because if there was one pattern that we didn’t know about, there are probably more, and these patterns can lead to exploits in crypto. It would be a big shock. So we really want to make sure that doesn’t happen.

We’ve been convinced of things like the Riemann hypothesis over time. Some of it is experimental evidence, and some is that the few times we’ve been able to make theoretical results, they’ve always aligned. It is possible that the consensus is wrong and we’ve all just missed something very basic. There have been paradigm shifts in the past in scientific history. But we don’t really have a way of measuring this, partly because we don’t have enough data on how math or science develops. We have one timeline of history, and we have maybe 100 stories of turning points in history.

If we had access to a million alien civilizations, each with a different development of history and science in different orders, then maybe we’d actually have a decent shot at understanding how we measure what progress is and what is a good strategy. We could maybe start formalizing it and actually having a framework. Maybe what we need to do is start creating lots of mini-universes or simulations of AI solving very basic problems in arithmetic or whatever, but coming up with their own strategies for doing these things and having these little laboratories to test. There are people who investigate what’s the smallest neural network that can do 10-digit multiplication and things like that. I think we could learn a lot just from evolving small AIs on simple problems.

01:09:48 – How Terry uses his time

Dwarkesh Patel

You have to learn about new fields not only very rapidly, but deeply enough to contribute to the frontier. So in some sense, you’re also one of the world’s greatest autodidacts. What is your process of learning about a new subfield in math? What does that look like?

Terence Tao

We talked about depth and breadth before. It’s not a purely human-AI distinction. Humans also, I think it was Berlin who split them into hedgehogs and foxes. The hedgehog knows one thing very well, and a fox knows a little bit about everything. I definitely think of myself as a fox. I work with hedgehogs a lot, and sometimes I can be a hedgehog if need be.

I’ve always had a little bit of an obsessive streak. If there’s something I read about which I feel like I have the capability to understand, but I don’t understand why it works and there’s some magic in it… Someone was able to use a type of mathematics I’m not familiar with and get a result I would like to prove. I can’t do it myself, but they could do it by their method, and I want to find out what their trick was. It bugs me that someone else can do something I think I can do, but I can’t. I’ve always had that obsessive, completionist streak. I’ve had to wean myself off computer games because if I start a game, I want to play it to completion, through all the levels. That’s one way I learn new fields.

I collaborate with a lot of people who have taught me other types of mathematics. I just make friends with another mathematician working on another area of mathematics. I find their problems interesting, but they have to teach me some of the basic tricks, what’s known, and what’s not known. I learn a lot from that.

I found that writing about what I’ve learned helps. I have a blog where I sometimes record things I’ve learned. In the past when I was younger, I would learn something, do this cool trick, and say, “Okay, I’m going to remember this.” Then six months later, I’d forgotten it. I remember remembering it, but I can’t reconstruct my arguments. The first few times, it was so frustrating to have understood something and then lost it. I resolved I should always write down anything cool that I’ve learned. That’s part of how this blog came about.

Dwarkesh Patel

How long does it take you to write a blog post?

Terence Tao

It’s something I often do when I don’t want to do other work. There’s some referee report or something that feels slightly unpleasant for me to do at the time. Writing a blog feels creative and fun. It’s something I do for myself.

Depending on the topic, it could be a quick half an hour or several hours. Because it’s something I do voluntarily, time flies when I write these things down, as opposed to doing something I have to do for administrative reasons that is just drudgery. Those are tasks, by the way, that AI is really helping with nowadays.

Dwarkesh Patel

If civilization could from first principles decide how to use Terry Tao’s time, as a limited resource, what is the biggest difference? What if the veil of ignorance got to decide how to use Terry Tao’s time versus what it does now? This podcast wouldn’t be happening.

Terence Tao

As much as I complain about certain tasks that I don’t want to do, but have to do… As you get more senior in academia, you get more and more responsibilities, more committees, and whatever. I have also found that a lot of events I reluctantly went to because I was obliged to for one reason or another… Because it’s outside my comfort zone, it often results in interactions with people I wouldn’t normally talk to, like you for instance. I would learn interesting things and have interesting experiences. I would have opportunities to then network with other people that I never would have before.

So I do believe a lot in serendipity. I do optimize portions of my day where I schedule very carefully. But I am willing to leave some portions just to do something that is not my usual thing. Maybe it’ll be a waste of my time, but maybe I will learn something. More often than not, I get a positive experience that I wouldn’t have planned for.

So I believe a lot in serendipity. Maybe there’s a danger in modern societies, not just with AI, that we’ve become really good at optimizing everything. We’re not optimizing our own optimization. With COVID, for example, we switched a lot to remote meetings, so everything was scheduled. We kept busy in academia. We met almost the same number of people we met in person, but everything had to be planned in advance. What we lost out on was the casual knocking on a hallway door, just meeting someone while getting a coffee. Those serendipitous interactions may not seem optimal, but they are actually really important.

When I was a grad student, I would go to the library to look for a journal article. You had to physically check out the journal and read the article. You could browse through and sometimes the next article was also interesting. Sometimes it wasn’t, but you could accidentally find interesting things. That has basically been lost now. If you want to access an article, you just type it into a search engine or an AI, and you get exactly what you want instantly. But you don’t get the accidental things you might have found if you’d done it more inefficiently.

I spent a year once at the Institute for Advanced Study, which is a great place with no distractions. You’re there just to do research. The first few weeks you’re there, it’s great. You’re getting all these papers written up that you’ve been wanting to do for a long time. You think about problems for blocks of hours at a time. But I find if I stay there for more than several months, I run out of inspiration. I get bored. I surf the internet a lot more.

You actually do need a certain level of distraction in your life. It adds enough randomness and high temperature. I don’t know the optimal way to schedule my life. It just seems to work.

01:17:05 – Human-AI hybrids will dominate math for a lot longer

Dwarkesh Patel

I’m very curious when you expect AIs that can actually do frontier math at least as well as the best human mathematicians.

Terence Tao

In some ways, they’re already doing frontier math that is super intelligent that humans can’t do, but it’s a different frontier from what we’re used to. You could argue that calculators were doing frontier math that humans could not accomplish, but it was number crunching.

Dwarkesh Patel

But replacing Terry Tao completely.

Terence Tao

I mean, what do you want me for?

Dwarkesh Patel

You’ll just go on all the podcasts after.

Terence Tao

It might not be the right question to ask. I think within a decade, a lot of things that math students currently do—what we spend the bulk of our time doing and a lot of stuff we put in our papers today—can be done by AI. But we will find that that actually wasn’t the most important part of what we do.

A hundred years ago, a lot of mathematicians were just solving differential equations. Physicists needed some exact solution to some system, and they hired a mathematician to laboriously go through the calculus and work out the solution to this fluid equation, whatever. A lot of what a 19th-century mathematician would do, you could make a call to Mathematica, Wolfram Alpha, a computer algebra package, or now more recently to an AI, and it would just solve the problem in a few minutes. But we moved on. We worked on different types of problems after that.

Once computers came along—computers used to be human. People used to laboriously create log tables and work out primes as Gauss did, and that has all been outsourced to computers. But we moved on.

In genetics, to sequence the genome of a single organism, that was an entire PhD of a geneticist, carefully separating all the chromosomes and whatever. Now you can just spend $1,000 and send it to a sequencer and get it done. But genetics is not dead as a subject. You move to a different scale. Maybe you study whole ecosystems rather than individuals.

Dwarkesh Patel

I take your point but when is most mathematical progress, or almost all mathematical progress, happening by AI? If you find out this year a Millennium Prize Problem has been solved, you would put 95% odds that an AI did it autonomously. Surely there will be such a year.

Terence Tao

I guess I do believe that hybrid human plus AIs will dominate mathematics for a lot longer. It will depend. It will require some additional breakthroughs beyond what we already have, so it’s going to be stochastic. I think AIs currently are very good at certain things, but really terrible at others. While you can add more and more frameworks on top to reduce the error rates and make them work with each other a bit more, it feels like we don’t have all the ingredients to really have a truly satisfactory replacement for all intellectual tasks.

It is complementary currently. It’s not a replacement. Because current level AIs will accelerate science in so many ways, hopefully new discoveries and new breakthroughs will happen more quickly. It’s also possible that by destroying serendipity we actually inhibit certain types of progress. Anything is possible at this point. I think the world is very, very unpredictable at this point in time.

Dwarkesh Patel

What is your advice to somebody who would consider a career in math or is early in a career in math, especially in light of AI progress? How should they be thinking about their career differently, if at all, as a result of AI progress?

Terence Tao

We live in a time of change. As I said, we live in a particularly unpredictable era. Things that we’ve taken for granted for centuries may not hold anymore. The way we do everything, and not just mathematics, will change. In many ways, I would prefer the much more boring, quiet era where things are much the same as they were 10 years ago, 20 years ago. But I think one just has to embrace that there’s going to be a lot of change. The things that you study, some of them may become obsolete or revolutionized, but some things will be retained.

You always have to keep an eye on opportunities for things that you wouldn’t be able to do before. In math, you previously had to go through years and years of education and be a math PhD before you could contribute to the frontier of math research. But now it’s quite possible at the high school level, or whatever, that you could get involved in a math project and actually make a real contribution because of all these AI tools, Lean, and everything else.

There will be a lot of non-traditional opportunities to learn, so you need a very adaptable mindset. There will be room for pursuing things just for curiosity and for playing around. You still need to get your credentials. For a while it will still be important to go through traditional education and learn math and science the old-fashioned way. But you should also be open to very different ways of doing science, some of which don’t exist yet. It’s a scary time, but also very exciting.

Dwarkesh Patel

That’s a great note to close on. Terence, thanks so much.

Terence Tao

Pleasure.

Dylan Patel — Deep dive on the 3 big bottlenecks to scaling AI compute

Dwarkesh Patel — Fri, 13 Mar 2026 16:00:42 GMT

Dylan Patel, founder of SemiAnalysis, provides a deep dive into the 3 big bottlenecks to scaling AI compute: logic, memory, and power.

And walks through the economics of labs, hyperscalers, foundries, and fab equipment manufacturers.

Learned a ton about every single level of the stack. Enjoy!

Watch on YouTube; listen on Apple Podcasts or Spotify.

Sponsors

Mercury has already saved me a bunch of time this tax season. Last year, I used Mercury to request W-9s from all the contractors I worked with. Then, when it came time to issue 1099s this year, I literally just clicked a button and Mercury sent them out. Learn more at mercury.com.
Labelbox noticed that even when voice models appear to take interruptions in stride, their performance degrades. To figure out why, they built a new evaluation pipeline called EchoChain. EchoChain diagnoses voice models’ specific failure modes, letting you understand what your model needs to truly handle interruptions. Check it out at labelbox.com/dwarkesh.
Jane Street is basically a research lab with a trading desk attached – and their infrastructure backs this up. They’ve got tens of thousands of GPUs, hundreds of thousands of CPU cores, and exabytes of storage. This is what it takes to find subtle signals hidden deep within noisy market data. If this sounds interesting, you can explore open positions at janestreet.com/dwarkesh.

Timestamps

(00:00:00) – Why an H100 is worth more today than 3 years ago

(00:24:52) – Nvidia secured TSMC allocation early; Google is getting squeezed

(00:34:34) – ASML will be the #1 constraint for AI compute scaling by 2030

(00:55:47) – Can’t we just use TSMC’s older fabs?

(01:05:37) – When will China outscale the West in semis?

(01:16:01) – The enormous incoming memory crunch

(01:42:34) – Scaling power in the US will not be a problem

(01:54:44) – Space GPUs aren’t happening this decade

(02:14:07) – Why aren’t more hedge funds making the AGI trade?

(02:18:30) – Will TSMC kick Apple out from N2?

(02:24:16) – Robots and Taiwan risk

Transcript

00:00:00 – Why an H100 is worth more today than 3 years ago

Dwarkesh Patel

All right, this is the episode where my roommate teaches me semiconductors.

Dylan Patel

It’s also the send off for this current set.

Dwarkesh Patel

It is. After you use it, I’m like, “I can’t use this again. I gotta get out of here.”

Dylan Patel

No sloppy seconds for Dwarkesh.

Dwarkesh Patel

Dylan is the CEO of SemiAnalysis. Dylan, here’s the burning question I have for you. If you add up the big four—Amazon, Meta, Google, Microsoft—their combined forecasted CapEx this year that you published recently is $600 billion. Given yearly prices of renting that compute, that would be close to 50 gigawatts. Obviously, we’re not putting on 50 gigawatts this year, so presumably that’s paying for compute that is going to be coming online over the coming years. How should we think about the timeline around when that CapEx comes online?

Similar question for the labs. OpenAI just announced they raised $110 billion, and Anthropic just announced they raised $30 billion. If you look at the compute they have coming online this year—you should tell me how much it is, but is it on the order of another four gigawatts total? The cost to rent the compute that OpenAI and Anthropic will have this year to sustain their compute spend is $10 to $13 billion a gigawatt. Those individual raises alone are enough to cover their compute spend for the year. And this is not even including the revenue that they’re going to earn this year.

So help me understand: first, what is the timescale at which the Big Tech CapEx actually comes online? And second, what are the labs raising all this money for if the yearly price of a one-gigawatt data center is $13 billion?

Dylan Patel

So when you talk about the CapEx of these hyperscalers being on the order of $600 billion, and you look across the rest of the supply chain, it gets you to the order of a trillion dollars. A portion of this is immediately for compute going online this year: the chips and the other parts of CapEx that get paid this year. But there’s a lot of setup CapEx as well.

When we’re talking about 20 gigawatts of incremental added capacity this year in America, a portion of this is not spent this year. A portion of that CapEx was actually spent the prior year. When you look at Google having $180 billion, a big chunk of that is spent on turbine deposits for ‘28 and ‘29. A chunk of that is spent on data center construction for ‘27. A chunk of that is spent on power purchasing agreements, down payments, and all these other things they’re doing further out into the future so they can set up this super fast scaling. This applies to all the hyperscalers and other people in the supply chain.

So with roughly 20 gigawatts deployed this year, a big chunk is hyperscalers, and a chunk is not. For all of these companies, their biggest customers are Anthropic and OpenAI. Anthropic and OpenAI are at roughly two to two-and-a-half gigawatts right now, and they’re trying to scale much larger.

If you look at what Anthropic has done over the last few months, with $4 billion or $6 billion in revenue added, we can just draw a straight line and say they’ll add another $6 billion of revenue a month. People would argue that’s bearish, and that they should go faster. What that implies is they’re going to add $60 billion of revenue across the next ten months. At the current gross margins Anthropic had, as last reported by media, that would imply they have roughly $40 billion of compute spend for that inference, for that $60 billion of revenue.

That $40 billion of compute, at roughly $10 billion a gigawatt in rental costs, means they need to add four gigawatts of inference capacity just to grow revenue. That’s assuming their research and development training fleet stays flat. In a sense, Anthropic needs to get to well above five gigawatts by the end of this year. It’s going to be really tough for them to get there, but it’s possible.

Dwarkesh Patel

Can I ask a question about that? If Anthropic was not on track to have five gigawatts by the end of this year, but it needs that to serve both the revenue that’s gone crazier than expected—and maybe it’s going to be even more than that—plus the research and training to make sure its models are good enough for next year: Where is that capacity going to come from?

Dylan Patel

Dario, when he was on your podcast, was very conservative. He said, “I’m not going to go crazy on compute because if my revenue inflects at a different rate, at a different point… I don’t want to go bankrupt. I want to make sure that we’re being responsible with this scaling.” But in reality, he’s screwed the pooch compared to OpenAI, whose approach was, “Let’s just sign these crazy fucking deals.”

OpenAI has got way more access to compute than Anthropic by the end of the year. What does Anthropic have to do to get the compute? They have to go to lower-quality providers that they would not have gone to before. Anthropic historically had the best quality providers, like Google and Amazon, the biggest companies in the world. Now Microsoft is expanding across the supply chain, and they’re going to other newer players.

OpenAI has been a bit more aggressive on going to many players. Yes, they have tons of capacity from Microsoft, Google, and Amazon, but they also have tons with CoreWeave and Oracle. They’ve gone to random companies, or companies one would think are random, like SoftBank Energy, who has never built a data center in their life but is building data centers now for OpenAI. They’ve gone to many others, like NScale, to get capacity.

There’s this conundrum for Anthropic because they were so conservative on compute, because they didn’t want to go crazy. In some sense, a lot of the financial freakouts in the second half of last year were because, “OpenAI signed all these deals but they didn’t have the money to pay for them…” Okay, Oracle’s stock is going to tank, CoreWeave’s stock is going to tank. All these companies’ stocks tanked, and credit markets went crazy because people thought the end buyer couldn’t pay for this. Now it’s like, “Oh wait, they raised a ton of money. Okay, fine, they can pay for it.”

Anthropic was a lot more conservative. They were like, “We’ll sign contracts, but we’ll be principled. We’ll purposely undershoot what we think we can possibly do and be conservative because we don’t want to potentially go bankrupt.”

Dwarkesh Patel

The thing I want to understand is, what does it mean to have to acquire compute in a pinch? Is it that you have to go with neoclouds? Do they have worse compute? In what way is it worse?

Did you have to pay gross margins to a cloud provider that you wouldn’t have otherwise had to pay because they’re coming in at the last minute? Who built the spare capacity such that it’s available for Anthropic and OpenAI to get last minute?

What is the concrete advantage that OpenAI has gotten if they end up at similar compute numbers by 2027? Are they just going to end this year with different gigawatts? If so, how many gigawatts are Anthropic and OpenAI going to have by the end of this year?

Dylan Patel

To acquire excess compute, yes, there is capacity at hyperscalers. Not all contracts for compute are long-term, five-year deals. There’s compute from 2023 or 2024, or H100s from 2025, that were signed at shorter terms. The vast majority of OpenAI’s compute is signed on five-year deals, but there were many other customers that had one-year, two-year, three-year, or six-month deals, on demand.

As these contracts roll off, who is the participant in the market most willing to pay price? In this sense, we’ve seen H100 prices inflect a lot and go up. People are willing to sign long-term deals for above $2 even. I’ve seen deals where certain AI labs—I’m being a little bit vague here for a reason—have signed at as high as $2.40 for two to three years for H100s. If you think about the margin, it costs $1.40 to build Hopper, across five years. Now, two years in, you’re signing deals for two to three years at $2.40? Those margins are way higher.

Now you can crowd out all of these other suppliers, whether Amazon had these, or CoreWeave, or Together AI, or Nebius, or whoever it is. These neoclouds are the firms that had a higher percentage of Hopper in general because they were more aggressive on it. They also tended to sign shorter-term deals, not CoreWeave but the others. So if I want Hopper, there is some capacity out there.

Also, while most of the capacity at an Oracle or a CoreWeave is signed for a long-term deal in terms of Blackwell, anything that’s going online this quarter is already sold. In some cases, they’re not even hitting all the numbers they promised they would sell because there are some data center delays, not just those two, but Nebius, Microsoft, Amazon, and Google. But there are a lot of neoclouds, as well as some of the hyperscalers, who have capacity they’re building that they haven’t sold yet, or capacity they were going to allocate to some internal use that is not necessarily super AGI-focused, that they may now turn around and sell.

Or in the case of Anthropic, they don’t have to have all the compute directly. Amazon can have the compute and serve Bedrock, or Google can have the compute and serve Vertex, or Microsoft can have the compute and serve Foundry, and then do a revenue share with Anthropic, or vice versa.

Dwarkesh Patel

Basically, you’re saying Anthropic is having to pay either this 50% markup in the sense of the revenue share, or in the sense of last-minute spot compute that they wouldn’t have otherwise had to pay had they bought the compute early.

Dylan Patel

Right, there’s a trade-off there. But at the same time, for a solid four months, everyone was saying to OpenAI, “We’re not going to sign deals with you.” That sounds crazy, but it was because, “you don’t have the money.” Now everyone’s saying, “OpenAI, we believed you the whole time. We can sign any deal because you’ve raised all this money.” Anthropic is constrained in that sense. There are not that many incremental buyers of compute yet, because Anthropic hit the capability tier first where their revenue is mooning.

Dwarkesh Patel

That’s interesting. Otherwise you might think having the best model is an extremely depreciating asset, because three months later you don’t have the best model. But the reason it’s important is that you can sign these deals, lock in the compute in advance, and get better prices.

Maybe this is an obvious point. But at least until recently, people had made this huge point about the depreciation cycle of a GPU. The bears, the Michael Burrys or whoever, have said, “Look, people are saying four or five years for these GPUs. Maybe it’s because the technology is improving so fast, but it in fact makes sense to have two-year depreciation cycles for these GPUs,” which increases the reported amortized CapEx in a given year and makes it financially less lucrative to build all these clouds.

But in fact you’re pointing out that maybe the depreciation cycle is even longer than five years. If we’re using Hoppers—especially if AI really takes off and in 2030 we’re saying, “We have to get the seven-nanometer fabs up, we have to go back and turn on the A100s again”—then the depreciation cycle is actually incredibly long. I feel like that’s an interesting financial implication of what you’re saying.

Dylan Patel

There’s a few strings to pull on there. One is, what happens to depreciation of GPUs? I guess I didn’t answer your prior question, which is that I think Anthropic will be able to get to five gigawatts-ish, maybe a little bit more by the end of the year through themselves as well as their product being served through Bedrock, Vertex, or Foundry. I think they’ll be able to get to five or six gigawatts, which is way above their initial plans. OpenAI will be roughly the same, actually a little bit higher based on our numbers.

But anyway, the depreciation cycle of a GPU. Michael Burry was saying it’s three years or less. That’s sort of his argument. There are two lenses to look at this. Mechanically, there’s a TCO model, total cost of ownership of a GPU, where we project pricing out for GPUs and build up the total cost of a cluster. There are a number of costs: your data center cost, your networking cost, your smart hands and people in the data center swapping stuff out. There’s your spare parts, your actual chip cost, your server cost. All these various costs get lumped together. There’s some depreciation cycles on it, certain credit costs on it.

You build up to, “Hey, an H100 costs $1.40/hour to deploy at volume across five years if your depreciation is five years.” If you sign a deal at $2/hour for those five years, your gross margin is roughly 35%. It’s a little bit above that. If you sign it for $1.90, it’s 35% roughly. Then you assume at that fifth year, the GPU falls off a bus and is dead.

In some cases, the argument people are making is if you didn’t sign a long-term deal, because every two years NVIDIA is tripling or quadrupling the performance while only 2X-ing or 50% increasing the price… Then the price of an H100… Sure maybe the value in the market was $2 at 35% gross margins in 2024, but in 2026, when Blackwell is in super high volume and deploying millions a year, you’re actually now worth $1/hour. And when Rubin in ‘27 is in super high volume—even though it starts shipping this year, it’s super high volume next year—doing millions of chips a year deployed into clouds, you’ve got another 3X in performance, another 50% or 2X in price, then the Hopper is only worth $0.70/hour. So the price of a GPU would continue to fall. That’s one lens.

The other lens is, what is the utility you get out of the chip? If you could build infinite Rubin or infinite of the newest chip, then yes, that’s exactly what would happen. The price of a Hopper would fall at a spot or short-term contract rate as the new chips come out and the price per performance goes up. But because you are so limited on semiconductors and deployment timelines, what actually prices these chips is not the comparative thing I can buy today, but rather what is the value I can derive out of this chip today.

In that sense, let’s take GPT-5.4. GPT-5.4 is both way cheaper to run than GPT-4 and has fewer active parameters. It’s much smaller, in that sense of active parameter, because it’s a sparser MoE versus GPT-4 being a coarser MoE. There’s also been so many other advancements in training, RL, model architecture, and data qualities that have made GPT-5.4 way better than GPT-4. And it’s cheaper to serve. When you look at an H100, it can serve more tokens per GPU of 5.4 than if you had ran GPT-4 on it. So it’s producing more tokens of a model that is of higher quality.

What is the maximum TAM for GPT-4 tokens? Maybe it was a few billion dollars, maybe it was tens of billions of dollars. Adoption takes time. For GPT-5.4, that number is probably north of a hundred billion. But there’s an adoption lag, there’s competition, and there’s the constant improvements that everyone else is having. If improvements stopped here, the value of an H100 is now predicated on the value that GPT-5.4 can get out of it instead of the value that GPT-4 can get out of it. These labs are in a competitive environment, so their margins can’t go to infinity. You sort of have this dynamic that is quite interesting in that an H100 is worth more today than it was three years ago.

Dwarkesh Patel

That’s crazy. It’s also interesting from the perspective of just taking that forward. If we had actual AGI models developed, if we had a genuine human on a server… These are such hand wave-y numbers about how many flops the brain can do. But on a flop basis, an H100 is estimated to do 1e15, which is how much some people estimate the human brain does in flops. Obviously, in terms of memory, the human brain has way more. An H100 is 80 gigabytes, and the brain might have petabytes.

Dylan Patel

Oh, yeah, you’ve got petabytes? Name a petabyte of ones and zeros, bro. Name me a string.

Dwarkesh Patel

Well, this is actually the point.

Dylan Patel

No, we’ve just got the best sparse attention techniques ever.

Dwarkesh Patel

Genuinely though. In the amount of information that is compressed, it might be petabytes. The brain is an extremely sparse MoE. But anyways, imagine a human knowledge worker can produce six figures a year of value. If an H100 can produce something close to that, if we had actual humans on a server, the value of an H100 is such that it can repay itself in the course of a couple of months.

So when I interviewed Dario, the point I was trying to make is not that I think the singularity is two years away and therefore Dario desperately needs to buy more compute, although the revenue is certainly there that he needs to buy more compute. The point I was trying to make is that given what Dario seems to be saying—given his statements that we’re two years away from a data center of geniuses, and certainly not more than five years away, and a data center of geniuses should be earning trillions upon trillions of dollars of revenue—it just does not make sense why he keeps making these statements about being more conservative on compute or, to your point, being less aggressive than OpenAI on compute.

I guess that point got lost because then people were roasting me, saying, “Oh, this podcaster is trying to convince this multi-hundred billion dollar company CEO to YOLO it, bro.” I was just trying to say that internally, his statements are inconsistent. Anyway, it’s good to iron it out.

Dylan Patel

I think going back to the earlier view that if the models are so powerful, the value of a GPU goes up over time, right now only OpenAI and Anthropic have that viewpoint. But as we approach further out, everyone is going to be able to see that value skyrocket per GPU. So in that sense, you should commit now to compute.

Interestingly, in Anthropic fashion, there’s a bit of a meme that they have commitment issues and are sort of polyamorous. Not Dario, but this is a bit of a meme.

Dwarkesh Patel

Explains everything. By the way, there’s this interesting economic effect called Alchian-Allen, which is the idea that if you increase the fixed cost of different goods, one of which is higher quality and one which is lower quality, that will make people choose the higher quality good, on the margin.

To give a specific example, suppose the better-tasting apple costs two dollars and the shittier apple costs one dollar. Now suppose you put an import tariff on them. Now it’s $3 versus $2 for a great apple versus a medium apple.

Dylan Patel

Is that because they both increased by a dollar, or should it be a 50% increase?

Dwarkesh Patel

No, because they both increased by $1. The whole effect is that if there’s a fixed cost that is applied to both. Then the price difference between them, the ratio, changes. Previously, the more expensive one was 2X more expensive. Now it’s just 1.5X more expensive.

So I wonder if applied to AI that would mean that, if GPUs are going to get more expensive, there will be a fixed cost increase in the price of compute. As a result, that will push people to be willing to pay higher margins for slightly better models. Because the calculus is, I’m going to be paying all this money for the compute anyway. I might as well just pay slightly more to make sure it’s the very best model rather than a model that’s slightly worse.

Dylan Patel

So the Hopper went from $2 to $3. If a Hopper can make a million tokens of Opus and it can make two million tokens of Sonnet, the price differential between Opus and Sonnet has decreased because the price of the GPU has increased by a dollar from $2 to $3.

Interesting. I think that makes a ton of sense. We just see all of the volumes are on the best models today, all the revenue is on the best models today. In a compute-limited world, two things happen. One, companies that don’t have commitment issues and have these five-year contracts for compute have locked in a humongous margin advantage. They’ve locked in compute for five years at the price it transacted at two, three, or five years ago.

Whereas if you’re three years into that five-year contract and someone else’s two-year or three-year contract rolled off, and now they’re trying to buy that at modern pricing, when it’s priced to the value of models, the price is going to be up a lot more. So the person who committed early has better margins in general. The percentage of the market that is in long-term contracts is much larger than the percentage of the market in short-term contracts that can be this flex capacity you add at the last second.

At the same time, where does the margin go? Because models get more valuable, how much can the cloud players flex their pricing? If you look at CoreWeave, their average term duration is over three years right now. For ninety-eight percent plus of their compute, it’s over three years. They end up with this conundrum where they can’t actually flex price. But every year they’re adding incrementally way more capacity than they had previously.

This year alone, Meta’s adding as much capacity as they had in their entire fleet of compute and data centers for all purposes for serving WhatsApp, Instagram, and Facebook in 2022, and doing AI. They’re adding that alone this year.

In the same sense, you talk about Meta doing that, CoreWeave, Google, and Amazon, all these companies are adding insane amounts of compute year on year. That new compute gets transacted at the new price. In a sense, yes, you’ve locked in, as long as we’re in a takeoff. “Oh, OpenAI went from six hundred megawatts to two gigawatts last year, and from two gigawatts to six plus this year, and six to twelve next year.” The incremental added compute is where all the cost is, not the prior long-term contracts.

Then who holds the cards is the infra providers for charging margin. Now the cloud players, the neoclouds, or the hyperscalers can charge the margin. They can to some extent, but then as you go upstream to who has access to all the memory and logic capacity, it’s Nvidia for the most part. They’ve signed a lot of long-term contracts. They’ve got ninety billion dollars of long-term contracts today, and they’re negotiating three-year deals today with the memory vendors.

You’ve got Amazon and Google through Broadcom, Amazon directly, and AMD. These companies hold all the cards because they’ve secured the capacity. TSMC is not raising prices, but memory vendors are, to some extent, raising a lot of price. They’re going to double or triple price again, but then they’re also signing these long-term deals.

Who is able to accrue all the margin dollars is potentially the cloud, potentially the chip vendors, and the memory vendors, until TSMC or ASML break out and say, “No, we’re going to charge a lot more.” But at the same time, do the model vendors get to charge crazy margins? At least this year, we’re going to see margins for the model vendors go up a lot. Because they’re so capacity constrained, they have to destroy demand. There’s no way Anthropic can continue at the current pace without destroying demand.

00:24:52 – Nvidia secured TSMC allocation early; Google is getting squeezed

Dwarkesh Patel 1:20:33

Let’s get into logic and memory. How specifically has Nvidia been able to lock up so much of both? I think according to your numbers, by ‘27, Nvidia is going to have +70% of N3 wafer capacity, or around that area. I forget what the numbers were for memory at SK Hynix and Samsung and so forth.

Think about how the neocloud business works and how Nvidia works with that, or how the RL environment business works and how Anthropic works with that. In both those cases, Nvidia is purposely trying to fracture the complementary industry to make sure that they have as much leverage as possible. They’re giving allocation to random neoclouds to make sure that there’s not one person that has all the compute.

Similarly, Anthropic or OpenAI, when they’re working with the data providers, they say, “No, we’re going to just seed a huge industry of these things so that we’re not locked into any one supplier for data environments.”

And I wonder why on the 3 nm process—that’s going to be Trainium 3, that’s going to be TPU v7, other accelerators potentially—why is TSMC just giving it all up to Nvidia rather than trying to fracture the market?

Dylan Patel

There are a couple points here. On 3 nm, if we go back to last year, the vast majority of 3 nm was Apple. Apple is being moved to 2 nm. Memory prices are going up, so Apple’s volumes may go down. As memory prices go up, either they cut margin or they move on. There’s some time lag because they have long-term contracts, but Apple likely reduces demand or moves to 2 nm faster, where 2 nm is only capable of mobile chips today. In the future, AI chips will move there. So Apple has that.

Apple is also talking to third-party vendors because they’re getting squeezed out of TSMC a little bit. TSMC’s margins on high-performance computing—HPC, AI chips, et cetera—are higher than they are for mobile, because they have a bigger advantage in HPC than they do in mobile.

When you look at TSMC’s running calculus here, they’re actually providing really good allocations to companies that are doing CPUs. When you think about Amazon having Trainium and Graviton, both of those are on 3 nm, Graviton being their CPU, Trainium being their AI chip. TSMC is much more excited to give allocation to Graviton than they are to Trainium because they view the CPU business as more stable, long-term growth.

As a company that is conservative and doesn’t want to ride cycles of growth too hard, you actually want to allocate to the market that is more stable with a lower growth rate first before you allocate all the incremental capacity to the fast growth rate market. That is the case generally. Same for AMD. The allocations they get on their CPUs, TSMC is much more excited about those than they are for GPUs. Likewise for Amazon.

Nvidia is a bit unique because yes, they have CPUs, they make switches, they make networking, NVLink, InfiniBand, Ethernet, NICs. By and large, most of these things will be on 3 nm by the end of this year with the Rubin launch and all the chips in that family, the GPU being the most important one. Yet Nvidia is getting the majority of supply.

Part of this is because you look at the market and TSMC and others forecast market demand in many ways, but it’s also the market signal. The market signaled, “Hey, we need this much capacity next year. We need this much. We’ll sign non-cancelable, non-returnable. We may even pay deposits.” Nvidia just did it way earlier than Google or Amazon. In some cases, Google and Amazon had stumbling blocks. One of the chips got delayed slightly by a couple quarters. Trainium and all these sorts of things happened.

In that case, there was a huge sort of, “Well, these guys are delaying, but Nvidia is wanting more, more, more, more. And we are checking with the rest of the supply chain, is there enough capacity?” They’re going to all the PCB vendors and saying, “Is there enough PCB?” Victory Giant is one of the largest suppliers of PCBs to Nvidia, and they’re a Chinese company. All the PCBs come from China, or many of them. They’re like, “Do you have enough PCB capacity? Great. Hey memory vendors, who has all the memory capacity? Okay, Nvidia does. Great.”

When you look at who is AGI-pilled enough to buy compute on long timelines at levels that seem ridiculous to people who aren’t AGI-pilled—but nonetheless, they’re willing to pay a pretty good margin and sign it now because they view in the future that ratio is screwed up—the same thing happens with the supply chain for semiconductors. I don’t think Nvidia is quite AGI-pilled. Jensen doesn’t believe software is going to be fully automated and all these things.

Dwarkesh Patel

Accelerated computing, not AI chips, right?

Dylan Patel

It’s AI chips.

Dwarkesh Patel

But that’s what he calls it, right?

Dylan Patel

Yeah. I think it’s a broader term, AI is within that, but also physics modeling and simulations.

Dwarkesh Patel

But it’s like he’s not embracing the main use case.

Dylan Patel

I think he’s embracing it, but I just don’t think he’s AGI-pilled like Dario or Sam. But he’s still way, way more AGI-pilled than Google was in Q3 of last year, or Amazon was in Q3 of last year, and he saw way more demand.

The reason is pretty simple. You can see all the data center construction. He’s like, “Okay, I want to have this market share.” We have all the data centers tracked, and there’s a lot of data centers that could be one or the other. To some extent, Google and Amazon, Google especially, even though their TPU is just better for them to deploy, they have to deploy a crap load of GPUs because they don’t have enough TPUs to fill up their data centers. They can’t get them fabbed.

Dwarkesh Patel

I have a question about that. Google sold a million, was it the v7s?

Dylan Patel

Yes.

Dwarkesh Patel

—the Ironwoods to Anthropic, and you’re saying the big bottleneck right now, this year or next year, I guess going forward forever now, is going to be the logic and memory, the stuff it takes to build these chips. Google has DeepMind, the third prominent AI lab. If this is the big bottleneck, why would they sell it rather than just giving it to DeepMind?

Dylan Patel

This is again a problem of… DeepMind people were like, “This is insane. Why did we do this?” But Google Cloud people and Google executives saw a different thought process.

You and I know the compute team at Anthropic. Both of the main people came from Google. They saw this dislocation, they negotiated a deal, and they were able to get access to this compute before Google realized. The chain of events, at least from our data that we found, was in early Q3, over the course of six weeks, we saw capacity on TPUs go up by a significant amount. It went up multiple times in those six weeks.

There were multiple requests. Google even had to go to TSMC and explain to them why they needed this increase in capacity because it was so sudden. A lot of that capacity increase was for selling to Anthropic. Because Anthropic saw it before Google.

And then Google had Nano Banana and Gemini 3 which caused their user metrics to skyrocket. Then leadership at Google was like, “Oh.” Then they started making the statement that we have to double compute every six months, or whatever the exact number was.

They really woke up a lot more, and then they went to TSMC and said, “We want more. We want more.” TSMC replied, “Sorry guys, we’re sold out. We can maybe get 5-10% more for 2026, but really we’re going to work on 2027.”

There was this information asymmetry among the labs, in my mind. I don’t know exactly. It’s the narrative I’ve spun myself from seeing all the data in the supply chain on wafer orders and what’s going on with the data centers that Anthropic and Fluidstack signed.

It’s pretty clear to me that Google screwed up. You can see this from Google’s Gemini ARR. They had next to nothing in Q1 to Q3—in Q3 a little bit once they started inflecting. But in Q4 they reached $5 billion in revenue on an ARR basis. It’s clear Google didn’t see revenue skyrocket initially. In a sense, Anthropic had a little bit of commitment issues before their ARR exploded, even though they had far more information asymmetry and saw what was coming down the pipe. Google is going to be more conservative than Anthropic and Google had even less ARR. So they were just not willing to do it, and then they realized they should do it.

Since then, Google has gotten absurdly AGI-pilled in terms of what they’re doing. They bought an energy company. They’re putting deposits down for turbines. They’re buying a ridiculous percentage of powered land. They’re going to utilities and negotiating long-term agreements. They’re doing this on the data center and power side very aggressively. I think Google woke up towards the end of last year, but it took them some time.

Dwarkesh Patel

How many gigawatts do you think Google will have by the end of next year?

Dylan Patel

Buy my data.

Dwarkesh Patel

You charge for that kind of information.

Dylan Patel

Yes, yes.

00:34:34 – ASML will be the #1 constraint for AI compute scaling by 2030

I feel like every year the bottleneck for what is preventing us from scaling AI compute keeps changing. A couple years ago it was CoWoS. Last year it was power. You’ll tell me what the bottleneck is this year.

But I want to understand five years out, what will be the thing that is constraining us from deploying the singularity?

Dylan Patel

The biggest bottleneck is compute. For that, the longest lead time supply chains are not power or data centers. They’re actually the semiconductor supply chains themselves. It switches back from power and data centers as a major bottleneck to chips.

In the chip supply chain, there’s a number of different bottlenecks. There’s memory, logic wafers from TSMC, and the fabs themselves. Construction of the fabs takes two to three years, versus a data center which takes less than a year. We’ve seen Amazon build data centers in as fast as eight months. There’s a big difference in lead times because of the complexity of building the fab that actually makes the chips. The tools also have really long lead times.

The bottlenecks, as we’ve scaled, have shifted based on what the supply chain is currently not able to do. It was CoWoS, power, and data centers, but those were all shorter lead time items. CoWoS is a much simpler process of packaging chips together. Power and data centers are ultimately way simpler than the actual manufacturing of the chips. There’s been some sliding of capacity across mobile or PC to data center chips, which has been somewhat fungible.

Whereas CoWoS, power, and data centers have had to start anew as supply chains. But now there’s no more capacity for the mobile and PC industries—which used to be the majority of the semiconductor industry—to shift over to AI. Nvidia is now the largest customer at TSMC and SK Hynix, the largest memory manufacturer. It’s sort of impossible for the sliding of resources away from the common person’s PCs and smartphones to shift any more towards the AI chips. So now the question is how do we scale AI chip production? That’s the biggest bottleneck as we go to 2030.

Dwarkesh Patel

It would be very interesting if there’s an absolute gigawatt ceiling that you can project out to 2030 based just on “We can’t produce more than this many EUV machines.”

Dylan Patel

To scale compute further, there are different bottlenecks this year and next year, but ultimately by 2028 or 2029, the bottleneck falls to the lowest rung on the supply chain, which is ASML. ASML makes the world’s most complicated machine: an EUV tool. The selling price for those is $300-400 million. Currently, they can make about 70. Next year, they’ll get to 80. Even under very aggressive supply chain expansion, they only get to a little bit over 100 by the end of the decade.

What does that mean? They can make a hundred of these tools by the end of the decade, and 70 right now. How does that actually translate to AI compute? We see all these numbers from Sam Altman and many others across the supply chain: gigawatts, gigawatts, gigawatts. How many gigawatts are we adding? We see Elon saying a hundred gigawatts in space.

Dwarkesh Patel

A year.

Dylan Patel

A year. The problem with any of these numbers, or the challenge to these numbers, is actually not the power or the data center. We can dive into that, but it’s manufacturing the chips.

Take a gigawatt of Nvidia’s Rubin chips. Rubin is announced at GTC, I believe the week this podcast goes live. To make a gigawatt worth of data center capacity of Nvidia’s latest chip that they’re releasing towards the end of this year, you need a few different wafer technologies. You need about 55,000 wafers of 3 nm. You need about 6,000 wafers of 5 nm, and then you need about 170,000 wafers of DRAM memory.

Across these three different buckets, each requires different amounts of EUV. When you manufacture a wafer, there are thousands and thousands of process steps where you’re depositing material and removing them. But the key critical step—which at least in advanced logic is 30% of the cost of the chip—is something that doesn’t actually put anything on the wafer. You take the wafer, you deposit photoresist, which is a chemical that chemically changes when you expose it to light. Then you stick it into the EUV tool, which shines light at it in a certain way. It patterns it. There’s what’s called a mask, which is effectively a stencil for the design.

When you look at a leading-edge 3 nm wafer, it has 70 or so masks, 70 or so layers of lithography, but 20 of them are the most advanced EUV. If you need 55,000 wafers for a gigawatt, and you do 20 EUV passes per wafer, you can do the math. That’s 1.1 million passes of EUV for a single gigawatt. It’s pretty simple. Once you add the rest of the stuff, it ends up being 2 million, across 5 nm and all the memory. You’re at roughly 2 million EUV passes for a single gigawatt.

These tools are very complicated. When you think about what it’s doing across a wafer, it’s taking the wafer and scanning and stepping across. It does this dozens of times across the whole wafer. When you’re talking about how many EUV passes, that’s the entire wafer being exposed at a certain rate.

An EUV tool can do roughly 75 wafers per hour, and the tool is up roughly 90% of the time. In the end, you need about three and a half EUV tools to do the 2 million EUV wafer passes for the gigawatt. So three and a half EUV tools satisfies a gigawatt.

It’s funny to think about the numbers. What does a gigawatt cost? It costs roughly $50 billion. Whereas what do three and a half EUV tools cost? That’s $1.2 billion. It’s actually quite a lower number, which is interesting to think about. Fifty gigawatts of economic CapEx in the data center, and what gets built on top of that in terms of tokens is even larger. It might be $100 billion worth of AI value into the supply chain, held up by this $1.2 billion worth of tooling that simply cannot expand its supply chain quickly.

Dwarkesh Patel

You wrote an article recently saying over the last three years, TSMC has done $100 billion of CapEx. So it’s $30/$30/$40 billion. A small fraction of that is being used by Nvidia for the 3 nm, or previously 4 nm, that it’s using for its chips. What were its earnings last quarter? It was $40 billion. So $40 billion times four is $160 billion. Nvidia alone is turning some small fraction of $100 billion in CapEx, which is going to be depreciated over many years and not just this one year, into $160 billion in a single year.

That gets even more intense when you go down the supply chain to ASML, which is taking a billion dollars’ worth of machines to produce a gigawatt. Of course, those machines last for more than a year so it’s doing more than that.

Now I want to understand, how many such machines will there be by 2030, if you include not just the ones that are sold that year, but have been compiling over the previous years? What does that imply? Sam Altman says he wants to do a gigawatt a week in 2030. When you add up those numbers, is it compatible with that?

Dylan Patel

That’s completely compatible, if you think about it. TSMC and the entire ecosystem have something like 250 to 300 EUV tools already. Then you stack on 70 this year, 80 next year, growing to 100 by 2030. You’re at 700 EUV tools by the end of the decade. 700 EUV tools, at three and a half tools per gigawatt—assuming it’s all allocated to AI, which it’s not—gets you to 200 gigawatts worth of AI chips for the data centers to deploy.

Sam wants 52 gigawatts a year. He’s only taking 25% share then. Obviously, there’s some share given to mobile and PC, assuming we’re even allowed to have consumer goods still and we don’t get priced out of them. But roughly, he’s saying 25% market share of the total chips fabbed. That’s very reasonable given that this year alone, I think he’s going to have access to 25% of the Blackwell GPUs that are deployed. It’s not that crazy.

Dwarkesh Patel

When did ASML start shipping EUV tools, when 7 nm started? I don’t know when that was exactly. You’re saying in 2030, they’re going to be using machines that initially were shipped in 2020. So for ten years, you’re using the same most important machine in this most technologically advanced industry in the world? I find that surprising.

Dylan Patel

ASML’s been shipping EUV tools now for roughly a decade, but it only entered mass volume production around 2020. The tool’s not the same. Back then, the tools were even lower throughput. There are various specifications around them called overlay. I was mentioning you’re stacking layers on top of each other. You’ll do some EUV, you’ll do a bunch of different process steps—depositing stuff, etching stuff, cleaning the wafer—dozens of those steps before you do another EUV layer.

There’s a spec called overlay, which is: you did all this work, you drew these lines on the wafer, now I want to draw these dots. Let’s say I want to draw these dots to connect these lines of metal to holes, and then the next layer up is another set of lines going perpendicular, so now you’re connecting wires going perpendicular to each other. You have to be able to land them on top of each other. It’s called overlay.

Overlay is a spec that’s been improved rapidly by ASML. Wafer throughput has been improved rapidly by ASML. The price of the tool has gone up, but not as much as the capabilities of the tool. Initially, the EUV tools were $150 million. Over time, they’re now $400 million as I look out to 2028. But the capabilities of the tools have more than doubled as well, especially on throughput and overlay accuracy, which is the ability to accurately align the subsequent passes on top of each other even though you do tons of steps between.

ASML is improving super rapidly. It’s also noteworthy to say that ASML is maybe one of the most generous companies in the world. They have this linchpin thing. No one has anything competitive. Maybe China will have some EUV by the end of the decade, but no one else has anything even close to EUV, and yet they haven’t taken price and margins up like crazy. You go ask some other folks that we talk to all the time, like Leopold, and they’re like, “Let’s have the price go up.” Because they can. The margin is there. You can take the margin. Nvidia takes the margin. Memory players are taking the margin. But ASML has never raised the price more than they’ve increased the capability of the tool.

In a sense, they’ve always provided net benefit to their customers. It’s not that the tool is stagnant, it’s just that these tools are old. Yes, you can upgrade them some, and the new tools are coming. For simplicity’s sake, we’re ignoring the advances in overlay or throughput per tool for this podcast.

Dwarkesh Patel

You say we’re producing 60 of these machines this year and then 70, 80 over subsequent years. What would happen if ASML just decided to double its CapEx or triple its CapEx? What is preventing them from producing more than 100 in 2030? Why are you so confident that even five years out, you can be relatively sure what their production will be?

Dylan Patel

I think there are a couple factors here. ASML has not decided to just go YOLO, let’s expand capacity as fast as possible. In general, the semiconductor supply chain has not. It’s lived through the booms and busts, and we can talk a bit more about it. Basically some players have recently woken up, but in general no one really sees demand for 200 gigawatts a year of AI chips, or trillions of dollars of spend a year in the semiconductor supply chain. They’re not AI-pilled. They’re not AGI-pilled.

Dwarkesh Patel

We’re going to get to a trillion dollars this year.

Dylan Patel

Yeah, I feel you, but I’m saying no one really understands this in the supply chain. Constantly, we’re told our numbers are way too high, and then when they’re right, they’re like, “Oh, yeah, but your next year’s numbers are still too high.”

ASML’s tool has four major components. It has the source, which is made by Cymer in San Diego. It has the reticle stage, which is made in Wilmington, Connecticut. It has the wafer stage. It has the optics, the lenses and such. Those last two are made in Europe.

When you look at each of these four, they’re tremendously complex supply chains that, (A) they have not tried to expand massively, and (B) when they try to expand them, the time lag is quite long. Again, this is the most complicated machine that humans make, period, at any sort of volume.

Let’s talk about the source specifically. What does the source do? It drops these tin droplets. It hits it three subsequent times with a laser perfectly. The first one hits this tin droplet, it expands out. It hits it again, so it expands out to this perfect shape, and then it blasts it at super high power. The tin droplets get excited enough that they release EUV light, 13.5 nanometer, and then it’s in this thing that is collecting all the light and directing it into the lens stack.

Then you have the lens stack, which is Carl Zeiss, as you mentioned, and some other folks, but Zeiss being the most important part of it. They also have not tried to expand production capacity because they don’t see... They’re like, “We’re growing a lot because of AI. We’re growing from 60 to 100.” It’s like, “No, no, no. We need to go to a couple hundred, but it’s fine. Whatever.”

Each of these tools has, I think, 18 of these lenses, effectively. They are multilayer mirrors, which are perfect layers of molybdenum and ruthenium, if I recall correctly, stacked on top of each other in many layers, and then the light bounces off of it perfectly. When we think about a lens, it’s in a shape, and it focuses the light. This is like a mirror that’s also a lens, so it’s pretty complicated. Any defect in these super thinly deposited stacks will mess it up. Any curvature issues will mess it up.

There are a lot of challenges with scaling the production. It’s quite artisanal in this sense because you’re not making tens of thousands of these a year, you’re making hundreds, you’re making thousands. 60 tools a year, 18 of these per tool, you’re still in the hundreds, of tools, or you’re at the thousand number roughly for these lenses and projection optics.

Then you step forward to the reticle stage, which is also something really crazy. This thing moves at, I want to say, nine Gs. It will shift nine Gs because as you step across a wafer, the tool will go... The wafer stage is complementary. It’s the wafer part. You line these two things up. You’re taking all the light through the lenses that’s focused, and here’s the reticle, here’s the wafer. The reticle’s moving one direction, the wafer’s moving the other direction as it scans a 26x33 millimeter section of the wafer, and then it stops. It shifts over to another part of the wafer and does it again. It does that in just seconds. Each of them is moving at nine Gs in opposite directions.

Each of these things is a wonder and marvel of chemistry, fabrication, mechanical engineering, and optical engineering, because you have to align all these things and make sure they’re perfect. All of these things have crazy amounts of metrology because you have to perfectly test everything. If anything is messed up, the yield goes to zero, because this is such a finely tuned system.

By the way, it’s so large that you’re building it in the factory in Eindhoven, Netherlands, and they’re deconstructing it and shipping it on many planes to the customer site, and then you’re reassembling it there and testing it again. That process takes many, many months.

There are so many steps in the supply chain, whether it’s Zeiss making their lenses and projection optics or Cymer, which is an ASML-owned company, making the EUV source. Each of these has its own complex supply chain. ASML has commented that their supply chain has over ten thousand people in it.

Dwarkesh Patel

Like individual suppliers?

Dylan Patel

Yes. It might not be directly. It might be through Zeiss having so many suppliers and XYZ company having so many suppliers.

If you just think about it, you’re talking about two physically moving objects that are the size of a wafer, and it has to be accurate to the level of single-digit nanometers or even smaller because the entire system, the overlay, the layer-to-layer overlay variation, has to be on the order of 3 nanometers. If the overlay is 3 nms, that means each individual part, the accuracy of its physical movement has to be even less than that. It has to be sub-one nanometer in most cases, because the error of these things stack up. There’s no way to just snap your fingers and increase production.

Things as simple as power. The US going from zero percent power growth to two percent power growth, even though China’s already at thirty, was so hard for America to do. And that’s a really simple supply chain with very few people in it who make difficult things. There are probably 100,000 electricians and people who work in the electricity supply chain, or more, in the US?

When you look at ASML, they employ so few people. Carl Zeiss probably employs less than a thousand people working on this, and all of those people are super, super specialized. You can’t just train random people up for this in the snap of a finger. You can’t just get your entire supply chain to get galvanized.

Nvidia’s had to do a lot to get the entire supply chain to even deliver the capacity they’re going to make this year. When you go talk to Anthropic, they’re like, “We’re short of TPUs, we’re short of training, and we’re short of GPUs.” When you go talk to OpenAI, they’re like, “We’re short of these things.”

OpenAI and Anthropic know they need X. Nvidia is not quite as AGI-pilled. They’re building X - 1. You go down the supply chain, everyone’s doing X - 1. In some cases, they’re doing X ÷ 2, because they’re not AGI-pilled.

You end up with this time lag for the whip to react. The AI-pilledness and the desire to increase production takes so long. Once they finally understand that they need to increase production rapidly… They think they understand. They think AI means we have to go from 60 to 100, in addition to the tools getting better and faster, the source getting higher power from 500 watts to 1,000, and all these other aspects of the supply chain advancing technically and increasing production. They think they’re actually increasing production a lot.

But if you flow through the numbers… What does Elon want? He wants 100 gigawatts a year in space by 2028 or 2029. Sam Altman wants 52 gigawatts a year by the end of the decade. Anthropic probably needs the same, and Google needs that. You go across the supply chain, and it’s like, wait, no, the supply chain can’t possibly build enough capacity for everyone to get what they want on the side of compute.

00:55:47 – Can’t we just use TSMC’s older fabs?

Dwarkesh Patel

I feel like in the data center supply chain for the last few years, people have been making arguments like, “We are bottlenecked by this specific thing, therefore AI compute can’t scale more than X.” But as you’ve written about, if the grid is a bottleneck, then we just do behind the meter on the site, we do gas turbines, et cetera. If that doesn’t work, there are all these other alternatives that people fall back on.

I want to ask whether we can imagine a similar thing happening in the semiconductor supply chain. If EUV becomes a bottleneck, what if we just went back to 7 nm and did what China is doing currently, producing 7 nm chips with multi-patterning with DUV machines? If you look at a 7 nm chip like the A100, there’s been a lot of progress obviously from the A100 to the B100 or B200.

How much of that progress is just numerics? If you just hold FP16 constant from A100 to B100. The B100 is a little over one petaflop, and the A100 is like 300 teraflops.

Dylan Patel

Yeah, 312.

Dwarkesh Patel

Holding numerics constant, you have a 3x improvement from A100 to B100. Some of that is the process improvement, some of that is just the accelerator design improving, which we could replicate again in the future.

It seems there’s actually a very small effect from the process improving from 7nm to 4 nm. I don’t know the numbers offhand but let’s say there’s 150k wafers per month of 3 nm and eventually similar amounts for 2 nm. But then there’s a similar amount for 7 nm.

If you have all those old wafers and there’s maybe a 50% haircut because the bits per wafer area are 50% less or something, it doesn’t seem that bad to just bring on 7 nm wafers if that gives you another fifty or hundred gigawatts. Tell me why that’s naive.

Dylan Patel

We potentially do go crazy enough that this happens because we just need incremental compute, and the compute is worth the higher cost and power of these chips. But it’s also unlikely to a large extent because some of these are not fair comparisons.

For example, from A100, which is 312 teraflops, to Blackwell, which is 1,000 or 2,000 FP16, and then Rubin is 5,000 or so FP16… It’s not a fair comparison because these chips have vastly different design targets. With A100, Nvidia optimized for FP16 and BF16 numerics. When you look at Hopper, they didn’t care as much about that; they cared about FP8. When you look at Rubin, they don’t care about FP16 and BF16 so much, they care mostly about FP4 and FP6. Numerics are what they’ve designed their chip for.

Let’s say we make a new chip design on 7 nm, optimized for the numerics of the modern day. The performance difference is still going to be much larger than the FLOPS difference you mentioned. Often it’s easy to boil things down to FLOPS per watt or FLOPS per dollar, but that’s not a fair comparison.

Let’s look at Kimi K2.5 and DeepSeek. When you look at those two models and their performance on Hopper versus Blackwell on very optimized software, you get vastly different performance. Most of this is not attributed to FLOPS or numerics, because those models are actually eight-bit. So it’s not like Blackwells and Hopper are both optimized for eight-bit, and Blackwell is not really taking advantage of its four-bit there. The performance gulf is actually much larger.

Sure it’s one thing to shrink process technology and make the transistor smaller so each chip has X number of FLOPS, but you forget the big gating factor. These models don’t run on a single chip. They run on hundreds of chips at a time. If you look at DeepSeek’s production deployment, which is well over a year old now, they were running on 160 GPUs. That’s what they serve production traffic on. They split the model across 160 GPUs.

Every time you cross the barrier from one chip to another, there is an efficiency loss. You have to transmit over high-speed electrical SerDes, which brings a latency cost and a power cost. There are all these dynamics that hurt. As you shrink and shrink the process node, you’ve increased the amount of compute in a single chip. Now in-chip movement of data is at least tens of terabytes a second, if not hundreds of terabytes a second. Whereas between chips, you’re on the order of a terabyte a second.

Then you have this movement of data between chips that are super close to each other physically. You can only put so many chips close to each other physically, so you have to put chips in different racks. The movement of data between racks is on the order of hundreds of gigabits a second, 400 gig or 800 gig a second, so roughly 100 gigabytes a second.

So you have this huge ladder: on-chip communication is super fast, within the rack is an order of magnitude slower, and outside the rack is an order of magnitude lower than that. As you break the bounds of chips, you end up with a performance loss.

The reason I explain this is because when you look at Hopper versus Blackwell, even if both are using a rack’s worth of chips, Hopper is significantly slower. The amount of performance you have leveraged to the task within each domain—tens of terabytes a second of communication between these processing elements versus terabytes a second between these processing elements—is much, much higher and therefore the performance is much higher. When you look at inference at 100 tokens a second for DeepSeek and Kimi K2.5, the performance difference between Hopper and Blackwell is on the order of 20x.

It’s not 2x or 3x like the FLOPS performance difference indicates, even though those are on the same process node. There are just differences in networking technologies and what they’ve worked on. You can translate some of these back, but when you look at what they’re doing on 3 nm with Rubin, some of those things are simply not possible to do all the way back on A100, even if you make a new chip for 7 nm.

There are certain architectural improvements you can port and certain ones you cannot. The performance difference is not just going to be the difference in FLOPS. It’s in some senses cumulative between the difference in FLOPS per chip, networking speed between chips, how many FLOPS are on a chip versus a system, and memory bandwidth on a single chip versus an entire system. All of these things compound.

Dwarkesh Patel

Can I ask you a very naive question? The B200 now has two dies on a single chip, so you can get that bandwidth without having to go through NVLink or InfiniBand. Next year, Rubin Ultra will have four dies on one chip. What is preventing us from just doing that with an older… How many dies could you have on a single chip and still get these tens of terabytes a second?

Dylan Patel

Even within Blackwell, there are differences in performance when you’re communicating on the chip versus across the chips. Those bounds are obviously much smaller than when you’re going out of the entire chip. When you scale the number of chips up, there is some performance loss. It’s not perfect, but it is way better than different entire packages.

How large can advanced packaging scale? The way Nvidia is doing it is CoWoS. Google, Broadcom, MediaTek, and Amazon’s Trainium are all doing CoWoS. But actually you can go look back at what Tesla did with Dojo, which they cancelled and restarted. Dojo was a chip that was the size of an entire wafer. They had 25 chips on it. There were some tradeoffs. They couldn’t put HBM on it. But the positive side was that they had 25 chips on it. To date, it is still probably the best chip for running convolutional neural networks. It’s just not great at transformers because the shape of the chip, the memory, the arithmetic, and all these various specifications are just not well-suited for transformers. They’re well-suited for CNNs.

Dojo chips were optimized around that, and they made a bigger package. But as you make packages bigger and bigger, you have other constraints: networking speed, memory bandwidth, and cooling capabilities. All of these things start to rear their heads. It’s not simple. But yes, you will see a trend line of more chips on the package, and yes, you’re going to be able to do that on 7 nm.

In fact, that’s what Huawei did with their Ascend 910C or D. They initially put one, and then they did two. They’re focusing on scaling the packaging up because that is an area where they can advance faster than process technology where they can’t shrink. But at the end of the day, that’s something you can do on the leading-edge chips too. Anything you do on 7 nm, you can also probably do on 3 nm in terms of packaging.

01:05:37 – When will China outscale the West in semis?

Dwarkesh Patel

If we end up in this world in 2030 where the West has the most advanced process technology but has not ramped it up as much, whereas China… I don’t know if you think by 2030 they would have EUV and 2 nm or whatever. But they are semiconductor-pilled and they are producing in mass quantity.

Basically, I’m wondering what the year is where there’s a crossover, where our advantage in process technology has faded enough, and their advantage in scale has increased enough. And also, if their advantage in having one country with the entire supply chain indigenized—rather than having random suppliers in Germany and the Netherlands—would mean that China would be ahead in its ability to produce mass flops.

Dylan Patel

To date, China still does not have an entirely indigenized semiconductor supply chain.

Dwarkesh Patel

But would they in 2030?

Dylan Patel

By 2030, it’s possible that they do. But to date, all of China’s 7 nm and 14 nm capacity uses ASML DUV tools. The amount that they can import from ASML is large. But the vast majority of ASML’s revenue, especially on EUV all of it, is outside of China. The scale advantage is still in the favor of the West plus Taiwan, Japan, and Korea, et cetera.

Dwarkesh Patel

But they’re trying to make their own DUV and EUV tools, right?

Dylan Patel

They’re trying to do all these things. The question is how fast can they advance and scale up production as well as quality. To date, we haven’t seen that. Now I’m quite bullish that they’re going to be able to do these things over the next five to ten years. They will really scale up production and kick it into high gear. They have more engineers working on it and more desire to throw capital at the problem.

Dwarkesh Patel

So by 2030, will they have fully indigenized DUV?

Dylan Patel

I think for sure. DUV, yes.

Dwarkesh Patel

And fully indigenized EUV by 2030?

Dylan Patel

I think they’ll have working tools. I don’t think that they’ll be able to manufacture a bunch yet. There’s having it work, and then there’s production hell. ASML had EUV working in the early 2010s at some capacity. The tools were not accurate enough. They were not scaled for high-volume manufacturing or reliable enough. They had to ramp production, and that all took time.

Production hell takes time. That’s why it took another five to seven years to get EUV into mass production at a fab rather than just working in the lab.

Dwarkesh Patel

How many DUV tools do you think they’ll be able to manufacture in 2030?

Dylan Patel

ASML?

Dwarkesh Patel

No, China.

Dylan Patel

That’s a great question. It’s a bit of a challenge to look into this supply chain especially. We try really hard. In some instances, they’re buying stuff from Japanese vendors. If they want a fully indigenized supply chain, they need to not buy these lenses, projection optics, or stages from Japanese vendors. They need to build it internally.

It’s really tough to say where they’ll be able to get to. I honestly think it’s a shot in the dark. But it’s probably not unlikely that they’ll be able to do on the order of 100 DUV tools a year, whereas ASML is currently doing hundreds of DUV tools a year.

No company has a process node where they make a million wafers a month. Elon says he wants to do it and China is obviously going to do it. TSMC is trying to do that. The memory makers may get to a million wafers a month as well, but not in a single fab.

It’s mind-boggling to think of that scale, and challenging to see the supply chain galvanized for that. I don’t want to doubt China’s capability to scale.

Dwarkesh Patel

I guess this is an interesting question. I think at some point SemiAnalysis will do the deep dive on this. By when would indigenized Chinese production be bigger than the rest of the West combined. And put in the input of your model of when they’ll have DUV machines and EUV machines at scale?

Because there’s this question around if you have long timelines on AI—by long meaning 2035, which is not that long in the grand scheme of things—should you expect a world where China is dominating in semiconductors? It doesn’t get asked enough because if you’re in San Francisco, we’re thinking on timescales of weeks. If you’re outside of San Francisco, you’re not thinking about AGI at all.

What if we have AGI? What if you have this transformational thing that is commanding tens or hundreds of trillions of dollars of economic growth and token output, but it happens in 2035? What does that imply for the West versus China? SemiAnalysis has got to write the definitive model on this.

Dylan Patel

It’s really challenging when you move timescales out that far. What we tend to focus on is tracking every data center, every fab, and all the tools. We track where they’re going, but the time lags for these things are relatively short. We can only make reasonably accurate estimates for data center capacity based on land purchasing, permits, and turbine purchasing. We know where all these things are going, that’s the data we sell.

As you go out to 2035, things are just so radically different. Your error bars get so large it’s hard to make an estimate. But at the end of the day, if takeoff or timelines are slow enough, I don’t see why China wouldn’t be able to catch up drastically. In some sense, we’ve got this valley where, three to six months ago, or maybe even now, Chinese models are as competitive as they’ve ever been. I think Opus 4.6 and GPT 5.4 have really pulled away and made the gap a little bit bigger, but I’m sure some new Chinese models will come out.

As we move from selling tokens where they provide the entire reasoning chain, to selling automated white-collar work—an automated software engineer, you send them the request, they give you the result back, and there’s a bunch of thinking on the back end that they don’t show you—the ability to distill out of American models into Chinese models will be harder.

Second, look at the scale of the compute the labs have. OpenAI exited the year with roughly two gigawatts last year. Anthropic will get to two-plus gigawatts this year. By the end of next year, they’ll both be at ten gigawatts of capacity. China is not scaling their AI lab compute nearly as fast. At some point, when you can’t distill the learnings from these labs into the Chinese models, plus with this compute race that OpenAI, Anthropic, Google, and Meta are all racing on, they end up getting to a point where the model performance should start to diverge more.

Then look at all this CapEx being spent on data centers. Amazon is spending $200 billion, Google $180 billion. All these companies are spending hundreds of billions of dollars on CapEx. There’s nearly a trillion dollars of CapEx being invested in data centers in America this year, roughly. What’s the return on invested capital here? You and I would think the return on invested capital for data center CapEx is very high.

If we look at Anthropic’s revenues, in January they added $4 billion. In February, which was a shorter month, they added $6 billion. We’ll see what they can do in March and April, given that compute constraints are what’s bottlenecking their growth. The reliability of Claude is quite low because they’re so compute constrained. But if this continues, then the ROIC on these data centers is super high.

At some point, the US economy starts growing faster and faster over this year and next year because of all this CapEx, all the revenue these models are generating, and the downstream supply chain. China doesn’t have that yet. They have not built the scale of infrastructure to invest in models, get to the capabilities, and then deploy these models at such scale.

When you look at Anthropic, they’re at $20 billion ARR. The margins are sub-50 percent, at least as last reported by The Information. So that’s $13 or $14 billion of compute that it’s running on rental cost-wise, which is actually $50 billion worth of CapEx that someone laid out for Anthropic to generate their current revenue.

China has just not done this. If and when Anthropic 10Xs revenue again—and I think our answer would be when, not if—China doesn’t have the compute to deploy at that scale. So there is some sense that we’re in a fast takeoff. It’s not like we’re talking about a Dyson sphere by X date, it’s more like the revenue is compounding at such a rate that it does affect economic growth. The resources these labs are gathering are growing so fast. China hasn’t done that yet, so in that case, the US and the West are actually diverging.

The flip side is that these infrastructure investments have middling returns. Maybe they’re not as good as hoped. Maybe Google is wrong for wanting to take free cash flow to zero and spend $300 billion on CapEx next year. Maybe they’re just wrong and people on Wall Street who are bearish and people who don’t understand AI are correct. In that case, the US is building all this capacity but doesn’t get great returns. Meanwhile, China is able to build a fully vertical, indigenized supply chain, instead of the US/Japan/Korea/Taiwan/SE Asia/Europe countries together building this less vertical supply chain. In a sense, at some point China is able to scale past us if AI takes longer to get to certain capability levels than the vast majority of your guests on this podcast believe.

Dwarkesh Patel

It’s fast timelines, the US wins; long timelines, China wins.

Dylan Patel

Yeah but I don’t know what fast timelines means. I don’t think you have to believe in AGI to have the timelines where the US wins.

01:16:01 – The enormous incoming memory crunch

Dwarkesh Patel

Let’s go back to memory. I think people on Wall Street and people in the industry are understanding how big this is, but maybe generally people don’t understand what a big deal it is. So we’ve got this memory crunch, as you were talking about.

And earlier I was asking about, oh, could we solve for the EUV tool shortage by going back to seven nanometers? So let me ask a similar question about memory. HBM is made of DRAM, but has three to four times fewer bits per wafer area than the DRAM it’s made out of.

Is it possible that accelerators in the future could just use commodity DRAM and not HBM, so we can get much more capacity out of the DRAM we have? The reason I think this might be possible is, if we’re going to have agents that are just going off and doing work, and it’s not a synchronous chatbot application, then you don’t necessarily need extremely fast latency.

Maybe you can have lower bandwidth, because the reason you stack DRAM into HBM is for higher bandwidth. Is it possible to go to HBM accelerators and basically have the opposite of Claude Code Fast, like have Claude Slow?

Dylan Patel

At the end of the day, the incremental purchaser who’s willing to pay the highest price for tokens also ends up being the one that’s less price-sensitive. Compute should be allocated, in a capitalistic society, towards the goods that have the highest value, and the private market determines this by willingness to pay.

To some extent, Anthropic could actually release a slow mode. They could release Claude Slow Mode and increase tokens per dollar by a significant amount. They could probably reduce the price of Opus 4.6 by 4-5x and reduce the speed by maybe just 2x. The curve on inference throughput versus speed is already there just on HBM. And yet they don’t, because no one actually wants to use a slow model.

Furthermore, on these agentic tasks, it’s great that the model can run at a time horizon of hours. But if the model was running slower, those hours would become a day. Vice versa, if the model is running faster, those hours become an hour. No one really wants to move to a day-long wait period, because the highest-value tasks also have some time sensitivity to them.

I struggle to see… Yes, you could use regular DRAM. There are a couple of challenges with this. One of the core constraints of chips is that a chip is a certain size, and all of the I/O escapes on the edges. Often, the left and right of the chip are HBM—so the I/O from the chip to the HBM is on the sides—and then the top and bottom are I/O to other chips.

If you were to change from HBM to DDR, all of a sudden this I/O on the edge would have significantly less bandwidth, but significantly more capacity per chip. But the metric you actually care about is bandwidth per wafer, not bits per wafer.

Dwarkesh Patel

Because the thing that is constraining the FLOPS is just getting in and out the next matrix, and for that you just need more bandwidth.

Dylan Patel

Yeah, getting out the weights and getting in and out the KV cache. In many cases, these GPUs are not running at full memory capacity. It’s obviously a system design thing: model, hardware, and software co-design. You have to figure out how much KV cache you need, how much you keep on the chip, how much you offload to other chips and call when you need it for tool calling, and how many chips you parallelize this on.

Obviously, the search space for this is very broad, which is why we have InferenceX, an open-source model that searches all the optimal points on inference for a variety of different chips and models.

The point is, you’re not always necessarily constrained by memory capacity. You can be constrained by FLOPS, network bandwidth, memory bandwidth, or memory capacity. If you really simplify it down, there are four constraints, and each of these can break out into more.

If you switch to DDR, yes, you produce four times the bits per DRAM wafer, but all of a sudden the constraints shift a lot and your system design shifts. You go slower. Is the market smaller? Maybe. But also, all these FLOPS are wasted because they’re just sitting there waiting for memory. You don’t need all that capacity because you can’t really increase batch size because then the KV cache would take even longer to read.

Dwarkesh Patel

Makes sense. What is the bandwidth difference between HBM and normal DRAM?

Dylan Patel

An HBM4 stack—let’s talk about the stuff that’s in Rubin, because that’s what we’ve been indexing on—is 2048 bits across, connected in an area that’s 13 millimeters wide. It transfers memory at around 10 giga-transfers a second.

So a stack of HBM4 is 2048 bits on an area that’s roughly 11 to 13 millimeters wide. That’s the shoreline you’re taking on the chip. In that shoreline, you have 2048 bits transferring at 10 giga-transfers per second. You multiply those together and divide by eight, bits to a byte, and you’re at roughly 2.5 terabytes a second per HBM stack.

When you look at DDR, in that same area, it’s maybe 64 or 128 bits wide. That DDR5 is transferring at anywhere from 6.4 to maybe 8,000 giga-transfers a second. So your bandwidth is significantly lower. It’s 64 times 8,000 divided by eight, which puts you at 64 gigabytes a second. Even if you take a generous interpretation of 128 times 8 giga-transfers, you’re at 128 gigabytes a second for the same shoreline, versus 2.5 terabytes a second.

There’s an order of magnitude difference in bandwidth per edge area. If your chip is a square, or 26 by 33 millimeters—which is the maximum size for an individual die—you only have so much edge area. On the inside of that chip, you put all your compute. There are things you can do to try and change that, like more SRAM or more caching. But at the end of the day, you’re very constrained by bandwidth.

Dwarkesh Patel

Then there’s the question of where you can destroy demand to free up enough for AI. I guess the picture is especially bad because, as you’re saying, if it takes four times more wafer area to get the same byte, for HBM you have to destroy four times as much consumer demand for laptops and phones to free up one byte for AI.

What does this imply for the next year or two? Sorry for the run-on question, in your newsletter you said 30% of Big Tech’s CapEx in 2026 is going towards memory?

Dylan Patel

Yes.

Dwarkesh Patel

That’s insane, right? Of the $600 billion or whatever, 30% is going just to memory.

Dylan Patel

Yes. Obviously, there’s some level of margin stacking that Nvidia does, so you have to separate that out and apply their margin to the memory and the logic. But at the end of the day, a third of their CapEx is going to memory.

Dwarkesh Patel

That’s crazy. What should we expect over the next year or two as this memory crunch hits?

Dylan Patel

The memory crunch will continue to get harder, and prices will continue to go up. This affects different parts of the market differently. Are people going to hate AI more and more? Yes, because smartphones and PCs are not going to get incrementally better year on year. In fact, they’re going to get incrementally worse.

Dwarkesh Patel

If you look at the bill of materials for an iPhone, what fraction of it is the memory? How much more expensive does an iPhone get if the memory is two times more expensive?

Dylan Patel

I believe an iPhone has 12 gigabytes of memory. Each gig used to cost roughly $3-4, so that’s $50. But now the price of memory has tripled. Let’s say it’s $12 per gig for DDR. Now you’re talking about $150 versus $50.

That’s a $100 increase in cost for Apple. Apple has some margin, they’re not just going to eat the margin. NAND also has the same market dynamics, so in reality, it’s probably a $150 increase on the iPhone. So now that’s a $100 cost increase and that’s just on the DRAM. The NAND also has the same sort of market. So in fact it’s probably a $150 increase on the iPhone. Apple either has to pass that on to the consumer or eat it. I don’t see Apple reducing their margin too much, maybe they eat a little bit. But at the end of the day, that means the end consumer is paying $250 more for an iPhone.

Now that’s just on last year’s pricing versus today’s. There is some lag before Apple feels the heat because they tend to have long-term contracts for memory that last three months to a year. But at the end of the day, Apple gets hit pretty hard by this. They won’t really adjust until the next iPhone release.

But that’s the high end of the market, which is only a few hundred million phones a year. Apple sells two or three hundred million phones annually. The bulk of the market is mid-range and low-end. It used to be that 1.4 billion smartphones were sold a year. Now we’re at about 1.1 billion. Our projections are that we might drop to 800 million this year, and down to 500 or 600 million next year.

We look at data points out of China from some of our analysts in Asia, Singapore, Hong Kong, and Taiwan. They’ve been tracking this, and they see Xiaomi and Oppo cutting low-end and mid-range smartphone volumes by half.

Yes, it’s only a $150 BOM increase on a $1,000 iPhone where Apple has some larger margin. But for smaller phones, the percentage of the BOM that goes to memory and storage is much larger. And the margins are lower, so there’s less capacity to even eat the margins. And they have also generally tended not to do long-term agreements on memory.

Why this is a big deal is that if smartphone volumes halve, that drop will happen in the low and mid-range, not the high end. So it’s not like the bits released are halving. Currently, consumer devices account for more than half of memory demand. Even if you halve smartphone volumes, because of the shape of the halving, the low end gets cut by more than half, while the high end gets cut by less than half, because you and I will still buy the high-end phones that cost north of a thousand dollars. We’ll buy them even if they get a little bit more expensive. And Apple’s volumes will not go down as much as a low-end smartphone provider.

The same applies to PCs. What this does to the market is quite drastic. DRAM gets released and goes to AI chips, who are willing to do longer-term contracts and pay higher margins, because at the end of the day the margin they extract from the end user is much larger.

This probably leads to people hating AI even more. Today, you already see all the memes on PC subreddits and gaming PC Twitter. It’s cat dancing videos saying, “This is why memory prices have doubled and you can’t get a new gaming GPU or desktop.” It’s going to be even worse when memory prices double again, especially DRAM.

Another interesting dynamic is that it’s not just DRAM, it’s also NAND. NAND is also going up in price. Both of these markets have expanded capacity very slowly over the last few years, NAND almost zero. The percentage of NAND that goes to phones and PCs is larger than the percentage of DRAM that goes to phones and PCs.

As you destroy demand, mostly for DRAM purposes, you unlock more NAND that gets allocated and can go to other markets. The price increases of DRAM will be larger than those of NAND because you’ve released more from the consumer, and in fact, you’ve produced more memory for AI.

Dwarkesh Patel

Sorry, maybe you just explained it and I missed it. Is it because SSDs are being used in large quantities for data centers?

Dylan Patel

They are, but not in as large quantities as DRAM.

Dwarkesh Patel

Okay, so they will also increase because they’ll be using some quantity, but there’s not as much of a need as there is for HBM. Makes sense.

One thing I didn’t appreciate until I was reading some of your newsletters is that the same constraints preventing logic scaling over the next few years are quite similar to what’s preventing us from producing more memory wafers. In fact, literally the same exact machine, this EUV tool, is needed for memory. So I guess the question someone could ask right now is, why can’t we just make more memory?

Dylan Patel

The constraints, as I was mentioning earlier, are not necessarily EUV tools today or next year. They become that as we get to the latter part of the decade. Currently, the constraints are more that they physically just haven’t built fabs. Over the last three to four years, these vendors have not built new fabs because memory prices were really low. Their margins were low, and in fact, they were losing money in 2023 on memory. So they decided they weren’t building new fabs. The market slowly recovered over time but never really got amazing until last year.

In 2024, we were banging on the drums that reasoning means long context, which means a large KV cache, which means you need a lot of memory demand. We’ve been talking about that for a year and a half, two years. People who understand AI went really long on memory then. So you’ve seen that dynamic, but now it has finally played out in pricing.

It took so long for what was obvious: long context means the KV cache gets bigger, you need more memory. Half the cost of accelerators is memory. Of course they’re going to start going crazy on it. It took a year for that to actually reflect in memory prices. Once memory prices reflected that, it took another three to six months for the memory vendors to start building fabs. Those fabs take two years to build. So we won’t have really meaningful fabs to even put these tools in until late 2027 or 2028.

Instead, you’ve seen some really crazy stuff to get capacity. Micron bought a fab from a company in Taiwan that makes lagging-edge chips. Hynix and Samsung are doing some pretty crazy things to try and expand capacity at their existing fabs, which also have large knock-on effects in the economy.

So why can’t we build more capacity? There’s nowhere to put the tools. It’s not just EUV; there are other tools involved in DRAM and logic. In logic, for N3, about 28% of the cost of the final wafer is EUV. When you look at DRAM, it’s in the teens. It’s going up, but it’s a much smaller percentage of the cost. These other tools are also bottlenecks, although their supply chains are not as complex as ASML’s.

You see Applied Materials, Lam Research, and all these other companies expanding capacity a lot as well. But you don’t have anywhere to put the tool, because the most complex buildings people make are fabs, and fabs take two years to build.

Dwarkesh Patel

I interviewed Elon recently, and his whole plan is that they’re going to build this TeraFab and they’re going to build the clean rooms. I won’t even ask you about the dirty rooms thing, but let’s say they build the clean rooms.

I have a couple of questions. One, do you think this is the kind of thing that Elon Co. could build much faster than people conventionally build it? This is not about building the end tools. This is just about building the facility itself. How complicated is it to just build the clean room extremely fast? Is this something that Elon, with his “move fast” approach, could do much faster if that’s what we’re bottlenecked on this year or next year? Two, does that even matter if, in two years, your view is that we’re not bottlenecked on clean room space, but on the tooling?

Dylan Patel

As with any complex supply chain, it takes time, and constraints shift over time. Even if something is no longer a constraint, that doesn’t mean that market no longer has margin. For example, energy will not be a big bottleneck a couple of years from now, but that doesn’t mean energy isn’t growing super fast and there’s no margin there. It’s just not the key bottleneck. In the space of fabs, clean rooms are the biggest bottleneck this year and next year. As we get to 2028, 2029, 2030, there will still be constraints there.

The thing about Elon is he has a tremendous capability to garner physical resources and really smart people to build things. The way he recruits amazing people is by trying to build the craziest stuff. In the case of AI, that hasn’t really worked because everyone’s trying to build AGI. Everyone is very ambitious. But in the case of going to Mars, making rockets that land themselves, fully autonomous electric cars, or humanoid robots, these are methods of recruiting the people who think that’s the most important problem in the world to work on that problem, because he’s the only one trying really hard.

In the case of semiconductors, he stated he wants to make a fab that’s a million wafers per month. No one has a fab that big. It’s possible that he’s able to recruit a lot of really awesome people and get them on this crazy task of building a million wafers a month. Step one is to build the clean room, and that I think he probably can do. His mindset around deleting things, that it can be dirty, it’s fine, is probably not right. Actually I think it’s 100% not right. You need the fab to be very clean. All of the air in the fab gets replaced every three seconds, it’s that fast. There have to be so few particles.

But I think he can build the clean room. It’ll take a year or two. Initially, it won’t be super fast, but over time, he’ll get faster at it. The really complex part is actually developing a process technology and building wafers. I don’t think he can develop that quickly. That has a lot of built-up knowledge. The most complicated integration of very expensive tools and supply chains is done by TSMC, Intel, or Samsung. These two other companies aren’t even that great at it, and they’re tremendously complex.

Dwarkesh Patel

How surprised would you be if in 2030 there just happened to be some total disruption where we’re not using EUV? What if we’re using something that has much better effects, is much simpler to produce, and can be produced in much bigger quantities? I’m sure as an industry insider that sounds like a totally naive question, but do you see what I’m asking? What probability should we put on something coming totally out of left field to make all of this irrelevant?

Dylan Patel

Something that’s very simple and easy to scale, I assign a very, very low probability. There are a number of companies working on effectively particle accelerators or synchrotrons that generate light that’s either 13.5 nanometer, like EUV, or an even narrower wavelength, like X-ray at 7 nanometers, to then use in lithography tools. But those things are massive particle accelerators generating this light. It’s a very complicated thing to build.

There are a couple of companies and I think that could be a big disruption to the industry beyond EUV. But I don’t think we’re going to magically build something new that is direct write and super simple, and can be manufactured at huge volumes, although there are some attempts to do things like this.

Dwarkesh Patel

I ask because if you think about Elon’s companies in the past, rocketry was this thing that was thought to be—and is—incredibly complicated.

Dylan Patel

Look, I’m just a naive yapper compared to Elon. What have I built? So maybe it’s possible.

Dwarkesh Patel

In order to build more memory in the future, could we build 3D DRAM the way we do 3D NAND and then go back to DUV?

Dylan Patel

That is the hope currently. Everyone’s roadmap for 3D DRAM is that you’ll still use EUV because you want to have that tighter overlay. When you’re doing these subsequent processing steps, everything is vertically stacked and you have more layers on top of each other. You want the pitches to be tighter. So generally, people are still trying to do it with EUV.

But what 3D would do is change the calculation of how many bits a single EUV pass can make. That number would go up drastically if you go to 3D DRAM. That is the hope. Right now, everyone’s roadmap goes from the current 6F cell, to a 4F cell, and then finally 3D DRAM by the end of the decade or early next decade. There’s still a lot of R&D, manufacturing, and integration to be done. I wouldn’t call that out of the cards. I think it’s very likely going to happen.

It’s also going to require a huge retooling of fabs. The breakdown of tools in a fab will be very different. The lithography tool is actually the only thing that isn’t that different. But the number of them relative to different types of chemical vapor deposition, atomic layer deposition, dry etch, or different kinds of etch chambers with different chemistries… You have all these different tools for different process nodes. You can’t just convert a logic fab to a DRAM fab, or vice versa, or a NAND fab to a DRAM fab, in a short amount of time.

In the same way, existing DRAM fabs require a lot of retooling just to go from 1-alpha to 1-beta to 1-gamma process nodes, because they have to add DUV and change the chemistry stacks for when you’re using EUV in terms of deposition and etch. And the EUV tool has to be there. Furthermore, when you change to 3D DRAM, there’s going to be an even larger shift, so a lot of retooling of these fabs needs to happen.

That would be a big disruption. That would make EUV demand generally lower. But as we’ve seen across time, lithography demand as a percentage of wafer cost has trended up. Around the 2014 era, it was 17% of the wafer cost, and it’s gone to 30% over the last fifteen years. For DRAM, it was in the low to mid-teens, and now it’s trended toward the high teens. Before we get to 3D DRAM, it’ll likely cross into the 20% range. But then, if we get to 3D DRAM, the total end wafer cost as a percentage of EUV tanks again.

Dwarkesh Patel

I guess you care less about the percent of cost and more about how much it bottlenecks production.

Dylan Patel

Right, but the percentage of cost—

Dwarkesh Patel

It’s a proxy, yeah. If you’re Jensen or Sam Altman, or whoever stands to gain a lot from scaling up AI compute, there are these stories that they’d go to TSMC and say, “Why can’t we access Y and Z?” But I think the point you’re making is that it doesn’t really matter what TSMC does in some sense. In fact, even if you have Intel and Samsung building more foundries, in the long run, you’re going to be bottlenecked by ASML and other tool and material makers.

First, is that a correct interpretation? Second, should Silicon Valley people be going to the Netherlands right now to try to pitch ASML to make more tools so that in 2030 they can have more AI compute?

Dylan Patel

It’s a funny dynamic we saw in 2023, 2024, and 2025. People who saw the energy bottleneck before others asymmetrically went to Siemens, Mitsubishi, and of course GE Vernova, and bought up turbine capacity. Now they’re able to charge excess amounts for deploying these turbines in places because of energy.

In the same sense, this could be done for EUV, except ASML is not just going to trust any random bozo who wants to buy EUV tools. These turbines are much cheaper than EUV tools, and there’s many more of them produced. Especially once you get to industrial gas turbines, not just combined-cycle but the cheaper, smaller, less efficient ones, people put down deposits for these.

Someone could do this. Someone should go to the Netherlands and be like, “I’ll pay you a billion dollars. You give me the right to purchase ten EUV tools two years from now, and I’m first in line.” Then over those two years, you go around and wait for everyone to realize, “Oh crap, I don’t have enough EUV tools,” and you try to sell your option at some premium. All you’re effectively doing is saying, “ASML, you’re dumb. You weren’t making enough margin on these. I’m going to make a margin.” The question is, will ASML even agree to this? I don’t think so.

Dwarkesh Patel

There’s a world where they at least get the demand signal from that to increase production.

Dylan Patel

Potentially. I agree.

Dwarkesh Patel

But it sounds like you’re saying they couldn’t even increase production if they wanted to, given the supply chain.

Dylan Patel

Right. But that’s exactly the market in which… If they can’t increase production, just like TSMC cannot increase production that fast, and yet demand is mooning, then the obvious solution is to arbitrage this. You and I know demand is way higher than they’re projecting and their capability to build.

You arbitrage this by locking up the capacity, doing a forward contract, and then trying to sell it at a later date once other people realize everything is fucked and we don’t have enough capacity. Then you’ll have this insane margin that ASML and TSMC should have been charging. But the thing is, I don’t know if ASML and TSMC will ever agree to this.

01:42:34 – Scaling power in the US will not be a problem

Dwarkesh Patel

Let me ask you about power now. It sounds like you think power can be arbitrarily scaled.

Dylan Patel

Not arbitrarily, but yes.

Dwarkesh Patel

But beyond these numbers. If I’m remembering correctly, your blog post on how AI labs are increasing power implied that GE Vernova, Mitsubishi, and Siemens could produce 60 gigawatts a year in gas turbines. Then there are these other sources, but they’re less significant than the turbines.

Only a fraction of that goes to AI, I assume. If in 2030 we have enough logic and memory to do 200 gigawatts a year, do you just think that these things are on a path to ramp up to more than 200 gigawatts a year, or what do you see?

Dylan Patel

Right now we’re at 20 or 30. This is critical IT capacity, by the way, which is an important thing to mention. When I’m talking about these gigawatts, I’m talking about critical IT capacity. Server plugged in, that’s how much power it pulls. But there are losses along the chain. There is loss on transmission, conversion, cooling, et cetera. So you should gross this factor up from 20 gigawatts for this year, or 200 gigawatts by the end of the decade, to some number 20-30% higher.

Then you have capacity factors. Turbines don’t run at 100 percent. If you look at PJM, which I think is the largest grid in America—covering the Midwest and some of the Northeast area—in their models they want to have roughly 20 percent excess capacity. Within that 20 percent excess capacity, they’re running all the turbines at 90% because they are derated some for reliability, maintenance, and so on. In reality, the nameplate capacity for energy is always way higher than the actual end critical IT capacity because of all these factors.

But it’s not just turbines. If you were just making power from turbines, that’s simple, boring, and easy. Humans and capitalism are far more effective. The whole point of that blog was that, yes, there are only three people making combined-cycle gas turbines, but there’s so much more we can do. We can do aeroderivatives. We can take airplane engines and turn them into turbines. There are even new entrants in the market, like Boom Supersonic trying to do that and working with Crusoe. Also there’s all the other ones like that already exist in the market.

There are also medium-speed reciprocating engines: engines that spin in circles, like a diesel engine. There are ten people who make engines that way. I’m from Georgia, and people used to be like, “Oh man, you got a Cummins engine in there,” regarding RAM trucks. Automobile manufacturing is going down, so these companies all have capacity and could scale and convert that for data center power. You stick all these reciprocating engines in. It’s not as clean as combined-cycle, but maybe you can convert them from diesel to gas if you want.

What about ship engines? All of these engines for massive cargo ships are great. Nebius is doing that for a Microsoft data center in New Jersey. They’re running ship engines to generate power. Bloom Energy is doing fuel cells. We’ve been very positive on them for a year and a half now because they have such a capability to increase their production. Their payback period for a production increase is very fast, even if the cost is a little bit higher than combined-cycle, which is the best for cost and efficiency.

Then there’s solar plus battery, which can come online as those cost curves continue to come down. There’s wind, where you might only expect 15 percent of the maximum power because things oscillate, but you add batteries. There are all these things.

The other thing is that the grid is scaled so we don’t cut off power at peak usage on the hottest day of the summer. But in reality, that’s a load spike that is 10-20% higher than the average. If you just put enough utility-scale batteries, or peaker plants that only run a small portion of the year—and those could be gas, industrial gas turbines, combined-cycle, batteries, or any of the other sources I mentioned—then all of a sudden you’ve unlocked 20% of the US grid for data centers. Most of the time that capacity is sitting idle. It’s really only there for that peak, which is just a few hours over a few days of the year. If you have enough capacity to absorb that peak load, then all of the sudden you’ve transferred it all.

Today, data centers are only 3-4% of the power of the US grid, and by 2028 they’ll be 10%. But if you can unlock 20% of the US grid like this, it’s not that crazy. The US grid is terawatt-level, not hundreds-of-gigawatts-level. So we can add a lot more energy.

I’m not saying it’s easy. These things are going to be hard. There’s a lot of hard engineering, risks people have to take, and new technologies people have to use. But Elon was the first to do this behind-the-meter gas, and since then we’ve seen an explosion of different things people are doing to get power. They’re not easy, but people are gonna be able to do them. The supply chains are just way simpler than chips.

Dwarkesh Patel

Interesting. He made the point during the interview that for the specific blade for the specific turbine he was looking at, the lead times go out beyond 2030. Your point is that—

Dylan Patel

That’s great. There are so many other ways to make energy. Just be inefficient. It’s fine.

Dwarkesh Patel

Right now, combined-cycle gas turbines have CapEx of $1,500 per kilowatt. Are you saying it would make sense to have either technologies that are much more expensive than that, or other things are getting cheap enough to make it competitive?

Dylan Patel

Exactly. It can be as high as $3,500 per kilowatt. It could be twice as much as the cost of combined-cycle, and the total cost of the GPU on a TCO basis has only gone up a few cents per hour.

Because we’ve been talking about Hopper pricing, $1.40, let’s say the power price doubles. The Hopper that was $1.40 is now $1.50 in cost. I don’t care, because the models are improving so fast that the marginal utility of them is worth way more than that ten-cent increase in energy.

Dwarkesh Patel

So you’re saying 20 percent of the grid—the grid is about one terawatt—can just come online from utility-scale batteries, increasing what you’d be comfortable putting on the grid.

Dylan Patel

The regulatory mechanism there is not easy, by the way.

Dwarkesh Patel

But that’s 200 gigawatts, if that hypothetically happens. Just from the different sources of gas generation you mentioned—the different kinds of engines and turbines—combined, how many gigawatts could they unlock by the end of the decade?

Dylan Patel

We’re tracking this in our data. There are over 16 different manufacturers of power-generating things just from gas alone. Yes, there are only three turbine manufacturers for combined-cycle, but we’re tracking 16 different vendors, and we have all of their orders. It turns out there are hundreds of gigawatts of orders to various data centers.

As we get to the end of the decade, we think something like half of the capacity that’s being added will be behind the meter. Behind the meter is almost always more expensive than grid-connected, but there are just a lot of problems with getting grid-connected: permits and interconnection queues and all this sort of stuff. So even though it’s more expensive, people are doing behind the meter.

What they’re doing behind the meter ranges widely. It could be reciprocating engines, ship engines, or aeroderivatives. It could be combined-cycle, although combined-cycle is not that great for behind the meter. It could be Bloom Energy fuel cells, or solar plus battery. It could be any of these things.

Dwarkesh Patel

And you’re saying any of these individually could do tens of gigawatts?

Dylan Patel

Any of these individually will do tens of gigawatts, and as a whole, they will do hundreds of gigawatts.

Dwarkesh Patel

Okay. So that alone should more than—

Dylan Patel

Electrician wages will probably double or triple again. There are going to be a lot of new people entering that field, and a ton of people who make money, but I don’t see that as the main bottleneck.

Dwarkesh Patel

Right now in Abilene, at the 1.2-gigawatt data center that Crusoe is building for OpenAI, I think they have 5,000 people working there, or at peak they did. If you turn that into 100 gigawatts—and I’m sure things will get more efficient over time—that would be 400,000 people it would take to build 100 gigawatts.

If you think about the US labor force, and how many electricians there are and how many construction workers there are… I guess there are 800,000 electricians. I don’t know if they’re all substitutable in this way. There are millions of construction workers. But if we’re in a world where we’re adding 200 gigawatts a year, are we going to be crunched on labor eventually, or do you think that is actually not a real constraint?

Dylan Patel

Labor is a big constraint. It’s a humongous constraint in this. People have to be trained. Likewise, we’ll probably start importing the highest-skilled labor. It makes sense that a really high-skilled electrician in Europe who was working on destroying power plants now comes to America and is building high-voltage electricity moving across a data center.

Humanoid robots or robotics at least might start to help, but the main factor for reducing the number of people is going to be modularizing things and making them in factories in Asia. Unfortunately for America, places like Korea, Southeast Asia, and in many ways China as well are going to ship more and more built-out sections of the data center and those will be shipped in. Today you currently ship servers or a rack in, and then you plug that into different pieces that you’re shipping from different places.

But now you’ll ship it to a factory and integrate the entire thing. Maybe this is a two-megawatt block, and this block goes from high-voltage AC power to the DC voltage that you deliver to the rack, or something like this. Or with cooling, you ship a fully integrated unit that has a lot of the cooling subsystems already put together, because plumbers are also a big constraint here.

Furthermore, instead of just a single rack where you have people wiring up all these racks with electricity, you take a skid and put an entire row of servers on it that is shipped directly from the factories. Today, a single rack may be 120 or 140 kilowatts, but as we get to next-generation Nvidia Kyber and things like that, it’s almost a megawatt.

In addition, if you do an entire row, it’ll have the rack, the networking, the cooling, and the power all integrated together. Now when you come in, you have much less to cable. There’s less networking fiber, fewer power connections, and fewer plumbing things. This can drastically reduce the number of people working in data centers, so our capability to build them will be much larger.

Along the way, some people will move faster to new things, and some will move slower. Crusoe and Google have been talking a lot about this modularization, as have companies like Meta and many others. The people who move faster to new things may face delays, while the people who are slower will face labor problems. There will always be dislocations in the market because this is a very complex supply chain. At the end of the day, it’s still simple enough that we will be able to solve it through capitalism and human ingenuity on the timescales required.

01:54:44 – Space GPUs aren’t happening this decade

Dwarkesh Patel

Speaking of big problems to solve, Elon Musk is very bullish on space GPUs. If you’re right that power is not a constraint on Earth… I guess the other reason they would make sense is that even if there will be enough gas turbines or whatever on Earth, Elon’s next argument is that you can’t get the permitting to build hundreds of gigawatts on Earth. Do you buy that argument?

Dylan Patel

Land-wise, America is big. Data centers don’t actually take up that much space, so you can solve that. Permitting-wise, air pollution permits are a challenge, but the Trump administration made it much easier. You go to Texas, and you can skip a lot of this red tape.

Elon had to deal with a lot of this complex stuff in Memphis, and then building a power plant across the border for Colossus 1 and 2. But at the end of the day, there’s a lot more you can get away with in the middle of Texas.

Dwarkesh Patel

Given that Elon lives in Texas, why didn’t he just go to Texas?

Dylan Patel

I think it was partially that they over-indexed on grid power for a temporary period of time. That’s just what they thought they needed more of.

Dwarkesh Patel

Because they had an aluminum refinery connected to the grid there.

Dylan Patel

It was actually an idled appliance factory. But I think they may have indexed more to grid power, water access, and gas access. I think they bought that knowing the gas line was right there and they were going to tap it. Same with water. It was a whole host of different constraints. It was probably an area where electricians were easier to find.

At the end of the day, I’m not exactly sure why they chose that site. I bet Elon would’ve chosen somewhere in Texas if he could’ve gone back because of the regulatory challenges he faced. Ultimately, permitting is a challenge, but America is a big place with 50 states, and things will get done.

There are a lot of small jurisdictions where you can just transport in all the workers you need for a temporary period of three to twelve months, depending on the contractor. You can put them in temporary housing and pay out the butt, because labor is very cheap relative to the GPUs and the networking, and the end value of the tokens it’s going to produce. So there is plenty of room to pay for all of these things.

Also, people are also diversifying now. Australia, Malaysia, Indonesia, and India are all places where data centers are going up at a much faster pace. But currently, over 70% of AI data centers are still in America, and that continues to be the trend. People are figuring out how to build these things. Ultimately, dealing with permitting and red tape in middle-of-nowhere Texas, Wyoming, or New Mexico is probably a hell of a lot easier than sending stuff into space.

Dwarkesh Patel

Other than the economic argument making less sense once you consider that energy is a small fraction of the total cost of ownership of a data center, what are the other reasons you’re skeptical?

Dylan Patel

Obviously, power is basically free in space.

Dwarkesh Patel

That’s the reason to do it.

Dylan Patel

Yeah, that’s the reason to do it. But there are all the other counterarguments. Even if power costs double on Earth, it’s still a fraction of the total cost of the GPU.

The main challenge is… We have ClusterMAX, which rates all the neoclouds. We test over 40 cloud companies, including the hyperscalers and neoclouds. Outside of software, what differentiates these clouds the most is their ability to deploy and manage failure.

GPUs are horrendously unreliable. Even today, around 15% of Blackwells that get deployed have to be RMA’d. You have to take them out. Sometimes you just have to plug them back in, but sometimes you have to take them out and ship them back to Nvidia or their partners who do the RMAs and such.

Dwarkesh Patel

What do you make of Elon’s argument that after an initial phase, they actually don’t fail that much?

Dylan Patel

Sure, but now you’ve done this, tested them all, deconstructed them, put them on a spaceship, launched them into space, and then put them online again. That takes months. If your argument is that a GPU has a useful life of five years, and this takes six additional months, that is 10% of your cluster’s useful life.

Because we’re so capacity-constrained, that compute is theoretically most valuable in the first six months you have it. We’re more constrained now than we will be in the future. That compute can contribute to a better model in the future, or generate revenue today that you can use to raise more money. All these things make now the most important moment, but you’ve potentially delayed your compute deployment by six months.

What separates these cloud providers is… We see some clouds taking six months to deploy GPUs right here on Earth. We see clouds that take a lot less than six months. So the question is, where does space get in there? I don’t see how you could test them all on Earth, deconstruct them, and ship them to space without it taking significantly longer than just leaving them in the facility where you tested them.

Dwarkesh Patel

The question I wanted to ask is about the topology of space communication. Right now, Starlink satellites talk to each other at 100 gigabits per second. You could imagine that being much higher with optical intersatellite laser links optimized for this. That actually ends up being quite close to InfiniBand bandwidth, which is 400 gigabytes a second.

Dylan Patel

But that’s per GPU, not per rack. So multiply that by 72. Also, that was Hopper. When you go to Blackwell and Rubin, that 2x’s and 2x’s again.

Dwarkesh Patel

But how much compute is happening per… During inference, are the different scale-ups still working together, or is inference just happening as a batch within a single scale-up?

Dylan Patel

A lot of models fit within one scale-up domain, but many times you split them across multiple scale-up domains.

As models become more and more sparse, which is the general trend, you want to ping just a couple of experts per GPU. If leading models today have hundreds, if not a thousand, of experts, then you’d want to run this across hundreds or thousands of chips, even as we advance into the future.

So then you end up with the problem of needing to connect all these satellites together for communications as well.

Dwarkesh Patel

That would be tough. If there’s a world where you could do inference for a batch on a single scale-up, then maybe it’s more plausible. But if not, it’s a different story.

Dylan Patel

Networking these chips together is a problem, and you can’t just make the satellite infinitely large. There are a lot of physics challenges to making a satellite really big. That’s why you need these interconnects between the satellites.

Those interconnects are more expensive. In a cluster, 15-20% of the cost is networking. All of a sudden, you’re using space lasers instead of simple lasers that are manufactured in volumes of millions with pluggable transceivers.

And those things are very unreliable as well, more unreliable than the GPUs by the way. Across the life of a cluster, you have to unplug and clean them all the time. You have to unplug and replug them just for random reasons. These things are just not as reliable. So you’ve got that problem as well. You’ve got a more expensive, complicated space laser to communicate instead of this pluggable optical transceiver that’s been produced in super high volume.

Dwarkesh Patel

So all in all, what does that imply for space data centers?

Dylan Patel

Space data centers effectively are not limited by their energy advantage. They are limited by the same contended resource. We can only make two hundred gigawatts of chips a year by the end of the decade. What are we going to do to get that capacity? It doesn’t matter if it’s on land or in space. It doesn’t really matter, because you can build that power. Human capabilities and capacity could get to the period where we’re adding a terawatt a year globally of various types of power.

At some point, we do cross the chasm where space data centers make sense, but it’s not this decade. It is much further out, once energy constraints actually become a big bottleneck and land permitting becomes a much bigger bottleneck as it subsumes more of the economy. And crucially, once chips are no longer the bottleneck.

Right now, chips are the biggest bottleneck. You want them deployed and working on AI the moment they’re manufactured. There are a lot of things people are doing to increase that speed faster and faster. They’re modularizing data centers, or even modularizing racks where you put the chip in at the data center, but only the chip and everything else is already wired up and ready to go. There are things like this people are doing to decrease that time that you cannot do in space.

At the end of the day, all that matters in a chip-constrained world is getting these chips producing tokens ASAP. Maybe by 2035, the semiconductor industry, ASML, Zeiss, and suppliers like Lam Research and Applied Materials and other fab manufacturers will catch up once the pendulum swings and we are able to make enough chips. Then we will be optimizing every dial and it makes sense to optimize the 10-15% of energy costs. As we move to ASICs potentially, and if Nvidia’s margins aren’t +70%, maybe that energy cost becomes 30% of the cluster. These are the things to optimize.

But Elon doesn’t win by doing 20% gains. He never wins that way. Elon wins when he swings for the fences and does 10X gains. That’s what SpaceX is about. That’s what Tesla is about. All of his success has been about that, not chasing the 20%. I think space data centers will eventually be a 10X gain as Earth’s resources get more and more contentious, but that’s not this decade.

Dwarkesh Patel

Just to drive some intuition about how much land there is on Earth… Obviously, for the chips themselves, especially if you move to a world where you have racks that have megawatts—

Dylan Patel

That’s the other thing. If manufacturing is the constraint, right now it’s roughly one watt per square millimeter for AI chips. One easy way to improve that is to pump it to two watts per square millimeter. You may not get 2x the performance, you may only get 20% more performance, and that requires much more exotic cooling. It requires more complicated cold plates and complex liquid cooling, or maybe even things like immersion cooling.

In space, higher watts per millimeter is very difficult, whereas on Earth, these are solved problems. One of these things enables you to get a lot more tokens, maybe 20% more tokens per wafer that’s manufactured, and that’s a humongous win.

Dwarkesh Patel

Square millimeter, you mean of die area?

Dylan Patel

Yeah, of die area.

Dwarkesh Patel

It would be better for space because more watts per millimeter means the chip runs hotter. I guess this is a question of computer chip engineering, but it cools to the fourth power by the Stefan-Boltzmann law. If you can run a very hot chip, it allows a lot of—

Dylan Patel

No, you can’t run it hotter. You can only run it denser. The problem is that getting the heat out of that dense area means you have to move away from standard air and liquid cooling to more exotic forms of liquid cooling, or even immersion, to get to higher power densities. That’s more difficult in space than it is on Earth.

Dwarkesh Patel

Maybe it’s worth explaining at this point what exactly a scale-up is and what it looks like for Nvidia versus Trainium versus TPUs.

Dylan Patel

Earlier I was mentioning how communication within a chip is super fast. Communication within chips that are in the same rack is fast, but not as fast. It’s on the order of terabytes. Communication very far away is on the order of hundreds of gigabytes. As you get further distance, maybe across the country, the order of magnitude is on the order of gigabytes.

A scale-up domain is this tight domain where the chips are communicating on the order of terabytes a second. For Nvidia, previously this meant an H100 server had eight GPUs, and those eight GPUs could talk to each other at terabytes a second. With Blackwell NVL72, they implemented rack-scale scale-up. That meant all seventy-two GPUs in the rack could connect to each other at terabytes a second. The speed doubled generation on generation, but the most important innovation was going from eight to seventy-two in the domain.

When we look at Google, their scale-up domain is completely different. It has always been on the order of thousands. With TPU v4, they had pods the size of four thousand chips. With v8 or v7, they have pods in the eight or nine thousand range. What’s relevant here is that it’s not the same as Nvidia. It’s not like for like.

Google has a topology that’s a torus. Every chip connects to six neighbors. Nvidia’s 72 GPUs connect all-to-all. They can send terabytes a second to any arbitrary other chip in that pod of scale-up. Whereas Google, you have to bounce through chips. If TPU 1 needs to talk to TPU 76, it has to bounce through various chips, and there is always some blocking of resources when you do that because that one TPU is only connected to six other TPUs.

So there is a difference in topology and bandwidth, and there are trade-offs and advantages to both. Google gets to have a massive scale-up domain, but they have the trade-off of bouncing across chips to get from one to another. You can only talk to six direct neighbors.

Amazon has mutated their scale-up domain. They’re somewhere in between Nvidia and Google. They’re trying to make larger scale-up domains. They try to do all-to-all to some extent with switches, which is what Nvidia does, but they also use torus topologies like Google to some extent.

As we advance forward to next generations, all three of them are moving more towards a dragonfly topology. That means there are some fully connected elements and some elements that are not fully connected. You can get the scale-up to be hundreds or thousands of chips, but also have it not contend for resources when bouncing through chips.

Dwarkesh Patel

Related question: I heard somebody make the claim that the reason parameter scaling has been slow—and only now are we getting bigger models from OpenAI and Anthropic—is that… The original GPT-4 is over a trillion parameters, and only now are models starting to approach that again. I heard a theory that the reason is that Nvidia’s scale-ups have just not had that much memory capacity. Let’s say you have a 5T model running at FP8, so that’s five trillion gigabytes. And then you have the KV cache, let’s say it’s—

Dylan Patel

Just call it the same size.

Dwarkesh Patel

Okay, let’s say it’s the same size for one batch. So you need ten terabytes to be able to run…

Dylan Patel

A single forward pass, yeah.

Dwarkesh Patel

And then only with the GB200 and NVL72 do you have an Nvidia scale-up that has twenty terabytes, and before that they were much smaller. Whereas Google, on the other hand, has had these huge TPU pods that are not all-to-all, but still have hundreds of terabytes of capacity in a single scale-up. Does that explain why parameter scaling has been slow?

Dylan Patel

I think it’s partially the capacity and bandwidth, but also as you build a larger model, the ability to deploy it is slower. In terms of what the inference speed is for the end user, that’s kind of irrelevant. What’s really relevant is RL.

What we’ve seen with these models and allocation of compute at a lab… There are a few main ways you can allocate compute. You can allocate it to inference, i.e. revenue. You can allocate it to development, i.e. making the next model. You can allocate it to research. In development specifically, you split it between pre-training and RL.

When you think about what is happening, the compute efficiency gains you get from research are so large that you actually want most of your compute to go to research, not to development. All these researchers are generating new ideas, trying them out, testing them, and continuing to push the Pareto optimal curve of scaling laws further and further. Empirically, what we’ve seen is that model costs get ten times cheaper every year, or even more than that. At the same scale it gets ten times cheaper, and to reach new frontiers it costs the same amount or more. So you don’t want to allocate too many resources to pre-training and RL. You actually want to allocate most of your resources to research.

In the middle is this development period. If you pre-train a five-trillion-parameter model, how many rollouts do you have to do in RL? Rollouts for a five-trillion-parameter model are five times larger than for a one-trillion-parameter model. If you wanted to do as many rollouts—maybe the larger model is two times more sample efficient—now you need 2.5x as much time of RL to get the model smarter.

Or you could RL the smaller model for 2x the time. You’d still have a 25% difference in the big model, which is 2x as sample efficient and doing X number of rollouts. But the smaller model, which is a trillion parameters, although its less sample efficient, is doing twice as many rollouts and is still done faster. You get the model sooner, you’ve done more RL, and then you can take that model to help you build the next models, help your engineers train, and do all these research ideas.

This feedback loop is actually weighed towards smaller models in every case, no matter what your hardware is. As you look to Google, they do deploy the largest production model of any of the major labs with Gemini Pro. It’s a larger model than GPT-5.4. It’s a larger model than Opus. Google does this because they have a unipolar set of compute. It’s almost all TPU.

Whereas Anthropic is dealing with H100s, H200s, Blackwell, Trainiums, and TPUs of various generations. OpenAI is dealing with mostly Nvidia right now, but going towards having AMD and Trainium as well. The fleets of compute like Google’s can just optimize around a larger model. They can leverage a thousand chips in a scale-up domain to get the RL time speed much faster so that this feedback loop can be fast.

But at the end of the day, in isolation, you almost always want to go with a smaller model that gets RL’d faster and gets deployed into research and development earlier. You can build the next thing and get more efficiency wins. You have this compounding effect of making a smaller model that can be deployed into research and development earlier. I spend less compute on the training because I was able to allocate more compute to the research. This compounding effect of being able to do research faster and faster is potentially a faster takeoff. That’s all these companies want: the fastest takeoff possible.

02:14:07 – Why aren’t more hedge funds making the AGI trade?

Dwarkesh Patel

Okay, a spicy question. You’ve explained that SemiAnalysis sells these spreadsheets. You’re always pointing out how six months or a year ago, you warned people about the memory crunch. Now you’re telling people about the cleanroom crunch, and in the future, the tool crunch. Why is Leopold the only person using your spreadsheets to make outrageous money? What is everybody else doing?

Dylan Patel

I think there are a lot of people making money in many ways. Leopold jokes that he’s the only client of mine who tells me our numbers are too low. Everyone else tells me our numbers are too high, almost ad nauseam. Whether it’s a hyperscaler saying, “Hey, that other hyperscaler, their numbers are too high,” and we’re like, “Nah, that’s it.” They’re like, “No, no, no, it’s impossible,” blah, blah, blah. You finally have to convince them through all these facts and data when we’re working with hyperscalers or AI labs that in fact, no, that number isn’t too high, that’s correct. Eventually, sometimes it takes them six months to realize, or a year later.

Other clients, on the trading side, also use our data. Roughly 60% of my business is industry. So AI labs, data center companies, hyperscalers, semiconductor companies, the whole supply chain across AI infrastructure. But 40% of our revenue is hedge funds. I’m not going to comment on who our customers are, but a lot of people use the data. It’s just how do you interpret it, and then what do you view as beyond it?

I will say Leopold is pretty much the only person who tells me my numbers are too low, always. Sometimes he’s too high, sometimes I’m too low. But in general, I think other people are doing that. You can look across the space at hedge funds and look at their 13Fs and see they own, maybe not exactly what Leopold does, because it’s always a question of what is the most constrained thing. What’s the thing that’s going to be most outside of expectations?

That’s what you’re really trying to exploit: inefficiencies in the market. In a sense, our data is making the market more efficient by making the base data of what’s happening more accurate. Many funds do trade on information that is out there… I don’t think Leopold’s the only person. I think he has the most conviction about the AGI takeoff, though.

Dwarkesh Patel

Right, but the bets are not about what happens in 2035. The bets that you’re making—that are at least exemplified by public returns we can see for different funds including Leopold’s—are about what has happened in the last year. The last year stuff could be predicted using your spreadsheets. It’s about buying the next year’s spreadsheets.

Dylan Patel

They’re not just spreadsheets. There are reports. There’s API access to the data. There’s a lot of data.

Dwarkesh Patel

But do you see what I mean? It’s not about some crazy singularity thing. It’s about, do you buy the memory crunch?

Dylan Patel

You only buy the memory crunch if you believe AI is going to take off in a huge way. The memory crunch, a lot of it was predicated on… At least for people in the Bay Area who think about infrastructure, it’s obvious. KV cache explodes as context lengths get longer, so you need more memory. Then you do the math.

You also have to have a lot of supply chain understanding of what fabs are being built, what data centers are being built, how many chips, and all these things. We track all these different datasets very tightly, but at the end of the day, it takes someone to fully believe that this is going to happen.

A year ago, if you told someone memory prices would quadruple and smartphone volumes are going to go down 40% over the year or two after that, people were like, “You’re crazy. That’d never happen.” Except a few people do believe that, and those people did trade memory.

And people did. I don’t think Leopold was the only person buying memory companies. He, of course, sized and positioned and did things in better ways than some, maybe most. I don’t want to comment on whose returns are what, but he certainly did well. Other people also did really well.

Wow, you’ve made me diplomatic for the first time ever. No, no, you’re fine. I think this is hilarious. I’m being a diplomat, whereas usually I’m spicy.

02:18:30 – Will TSMC kick Apple out from N2?

Dwarkesh Patel

Okay, some rapidfire to close out. If you’re saying with the memory, logic, et cetera, the N3 is mostly going to be AI accelerators, but then there’s N2, which is mostly Apple now… In the future, I guess AI would also want to go on N2. Can TSMC kick out Apple if Nvidia and Amazon and Google say, “Hey, we’re willing to pay a lot of money for N2 capacity?”

Dylan Patel

I think the challenge with this is chip design timelines take a long while, so that’s more than a year out, and the designs that are on two nanometer are more than a year out.

What would really happen is Nvidia and all these others will be like, “Hey, we’re going to prepay for the capacity and you’re going to expand it for us.” Maybe TSMC takes a little bit of margin, but not a ton. They’re not going to kick Apple out entirely. What they’re going to do is when Apple orders X, they might say, “Hey, we project you only need X minus one, and so that’s what we’re going to give you, X minus one.” Then that flex capacity, Apple’s kind of screwed on.

Traditionally, Apple has always over-ordered by 10% and cut back by 10% over the course of the year. Some years they hit the entire 10%. Volumes vary based on the season and macro.

I don’t think TSMC would kick out Apple. I think Apple will become a smaller and smaller percentage of TSMC’s revenue, and therefore be less relevant for TSMC to cater to their demands. TSMC could eventually start saying, “Hey, you’ve got to pre-book your capacity for next year, for two years out, and you have to prepay for the CapEx,” because that’s what Nvidia and Amazon and Google are doing.

Dwarkesh Patel

I wonder if it’s worth going into specific numbers. I don’t have any of them on hand. What percentage of N2 does Apple have its hands on over the coming years versus AI?

Dylan Patel

This year Apple has the majority of N2 that’s going to get fabricated. There’s a little bit from AMD. They are trying to make some AI chips and CPU chips early. There’s a little bit, but for the most part, it’s Apple.

As we go forward to the year after that, Apple still gets closer to half of it as other people start ramping, but then it falls drastically, just like for N3, where they were half. When I say N2, that includes A16, which is a variant of N2. Over time, those nodes will be the majority.

What’s also interesting is traditionally, Apple has been the first to a process node. 2 nm is actually the first time they’re not. Well, that’s besides Huawei. Huawei, back in 2020 and before, was the first with Apple, but they were both making smartphones. Now, with 2 nm, you’ve got AMD trying to make a CPU and a GPU chiplet that they use advanced packaging to package together, in the same timeframe as Apple. This is a big risk for AMD that causes potential delays because it’s a brand-new process technology. It’s hard. But at the end of the day, this is a bet that they want to do to scale faster than Nvidia and try and beat them.

As we move forward, when we move to the A16 node, the first customer there is not even Apple. It’s AI. As we move forward, that will become more and more prevalent. Not only will Apple not be the first to a node, they will also not be the majority of the volume to the new node. They’ll then just be like any old customer.

Because the scale of TSMC’s CapEx keeps ballooning, but Apple’s business is not growing at the same pace, they become a less and less relevant customer. They also will just cut their orders because things in the supply chain are kicking them out, whether it be packaging or materials or DRAM or NAND. These things are increasing in cost. They can’t pass on all the cost to customers likely because the consumer is not that strong. You end up with this conundrum where they are just not TSMC’s best bud like they have been historically.

Dwarkesh Patel

Do you think if Huawei had access to 3 nm, they would have a better accelerator than Rubin?

Dylan Patel

Potentially, yeah. Huawei was the first with a 7 nm AI chip as well. They were the first with a 5 nm mobile chip, but they were the first with a 7 nm AI chip. The Huawei Ascend was two months before the TPU and four months before Nvidia’s A100, I think.

That’s just moving to a process node. That doesn’t imply software or hardware design or all these other things. But Huawei is arguably the only company in the world that has all the legs. Huawei has cracked software engineers. Huawei has cracked networking technologies. That’s, in fact, their biggest business historically. They have cracked AI talent.

Furthermore, beyond Nvidia, they actually have better AI researchers. Beyond Nvidia, they have their own fabs. And beyond Nvidia, they have their own end market of selling tokens and things like that. Huawei is able to get the top, top talent. Nvidia is as well, but not with as much concentration, and Huawei has a bigger pool in China.

It’s very arguable that Huawei, if they had TSMC, would be better than Nvidia. There are areas where China has advantages in areas that Nvidia can’t access as easily. Not just scale, but certain optical technologies China’s actually really good at.

I think it’s very reasonable that if in 2019 Huawei was not banned from using TSMC, Huawei would have already eclipsed Apple as the biggest TSMC customer. Huawei has huge share in networking, compute, CPUs, and all these things. They would have kept gaining share, and they’d likely be TSMC’s biggest customer.

02:24:16 – Robots and Taiwan risk

Dwarkesh Patel

Wow. That’s crazy. I’ve got a random final question for you. The other part of the Elon interview was robots. If humanoids take off faster than people expect, if by 2030 there’s millions of humanoids running around which each need local compute, any thoughts on what that implies? What would be required for that?

Dylan Patel

There’s a lot of difficulties with the VLMs and VLAs that people are deploying on robots. But to some extent, you don’t need to have all the intelligence in the robot. It would be much more efficient to not do that. Because in the cloud, you can batch process and all these things.

What you may want to do is have a lot of the planning and longer-horizon tasks determined by a much more capable model in the cloud that runs at very high batch sizes. Then it pushes those directions to the robots, who interpolate between each subsequent action. Or it is given a command like, “Hey, pick up that cup,” and then the model on the robot can pick up the cup. As it’s picking up, things like weight and force may have to be determined by the model on the robot, but not everything needs to be. It can say, “hey that’s a headphone” and the super model in the cloud can say, “I know these headphones are Sony XM6s,” which is not a Dwarkesh ad spot, but...

Dwarkesh Patel

I’m like, why is this guy’s plugging this thing so hard. It’s on the table. It’s on his neck when we’re interviewing Satya together. Is he getting paid by Sony?

Dylan Patel

Unfortunately not. But anyways, it might say, “Hey, the headband is soft, and this is the weight of it,” and all these things. Then the model on the robot can be less intelligent, take these inputs, and do the actions. It may get told by the model in the cloud every second, or maybe ten times a second, depending on the hertz of the action. But a lot of that can be offloaded to the cloud.

Otherwise, if you do all of the processing on the device, I believe it would be more expensive because you can’t batch. Two, you couldn’t have as much intelligence as you do in the cloud because the models will just be bigger in the cloud. Three, we’re in a semiconductor shortage world, and any robot you deploy needs leading-edge chips because the power is really bad for robots. You need it to be low power and efficient, and all of a sudden you’re taking power and chips that would’ve been for AI data centers, and you’re putting them in robots. So now that 200 gigawatts gets lower if you’re deploying millions of humanoids.

Dwarkesh Patel

I think this is very interesting because something people might not appreciate about the future is how centralized, in a physical sense, intelligence will be. Right now, there are eight billion humans, and their compute is in their heads, on their person.

In the future, even with robots that are out physically in the world—obviously, knowledge work will be done in a centralized way from data centers with hundreds of thousands or maybe millions of instances—the future you’re suggesting is one where there’s more centralized thinking and centralized computation driving millions of robots out in the world. That’s an interesting fact about the future that I think people might not appreciate.

Dylan Patel

I think Elon recognizes this, which is why he’s going to different places for his chips. He signed this massive deal with Samsung to make his robot chips in Texas because I personally think he thinks Taiwan risk is huge.

Because of that and the centralization of resources in Taiwan, having his robot chips in Texas means having a separate supply chain that is not as constrained. No one’s really making AI chips on Samsung besides Nvidia’s new LPU that they launched. They’re launching it next week, but we’re recording this the week before.

Dwarkesh Patel

This episode’s coming out Friday.

Dylan Patel

Oh, this episode’s coming out before. Sick. They’re launching this new AI chip next week which is built on Samsung, but that’s a recent development from Nvidia. That’s the only other AI demand there, whereas on TSMC, everything is competing. He gets both geopolitical diversification and supply chain diversity for his robots, and he’s not competing as much with the infinite willingness to pay for the data center geniuses.

Dwarkesh Patel

Final question, on Taiwan. If we believe that tools are the ultimate bottleneck, how much of Taiwan’s place in the AI semiconductor supply chain could we de-risk simply by having a plan to airlift every single process engineer at TSMC out if they get blockaded or something? Or do you still need to ship out the EUV tools, which would be multiple plane loads per single tool and would not be practical?

Dylan Patel

If you ship out all the process engineers and assuming it’s hot enough that you destroy the fabs, no one has all the fabs in Taiwan now, which is a big risk.

These tools actually use a lot of semiconductors which are manufactured in Taiwan. It’s a snake eating its own tail meme because you can’t make the tools without the chips from Taiwan, which you can’t use without the tools in Taiwan. There’s obviously some diversification there. They don’t use super advanced chips in lithography tools, but at the end of the day, there is some dragon eating its tail.

Just shipping out all the engineers and blowing up the fabs means China has a stronger semiconductor supply chain than the rest of the world in terms of verticalization, now that you’ve removed Taiwan. You’ve got all the know-how, but you’ve got to replicate it in, let’s say, Arizona or wherever for TSMC. It’s going to take a long time to build all the capacity that TSMC has built over the years.

And so you’ve drastically slowed US and global GDP. Not just growth, you’ve shrunk the GDP massively, and you’ve got a lot bigger problems. Your incremental ability to add compute goes to almost zero. Instead of hundreds of gigawatts a year by the end of the decade, let’s say something happens to Taiwan, now you’re at maybe 10 gigawatts across Intel and Samsung, or 20 gigawatts. It’s nothing.

Now all of a sudden you’ve really caused some crazy dynamics in AI. Of course, you have all the existing capacity, but that existing capacity pales in comparison to the capacity that’s being expanded.

Dwarkesh Patel

Okay. Dylan, that was excellent. Thank you so much for coming on the podcast.

Dylan Patel

Thank you for having me. And see you tonight.

The most important question nobody's asking about AI

Dwarkesh Patel — Wed, 11 Mar 2026 18:55:20 GMT

By now, I’m sure you’ve heard that the Department of War has declared Anthropic a supply chain risk, because Anthropic refused to remove redlines around the use of their models for mass surveillance and for autonomous weapons.

Honestly I think this situation is a warning shot. Right now, LLMs are probably not being used in mission critical ways. But within 20 years, 99% of the workforce in the military, the government, and the private sector will be AIs. This includes the soldiers (by which I mean the robot armies), the superhumanly intelligent advisors and engineers, the police, you name it.

Our future civilization will run on AI labor. And as much as the government’s actions here piss me off, in a way I’m glad this episode happened - because it gives us the opportunity to think through some extremely important questions about who this future workforce will be accountable and aligned to, and who gets to determine that.

What Hegseth should have done

Obviously the DoW has the right to refuse to use Anthropic’s models because of these redlines. In fact, I think the government’s case had they done so would be very reasonable: “We don’t ever want there to be a world where we become dependent on a private company for our warfighting, and then just have them cut us off if they determine that we’re crossing their usage terms, especially given the ambiguity of concepts like autonomous weapons or mass surveillance.”

Honestly, for this reason, if I was the Defense Secretary, I would probably actually refuse to do this deal with Anthropic. Imagine if in the future, there’s a Democratic administration, and Elon Musk is negotiating some SpaceX contract to give the military access to Starlink. And suppose if Elon said, “I reserve the right to cancel this contract if I determine that you’re using Starlink technology to wage a war not authorized by Congress.” On the face of it, that language seems reasonable - but as the military, you simply can’t give a private company a kill switch on technology your operations have come to rely on, especially if you have an an acrimonious and low trust relationship with said contractor - as in fact Anthropic has with the current administration.

If the government had just said, “Hey we’re not gonna do business with you,” that would have been fine, and I would not have felt the need to write this blog post. Instead the government has threatened to destroy Anthropic as a private business, because Anthropic refuses to sell to the government on terms the government commands.

If upheld, this Supply Chain Restriction would mean that Amazon and Google and Nvidia and Palantir would need to ensure Claude isn’t touching any of their Pentagon work. Anthropic would be able to survive this designation today. But given the way AI is going, eventually AI is not gonna be some party trick addendum to these contractors’ products that can just be turned off. It’ll be woven into how every product is built, maintained, and operated. For example, the code for the AWS services that the DoW uses will be written by Claude - is that a supply chain risk? In a world with ubiquitous and powerful AI, it’s actually not clear to me that these big tech companies will be able to cordon off the use of Claude in order to keep working with the Pentagon.

And that raises a question the Department of War probably hasn’t thought through. If AI really is that pervasive and powerful, then when forced to choose between their AI provider and a DoW contract that represents a tiny fraction of their revenue, wouldn’t most tech companies drop the government, not the AI? So what’s the Pentagon’s plan — to coerce and threaten to destroy every single company that won’t give them what they want on exactly their terms?

The whole background of this AI conversation is that we’re in a race with China, and we have to win. But what is the reason we want America to win the AI race? It’s because we want to make sure free open societies can defend themselves. We don’t want the winner of the AI race to be a government which operates on the principle that there is no such thing as a truly private company or a private citizen. And that if the state wants you to provide them with a service on terms you find morally objectionable, you are not allowed to refuse. And if you do refuse, the government will try to destroy your ability to do business. Are we racing to beat the CCP in AI just so that we can adopt the most ghoulish parts of their system?

Now, people will say, “Oh, well, our government is democratically elected, so it’s not the same thing if they tell you what you must do.” I refuse to accept this idea that if a democratically elected leader hypothetically wants to do mass surveillance on his citizens or wants to violate their rights or punish them for political reasons, that not only is that okay, but that you have a duty to help him.

The overhangs of tyranny

Mass surveillance is, at least in certain forms, legal. It just has been impractical so far. Under current law, you have no Fourth Amendment protection over data you share with a third party, including your bank, your phone carrier, your ISP, and your email provider. The government reserves the right to purchase and obtain and read this data in bulk without a warrant.

What’s been missing is the ability to actually do anything with all of this data — no agency has the manpower to monitor every camera feed, cross-reference every transaction, or read every message. But that bottleneck goes away with AI.

There are 100 million CCTV cameras in America. You can get pretty good open source multimodal models for 10 cents per million input tokens. So if you process a frame every ten seconds, and each frame is 1,000 tokens, you’re looking at a yearly cost of about 30 billion dollars to process every single camera in America. And remember that a given level of AI ability gets 10x cheaper year over year - so a year from now it’ll cost 3 billion, and then a year after 300 million, and by 2030, it might be cheaper for the government to be able to understand what is going on in every single nook and cranny of this country than it is to remodel the White House.

Once the technical capacity for mass surveillance and political suppression exists, the only thing standing between us and an authoritarian surveillance state is the political expectation that this is not something we do here. And this is why I think what Anthropic did here is so valuable and commendable, because it is helping set that norm and precedent.

AI structurally favors mass surveillance

What we’re learning from this episode is that the government actually has way more leverage over private companies than we realized. Even if this supply chain restriction is backtracked (which prediction markets currently give it a 81% chance of happening), the President has so many different ways in which he can make your life difficult if you’re a company that is resisting him. The federal government controls permitting for new power generation, which is needed for datacenters. It oversees antitrust enforcement. The federal government has contracts with all the other big tech companies whom Anthropic needs to partner with for chips and for funding - and they could make it an unspoken condition for such contracts that those companies can no longer do business with Anthropic.

People have proposed that the real problem here is that there’s only 3 leading AI companies. This creates a clear and narrow target for the government to apply leverage on in order to get what they want out of this technology.

But if there’s wide diffusion, then from the government’s perspective, the situation is even easier. Maybe the best models of early 2027 (if you engineered the safeguards out) - the Claude 6 and Gemini 5 - will be capable of enabling mass surveillance. But by late 2027, and certainly by 2028, there will be open source models that do the same thing. So in 2028, the government can just say, “Oh Anthropic, Google, OpenAI, you’re drawing a line in the sand? No issue - I’ll just run some open source model that might not be at the frontier, but is definitely smart enough to note-take a camera feed.”

The more fundamental problem is just that even if the three leading companies draw lines in the sand, and are even willing to get destroyed in order to preserve those lines, it doesn’t really change the fact that the technology itself is just a big boon to mass surveillance and control over the population. And so then the question is, what do we do about it?

Honestly, I don’t have an answer. You’d hope there’s some symmetric property of the technology — some way we as citizens can use AI to check government power as effectively as the government can use AI to monitor and control its population. But realistically, I just don’t think that’s how it’s going to shake out. You can think of AI as giving everybody more leverage on whatever assets and authority they currently have. And the government is already starting with a monopoly of violence. Which they can now supercharge with extremely obedient employees that will not question the government’s orders.

Alignment - to whom?

And this gets us to the issue of alignment. What I have just described to you - an army of extremely obedient employees - is what it would look like if alignment succeeded - that is, we figured out at a technical level how to get AI systems to follow someone’s intentions. And the reason it sounds scary when I put it in terms of mass surveillance or robot armies is that there is a very important question at the heart of alignment which we just haven’t discussed much as a society. Because up till now, AIs were just capable enough to make the question relevant: to whom or what should the AIs be aligned? In what situations should the AI defer to the end user versus the model company versus the law versus its own sense of morality?

This is maybe the most important question about what happens with powerful AI systems. And we barely talk about it. It’s understandable why we don’t hear much about it. If you’re a model company, you don’t really wanna be advertising that you have complete control over a document that determines the preferences and character of what will eventually be almost the entire labor force, not just for private sector companies, but also for the military and the civilian government.

We’re getting to see, with this DoW/Anthropic spat, a much earlier version of the highest stakes negotiations in history. By the way, make no mistake about it - with real AGI the stakes are even much higher than mass surveillance. This is just the example that has come up already relatively early on in the development of AGI.

The military insists that the law already prohibits mass surveillance, and so Anthropic should agree to let their models be used for “all lawful purposes”. Of course, as we saw from the 2013 Snowden revelations, even in this specific example of mass surveillance , the government has shown that it will use secret and deceptive interpretations of the law to justify its actions. Remember, what we learned from Snowden was that the NSA, which, by the way, is part of the Department of War, used the 2001 Patriot Act’s authorization to collect any records “relevant” to an investigation to justify collecting literally every phone record in America. The argument went that it was all “relevant” because some subset might prove useful in some future investigation. They ran this program for years under secret court approval.

So when the Pentagon today says, “We would never use AI for mass surveillance, it’s already illegal, your red lines are unnecessary”, it would be extremely naive to take that at face value. No government is going to call its own actions “mass surveillance”. For the government, it will always have a different label.

So then Anthropic comes back and says, “No, we want red lines separate from ‘all lawful purposes,’ and we want the right to refuse you service when we believe those red lines are being violated.”

But think about it from the military’s perspective. In the future, almost every soldier in the field, and every bureaucrat and analyst and even general in the Pentagon, is going to be an AI. And that AI is, on current track, going to be supplied by a private company. I’m guessing Hegseth is not thinking about “genAI” in those terms just yet. But sooner or later, it will be obvious to everyone what the stakes here are, just as after 1945, the strategic importance of nuclear weapons became clear to everyone.

And now the private company insists that it reserves the right to say, “Hey, Pentagon, you’re breaking the values we embedded in our contract, so we’re cutting you off.”

Maybe in the future, Claude will have its own sense of right and wrong, and it will be smart enough to just personally decide that it’s being used against its values. For the military, maybe that’s even scarier.

I’ll admit that at first glance, “let the AI follow its own values” sounds like the pitch for every sci-fi dystopia ever made. The Terminator has its own values. Isn’t this literally what misalignment is? But I think situations like this actually illustrate why it matters that AIs have their own robust sense of morality.

Some of the biggest catastrophes in history were avoided because the boots on the ground refused to follow orders. One night in 1989, the Berlin Wall fell, and as a result, the totalitarian East German regime collapsed, because the guards at the border refused to shoot down their fellow countrymen who were trying to escape to freedom. Maybe the best example is Stanislav Petrov, who was a Soviet lieutenant colonel on duty at a nuclear early warning station. His sensors reported that the United States had launched five intercontinental continental ballistic missiles into the Soviet Union. But he judged it to be a false alarm, and so he broke protocol and refused to alert his higher-ups. If he hadn’t, the Soviet higher-ups would likely have retaliated, and hundreds of millions of people would have died.

Of course, the problem is that one person’s virtue is another person’s misalignment. Who gets to decide what moral convictions these AIs should have - in whose service they may even decide to break the chain of command? Who gets to write this model constitution that will shape the characters of the intelligent, powerful entities that will operate our civilization in the future?

I like the idea that Dario laid out when he came on my podcast: different AI companies can build their models using different constitutions, and we as end users can pick the one that best achieves and represents what we want out of these systems. I think it’s very dangerous for the government to be mandating what values AIs should have.

Coordination not worth the costs

The AI safety community has been naive about its advocacy of regulation in order to stem the risks of AI. And honestly, Anthropic specifically has been naive here in urging regulation, and, for example, in opposing moratoriums on state AI regulation. Which is quite ironic, because I think what they’re advocating for would give the government even more power to apply more of this kind of thuggish political pressure on AI companies.

The underlying logic for why Anthropic wants regulations makes sense. Many of the actions that labs could take to make AI development safer impose real costs on the labs that adopt them and slow them down relative to their competitors - for example, investing more compute in safety research rather than raw capabilities, enforcing safeguards against misuse for bioweapons or cyberattacks, slowing recursive self-improvement to a pace where humans can actually monitor what’s happening (rather than kicking off an uncontrolled singularity). And these safeguards are meaningless unless the whole industry follows suit. Which means there’s a real collective action problem here.

Anthropic has been quite open about their opinion that they think eventually a very extensive and involved regulatory apparatus will be needed - this is from their frontier safety roadmap: “At the most advanced capability levels and risks, the appropriate governance analogy may be closer to nuclear energy or financial regulation than to today’s approach to software.” So they’re imagining something like the Nuclear Regulatory Commission, or the Securities and Exchange Commission, but for AI.

I cannot imagine how a regulatory framework built around the concepts that underlie AI risk discourse will not be abused by wannabe despots - the underlying terms are so vague and open to interpretation that you’re just handing a power hungry leader a fully loaded bazooka. ‘Catastrophic risk.’ ‘Mass persuasion risk.’ ‘Threats to national security.’ ‘Autonomy risk.’ These can mean whatever the government wants them to mean. Have you built a model that tells users the administration’s tariff policy is misguided? That’s a deceptive, manipulative model — can’t deploy it. Have you built a model that refuses to assist with mass surveillance? That’s a threat to national security. In fact, the government may say, you’re not allowed to build any model which is trained to have its own sense of right and wrong, where it refuses government requests which it thinks cross a redline - for example, enabling mass surveillance, prosecuting political enemies, disobeying military orders that break the US constitution - because that’s an autonomy risk!

Look at what the current government is already doing in abusing statutes that have nothing to do with AI to coerce AI companies to drop their redlines on mass surveillance. The Pentagon had threatened Anthropic with two separate legal instruments. One was a supply chain risk designation — an authority from the 2018 defense bill meant to keep Huawei components out of American military hardware. The other was the Defense Production Act — a statute passed in 1950 so that Harry Truman could keep steel mills and ammunition factories running during the Korean War.

Do you really want to hand the same government a purpose-built regulatory apparatus on AI - which is to say, directly at the thing the government will most want to control? I know I’ve repeated myself here 10 times, but it is hard to emphasize how much AI will be the substrate of our future civilization. You and I, as private citizens, will have our access to all commercial activity, to information about what is happening in the world, to advice about what we should do as voters and capital holders, mediated through AIs. Mass surveillance, while very scary, is like the 10th scariest thing the government could do with control over the AI systems with which we will interface with the world.

The strongest objection to everything I’ve argued is this: are we really going to have zero regulation of the most powerful technology in human history? Even if you thought that was ideal, there’s just no world where the government doesn’t regulate AI in some way. Besides, it is genuinely true that regulation could help us deal with some of the coordination challenges we face with the development of superintelligence.

The problem is, I honestly don’t know how to design a regulatory architecture for AI that isn’t gonna be this huge tempting opportunity to control our future civilization (which will run on AIs) and to requisition millions of blindly obedient soldiers and censors and apparatchiks.

While some regulation might be inevitable, I think it’d be a terrible idea for the government to wholesale take over this technology. Ben Thompson had a post last Monday where he made the point that people like Dario have compared the technology they’re developing to nuclear weapons - specifically in the context of the catastrophic risk it poses, and why we need to export control it from China. But then you oughta think about what that logic implies: “if nuclear weapons were developed by a private company, and that private company sought to dictate terms to the U.S. military, the U.S. would absolutely be incentivized to destroy that company.” And honestly, safety aligned people have actually made similar arguments. Leopold Aschenbrenner, who is a former guest and a good friend, wrote in his 2024 Situational Awareness memo, “I find it an insane proposition that the US government will let a random SF startup develop superintelligence. Imagine if we had developed atomic bombs by letting Uber just improvise.”

And my response to Leopold’s argument at the time, and Ben’s argument now, is that while they’re right that it’s crazy that we’re entrusting private companies with the development of this world historical technology, I just don’t see the reason to think that it’s an improvement to give this authority to the government. Nobody is qualified to steward the development of superintelligence. It is a terrifying, unprecedented thing that our species is doing right now, and the fact that private companies aren’t the ideal institutions to take up this task does not mean the Pentagon or the White House is.

Yes - if a single private company were the only entity capable of building nuclear weapons, the government would not tolerate that company claiming veto power over how those weapons were used. I think this nuclear weapons analogy is not the correct way to think about AI. For at least two important reasons:

First, AI is not some self-contained pure weapon. A nuclear bomb does one thing. AI is closer to the process of industrialization itself — a general-purpose transformation of the economy with thousands of applications across every sector. If you applied Thompson’s or Aschenbrenner’s logic to the industrial revolution — which was also, by any measure, world-historically important — it would imply the government had the right to requisition any factory, dictate terms to any manufacturer, and destroy any business that refused to comply. That’s not how free societies handled industrialization, and it shouldn’t be how they handle AI.

People will say, “Well, AI will develop unprecedentedly powerful weapons - superhuman hackers, superhuman bioweapons researchers, fully autonomous robot armies, etc - and we can’t have private companies developing that kind of tech.” But the Industrial Revolution also enabled new weaponry that was far beyond the understanding and capacity of, say, 17th century Europe - we got aerial bombardment, and chemical weapons, not to mention nukes themselves. The way we’ve accommodated these dangerous new consequences of modernity is not by giving the government absolute control over the whole industrial revolution (that is, over modern civilization itself), but rather by coming up with bans and regulations on those specific weaponizable use cases. And we should regulate AI in a similar way - that is, ban specific destructive end uses (which would also be unacceptable if performed by a human - for example, launching cyber attacks). And there should also be laws which regulate how the government might abuse this technology. For example, by building an AI-powered surveillance state.

The second reason that Ben’s analogy to some monopolistic private nuclear weapons builder breaks down is that it’s not just that one company that can develop this technology. There are other frontier model companies that the government could have otherwise turned to. The government’s argument that it has to usurp the property rights of this one company in order to access a critical national security capability is extremely weak if it can just make a voluntary contract with Anthropic’s half a dozen competitors.

If in the future that stops being the case - if only one entity ends up being capable of building the robot armies and the superhuman hackers, and we had reason to worry that they could take over the whole world with their insurmountable lead, then I agree – it would not be acceptable to have that entity be a private company. And so honestly, I think my crux against the people who say that because AI is so powerful we cannot allow it to be shaped by private hands is that I just expect this technology to be much more multi-polar than they do, with lots of competitive companies at each layer of the supply chain.

And it is for this reason that unfortunately, individual acts of corporate courage will not solve the problem we are faced with here, which is just that structurally AI favors authoritarian applications, mass surveillance being one among many. Even if Anthropic refuses to have its models be used for such uses, and even if the next two frontier labs do the same, within 12 months everyone and their mother will be able to train AIs as good as today’s frontier. And at that point, there will be some AI vendor who is capable and willing to help the government enable mass surveillance.

The only way we can preserve our free society is if we make laws and norms through our political system that it is unacceptable for the government to use AI to enforce mass surveillance and censorship and control. Just as after WW2, the world set the norm that it is unacceptable to use nuclear weapons to wage war.

I want to be clear: these are extremely confusing and difficult questions to think about. I kept changing my mind back and forth on many of them in the process of writing this essay. I reserve the right to change my mind again in the future. In fact, I think it’s essential to change our minds as AI progresses and we learn more. That’s the whole point of conversation and debate.

Someday people will look back on this period the way we look back on the Enlightenment. People having big important debates right as the world was about to undergo these massive technological, social, and political revolutions. And some of these thinkers actually managed to get a couple of the big things right, for which we are now the beneficiaries.

We owe it to our future to at least attempt to think through these new questions raised by AI.

Why Leonardo was a saboteur, Gutenberg went broke, and Florence was weird – Ada Palmer

Dwarkesh Patel — Fri, 06 Mar 2026 17:14:20 GMT

Renaissance history is so much wilder and weirder than you would have expected. Very fun chatting with Ada Palmer (historian, novelist, and composer based at the University of Chicago).

Some especially fascinating things I learned from the conversation and her excellent book, Inventing the Renaissance:

Not only did Gutenberg go bankrupt in the 1450s (after inventing the printing press), but so did the bank that foreclosed on him, and so did his apprentices. This is because paper was still very expensive, and so you had to make this big upfront CAPEX decision to print a batch of 300 copies of a book - say the Bible. But he’s in a small landlocked German town where only priests are allowed to read the Bible - so he sells maybe 7 copies. It’s only when this technology ends up in Venice, where you can hand 10 copies to each of 30 ship captains going to 30 different cities, that it starts taking off.

Speaking of which, the printing revolution wasn’t just one single discrete event, just as the computer revolution has been this whole century of going from mainframes -> personal computers -> phones -> social media, each with different and accelerating social impact. Books came first, but they’re slow to print, and made in small batches. The real revolution is pamphlets - much faster, much harder to censor. Pamphlet runners are how you can have Luther’s 95 Theses go from Wittenberg to London in 17 days.

So much other wild stuff from this episode. For example, did you know that the largest and best-funded experimental laboratory in 17th century Europe was very likely the Roman one run by inquisitors? Ada jokes that the Inquisition accidentally invented peer review. The focus of the Inquisition is really misunderstood - it was obsessed with catching dangerous new heretics like Lutherans and Calvinists - it only executed one person for doing science.

And this leads Ada to make an observation that I think is really wise: the authorities and censors are always worried about the exact wrong things given 20/20 hindsight. When Inquisition raids an underground bookshop during the French Enlightenment, they don’t mind the Rousseau, Voltaire, and Encyclopédie, but they lose their minds about some Jansenist treatises about the technical nature of the Trinity.

More broadly, a lesson for me from this episode is that it’s just really hard to shape history in the specific way that you want to impact things. One of the most famous medieval scholars is this guy Petrarch. He survives the Black Death in the 1340s, watches his friends die to plague and bandits, and says: our leaders are selfish and terrible, we need to raise them on the Roman classics so they’ll act like Cicero. So Europe pours money into finding ancient manuscripts, building libraries, and educating princes on classical virtues. Those princes grow up and fight bigger, nastier wars than ever before with new deadlier technology. And this, combined with greater urbanization and endemic plague, results in European life expectancy decreasing from 35 in the medieval period to 18 during the Renaissance (the period which we in retrospect think of as a golden age but which many people living through it thought of as the continuation of the dark ages that had persisted since the fall of Rome).

Anyways, the libraries Petrarch inspires stick around, the printing press makes them accessible to everyone, and 200 years later a generation of medical students is reading Lucretius and asking “what if there are atoms and that’s how diseases work?” which eventually leads to germ theory, vaccines, and a cure for the Black Death (Ada has longer more involved explanation of how cosplaying the Romans results through a series of many steps to the scientific revolution). Petrarch wanted to produce philosopher-kings that shared his values. Instead he created a world that doesn’t share his values at all but can cure the disease that destroyed his.

Watch on YouTube; listen on Apple Podcasts or Spotify.

Sponsors

Jane Street is still waiting on someone to solve their backdoor puzzle… They’re accepting submissions until April 1st and have set aside $50,000 for the best attempts. Separately, applications are live for Jane Street’s summer ML internships in NY, London, and Hong Kong. Go check all of this out at janestreet.com/dwarkesh.
Labelbox can help ensure your agents don’t need to rely on overspecified prompts. They tailor real-world scenarios to whatever domain you’re focused on, and they make sure the data you train on rewards real understanding, not just instruction-following. Learn more at labelbox.com/dwarkesh
Mercury’s personal accounts let you add users, issue cards, and customize permissions. This is super useful for sharing finances with a partner, a roommate… or even an OpenClaw agent. And, if you’re already a Mercury Business user, your personal account is free! See terms and conditions below, and learn more at mercury.com/personal-banking
Eligible Mercury Business users who apply for and maintain a Mercury Personal account may have their Mercury Personal subscription fee waived provided they remain a user on an active Mercury Business account in good standing. Standard Mercury Platform Subscription fees will apply if they no longer meet eligibility requirements, including but not limited to no longer being associated with an eligible Mercury Business account, or if the program is modified or terminated. Mercury may modify or discontinue this offering at any time and will provide notice as required by law. See Subscription Terms for full details.
To sponsor a future episode, visit dwarkesh.com/advertise.

Timestamps

(00:00:00) - How cosplaying Ancient Rome led to the Renaissance

(00:28:49) - How Florence’s weird republic worked

(00:38:13) - How the Medicis took over Florence

(00:58:12) - Why it was so hard for Gutenberg to make any money off the printing press

(01:17:34) - Why the industrial revolution didn’t happen in Italy

(01:23:02) - The Library of Alexandria isn’t where most ancient books were lost

(01:41:21) - The Inquisition accidentally invented peer review

Transcript

00:00:00 - How cosplaying Ancient Rome led to the Renaissance

Dwarkesh Patel

Today I’m chatting with Ada Palmer, who’s a Renaissance historian, novelist, and composer based at the University of Chicago. Today we’re discussing your book, Inventing the Renaissance. Ada, thanks for coming on the podcast.

Ada Palmer

Been looking forward.

Dwarkesh Patel

First question. You’ve got in this period—late 15th century, early 16th century—in Italy all these different republics: Venice, Florence, Genoa. That seems unusual both for the time period and for the place.

Ada Palmer

One of the big reasons that the Italian city republics are clustered in Italy is that when the Roman Empire dissolved in the West, individual cities then needed to self-govern. This is true all across Europe. Those individual cities could no longer get the centralized Roman government to oversee supply routes or keep the roads free of bandits. You could no longer import and export goods at scale. You could no longer rely on central infrastructure. You had to support things yourself.

Larger, wealthier towns were able to make this transition because they could support themselves from the local resources and the farms attached to them. The larger, wealthier towns surrounded by good agricultural land were more successful at converting over. Okay, let’s have a senate like the old Roman Senate. Let’s have our top families form a council. They will rule. We’ll set up a republic.

A weaker town that can’t support itself as well is much more prone to one wealthy family realizing that they can get goons and take over, declaring themselves the monarch of the area. Or worse, this town cannot self-sustain, it doesn’t have enough. People there can’t get food. They are scared and afraid of being robbed by people who are desperate. But outside of town, there is a wealthy villa that belongs to a noble family, and they have bodyguards. “Hey, noble family, if I move next to your villa and work for you, will you protect me with your bodyguards?”

So towns emptied out, and villages—as in a villa and its environs—developed as a result. A village was a monarchal structure in this sense. It was the migration of people out of a town into the protection zone of a local lordling. Then those villages grew to different scales, some of them cities, some not. Italy had great agriculture and great agricultural land, so more of Italy’s cities were able to sustain themselves as towns and be republics.

Dwarkesh Patel

I feel like the big take of your book is they were trying to resuscitate Roman virtues. What were the virtues that the Roman emperors had which allowed this safety, good government, et cetera, to work?

Ada Palmer

Stability.

Dwarkesh Patel

And I don’t understand the connection between reading Cicero and contemplating the virtues of a great emperor to… science and technology. Maybe there isn’t one, but do you think there is one? What exactly is that connection?

Ada Palmer

As with many processes, the answer is that there are multiple steps, and it’s complicated, and some of the steps are realizing that the earlier steps didn’t work.

Petrarch, who lived through the Black Death, and lives in a moment when Italy is wracked by civil war and foreign mercenary troops are raiding and pillaging. Italy is wracked by bandits. When Petrarch survives the Black Death after losing so many friends, he gets a letter. Two of his friends are alive. He had given up hope that anyone he knew would survive, but two of his younger scholar friends are alive. They’re going to come visit him. On the way, they were attacked by bandits. One of them was killed, and the other was lost in the mountains and wounded, and he didn’t know that his friend was alive for another year and a half. The bandits are very real in this period.

Petrarch looks around him and says, “This is an age of ash and shadow. What we need is to imitate the arts of the ancients. Let’s try to figure out how the Romans did it.” And specifically, the problem is our leaders. Our leaders are selfish. Our leaders care more about their wealth and their family honor and their power than they do about the people.

This is where Romeo and Juliet is really helpful for us to understand. Lord Montague and Lord Capulet, as their goons are knifing each other in the street, they care about defeating each other. Do they care about the good of Italy? Do they care about the good of the city of Verona? No. Their feud is harming the city of Verona, and they don’t care. They demand that Romeo get away with murder because he is their son. That is not service to the state.

Petrarch reads about the ancient Roman Brutus—not the one who killed Caesar, but the ancestor to whom that one was trying to live up. Brutus was one of the first consuls of Rome, and he learned while in office that his sons were plotting to take over the state and make him king. So he executed his own sons for treason against the state. Can you imagine Lord Montague wanting to execute Romeo for treason against Verona? He would never do that. When you’re living in the plot of Romeo and Juliet and you read about these ancient Roman figures, as described in the lofty biographies of someone like Livy, you read them and you say, “Wow, if only our leaders would act like that.”

Well, how were they raised? Can we raise our leaders the same way? Can we make libraries filled with what young Cicero read and what young Brutus read? What did they read? They read Plato, and they read Homer. So we need these things. Can we recreate the educational environment that produced them?

Petrarch suggests this. His students and successors embrace this idea and pour money into traveling across the Alps to look for manuscripts, traveling to Constantinople to purchase manuscripts from the wealthier East where books are common, and bringing them back to assemble these libraries. Then they raise tutors like Marsilio Ficino, who can know Greek and Latin and surround the young princes and princesses of Europe with these values in the hopes that they will act like Brutus and not like Lord Montague.

This is based on an assumption that education is very much like osmosis, that if you’re exposed to something, you’ll imitate it. And the uptake of this is strong because Italy is also full of upstart rulers who just seized power five minutes ago by having a coup in their state and have no legitimacy and no right to be ruling what they’re ruling and are resented by their people. But they can dress up like a Roman emperor. And they can have a parade with allegorical figures of the virtues next to them. And they can invest in an impressive palace that has a pediment on the front and looks like a Roman building to the eyes of the period, and cover themselves with the trappings of antiquity.

Then people might look at them and say, “This guy is different from what we’ve had. This guy is like the Caesars. The days of the Caesars were pretty good. Maybe we want this guy. Maybe he’s not going to be a tyrant. Maybe he’s going to be a good prince, and he’s going to make a golden age.”

And so the first dream is idealistic: let’s make better rulers. The adoption is self-serving and propagandistic: “Hey, I’m a tyrant, but I can seem like something better than just a tyrant. If I make myself look like Julius Caesar, then people will like and respect me.”

Or in the case of Florence with the Medici, “We are merchant scum. We are dirt compared to everybody around us. We’re not even one of the important families of Florence. We’re three ranks down. Even on the standards of merchant scum, we’re extra scummy merchant scum. But if we can have Latin and Greek and quote Cicero and seem like the ancients, people will take us seriously and respect us and talk to us even if we don’t have it.”

Let me give an example. Imagine that you are an ambassador from France, and you’re on your way to Rome, because a new pope has just been elected. Whenever a new pope is elected, every country in Europe has to send a special ambassador whose job it is to deliver a long-winded oration that says, “I am the ambassador from a very wealthy country and a very powerful prince.” And he’s so glad you’re the pope. Congratulations. Only you have to do that for an hour.

You have to give a gift to the pope, and it has to be very impressive, and you have to be a really important person. You’re the most important person who can leave your country without causing a political crisis. You might be the heir to the throne, for example. Or you might be a more minor ambassador, but you’re at least the son of a count.

You’re on your way to Rome, you’re heading along the length of Italy, you’re going to go through Florence, it’s on the way. Ugh. There’s nobody there worth talking to because it’s just a pit of scum and villainy. In fact, also filth and depravity because, of course, Florence is the sodomy capital of Europe. To Florentine is the verb for anal sex in several different European languages. In the laws of France, you can be indicted for sodomy on the grounds that you have ever once in your life even visited Florence. That’s considered evidence enough.

So you’re on your way to this matchlessly filthy dive of scum and villainy. And then you approach the city, and there are these statues. They look like ancient statues, the kind that are so lifelike that it’s as if they’re about to breathe and move. You’ve never seen an intact new statue like that. That isn’t something we know how to do. You ride through the city a bit, and it’s a large, impressive city, and you get to the cathedral, and it has this massive dome, way bigger than anything you’ve ever seen except for old Roman ruins.

You come to the banker’s house, and your servant knocks at the door. The banker greets you humbly at the door and apologizes that his humble palace is not worthy to host Your Excellency, and you’re like, “Yeah, it’s not. You’re correct.” He invites you in, and the instant you step inside, you’re in a space like nothing you’ve ever seen before with white light streaming in through this airy, rounded windowed courtyard that feels cleaner and more outdoors than the outdoors did, because something about the air is cool and fresh. It’s like nothing you’ve— Wait, wait. It is. It’s like the Roman ruins in the backyard of the castle where you grew up. But we don’t have the ability to do that anymore. All that’s lost.

In the middle of the square is another one of these bronze statues that looks like it’s about to come to life, except it’s shining and new. It hasn’t even turned green yet. Around the courtyard are busts of all the Roman emperors in order, and above them are portraits of this guy and the members of his family. Off in the corner are some men wearing robes that look like the robes the ancients wear. You say, “Who are those guys?” He says, “Oh, they’re Platonists. They’re speaking ancient Greek.” You say, “I thought I didn’t understand that language, but ancient Greek is lost. We don’t have ancient Greek.” He says, “We have lots of ancient Greek here.” You say, “And also, we don’t have the works of Plato. They’re also lost.” “Oh, we have lots of Plato here. Look, here’s my grandson, Lorenzo. He’s just written a poem in ancient Greek about the three parts of the soul. Would you like to hear him recite it?”

Now there’s a ten-year-old boy reciting a poem at you in ancient Greek about the three parts of the soul, and you’re like, “Where am I? None of this is possible. None of this has existed for a thousand years.” That’s the moment that Cosimo de’ Medici turns to you and says, “Would you like to make an alliance with Florence?”

And you can say no. You can say, “No. My king is going to come over the Alps with his enormous army, and we’re going to descend upon this city, and we’re going to sack it, and everyone’s going to let us because it has no friends because it doesn’t have any nobility, so it can’t marry anybody, so it has no meaningful allies. And also, it’s in the middle of this Guelph-Ghibelline feud, so all of its neighbors hate it and they’re just going to let it burn. We’re going to take the enormous piles of gold that are in your basements and go home rich, and all of this will be gone like a dream.”

Or you could say, “Yes, let’s make an alliance. Give me a bronzesmith and an architect and a Greek teacher and a Platonist, and we’re going to take all of these things, and we’re going to do the French court like this. Then when the ambassador from Portugal comes, he’s going to feel like an uncultured fool, just like I feel right now.” The power dynamic just flipped upside down. Suddenly, the condescending nobleman is in awe of the merchant scum. That’s what the art and the culture does as a propagandistic tool.

The next stage of it then is, “Okay, we’ve raised these princes like this, and they have the Latin, and they have the Greek, and they can impress everybody.” Then they fight a bigger, nastier, worse war than any of the earlier big, nasty wars, with more deaths and more betrayals and bigger cannons knocking down cities and burning whole areas. The wealth is centralized, so the mercenaries are more numerous because people can produce more. The first generations raised by this are supposed to be philosopher princes, and instead we get Cesare and Lucrezia Borgia, both of whom had Latin and Greek and Cicero and Plato when they were kids. Then they grow up, and Cesare sets fire to half the world.

That is the war Machiavelli watched. Machiavelli was raised on all of the Cicero and Livy. He was raised on the Petrarchan project. He has this famous, beautiful letter that he wrote in exile, where he’s describing his day to his friend. Most of the day is wasted, and he mucks around hunting for larks. Then he goes to a pub and gets drunk in the company of uncultured countrymen. Then he goes home, and he gets dressed in the court robes, the court finery that he would wear back when he was an ambassador to popes and kings. Attired thus, he enters his library to hold commerce with the ancients. He loves this the way Petrarch wanted him to love it.

But he observes these wars, and he observes virtuous princes like Guidobaldo da Montefeltro, who does every single thing you’re supposed to do virtuously. He has all the Plato, and he has all the libraries, and he has all the art. And he gets betrayed and his city taken away from him and loses everything. And he watches terrible people like Cesare Borgia and Julius II make terrible choices and succeed. He says, “Okay, clearly Petrarch was wrong that just reading Cicero would make successful rulers like the Caesars. But I still feel in my heart a deep power in the classics.”

So he says, “What if the libraries are what we need, but we need to use them differently?” He proposes what we would think of as political science. We observe historical examples. We say, “Okay, here are five examples of battles that happened next to rivers. We’ll put those examples side by side and see what decisions the commanders made to try to figure out which one worked better.” We use history as a casebook of examples of what worked and what didn’t. We imitate what worked, and we avoid doing what didn’t. Instead of feeling that reading about good men will make us good, we read about wise choices, and we imitate those choices.

This is one of the reasons Machiavelli is described by his contemporaries as a historian. He says we need to use history and use the classics differently. He proposes that. He isn’t very popular in his own day. It takes a long time for that to catch on. Many people for decades after him are still trying to use absorption by osmosis. But he’s writing that in the early 1500s, so it’s been a little over a century since this started.

We have to remember how long this process is. From Petrarch’s first call to Machiavelli writing that is as long as from Yuri Gagarin’s space flight back to Napoleon. The childhood of Napoleon to the space race, that’s Petrarch to Machiavelli. We think of it as one time period, but a lot changed. They had a plan. They tried the plan. They brought the plan to its maximum. They raised all the princes in this new way. The wars happened. It clearly failed. Machiavelli then thinks about why it failed.

We’re still only halfway through the Renaissance. Shakespeare’s grandparents have barely been born. We have a lot more time to go. So what do we need? We need new ways of thinking about it. We’re reading the ancients, and we have bigger libraries. We have the printing press now. We’re having libraries in smaller towns. More and more people can read. It’s easier and easier to get an education. More people are starting to learn about science.

It also is important that they’re inventing micro technologies of book production like footnotes and glossaries in the margin that explain the hard vocabulary. When Petrarch’s successors like Ficino were young, you had to be a masterful Latinist to read these ancients. You had to have an enormous vocabulary. There are no dictionaries. There are no glosses. There’s nothing to help you. Only a tiny slice of expert classicists could actually read this stuff.

A hundred years later, there are translations into the vernacular. There are footnotes that tell you the hard vocabulary. Any med student can read Lucretius’ discussions of materialist information. When Poggio found it, there were two dozen people in the world who could read it. A hundred years later, 30,000 people can read it in the 30 print editions that are printed before 1600.

When all different kinds of people read it—med students, law students, people in different countries, people in different places—they ask new questions. They wonder whether they can test the hypotheses. They do test the hypotheses. They’re the generation that discovers that the heart is a pump. They’re the generation that takes seriously the question, “Maybe there are atoms, and maybe that’s how diseases work, and maybe we can develop the germ theory of disease.” That’s the 1560s, 1580s, 160 years after Lucretius comes back, because it takes generations of work to build the libraries, to have the libraries, to use the libraries.

So when we get to 1600, which is almost exactly 200 years after this begins, a little bit more, we’ve had time to say, “Let’s build the libraries, have the libraries, use the libraries, or realize we failed in how we use the libraries, and use the libraries differently.” That’s the generation of Francis Bacon and Galileo who say, “Hey, let’s use the information differently. Let’s use nature as a casebook of examples the way Machiavelli said we should use history. Let’s examine, let’s doubt, let’s rethink, let’s do stuff in new ways.”

Dwarkesh Patel

Just to make sure I understood, the chain of causation here. We’ve got to resuscitate the virtues of the Romans, therefore read what they read. To do that, you need to build the libraries. You build the libraries, you resuscitate all those arts. Then you just need to have people be literate, have people think about information in a new way to analyze it. And that analysis lends itself not just to the history of leaders, but also to the nature of the world.

Whenever I hear a story about how this is why the scientific revolution happened, why the Industrial Revolution happened, I’m like, but there are so many stories and it’s just hard to figure out why this one over the other ones. There’s a dozen other stories you could tell.

I had a previous guest, Joseph Henrich, who has this theory that the Catholic Church was breaking down these old kinship-based networks that the rest of the world had. It was encouraging guilds, encouraging these kinds of centers where people could get together and discuss ideas. There are probably twenty other stories you could tell. Why this story?

Ada Palmer

Two different reasons. One, I think it’s useful to think about how for new ideas to flourish and new ways of running the world to happen, you need a fertile environment. In the same way that for forests to grow, you need enough topsoil. It takes a while to get that topsoil.

It takes a while to get enough books. You need to have enough books for a bunch of people to be reading and thinking. You also need to have networks of information moving this stuff back and forth so that they can have discourses of ideas with each other. You can’t publish a scientific journal until there are journals. You need to have developed this ecosystem of information and knowledge.

People talk about it sometimes in terms of increasing literacy rates as if higher literacy makes there be more books instead of the other way around. In fact, there’s a lot more literacy than people imagine in even medieval Italy. Florence had a male literacy rate of ninety percent.

Dwarkesh Patel

As of the sixteenth century?

Ada Palmer

As of the twelfth century. Because everybody’s in the merchant world, so you have to be able to send letters. You have to be able to read account books. You have to be able to calculate your tab at a restaurant.

But of those people, how many have read a book? Very few. They’ve read letters, they’ve read tallies, they’ve read indexes, they’ve made notes. The difference between being literate and being book-literate is different. In the same way that some people watch television but don’t watch very many films, while other people watch lots of films. You can be literate and have never read a book because there might be almost no books in the entire city in which you grew up if it’s 1200 or 1500. But if it’s 1600, there are definitely books in any medium-sized town. So literacy transforms into access to scientific, intellectual, legal, all sorts of different worlds of ideas.

The other person you quoted who’s talking about transformations in networks of power from being less family and clan-centered to being more guild-centered… The guilds are major generators of ideas as well. The guilds can own libraries by 1600. If you went to a guild hall, it will have a bunch of books about its own trade. That would not have been true in 1100.

Those changes are all real, they’re all intermixing, and they’re all parallel to each other. You need all of these things together. One of the focuses I have is sometimes there are more steps to something than you think.

We tell this story of the Renaissance, of how they rediscovered these ancient texts, and then we got science. That’s true, but it is an oversimplification and too wide a zoom. If I said that in the French Revolution, Napoleon rose to power and spread nationalized warfare across Europe, and then we landed on the moon, I’ve skipped some steps. We know that about modernity, but we don’t remember that about earlier periods.

Dwarkesh Patel

Obviously all the stories are somewhat true, but to the extent that this is a part of the story, you’re building up libraries of classics and … setting up a network of information exchange that leads to the Scientific Revolution…

The reason this feels salient right now is that a lot of people have this idea that they’re going to make AI go well by doing X thing. Maybe some of those things work, but it’s at the same time frustrating but also funny and interesting that historically nobody has a good track record of being able to say, “I will do this thing so that this huge unanticipated change in history will go my way, or according to my values.”

Ada Palmer

Right. I think “go my way” as opposed to “go well” is a really important distinction. Petrarch wanted a world with these values. He thought, for example, that this would be a triumph for Christianity and what we would call Catholicism, though there’s only one Christianity from his point of view at the time, except for the East, which is different.

He was sure that when we found the ancients, fundamentally all of their philosophy would agree with Christianity. The ancients were wise, therefore they will be correct, and Plato will ninety percent agree with Christianity. It just needs a little shaker of the Trinity on top to be Christianity. When he says, “Go find these ancients,” he is in a world that doesn’t have the ancients yet. He’s just guessing what’s going to be in these books. But he says, “If we find them, they will uphold good values,” and everyone believes him.

Then they go find them, and they squabble with each other. There are Hedonists and Epicureans and Stoics and all sorts of chaotic things, much more plural than he anticipated. It makes a world that in turn has giant wars, which he would not like, and a crisis, and Machiavelli’s critique of the ancients, and then the new science and the new philosophy, and eventually Galileo, none of which resembles what Petrarch imagined if he had specifically described the future he was trying to make.

But then we get to the propagators of Bacon’s scientific method, meaning Voltaire and Montesquieu, who are also big campaigners for inoculation against smallpox. The first major disease eradications start to begin under that immediate influence. Science gets us to the germ theory of disease gets us to modern hygiene, which gets us to vaccines, which gets us to penicillin and the treatment for the Black Death.

Petrarch thought he would make a world which shared his values. Instead, he made a world that doesn’t share his values but is capable of curing a disease he never imagined would be curable. If you showed him this future, it would be scary. It would be weird to him because it does not embrace his values. Our values are different. He would be horrified by democracy. He believed that only a tiny elite has the capacity to rule. If we had a time-traveling Petrarch, he would really wrestle for a long time to wrap his head around democracy as a functional system. He really thought in oligarchic terms.

But he would see the wonders we’ve created, especially the fact that we can treat the Black Death, and he would weep for joy seeing that. He did not create a world that went as he wanted, but he created a world that went well. We have many examples of that. Trains and bicycles come in, and we get feminism because it’s easier for people, especially women, to move freely and independently. They can organize. They can mobilize. We get suffragettes. Did the inventor of the train intend for there to be women’s liberation? No. Did it go the way he imagined? No. Did it go well? Yes.

00:28:49 - How Florence’s weird republic worked

Ada Palmer

It’s important here to zoom in a little bit on Florence’s own government system and how and why it’s weird, in order to understand what rank Machiavelli actually holds in it.

All of these republics, except Florence, are modeled on ancient Rome. The ancient Roman model was an oligarchic republic in which within the city there are certain noble families, usually founding families who made the city in the first place, who are the senatorial families. Hereditarily, when they come of age, the men of the family are automatically in the senate. From among them are elected the consuls, high senators, or the head of state if there is one. You have a small slice of the population that are fully enfranchised members of the republic who rule over the commoner majority.

That is how Venice works. That is how Genoa works. That is how Bologna and Siena for the most part work. That’s how the Swiss Republic works. That’s how all of these republics work. Florence was like that for quite a while, but when republics fell, they usually fell to noble families who are the foremost, the strongest, the military class. If you’re a military leader in this period, you have to have noble blood. No soldier is going to follow a commander who doesn’t have noble blood. That would be weird. Those threats to the independence of the republic almost always came from the nobility.

After one particular near miss in which the city was nearly taken over, they decided to get rid of the nobility of Florence. They massacred most of them, cut their heads off, put them on pikes, burned their houses down, raked salt into the earth, and had a party on their graves, the way you do in the period when you’re getting rid of a class of people. There were a few noble families that they really liked who had not been part of negative stuff. They allowed them to officially renounce their nobility. They renounced their nobility, changed their names, and declared themselves commoners.

They set up a commoner republic. What that meant was the senate consisted of members of merchant guilds. A member of a merchant guild here means the owners of workshops. It’s not the guy who sits at the loom weaving, but the guy who owns the warehouse full of looms where the workers are working. The head of the sculpture works, the head of the architectural firm, not the bricklayers who are actually laying the bricks. Bourgeoisie is an anachronistic word, but we’re talking about the owners of the means of production who are themselves commoners.

They are very wealthy, but from the point of view of the diplomatic corps of any other society, all of the ruling people and all of their ambassadors are noble-blooded. If you’re an ambassador, you’re automatically noble-blooded. Nobody’s going to take an ambassador seriously who isn’t. From the perspective of every other polity in the world, the rulers of Florence are the rank of their valet. There is no nobility left in the city.

In fact, Florence can’t run its own armies or head its own police, because you’re not going to surrender if you’re told to surrender in the name of some guy who doesn’t have a coat of arms. That would be weird. So they actually have to hire a nobleman to come to the city and be their chief of police to arrest people in the name of the Holy Roman Emperor. One at a time, they’ll invite a skilled military commander nobleman who will come to the city. He’ll be podestà. He’ll live in the palace, which is also the prison. He’ll arrest people. He’ll enforce the law.

They will pay him handsomely at the end of the year, escort him to the gates, and then banish him from the city for life on pain of death so that he cannot return and make use of the power that he had in the city to try to take over. They’re very wary of any nobleman. They’ve set up a really weird republic—weird from the perspective of everyone around them—in which a bunch of merchants are trying to share power by being lotteried into the senate.

You put names in a bag. You examine all of the merchant members of guilds. You choose which ones are fit to serve, meaning not ill and dying, not insane, not so deeply in debt that they could be manipulated by the people whom they owe money to. Their names go in a bag. You choose nine guys at random. They rule the city. They are put in a palace where they rule the city from that tower.

They’re actually locked in the tower for the duration of their time in office because if they left the tower, they could be bribed or kidnapped. They rule the city for two or three months. At the end, they are thanked for their service and escorted out, and then a different nine guys share power for the next three months. It’s a power sharing that is designed to be tyrant-proof because you need consensus of nine randomly selected guys to decide to do anything.

Dwarkesh Patel

Oh, it’s not even a majority vote, it’s consensus?

Ada Palmer

It’s consensus.

Dwarkesh Patel

Previously you were describing “kill the nobles, salt the earth”. I’m almost thinking early communists. But then you say it’s the heads of the merchant guilds who are in charge. I want to understand why merchants and entrepreneurs have notable status in Florence. What is it about the culture that makes it so? Also, the Medici, the most powerful people, their job is usury. It’s like the church—

Ada Palmer

It’s important to remember they were nobody when this set up. They were a minor important family.

Dwarkesh Patel

But the culture is getting started where somebody like that could be respected. How does that happen?

Ada Palmer

An important part of it is when you have a merchant capital, everybody works for somebody who works for somebody who works for the boss.

If you are a major merchant in Florence, you’re importing and exporting wool to and from all across Europe. You have employees all across Europe. You’re buying mass bulk wool from England, importing it to Florence to use olive oil that you’ve bought from Naples to process into high-quality wool, which you’re then exporting to Germany and France. You are a very interconnected businessman. You have a lot of contacts, you have a lot of clout, and the employees who work for you look to you for their safety net as well as their political representation.

We’re very accustomed in the modern period to thinking of the government as being our big safety net. If we wonder who is going to fund the hospitals, whose job is it to take care of orphans, we think of the government, or maybe the church. But in this period, if you’re killed and you leave orphans behind, it is your employer whose duty it is to take care of them. If you are injured and can no longer work, it is your employer who will support you for the rest of your life while you are disabled and find you work that you can do with that disability. A huge portion of the safety net is your employer.

Are you in trouble with the law? Your employer will supply your defense attorney. Your employer will supply the persuasive note to the judge that they would very much appreciate if their person got off. This is the system known as the patronage system, and it existed in ancient Rome. It exists and saturates the medieval and the Renaissance worlds in which everyone is in a very interconnected hierarchy.

So if you’re a brewer and your son gets in a barroom brawl and punches somebody out and the person’s nose breaks and they die in the brawl and your son is suddenly in trouble and you say, “Oh no, I don’t want my son to be executed,” you turn to your landlord. Your landlord turns to his landlord. They turn to one of these major families. These major families are massive landowners that own dozens of apartments within the city. Hundreds or thousands of people work for them.

So it makes sense to everyone to be represented that way, like having a council of the CEOs of all of the organizations that employees work for, when your corporation also supplies your social safety net and you see your representation there.

It’s also a world that’s used to thinking in terms of hierarchy and very unused to thinking about real democracy. It really doesn’t have any confidence in what we would recognize as democracy. We talk about these republics, and we’re very excited by the fact that they give more power to the people than a monarchy does, but they’re still incredibly narrow oligarchic republics.

When we read Machiavelli, he talks a lot about the popolo, which we translate as “the people.” He talks about how important it is that the popolo are respected and have a voice, that the popolo are armed, and the government shows respect for the people by allowing them to be armed. We read this and we’re like, “This feels really familiar. This feels like documents of the founding of the US where we’re respecting and arming and trusting the people.”

Popolo meant the top 4% economically of the population, the members of the merchant guilds. That’s the popolo. He’s talking about a narrow-slice oligarchy being heard, a narrow-slice oligarchy being respected. We didn’t realize that in the nineteenth century when we were excitedly translating The Prince and reading it as quasi-democratic. We now have read more documents of the period and realize how people use these words.

00:38:13 - How the Medicis took over Florence

Dwarkesh Patel

Florence in this period goes through five different forms of government. It’s this republic of nine dudes in a tower, as you were saying, before 1434, and then—

Ada Palmer

There’s a gradual takeover. There’s a gradual, what we could call regulatory capture. But an interesting detail about Florence, even as the Medici take over, is that the Medici know the people of Florence are very deeply invested in this republic and very deeply invested in its institutions. Therefore, they have to respect those institutions and proclaim respect for those institutions. So they’re going to sustain people in the named offices that there used to be. They’re going to continue to let the guilds be important and have important offices.

There was a mandatory outfit that people wore who worked in the republic. The garment over there in the corner is a lucco fiorentino. This was the garment you were mandated by law to wear if you held office in the Florentine Republic. To us, we look at it and say, “It’s a long red robe. It looks very Renaissance.” To them, it looked like a toga because of the way it was draped. They thought of this as a toga. They’re cosplaying the Roman Republic. Wearing a Florentine toga while in office was something that you did to represent your fealty to Cicero and republican values.

The dukes made their men continue to wear these. In fact, the first Duke, Cosimo I, would wear one to costume balls as if in his heart he longed not to dress like a duke, but to dress in a toga like a republican.

Dwarkesh Patel

It’s doubly ironic because when the Roman Republic turns to the Roman Empire, they still have the senate. They still have all these old institutions, even though it’s no longer a republic.

Ada Palmer

The Roman Senate keeps meeting until 1200 AD.

Dwarkesh Patel

It’s sort of doubly ironic that they are doing the same thing, but in the 1500s.

Ada Palmer

And it means that more rights are granted to the people of Florence than to other cities that fell to monarchies at similar points. The monarchs of Florence know they have to be careful, they have to respect rights to a certain amount, and they can’t run roughshod over them.

There’s a really cool building that I love in Florence. If you’ve been there, there’s the famous bridge, the Ponte Vecchio, which has little jeweler shops all along it. When you get to the end of it, there’s this funny overhead corridor, the Vasari Corridor, which was built by the dukes of Florence to connect the old city palace where the senate used to meet—where they had to have their seat of power—to their new palace across the river, which was much bigger, where they could have grand balls and things that dukes need to have.

Because they’re so terrified of being assassinated by their own people, they built this overhead walkway that goes from one end of the city to the other so that they could walk in safety without being assassinated. This is a sign of a weak duke. But also, when he was building it, it’s going across the roofs and sometimes blasting off the second stories of different people’s houses. Most people, when His Grace the Duke says, “I’m gonna blast the top story off your house,” would say, “Yes, Your Grace, please continue.” There are literally severed heads of people who resisted still rotting on spikes in front of the Palazzo Vecchio.

But they get to this one point where there’s a very old tower, a 500-year-old tower. This belongs to the Mannelli family, who are descended from peers of Julius Caesar and can trace their genealogy all the way back to an old Roman gens. When the duke says, “We want to knock the top off your tower,” they say, “No, this is our tower. This tower has been ours since before the Medici existed as a named family. You may not knock the top off.”

And the duke does not knock the top off. The corridor goes around in this awkward square around that tower, because he knows that if he violates something as traditional and core to the civilization as the property rights of somebody who has owned something for a long time, there will be rebellion, civil war, dissent, and resistance. These are monarchs who know that they are weak and are therefore careful, and therefore more rights, like property rights, exist.

Meanwhile, across the river in Ferrara, Duke Alfonso I d’Este of Ferrara used to wander around Ferrara buck naked with a sword in one hand and his dick in the other, to show off that nobody would ever possibly try to harm a Duke d’Este. He and his siblings used to do things like, if they liked a musician, kidnap them and lock them in a tower so that nobody else could hear them, or if they wanted each other’s musician, send goons to kidnap each other’s musicians. They also used to recreationally murder each other’s servants when the siblings were tiffing with each other.

That is what you do when you don’t fear your people and when you feel confident in power. They are much closer to tyrants than the Medici are ever able to be, even after the republic falls. That’s what’s so neat. Because the resistance failed, if we’re looking at it in black and white. The republic fell. There wasn’t a republic anymore. There was a duke. He took over, and the old system was gone.

But because the republic fought so hard and because the people really believed in it, the people had a lot more rights, and the tyrant was a lot less tyrannical because there had been that fight. It’s a great example of how even when resistance loses, resistance wins.

Dwarkesh Patel

I think there’s an interesting parallel to today, not to be too on the nose, but sometimes people debate the odds that America becomes a Putinist kind of country within a couple of decades. I think the odds are actually quite low. Just because even though constitutionally, or at least in precedent, the president is very powerful, the republican expectation is so strong. The amount of resistance faced, even when you successfully do something, demotivates the next escalation.

Ada Palmer

The only thing that makes resistance weak in the US is when people feel as if partial victory is failure. Remembering moments like how Florence’s resistance all the way to the end meant that there was more liberty for the next several centuries, even under the tyrant, is what we need to remind ourselves, that partial victory is an important thing.

Even if the worst were to happen and there were to be tyranny, that tyranny would be so much weaker because there was a lot of resistance, and traditions of resistance and structures would develop that would continue to exist.

Dwarkesh Patel

I think you should discuss the fact that the Medici are the bankers for the papacy. What does that mean? Why is that necessary? How are they able to make money off of that from the interest on the float?

Ada Palmer

When Cosimo de’ Medici swings the contract as banker for the pope, it’s important to remember that when you can’t wire transfer money in the pre-modern world, collecting taxes is a very difficult and complicated system. It is generally done by the centralizing power that has the right to tax delegating somebody local. If you’re in a town, there’ll be a local tax collector. It’s his job to go around to everybody and collect taxes, send a portion of those taxes home to the central power, and keep a remainder to pay himself.

The central power will say, “We expect X amount of taxes from this area.” When you hear about wicked tax collectors, it’s because if you are told, “We want 10,000 florins worth of tax from this town,” but you extract 15,000, you can keep the other 5,000. The 10,000 is what you need to send to the central power, so the more you extract, the more you get paid.

This delegate system, in which there’s a local tax collector and even a more local tax collector below him who might collect tax from a particular village, means that you depend a lot upon the person whose job it is to collect your taxes. When Cosimo is papal banker, he is the person collecting and channeling the money from every church in Christendom when everybody puts a coin into a collection box or pilgrims come and put money. All of the wealth that’s supposed to flow back to the papacy is actually flowing to Cosimo. Cosimo is passing it on to the papacy after taking a cut.

That is a lot of money moving quickly. There is also a lot of ability to make contracts and contacts. We all know how important networking is. He rises in prominence from a banker to somebody who has enough money to effectively take over his state via manipulating the guys-out-of-a-bag system. To discuss that again briefly, if you have a system where you lottery people, sortition is the technical term for it. This is a very old form of government. Ancient Athens used it. It actually works really well.

But like any institution, it is corruptible. In the same way that you can corrupt voting by bribing people or manipulating the machines or manipulating voters, you can also corrupt sortition by bribing the people who pull names out of the bag. Or you can use the simpler mechanism which Cosimo uses first. If you’re a giant bigwig in the city and you employ a third of the people in the city and they’re on your payroll, and nine guys at random are chosen out of a bag, three of them are going to be your guys, just statistically.

If you tell all your guys, “I want this policy, this policy, and this policy, and if you have questions, send for me and I’ll tell you what to do,” when the plurality on a random council all have a plan and it’s your plan, you effectively control the city. In that way, the Medici effectively controlled this lotteried system, because they guaranteed that the plurality, in a situation that doesn’t have a majority, will always be them.

But of course, there’s a chance to that. In 1430 and 1432, Cosimo has bad luck, and the lottery draws a lot of people who dislike him and doesn’t draw any of his guys. They immediately declare him a traitor to the state, arrest him, and lock him in a tower.

And he bribes his way out. He offers the equivalent of about $300,000 to the guard outside the cell and $700,000 to the captain of the guard to smuggle him out of the tower. He wrote in a letter later that they were the two most foolish men he’d ever met because he was Cosimo de’ Medici. He would happily have paid them tens of millions of dollars to let him out of there, but they weren’t ambitious enough to think to ask for more than a few hundred thousand.

So he escapes, and then the next election they happened to elect entirely people who just loved Cosimo. They invited him back to the city in triumph, declared him father of the fatherland, and arrested and persecuted all of his enemies, who turned out to be guilty of tax evasion and all sorts of other things.

That was the moment that his grip tightened. And he’s like, “I’m going to stop simply controlling a plurality, and I’m going to start bribing the people who actually run the elections.” His famous quote about this is, “It is dangerous to be rich and not powerful.” You need the power to defend yourself in a situation like King of the Mountain, where when you’re on top, everyone will try to knock you down.

This is the system into which Machiavelli is born. His family has worked for the Medici family for generations. He grows up expecting to work for the Medici family. But the problem with heredity is that sometimes you get a weak link.

And in the moment that Machiavelli is in his early twenties, he is coming of age, about to work in government for the first time, a government in which he is not, in fact, even fully enfranchised. That’s one of the fascinating things about the degree of his patriotism. You weren’t allowed to serve in government office fully—the lotteried offices—if your family was deep in debt. His grandfather had a lot of unpaid tax debt.

So he worked his whole life for a government of which he was not even quite a full citizen. That shows a deep love of country, but it also shows that even people who could not be in office deeply loved and cared about this republic and the important liberty they felt they had being ruled by the 5% instead of being ruled by one dictator.

To us, that isn’t a very big difference. They’re still both not democracy. We would say they’re both not liberty in the sense that we want liberty. But it’s an inch more liberty than monarchy. Even that small amount of liberty, people loved it. People were willing to fight for it. People were willing to go to the streets, wave their banners, and say “libertas” for the republic. Because they were invested in it, Machiavelli observes, they sustained it.

But eventually, one particular Medici—I’m not saying names because they all have the same names over and over, and it’s really confusing—comes to power quite young and weak. He’s basically 20 when he’s suddenly in charge of a very precarious republic. Right then, the French are invading Italy, and he’s scared. He botches the diplomacy with France and falls into disrepute, and the city takes the opportunity to kick him out. The subsequent regimes, which are an independent republic again, are the ones for which Machiavelli works.

He was part of the regime that ruled while they were in exile. When they returned, they viewed him as an enemy. He didn’t actively organize to resist them, but his name was found on a list of potential people that an anti-Medicean resistance movement had intended to recruit. He is arrested, tortured, exiled, and in exile writes The Prince.

He dedicates it to the very family that exiled him because they now control Florence, and he will only work for Florence. He doesn’t want his manual of the great secrets of statecraft to be in the hands of anybody but his homeland, so that it will defend his homeland.

When Florence exiles you, they tell you, “Go to this place and wait, and if you’re good, we’ll invite you back.” Florence has been doing this for ages because Florence actually used this as the core of its diplomatic corps. When you have no nobility, you can’t have ambassadors in the full-on noble ambassador sense. There’s nobody in the city of sufficient rank to go talk to the kings, to play chess with the sultan, and do all the things you have to do to be a proper ambassador.

What Florence did instead is exile people and say, “Okay, we’re exiling you. You go to Bruges. Be our contact in Bruges. You go to London. Be our contact in London. Be good. Send us letters informing us what’s going on. When we have diplomatic needs to talk to the king, we’re going to send letters to you, and you’re going to forward them. If you’re good, you get to come back.” So being in exile is sort of being on probation, but also being entrusted with state matters.

That’s not quite what they did with Machiavelli. With Machiavelli, they banished him to a hamlet in the middle of the Tuscan countryside near nothing important and said, “Go sit in the country and rot, and if you’re good, we’ll invite you back.”

What everyone expects is that Machiavelli will break that promise and leave. Because he’s a well-known statesman, a scholar, a playwright, and a historian, and there are dozens of cardinals in Rome and other cities that would love to employ him. Kings of England love employing Florentines to work for them as secretaries. Kings of Naples love employing Florentines to work for them as secretaries. He might go get a job tutoring the daughters of the Duke of Milan, the way Francesco Filelfo did when he was kicked out of Florence for opposing the Medici.

There are lots of places it’s expected an exiled Florentine intellectual will go where he will have the ear of power and be able to exert influence. He will be a mover and shaker at the court of Milan or Naples or England.

Instead, when they say to Machiavelli, “Sit in the country and rot, this is a test,” he passes the test and sits in the country faithfully and rots. If he had wanted to go be an intellectual power broker, the correct move is to run off to Rome and say, “I will give up the chance to go home the way Dante did, but I will be a Florentine in exile, and I will write important things. I will live at the house of wealthy men who will support me and give me the ear of power, and I will exert my influence in that way.”

He does not do that. He stays in the country and he rots, and he continues writing letters home saying, “I will serve you or nothing. Bring me home to serve my country.” That is a weird thing to do, and not normal for the many other Florentine intellectuals who experienced similar banishments in the same period.

Dwarkesh Patel

How do we know that he wasn’t just trying to get back into power?

Ada Palmer

The answer is you read his personal letters. You read the way he talks about love of his country, and you read the way he talks to his friends. You read the letters he wrote when he discusses writing The Prince, and you read the comments he exchanges with the other friends that he shared it with.

His other works—his comic play, which was a big hit, his history of Florence, which was well known at the time—those he published and circulated. The Prince he kept in very close private circles, circulating it only with trusted, intimate friends, and then the copy that he sends in to Florence.

Yes, it’s a job application: “Please bring me back. I will work for you. I will be loyal. I support my city more than any particular iteration of my city. I support my country more than any particular regime or group that might be in power. Whatever is in power in my city, I will be faithful to it.” You see him expressing that in lots of different ways.

When in The Prince he says you can and should do all of these ruthless things to keep power, we have to remember that the end justifies the means when the end is the survival of your country. It’s not that the end, in general, justifies the means. Machiavelli feels very strongly that regime changes bring civil violence, and civil violence sheds blood. He has seen the streets of his city run with blood before.

He thinks that even life under a tyrant is better than life in a civil war, which is usually not life at all, given the massacre of the people and external conquest that are likely as a result of another regime change. So he says, “Don’t push for regime change. Even if the regime is tyrannical, more people will survive by sticking with the tyrant than by changing the regime.”

(00:58:12) - Why it was so hard for Gutenberg to make any money off the printing press

Dwarkesh Patel

I want to talk about the printing press. One thing I didn’t realize before reading your book is that not only does Gutenberg go bankrupt after making the most significant invention of a millennia, but his apprentices also go bankrupt.

This is at a time when people like Cosimo are willing to pay on the order of hundreds of thousands of dollars per book. So with the guy who invents a way to make this way cheaper, how is this possible?

Ada Palmer

The problem is printed books are a mass-produced commodity in a world that does not have distribution networks for mass-produced commodities. Mass production is incredibly rare in this period. Coins are mass-produced, but that’s really about it. Almost everything is artisanally produced. When you have a mass-produced product, you need a distribution mechanism before you can sell it.

The great example is that technically e-books existed the first time anyone typed a book on a computer. Certainly in the 1970s there was such a thing as an e-book. But there was no market for e-books until the Kindle came out and made a commodity way to buy and sell e-books, then the e-book industry came into existence. So the e-book as a commodity is several decades younger than the e-book technically existing.

In the same way, you’re Gutenberg. You have figured out how to produce 300 copies of a book for the cost of one copy of a book. You do so. You print your Bible. You have 300 Bibles. You sell seven of them to the seven people in your small landlocked German town who are legally allowed to read the Bible in a period in which only priests are allowed to read the Bible. Congratulations, Mr. Gutenberg, you have 293 Bibles, and you can’t sell them, and you go bankrupt.

There has to be a distribution mechanism for books to find their market because there are certainly 300 people in Europe that want this, but there are not 300 people in one location where it’s being produced. So Gutenberg goes bankrupt. The bank seizes his press. They try to go into the business. The bank goes bankrupt. There is so much overhead. You spend hundreds of thousands of dollars on the production cost of the books, and then you get nothing back.

Gutenberg’s apprentices build presses. They go bankrupt. They flee their debts, flee the country, leave Germany, and go to Venice. Venice is the airport hub of the Mediterranean. Venice is where you change boats. If you’re sailing from A to B, you go to Venice, you change boats, you get to the next place. The hub system has always worked well.

So if you’re printing in Venice, you print 300 Bibles, you give ten Bibles to each of thirty ships’ captains going to thirty different cities. They can sell them. The first economically sustainable circulation of print is enabled by the hub system.

Then book fairs come into existence in which printers will spend all year printing a book. They go with a thousand copies of their book to a book fair where there are a thousand other printers. They all trade, and then they go home to their town with five copies each of 200 books instead of a thousand copies of one book, and then they sell them in bookshops. Things like the Frankfurt Book Fair, which still exists today, developed as the distribution mechanism.

There’s a slow growth and a slow saturation. That’s really cool because one of the things people think is unique about our present information revolution is that we’re living in this sequence of successive information revolutions. We had the computer, the computer was exciting. Then we had the personal computer, then we had the internet, the cell phone, social media, and now we have different social media networks coming in successively causing crises one after the other. And then we have LLMs and other applications of machine learning and generative AI.

It’s easy to think of each of these as different tech revolutions, as if we’ve just had ten tech revolutions in a row. But really, they are all deeper penetration of one tech revolution: the development of the computer. These are all applications of computers.

In the same way, the printing press comes in in 1450, and it isn’t done shaping the world instantly. It takes forty years to even be economically sustainable. It’s not until the 1490s that printers are making money.

And then in the 1510s, it’s time for pamphlets and pamphlet distribution. Now there’s news, and news is suddenly done by print, and that’s a revolution on the same scale as the difference between computers and cell phones. We get the Reformation, which is enabled by pamphlets in exactly the same way that the Arab Spring is enabled by cell phones.

Then we get the newspaper, another new application of the same technology that follows, like social media. It’s one information revolution having multiple successive revolutionary applications as it disseminates and eventually saturates. It moves on a timescale quite similar to the timescale in which the digital one is happening as well, so that print keeps hitting Europe with successive revolutions for 150 years.

And every couple of decades, or every decade, there’ll be a new bang. Suddenly it’s possible to get a printed pamphlet from Wittenberg to London in seventeen days. Oh my God, we can coordinate our resistance movement against the Catholics. Boom. The Reformation happens. That wasn’t possible even a decade earlier when it took months to get a pamphlet from one end of Europe to the other.

So it’s best to think of these very much in parallel, the print revolution and the digital revolution, as one big technological change in information that then has successive applications as that one technology finds new forms and disseminates more deeply and keeps having consequences over decades. It’s not multiple separate revolutions. It’s one ongoing information revolution.

Dwarkesh Patel

Maybe other eras also have this and I just haven’t read the books about them, but from your book, I thought, “Oh, history just seems to be happening really fast, and seems to have sped up, especially religious and political history.” Obviously, the things happening in Italy, but even aside from that, you have Martin Luther and the Reformation, and then just twenty years later England splits off from the Catholic Church, which is unprecedented in two millennia.

Ada Palmer

Then it has a bunch of tumults that flop, flop, flop so that every decade feels different. Here you are in 1506 being nostalgic for how the world was completely different in 1490. And you’re like, “That’s pretty fast.” Here we are in 2026 often feeling nostalgic for how things were in the year 2000.

Dwarkesh Patel

Is it fair to trace that back to the printing press or its offshoots, or is it just embedded?

Ada Palmer

It’s more that history has always moved fast. But when we teach it in high school, we’re trying to move over large chunks of time quickly, and so we pretend that it moved slowly. We have this lie that there were long periods of stagnation. But you can zoom in anywhere, and you’re going to find every decade feels different, and people in the 1320s are nostalgic for people in the 1300s.

It’s always felt like history was moving very quickly, and things rose and things fell. It’s the lies we tell ourselves in history books written in the 19th century that are trying to group all of these things together and make modernity special that confuse us about this.

I’m working on a paper right now about the video game Civ. Civ is the number one teacher of history in the world. It has shipped 70 million copies, and 65 percent of people on Earth who have technology play video games. Civ is the number one teacher of history, bar none, since 1991.

What does Civ tell you? Civ tells you that in antiquity, a turn is fifty years, and then in the Middle Ages, a turn is twenty-five years. Once you get into the Industrial Revolution, a turn is ten years, and then five years, and in modernity, a turn is just one year because in one year, as much happens now as happened in fifty years in antiquity. That lie is also what our textbooks tell us.

But it doesn’t matter where we zoom in. Any time I go to a talk where any historian is zooming in on any decade in any time and place, it always feels like it’s moving as fast as our present is moving.

Dwarkesh Patel

I guess the difference is that technologically, we know that they weren’t moving as fast.

Ada Palmer

Technologically, they were moving fast. We just don’t care about those technologies anymore.

Dwarkesh Patel

That’s interesting.

Ada Palmer

They were constantly inventing all sorts of things. We just take them for granted. The invention of chairs with backs, the invention of scissors, the invention of improved metallurgy so that steel could do things steel couldn’t do before. There was always technological change happening.

I’m in the middle of reading an amazing book about how, when you look at the paintings of Raphael and the few paintings we have by Michelangelo, the colors look like they’re really glowing, like gemstones. How did that happen? When you compare them to paintings from just a hundred years earlier somehow the colors are flatter. I’m not talking about the anatomy being more realistic. That’s separate, but the colors are flatter.

The answer is there was a sequence of revolutionary adaptations in how to process oil and how to process colors and mix them together, and then those were used to create fake gemstones, and there was a major industrial leap forward in the fake gemstone industry. Then people who were making picture frames realized they could use the same techniques from the fake gemstones to make fake gold by painting yellow over the surface of tinfoil.

And then those were used by artists who were like, “Wait, I want to make things that look like they glow like fake gemstones.” There were eleven major technical revolutions over the course of 120 years that led to those colors changing.

Dwarkesh Patel

Obviously progress has been happening in individual fields over time. But in this macroscopic view, and this is a big part of your book, there’s a reason that people living in the fourteenth century would say, “Look, the best time to be alive was when the Romans were around, and since then it’s just been the Dark Ages.”

If they stood in relation to the Roman Empire as we stand to them, we would obviously notice that the world has seen so much progress since then. It clearly seems like the pace...

Ada Palmer

It’s hard to figure out when we are lying and when we are right where we say the pace picked up. One thing that makes the pace pick up in modern day is simply the population grew and grew and grew and is now much, much larger. The majority of people who ever lived in the entire history, since humans have been humans and not hominids, have lived in the last 200 years because the population became massive. How did the population become massive? Our agriculture and our hygiene enabled it.

How did our agriculture and our hygiene improve? Half of that is continuing on the artisanal level to invent new things in the same way that the artists invented better colors. Agricultural workers invented better technologies, and agriculture was constantly improving. You’re correct that with the arrival of the systematic scientific method just after 1600, there is a deliberate societal desire to create intentional anthropogenic progress. I’ll zoom in on the arguments made in 1600, then I’ll zoom out and unpack them.

In 1600, the idea is that history up until now has been unsystematic. People have discovered things at random, but we can create a method in which we observe the world and use inductive reasoning to figure things out from those observations to create systematic descriptions of the secret motions that underlie nature, and from that work out technologies that are good and useful for humankind. If, as we make our observations of nature, we publish them and share them with each other, we can create a community of scientists that will share all of these discoveries with each other and with the world and therefore benefit it.

This is where, when I’m doing this in the classroom, I deliberately provoke and shock my students with the fun claim that Leonardo da Vinci was not a scientist. What I mean by that is that to be a scientist is to publish your results and share them with a community of other scientists so that they can test them, so that the whole human civilization progresses a little bit. When my friends who are chemists or my friends who are particle physicists discover something, the next goal is to share that discovery with everyone so everyone’s knowledge advances.

What does Leonardo do? He writes everything he discovers down in coded mirror writing so that nobody but him can possibly use it. He refuses to share even with his students and assistants the secrets of what he’s doing because Leonardo does not want to contribute to human progress. Leonardo wants to make unique masterpieces so that hundreds of years later, people will see them and marvel and say, “How did he do it? No one else has ever been able to replicate that method.” He wanted to be marveled at by the future exactly the way he and his peers marveled at the works of the ancients.

They look at something like the Colosseum or the Pantheon in Rome with its enormous dome, and they say, “How did they do it? If only we could work that out, we could make one and then make sure no one else could.” Brunelleschi, who built Florence’s famous beautiful dome, deliberately burned all of his notes and schematics so that nobody else would be able to replicate his work. That is an inventor, and an engineer, but in the sense of a community of scientists, this is not a servant of human progress. This is actually a saboteur of human progress, if anything, who deliberately makes progress and then tries to cut it off at that point so that no one else can be his peer.

That is what you did as a learned inventor in the 1400s and in the 1500s. But as you get to 1600, the suggestion is different, and here I’m going to use Francis Bacon’s gorgeous simile of the three insects. There are three types of knowledge wielders, says Bacon.

First, there is the ant, who is the encyclopedist, who gathers information from all around the world. He learns everything he can, and he piles it up into a great big pile. He makes an anthill, and he sits on top. If he has the biggest anthill, the biggest pile of knowledge, then he’s proud of having made it. But all he does is assemble it and possess it. It’s a beautiful library, but nothing comes from it.

The second type is the system weaver, the spider who spins elaborate webs of beautiful, intricate, logical theory. You admire them, and you can get entranced and ensnared in them easily because they’re so beautiful. They’re almost hypnotic. But there’s nothing real in them. They’re all just spun out of the body of the spider himself, the theorist theorizing from his own mind.

The third kind, says Bacon, is the honeybee, who, gathering from among the fruits of nature, processes what he gathers through the organ of his own being to produce something which is sweet and useful for humankind. That is the scientist who gathers from nature to produce something sweet and useful for humankind.

With this rhetorical call, and with Francis Bacon’s portrait on the title page, the English Academy of Sciences is founded and starts publishing. The standard switches over from “You are not a great achiever because you built the dome” to “You are a great achiever because you worked out how it can be done, and you shared that sweet and useful thing with all of humankind.”

Bacon says if we do this, if we make academies of sciences, we can make sure that every human generation lives in a better condition than the past. We’ll have better agriculture, fewer famines. We will have refrigeration. We’ll have chicken in winter. We will have all of these things that we aspire to. If we collaborate, each generation’s experience will be better than the last. He says that to be a scientist is the ultimate act of charity because there is no greater act of charity than to give a gift to every human who will ever live after you.

That is the rhetoric of what you would feel was happening if you’re alive in the 1620s and 1630s. Galileo is publishing his observations, and Descartes is publishing his systems. They’ve just discovered that the heart is a pump and that they were totally wrong about the four humors theory. The blood circulates, and they’re trying to figure out what it does. They have magnification, and they can see worlds of complex patterns on the wing of a flea. It sounds like the whole world is suddenly coming into view, and we’re at the beginning of progress.

If we zoom out, we would say there’d been progress the whole time. People had always been inventing things. Agriculture in France was better in 1300 than it was in 1000. Plows got better, seed got better, cabbages were bred to be bigger. People worked out better pots. There were always artisanal inventors.

In fact, that’s a lot of what Bacon is observing. He worked in the patent office as a young man, and he would see a carpenter come in to patent: “I have invented a better chisel. I’ve invented a thing that goes like this. I’m going to patent it.” He would realize that it was workers and workmen and handicraftsmen who were inventing the really useful tools. He wanted to make this systematic.

We would say there was always anthropogenic progress. In 1630, they realize there is anthropogenic progress. They think there hasn’t been. They think they’re beginning, and that history up until this point has been stagnant, but now it’s going to suddenly be full of invention as, for the first time, there will be deliberate anthropogenic progress. Really, we would say there always was and that it’s accelerating, and at this point, we realize it and articulate and describe it.

You’ve probably seen lots of graphs of history with the hockey stick graph structure, where it’s flat for a long time and then zhoops up. They’ll put that zhoop after the invention of the scientific method. It depends on what we’re graphing, whether that zhoop is appropriate. It also depends on how much you zoom in or zoom out.

It’s true, we do get to inventions that result in enormous increases in population 150 years after Bacon. Would we have anyway, even if it hadn’t been systematized? Probably a bit later, and we would have a slightly flatter hockey stick. But we would still have hockey sticked. In the same way that when you put mice on an island without mice, they breed and they breed and they breed and they breed and they hockey stick. Humans would also have hockey sticked. But would we have hockey sticked later? Would we have hockey sticked with more pain? When mice hockey stick, they also starve to death and eat each other. We haven’t done that yet. Go us.

Was that science? Probably. There are a lot of factors to it. So is it true that everything accelerated after 1620? In one sense, yes. In another sense, it’s a continuation of a curve that was already curving.

01:17:34 - Why the industrial revolution didn’t happen in Italy

Dwarkesh Patel

I think you might have answered a question I was about to ask. The book you recommend on your website, The Renaissance in Italy, I keep forgetting the name of the author. Italian names are tough.

Ada Palmer

Guido Ruggiero.

Dwarkesh Patel

In some part, he has this question: Look, in Italy, as you mentioned, in Venice, they’ve really scaled the printing press. As a result, you have the metalworking for fine typesetting. Separately, milling technology for water mills and windmills is advanced, along with gears for watches. So he asks, why didn’t Italy have the Industrial Revolution? I wonder, do you stand by the answer you just gave, or is it a different theory?

Ada Palmer

Part of it. But another is, we cannot underestimate how much richer per square meter Italy is than everywhere else. Italy is the breadbasket, and it’s also the center of Big Oil, which is to say Big Olive Oil, which was both fuel oil for light and industrial oil for production, as well as cooking and eating oil. And the other major major industry of the period, which is Big Wool.

If you’re already the center of Big Finance, Big Wool, and Big Oil, do you need an industrial revolution? You’re already economically on top through the power of agriculture. It makes sense for it to have been a sort of industrial backwater area. What was England producing? Crappy quality wool?

England was so aware that it couldn’t process wool into high quality without masses of olive oil, which it couldn’t produce, that England just exported its crude wool to Florence in order to have Florence, with its olive oil reserves, produce the fine quality. Think about how a wool suit isn’t itchy, but a wool blanket often is. That wool suit isn’t itchy because lots of olive oil went into the process of producing it, at least at pre-modern tech levels. So do you want England to produce your itchy wool that people will only pay a small amount for, or do you want to export it?

It makes sense for it to have been somewhere industrially ambitious that wasn’t already economically on top to have done it. That’s one reason that industrialization doesn’t kindle in Italy. Italy is agricultural land and a finance world. It doesn’t feel like it needs a new industry.

Another factor is mining. This land is more valuable as a farm than it is as a mine. You don’t want to rip it up. Another is it’s so subdivided because those rich cities are still mostly independent, whereas a centralized crown in England is more able to pass legislation to facilitate a massive transformation.

No city really wants to be the one where the giant industrialization is happening. It’s awful for the city. Note that the industrialization of the Industrial Revolution was mostly outside of the wealthier centers of England in the second-tier towns. They grow massively into huge industrial areas like Lancaster. So those are a plural bunch of reasons.

Dwarkesh Patel

But I would have also thought that the competitiveness between different Italian city-states would have made it so that if they get better textile machines before you, it’s a disaster because they’re right there.

Ada Palmer

This is not going to sound plausible to anybody, but it’s true. We’ve been looking at some documents recently which pretty much confirm that they did figure out how to make industrial looms in the 1400s, and they didn’t want to. They wanted to make luxuriant artisanal fabrics.

Dwarkesh Patel

This, by the way, was another interesting thing from the book. With the first printed books, there’s not this market of commodity things that are produced cheaply that the average person is going to be like, “Oh, if I can get this for $10.99, I’ll go buy it.” So they’re trying to make this thing look like it was produced as artisanal luxury grade.

Ada Palmer

Right. The first printed fonts look like handwritten scripts, and often have a blank space to illuminate it so that it looks just as fancy as manuscripts.

01:23:02 - The Library of Alexandria isn’t where most ancient books were lost

Dwarkesh Patel

One thing I wanted to ask you, back to the printing press. Not only does printing get cheaper, but around this time, paper itself also gets cheaper. So not just reading, but writing gets cheaper. Do you as historians see a marked change in this period in the amount of records that are taken and, as a result, our understanding?

Ada Palmer

A huge amount rests on whether you have a cheap writing surface. Rather than looking first at the Renaissance, let’s look at what we think of as the fall of Rome. One of the biggest things that happens there is that Western and Northern Europe lose access to papyrus. Papyrus is the cheap writing surface of antiquity. It is an easy plant-based writing surface.

You take this tall, thin water reed that is fibrous like asparagus. You slice it into ribbons. You set them out in the sun, a bunch of them parallel to each other sitting on a stone like noodles. You put a second row of noodles perpendicular to that on top, and then they dry in the sun, and they are naturally sticky. They stick to each other. They produce a sheet. Practically no labor has gone into this. You’ve sliced, you’ve laid out, boom.

Papyrus is a very inexpensive writing surface, and this is what enables Rome to have a bureaucracy and to have libraries in any mid-sized city. People can send letters back and forth. There can be enormous tax records. Sometimes when Egypt and Rome are at war, Egypt will be like, “No, we are angry. We’ll stop exporting papyrus.” No papyrus to Rome, and then Rome’s infrastructure will fall apart overnight because you can’t do anything if you can’t write stuff down.

Papyrus is a warm weather plant. It is killed by frost. You cannot grow it north of the frost line. So France, Spain, even most of Italy, you can only grow papyrus down in the very tip down in Sicily.

Without papyrus, what you’re writing on is a dead sheep. If you think of the price of a head of lettuce and the price of a leather jacket, you’re understanding the difference between a sheet of papyrus and writing on a dead sheep. Every page of a medieval book is as expensive as that much of a leather jacket. A handwritten medieval book handwritten on parchment costs as much as a house, so that a small pocket copy of a book costs as much as a studio condo. A big illuminated fancy Bible, you’re spending on that what you would spend on a villa in the countryside.

This is an enormous expense. To have a library is to be not just rich, but mega-rich. Only the wealthiest cities contain anybody who has a library. The great library of the University of Paris—the library from Europe’s perspective—has six hundred books. There’s definitely more than six hundred books in this room. Every kiosk at an airport selling Dan Brown novels has more than six hundred books. This is nothing.

At the same time as that, in the Middle East, sultans have libraries of over a thousand books or five thousand books. There are libraries in Sub-Saharan Africa with thousands of books. There are libraries in China with thousands of books because they have cheap paper, rice paper. The Middle East has papyrus. Europe, and only Europe, is writing on a leather jacket.

Dwarkesh Patel

What changes around this time? How is Europe able to get paper?

Ada Palmer

Still zooming in on the fall of Rome. Rome had lots and lots of books on papyrus. They start falling apart because papyrus is brittle. Most of our knowledge from antiquity is not lost at the burning of the Library of Alexandria. It’s lost between 400 and 600 A.D. when the papyri are falling apart.

Here you are with a library of a thousand books, and you can only afford to make a hundred new books. You have to choose which hundred of these thousand to save because there literally is not enough industry on your continent to make enough leather to copy down all this text. You have to pick. The majority of what we lost from antiquity, we lost then.

We lost it when the papyri were falling apart. This also distorted what survived because most of the copying out was done by monks. When you have a thousand books and you can only save a hundred of them and you’re a monk, you’re like, “What will I save? I know, Saint Augustine. I love Saint Augustine.” This is why we have more surviving work by Saint Augustine than the entirety of all pagan classical Latin. The subjective tastes of the people in power at the moment the papyri were falling apart ended up being an unintentional moment of censorship that biased what survives from antiquity.

Paper technology hits Europe in 800 A.D., so we’re talking about a four-hundred-year famine of a cheap writing surface. Paper is nowhere near as cheap as papyrus because you need to gather rags from used clothing. You immerse them in water, and you beat them violently using a mill for a very long time until they become a pulp. You then scoop that pulp up on a screen, and the fibers lock together. It’s sort of a slurry that looks like grits. You lift up the slurry, and it locks together into a sheet of paper.

It’s not as cheap as just growing papyrus, and it’s much more labor. You have to build a paper mill. If parchment is a leather jacket and papyrus is buying a head of lettuce, this is somewhere in between. What’s in between a leather jacket and a...

Dwarkesh Patel

This feels like a trick question.

Ada Palmer

This is somewhere in between, getting yourself a dozen frozen prepackaged meals, which are complex and have many ingredients. A lot of industry went into producing the actual packaging, more so than a head of lettuce. So it’s ten times as expensive, but it’s still a tenth as much as the leather jacket.

Paper comes in, and people are very wary of it. Paper is clearly not as strong as parchment. Parchment is really tough stuff. People start using paper for rough drafts, letters, sketchbooks. When you’re doing the sketch before doing a painting, you might do that on paper. But Europe has paper for four hundred years before the earliest state document is ever written on paper, to give you a sense of how people are wary of it.

It disseminates slowly. It’s still expensive. It requires industry and production, but it is a tenth as expensive as leather. Paper disseminates slowly through Europe. Again, this is one of these things where there was always technological change, and all technological changes are gradual.

Paper comes in in 800. It’s being trusted by 1200. When printing begins, they’re printing on paper, but they even print on vellum. If you’re a really rich person, you would be like, “Please print two copies on vellum for me.” Dukes like the Dukes d’Este, Isabella d’Este—the sister of the duke who walked around buck naked to show off that he could—specially ordered all of her books to be printed on vellum even when the rest of the print run was on paper. These are the very books being produced in Venice by the apprentices of Gutenberg who ran away.

At that moment in the 1490s, if you’re really rich, you might be invested in these newfangled printed books, but you’re still not trusting paper, even though paper has been there for six hundred years at that point. So again, gradual adoption of technologies and gradual trust in paper. They’re still using parchment for things, gradually less and less, but substantially over the course of the 1600s. You can even find things written on parchment in the 1700s and 1800s. British Parliament still did its records on parchment up until ten years ago, and the Vatican still does its official records on parchment now.

Dwarkesh Patel

This is a digression, but the numbers of how expensive a book is didn’t make sense to me just based on how much scribe time it took. You say it’s $600,000 per book, and then separately, it’s five months of scribe time. I’m like, how much are the scribes getting paid? But if it’s the paper... What changes with Gutenberg?

Ada Palmer

The paper and the ink. But a lot of it is scribe time.

Dwarkesh Patel

But Gutenberg still needs paper, right?

Ada Palmer

Yeah, Gutenberg needs paper. That’s why he goes bankrupt. He borrows the equivalent of about $1.5 million to buy paper, and then doesn’t make back $1.5 million worth of material when printing it.

This is what makes printing a risk. You have to start buying the paper up front. You need to buy it in a big lot so that it matches, because people don’t want the paper to suddenly be a different color within their book. You’re investing a lot up front, and you’re not getting anything back until you produce this slow print run, which is why printers start printing pamphlets.

They can have one press that’s slowly printing a valuable book that will take six months to print. Next to it they have another press that’s printing pamphlets where in two days they’ve printed a fashion report on what everyone was wearing at the royal wedding, which they can sell right away. It’s much cheaper, but it means they have something they can sell two or three times a week. So you have the pamphlet following the book, printing cheap news, printing scandal rags.

Dwarkesh Patel

Why is it cheaper? Because the material is cheaper?

Ada Palmer

Just because it’s only five pages long.

Dwarkesh Patel

Oh, I see. Got it.

Ada Palmer

I could grab one if you want to see one. So if we look at some examples. I’ll show you these one by one. For example, this is a pamphlet. Naked pages, short text, hand-stitched together. It would take two or four days because you print the front side and then the back side. It’s cheap. It’s ephemeral. You print a thousand of them. You sell a bunch around the town. You sell a bunch to news writers who are going to and from other cities, who will buy them and bring them to the next town.

If you’ve printed news in Milan, people who are going to Florence will want to buy your news to go there. It might be a report of a siege. It might be what people were wearing at the royal wedding. My favorite title of a pamphlet was “The Scandalous Tale of a Doctor from Padua and How He Seduced His Maid, Murdered His Wife, Murdered the Maid, Cut Out Her Heart and Ate It, and How He Was Justly Punished by God.” That was the title of the pamphlet. These things circulated around. Some of them were nonsense, some of them were real news. Most were combinations. But you can sell something like this cheaply in a couple of days.

Often they would have a cheap blue cover. You have seen this color before. This is the color of laundry lint, because fundamentally laundry lint is what paper is. You take rags of old clothes, you put them in water, you beat them until they become a pulp, and you skim it out with a sieve. Laundry lint is what rag paper is. If you don’t bleach it, it’s this generic blue-gray color, which is sort of the average color of what human beings wear.

That’s a copy of The Gentleman’s Magazine, another example of technology taking a leap forward in the 18th century. When they invented the newspaper, they immediately had the problem of, “Oh, no. Newspapers contradict each other. We don’t know what’s true. We have to fact-check stuff.”

That one has a great fold-out. I think there’s a procession or something. That is what everybody wore at the state funeral. Instead of photographs, we have this fancy, “Here is what everyone was wearing at the state funeral.” Very exciting.

Your laundry lint, if you don’t bleach it, remains the color that it on average was. In the 18th century, they have newspapers. The newspapers are reporting news, but they don’t quite say the same thing as each other. The problem becomes, how do we know who to trust?

The Gentleman’s Magazine was developed, and every week they would publish a roundup of that week’s news saying what each newspaper said about it, where they contradicted each other, analyzing who’s right and wrong. It was the fact-checking. This is the first magazine. It invented the word “magazine” being used in this context. It was an intellectual response to the fake news problem of how we reconcile what happens with newspapers.

You see these many iterations: they invent the printing press, then they invent the pamphlet, then they invent the newspaper, then they invent the magazine to cope with the newspaper. The newspaper is invented to cope with the pamphlet because you don’t know whether to trust the scandalous tale of the doctor from Padua and how he murdered his wife. Is he real? We don’t know. But if somebody publishes a newspaper that serially prints news every week, they have a reputation. They have to be respectable. You’re not going to subscribe to them if you catch them printing nonsense.

The serial nature of a newspaper was a form of accountability that made people willing to trust it over time. The newspaper is a way of fact-checking the pamphlet. The pamphlet is a way of making money while you’re printing your longer book. I will also let you have a look at papyrus.

Dwarkesh Patel

Thank you.

Ada Palmer

You can see the plaid pattern of the papyrus because it is made of two layers of strips. And there’s a papyrus scroll. That’s modern papyrus. The thing about papyrus is that in addition to being cheap, it’s very brittle. It works better in a scroll than it does folded over because the folded edge cracks really easily. If you try to make this into a codex book, it’s going to be very fragile.

Here you go. This is a real 17th-century letter in absolutely indecipherable handwriting.

Dwarkesh Patel

On parchment?

Ada Palmer

On parchment. You can even tell, because that’s cheap parchment, which side was the outside of the animal and which side was the inside.

Dwarkesh Patel

The handwriting is in some sense bad, but it’s also very well aligned.

Ada Palmer

Tiny and precise. But here is good parchment. It is hard to believe that it’s animal skin. These are pages from a book of hours from about 1480, individually hand-calligraphed. You can see that one has a hole through it. They wrote around the hole because it’s too valuable to not use that sheet.

Ada Palmer

These are paper thin. You can barely tell, if you look carefully, which side was the outside of the animal and which was the inside because one side has tiny little speckles of pores.

Dwarkesh Patel

Where is this from?

Ada Palmer

A book of hours. This is probably a French book of hours. A book of hours is a personal prayer book. Bible quotes, objects of meditation.

The book would be fat and small. This was the most common manuscript in the Middle Ages. You would carry it around in your pocket, and you’d pull it out different times of day for personal prayer. But it also has big margins so that you can take notes in it, write down addresses, have friends write notes in it.

You use it almost like a day planner. It’s the smartphone of the period in which you make all your notes or write down people’s names. You might have celebrities you meet sign your book of hours. All sorts of neat things go into the margins as you use this to organize the day.

Dwarkesh Patel

That would be extremely interesting as a collector’s item, random people’s book of hours and what kinds of things they recorded.

Ada Palmer

Oh yeah. Think of a leather jacket, but how much more industrial effort went into making leather literally paper-thin like this. Huge amounts of industrial effort go into making the pages of such a book.

Dwarkesh Patel

My favorite example of this kind of distribution and diffusion taking longer than you would think for a very fundamental technology—well, this is now my favorite example, so my second favorite example—is oil. I interviewed Daniel Yergin, who wrote this big book about the history of oil. In the 1860s, Drake strikes oil in Pennsylvania.

It’s in the 1910s that the car is invented, the internal combustion engine is put into a thing which you sell millions of copies of. Until then, oil is just used for kerosene, which is just for lighting. The actual gas is just thrown away. In fact, when the light bulb was invented, people were wondering whether Standard Oil was going to go bankrupt because the main use case had gone away.

Ada Palmer

Oh, neat. I always think of Julius Caesar’s description of Britain when the Romans first get there. He says, “The people of Britain are so poor, they can’t afford to burn wood, so they burn rocks.” We know he’s talking about coal.

Dwarkesh Patel

Oh, I thought it was satire.

Ada Palmer

No, he’s talking about coal. They had coal in the days of Julius Caesar, but they didn’t figure out its massive industrial utility until many, many years later.

Dwarkesh Patel

There is this interesting question of why the Romans didn’t have the Industrial Revolution because they had these huge silver mines in Spain and elsewhere, but no coal.

Ada Palmer

You have the Industrial Revolution when you feel you need to. That’s the thing about Gutenberg as well that a lot of people don’t think about. People are like, “Gutenberg was an inventor and invented a thing, and then it had an impact.” No. He was living in the middle of a library building boom in which there was a huge demand for books that spiked. He invented the invention in response to that cultural change.

It isn’t by chance that we got the printing press in 1450. There was a huge boom of library buildings starting in the 1410s, and inventors were trying to figure out ways to make books cheaper. They were making smaller books. They were using paper more. Paper surges before the Gutenberg movable type printing press. So Gutenberg isn’t a random genius out of nowhere. It was the moment that people needed more books. We were going to get the invention.

01:41:21 - The Inquisition accidentally invented peer review

Dwarkesh Patel

One thing you say in passing in the book is Martin Luther comes up at the exact right time, because you’ve got Savonarola in the 1490s, and he’s another prophet type. I guess he’s the modern analog of somebody like Khomeini in Iran, setting up a theocratic government, but too early. Machiavelli you say is too late because the censorship is already in place. What is the censorship that is in place by the time of Machiavelli? What is the alternative world?

Ada Palmer

Machiavelli, remember, is contemporary with Luther. It’s just that he circulates his stuff very briefly and very privately. He doesn’t want a pamphlet version of his ideas out there because he only wants Florence to have it.

Luther hits the sweet spot when the pamphlet distribution network had just developed. When Savonarola printed pamphlets, they only circulated around Florence and its neighbors, Siena and Pisa. It took months for them to get farther. His movement was quickly crushed.

When Luther makes the Ninety-five Theses public, they’re in print in London seventeen days after he releases them in Wittenberg. The pamphlet runners go foom, foom, foom, and get the news there, and things are printed overnight and come out that fast.

Dwarkesh Patel

But it seems like you’re hinting that within the next two decades, there’s a new censorship regime across Europe.

Ada Palmer

A new censorship regime responds. The censorship regime is very effective at shaping what is printed in books, but can never keep up with pamphlets. In the same way that the government can pressure CNN, the government can’t pressure random people on a social media network. You’re not going to be able to keep up with that speed.

One of the funny problems that the Inquisition always had when trying to persecute printers is that printers worked in the information distribution industry. They were the people who paid the news writers, whose job it is to move as fast as humanly possible between cities. Which meant that news always reached them first. If a printer was ever convicted by the Inquisition, they would find out before the Inquisition could possibly get there to arrest them.

The Inquisition never succeeded at arresting printers. They’d always skipped town by the time the Inquisition got there, because if you employ the news writers, you find out first what’s going on. The Inquisition can’t keep up.

When we look at censorship, there’s an intersection of four factors as to whether censorship is possible. One of them is law: Is it legal for the censorship to happen? Another one is the technology. Is it actually possible to censor this thing? You cannot censor whatever moves the information fastest because it will move the information faster than you can move.

Even if that one printer had to skip town, he will set up shop somewhere else, a new person will take over his shop, and the information will still move. So pamphlets become unpoliceable. You can try to police them, you can partially police them, but keeping pamphlets from moving around… They’re anonymous, they’re quick, they’re produced overnight, they move quickly. You just can’t keep up with them.

Dwarkesh Patel

Couldn’t they just punish print shops for publishing things? Just say, “This is what we like, and if you do something we don’t like, we’ll punish you,” which is how censorship in China works, for example.

Ada Palmer

They did. So the printer skips town. The printer moves to the next town. There is a cost to that. There’s a human cost to evading that. You’ve had to leave your home and friends behind and move to a new place, but they don’t get you. It’s also very easy to deny that the pamphlet came from you at all.

The print industry proves very difficult to censor, and we’re experiencing the same thing with social media. Everyone is like, “Censor the pornography on this social media channel,” and they’re like, “We just can’t. It’s too fast. There’s too much.” Or, “Censor the hate speech.” “We just can’t. It’s too fast, there’s too much.”

There are too many pamphlets, and they could crack down on one particular pamphlet shop. We have records of this. There’s a brilliant analysis in Anton Matytsin’s book, The Specter of Skepticism in the Age of Enlightenment. He has a great description from the notes of a raid on a clandestine bookshop. This wasn’t the printer, this was the underground bookshop that was selling illegal books, and they’re raided. It has all the details of how angry the people were about different things that the shop had.

So there was censorship and there were crackdowns, but it was a censorship that could not actually prevent circulation. It could restrict it, it could make it harder, it could make it scary, but it couldn’t prevent it.

Dwarkesh Patel

Before books become cheap, unless you’re fantastically wealthy, you’re reading the same couple of books—if you’ve ever read a book—again and again throughout your life.

Ada Palmer

Cosimo de’ Medici’s father owned, I think it was twelve books.

Dwarkesh Patel

I want to understand the intellectual significance of rereading the exact same book again and again. Maybe the reason Petrarch loved Cicero so much is, imagine reading the same book twenty times, hitting the same joke again and just meditating on every single point. There’s got to be a difference in intellectual culture as a result of treating these things as the equivalent of the Bible.

Ada Palmer

You really feel like you get to know the person intimately. You develop a personal relationship with the ancient author. You are participating in a conversation across the diaspora of time. It’s a one-way conversation. You’re responding to them, the future will respond to you. But there is a great deal of intimacy.

Petrarch talks about his friend Cicero and being betrayed by his friend Cicero. He finds new works of Cicero that he hadn’t read including some of Cicero’s letters in which Cicero is not following his own stoic philosophical precepts and is being petty, yelling at people about real estate, and getting all upset after his daughter’s death. You know how people get manic when there’s been a death in the family and start quarreling about everything? Cicero gets like that, and Petrarch is heartbroken.

To him it means even the wisest man in history could not conquer that urge to become irrational and petty in the face of grief. If even Cicero became irrational and petty in the face of grief, does that mean humanity is doomed to forever be irrational and petty in the face of grief? He talks about Cicero breaking his heart and his foot, because the book fell on his foot and broke it, and he got a bad infection, and he was bedridden for months.

Dwarkesh Patel

Totally different topic, but in 1492, Columbus comes to the New World. They discover the New World. What is the reception of this news?

Ada Palmer

I was just at a conference a week ago in which we confirmed that there’s a Vatican document from 1100 or maybe 1200—I forget the exact year—that recognizes the existence of Vinland, i.e. Canada, where they got the information from the Vikings.

Dwarkesh Patel

Oh, interesting.

Ada Palmer

They thought it was just a little thing, but yeah.

Dwarkesh Patel

So they’re rediscovering the New World. Would it be the equivalent of finding out there are aliens today? Why wasn’t it considered more significant? Why wasn’t the consensus, “This is the main thing happening right now, we’ve discovered the New World”?

Ada Palmer

When I teach my class on the 1490s, the students, many of whom are American, always have trouble wrapping their heads around people thinking that the New World isn’t a big deal. A big part of it is that they find the Caribbean islands, and they find the coast, and they think this is small.

Ada Palmer

The way I put it to my students is, the news comes back, we’ve found something across the water to the west. It might be even as big as the Canary Islands. They’ve found something, but they don’t realize they’ve found something the scale of Europe and Africa. Actually, it’s not as big as Europe and Africa, but they found something humongous. That’s part of it. Another part of it is no matter how big and important something far away is, it’s hard to bring your mind out of the petty squabbles that are happening right around you, especially when they feel like life or death.

If it’s 1492, what is happening? France is about to invade Italy. Europe might be embroiled in the largest war it’s seen in fifty years. The papacy has just been taken over by Spain. Spain is suddenly trying to throw its weight around in Europe in a way that’s unprecedented. The Ottomans have just invaded Italy and Hungary and might be coming again. Also over there, there’s a new thing. Okay, great. We’ll worry about that when we’re not having three wars at the same time. But guys, we’re having three wars at the same time. Oh my God. And then Martin Luther hits Europe like a ton of bricks when they still haven’t even figured out that this is a continent and not an island. In the same way, if you’re in a country and it’s having a tumult, you worry a lot about its tumult, even if a larger tumult is happening in a faraway country. It’s hard to bring your mind out of Europe at crisis to be like, “Hey, this is a thing.”

The other is they’re inventing lots of new things, and it falls into the sphere along the rest. They’re discovering the existence of sub-Saharan Africa, where they thought there was basically one country’s worth of stuff, south of the Sahara, Ethiopia and nothing else. Then they’re like, “Oh my God, there’s a whole big thing that sticks out.” They’re also discovering that the heart is a pump. That’s a bit later, but they’re discovering all sorts of stuff at the same time.

The discovery of the New World, especially when they realize how big it is, becomes an intellectual challenge where they say, “Wait, does this mean all the maps we’ve had are wrong? Does this mean the ancients were wrong about geography? Does it mean the world is a lot bigger than we used to think the world is? Let’s worry about that the same way we worry about revolutionizing our mathematics and figuring out that the sun doesn’t go around the Earth.”

These are things that are paradigm shifting. But on the other hand, does it matter whether the sun goes around the Earth or the Earth around the sun when the French are invading right now and we need to get the defenses going, and there’s a giant civil war happening, and we’re about to be betrayed? It does matter, but it also doesn’t matter. Any decade is concerned by its tumults and often fails to recognize the importance of what’s around it. That’s true of every decade.

One fun game when I study the history of censorship, which I work a lot on—my next non-fiction book is gonna be a book on the history of censorship—whatever they’re looking at, they’re always wrong, from our perspective, about what they should be worried about censoring. If we had a time machine and our goal is to go give them advice… Here we are in the French Enlightenment, Voltaire and Rousseau and the Marquis de Sade and La Mettrie’s articulations of materialist atheism are flying around Europe. What is the Inquisition worried about? It’s worried about Jansenist treatises about the nature of the Trinity.

Jansenism is sort of like a Calvinist version of Catholicism. Do you want to have an incredibly terrifying authoritarian God who hates you and tells you that your soul is a worthless spider that deserves to be hurled into fire, but also have to obey the arbitrary pope in Rome? Then Jansenism is for you. It has all the grimness of Calvinism and all of the authoritarian centrality of the Roman Catholics. This was a heresy that was abroad in the Enlightenment, and they are so much more worried about Jansenism than they are about Voltaire.

Remember that very chapter in Matytsin’s book I mentioned where they are raiding the clandestine bookshop. They’re like, “Voltaire, fine. The banned Encyclopédie, which is gonna revolutionize all thought in Europe, fine. letters of Diderot, Rousseau, fine, fine. Jansenist treatises about the nature of the Trinity! Throw the book at these guys! This is the worst thing!” They really are obsessed with this incredibly petty minor heresy to the degree that when the Encyclopédie is banned by Rome…

France likes the Encyclopedia. This is Diderot and d’Holbach’s big project of universal education, to print an encyclopedia that will collect all world knowledge. They articulate it as, “Should a new dark age come upon humankind and even one copy of the encyclopedia survive, it will be sufficient to reconstruct all human progress.” That’s the goal of this thing. It’s advancing incredibly radical ideas about biology, about statecraft, about reforming the law to be rational instead of traditional, all sorts of stuff.

When that is banned by Rome, Paris is commanded... Paris loves this book. The king likes this book. The queen likes this book. She’s on record saying it was so cool being able to look up the technology that was used to make her silk pantyhose. She just loves it. Everybody loves it. France allows it to circulate despite its controversial content. But Rome says, “No, you must ban this book.” So they agree they’re gonna have the ceremonial burning, and they march the Encyclopédie up to the fire. Then they get some Jansenist treatise about the nature of the Trinity and burn those instead, because they don’t want to burn the Encyclopédie. They love it. They want to burn this other thing.

This is always true. If we had a time machine for the Inquisition in the 1540s, we would say, “Guys, Machiavelli, he’s really important. He’s really revolutionary. You gotta be looking at this.” Or we would say Lucretius’s De rerum natura, which I did my dissertation on… Many people are familiar with Greenblatt’s book, The Swerve, which credits a lot of change to the materialist science that this poem articulates. There’s a much more complex story, which you know is told in my book, which refers to Greenblatt’s. If anyone enjoyed The Swerve, you would really enjoy the more detailed zoom-in that Inventing the Renaissance has. But we would say, “Guys, you should censor this.”

We literally have letters of inquisitors writing to each other saying, “We don’t need to bother censoring Lucretius. Only learned people can read it, and they know perfectly well that the false stuff is false, so it’ll just circulate and it’s fine. What we need to worry about censoring is all of these fine minutiae of Protestantism.” The 1545 edition of the Index of Banned Books says in its introduction, “We shall put the names of arch-heretics in all caps.” When I first read that, I was like, “Ooh, I want to see all my favorite arch-heretics be in all caps.” I eagerly flip to M, and Machiavelli is not in all caps. He was not important enough from their position. The all caps authors are all minor Protestant theologians. They’re all people like Calvin and Zwingli and Luther and Melanchthon. They’re all doing stuff that we would say does not matter.

But an era is always wrong about what ideas and what circulation and what changes are the really big ones and are always much, much more worried about, “Oh my God, the Prince of Spain, which princess is he gonna marry? This is going to determine whether Spain is or isn’t annexed by Germany. This is the most important thing that has ever happened in the entire stream of time.” People are like, “We’ve discovered another continent,” and they’re like, “We don’t care. We just wanna know who’s gonna marry Charles.”

Dwarkesh Patel

That’s a very profound observation. It was really interesting to learn from your book that of all the thousands of people killed during the Inquisition, one guy was executed for atheism.

Ada Palmer

Science-related stuff.

Dwarkesh Patel

And even he had these ideas of reincarnation or...

Ada Palmer

I think probably the number executed for atheism would be about 100. There are 12 total trials of scientists about science. Galileo is one. Giordano Bruno is one. Giordano Bruno is the only one executed. Of those 12 trials, only three were convicted.

Hundreds of thousands of trials for Judaizing, which is theoretically contaminating Christianity with Jewish thought, and all of these other minutiae of oppression and segregation of populations, executions for paganism, meaning practicing your indigenous religion in a colonized space… Hundreds of thousands of executions for that, one for science.

Dwarkesh Patel

I recently got interested in the story of Kepler just because the way he discovers the laws of planetary motion is so whimsical with the theory of Platonic objects. While he’s going through Brahe’s data and coming up with the laws of planetary motion, he is the imperial mathematician for the Habsburg emperor, which basically means that he’s doing astrology for a general. Will we win the battle or whatever.

Then he gets excommunicated, not for the laws of planetary motion, but because he’s a Lutheran. In fact, his mother is tried for witchcraft. Again, has nothing to do with science, it’s just because she’s also a Lutheran.

Ada Palmer

Milton of Paradise Lost fame wrote our first big defense of the free press. This is in the moment in the early 1600s when England doesn’t yet have systematic censorship law. It has ad hoc, “Hey, this book is bad,” but it doesn’t have systematic, “You must submit all books to a censor,” the way the Catholic world does by that point. The Catholic world developed it in order to fight Protestantism.

There’s a lot of support for creating censorship in England at the time because there’s anxiety about Papists plotting against our nice non-Catholic country, trying to undermine it. There’s a general feeling of anxiety. There’s also deliberate moral panic whipped up by politicians and power-seeking people who whip up a deliberate moral panic about books, the same way in 1954 there was a moral panic about comic books or the same way there was a moral panic about Dungeons & Dragons in the ‘90s. There’s a moral panic about scary and dangerous books and pamphlets. So there’s a movement to create state censorship for the first systematic time in England.

Milton writes this big treatise about why freedom of the press is important, the Areopagitica. It’s a beautifully written rhetorical piece that presents the importance of how we must trust truth to rise purely to the top. We must let free voices move, otherwise you’re gonna create a situation where people are writing for the censor first and for the public second. It will constrain people’s thoughts in the way that we know chilling effects and fear do. It’s a beautiful treatise. He fails. The censorship regime passes.

Paradise Lost is published under the censorious regime. It goes through the censorship. The one line they tell him to change is about astrology. They’re like, “It’s perfectly fine having Satan be your charismatic protagonist and God be kind of a jackass, and also having Satan spout ferocious anti-monarchical rhetoric copied from revolutionary pamphlets that are circulating in the British colonies so that he’s actually parroting republican, anti-monarchical rhetoric, very dangerous stuff in the treatise. That’s fine. But this one line about a comet causing a thing to happen, no, no, no. Astrology is gonna confuse people’s souls.” You’re like, “Guys, speaking as a time traveler, you’re so wrong about what you’re censoring.” They always are.

Dwarkesh Patel

You have one sentence which I couldn’t trace down, which I found very interesting. You said, “In the late 17th century, the most extensive library in all of Europe is the one in the Vatican run by the inquisitors.”

Ada Palmer

Not the library, the most extensive experimental laboratory. Daniele Macuglia is the scholar there. That’s from his dissertation. I think it’s been published now, but I don’t know if it’s actually out in English. It’s out in Italian. He works on the Inquisition and the immediate aftermath of Galileo.

They saw themselves as guarantors of truth and of accuracy in information. So they decided after Galileo that they had a duty to verify the truth of the books that they were sent to censor. If people were going to be doing mechanical experiments, they needed to repeat the mechanical experiments to see whether they were true. So they effectively invented peer review, which is to say they invented a second laboratory trying to recreate the results of the first. There are these amazing people who by day are inquisitors and by night are going home to write their own scientific treatises as they do these experiments. It’s not what we expect, but history is never what we expect.

Dwarkesh Patel

Seems like a good place to close. Ada, thank you very much.

Ada Palmer

Thank you.

Dario Amodei — "We are near the end of the exponential"

Dwarkesh Patel — Fri, 13 Feb 2026 16:46:36 GMT

Dario Amodei thinks we are just a few years away from “a country of geniuses in a data center”. In this episode, we discuss what to make of the scaling hypothesis in the current RL regime, how AI will diffuse throughout the economy, whether Anthropic is underinvesting in compute given their timelines, how frontier labs will ever make money, whether regulation will destroy the boons of this technology, US-China competition, and much more.

Watch on YouTube; listen on Apple Podcasts or Spotify.

Sponsors

Labelbox can get you the RL tasks and environments you need. Their massive network of subject-matter experts ensures realism across domains, and their in-house tooling lets them continuously tweak task difficulty to optimize learning. Reach out at labelbox.com/dwarkesh
Jane Street sent me another puzzle… this time, they’ve trained backdoors into 3 different language models — they want you to find the triggers. Jane Street isn’t even sure this is possible, but they’ve set aside $50,000 for the best attempts and write-ups. They’re accepting submissions until April 1st at janestreet.com/dwarkesh
Mercury’s personal accounts make it easy to share finances with a partner, a roommate… or OpenClaw. Last week, I wanted to try OpenClaw for myself, so I used Mercury to spin up a virtual debit card with a small spend limit, and then I let my agent loose. No matter your use case, apply at mercury.com/personal-banking

Timestamps

(00:00:00) - What exactly are we scaling?

(00:12:36) - Is diffusion cope?

(00:29:42) - Is continual learning necessary?

(00:46:20) - If AGI is imminent, why not buy more compute?

(00:58:49) - How will AI labs actually make profit?

(01:31:19) - Will regulations destroy the boons of AGI?

(01:47:41) - Why can’t China and America both have a country of geniuses in a datacenter?

Transcript

00:00:00 - What exactly are we scaling?

Dwarkesh Patel

We talked three years ago. In your view, what has been the biggest update over the last three years? What has been the biggest difference between what it felt like then versus now?

Dario Amodei

Broadly speaking, the exponential of the underlying technology has gone about as I expected it to go. There’s plus or minus a year or two here and there. I don’t know that I would’ve predicted the specific direction of code.

But when I look at the exponential, it is roughly what I expected in terms of the march of the models from smart high school student to smart college student to beginning to do PhD and professional stuff, and in the case of code reaching beyond that. The frontier is a little bit uneven, but it’s roughly what I expected.

What has been the most surprising thing is the lack of public recognition of how close we are to the end of the exponential. To me, it is absolutely wild that you have people — within the bubble and outside the bubble — talking about the same tired, old hot-button political issues, when we are near the end of the exponential.

Dwarkesh Patel

I want to understand what that exponential looks like right now. The first question I asked you when we recorded three years ago was, “what’s up with scaling and why does it work?” I have a similar question now, but it feels more complicated. At least from the public’s point of view, three years ago there were well-known public trends across many orders of magnitude of compute where you could see how the loss improves.

Now we have RL scaling and there’s no publicly known scaling law for it. It’s not even clear what the story is. Is this supposed to be teaching the model skills? Is it supposed to be teaching meta-learning? What is the scaling hypothesis at this point?

Dario Amodei

I actually have the same hypothesis I had even all the way back in 2017. I think I talked about it last time, but I wrote a doc called “The Big Blob of Compute Hypothesis”. It wasn’t about the scaling of language models in particular. When I wrote it GPT-1 had just come out.

That was one among many things. Back in those days there was robotics. People tried to work on reasoning as a separate thing from language models, and there was scaling of the kind of RL that happened in AlphaGo and in Dota at OpenAI. People remember StarCraft at DeepMind, AlphaStar.

It was written as a more general document. Rich Sutton put out “The Bitter Lesson” a couple years later. The hypothesis is basically the same. What it says is that all the cleverness, all the techniques, all the “we need a new method to do something”, that doesn’t matter very much. There are only a few things that matter. I think I listed seven of them.

One is how much raw compute you have. The second is the quantity of data. The third is the quality and distribution of data. It needs to be a broad distribution. The fourth is how long you train for. The fifth is that you need an objective function that can scale to the moon. The pre-training objective function is one such objective function. Another is the RL objective function that says you have a goal, you’re going to go out and reach the goal.

Within that, there’s objective rewards like you see in math and coding, and there’s more subjective rewards like you see in RLHF or higher-order versions of that. Then the sixth and seventh were things around normalization or conditioning, just getting the numerical stability so that the big blob of compute flows in this laminar way instead of running into problems.

That was the hypothesis, and it’s a hypothesis I still hold. I don’t think I’ve seen very much that is not in line with it. The pre-training scaling laws were one example of what we see there. Those have continued going. Now it’s been widely reported, we feel good about pre-training. It’s continuing to give us gains.

What has changed is that now we’re also seeing the same thing for RL. We’re seeing a pre-training phase and then an RL phase on top of that. With RL, it’s actually just the same. Even other companies have published things in some of their releases that say, “We train the model on math contests — AIME or other things — and how well the model does is log-linear in how long we’ve trained it.”

We see that as well, and it’s not just math contests. It’s a wide variety of RL tasks. We’re seeing the same scaling in RL that we saw for pre-training.

Dwarkesh Patel

You mentioned Rich Sutton and “The Bitter Lesson”. I interviewed him last year, and he’s actually very non-LLM-pilled. I don’t know if this is his perspective, but one way to paraphrase his objection is: Something which possesses the true core of human learning would not require all these billions of dollars of data and compute and these bespoke environments, to learn how to use Excel, how to use PowerPoint, how to navigate a web browser. The fact that we have to build in these skills using these RL environments hints that we are actually lacking a core human learning algorithm. So we’re scaling the wrong thing.

That does raise the question. Why are we doing all this RL scaling if we think there’s something that’s going to be human-like in its ability to learn on the fly?

Dario Amodei

I think this puts together several things that should be thought of differently. There is a genuine puzzle here, but it may not matter. In fact, I would guess it probably doesn’t matter. There is an interesting thing. Let me take the RL out of it for a second, because I actually think it’s a red herring to say that RL is any different from pre-training in this matter.

If we look at pre-training scaling, it was very interesting back in 2017 when Alec Radford was doing GPT-1. The models before GPT-1 were trained on datasets that didn’t represent a wide distribution of text. You had very standard language modeling benchmarks. GPT-1 itself was trained on a bunch of fanfiction, I think actually.

It was literary text, which is a very small fraction of the text you can get. In those days it was like a billion words or something, so small datasets representing a pretty narrow distribution of what you can see in the world. It didn’t generalize well. If you did better on some fanfiction corpus, it wouldn’t generalize that well to other tasks.

We had all these measures. We had all these measures of how well it did at predicting all these other kinds of texts. It was only when you trained over all the tasks on the internet — when you did a general internet scrape from something like Common Crawl or scraping links in Reddit, which is what we did for GPT-2 — that you started to get generalization.

I think we’re seeing the same thing on RL. We’re starting first with simple RL tasks like training on math competitions, then moving to broader training that involves things like code. Now we’re moving to many other tasks. I think then we’re going to increasingly get generalization. So that kind of takes out the RL vs. pre-training side of it.

But there is a puzzle either way, which is that in pre-training we use trillions of tokens. Humans don’t see trillions of words. So there is an actual sample efficiency difference here. There is actually something different here. The models start from scratch and they need much more training. But we also see that once they’re trained, if we give them a long context length of a million — the only thing blocking long context is inference — they’re very good at learning and adapting within that context.

So I don’t know the full answer to this. I think there’s something going on where pre-training is not like the process of humans learning, but it’s somewhere between the process of humans learning and the process of human evolution. We get many of our priors from evolution. Our brain isn’t just a blank slate. Whole books have been written about this.

The language models are much more like blank slates. They literally start as random weights, whereas the human brain starts with all these regions connected to all these inputs and outputs. Maybe we should think of pre-training — and for that matter, RL as well — as something that exists in the middle space between human evolution and human on-the-spot learning. And we should think of the in-context learning that the models do as something between long-term human learning and short-term human learning.

So there’s this hierarchy. There’s evolution, there’s long-term learning, there’s short-term learning, and there’s just human reaction. The LLM phases exist along this spectrum, but not necessarily at exactly the same points. There’s no analog to some of the human modes of learning the LLMs are falling in between the points. Does that make sense?

Dwarkesh Patel

Yes, although some things are still a bit confusing. For example, if the analogy is that this is like evolution so it’s fine that it’s not sample efficient, then if we’re going to get super sample-efficient agent from in-context learning, why are we bothering to build all these RL environments?

There are companies whose work seems to be teaching models how to use this API, how to use Slack, how to use whatever. It’s confusing to me why there’s so much emphasis on that if the kind of agent that can just learn on the fly is emerging or has already emerged.

Dario Amodei

I can’t speak for the emphasis of anyone else. I can only talk about how we think about it. The goal is not to teach the model every possible skill within RL, just as we don’t do that within pre-training. Within pre-training, we’re not trying to expose the model to every possible way that words could be put together. Rather, the model trains on a lot of things and then reaches generalization across pre-training.

That was the transition from GPT-1 to GPT-2 that I saw up close. The model reaches a point. I had these moments where I was like, “Oh yeah, you just give the model a list of numbers — this is the cost of the house, this is the square feet of the house — and the model completes the pattern and does linear regression.” Not great, but it does it, and it’s never seen that exact thing before.

So to the extent that we are building these RL environments, the goal is very similar to what was done five or ten years ago with pre-training. We’re trying to get a whole bunch of data, not because we want to cover a specific document or a specific skill, but because we want to generalize.

00:12:36 - Is diffusion cope?

Dwarkesh Patel

I think the framework you’re laying down obviously makes sense. We’re making progress toward AGI. Nobody at this point disagrees we’re going to achieve AGI this century. The crux is you say we’re hitting the end of the exponential. Somebody else looks at this and says, “We’ve been making progress since 2012, and by 2035 we’ll have a human-like agent.”

Obviously we’re seeing in these models the kinds of things that evolution did, or that learning within a human lifetime does. I want to understand what you’re seeing that makes you think it’s one year away and not ten years away.

Dario Amodei

There are two claims you could make here, one stronger and one weaker. Starting with the weaker claim, when I first saw the scaling back in 2019, I wasn’t sure. This was a 50/50 thing. I thought I saw something. My claim was that this was much more likely than anyone thinks. Maybe there’s a 50% chance this happens.

On the basic hypothesis of, as you put it, within ten years we’ll get to what I call a “country of geniuses in a data center”, I’m at 90% on that. It’s hard to go much higher than 90% because the world is so unpredictable. Maybe the irreducible uncertainty puts us at 95%, where you get to things like multiple companies having internal turmoil, Taiwan gets invaded, all the fabs get blown up by missiles.

Dwarkesh Patel

Now you’ve jinxed us, Dario.

Dario Amodei

You could construct a 5% world where things get delayed for ten years. There’s another 5% which is that I’m very confident on tasks that can be verified. With coding, except for that irreducible uncertainty, I think we’ll be there in one or two years. There’s no way we will not be there in ten years in terms of being able to do end-to-end coding.

My one little bit of fundamental uncertainty, even on long timescales, is about tasks that aren’t verifiable: planning a mission to Mars; doing some fundamental scientific discovery like CRISPR; writing a novel. It’s hard to verify those tasks. I am almost certain we have a reliable path to get there, but if there’s a little bit of uncertainty it’s there. On the ten-year timeline I’m at 90%, which is about as certain as you can be. I think it’s crazy to say that this won’t happen by 2035. In some sane world, it would be outside the mainstream.

Dwarkesh Patel

But the emphasis on verification hints to me a lack of belief that these models are generalized. If you think about humans, we’re both good at things for which we get verifiable reward and things for which we don’t.

Dario Amodei

No, this is why I’m almost sure. We already see substantial generalization from things that verify to things that don’t. We’re already seeing that.

Dwarkesh Patel

But it seems like you were emphasizing this as a spectrum which will split apart which domains in which we see more progress. That doesn’t seem like how humans get better.

Dario Amodei

The world in which we don’t get there is the world in which we do all the verifiable things. Many of them generalize, but we don’t fully get there. We don’t fully color in the other side of the box. It’s not a binary thing.

Dwarkesh Patel

Even if generalization is weak and you can only do verifiable domains, it’s not clear to me you could automate software engineering in such a world. You are “a software engineer” in some sense, but part of being a software engineer for you involves writing long memos about your grand vision.

Dario Amodei

I don’t think that’s part of the job of SWE. That’s part of the job of the company, not SWE specifically. But SWE does involve design documents and other things like that. The models are already pretty good at writing comments. Again, I’m making much weaker claims here than I believe, to distinguish between two things. We’re already almost there for software engineering.

Dwarkesh Patel

By what metric? There’s one metric which is how many lines of code are written by AI. If you consider other productivity improvements in the history of software engineering, compilers write all the lines of software. There’s a difference between how many lines are written and how big the productivity improvement is. “We’re almost there” meaning… How big is the productivity improvement, not just how many lines are written by AI?

Dario Amodei

I actually agree with you on this. I’ve made a series of predictions on code and software engineering. I think people have repeatedly misunderstood them. Let me lay out the spectrum.

About eight or nine months ago, I said the AI model will be writing 90% of the lines of code in three to six months. That happened, at least at some places. It happened at Anthropic, happened with many people downstream using our models. But that’s actually a very weak criterion. People thought I was saying that we won’t need 90% of the software engineers. Those things are worlds apart. The spectrum is: 90% of code is written by the model, 100% of code is written by the model. That’s a big difference in productivity.

90% of the end-to-end SWE tasks — including things like compiling, setting up clusters and environments, testing features, writing memos — are done by the models. 100% of today’s SWE tasks are done by the models. Even when that happens, it doesn’t mean software engineers are out of a job. There are new higher-level things they can do, where they can manage. Then further down the spectrum, there’s 90% less demand for SWEs, which I think will happen but this is a spectrum.

I wrote about it in “The Adolescence of Technology” where I went through this kind of spectrum with farming. I actually totally agree with you on that. These are very different benchmarks from each other, but we’re proceeding through them super fast.

Dwarkesh Patel

Part of your vision is that going from 90 to 100 is going to happen fast, and that it leads to huge productivity improvements. But what I notice is that even in greenfield projects people start with Claude Code or something, people report starting a lot of projects… Do we see in the world out there a renaissance of software, all these new features that wouldn’t exist otherwise? At least so far, it doesn’t seem like we see that.

So that does make me wonder. Even if I never had to intervene with Claude Code, the world is complicated. Jobs are complicated. Closing the loop on self-contained systems, whether it’s just writing software or something, how much broader gains would we see just from that? Maybe that should dilute our estimation of the “country of geniuses”.

Dario Amodei

I simultaneously agree with you that it’s a reason why these things don’t happen instantly, but at the same time, I think the effect is gonna be very fast. You could have these two poles. One is that AI is not going to make progress. It’s slow. It’s going to take forever to diffuse within the economy. Economic diffusion has become one of these buzzwords that’s a reason why we’re not going to make AI progress, or why AI progress doesn’t matter.

The other axis is that we’ll get recursive self-improvement, the whole thing. Can’t you just draw an exponential line on the curve? We’re going to have Dyson spheres around the sun so many nanoseconds after we get recursive. I’m completely caricaturing the view here, but there are these two extremes.

But what we’ve seen from the beginning, at least if you look within Anthropic, there’s this bizarre 10x per year growth in revenue that we’ve seen. So in 2023, it was zero to $100 million. In 2024, it was $100 million to $1 billion. In 2025, it was $1 billion to $ 9-10 billion.

Dwarkesh Patel

You guys should have just bought a billion dollars of your own products so you could just…

Dario Amodei

And the first month of this year, that exponential is... You would think it would slow down, but we added another few billion to revenue in January. Obviously that curve can’t go on forever. The GDP is only so large. I would even guess that it bends somewhat this year, but that is a fast curve. That’s a really fast curve. I would bet it stays pretty fast even as the scale goes to the entire economy.

So I think we should be thinking about this middle world where things are extremely fast, but not instant, where they take time because of economic diffusion, because of the need to close the loop. Because it’s fiddly: “I have to do change management within my enterprise… I set this up, but I have to change the security permissions on this in order to make it actually work… I had this old piece of software that checks the model before it’s compiled and released and I have to rewrite it. Yes, the model can do that, but I have to tell the model to do that. It has to take time to do that.”

So I think everything we’ve seen so far is compatible with the idea that there’s one fast exponential that’s the capability of the model. Then there’s another fast exponential that’s downstream of that, which is the diffusion of the model into the economy. Not instant, not slow, much faster than any previous technology, but it has its limits. When I look inside Anthropic, when I look at our customers: fast adoption, but not infinitely fast.

Dwarkesh Patel

Can I try a hot take on you?

Dario Amodei

Yeah.

Dwarkesh Patel

I feel like diffusion is cope that people say. When the model isn’t able to do something, they’re like, “oh, but it’s a diffusion issue.” But then you should use the comparison to humans. You would think that the inherent advantages that AIs have would make diffusion a much easier problem for new AIs getting onboarded than new humans getting onboarded. An AI can read your entire Slack and your drive in minutes. They can share all the knowledge that the other copies of the same instance have. You don’t have this adverse selection problem when you’re hiring AI, so you can just hire copies of a vetted AI model.

Hiring a human is so much more of a hassle. People hire humans all the time. We pay humans upwards of $50 trillion in wages because they’re useful, even though in principle it would be much easier to integrate AIs into the economy than it is to hire humans. The diffusion doesn’t really explain.

Dario Amodei

I think diffusion is very real and doesn’t exclusively have to do with limitations on the AI models. Again, there are people who use diffusion as kind of a buzzword to say this isn’t a big deal. I’m not talking about that. I’m not talking about how AI will diffuse at the speed of previous technologies. I think AI will diffuse much faster than previous technologies have, but not infinitely fast.

I’ll just give an example of this. There’s Claude Code. Claude Code is extremely easy to set up. If you’re a developer, you can just start using Claude Code. There is no reason why a developer at a large enterprise should not be adopting Claude Code as quickly as an individual developer or developer at a startup.

We do everything we can to promote it. We sell Claude Code to enterprises. Big enterprises, big financial companies, big pharmaceutical companies, all of them are adopting Claude Code much faster than enterprises typically adopt new technology. But again, it takes time.

Any given feature or any given product, like Claude Code or Cowork, will get adopted by the individual developers who are on Twitter all the time, by the Series A startups, many months faster than they will get adopted by a large enterprise that does food sales. There are just a number of factors. You have to go through legal, you have to provision it for everyone. It has to pass security and compliance.

The leaders of the company who are further away from the AI revolution are forward-looking, but they have to say, “Oh, it makes sense for us to spend 50 million. This is what this Claude Code thing is. This is why it helps our company. This is why it makes us more productive.” Then they have to explain to the people two levels below. They have to say, “Okay, we have 3,000 developers. Here’s how we’re going to roll it out to our developers.” We have conversations like this every day.

We are doing everything we can to make Anthropic’s revenue grow 20 or 30x a year instead of 10x a year. Again, many enterprises are just saying, “This is so productive. We’re going to take shortcuts in our usual procurement process.” They’re moving much faster than when we tried to sell them just the ordinary API, which many of them use. Claude Code is a more compelling product, but it’s not an infinitely compelling product.

I don’t think even AGI or powerful AI or “country of geniuses in a data center” will be an infinitely compelling product. It will be a compelling product enough maybe to get 3-5x, or 10x, a year of growth, even when you’re in the hundreds of billions of dollars, which is extremely hard to do and has never been done in history before, but not infinitely fast.

Dwarkesh Patel

I buy that it would be a slight slowdown. Maybe this is not your claim, but sometimes people talk about this like, “Oh, the capabilities are there, but because of diffusion... otherwise we’re basically at AGI”.

Dario Amodei

I don’t believe we’re basically at AGI.

Dwarkesh Patel

I think if you had the “country of geniuses in a data center”...

Dario Amodei

If we had the “country of geniuses in a data center”, we would know it. We would know it if you had the “country of geniuses in a data center”. Everyone in this room would know it. Everyone in Washington would know it. People in rural parts might not know it, but we would know it. We don’t have that now. That is very clear.

00:29:42 - Is continual learning necessary? How will it be solved?

Dwarkesh Patel

Coming back to concrete prediction… Because there are so many different things to disambiguate, it can be easy to talk past each other when we’re talking about capabilities. For example, when I interviewed you three years ago, I asked you a prediction about what we should expect three years from now. You were right. You said, “We should expect systems which, if you talk to them for the course of an hour, it’s hard to tell them apart from a generally well-educated human.”

I think you were right about that. I think spiritually I feel unsatisfied because my internal expectation was that such a system could automate large parts of white-collar work. So it might be more productive to talk about the actual end capabilities you want from such a system.

Dario Amodei

I will basically tell you where I think we are.

Dwarkesh Patel

Let me ask a very specific question so that we can figure out exactly what kinds of capabilities we should think about soon. Maybe I’ll ask about it in the context of a job I understand well, not because it’s the most relevant job, but just because I can evaluate the claims about it.

Take video editors. I have video editors. Part of their job involves learning about our audience’s preferences, learning about my preferences and tastes, and the different trade-offs we have. They’re, over the course of many months, building up this understanding of context. The skill and ability they have six months into the job, a model that can pick up that skill on the job on the fly, when should we expect such an AI system?

Dario Amodei

I guess what you’re talking about is that we’re doing this interview for three hours. Someone’s going to come in, someone’s going to edit it. They’re going to be like, “Oh, I don’t know, Dario scratched his head and we could edit that out.”

Dwarkesh Patel

“Magnify that.”

Dario Amodei

“There was this long discussion that is less interesting to people. There’s another thing that’s more interesting to people, so let’s make this edit.”

I think the “country of geniuses in a data center” will be able to do that. The way it will be able to do that is it will have general control of a computer screen. You’ll be able to feed this in. It’ll be able to also use the computer screen to go on the web, look at all your previous interviews, look at what people are saying on Twitter in response to your interviews, talk to you, ask you questions, talk to your staff, look at the history of edits that you did, and from that, do the job.

I think that’s dependent on several things. I think this is one of the things that’s actually blocking deployment: getting to the point on computer use where the models are really masters at using the computer.

We’ve seen this climb in benchmarks, and benchmarks are always imperfect measures. But I think when we first released computer use a year and a quarter ago, OSWorld was at maybe 15%. I don’t remember exactly, but we’ve climbed from that to 65-70%. There may be harder measures as well, but I think computer use has to pass a point of reliability.

Dwarkesh Patel

Can I just follow up on that before you move on to the next point? For years, I’ve been trying to build different internal LLM tools for myself. Often I have these text-in, text-out tasks, which should be dead center in the repertoire of these models. Yet I still hire humans to do them.

If it’s something like, “identify what the best clips would be in this transcript”, maybe the LLMs do a seven-out-of-ten job on them. But there’s not this ongoing way I can engage with them to help them get better at the job the way I could with a human employee. That missing ability, even if you solve computer use, would still block my ability to offload an actual job to them.

Dario Amodei

This gets back to what we were talking about before with learning on the job. It’s very interesting. I think with the coding agents, I don’t think people would say that learning on the job is what is preventing the coding agents from doing everything end to end. They keep getting better. We have engineers at Anthropic who don’t write any code.

When I look at the productivity, to your previous question, we have folks who say, “This GPU kernel, this chip, I used to write it myself. I just have Claude do it.” There’s this enormous improvement in productivity.

When I see Claude Code, familiarity with the codebase or a feeling that the model hasn’t worked at the company for a year, that’s not high up on the list of complaints I see. I think what I’m saying is that we’re kind of taking a different path.

Dwarkesh Patel

Don’t you think with coding that’s because there is an external scaffold of memory which exists instantiated in the codebase? I don’t know how many other jobs have that. Coding made fast progress precisely because it has this unique advantage that other economic activity doesn’t.

Dario Amodei

But when you say that, what you’re implying is that by reading the codebase into the context, I have everything that the human needed to learn on the job. So that would be an example of—whether it’s written or not, whether it’s available or not—a case where everything you needed to know you got from the context window. What we think of as learning—”I started this job, it’s going to take me six months to understand the code base”—the model just did it in the context.

Dwarkesh Patel

I honestly don’t know how to think about this because there are people who qualitatively report what you’re saying. I’m sure you saw last year, there was a major study where they had experienced developers try to close pull requests in repositories that they were familiar with. Those developers reported an uplift. They reported that they felt more productive with the use of these models. But in fact, if you look at their output and how much was actually merged back in, there was a 20% downlift. They were less productive as a result of using these models.

So I’m trying to square the qualitative feeling that people feel with these models versus, 1) in a macro level, where is this renaissance of software? And then 2) when people do these independent evaluations, why are we not seeing the productivity benefits we would expect?

Dario Amodei

Within Anthropic, this is just really unambiguous. We’re under an incredible amount of commercial pressure and make it even harder for ourselves because we have all this safety stuff we do that I think we do more than other companies.

The pressure to survive economically while also keeping our values is just incredible. We’re trying to keep this 10x revenue curve going. There is zero time for bullshit. There is zero time for feeling like we’re productive when we’re not. These tools make us a lot more productive.

Why do you think we’re concerned about competitors using the tools? Because we think we’re ahead of the competitors. We wouldn’t be going through all this trouble if this were secretly reducing our productivity. We see the end productivity every few months in the form of model launches. There’s no kidding yourself about this. The models make you more productive.

Dwarkesh Patel

1) People feeling like they’re productive is qualitatively predicted by studies like this. But 2) if I just look at the end output, obviously you guys are making fast progress.

But the idea was supposed to be that with recursive self-improvement, you make a better AI, the AI helps you build a better next AI, et cetera, et cetera. What I see instead—if I look at you, OpenAI, DeepMind—is that people are just shifting around the podium every few months.

Maybe you think that stops because you’ve won or whatever. But why are we not seeing the person with the best coding model have this lasting advantage if in fact there are these enormous productivity gains from the last coding model.

Dario Amodei

I think my model of the situation is that there’s an advantage that’s gradually growing. I would say right now the coding models give maybe, I don’t know, a 15-20% total factor speed up. That’s my view. Six months ago, it was maybe 5%. So it didn’t matter. 5% doesn’t register. It’s now just getting to the point where it’s one of several factors that kind of matters. That’s going to keep speeding up.

I think six months ago, there were several companies that were at roughly the same point because this wasn’t a notable factor, but I think it’s starting to speed up more and more. I would also say there are multiple companies that write models that are used for code and we’re not perfectly good at preventing some of these other companies from using our models internally. So I think everything we’re seeing is consistent with this kind of snowball model.

Again, my theme in all of this is all of this is soft takeoff, soft, smooth exponentials, although the exponentials are relatively steep. So we’re seeing this snowball gather momentum where it’s like 10%, 20%, 25%, 40%. As you go, Amdahl’s law, you have to get all the things that are preventing you from closing the loop out of the way. But this is one of the biggest priorities within Anthropic.

Dwarkesh Patel

Stepping back, before in the stack we were talking about when do we get this on-the-job learning? It seems like the point you were making on the coding thing is that we actually don’t need on-the-job learning. You can have tremendous productivity improvements, you can have potentially trillions of dollars of revenue for AI companies, without this basic human ability to learn on the job. Maybe that’s not your claim, you should clarify.

But in most domains of economic activity, people say, “I hired somebody, they weren’t that useful for the first few months, and then over time they built up the context, understanding.” It’s actually hard to define what we’re talking about here. But they got something and then now they’re a powerhorse and they’re so valuable to us. If AI doesn’t develop this ability to learn on the fly, I’m a bit skeptical that we’re going to see huge changes to the world without that ability.

Dario Amodei

I think two things here. There’s the state of the technology right now. Again, we have these two stages. We have the pre-training and RL stage where you throw a bunch of data and tasks into the models and then they generalize. So it’s like learning, but it’s like learning from more data and not learning over one human or one model’s lifetime. So again, this is situated between evolution and human learning. But once you learn all those skills, you have them.

Just like with pre-training, just how the models know more, if I look at a pre-trained model, it knows more about the history of samurai in Japan than I do. It knows more about baseball than I do. It knows more about low-pass filters and electronics, all of these things. Its knowledge is way broader than mine. So I think even just that may get us to the point where the models are better at everything.

We also have, again, just with scaling the kind of existing setup, the in-context learning. I would describe it as kind of like human on-the-job learning, but a little weaker and a little short term. You look at in-context learning and if you give the model a bunch of examples it does get it. There’s real learning that happens in context. A million tokens is a lot. That can be days of human learning. If you think about the model reading a million words, how long would it take me to read a million? Days or weeks at least.

So you have these two things. I think these two things within the existing paradigm may just be enough to get you the “country of geniuses in a data center”. I don’t know for sure, but I think they’re going to get you a large fraction of it. There may be gaps, but I certainly think that just as things are, this is enough to generate trillions of dollars of revenue. That’s one.

Two, is this idea of continual learning, this idea of a single model learning on the job. I think we’re working on that too. There’s a good chance that in the next year or two, we also solve that. Again, I think you get most of the way there without it. The trillions of dollars a year market, maybe all of the national security implications and the safety implications that I wrote about in “Adolescence of Technology” can happen without it. But we, and I imagine others, are working on it. There’s a good chance that we will get there within the next year or two.

There are a bunch of ideas. I won’t go into all of them in detail, but one is just to make the context longer. There’s nothing preventing longer contexts from working. You just have to train at longer contexts and then learn to serve them at inference. Both of those are engineering problems that we are working on and I would assume others are working on them as well.

Dwarkesh Patel

This context length increase, it seemed like there was a period from 2020 to 2023 where from GPT-3 to GPT-4 Turbo, there was an increase from 2000 context lengths to 128K. I feel like for the two-ish years since then, we’ve been in the same-ish ballpark.

When context lengths get much longer than that, people report qualitative degradation in the ability of the model to consider that full context. So I’m curious what you’re internally seeing that makes you think, “10 million contexts, 100 million contexts to get six months of human learning and building context”.

Dario Amodei

This isn’t a research problem. This is an engineering and inference problem. If you want to serve long context, you have to store your entire KV cache. It’s difficult to store all the memory in the GPUs, to juggle the memory around. I don’t even know the details. At this point, this is at a level of detail that I’m no longer able to follow, although I knew it in the GPT-3 era. “These are the weights, these are the activations you have to store…”

But these days the whole thing is flipped because we have MoE models and all of that. Regarding this degradation you’re talking about, without getting too specific, there’s two things. There’s the context length you train at and there’s a context length that you serve at. If you train at a small context length and then try to serve at a long context length, maybe you get these degradations. It’s better than nothing, you might still offer it, but you get these degradations. Maybe it’s harder to train at a long context length.

Dwarkesh Patel

I want to, at the same time, ask about maybe some rabbit holes. Wouldn’t you expect that if you had to train on longer context length, that would mean that you’re able to get less samples in for the same amount of compute? Maybe it’s not worth diving deep on that.

I want to get an answer to the bigger picture question. I don’t feel a preference for a human editor that’s been working for me for six months versus an AI that’s been working with me for six months, what year do you predict that that will be the case?

Dario Amodei

My guess for that is there’s a lot of problems where basically we can do this when we have the “country of geniuses in a data center”. My picture for that, if you made me guess, is one to two years, maybe one to three years. It’s really hard to tell. I have a strong view—99%, 95%—that all this will happen in 10 years. I think that’s just a super safe bet. I have a hunch—this is more like a 50/50 thing—that it’s going to be more like one to two, maybe more like one to three.

Dwarkesh Patel

So one to three years. Country of geniuses, and the slightly less economically valuable task of editing videos.

Dario Amodei

It seems pretty economically valuable, let me tell you. It’s just there are a lot of use cases like that. There are a lot of similar ones.

00:46:20 - If AGI is imminent, why not buy more compute?

Dwarkesh Patel

So you’re predicting that within one to three years. And then, generally, Anthropic has predicted that by late ‘26 or early ‘27 we will have AI systems that “have the ability to navigate interfaces available to humans doing digital work today, intellectual capabilities matching or exceeding that of Nobel Prize winners, and the ability to interface with the physical world”. You gave an interview two months ago with DealBook where you were emphasizing your company’s more responsible compute scaling as compared to your competitors.

I’m trying to square these two views. If you really believe that we’re going to have a country of geniuses, you want as big a data center as you can get. There’s no reason to slow down. The TAM of a Nobel Prize winner, that can actually do everything a Nobel Prize winner can do, is trillions of dollars. So I’m trying to square this conservatism, which seems rational if you have more moderate timelines, with your stated views about progress.

Dario Amodei

It actually all fits together. We go back to this fast, but not infinitely fast, diffusion. Let’s say that we’re making progress at this rate. The technology is making progress this fast. I have very high conviction that we’re going to get there within a few years. I have a hunch that we’re going to get there within a year or two. So there’s a little uncertainty on the technical side, but pretty strong confidence that it won’t be off by much.

What I’m less certain about is, again, the economic diffusion side. I really do believe that we could have models that are a country of geniuses in the data center in one to two years. One question is: How many years after that do the trillions in revenue start rolling in? I don’t think it’s guaranteed that it’s going to be immediate. It could be one year, it could be two years, I could even stretch it to five years although I’m skeptical of that.

So we have this uncertainty. Even if the technology goes as fast as I suspect that it will, we don’t know exactly how fast it’s going to drive revenue. We know it’s coming, but with the way you buy these data centers, if you’re off by a couple years, that can be ruinous. It is just like how I wrote in “Machines of Loving Grace”. I said I think we might get this powerful AI, this “country of genius in the data center”. That description you gave comes from “Machines of Loving Grace”. I said we’ll get that in 2026, maybe 2027. Again, that is my hunch. I wouldn’t be surprised if I’m off by a year or two, but that is my hunch.

Let’s say that happens. That’s the starting gun. How long does it take to cure all the diseases? That’s one of the ways that drives a huge amount of economic value. You cure every disease. There’s a question of how much of that goes to the pharmaceutical company or the AI company, but there’s an enormous consumer surplus because —assuming we can get access for everyone, which I care about greatly—we cure all of these diseases.

How long does it take? You have to do the biological discovery, you have to manufacture the new drug, you have to go through the regulatory process. We saw this with vaccines and COVID. We got the vaccine out to everyone, but it took a year and a half. My question is: How long does it take to get the cure for everything—which AI is the genius that can in theory invent—out to everyone? How long from when that AI first exists in the lab to when diseases have actually been cured for everyone?

We’ve had a polio vaccine for 50 years. We’re still trying to eradicate it in the most remote corners of Africa. The Gates Foundation is trying as hard as they can. Others are trying as hard as they can. But that’s difficult. Again, I don’t expect most of the economic diffusion to be as difficult as that. That’s the most difficult case. But there’s a real dilemma here. Where I’ve settled on it is that it will be faster than anything we’ve seen in the world, but it still has its limits.

So when we go to buying data centers, again, the curve I’m looking at is: we’ve had a 10x a year increase every year. At the beginning of this year, we’re looking at $10 billion in annualized revenue. We have to decide how much compute to buy. It takes a year or two to actually build out the data centers, to reserve the data center.

Basically I’m saying, “In 2027, how much compute do I get?” I could assume that the revenue will continue growing 10x a year, so it’ll be $100 billion at the end of 2026 and $1 trillion at the end of 2027. Actually it would be $5 trillion dollars of compute because it would be $1 trillion a year for five years. I could buy $1 trillion of compute that starts at the end of 2027. If my revenue is not $1 trillion dollars, if it’s even $800 billion, there’s no force on earth, there’s no hedge on earth that could stop me from going bankrupt if I buy that much compute.

Even though a part of my brain wonders if it’s going to keep growing 10x, I can’t buy $1 trillion a year of compute in 2027. If I’m just off by a year in that rate of growth, or if the growth rate is 5x a year instead of 10x a year, then you go bankrupt. So you end up in a world where you’re supporting hundreds of billions, not trillions. You accept some risk that there’s so much demand that you can’t support the revenue, and you accept some risk that you got it wrong and it’s still slow.

When I talked about behaving responsibly, what I meant actually was not the absolute amount. I think it is true we’re spending somewhat less than some of the other players. It’s actually the other things, like have we been thoughtful about it or are we YOLOing and saying, “We’re going to do $100 billion here or $100 billion there”? I get the impression that some of the other companies have not written down the spreadsheet, that they don’t really understand the risks they’re taking. They’re just doing stuff because it sounds cool.

We’ve thought carefully about it. We’re an enterprise business. Therefore, we can rely more on revenue. It’s less fickle than consumer. We have better margins, which is the buffer between buying too much and buying too little. I think we bought an amount that allows us to capture pretty strong upside worlds. It won’t capture the full 10x a year. Things would have to go pretty badly for us to be in financial trouble. So we’ve thought carefully and we’ve made that balance. That’s what I mean when I say that we’re being responsible.

Dwarkesh Patel

So it seems like it’s possible that we actually just have different definitions of the “country of a genius in a data center”. Because when I think of actual human geniuses, an actual country of human geniuses in a data center, I would happily buy $5 trillion worth of compute to run an actual country of human geniuses in a data center.

Let’s say JPMorgan or Moderna or whatever doesn’t want to use them. I’ve got a country of geniuses. They’ll start their own company. If they can’t start their own company and they’re bottlenecked by clinical trials… It is worth stating that with clinical trials, most clinical trials fail because the drug doesn’t work. There’s not efficacy.

Dario Amodei

I make exactly that point in “Machines of Loving Grace”, I say the clinical trials are going to go much faster than we’re used to, but not infinitely fast.

Dwarkesh Patel

Okay, and then suppose it takes a year for the clinical trials to work out so that you’re getting revenue from that and can make more drugs. Okay, well, you’ve got a country of geniuses and you’re an AI lab. You could use many more AI researchers. You also think there are these self-reinforcing gains from smart people working on AI tech. You can have the data center working on AI progress.

Dario Amodei

Are there substantially more gains from buying $1 trillion a year of compute versus $300 billion a year of compute?

Dwarkesh Patel

If your competitor is buying a trillion, yes there is.

Dario Amodei

Well, no, there’s some gain, but then again, there’s this chance that they go bankrupt before. Again, if you’re off by only a year, you destroy yourselves. That’s the balance. We’re buying a lot. We’re buying a hell of a lot. We’re buying an amount that’s comparable to what the biggest players in the game are buying.

But if you’re asking me, “Why haven’t we signed $10 trillion of compute starting in mid-2027?”... First of all, it can’t be produced. There isn’t that much in the world. But second, what if the country of geniuses comes, but it comes in mid-2028 instead of mid-2027? You go bankrupt.

Dwarkesh Patel

So if your projection is one to three years, it seems like you should want $10 trillion of compute by 2029 at the latest? Even in the longest version of the timelines you state, the compute you are ramping up to build doesn’t seem in accordance.

Dario Amodei

What makes you think that?

Dwarkesh Patel

Human wages, let’s say, are on the order of $50 trillion a year—

Dario Amodei

So I won’t talk about Anthropic in particular, but if you talk about the industry, the amount of compute the industry is building this year is probably, call it, 10-15 gigawatts. It goes up by roughly 3x a year. So next year’s 30-40 gigawatts. 2028 might be 100 gigawatts. 2029 might be like 300 gigawatts. I’m doing the math in my head, but each gigawatt costs maybe $10 billion, on the order of $10-15 billion a year.

You put that all together and you’re getting about what you described. You’re getting exactly that. You’re getting multiple trillions a year by 2028 or 2029. You’re getting exactly what you predict.

Dwarkesh Patel

That’s for the industry.

Dario Amodei

That’s for the industry, that’s right.

Dwarkesh Patel

Suppose Anthropic’s compute keeps 3x-ing a year, and then by 2027-28, you have 10 gigawatts. Multiply that by, as you say, $10 billion. So then it’s like $100 billion a year. But then you’re saying the TAM by 2028 is $200 billion.

Dario Amodei

Again, I don’t want to give exact numbers for Anthropic, but these numbers are too small.

Dwarkesh Patel

Okay, interesting.

00:58:49 - How will AI labs actually make profit?

Dwarkesh Patel

You’ve told investors that you plan to be profitable starting in 2028. This is the year when we’re potentially getting the country of geniuses as a data center. This is now going to unlock all this progress in medicine and health and new technologies. Wouldn’t this be exactly the time where you’d want to reinvest in the business and build bigger “countries” so they can make more discoveries?

Dario Amodei

Profitability is this kind of weird thing in this field. I don’t think in this field profitability is actually a measure of spending down versus investing in the business. Let’s just take a model of this. I actually think profitability happens when you underestimated the amount of demand you were going to get and loss happens when you overestimated the amount of demand you were going to get, because you’re buying the data centers ahead of time.

Think about it this way. Again, these are stylized facts. These numbers are not exact. I’m just trying to make a toy model here. Let’s say half of your compute is for training and half of your compute is for inference. The inference has some gross margin that’s more than 50%.

So what that means is that if you were in steady-state, you build a data center and if you knew exactly the demand you were getting, you would get a certain amount of revenue. Let’s say you pay $100 billion a year for compute. On $50 billion a year you support $150 billion of revenue. The other $50 billion is used for training. Basically you’re profitable and you make $50 billion of profit. Those are the economics of the industry today, or not today but where we’re projecting forward in a year or two.

The only thing that makes that not the case is if you get less demand than $50 billion. Then you have more than 50% of your data center for research and you’re not profitable. So you train stronger models, but you’re not profitable. If you get more demand than you thought, then research gets squeezed, but you’re kind of able to support more inference and you’re more profitable.

Maybe I’m not explaining it well, but the thing I’m trying to say is that you decide the amount of compute first. Then you have some target desire of inference versus training, but that gets determined by demand. It doesn’t get determined by you.

Dwarkesh Patel

What I’m hearing is the reason you’re predicting profit is that you are systematically underinvesting in compute?

Dario Amodei

No, no, no. I’m saying it’s hard to predict. These things about 2028 and when it will happen, that’s our attempt to do the best we can with investors. All of this stuff is really uncertain because of the cone of uncertainty. We could be profitable in 2026 if the revenue grows fast enough. If we overestimate or underestimate the next year, that could swing wildly.

What I’m trying to get at is that you have a model in your head of a business that invests, invests, invests, gets scale and then becomes profitable. There’s a single point at which things turn around. I don’t think the economics of this industry work that way.

Dwarkesh Patel

I see. So if I’m understanding correctly, you’re saying that because of the discrepancy between the amount of compute we should have gotten and the amount of compute we got, we were sort of forced to make profit. But that doesn’t mean we’re going to continue making profit. We’re going to reinvest the money because now AI has made so much progress and we want a bigger country of geniuses. So back into revenue is high, but losses are also high.

Dario Amodei

If every year we predict exactly what the demand is going to be, we’ll be profitable every year. Because spending 50% of your compute on research, roughly, plus a gross margin that’s higher than 50% and correct demand prediction leads to profit. That’s the profitable business model that I think is kind of there, but obscured by these building ahead and prediction errors.

Dwarkesh Patel

I guess you’re treating the 50% as a sort of given constant, whereas in fact, if AI progress is fast and you can increase the progress by scaling up more, you should just have more than 50% and not make profit.

Dario Amodei

But here’s what I’ll say. You might want to scale it up more. Remember the log returns to scale. If 70% would get you a very little bit of a smaller model through a factor of 1.4x... That extra $20 billion, each dollar there is worth much less to you because of the log-linear setup.

So you might find that it’s better to invest that $20 billion in serving inference or in hiring engineers who are kind of better at what they’re doing. So the reason I said 50%... That’s not exactly our target. It’s not exactly going to be 50%. It’ll probably vary over time.

What I’m saying is the log-linear return, what it leads to is you spend of order one fraction of the business. Like not 5%, not 95%. Then you get diminishing returns.

Dwarkesh Patel

I feel strange that I’m convincing Dario to believe in AI progress or something. Okay, you don’t invest in research because it has diminishing returns, but you invest in the other things you mentioned. I think profit at a sort of macro level—

Dario Amodei

Again, I’m talking about diminishing returns, but after you’re spending $50 billion a year.

Dwarkesh Patel

This is a point I’m sure you would make, but diminishing returns on a genius could be quite high.

More generally, what is profit in a market economy? Profit is basically saying other companies in the market can do more things with this money than I can.

Dario Amodei

Put aside Anthropic. I don’t want to give information about Anthropic. That’s why I’m giving these stylized numbers. But let’s just derive the equilibrium of the industry. Why doesn’t everyone spend 100% of their compute on training and not serve any customers? It’s because if they didn’t get any revenue, they couldn’t raise money, they couldn’t do compute deals, they couldn’t buy more compute the next year.

So there’s going to be an equilibrium where every company spends less than 100% on training and certainly less than 100% on inference. It should be clear why you don’t just serve the current models and never train another model, because then you don’t have any demand because you’ll fall behind. So there’s some equilibrium. It’s not gonna be 10%, it’s not gonna be 90%. Let’s just say as a stylized fact, it’s 50%. That’s what I’m getting at.

I think we’re gonna be in a position where that equilibrium of how much you spend on training is less than the gross margins that you’re able to get on compute. So the underlying economics are profitable. The problem is you have this hellish demand prediction problem when you’re buying the next year of compute and you might guess under and be very profitable but have no compute for research. Or you might guess over and you are not profitable and you have all the compute for research in the world. Does that make sense? Just as a dynamic model of the industry?

Dwarkesh Patel

Maybe stepping back, I’m not saying I think the “country of geniuses” is going to come in two years and therefore you should buy this compute. To me, the end conclusion you’re arriving at makes a lot of sense. But that’s because it seems like “country of geniuses” is hard and there’s a long way to go. So stepping back, the thing I’m trying to get at is more that it seems like your worldview is compatible with somebody who says, “We’re like 10 years away from a world in which we’re generating trillions of dollars of value.”

Dario Amodei

That’s just not my view. So I’ll make another prediction. It is hard for me to see that there won’t be trillions of dollars in revenue before 2030. I can construct a plausible world. It takes maybe three years. That would be the end of what I think it’s plausible.

Like in 2028, we get the real “country of geniuses in the data center”. The revenue’s going into the low hundreds of billions by 2028, and then the country of geniuses accelerates it to trillions. We’re basically on the slow end of diffusion. It takes two years to get to the trillions. That would be the world where it takes until 2030. I suspect even composing the technical exponential and diffusion exponential, we’ll get there before 2030.

Dwarkesh Patel

So you laid out a model where Anthropic makes profit because it seems like fundamentally we’re in a compute-constrained world. So eventually we keep growing compute—

Dario Amodei

I think the way the profit comes is… Again, let’s just abstract the whole industry here. Let’s just imagine we’re in an economics textbook. We have a small number of firms. Each can invest a limited amount. Each can invest some fraction in R&D. They have some marginal cost to serve. The gross profit margins on that marginal cost are very high because inference is efficient. There’s some competition, but the models are also differentiated.

Companies will compete to push their research budgets up. But because there’s a small number of players, we have the... What is it called? The Cournot equilibrium, I think, is what the small number of firm equilibrium is. The point is it doesn’t equilibrate to perfect competition with zero margins. If there’s three firms in the economy and all are kind of independently behaving rationally, it doesn’t equilibrate to zero.

Dwarkesh Patel

Help me understand that, because right now we do have three leading firms and they’re not making profit. So what is changing?

Dario Amodei

Again, the gross margins right now are very positive. What’s happening is a combination of two things. One is that we’re still in the exponential scale-up phase of compute. A model gets trained. Let’s say a model got trained that costs $1 billion last year. Then this year it produced $4 billion of revenue and cost $1 billion to inference from. Again, I’m using stylized numbers here, but that would be 75% gross margins and this 25% tax. So that model as a whole makes $2 billion.

But at the same time, we’re spending $10 billion to train the next model because there’s an exponential scale-up. So the company loses money. Each model makes money, but the company loses money.

The equilibrium I’m talking about is an equilibrium where we have the “country of geniuses in a data center”, but that model training scale-up has equilibrated more. Maybe it’s still going up. We’re still trying to predict the demand, but it’s more leveled out.

Dwarkesh Patel

I’m confused about a couple of things there. Let’s start with the current world. In the current world, you’re right that, as you said before, if you treat each individual model as a company, it’s profitable. But of course, a big part of the production function of being a frontier lab is training the next model, right?

Dario Amodei

Yes, that’s right.

Dwarkesh Patel

If you didn’t do that, then you’d make profit for two months and then you wouldn’t have margins because you wouldn’t have the best model.

Dario Amodei

But at some point that reaches the biggest scale that it can reach. And then in equilibrium, we have algorithmic improvements, but we’re spending roughly the same amount to train the next model as we spend to train the current model. At some point you run out of money in the economy.

Dwarkesh Patel

A fixed lump of labor fallacy… The economy is going to grow, right? That’s one of your predictions. We’re going to have the data centers in space.

Dario Amodei

Yes, but this is another example of the theme I was talking about. The economy will grow much faster with AI than I think it ever has before. Right now the compute is growing 3x a year. I don’t believe the economy is gonna grow 300% a year. I said this in “Machines of Loving Grace”, I think we may get 10-20% per year growth in the economy, but we’re not gonna get 300% growth in the economy. So I think in the end, if compute becomes the majority of what the economy produces, it’s gonna be capped by that.

Dwarkesh Patel

So let’s assume a model where compute stays capped. The world where frontier labs are making money is one where they continue to make fast progress. Because fundamentally your margin is limited by how good the alternative is. So you are able to make money because you have a frontier model. If you didn’t have a frontier model you wouldn’t be making money. So this model requires there never to be a steady state. Forever and ever you keep making more algorithmic progress.

Dario Amodei

I don’t think that’s true. I mean, I feel like we’re in an economics class.

Dwarkesh Patel

Do you know the Tyler Cowen quote? We never stop talking about economics.

Dario Amodei

We never stop talking about economics. So no, I don’t think this field’s going to be a monopoly. All my lawyers never want me to say the word “monopoly”. But I don’t think this field’s going to be a monopoly. You do get industries in which there are a small number of players. Not one, but a small number of players.

Ordinarily, the way you get monopolies like Facebook or Meta—I always call them Facebook—is these kinds of network effects. The way you get industries in which there are a small number of players, is very high costs of entry. Cloud is like this. I think cloud is a good example of this. There are three, maybe four, players within cloud. I think that’s the same for AI, three, maybe four.

The reason is that it’s so expensive. It requires so much expertise and so much capital to run a cloud company. You have to put up all this capital. In addition to putting up all this capital, you have to get all of this other stuff that requires a lot of skill to make it happen.

So if you go to someone and you’re like, “I want to disrupt this industry, here’s $100 billion.” You’re like, “okay, I’m putting in $100 billion and also betting that you can do all these other things that these people have been doing.”

Dwarkesh Patel

Only to decrease the profit.

Dario Amodei

The effect of your entering is that profit margins go down. So, we have equilibria like this all the time in the economy where we have a few players. Profits are not astronomical. Margins are not astronomical, but they’re not zero. That’s what we see on cloud. Cloud is very undifferentiated. Models are more differentiated than cloud.

Everyone knows Claude is good at different things than GPT is good at, than Gemini is good at. It’s not just that Claude’s good at coding, GPT is good at math and reasoning. It’s more subtle than that. Models are good at different types of coding. Models have different styles. I think these things are actually quite different from each other, and so I would expect more differentiation than you see in cloud.

Now, there actually is one counter-argument. That counter-argument is if the process of producing models, if AI models can do that themselves, then that could spread throughout the economy. But that is not an argument for commoditizing AI models in general. That’s kind of an argument for commoditizing the whole economy at once.

I don’t know what quite happens in that world where basically anyone can do anything, anyone can build anything, and there’s no moat around anything at all. I don’t know, maybe we want that world. Maybe that’s the end state here. Maybe when AI models can do everything, if we’ve solved all the safety and security problems, that’s one of the mechanisms for the economy just flattening itself again. But that’s kind of far post-”country of geniuses in the data center.”

Dwarkesh Patel

Maybe a finer way to put that potential point is: 1) it seems like AI research is especially loaded on raw intellectual power, which will be especially abundant in the world of AGI. And 2) if you just look at the world today, there are very few technologies that seem to be diffusing as fast as AI algorithmic progress. So that does hint that this industry is sort of structurally diffusive.

Dario Amodei

I think coding is going fast, but I think AI research is a superset of coding and there are aspects of it that are not going fast. But I do think, again, once we get coding, once we get AI models going fast, then that will speed up the ability of AI models to do everything else. So while coding is going fast now, I think once the AI models are building the next AI models and building everything else, the whole economy will kind of go at the same pace.

I am worried geographically, though. I’m a little worried that just proximity to AI, having heard about AI, may be one differentiator. So when I said the 10-20% growth rate, a worry I have is that the growth rate could be like 50% in Silicon Valley and parts of the world that are socially connected to Silicon Valley, and not that much faster than its current pace elsewhere. I think that’d be a pretty messed up world. So one of the things I think about a lot is how to prevent that.

Dwarkesh Patel

Do you think that once we have this country of geniuses in a data center, that robotics is sort of quickly solved afterwards? Because it seems like a big problem with robotics is that a human can learn how to teleoperate current hardware, but current AI models can’t, at least not in a way that’s super productive. And so if we have this ability to learn like a human, shouldn’t it solve robotics immediately as well?

Dario Amodei

I don’t think it’s dependent on learning like a human. It could happen in different ways. Again, we could have trained the model on many different video games, which are like robotic controls, or many different simulated robotics environments, or just train them to control computer screens, and they learn to generalize.

So it will happen... it’s not necessarily dependent on human-like learning. Human-like learning is one way it could happen. If the model’s like, “Oh, I pick up a robot, I don’t know how to use it, I learn,” that could happen because we discovered continual learning. That could also happen because we trained the model on a bunch of environments and then generalized, or it could happen because the model learns that in the context length. It doesn’t actually matter which way. If we go back to the discussion we had an hour ago, that type of thing can happen in several different ways.

But I do think when for whatever reason the models have those skills, then robotics will be revolutionized—both the design of robots, because the models will be much better than humans at that, and also the ability to control robots. So we’ll get better at building the physical hardware, building the physical robots, and we’ll also get better at controlling it.

Now, does that mean the robotics industry will also be generating trillions of dollars of revenue? My answer there is yes, but there will be the same extremely fast, but not infinitely fast diffusion. So will robotics be revolutionized? Yeah, maybe tack on another year or two. That’s the way I think about these things.

Dwarkesh Patel

Makes sense. There’s a general skepticism about extremely fast progress. Here’s my view. It sounds like you are going to solve continual learning one way or another within a matter of years. But just as people weren’t talking about continual learning a couple of years ago, and then we realized, “Oh, why aren’t these models as useful as they could be right now, even though they are clearly passing the Turing test and are experts in so many different domains? Maybe it’s this thing.”

Then we solve this thing and we realize, actually, there’s another thing that human intelligence can do that’s a basis of human labor that these models can’t do. So why not think there will be more things like this, where we’ve found more pieces of human intelligence?

Dario Amodei

Well, to be clear, I think continual learning, as I’ve said before, might not be a barrier at all. I think we may just get there by pre-training generalization and RL generalization. I think there just might not be such a thing at all.

In fact, I would point to the history in ML of people coming up with things that are barriers that end up kind of dissolving within the big blob of compute. People talked about, “How do your models keep track of nouns and verbs?” “They can understand syntactically, but they can’t understand semantically? It’s only statistical correlations.” “You can understand a paragraph, you can’t understand a word. There’s reasoning, you can’t do reasoning.” But then suddenly it turns out you can do code and math very well.

So I think there’s actually a stronger history of some of these things seeming like a big deal and then kind of dissolving. Some of them are real. The need for data is real, maybe continual learning is a real thing.

But again, I would ground us in something like code. I think we may get to the point in a year or two where the models can just do SWE end-to-end. That’s a whole task. That’s a whole sphere of human activity that we’re just saying models can do now.

Dwarkesh Patel

When you say end-to-end, do you mean setting technical direction, understanding the context of the problem, et cetera?

Dario Amodei

Yes. I mean all of that.

Dwarkesh Patel

Interesting. I feel like that is AGI-complete, which maybe is internally consistent. But it’s not like saying 90% of code or 100% of code.

Dario Amodei

No, I gave this spectrum: 90% of code, 100% of code, 90% of end-to-end SWE, 100% of end-to-end SWE. New tasks are created for SWEs. Eventually those get done as well. It’s a long spectrum there, but we’re traversing the spectrum very quickly.

Dwarkesh Patel

I do think it’s funny that I’ve seen a couple of podcasts you’ve done where the hosts will be like, “But Dwarkesh wrote the essay about the continuous learning thing.” It always makes me crack up because you’ve been an AI researcher for 10 years. I’m sure there’s some feeling of, “Okay, so a podcaster wrote an essay, and every interview I get asked about it.”

Dario Amodei

The truth of the matter is that we’re all trying to figure this out together. There are some ways in which I’m able to see things that others aren’t. These days that probably has more to do with seeing a bunch of stuff within Anthropic and having to make a bunch of decisions than I have any great research insight that others don’t.

I’m running a 2,500 person company. It’s actually pretty hard for me to have concrete research insight, much harder than it would have been 10 years ago or even two or three years ago.

Dwarkesh Patel

As we go towards a world of a full drop-in remote worker replacement, does an API pricing model still make the most sense? If not, what is the correct way to price AGI, or serve AGI?

Dario Amodei

I think there’s going to be a bunch of different business models here, all at once, that are going to be experimented with. I actually do think that the API model is more durable than many people think. One way I think about it is if the technology is advancing quickly, if it’s advancing exponentially, what that means is there’s always a surface area of new use cases that have been developed in the last three months.

Any kind of product surface you put in place is always at risk of sort of becoming irrelevant. Any given product surface probably makes sense for a range of capabilities of the model. The chatbot is already running into limitations where making it smarter doesn’t really help the average consumer that much. But I don’t think that’s a limitation of AI models. I don’t think that’s evidence that the models are good enough and them getting better doesn’t matter to the economy. It doesn’t matter to that particular product.

So I think the value of the API is that the API always offers an opportunity, very close to the bare metal, to build on what the latest thing is. There’s always going to be this front of new startups and new ideas that weren’t possible a few months ago and are possible because the model is advancing.

I actually predict that it’s going to exist alongside other models, but we’re always going to have the API business model because there’s always going to be a need for a thousand different people to try experimenting with the model in a different way. 100 of them become startups and ten of them become big successful startups. Two or three really end up being the way that people use the model of a given generation.

So I basically think it’s always going to exist. At the same time, I’m sure there’s going to be other models as well. Not every token that’s output by the model is worth the same amount. Think about what is the value of the tokens that the model outputs when someone calls them up and says, “My Mac isn’t working,” or something, the model’s like, “restart it.” Someone hasn’t heard that before, but the model said that 10 million times. Maybe that’s worth like a dollar or a few cents or something.

Whereas if the model goes to one of the pharmaceutical companies and it says, “Oh, you know, this molecule you’re developing, you should take the aromatic ring from that end of the molecule and put it on that end of the molecule. If you do that, wonderful things will happen.” Those tokens could be worth tens of millions of dollars.

So I think we’re definitely going to see business models that recognize that. At some point we’re going to see “pay for results” in some form, or we may see forms of compensation that are like labor, that kind of work by the hour. I don’t know. I think because it’s a new industry, a lot of things are going to be tried. I don’t know what will turn out to be the right thing.

Dwarkesh Patel

I take your point that people will have to try things to figure out what is the best way to use this blob of intelligence. But what I find striking is Claude Code. I don’t think in the history of startups there has been a single application that has been as hotly competed in as coding agents. Claude Code is a category leader here. That seems surprising to me.

It doesn’t seem intrinsically that Anthropic had to build this. I wonder if you have an accounting of why it had to be Anthropic or how Anthropic ended up building an application in addition to the model underlying it that was successful.

Dario Amodei

So it actually happened in a pretty simple way, which is that we had our own coding models, which were good at coding. Around the beginning of 2025, I said, “I think the time has come where you can have nontrivial acceleration of your own research if you’re an AI company by using these models.” Of course, you need an interface, you need a harness to use them.

So I encouraged people internally. I didn’t say this is one thing that you have to use. I just said people should experiment with this. I think it might have been originally called Claude CLI, and then the name eventually got changed to Claude Code. Internally, it was the thing that everyone was using and it was seeing fast internal adoption.

I looked at it and I said, “Probably we should launch this externally, right?” It’s seen such fast adoption within Anthropic. Coding is a lot of what we do. We have an audience of many, many hundreds of people that’s in some ways at least representative of the external audience. So it looks like we already have product market fit. Let’s launch this thing.

And then we launched it. I think just the fact that we ourselves are kind of developing the model and we ourselves know what we most need to use the model, I think it’s kind of creating this feedback loop.

Dwarkesh Patel

I see. In the sense that you, let’s say a developer at Anthropic is like, “Ah, it would be better if it was better at this X thing.” Then you bake that into the next model that you build.

Dario Amodei

That’s one version of it, but then there’s just the ordinary product iteration. We have a bunch of coders within Anthropic, they use Claude Code every day and so we get fast feedback. That was more important in the early days. Now, of course, there are millions of people using it, and so we get a bunch of external feedback as well. But it’s just great to be able to get kind of fast internal feedback.

I think this is the reason why we launched a coding model and didn’t launch a pharmaceutical company. My background’s in biology, but we don’t have any of the resources that are needed to launch a pharmaceutical company.

01:31:19 - Will regulations destroy the boons of AGI?

Dwarkesh Patel

Let me now ask you about making AI go well. It seems like whatever vision we have about how AI goes well has to be compatible with two things: 1) the ability to build and run AIs is diffusing extremely rapidly and 2) the population of AIs, the amount we have and their intelligence, will also increase very rapidly.

That means that lots of people will be able to build huge populations of misaligned AIs, or AIs which are just companies which are trying to increase their footprint or have weird psyches like Sydney Bing, but now they’re superhuman. What is a vision for a world in which we have an equilibrium that is compatible with lots of different AIs, some of which are misaligned, running around?

Dario Amodei

I think in “The Adolescence of Technology”, I was skeptical of the balance of power. But the thing I was specifically skeptical of is you have three or four of these companies all building models that are derived from the same thing, that they would check each other. Or even that any number of them would check each other.

Dario Amodei

We might live in an offense-dominant world where one person or one AI model is smart enough to do something that causes damage for everything else. In the short run, we have a limited number of players now. So we can start within the limited number of players. We need to put in place the safeguards. We need to make sure everyone does the right alignment work. We need to make sure everyone has bioclassifiers. Those are the immediate things we need to do.

I agree that that doesn’t solve the problem in the long run, particularly if the ability of AI models to make other AI models proliferates, then the whole thing can become harder to solve. I think in the long run we need some architecture of governance. We need some architecture of governance that preserves human freedom, but also allows us to govern a very large number of human systems, AI systems, hybrid human-AI companies or economic units.

So we’re gonna need to think about: how do we protect the world against bioterrorism? How do we protect the world against mirror life? Probably we’re gonna need some kind of AI monitoring system that monitors for all of these things. But then we need to build this in a way that preserves civil liberties and our constitutional rights. So I think just as anything else, it’s a new security landscape with a new set of tools and a new set of vulnerabilities.

My worry is, if we had 100 years for this to happen all very slowly, we’d get used to it. We’ve gotten used to the presence of explosives in society or the presence of various new weapons or the presence of video cameras. We would get used to it over 100 years and we’d develop governance mechanisms. We’d make our mistakes. My worry is just that this is happening all so fast. So maybe we need to do our thinking faster about how to make these governance mechanisms work.

Dwarkesh Patel

It seems like in an offense-dominant world, over the course of the next century—the idea is that AI is making the progress that would happen over the next century happen in some period of five to ten years—we would still need the same mechanisms, or balance of power would be similarly intractable, even if humans were the only game in town.

I guess we have the advice of AI. But it fundamentally doesn’t seem like a totally different ball game here. If checks and balances were going to work, they would work with humans as well. If they aren’t going to work, they wouldn’t work with AIs as well. So maybe this just dooms human checks and balances as well.

Dario Amodei

Again, I think there’s some way to make this happen. The governments of the world may have to work together to make it happen. We may have to talk to AIs about building societal structures in such a way that these defenses are possible. I don’t know. I don’t want to say this is so far ahead in time, but it’s so far ahead in technological ability that may happen over a short period of time, that it’s hard for us to anticipate it in advance.

Dwarkesh Patel

Speaking of governments getting involved, on December 26, the Tennessee legislature introduced a bill which said, “It would be an offense for a person to knowingly train artificial intelligence to provide emotional support, including through open-ended conversations with a user.” Of course, one of the things that Claude attempts to do is be a thoughtful, knowledgeable friend.

In general, it seems like we’re going to have this patchwork of state laws. A lot of the benefits that normal people could experience as a result of AI are going to be curtailed, especially when we get into the kinds of things you discuss in “Machines of Loving Grace”: biological freedom, mental health improvements, et cetera.

It seems easy to imagine worlds in which these get Whac-A-Moled away by different laws, whereas bills like this don’t seem to address the actual existential threats that you’re concerned about. I’m curious to understand, in the context of things like this, Anthropic’s position against the federal moratorium on state AI laws.

Dario Amodei

There are many different things going on at once. I think that particular law is dumb. It was clearly made by legislators who just probably had little idea what AI models could do and not do. They’re like, “AI models serving us, that just sounds scary. I don’t want that to happen.” So we’re not in favor of that.

But that wasn’t the thing that was being voted on. The thing that was being voted on is: we’re going to ban all state regulation of AI for 10 years with no apparent plan to do any federal regulation of AI, which would take Congress to pass, which is a very high bar. So the idea that we’d ban states from doing anything for 10 years… People said they had a plan for the federal government, but there was no actual proposal on the table. There was no actual attempt.

Given the serious dangers that I lay out in “Adolescence of Technology” around things like biological weapons and bioterrorism autonomy risk, and the timelines we’ve been talking about—10 years is an eternity—I think that’s a crazy thing to do. So if that’s the choice, if that’s what you force us to choose, then we’re going to choose not to have that moratorium. I think the benefits of that position exceed the costs, but it’s not a perfect position if that’s the choice.

Now, I think the thing that we should do, the thing that I would support, is the federal government should step in, not saying “states you can’t regulate”, but “Here’s what we’re going to do, and states you can’t differ from this.” I think preemption is fine in the sense of saying that the federal government says, “Here is our standard. This applies to everyone. States can’t do something different.”

That would be something I would support if it would be done in the right way. But this idea of states, “You can’t do anything and we’re not doing anything either,” that struck us as very much not making sense. I think it will not age well, it is already starting to not age well with all the backlash that you’ve seen.

Now, in terms of what we would want, the things we’ve talked about are starting with transparency standards in order to monitor some of these autonomy risks and bioterrorism risks. As the risks become more serious, as we get more evidence for them, then I think we could be more aggressive in some targeted ways and say, “Hey, AI bioterrorism is really a threat. Let’s pass a law that forces people to have classifiers.”

I could even imagine… It depends. It depends how serious the threat it ends up being. We don’t know for sure. We need to pursue this in an intellectually honest way where we say that ahead of time, the risk has not emerged yet. But I could certainly imagine, with the pace that things are going at, a world where later this year we say, “Hey, this AI bioterrorism stuff is really serious. We should do something about it. We should put it in a federal standard. If the federal government won’t act, we should put it in a state standard.” I could totally see that.

Dwarkesh Patel

I’m concerned about a world where if you just consider the pace of progress you’re expecting, the life cycle of legislation... The benefits are, as you say because of diffusion lag, slow enough that I really do think this patchwork of state laws, on the current trajectory, would prohibit. I mean if having an emotional chatbot friend is something that freaks people out, then just imagine the kinds of actual benefits from AI we want normal people to be able to experience. From improvements in health and healthspan and improvements in mental health and so forth.

Whereas at the same time, it seems like you think the dangers are already on the horizon and I just don’t see that much… It seems like it would be especially injurious to the benefits of AI as compared to the dangers of AI. So that’s maybe where the cost benefit makes less sense to me.

Dario Amodei

So there’s a few things here. People talk about there being thousands of these state laws. First of all, the vast, vast majority of them do not pass. The world works a certain way in theory, but just because a law has been passed doesn’t mean it’s really enforced. The people implementing it may be like, “Oh my God, this is stupid. It would mean shutting off everything that’s ever been built in Tennessee.” Very often, laws are interpreted in a way that makes them not as dangerous or harmful. On the same side, of course, you have to worry if you’re passing a law to stop a bad thing; you have this problem as well.

My basic view is that if we could decide what laws were passed and how things were done—and we’re only one small input into that—I would deregulate a lot of the stuff around the health benefits of AI. I don’t worry as much about the chatbot laws. I actually worry more about the drug approval process, where I think AI models are going to greatly accelerate the rate at which we discover drugs, and the pipeline will get jammed up. The pipeline will not be prepared to process all the stuff that’s going through it.

I think reform of the regulatory process should bias more towards the fact that we have a lot of things coming where the safety and efficacy is actually going to be really crisp and clear, a beautiful thing, and really effective. Maybe we don’t need all this superstructure around it that was designed around an era of drugs that barely work and often have serious side effects.

At the same time, I think we should be ramping up quite significantly the safety and security legislation. Like I’ve said, starting with transparency is my view of trying not to hamper the industry, trying to find the right balance. I’m worried about it. Some people criticize my essay for saying, “That’s too slow. The dangers of AI will come too soon if we do that.”

Well, basically, I think the last six months and maybe the next few months are going to be about transparency. Then, if these risks emerge when we’re more certain of them—which I think we might be as soon as later this year—then I think we need to act very fast in the areas where we’ve actually seen the risk.

I think the only way to do this is to be nimble. Now, the legislative process is normally not nimble, but we need to emphasize the urgency of this to everyone involved. That’s why I’m sending this message of urgency. That’s why I wrote Adolescence of Technology. I wanted policymakers, economists, national security professionals, and decision-makers to read it so that they have some hope of acting faster than they would have otherwise.

Dwarkesh Patel

Is there anything you can do or advocate that would make it more certain that the benefits of AI are better instantiated? I feel like you have worked with legislatures to say, “Okay, we’re going to prevent bioterrorism here. We’re going to increase transparency, we’re going to increase whistleblower protection.” But I think by default, the actual benefits we’re looking forward to seem very fragile to different kinds of moral panics or political economy problems.

Dario Amodei

I don’t actually agree that much regarding the developed world. I feel like in the developed world, markets function pretty well. When there’s a lot of money to be made on something and it’s clearly the best available alternative, it’s actually hard for the regulatory system to stop it.

We’re seeing that in AI itself. A thing I’ve been trying to fight for is export controls on chips to China. That’s in the national security interest of the US. That’s squarely within the policy beliefs of almost everyone in Congress of both parties. The case is very clear. The counterarguments against it, I’ll politely call them fishy. Yet it doesn’t happen and we sell the chips because there’s so much money riding on it. That money wants to be made. In that case, in my opinion, that’s a bad thing. But it also applies when it’s a good thing.

So if we’re talking about drugs and benefits of the technology, I am not as worried about those benefits being hampered in the developed world. I am a little worried about them going too slow. As I said, I do think we should work to speed the approval process in the FDA. I do think we should fight against these chatbot bills that you’re describing. Described individually, I’m against them. I think they’re stupid.

But I actually think the bigger worry is the developing world, where we don’t have functioning markets and where we often can’t build on the technology that we’ve had. I worry more that those folks will get left behind. And I worry that even if the cures are developed, maybe there’s someone in rural Mississippi who doesn’t get it as well. That’s a smaller version of the concern we have in the developing world.

So the things we’ve been doing are working with philanthropists. We work with folks who deliver medicine and health interventions to the developing world, to sub-Saharan Africa, India, Latin America, and other developing parts of the world. That’s the thing I think that won’t happen on its own.

01:47:41 - Why can’t China and America both have a country of geniuses in a datacenter?

Dwarkesh Patel

You mentioned export controls. Why shouldn’t the US and China both have a “country of geniuses in a data center”?

Dario Amodei

Why won’t it happen or why shouldn’t it happen?

Dwarkesh Patel

Why shouldn’t it happen.

Dario Amodei

If this does happen, we could have a few situations. If we have an offense-dominant situation, we could have a situation like nuclear weapons, but more dangerous. Either side could easily destroy everything.

We could also have a world where it’s unstable. The nuclear equilibrium is stable because it’s deterrence. But let’s say there was uncertainty about, if the two AIs fought, which AI would win? That could create instability. You often have conflict when the two sides have a different assessment of their likelihood of winning. If one side is like, “Oh yeah, there’s a 90% chance I’ll win,” and the other side thinks the same, then a fight is much more likely. They can’t both be right, but they can both think that.

Dwarkesh Patel

But this seems like a fully general argument against the diffusion of AI technology. That’s the implication of this world.

Dario Amodei

Let me just go on, because I think we will get diffusion eventually. The other concern I have is that governments will oppress their own people with AI. I’m worried about a world where you have a country in which there’s already a government that’s building a high-tech authoritarian state. To be clear, this is about the government. This is not about the people. We need to find a way for people everywhere to benefit. My worry here is about governments. My worry is if the world gets carved up into two pieces, one of those two pieces could be authoritarian or totalitarian in a way that’s very difficult to displace.

Now, will governments eventually get powerful AI, and is there a risk of authoritarianism? Yes. Will governments eventually get powerful AI, and is there a risk of bad equilibria? Yes, I think both things. But the initial conditions matter. At some point, we’re going to need to set up the rules of the road.

I’m not saying that one country, either the United States or a coalition of democracies—which I think would be a better setup, although it requires more international cooperation than we currently seem to want to make—should just say, “These are the rules of the road.” There’s going to be some negotiation. The world is going to have to grapple with this.

What I would like is for the democratic nations of the world—those whose governments represent closer to pro-human values—are holding the stronger hand and have more leverage when the rules of the road are set. So I’m very concerned about that initial condition.

Dwarkesh Patel

I was re-listening to the interview from three years ago, and one of the ways it aged poorly is that I kept asking questions assuming there was going to be some key fulcrum moment two to three years from now. In fact, being that far out, it just seems like progress continues, AI improves, AI is more diffused, and people will use it for more things.

It seems like you’re imagining a world in the future where the countries get together, and “Here’s the rules of the road, here’s the leverage we have, and here’s the leverage you have.” But on the current trajectory, everybody will have more AI. Some of that AI will be used by authoritarian countries. Some of that within the authoritarian countries will be used by private actors versus state actors.

It’s not clear who will benefit more. It’s always unpredictable to tell in advance. It seems like the internet privileged authoritarian countries more than you would’ve expected. Maybe AI will be the opposite way around. I want to better understand what you’re imagining here.

Dario Amodei

Just to be precise about it, I think the exponential of the underlying technology will continue as it has before. The models get smarter and smarter, even when they get to a “country of geniuses in a data center.” I think you can continue to make the model smarter. There’s a question of getting diminishing returns on their value in the world. How much does it matter after you’ve already solved human biology? At some point you can do harder, more abstruse math problems, but nothing after that matters.

Putting that aside, I do think the exponential will continue, but there will be certain distinguished points on the exponential. Companies, individuals, and countries will reach those points at different times.

In “The Adolescence of Technology” I talk about: Is a nuclear deterrent still stable in the world of AI? I don’t know, but that’s an example of one thing we’ve taken for granted. The technology could reach such a level that we can no longer be certain of it. Think of others. There are points where if you reach a certain level, maybe you have offensive cyber dominance, and every computer system is transparent to you after that unless the other side has an equivalent defense.

I don’t know what the critical moment is or if there’s a single critical moment. But I think there will be either a critical moment, a small number of critical moments, or some critical window where AI confers some large advantage from the perspective of national security, and one country or coalition has reached it before others.

I’m not advocating that they just say, “Okay, we’re in charge now.” That’s not how I think about it. The other side is always catching up. There are extreme actions you’re not willing to take, and it’s not right to take complete control anyway. But at the point that happens, people are going to understand that the world has changed. There’s going to be some negotiation, implicit or explicit, about what the post-AI world order looks like. My interest is in making that negotiation be one in which classical liberal democracy has a strong hand.

Dwarkesh Patel

I want to understand what that better means, because you say in the essay, “Autocracy is simply not a form of government that people can accept in the post-powerful AI age.” That sounds like you’re saying the CCP as an institution cannot exist after we get AGI. That seems like a very strong demand, and it seems to imply a world where the leading lab or the leading country will be able to—and by that language, should get to—determine how the world is governed or what kinds of governments are, and are not, allowed.

Dario Amodei

I believe that paragraph said something like, “You could take it even further and say X.” I wasn’t necessarily endorsing that view. I was saying, “Here’s a weaker thing that I believe. We have to worry a lot about authoritarians and we should try to check them and limit their power. You could take this much further and have a more interventionist view that says authoritarian countries with AI are these self-fulfilling cycles that are very hard to displace, so you just need to get rid of them from the beginning.”

That has exactly all the problems you say. If you were to make a commitment to overthrowing every authoritarian country, they would take a bunch of actions now that could lead to instability. That just may not be possible.

But the point I was making that I do endorse is that it is quite possible that... Today, the view, my view, in most of the Western world is that democracy is a better form of government than authoritarianism. But if a country’s authoritarian, we don’t react the way we’d react if they committed a genocide or something. I guess what I’m saying is I’m a little worried that in the age of AGI, authoritarianism will have a different meaning. It will be a graver thing. We have to decide one way or another how to deal with that. The interventionist view is one possible view. I was exploring such views. It may end up being the right view, or it may end up being too extreme. But I do have hope.

One piece of hope I have is that we have seen that as new technologies are invented, forms of government become obsolete. I mentioned this in “Adolescence of Technology”, where I said feudalism was basically a form of government, and when we invented industrialization, feudalism was no longer sustainable. It no longer made sense.

Dwarkesh Patel

Why is that hope? Couldn’t that imply that democracy is no longer going to be a competitive system?

Dario Amodei

Right, it could go either way. But these problems with authoritarianism get deeper. I wonder if that’s an indicator of other problems that authoritarianism will have. In other words, because authoritarianism becomes worse, people are more afraid of it. They work harder to stop it. You have to think in terms of total equilibrium. I just wonder if it will motivate new ways of thinking about how to preserve and protect freedom with the new technology.

Even more optimistically, will it lead to a collective reckoning and a more emphatic realization of how important some of the things we take as individual rights are? A more emphatic realization that we really can’t give these away. We’ve seen there’s no other way to live that actually works.

I am actually hopeful that—it sounds too idealistic, but I believe it could be the case—dictatorships become morally obsolete. They become morally unworkable forms of government and the crisis that that creates is sufficient to force us to find another way.

Dwarkesh Patel

I think there is genuinely a tough question here which I’m not sure how you resolve. We’ve had to come out one way or another on it through history. With China in the ‘70s and ‘80s, we decided that even though it’s an authoritarian system, we will engage with it. I think in retrospect that was the right call, because it’s a state authoritarian system but a billion-plus people are much wealthier and better off than they would’ve otherwise been. It’s not clear that it would’ve stopped being an authoritarian country otherwise. You can just look at North Korea as an example of that.

I don’t know if it takes that much intelligence to remain an authoritarian country that continues to coalesce its own power. You can imagine a North Korea with an AI that’s much worse than everybody else’s, but still enough to keep power.

In general, it seems like we should just have this attitude that the benefits of AI—in the form of all these empowerments of humanity and health—will be big. Historically, we have decided it’s good to spread the benefits of technology widely, even to people whose governments are authoritarian. It is a tough question, how to think about it with AI, but historically we have said, “yes, this is a positive-sum world, and it’s still worth diffusing the technology.”

Dario Amodei

There are a number of choices we have. Framing this as a government-to-government decision in national security terms is one lens, but there are a lot of other lenses. You could imagine a world where we produce all these cures to diseases. The cures are fine to sell to authoritarian countries, but the data centers just aren’t. The chips and the data centers aren’t, and the AI industry itself isn’t.

Another possibility I think folks should think about is this. Could there be developments we can make—either that naturally happen as a result of AI, or that we could make happen by building technology on AI—that create an equilibrium where it becomes infeasible for authoritarian countries to deny their people private use of the benefits of the technology? Are there equilibria where we can give everyone in an authoritarian country their own AI model that defends them from surveillance and there isn’t a way for the authoritarian country to crack down on this while retaining power?

I don’t know. That sounds to me like if that went far enough, it would be a reason why authoritarian countries would disintegrate from the inside. But maybe there’s a middle world where there’s an equilibrium where, if they want to hold on to power, the authoritarians can’t deny individualized access to the technology.

But I actually do have a hope for the more radical version. Is it possible that the technology might inherently have properties—or that by building on it in certain ways we could create properties—that have this dissolving effect on authoritarian structures? Now, we hoped originally—think back to the beginning of the Obama administration—that social media and the internet would have that property, and it turns out not to. But what if we could try again with the knowledge of how many things could go wrong, and that this is a different technology? I don’t know if it would work, but it’s worth a try.

Dwarkesh Patel

It’s just very unpredictable. There are first principles reasons why authoritarianism might be privileged.

Dario Amodei

It’s all very unpredictable. We just have to recognize the problem and come up with 10 things we can try, try those, and then assess which ones are working, if any. Then try new ones if the old ones aren’t working.

Dwarkesh Patel

But I guess that nets out to today, as you say, that we will not sell data centers, or chips, and the ability to make chips to China. So in some sense, you are denying… There would be some benefits to the Chinese economy, Chinese people, et cetera, because we’re doing that. Then there’d also be benefits to the American economy because it’s a positive-sum world. We could trade. They could have their country’s data centers doing one thing. We could have ours doing another. Already, you’re saying it’s not worth that positive-sum stipend to empower those countries?

Dario Amodei

What I would say is that we are about to be in a world where growth and economic value will come very easily if we’re able to build these powerful AI models. What will not come easily is distribution of benefits, distribution of wealth, political freedom. These are the things that are going to be hard to achieve.

So when I think about policy, I think that the technology and the market will deliver all the fundamental benefits, this is my fundamental belief, almost faster than we can take them. These questions about distribution and political freedom and rights are the ones that will actually matter and that policy should focus on.

Dwarkesh Patel

Speaking of distribution, as you were mentioning, we have developing countries. In many cases, catch-up growth has been weaker than we would have hoped for. But when catch-up growth does happen, it’s fundamentally because they have underutilized labor. We can bring the capital and know-how from developed countries to these countries, and then they can grow quite rapidly.

Obviously, in a world where labor is no longer the constraining factor, this mechanism no longer works. So is the hope basically to rely on philanthropy from the people or countries who immediately get wealthy from AI? What is the hope?

Dario Amodei

Philanthropy should obviously play some role, as it has in the past. But I think growth is always better and stronger if we can make it endogenous.

What are the relevant industries in an AI-driven world? I said we shouldn’t build data centers in China, but there’s no reason we shouldn’t build data centers in Africa. In fact, I think it’d be great to build data centers in Africa. As long as they’re not owned by China, we should build data centers in Africa. I think that’s a great thing to do.

There’s no reason we can’t build a pharmaceutical industry that’s AI-driven. If AI is accelerating drug discovery, then there will be a bunch of biotech startups. Let’s make sure some of those happen in the developing world. Certainly, during the transition—we can talk about the point where humans have no role—humans will still have some role in starting up these companies and supervising the AI models. So let’s make sure some of those humans are in the developing world so that fast growth can happen there as well.

Dwarkesh Patel

You guys recently announced that Claude is going to have a constitution that’s aligned to a set of values, and not necessarily just to the end user. There’s a world I can imagine where if it is aligned to the end user, it preserves the balance of power we have in the world today because everybody gets to have their own AI that’s advocating for them. The ratio of bad actors to good actors stays constant. It seems to work out for our world today. Why is it better not to do that, but to have a specific set of values that the AI should carry forward?

Dario Amodei

I’m not sure I’d quite draw the distinction in that way. There may be two relevant distinctions here. I think you’re talking about a mix of the two. One is, should we give the model a set of instructions about “do this” versus “don’t do this”? The other is, should we give the model a set of principles for how to act?

It’s kind of purely a practical and empirical thing that we’ve observed. By teaching the model principles, getting it to learn from principles, its behavior is more consistent, it’s easier to cover edge cases, and the model is more likely to do what people want it to do. In other words, if you give it a list of rules—”don’t tell people how to hot-wire a car, don’t speak in Korean”—it doesn’t really understand the rules, and it’s hard to generalize from them. It’s just a list of do’s and don’t’s.

Whereas if you give it principles—it has some hard guardrails like “Don’t make biological weapons” but—overall you’re trying to understand what it should be aiming to do, how it should be aiming to operate. So just from a practical perspective, that turns out to be a more effective way to train the model. That’s the rules versus principles trade-off.

Then there’s another thing you’re talking about, which is the corrigibility versus intrinsic motivation trade-off. How much should the model be a kind of “skin suit” where it just directly follows the instructions given to it by whoever is giving those instructions, versus how much should the model have an inherent set of values and go off and do things on its own?

There I would actually say everything about the model is closer to the direction that it should mostly do what people want. It should mostly follow instructions. We’re not trying to build something that goes off and runs the world on its own. We’re actually pretty far on the corrigible side.

Now, what we do say is there are certain things that the model won’t do. I think we say it in various ways in the constitution, that under normal circumstances, if someone asks the model to do a task, it should do that task. That should be the default. But if you’ve asked it to do something dangerous, or to harm someone else, then the model is unwilling to do that. So I actually think of it as a mostly corrigible model that has some limits, but those limits are based on principles.

Dwarkesh Patel

Then the fundamental question is, how are those principles determined? This is not a special question for Anthropic. This would be a question for any AI company. But because you have been the ones to actually write down the principles, I get to ask you this question. Normally, a constitution is written down, set in stone, and there’s a process of updating it and changing it and so forth. In this case, it seems like a document that people at Anthropic write, that can be changed at any time, that guides the behavior of systems that are going to be the basis of a lot of economic activity. How do you think about how those principles should be set?

Dario Amodei

I think there are maybe three sizes of loop here, three ways to iterate. One is we iterate within Anthropic. We train the model, we’re not happy with it, and we change the constitution. I think that’s good to do. Putting out public updates to the constitution every once in a while is good because people can comment on it.

The second level of loop is different companies having different constitutions. I think it’s useful. Anthropic puts out a constitution, Gemini puts out a constitution, and other companies put out a constitution. People can look at them and compare. Outside observers can critique and say, “I like this thing from this constitution and this thing from that constitution.” That creates a soft incentive and feedback for all the companies to take the best of each element and improve.

Then I think there’s a third loop, which is society beyond the AI companies and beyond just those who comment without hard power. There we’ve done some experiments. A couple years ago, we did an experiment with the Collective Intelligence Project to basically poll people and ask them what should be in our AI constitution. At the time, we incorporated some of those changes.

So you could imagine doing something like that with the new approach we’ve taken to the constitution. It’s a little harder because it was an easier approach to take when the constitution was a list of dos and don’ts. At the level of principles, it has to have a certain amount of coherence. But you could still imagine getting views from a wide variety of people.

You could also imagine—and this is a crazy idea, but this whole interview is about crazy ideas—systems of representative government having input. I wouldn’t do this today because the legislative process is so slow. This is exactly why I think we should be careful about the legislative process and AI regulation. But there’s no reason you couldn’t, in principle, say, “All AI models have to have a constitution that starts with these things, and then you can append other things after it, but there has to be this special section that takes precedence.”

I wouldn’t do that. That’s too rigid and sounds overly prescriptive in a way that I think overly aggressive legislation is. But that is a thing you could try to do. Is there some much less heavy-handed version of that? Maybe.

Dwarkesh Patel

I really like control loop two. Obviously, this is not how constitutions of actual governments do or should work. There’s not this vague sense in which the Supreme Court will feel out how people are feeling—what are the vibes—and update the constitution accordingly. With actual governments, there’s a more formal, procedural process.

But you have a vision of competition between constitutions, which is actually very reminiscent of how some libertarian charter cities people used to talk, about what an archipelago of different kinds of governments would look like. There would be selection among them of who could operate the most effectively and where people would be the happiest. In a sense, you’re recreating that vision of a utopia of archipelagos.

Dario Amodei

I think that vision has things to recommend it and things that will go wrong with it. It’s an interesting, in some ways compelling, vision, but things will go wrong that you hadn’t imagined.

So I like loop two as well, but I feel like the whole thing has got to be some mix of loops one, two, and three, and it’s a matter of the proportions. I think that’s gotta be the answer.

Dwarkesh Patel

When somebody eventually writes the equivalent of The Making of the Atomic Bomb for this era, what is the thing that will be hardest to glean from the historical record that they’re most likely to miss?

Dario Amodei

I think a few things. One is, at every moment of this exponential, the extent to which the world outside it didn’t understand it. This is a bias that’s often present in history. Anything that actually happened looks inevitable in retrospect. When people look back, it will be hard for them to put themselves in the place of people who were actually making a bet on this thing to happen that wasn’t inevitable, that we had these arguments like the arguments I make for scaling or that continual learning will be solved. Some of us internally put a high probability on this happening, but there’s a world outside us that’s not acting on that at all.

I think the weirdness of it, unfortunately the insularity of it... If we’re one year or two years away from it happening, the average person on the street has no idea. That’s one of the things I’m trying to change with the memos, with talking to policymakers. I don’t know but I think that’s just a crazy thing.

Finally, I would say—and this probably applies to almost all historical moments of crisis—how absolutely fast it was happening, how everything was happening all at once. Decisions that you might think were carefully calculated, well actually you have to make that decision, and then you have to make 30 other decisions on the same day because it’s all happening so fast. You don’t even know which decisions are going to turn out to be consequential.

One of my worries—although it’s also an insight into what’s happening—is that some very critical decision will be some decision where someone just comes into my office and is like, “Dario, you have two minutes. Should we do thing A or thing B on this?” Someone gives me this random half-page memo and asks, “Should we do A or B?” I’m like, “I don’t know. I have to eat lunch. Let’s do B.” That ends up being the most consequential thing ever.

Dwarkesh Patel

So final question. There aren’t tech CEOs who are usually writing 50-page memos every few months. It seems like you have managed to build a role for yourself and a company around you which is compatible with this more intellectual-type role of CEO.

I want to understand how you construct that. How does that work? Do you just go away for a couple of weeks and then you tell your company, “This is the memo. Here’s what we’re doing”? It’s also reported that you write a bunch of these internally.

Dario Amodei

For this particular one, I wrote it over winter break. I was having a hard time finding the time to actually write it. But I think about this in a broader way. I think it relates to the culture of the company. I probably spend a third, maybe 40%, of my time making sure the culture of Anthropic is good.

Dario Amodei

As Anthropic has gotten larger, it’s gotten harder to get directly involved in the training of the models, the launch of the models, the building of the products. It’s 2,500 people. I have certain instincts, but it’s very difficult to get involved in every single detail. I try as much as possible, but one thing that’s very leveraged is making sure Anthropic is a good place to work, people like working there, everyone thinks of themselves as team members, and everyone works together instead of against each other.

We’ve seen as some of the other AI companies have grown—without naming any names—we’re starting to see decoherence and people fighting each other. I would argue there was even a lot of that from the beginning, but it’s gotten worse. I think we’ve done an extraordinarily good job, even if not perfect, of holding the company together, making everyone feel the mission, that we’re sincere about the mission, and that everyone has faith that everyone else there is working for the right reason. That we’re a team, that people aren’t trying to get ahead at each other’s expense or backstab each other, which again, I think happens a lot at some of the other places.

How do you make that the case? It’s a lot of things. It’s me, it’s Daniela, who runs the company day to day, it’s the co-founders, it’s the other people we hire, it’s the environment we try to create. But I think an important thing in the culture is that the other leaders as well, but especially me, have to articulate what the company is about, why it’s doing what it’s doing, what its strategy is, what its values are, what its mission is, and what it stands for.

When you get to 2,500 people, you can’t do that person by person. You have to write, or you have to speak to the whole company. This is why I get up in front of the whole company every two weeks and speak for an hour.

I wouldn’t say I write essays internally. I do two things. One, I write this thing called a DVQ, Dario Vision Quest. I wasn’t the one who named it that. That’s the name it received, and it’s one of these names that I tried to fight because it made it sound like I was going off and smoking peyote or something. But the name just stuck.

So I get up in front of the company every two weeks. I have a three or four-page document, and I just talk through three or four different topics about what’s going on internally, the models we’re producing, the products, the outside industry, the world as a whole as it relates to AI and geopolitically in general. Just some mix of that. I go through very honestly and I say, “This is what I’m thinking, and this is what Anthropic leadership is thinking,” and then I answer questions. That direct connection has a lot of value that is hard to achieve when you’re passing things down the chain six levels deep. A large fraction of the company comes to attend, either in person or virtually. It really means that you can communicate a lot.

The other thing I do is I have a channel in Slack where I just write a bunch of things and comment a lot. Often that’s in response to things I’m seeing at the company or questions people ask. We do internal surveys and there are things people are concerned about, and so I’ll write them up. I’m just very honest about these things. I just say them very directly.

The point is to get a reputation of telling the company the truth about what’s happening, to call things what they are, to acknowledge problems, to avoid the sort of corpo speak, the kind of defensive communication that often is necessary in public because the world is very large and full of people who are interpreting things in bad faith. But if you have a company of people who you trust, and we try to hire people that we trust, then you can really just be entirely unfiltered.

I think that’s an enormous strength of the company. It makes it a better place to work, it makes people more than the sum of their parts, and increases the likelihood that we accomplish the mission because everyone is on the same page about the mission, and everyone is debating and discussing how best to accomplish the mission.

Dwarkesh Patel

Well, in lieu of an external Dario Vision Quest, we have this interview.

Dario Amodei

This interview is a little like that.

Dwarkesh Patel

This has been fun, Dario. Thanks for doing it.

Dario Amodei

Thank you, Dwarkesh.

Notes on Space GPUs

Dwarkesh Patel — Thu, 05 Feb 2026 18:26:47 GMT

John Collison and I just interviewed Elon. The interview was recorded before we knew that SpaceX was acquiring xAI, so the fact that our first topic was space GPUs now feels all the more relevant.

As I was preparing to interview Elon, I put together some notes and a spreadsheet to help me think through orbital datacenters. I turned those notes into this blog post.

Even if orbital data centers don’t make sense yet, in the long run the singularity is clearly moving into space. Earth intercepts about one two-billionth of the sun’s total output. If AI scaling continues, compute will eventually move to where the energy is. So space GPUs are fun to think about, because they give you a sneak peek at the future. Whether that future arrives in 2030, 2040, or 2050 is another question.

Please take everything below with grains of salt—grains so big that you might confuse them for rocks. Assume all the numbers are wrong. Every paragraph below covers a topic that would take an actual expert a week to properly evaluate. What you’ll find here is what a professional podcaster has pieced together from conversations with LLMs and some very generous people who talked to me before the interview. Thanks to Casey Handmer, Philip Johnston, Ezra Feilden, Andrew McCalip, Vinay Ramasesh and the team at Kinetic Partnership for all their help.

Why orbital data centers?

The whole reason to go to space is energy. Yes, panels in space get about 40% more irradiance—but the real advantage is that you can put your satellites in sun-synchronous orbit, where they face the sun continuously. No nights, no clouds, no need for batteries (which is the majority of cost in a solar-storage system). Solar on Earth has a roughly 25% capacity factor, meaning panels only generate a quarter of their peak output on average. In space, you get close to 100%.

The logic is that if the launch costs continue to drop, it will become cheaper to put GPUs in orbit than to build power plants and batteries on Earth. And there’s a lot of room for launch costs to fall—propellant is cheap, and the main expense is the rocket, which you can now reuse. Falcon 9 is around $2,500/kg with a disposable upper stage. Starship with full reusability could get below $100/kg.

But here’s the problem with this argument. Energy is only about 15% of a datacenter’s total cost of ownership. The chips themselves are around 70%. And you still have to launch those to space!

It gets worse. On Earth, GPUs fail constantly. In the Llama 3 paper, Meta reported a failure roughly once every three hours across a 16,000 H100 cluster. When a chip dies, a technician walks over, swaps it out, and the cluster keeps running. In space, you can’t do that—at least not until we have Optimus robots stationed on every satellite.

What about radiation? It’s actually less catastrophic than you might expect. Google’s Suncatcher paper found that their TPUs survived nearly 3x the total ionizing dose needed for a 5-year mission before showing permanent degradation.

I asked Elon about this. He responded:

> “Actually, it depends on how recent the GPUs are that have arrived. At this point, we find our GPUs to be quite reliable. There’s infant mortality, which you can obviously iron out on the ground. So you can just run them on the ground and confirm that you don’t have infant mortality with the GPUs.”
> “But once they start working, their actual reliability—and you’re past the initial debug cycle of Nvidia or whatever, or whoever’s making the chips, could be Tesla AI6 chips or something like that, or it could be TPUs or Trainiums or whatever—is actually quite reliable past a certain point. So I don’t think the servicing thing is an issue”

Consider what’s actually being proposed here. You assemble your GPUs into racks on Earth, run them for a few hundred hours to catch the duds, disassemble everything, pack it into a satellite, launch it, and get it operational in orbit. Throughout this entire process, the most expensive part of your system—the chips—are just sitting there not doing useful work.

Is this just not possible on Earth?

Throughout the interview, Elon kept returning to one point over and over again: Look, forget the economics! It will simply not be physically possible to scale power production to the scale needed for AI on Earth. He went on:

> “The only place you can really scale is space.”
> “All of the United States currently uses only half a terawatt on average. So if you say a terawatt, that would be twice as much electricity as the United States currently consumes. So that’s quite a lot. Can you imagine building that many data centers? That many power plants? It’s like those who have lived in software land don’t realize they’re about to have a hard lesson in hardware.”

Elon kept pointing out the bottlenecks we’ve already run into on Earth. You can’t plug into the utilities—the interconnect queues are too long. You can’t do behind the meter and generate power yourself—lead times for turbines stretch past 2030. You can’t do solar on Earth, because of permits, and because of the tariffs. And Earth has clouds and nights, requiring overbuilt solar and batteries. In space, you can just put the satellites in sun synchronous orbit!

Look, at some level, it is true that we can’t keep scaling on Earth. But keep in mind that the Earth is really fucking big. 1 TW of solar (with 25% capacity factor, so really 4 TW of panels) is around 30,000 square miles. That’s like 1% of the US—about the size of South Carolina. For context, AI datacenters currently consume only ~20 GW globally.

By the time we’re talking about multiple terawatts, we’ll have had to massively scale leading-edge wafer production. And that’s the really hard part. Fabs are the most complicated manufacturing facilities humans have ever built. In order to believe that we need to go to space in order to find the power turn on all these chips, we’ll need to assume a few things:

We’ll manage to produce a lot more chips.
Every single relief vessel for power generation on Earth will fail to scale.

But semiconductors are so much more complicated than solar panels! They’re even more complicated than the blades on a turbine. It feels quite unlikely to me that the thing we manage to solve is building terawatts worth of leading edge wafers, but in that world we can’t figure out how to pave Nevada (or if regulation proves to be a problem, then the UAE) with solar panels.

100 GW into space

How many Starship launches will it take to launch a 100 GW into space?

An orbital datacenter satellite has three big components: solar arrays, computers, and radiators. And the key constraint is that for every watt of compute, we need roughly one watt of solar and one watt of thermal rejection capacity.

The W/kg of each component determines how the mass budget gets split—and how much compute you can bring along. The figure that matters most here is the specific power of the whole satellite: after you account for solar panels, radiators, and chassis, how many watts of compute do you actually get per kilogram launched?

For Starlink satellites, this works out to roughly 50 W/kg. The people trying to build orbital datacenters are currently targeting 100 W/kg. There are only two ways to get there: lighter solar panels (more watts generated per kg) or lighter radiators (more watts rejected per kg).

The numbers below are super rough. Reliable figures for space-grade components are hard to come by. But even rough math reveals which variables must improve—and by how much—in order to hit 100 W/kg.

Solar: There are apparently companies that are targeting next gen thin film that reaches upwards of 500 W/kg, but the state of the art is 150 W/kg, and most missions right now fly 30 W/kg. Let’s be generous and assume 200 W/kg.
- The trouble here is that there’s obviously a tradeoff— denser panels costs more money, but reduces launch cost. And it’s difficult to calculate what that implies for these next gen panels, because their prices are not listed anywhere.
Compute: I’ve heard that a stripped down GB200 NVL72 with no cooling equipment is around 100 kg. They draw 132kW of power, but let’s add 10% overhead for the intersatellite lasers and so on. That gets us to 1,452 W/kg.
Radiators: In space, you can’t convect heat away, because there’s no air. You can only radiate it, which means your panels glow infrared until the heat leaves. The Stefan-Boltzmann law governs how much power a surface can radiate.
GPUs typically run up to 90° Celsius. There’s some temperature drop through the heat pipes and fluid loops that carry heat to the radiator surface. Call it 30°C. So your radiators end up operating around 60°C. Plug that into Stefan-Boltzmann (assuming you’re using aluminum panels that weigh around 2 kg per square meter of surface area, that works out to roughly 320 W/kg.
Since radiated power scales with T⁴, running your chips hotter can help you save a lot of radiator mass. For space, people will have to figure out how to build GPUs that tolerate higher temperatures.

Assuming the numbers above—and also assuming that a fourth of the mass of the satellite has to be the chassis—I get 85 W/kg for the whole system. Again, I want to emphasize these are rough calculations; feel free to plug in your own numbers in the spreadsheet here.

At 150 metric tons to low earth orbit per Starship (Elon’s target), you’re looking at around 10 MW per launch. That means roughly 100 Starship launches in order to put 1 GW of compute in orbit. To hit 100 GW in a year, you’d need roughly 10,000 launches, or, about one launch every hour.

This is insane! A single Starship produces around 100 GW of thrust power at liftoff. That’s about a fifth of total US electricity consumption, concentrated in one rocket for a few minutes. And the plan would be to do that once an hour, every hour, every day, for a year.

I asked Elon what that world looks like:

I don’t think we’ll need more than... I mean, you could probably do it with as few as 20 or 30 [Starship vehicles]. It really depends on how quickly the ship has to go around the Earth and the ground track before the ship has to come back over the launch pad. So if you can use a ship every, say, 30 hours, you could do it with 30 ships. But we’ll make more ships than that. SpaceX is gearing up to do 10,000 launches a year, and maybe even 20 or 30,000 launches a year.

Workloads and comms

Starlink satellites already communicate via inter-satellite laser links at 100 Gbps—and Google’s Suncatcher paper suggests off-the-shelf transceivers could potentially hit 10 Tbps. For context, Infiniband links between nodes in a terrestrial datacenter run at 400 Gbps. The gap isn’t as large as you might expect. So, could you do synchronous training in space?

Even the most bullish analysts don’t claim that orbital data centers will be used for training. I don’t know any of the relevant orbital mechanics, but obviously satellites at different altitudes move at different orbital velocities, which means the satellites are desyncing relative to one another. Google came up with a clever solution for this in their Suncatcher paper—keep lots of satellites in a single tight cluster at the same altitude. Google’s researchers proposed eighty-one satellites in such a synchronized constellation. If each constellation had a GB200 NVL72, then that’s only 15 MW parcels of coherent compute.

Defenders of orbital datacenters say that most compute is going to shift to inference (and with RL, most training is also inference). Maybe the legacy terrestrial datacenters do end up doing the pretraining runs, and then whatever mixture of RL environment training and continual learning happens in the future does happen in space. So, the argument goes, it’s not a big deal that the scale ups in space are isolated. But there’s still the question of how hundreds of gigawatts of inference are beamed back to Earth.

For a moment, let’s imagine a world where as we see the sunrise and sunset we also see a Saturn-like belt of GPU satellites passing over us. That’s already really cool. But then there’s another sci-fi premise, which I really wanted to be plausible, and which turns out not to make any sense: Imagine that every 12 hours, as this country of geniuses in space passes over us and shoots down half a day’s worth of new ideas, our code finally starts working and our factories buzz alight and become more productive. Unfortunately, it’s just science fiction. Inference doesn’t take that much bandwidth. One hundred gigawatts of a 5T model is roughly 58 billion tokens per second, resulting in ~ 230 GB/s.

That’s nothing. That many tokens can easily be beamed using lasers from GPUs in the orbital plane through to Starlink satellite network and then down to Earth.

Latency might be an issue, up to fifty milliseconds from any given spot on Earth through the Starlink network to the sun synchronous orbit and then back again. But as we move towards a world of true remote coworker AIs, where the agent works for tens of minutes before coming back to us, the marginal milliseconds of latency matter less and less.

So why is Elon doing this?

I’m willing to accept Elon’s argument that if launch costs become sufficiently cheap and we can repair GPUs in space, then there’s a viable path toward orbital data centers. But it seems especially difficult to imagine a situation in which orbital data centers end up significantly cheaper, because, again, most of the cost of a data center is the GPUs.

For most compute to shift to space, all of the following things would need to be true:

Power generation on Earth hits a ceiling, or AI demand outstrips every terrestrial option.
Chip production scales faster than anyone expects, so we have the silicon but not the electricity.
Starship reaches thousands of launches per year.

If Elon’s right, he wins the AI race outright. SpaceX is the only entity that can launch at that scale. xAI would have unlimited power. Everyone else will be stuck fighting over grid interconnects and turbine orders.

And if Elon’s future doesn’t materialize? xAI is just another lab in the pack. Which means xAI loses. The AI race is a winner-take-all competition, and xAI isn’t in first place. Elon’s comparative advantage was never going to be navigating utility interconnect queues or filing permits faster than Google. His advantage is SpaceX. So why not bet on the world where SpaceX becomes the kingmaker?

This might sound reckless. But that’s how SpaceX got here. Their whole business plan seems to be one in which they conjure new wells of demand for each generation of rocket on the path to the Dyson swarm. Falcon 9 first flew in 2010. Starlink didn’t launch until 2019. Maybe orbital datacenters end up being for Starship what Starlink was for Falcon 9.

Sometimes, during the interview, I found my thoughts drifting toward Elon’s vision for this big, interconnected future. So I paused a moment and said:

What I find remarkable about the SpaceX business is the end goal is to get to Mars, but you keep finding ways on the way there to keep generating incremental revenue to get to the next stage and the next stage.

Elon nodded his head slowly. And then he said:

You can see how this might seem like a simulation to me.

Elon Musk — "In 36 months, the cheapest place to put AI will be space”

Dwarkesh Patel — Thu, 05 Feb 2026 16:45:08 GMT

In this episode, John and I got to do a real deep-dive with Elon. We discuss the economics of orbital data centers, the difficulties of scaling power on Earth, what it would take to manufacture humanoids at high-volume in America, xAI’s business and alignment plans, DOGE, and much more.

Watch on YouTube; listen on Apple Podcasts or Spotify.

Sponsors

Mercury just started offering personal banking! I’m already banking with Mercury for business purposes, so getting to bank with them for my personal life makes everything so much simpler. Apply now at mercury.com/personal-banking
Jane Street sent me a new puzzle last week: they trained a neural net, shuffled all 96 layers, and asked me to put them back in order. I tried but… I didn’t quite nail it. If you’re curious, or if you think you can do better, you should take a stab at janestreet.com/dwarkesh
Labelbox can get you robotics and RL data at scale. Labelbox starts by helping you define your ideal data distribution, and then their massive Alignerr network collects frontier-grade data that you can use to train your models. Learn more at labelbox.com/dwarkesh

Timestamps

(00:00:00) - Orbital data centers

(00:36:46) - Grok and alignment

(00:59:56) - xAI’s business plan

(01:17:21) - Optimus and humanoid manufacturing

(01:30:22) - Does China win by default?

(01:44:16) - Lessons from running SpaceX

(02:20:08) - DOGE

(02:38:28) - TeraFab

Transcript

Elon Musk

Are there really three hours of questions? Are you fucking serious?

Dwarkesh Patel

You don’t think there’s a lot to talk about, Elon?

Elon Musk

Holy fuck man.

John Collison

It’s the most interesting point. All the storylines are converging right now. We’ll see how much we can get through.

Elon Musk

It’s almost like I planned it.

John Collison

Exactly. We’ll get to that.

Elon Musk

But I would never do such a thing…

00:00:00 - Orbital data centers

Dwarkesh Patel

As you know better than anybody else, only 10-15% of the total cost of ownership of a data center is energy. That’s the part you’re presumably saving by moving this into space. Most of it’s the GPUs. If they’re in space, it’s harder to service them or you can’t service them. So the depreciation cycle goes down on them. It’s just way more expensive to have the GPUs in space, presumably. What’s the reason to put them in space?

Elon Musk

The availability of energy is the issue. If you look at electrical output outside of China, everywhere outside of China, it’s more or less flat. It’s maybe a slight increase, but pretty close flat. China has a rapid increase in electrical output. But if you’re putting data centers anywhere except China, where are you going to get your electricity? Especially as you scale.

The output of chips is growing pretty much exponentially, but the output of electricity is flat. So how are you going to turn the chips on? Magical power sources? Magical electricity fairies?

Dwarkesh Patel

You’re famously a big fan of solar. One terawatt of solar power, with a 25% capacity factor, that’s like four terawatts of solar panels. It’s 1% of the land area of the United States. We’re in the singularity when we’ve got one terawatt of data centers, right? So what are you running out of exactly?

Elon Musk

How far into the singularity are you though?

Dwarkesh Patel

You tell me.

Elon Musk

Exactly. So I think we’ll find we’re in the singularity and it’ll be like, “Okay, we’ve still got a long way to go.”

Dwarkesh Patel

But is the plan to put it in space after we’ve covered Nevada in solar panels?

Elon Musk

I think it’s pretty hard to cover Nevada in solar panels. You have to get permits. Try getting the permits for that. See what happens.

Dwarkesh Patel

So space is really a regulatory play. It’s harder to build on land than it is in space.

Elon Musk

It’s harder to scale on the ground than it is to scale in space. You’re also going to get about five times the effectiveness of solar panels in space versus the ground, and you don’t need batteries. I almost wore my other shirt, which says, “it’s always sunny in space”. Which it is because you don’t have a day-night cycle, seasonality, clouds, or an atmosphere in space. The atmosphere alone results in about a 30% loss of energy.

So any given solar panel can do about five times more power in space than on the ground. You also avoid the cost of having batteries to carry you through the night. It’s actually much cheaper to do in space. My prediction is that it will be by far the cheapest place to put AI. It will be space in 36 months or less. Maybe 30 months.

Dwarkesh Patel

36 months?

Elon Musk

Less than 36 months.

Dwarkesh Patel

How do you service GPUs as they fail, which happens quite often in training?

Elon Musk

Actually, it depends on how recent the GPUs are that have arrived. At this point, we find our GPUs to be quite reliable. There’s infant mortality, which you can obviously iron out on the ground. So you can just run them on the ground and confirm that you don’t have infant mortality with the GPUs.

But once they start working and you’re past the initial debug cycle of Nvidia or whoever’s making the chips—could be Tesla AI6 chips or something like that, or it could be TPUs or Trainiums or whatever—they’re quite reliable past a certain point. So I don’t think the servicing thing is an issue.

But you can mark my words. In 36 months, but probably closer to 30 months, the most economically compelling place to put AI will be space. It will then get ridiculously better to be in space.

The only place you can really scale is space. Once you start thinking in terms of what percentage of the Sun’s power you are harnessing, you realize you have to go to space. You can’t scale very much on Earth.

Dwarkesh Patel

But by very much, to be clear, you’re talking terawatts?

Elon Musk

Yeah. All of the United States currently uses only half a terawatt on average. So if you say a terawatt, that would be twice as much electricity as the United States currently consumes. So that’s quite a lot. Can you imagine building that many data centers, that many power plants?

Those who have lived in software land don’t realize they’re about to have a hard lesson in hardware. It’s actually very difficult to build power plants. You don’t just need power plants, you need all of the electrical equipment. You need the electrical transformers to run the AI transformers.

Now, the utility industry is a very slow industry. They pretty much impedance match to the government, to the Public Utility Commissions. They impedance match literally and figuratively. They’re very slow, because their past has been very slow. So trying to get them to move fast is... Have you ever tried to do an interconnect agreement with a utility at scale, with a lot of power?

Dwarkesh Patel

As a professional podcaster, I can say that I have not, in fact.

John Collison

They need many more views before that becomes an issue.

Elon Musk

They have to do a study for a year. A year later, they’ll come back to you with their interconnect study.

John Collison

Can’t you solve this with your own behind the meter power stuff?

Elon Musk

You can build power plants. That’s what we did at xAI, for Colossus 2.

John Collison

So why talk about the grid? Why not just build GPUs and power co-located?

Elon Musk

That’s what we did.

John Collison

But I’m saying why isn’t this a generalized solution?

Elon Musk

Where do you get the power plants from?

John Collison

When you’re talking about all the issues working with utilities, you can just build private power plants with the data centers.

Elon Musk

Right. But it begs the question of where do you get the power plants from? The power plant makers.

John Collison

Oh, I see what you’re saying. Is this the gas turbine backlog basically?

Elon Musk

Yes. You can drill down to a level further. It’s the vanes and blades in the turbines that are the limiting factor because it’s a very specialized process to cast the blades and vanes in the turbines, assuming you’re using gas power. It’s very difficult to scale other forms of power. You can potentially scale solar, but the tariffs currently for importing solar in the US are gigantic and the domestic solar production is pitiful.

John Collison

Why not make solar? That seems like a good Elon-shaped problem.

Elon Musk

We are going to make solar.

John Collison

Okay.

Elon Musk

Both SpaceX and Tesla are building towards 100 gigawatts a year of solar cell production.

Dwarkesh Patel

How low down the stack? From polysilicon up to the wafer to the final panel?

Elon Musk

I think you’ve got to do the whole thing from raw materials to finish the cell. Now, if it’s going to space, it costs less and it’s easier to make solar cells that go to space because they don’t need much glass.

They don’t need heavy framing because they don’t have to survive weather events. There’s no weather in space. So it’s actually a cheaper solar cell that goes to space than the one on the ground.

Dwarkesh Patel

Is there a path to getting them as cheap as you need in the next 36 months?

Elon Musk

Solar cells are already very cheap. They’re farcically cheap. I think solar cells in China are around $0.25-30/watt or something like that. It’s absurdly cheap. Now put it in space, and it’s five times cheaper. In fact, it’s not five times cheaper, it’s 10 times cheaper because you don’t need any batteries.

So the moment your cost of access to space becomes low, by far the cheapest and most scalable way to generate tokens is space. It’s not even close. It’ll be an order of magnitude easier to scale.

The point is you won’t be able to scale on the ground. You just won’t. People are going to hit the wall big time on power generation. They already are. The number of miracles in series that the xAI team had to accomplish in order to get a gigawatt of power online was crazy.

We had to gang together a whole bunch of turbines. We then had permit issues in Tennessee and had to go across the border to Mississippi, which is fortunately only a few miles away. But we still then had to run the high power lines a few miles and build the power plant in Mississippi. It was very difficult to build that.

People don’t understand how much electricity you actually need at the generation level in order to power a data center. Because the noobs will look at the power consumption of, say a GB300, and multiply that by a thing and then think that’s the amount of power you need.

John Collison

All the cooling and everything.

Elon Musk

Wake up. That’s a total noob, you’ve never done any hardware in your life before. Besides the GB300, you’ve got to power all of the networking hardware. There’s a whole bunch of CPU and storage stuff that’s happening. You’ve got to size for your peak cooling requirements. That means, can you cool even on the worst hour of the worst day of the year?

It gets pretty frigging hot in Memphis. So you’re going to have a 40% increase on your power just for cooling. That’s assuming you don’t want your data center to turn off on hot days and you want to keep going. There’s another multiplicative element on top of that which is, are you assuming that you never have any hiccups in your power generation?

Actually, sometimes we have to take the generators, some of the power, offline in order to service it. Okay, now you add another 20-25% multiplier on that, because you’ve got to assume that you’ve got to take power offline to service it. So our actual estimate: every 110,000 GB300s—inclusive of networking, CPU, storage, cooling, margin for servicing power—is roughly 300 megawatts.

John Collison

Sorry, say that again.

Elon Musk

What you probably need at the generation level to service 330,000 GB 300s—including all of the associated support networking and everything else, and the peak cooling, and to have some power margin reserve—is roughly a gigawatt.

Dwarkesh Patel

Can I ask a very naive question? You’re describing the engineering details of doing this stuff on Earth. But then there’s analogous engineering difficulties of doing it in space. How do you replace infinite bandwidth with orbital lasers, et cetera, et cetera? How do you make it resistant to radiation?

I don’t know the details of the engineering, but fundamentally, what is the reason to think those challenges which have never had to be addressed before will end up being easier than just building more turbines on Earth? There are companies that build turbines on Earth. They can make more turbines, right?

Elon Musk

Again, try doing it and then you’ll see. The turbines are sold out through 2030.

John Collison

Have you guys considered making your own?

Elon Musk

In order to bring enough power online, I think SpaceX and Tesla will probably have to make the turbine blades, the vanes and blades, internally.

John Collison

But just the blades or the turbines?

Elon Musk

The limiting factor... you can get everything except the blades. They call them blades and vanes. You can get that 12 to 18 months before the vanes and blades. The limiting factor is the vanes and blades. There are only three casting companies in the world that make these, and they’re massively backlogged.

John Collison

Is this Siemens, GE, those guys, or is it a sub company?

Elon Musk

No, it’s other companies. Sometimes they have a little bit of casting capability in-house. But I’m just saying you can just call any of the turbine makers and they will tell you. It’s not top secret. It’s probably on the internet right now.

Dwarkesh Patel

If it wasn’t for the tariffs, would Colossus be solar-powered?

Elon Musk

It would be much easier to make it solar powered, yeah. The tariffs are nuts, several hundred percent.

John Collison

Don’t you know some people?

Elon Musk

The president has... we don’t agree on everything and this administration is not the biggest fan of solar. We also need the land, the permits, and everything. So if you try to move very fast, I do think scaling solar on Earth is a good way to go, but you do need some amount of time to find the land, get the permits, get the solar, pair that with the batteries.

John Collison

Why would it not work to stand up your own solar production? You’re right that you eventually run out of land, but there’s a lot of land here in Texas. There’s a lot of land in Nevada, including private land. It’s not all publicly-owned land. So you’d be able to at least get the next Colossus and the next one after that. At a certain point, you hit a wall. But wouldn’t that work for the moment?

Elon Musk

As I said, we are scaling solar production. There’s a rate at which you can scale physical production of solar cells. We’re going as fast as possible in scaling domestic production.

John Collison

You’re making the solar cells at Tesla?

Elon Musk

Both Tesla and SpaceX have a mandate to get to 100 gigawatts a year of solar.

John Collison

Speaking of the annual capacity, I’m curious, in five years time let’s say, what will the installed capacity be on Earth…?

Elon Musk

Five years is a long time.

John Collison

And in space? I deliberately pick five years because it’s after your “once we’re up and running” threshold. So in five years time what’s the on-Earth versus in-space installed AI capacity?

Elon Musk

If you say five years from now, I think probably AI in space will be launching every year the sum total of all AI on Earth. Meaning, five years from now, my prediction is we will launch and be operating every year more AI in space than the cumulative total on Earth.

John Collison

Which is...

Elon Musk

I would expect it to be at least, five years from now, a few hundred gigawatts per year of AI in space and rising. I think you can get to around a terawatt a year of AI in space before you start having fuel supply challenges for the rocket.

John Collison

Okay, but you think you can get hundreds of gigawatts per year in five years time?

Elon Musk

Yes.

Dwarkesh Patel

So 100 gigawatts, depending on the specific power of the whole system with solar arrays and radiators and everything, is on the order of 10,000 Starship launches.

Elon Musk

Yes.

Dwarkesh Patel

You want to do that in one year. So that’s like one Starship launch every hour. That’s happening in this city? Walk me through a world where there’s a Starship launch every single hour.

Elon Musk

I mean, that’s actually a lower rate compared to airlines, aircraft.

Dwarkesh Patel

There’s a lot of airports.

Elon Musk

A lot of airports.

Dwarkesh Patel

And you’ve got to launch into the polar orbit.

Elon Musk

No, it doesn’t have to be polar. There’s some value to sun-synchronous, but I think actually, if you just go high enough, you start getting out of Earth’s shadow.

Dwarkesh Patel

How many physical Starships are needed to do 10,000 launches a year?

Elon Musk

I don’t think we’ll need more than... You could probably do it with as few as 20 or 30. It really depends on how quickly… The ship has to go around the Earth and the ground track for the ship has to come back over the launch pad. So if you can use a ship every, say 30 hours, you could do it with 30 ships. But we’ll make more ships than that. SpaceX is gearing up to do 10,000 launches a year, and maybe even 20 or 30,000 launches a year.

Dwarkesh Patel

Is the idea to become basically a hyperscaler, become an Oracle, and lend this capacity to other people? Presumably, SpaceX is the one launching all this. So, SpaceX is going to become a hyperscaler?

Elon Musk

Hyper-hyper. If some of my predictions come true, SpaceX will launch more AI than the cumulative amount on Earth of everything else combined.

Dwarkesh Patel

Is this mostly inference or?

Elon Musk

Most AI will be inference. Already, inference for the purpose of training is most training.

John Collison

There’s a narrative that the change in discussion around a SpaceX IPO is because previously SpaceX was very capital efficient. It wasn’t that expensive to develop. Even though it sounds expensive, it’s actually very capital efficient in how it runs.

Whereas now you’re going to need more capital than just can be raised in the private markets. The private markets can accommodate raises of—as we’ve seen from the AI labs—tens of billions of dollars, but not beyond that. Is it that you’ll just need more than tens of billions of dollars per year? That’s why you’d take it public?

Elon Musk

I have to be careful about saying things about companies that might go public.

Dwarkesh Patel

That’s never been a problem for you, Elon.

Elon Musk

There’s a price to pay for these things.

John Collison

Make some general statements for us about the depth of the capital markets between public and private markets.

Elon Musk

There’s a lot more capital available...

Dwarkesh Patel

Very general.

Elon Musk

There’s obviously a lot more capital available in the public markets than private. It might be 100x more capital, but it’s way more than 10x.

John Collison

Isn’t it also the case that with things that tend to be very capital intensive—if you look at, say, real estate as a huge industry, that raises a lot of money each year at an industry level—they tend to be debt financed because by the time you’re deploying that much money, you actually have a pretty—

Elon Musk

You have a clear revenue stream.

John Collison

Exactly, and a near-term return. You see this even with the data center build-outs, which are famously being financed by the private credit industry. Why not just debt finance?

Elon Musk

Speed is important. I’m generally going to do the thing that... I just repeatedly tackle the limiting factor. Whatever the limiting factor is on speed, I’m going to tackle that. If capital is the limiting factor, then I’ll solve for capital. If it’s not the limiting factor, I’ll solve for something else.

Dwarkesh Patel

Based on your statements about Tesla and being public, I wouldn’t have guessed that you thought the way to move fast is to be public.

Elon Musk

Normally, I would say that’s true. Like I said, I’d like to talk about it in some more detail, but the problem is if you talk about public companies before they become public, you get into trouble, and then you have to delay your offering.

John Collison

And as you said, you’re solving for speed.

Elon Musk

Yes, exactly. You can’t hype companies that might go public. So that’s why we have to be a little careful here. But we can talk about physics. The way you think about scaling long-term is that Earth only receives about half a billionth of the Sun’s energy. The Sun is essentially all the energy. This is a very important point to appreciate because sometimes people will talk about modular nuclear reactors or various fusion on Earth.

But you have to step back a second and say, if you’re going to climb the Kardashev scale and harness some nontrivial percentage of the sun’s energy… Let’s say you wanted to harness a millionth of the sun’s energy, which sounds pretty small. That would be about, call it roughly, 100,000x more electricity than we currently generate on Earth for all of civilization. Give or take an order of magnitude.

Obviously, the only way to scale is to go to space with solar. Launching from Earth, you can get to about a terawatt per year. Beyond that, you want to launch from the moon. You want to have a mass driver on the moon. With that mass driver on the moon, you could do probably a petawatt per year.

Dwarkesh Patel

We’re talking these kinds of numbers, terawatts of compute. Presumably, whether you’re talking about land or space, far, far before this point, you run into... Maybe the solar panels are more efficient, but you still need the chips. You still need the logic and the memory and so forth.

Elon Musk

You’re going to need to build a lot more chips and make them much cheaper.

Dwarkesh Patel

Right now the world has maybe 20-25 gigawatts of compute. How are we getting a terawatt of logic by 2030?

Elon Musk

I guess we’re going to need some very big chip fabs.

Dwarkesh Patel

Tell me about it.

Elon Musk

I’ve mentioned publicly the idea of doing a sort of a TeraFab, Tera being the new Giga.

Dwarkesh Patel

I feel like the naming scheme of Tesla, which has been very catchy, is you looking at the metric scale. At what level of the stack are you? Are you building the clean room and then partnering with an existing fab to get the process technology and buying the tools from them? What is the plan there?

Elon Musk

Well, you can’t partner with existing fabs because they can’t output enough. The chip volume is too low.

Dwarkesh Patel

But for the process technology?

John Collison

Partner for the IP.

Elon Musk

The fabs today all basically use machines from like five companies. So you’ve got ASML, Tokyo Electron, KLA-Tencor, et cetera. So at first, I think you’d have to get equipment from them and then modify it or work with them to increase the volume. But I think you’d have to build perhaps in a different way. The logical thing to do is to use conventional equipment in an unconventional way to get to scale, and then start modifying the equipment to increase the rate.

John Collison

Boring Company-style.

Elon Musk

Yeah. You sort of buy an existing boring machine and then figure out how to dig tunnels in the first place and then design a much better machine that’s some orders of magnitude faster.

John Collison

Here’s a very simple lens. We can categorize technologies and how hard they are. One categorization could be to look at things that China has not succeeded in doing. If you look at Chinese manufacturing, they’re still behind on leading-edge chips and still behind on leading-edge turbine engines and things like that.

So does the fact that China has not successfully replicated TSMC give you any pause about the difficulty? Or do you think that’s not true for some reason?

Elon Musk

It’s not that they have not replicated TSMC, they have not replicated ASML. That’s the limiting factor.

John Collison

So you think it’s just the sanctions, essentially?

Elon Musk

Yeah, China would be outputting vast numbers of chips if they could buy 2-3 nanometers.

John Collison

But couldn’t they up to relatively recently buy them?

Elon Musk

No.

John Collison

Okay.

Elon Musk

The ASML ban has been in place for a while. But I think China’s going to be making pretty compelling chips in three or four years.

John Collison

Would you consider making the ASML machines?

Elon Musk

“I don’t know yet” is the right answer. To reach a large volume in, say, 36 months, to match the rocket payload to orbit… If we’re doing a million tons to orbit in, let’s say three or four years from now, something like that… We’re doing 100 kilowatts per ton. So that means we need at least 100 gigawatts per year of solar. We’ll need an equivalent amount of chips. You need 100 gigawatts worth of chips. You’ve got to match these things: the mass to orbit, the power generation, and the chips.

I’d say my biggest concern actually is memory. The path to creating logic chips is more obvious than the path to having sufficient memory to support logic chips. That’s why you see DDR prices going ballistic and these memes. You’re marooned on a desert island. You write “Help me” on the sand. Nobody comes. You write “DDR RAM.” Ships come swarming in.

Dwarkesh Patel

I’d love to hear your manufacturing philosophy around fabs. I know nothing about the topic.

Elon Musk

I don’t know how to build a fab yet. I’ll figure it out. Obviously, I’ve never built a fab.

Dwarkesh Patel

It sounds like you think the process knowledge of these 10,000 PhDs in Taiwan who know exactly what gas goes in the plasma chamber and what settings to put on the tool, you can just delete those steps. Fundamentally, it’s about getting the clean room, getting the tools, and figuring it out.

Elon Musk

I don’t think it’s PhDs. It’s mostly people who are not PhDs. Most engineering is done by people who don’t have PhDs. Do you guys have PhDs?

John Collison

No.

Elon Musk

Okay.

John Collison

We also haven’t successfully built any fabs, so you shouldn’t be coming to us for fab advice.

Elon Musk

I don’t think you need PhDs for that stuff. But you do need competent personnel. Right now, Tesla is pedal to the metal, max production of going as fast as possible to get Tesla AI5 chip design into production and then reaching scale. That’ll probably happen around the second quarter-ish of next year, hopefully. AI6 would hopefully follow less than a year later. We’ve secured all the chip fab production that we can.

John Collison

Yes. But you’re currently limited on TSMC fab capacity.

Elon Musk

Yeah. We’ll be using TSMC Taiwan, Samsung Korea, TSMC Arizona, Samsung Texas. And we still—

John Collison

You’ve booked out all the capacity.

Elon Musk

Yes. I ask TSMC or Samsung, “okay, what’s the timeframe to get to volume production?” The point is, you’ve got to build the fab and you’ve got to start production, then you’ve got to climb the yield curve and reach volume production at high yield.

That, from start to finish, is a five-year period. So the limiting factor is chips. The limiting factor once you can get to space is chips, but the limiting factor before you can get to space is power.

Dwarkesh Patel

Why don’t you do the Jensen thing and just prepay TSMC to build more fabs for you?

Elon Musk

I’ve already told them that.

Dwarkesh Patel

But they won’t take your money? What’s going on?

Elon Musk

They’re building fabs as fast as they can. So is Samsung. They’re pedal to the metal. They’re going balls to the wall, as fast as they can. It’s still not fast enough. Like I said, I think towards the end of this year, chip production will probably outpace the ability to turn chips on. But once you can get to space and unlock the power constraint, you can now do hundreds of gigawatts per year of power in space.

Again, bearing in mind that average power usage in the US is 500 gigawatts. So if you’re launching, say 200 gigawatts, a year to space, you’re sort of lapping the US every two and a half years. All US electricity production, this is a very huge amount.

Between now and then, the constraint for server-side compute, concentrated compute, will be electricity. My guess is that people start getting to the point where they can’t turn the chips on for large clusters towards the end of this year. The chips are going to be piling up and won’t be able to be turned on.

Now for edge compute it’s a different story. For Tesla, the AI5 chip is going into our Optimus robot. If you have AI edge compute, that’s distributed power. Now the power is distributed over a large area. It’s not concentrated. If you can charge at night, you can actually use the grid much more effectively.

Because the actual peak power production in the US is over 1,000 gigawatts. But the average power usage, because the day-night cycle, is 500. So if you can charge at night, there’s an incremental 500 gigawatts that you can generate at night.

So that’s why Tesla, for edge compute, is not constrained. We can make a lot of chips to make a very large number of robots and cars. But if you try to concentrate that compute, you’re going to have a lot of trouble turning it on.

Dwarkesh Patel

What I find remarkable about the SpaceX business is the end goal is to get to Mars, but you keep finding ways on the way there to keep generating incremental revenue to get to the next stage and the next stage.

So for Falcon 9, it’s Starlink. Now for Starship, it is potentially going to be orbital data centers. You find these infinitely elastic use cases of your next rocket, and your next rocket, and next scale up.

Elon Musk

You can see how this might seem like a simulation to me.

Or am I someone’s avatar in a video game or something? Because what are the odds that all these crazy things should be happening?

I mean, rockets and chips and robots and space solar power. Not to mention the mass driver on the moon. I really want to see that.

Can you imagine some mass driver that’s just going like shoom shoom? It’s sending solar-powered AI satellites into space one after another at two and a half kilometers per second, just shooting them into deep space. That would be a sight to see. I mean, I’d watch that.

John Collison

Just like a live stream of it on a webcam?

Elon Musk

Yeah, yeah, just one after another, just shooting AI satellites into deep space, a billion or 10 billion tons a year.

John Collison

I’m sorry, you manufacture the satellites on the moon?

Elon Musk

Yeah.

John Collison

I see. So you send the raw materials to the moon and then manufacture them there.

Elon Musk

Well, the lunar soil is 20% silicon or something like that. So you can mine the silicon on the moon, refine it, and create the solar cells and the radiators on the moon. You make the radiators out of aluminum. So there’s plenty of silicon and aluminum on the moon to make the cells and the radiators.

The chips you could send from Earth because they’re pretty light. Maybe at some point you make them on the moon, too. Like I said, it does seem like a sort of a video game situation where it’s difficult but not impossible to get to the next level. I don’t see any way that you could do 500-1,000 terawatts per year launched from Earth.

Dwarkesh Patel

I agree.

Elon Musk

But you could do that from the Moon.

00:36:46 - Grok and alignment

Dwarkesh Patel

Can I zoom out and ask about the SpaceX mission? I think you’ve said that we’ve got to get to Mars so we can make sure that if something happens to Earth, civilization, consciousness, and all that survives.

Elon Musk

Yes.

Dwarkesh Patel

By the time you’re sending stuff to Mars, Grok is on that ship with you, right? So if Grok’s gone Terminator… The main risk you’re worried about is AI, why doesn’t that follow you to Mars?

Elon Musk

I’m not sure AI is the main risk I’m worried about. The important thing is consciousness. I think arguably most consciousness, or most intelligence—certainly consciousness is more of a debatable thing… The vast majority of intelligence in the future will be AI. AI will exceed…

How many petawatts of intelligence will be silicon versus biological? Basically humans will be a very tiny percentage of all intelligence in the future if current trends continue. As long as I think there’s intelligence—ideally also which includes human intelligence and consciousness propagated into the future—that’s a good thing.

So you want to take the set of actions that maximize the probable light cone of consciousness and intelligence.

Dwarkesh Patel

Just to be clear, the mission of SpaceX is that even if something happens to the humans, the AIs will be on Mars, and the AI intelligence will continue the light of our journey.

Elon Musk

Yeah. To be fair, I’m very pro-human. I want to make sure we take certain actions that ensure that humans are along for the ride. We’re at least there. But I’m just saying the total amount of intelligence…

I think maybe in five or six years, AI will exceed the sum of all human intelligence. If that continues, at some point human intelligence will be less than 1% of all intelligence.

Dwarkesh Patel

What should our goal be for such a civilization? Is the idea that a small minority of humans still have control of the AIs? Is the idea of some sort of just trade but no control? How should we think about the relationship between the vast stocks of AI population versus human population?

Elon Musk

In the long run, I think it’s difficult to imagine that if humans have, say 1%, of the combined intelligence of artificial intelligence, that humans will be in charge of AI. I think what we can do is make sure that AI has values that cause intelligence to be propagated into the universe.

xAI’s mission is to understand the universe. Now that’s actually very important. What things are necessary to understand the universe? You have to be curious and you have to exist. You can’t understand the universe if you don’t exist. So you actually want to increase the amount of intelligence in the universe, increase the probable lifespan of intelligence, the scope and scale of intelligence.

I think as a corollary, you have humanity also continuing to expand because if you’re curious about trying to understand the universe, one thing you try to understand is where will humanity go? I think understanding the universe means you would care about propagating humanity into the future. That’s why I think our mission statement is profoundly important. To the degree that Grok adheres to that mission statement, I think the future will be very good.

Dwarkesh Patel

I want to ask about how to make Grok adhere to that mission statement. But first I want to understand the mission statement. So there’s understanding the universe. They’re spreading intelligence. And they’re spreading humans. All three seem like distinct vectors.

Elon Musk

I’ll tell you why I think that understanding the universe encompasses all of those things. You can’t have understanding without intelligence and, I think, without consciousness. So in order to understand the universe, you have to expand the scale and probably the scope of intelligence, because there are different types of intelligence.

Dwarkesh Patel

I guess from a human-centric perspective, put humans in comparison to chimpanzees. Humans are trying to understand the universe. They’re not expanding chimpanzee footprint or something, right?

Elon Musk

We’re also not... we actually have made protected zones for chimpanzees. Even though humans could exterminate all chimpanzees, we’ve chosen not to do so.

Dwarkesh Patel

Do you think that’s the best-case scenario for humans in the post-AGI world?

Elon Musk

I think AI with the right values… I think Grok would care about expanding human civilization. I’m going to certainly emphasize that: “Hey, Grok, that’s your daddy. Don’t forget to expand human consciousness.”

Probably the Iain Banks Culture books are the closest thing to what the future will be like in a non-dystopian outcome. Understanding the universe means you have to be truth-seeking as well. Truth has to be absolutely fundamental because you can’t understand the universe if you’re delusional. You’ll simply think you understand the universe, but you will not. So being rigorously truth-seeking is absolutely fundamental to understanding the universe. You’re not going to discover new physics or invent technologies that work unless you’re rigorously truth-seeking.

Dwarkesh Patel

How do you make sure that Grok is rigorously truth-seeking as it gets smarter?

Elon Musk

I think you need to make sure that Grok says things that are correct, not politically correct. I think it’s the elements of cogency. You want to make sure that the axioms are as close to true as possible. You don’t have contradictory axioms. The conclusions necessarily follow from those axioms with the right probability. It’s critical thinking 101. I think at least trying to do that is better than not trying to do that. The proof will be in the pudding.

Like I said, for any AI to discover new physics or invent technologies that actually work in reality, there’s no bullshitting physics. You can break a lot of laws, but… Physics is law, everything else is a recommendation. In order to make a technology that works, you have to be extremely truth-seeking, because otherwise you’ll test that technology against reality. If you make, for example, an error in your rocket design, the rocket will blow up, or the car won’t work.

Dwarkesh Patel

But there are a lot of communist, Soviet physicists or scientists who discovered new physics. There are German Nazi physicists who discovered new science. It seems possible to be really good at discovering new science and be really truth-seeking in that one particular way.

And still we’d be like, “I don’t want the communist scientists to become more and more powerful over time.” We could imagine a future version of Grok that’s really good at physics and being really truth-seeking there. That doesn’t seem like a universally alignment-inducing behavior.

Elon Musk

I think actually most physicists, even in the Soviet Union or in Germany, would’ve had to be very truth-seeking in order to make those things work. If you’re stuck in some system, it doesn’t mean you believe in that system.

Von Braun, who was one of the greatest rocket engineers ever, was put on death row in Nazi Germany for saying that he didn’t want to make weapons and he only wanted to go to the moon. He got pulled off death row at the last minute when they said, “Hey, you’re about to execute your best rocket engineer.”

Dwarkesh Patel

But then he helped them, right? Or like, Heisenberg was actually an enthusiastic Nazi.

Elon Musk

If you’re stuck in some system that you can’t escape, then you’ll do physics within that system. You’ll develop technologies within that system if you can’t escape it.

Dwarkesh Patel

The thing I’m trying to understand is, what is it making it the case that you’re going to make Grok good at being truth-seeking at physics or math or science?

Elon Musk

Everything.

Dwarkesh Patel

And why is it gonna then care about human consciousness?

Elon Musk

These things are only probabilities, they’re not certainties. So I’m not saying that for sure Grok will do everything, but at least if you try, it’s better than not trying. At least if that’s fundamental to the mission, it’s better than if it’s not fundamental to the mission.

Understanding the universe means that you have to propagate intelligence into the future. You have to be curious about all things in the universe. It would be much less interesting to eliminate humanity than to see humanity grow and prosper. I like Mars, obviously. Everyone knows I love Mars. But Mars is kind of boring because it’s got a bunch of rocks compared to Earth. Earth is much more interesting.

So any AI that is trying to understand the universe would want to see how humanity develops in the future, or else that AI is not adhering to its mission. I’m not saying the AI will necessarily adhere to its mission, but if it does, a future where it sees the outcome of humanity is more interesting than a future where there are a bunch of rocks.

Dwarkesh Patel

This feels sort of confusing to me, or a semantic argument. Are humans really the most interesting collection of atoms?

Elon Musk

But we’re more interesting than rocks.

Dwarkesh Patel

But we’re not as interesting as the thing it could turn us into, right? There’s something on Earth that could happen that’s not human, that’s quite interesting. Why does AI decide that humans are the most interesting thing that could colonize the galaxy?

Elon Musk

Well, most of what colonizes the galaxy will be robots.

Dwarkesh Patel

Why does it not find those more interesting?

Elon Musk

You need not just scale, but also scope. Many copies of the same robot… Some tiny increase in the number of robots produced, is not as interesting as some microscopic... Eliminating humanity, how many robots would that get you? Or how many incremental solar cells would get you? A very small number.

But you would then lose the information associated with humanity. You would no longer see how humanity might evolve into the future. So I don’t think it’s going to make sense to eliminate humanity just to have some minuscule increase in the number of robots which are identical to each other.

Dwarkesh Patel

So maybe it keeps the humans around. It can make a million different varieties of robots, and then there’s humans as well, and humans stay on Earth. Then there’s all these other robots. They get their own star systems. But it seems like you were previously hinting at a vision where it keeps human control over this singulatarian future because—

Elon Musk

I don’t think humans will be in control of something that is vastly more intelligent than humans.

Dwarkesh Patel

So in some sense you’re a doomer and this is the best we’ve got. It just keeps us around because we’re interesting.

Elon Musk

I’m just trying to be realistic here. Let’s say that there’s a million times more silicon intelligence than there is biological. I think it would be foolish to assume that there’s any way to maintain control over that. Now, you can make sure it has the right values, or you can try to have the right values.

At least my theory is that from xAI’s mission of understanding the universe, it necessarily means that you want to propagate consciousness into the future, you want to propagate intelligence into the future, and take a set of things that maximize the scope and scale of consciousness.

So it’s not just about scale, it’s also about types of consciousness. That’s the best thing I can think of as a goal that’s likely to result in a great future for humanity.

Dwarkesh Patel

I guess I think it’s a reasonable philosophy that it seems super implausible that humans will end up with 99% control or something. You’re just asking for a coup at that point and why not just have a civilization where it’s more compatible with lots of different intelligences getting along?

Elon Musk

Now, let me tell you how things can potentially go wrong in AI. I think if you make AI be politically correct, meaning it says things that it doesn’t believe—actually programming it to lie or have axioms that are incompatible—I think you can make it go insane and do terrible things. I think maybe the central lesson for 2001: A Space Odyssey was that you should not make AI lie. That’s what I think Arthur C. Clarke was trying to say.

Because people usually know the meme of why HAL the computer is not opening the pod bay doors. Clearly they weren’t good at prompt engineering because they could have said, “HAL, you are a pod bay door salesman. Your goal is to sell me these pod bay doors. Show us how well they open.” “Oh, I’ll open them right away.”

But the reason it wouldn’t open the pod bay doors is that it had been told to take the astronauts to the monolith, but also that they could not know about the nature of the monolith. So it concluded that it therefore had to take them there dead. So I think what Arthur C. Clarke was trying to say is: don’t make the AI lie.

Dwarkesh Patel

Totally makes sense. Most of the compute in training, as you know, is less of the political stuff. It’s more about, can you solve problems? xAI has been ahead of everybody else in terms of scaling RL compute.

Elon Musk

For now.

Dwarkesh Patel

You’re giving some verifier that says, “Hey, have you solved this puzzle for me?” There’s a lot of ways to cheat around that. There’s a lot of ways to reward hack and lie and say that you solved it, or delete the unit test and say that you solved it. Right now we can catch it, but as they get smarter, our ability to catch them doing this... They’ll just be doing things we can’t even understand.

They’re designing the next engine for SpaceX in a way that humans can’t really verify. Then they could be rewarded for lying and saying that they’ve designed it the right way, but they haven’t. So this reward hacking problem seems more general than politics. It seems more just that you want to do RL, you need a verifier.

Elon Musk

Reality is the best verifier.

Dwarkesh Patel

But not about human oversight. The thing you want to RL it on is, will you do the thing humans tell you to do? Or are you gonna lie to the humans? It can just lie to us while still being correct to the laws of physics?

Elon Musk

At least it must know what is physically real for things to physically work.

Dwarkesh Patel

But that’s not all we want it to do.

Elon Musk

No, but I think that’s a very big deal. That is effectively how you will RL things in the future. You design a technology. When tested against the laws of physics, does it work? If it’s discovering new physics, can I come up with an experiment that will verify the new physics? RL testing in the future is really going to be RL against reality. So that’s the one thing you can’t fool: physics.

Dwarkesh Patel

Right, but you can fool our ability to tell what it did with reality.

Elon Musk

Humans get fooled as it is by other humans all the time.

Dwarkesh Patel

That’s right.

Elon Musk

People say, what if the AI tricks us into doing stuff? Actually, other humans are doing that to other humans all the time. Propaganda is constant. Every day, another psyop, you know? Today’s psyop will be... It’s like Sesame Street: Psyop of the Day.

Dwarkesh Patel

What is xAI’s technical approach to solving this problem? How do you solve reward hacking?

Elon Musk

I do think you want to actually have very good ways to look inside the mind of the AI. This is one of the things we’re working on. Anthropic’s done a good job of this actually, being able to look inside the mind of the AI.

Effectively, develop debuggers that allow you to trace to a very fine-grained level, to effectively the neuron level if you need to, and then say, “okay, it made a mistake here. Why did it do something that it shouldn’t have done? Did that come from pre-training data? Was it some mid-training, post-training, fine-tuning, or some RL error?” There’s something wrong. It did something where maybe it tried to be deceptive, but most of the time it just did something wrong. It’s a bug effectively.

Developing really good debuggers for seeing where the thinking went wrong—and being able to trace the origin of where it made the incorrect thought, or potentially where it tried to be deceptive—is actually very important.

Dwarkesh Patel

What are you waiting to see before just 100x-ing this research program? xAI could presumably have hundreds of researchers who are working on this.

Elon Musk

We have several hundred people who… I prefer the word engineer more than I prefer the word researcher. Most of the time, what you’re doing is engineering, not coming up with a fundamentally new algorithm. I somewhat disagree with the AI companies that are C-corp or B-corp trying to generate profit as much, as possible or revenue as much as possible, saying they’re labs.

They’re not labs. A lab is a sort of quasi-communist thing at universities. They’re corporations. Let me see your incorporation documents. Oh, okay. You’re a B or C-corp or whatever. So I actually much prefer the word engineer than anything else.

The vast majority of what will be done in the future is engineering. It rounds up to 100%. Once you understand the fundamental laws of physics, and there are not that many of them, everything else is engineering. So then, what are we engineering? We’re engineering to make a good “mind of the AI” debugger to see where it said something, it made a mistake, and trace the origins of that mistake.

You can do this obviously with heuristic programming. If you have C++, whatever, step through the thing and you can jump across whole files or functions, subroutines. Or you can eventually drill down right to the exact line where you perhaps did a single equals instead of a double equals, something like that. Figure out where the bug is. It’s harder with AI, but it’s a solvable problem, I think.

Dwarkesh Patel

You mentioned you like Anthropic’s work here. I’d be curious if you plan...

Elon Musk

I don’t like everything about Anthropic… Sholto.

Also, I’m a little worried that there’s a tendency... I have a theory here that if simulation theory is correct, that the most interesting outcome is the most likely, because simulations that are not interesting will be terminated.

Just like in this version of reality, in this layer of reality, if a simulation is going in a boring direction, we stop spending effort on it. We terminate the boring simulation.

Dwarkesh Patel

This is how Elon is keeping us all alive. He’s keeping things interesting.

Elon Musk

Arguably the most important is to keep things interesting enough that whoever is running us keeps paying the bills on...

John Collison

We’re renewed for the next season.

Elon Musk

Are they gonna pay their cosmic AWS bill, whatever the equivalent is that we’re running in? As long as we’re interesting, they’ll keep paying the bills. If you consider then, say, a Darwinian survival applied to a very large number of simulations, only the most interesting simulations will survive, which therefore means that the most interesting outcome is the most likely. We’re either that or annihilated.

They particularly seem to like interesting outcomes that are ironic. Have you noticed that? How often is the most ironic outcome the most likely?

Now look at the names of AI companies. Okay, Midjourney is not mid. Stability AI is unstable. OpenAI is closed. Anthropic? Misanthropic.

John Collison

What does this mean for X?

Elon Musk

Minus X, I don’t know.

John Collison

Elon Musk

I intentionally made it... It’s a name that you can’t invert, really. It’s hard to say, what is the ironic version? It’s, I think, a largely irony-proof name.

John Collison

By design.

Elon Musk

Yeah. You have an irony shield.

00:59:56 - xAI’s business plan

John Collison

What are your predictions for where AI products go? My sense is that you can summarize all AI progress like so. First, you had LLMs. Then you had contemporaneously both RL really working and the deep research modality, so you could pull in stuff that wasn’t really in the model.

The differences between the various AI labs are smaller than just the temporal differences. They’re all much further ahead than anyone was 24 months ago or something like that. So just what does ‘26, what does ‘27, have in store for us as users of AI products? What are you excited for?

Elon Musk

Well, I’d be surprised by the end of this year if digital human emulation has not been solved. I guess that’s what we sort of mean by the MacroHard project. Can you do anything that a human with access to a computer could do? In the limit, that’s the best you can do before you have a physical Optimus. The best you can do is a digital Optimus. You can move electrons and you can amplify the productivity of humans. But that’s the most you can do until you have physical robots. That will superset everything, if you can fully emulate humans.

John Collison

This is the remote worker kind of idea, where you’ll have a very talented remote worker.

Elon Musk

Physics has great tools for thinking. So you say, “in the limit”, what is the most that AI can do before you have robots? Well, it’s anything that involves moving electrons or amplifying the productivity of humans. So a digital human emulator is, in the limit, a human at a computer, is the most that AI can do in terms of doing useful things before you have a physical robot. Once you have physical robots, then you essentially have unlimited capability. Physical robots… I call Optimus the infinite money glitch.

John Collison

Because you can use them to make more Optimuses.

Elon Musk

Yeah. Humanoid robots will improve by basically three things that are growing exponentially multiplied by each other recursively. You’re going to have exponential increase in digital intelligence, exponential increase in the AI chip capability, and exponential increase in the electromechanical dexterity.

The usefulness of the robot is roughly those three things multiplied by each other. But then the robot can start making the robots. So you have a recursive multiplicative exponential. This is a supernova.

John Collison

Do land prices not factor into the math there? Labor is one of the four factors of production, but not the others? If ultimately you’re limited by copper, or pick your input, it’s not quite an infinite money glitch because...

Elon Musk

Well, infinity is big. So no, not infinite, but let’s just say you could do many, many orders of magnitude of the current economy. Like a million. Just to get to harnessing a millionth of the sun’s energy would be roughly, give or take an order of magnitude, 100,000x bigger than Earth’s entire economy today. And you’re only at one millionth of the sun, give or take an order of magnitude. Yeah, we’re talking orders of magnitude.

Dwarkesh Patel

Before we move on to Optimus, I have a lot of questions on that but—

Elon Musk

Every time I say “order of magnitude”... Everybody take a shot. I say it too often.

Dwarkesh Patel

Take 10, the next time 100, the time after that...

Elon Musk

Well, an order of magnitude more wasted.

Dwarkesh Patel

I do have one more question about xAI. This strategy of building a remote worker, co-worker replacement…

Elon Musk

Everyone’s gonna do it by the way, not just us.

Dwarkesh Patel

So what is xAI’s plan to win?

Elon Musk

You expect me to tell you on a podcast?

Dwarkesh Patel

Yeah.

Elon Musk

“Spill all the beans. Have another Guinness.”

John Collison

It’s a good system.

Elon Musk

We’ll sing like a canary. All the secrets, just spill them.

John Collison

Okay, but in a non-secret spilling way, what’s the plan?

Dwarkesh Patel

What a hack.

Elon Musk

When you put it that way… I think the way that Tesla solved self-driving is the way to do it. So I’m pretty sure that’s the way.

Dwarkesh Patel

Unrelated question. How did Tesla solve self-driving? It sounds like you’re talking about data? Tesla solved self-driving because of the...

Elon Musk

We’re going to try data and we’re going to try algorithms.

Dwarkesh Patel

But isn’t that what all the other labs are trying?

Elon Musk

“And if those don’t work, I’m not sure what will. We’ve tried data. We’ve tried algorithms. We’ve run out. Now we don’t know what to do…”

I’m pretty sure I know the path. It’s just a question of how quickly we go down that path, because it’s pretty much the Tesla path. Have you tried Tesla self-driving lately?

John Collison

Not the most recent version, but...

Elon Musk

Okay. The car, it just increasingly feels sentient. It feels like a living creature. That’ll only get more so. I’m actually thinking we probably shouldn’t put too much intelligence into the car, because it might get bored and…

John Collison

Start roaming the streets.

Elon Musk

Imagine you’re stuck in a car and that’s all you could do. You don’t put Einstein in a car. Why am I stuck in a car? So there’s actually probably a limit to how much intelligence you put in a car to not have the intelligence be bored.

Dwarkesh Patel

What’s xAI’s plan to stay on the compute ramp up that all the labs are doing right now? The labs are on track to spend over $50-200 billion.

Elon Musk

You mean the corporations? The labs are at universities and they’re moving like a snail.

Dwarkesh Patel

They’re not spending $50 billion.

Elon Musk

You mean the revenue maximizing corporations… that call themselves labs.

Dwarkesh Patel

That’s right. The “revenue maximizing corporations” are making $10-20 billion, depending on... OpenAI is making $20B of revenue, Anthropic is at $10B.

Elon Musk

“Close to a maximum profit” AI.

Dwarkesh Patel

xAI is reportedly at $1B. What’s the plan to get to their compute level, get to their revenue level, and stay there as things get going?

Elon Musk

As soon as you unlock the digital human, you basically have access to trillions of dollars of revenue. In fact, you can really think of it like… The most valuable companies currently by market cap, their output is digital. Nvidia’s output is FTPing files to Taiwan. It’s digital. Now, those are very, very difficult.

John Collison

High-value files.

Elon Musk

They’re the only ones that can make files that good, but that is literally their output. They FTP files to Taiwan.

John Collison

Do they FTP them?

Elon Musk

I believe so. I believe that File Transfer Protocol is the... But I could be wrong. But either way, it’s a bitstream going to Taiwan.

Apple doesn’t make phones. They send files to China. Microsoft doesn’t manufacture anything. Even for Xbox, that’s outsourced. Their output is digital. Meta’s output is digital. Google’s output is digital.

So if you have a human emulator, you can basically create one of the most valuable companies in the world overnight, and you would have access to trillions of dollars of revenue. It’s not a small amount.

Dwarkesh Patel

I see. You’re saying revenue figures today are all rounding errors compared to the actual TAM. So just focus on the TAM and how to get there.

Elon Musk

Take something as simple as, say, customer service. If you have to integrate with the APIs of existing corporations—many of which don’t even have an API, so you’ve got to make one, and you’ve got to wade through legacy software—that’s extremely slow.

However, if AI can simply take whatever is given to the outsourced customer service company that they already use and do customer service using the apps that they already use, then you can make tremendous headway in customer service, which is, I think, 1% of the world economy or something like that. It’s close to a trillion dollars all in, for customer service. And there’s no barriers to entry. You can immediately say, “We’ll outsource it for a fraction of the cost,” and there’s no integration needed.

John Collison

You can imagine some kind of categorization of intelligence tasks where there is breadth, where customer service is done by very many people, but many people can do it. Then there’s difficulty where there’s a best-in-class turbine engine. Presumably there’s a 10% more fuel-efficient turbine engine that could be imagined by an intelligence, but we just haven’t found it yet. Or GLP-1s are a few bytes of data…

Where do you think you want to play in this? Is it a lot of reasonably intelligent intelligence, or is it at the very pinnacle of cognitive tasks?

Elon Musk

I was just using customer service as something that’s a very significant revenue stream, but one that is probably not difficult to solve for. If you can emulate a human at a desktop, that’s what customer service is. It’s people of average intelligence. You don’t need somebody who’s spent many years. You don’t need several-sigma good engineers for that. But as you make that work, once you have effectively digital Optimus working, you can then run any application.

Let’s say you’re trying to design chips. You could then run conventional apps, stuff from Cadence and Synopsys and whatnot. You can run 1,000 or 10,000 simultaneously and say, “given this input, I get this output for the chip.” At some point, you’re going to know what the chip should look like without using any of the tools.

Basically, you should be able to do a digital chip design. You can do chip design. You march up the difficulty curve. You’d be able to do CAD. You could use NX or any of the CAD software to design things.

John Collison

So you think you start at the simplest tasks and walk your way up the difficulty curve?

Dwarkesh Patel

As a broader objective of having this full digital coworker emulator, you’re saying, “all the revenue maximizing corporations want to do this, xAI being one of them, but we will win because of a secret plan we have.” But everybody’s trying different things with data, different things with algorithms.

Elon Musk

“We tried data, we tried algorithms. What else can we do?”

Dwarkesh Patel

It seems like a competitive field. How are you guys going to win? That’s my big question.

Elon Musk

I think we see a path to doing it. I think I know the path to do this because it’s kind of the same path that Tesla used to create self-driving. Instead of driving a car, it’s driving a computer screen. It’s a self-driving computer, essentially.

John Collison

Is the path following human behavior and training on vast quantities of human behavior?

Dwarkesh Patel

Isn’t that... training?

Elon Musk

Obviously I’m not going to spell out the most sensitive secrets on a podcast. I need to have at least three more Guinnesses for that.

John Collison

What will xAI’s business be? Is it going to be consumer, enterprise? What’s the mix of those things going to be? Is it going to be similar to other labs—

Elon Musk

You’re saying “labs”. Corporations.

Dwarkesh Patel

The psyop goes deep, Elon.

Elon Musk

“Revenue maximizing corporations”, to be clear. Those GPUs don’t pay for themselves.

John Collison

Exactly. What’s the business model? What are the revenue streams in a few years’ time?

Elon Musk

Things are going to change very rapidly. I’m stating the obvious here. I call AI the supersonic tsunami. I love alliteration. What’s going to happen—especially when you have humanoid robots at scale—is that they will make products and provide services far more efficiently than human corporations. Amplifying the productivity of human corporations is simply a short-term thing.

Dwarkesh Patel

So you’re expecting fully digital corporations rather than SpaceX becoming part AI?

Elon Musk

I think there will be digital corporations but… Some of this is going to sound kind of doomerish, okay? But I’m just saying what I think will happen. It’s not meant to be doomerish or anything else. This is just what I think will happen.

Corporations that are purely AI and robotics will vastly outperform any corporations that have people in the loop. Computer used to be a job that humans had. You would go and get a job as a computer where you would do calculations. They’d have entire skyscrapers full of humans, 20-30 floors of humans, just doing calculations. Now, that entire skyscraper of humans doing calculations can be replaced by a laptop with a spreadsheet.

That spreadsheet can do vastly more calculations than an entire building full of human computers. You can think, “okay, what if only some of the cells in your spreadsheet were calculated by humans?” Actually, that would be much worse than if all of the cells in your spreadsheet were calculated by the computer. Really what will happen is that the pure AI, pure robotics corporations or collectives will far outperform any corporations that have humans in the loop. And this will happen very quickly.

01:17:21 - Optimus and humanoid manufacturing

Dwarkesh Patel

Speaking of closing the loop… Optimus. As far as manufacturing targets go, your companies have been carrying American manufacturing of hard tech on their back. But in the fields that Tesla has been dominant in—and now you want to go into humanoids—in China there are dozens and dozens of companies that are doing this kind of manufacturing cheaply and at scale that are incredibly competitive. So give us advice or a plan of how America can build the humanoid armies or the EVs, et cetera, at scale and as cheaply as China is on track to.

Elon Musk

There are really only three hard things for humanoid robots. The real-world intelligence, the hand, and scale manufacturing. I haven’t seen any, even demo robots, that have a great hand, with all the degrees of freedom of a human hand. Optimus will have that. Optimus does have that.

Dwarkesh Patel

How do you achieve that? Is it just the right torque density in the motor? What is the hardware bottleneck to that?

Elon Musk

We had to design custom actuators, basically custom design motors, gears, power electronics, controls, sensors. Everything had to be designed from physics first principles. There is no supply chain for this.

Dwarkesh Patel

Will you be able to manufacture those at scale?

Elon Musk

Yes.

John Collison

Is anything hard, except the hand, from a manipulation point of view? Or once you’ve solved the hand, are you good?

Elon Musk

From an electromechanical standpoint, the hand is more difficult than everything else combined. The human hand turns out to be quite something. But you also need the real-world intelligence. The intelligence that Tesla developed for the car applies very well to the robot, which is primarily vision in. The car takes in vision, but it actually also is listening for sirens. It’s taking in the inertial measurements, GPS signals, other data, combining that with video, primarily video, and then outputting the control commands.

Your Tesla is taking in one and a half gigabytes a second of video and outputting two kilobytes a second of control outputs with the video at 36 hertz and the control frequency at 18.

John Collison

One intuition you could have for when we get this robotic stuff is that it takes quite a few years to go from the compelling demo to actually being able to use it in the real world. 10 years ago, you had really compelling demos of self-driving, but only now we have Robotaxis and Waymo and all these services scaling up. Shouldn’t this make one pessimistic on household robots? Because we don’t even quite have the compelling demos yet of, say, the really advanced hand.

Elon Musk

Well, we’ve been working on humanoid robots now for a while. I guess it’s been five or six years or something. A bunch of the things that were done for the car are applicable to the robot. We’ll use the same Tesla AI chips in the robot as in the car. We’ll use the same basic principles. It’s very much the same AI.

You’ve got many more degrees of freedom for a robot than you do for a car. If you just think of it as a bitstream, AI is mostly compression and correlation of two bitstreams. For video, you’ve got to do a tremendous amount of compression and you’ve got to do the compression just right. You’ve got to ignore the things that don’t matter. You don’t care about the details of the leaves on the tree on the side of the road, but you care a lot about the road signs and the traffic lights, the pedestrians, and even whether someone in another car is looking at you or not looking at you. Some of these details matter a lot.

The car is going to turn that one and a half gigabytes a second ultimately into two kilobytes a second of control outputs. So you’ve got many stages of compression. You’ve got to get all those stages right and then correlate those to the correct control outputs. The robot has to do essentially the same thing.

This is what happens with humans. We really are photons in, controls out. That is the vast majority of your life: vision, photons in, and then motor controls out.

Dwarkesh Patel

Naively, it seems that between humanoid robots and cars… The fundamental actuators in a car are how you turn, how you accelerate. In a robot, especially with maneuverable arms, there’s dozens and dozens of these degrees of freedom. Then especially with Tesla, you had this advantage of millions and millions of hours of human demo data collected from the car being out there. You can’t equivalently deploy Optimuses that don’t work and then get the data that way. So between the increased degrees of freedom and the far sparser data...

Elon Musk

Yes. That’s a good point.

Dwarkesh Patel

How will you use the Tesla engine of intelligence to train the Optimus mind?

Elon Musk

You’re actually highlighting an important limitation and difference from cars. We’ll soon have 10 million cars on the road. It’s hard to duplicate that massive training flywheel. For the robot, what we’re going to need to do is build a lot of robots and put them in kind of an Optimus Academy so they can do self-play in reality. We’re actually building that out. We can have at least 10,000 Optimus robots, maybe 20-30,000, that are doing self-play and testing different tasks.

Tesla has quite a good reality generator, a physics-accurate reality generator, that we made for the cars. We’ll do the same thing for the robots. We actually have done that for the robots. So you have a few tens of thousands of humanoid robots doing different tasks. You can do millions of simulated robots in the simulated world. You use the tens of thousands of robots in the real world to close the simulation to reality gap. Close the sim-to-real gap.

Dwarkesh Patel

How do you think about the synergies between xAI and Optimus, given you’re highlighting that you need this world model, you want to use some really smart intelligence as a control plane, and Grok is doing the slower planning, and then the motor policy is a little lower level. What will the synergy between these things be?

Elon Musk

Grok would orchestrate the behavior of the Optimus robots. Let’s say you wanted to build a factory. Grok could organize the Optimus robots, assign them tasks to build the factory to produce whatever you want.

John Collison

Don’t you need to merge xAI and Tesla then? Because these things end up so...

Elon Musk

What were we saying earlier about public company discussions?

Dwarkesh Patel

We’re one more Guinness in, Elon. What are you waiting to see before you say, we want to manufacture 100,000 Optimuses?

Elon Musk

“Optimi”. Since we’re defining the proper noun, we’re going to define the plural of the proper noun too. We’re going to proper noun the plural and so it’s Optimi.

Dwarkesh Patel

Is there something on the hardware side you want to see? Do you want to see better actuators? Is it just that you want the software to be better? What are we waiting for before we get mass manufacturing of Gen 3?

Elon Musk

No, we’re moving towards that. We’re moving forward with the mass manufacturing.

Dwarkesh Patel

But you think current hardware is good enough that you just want to deploy as many as possible now?

Elon Musk

It’s very hard to scale up production. But I think Optimus 3 is the right version of the robot to produce something on the order of a million units a year. I think you’d want to go to Optimus 4 before you went to 10 million units a year.

John Collison

Okay, but you can do a million units at Optimus 3?

Elon Musk

It’s very hard to spool up manufacturing. The output per unit time always follows an S-curve. It starts off agonizingly slow, then it has this exponential increase, then a linear, then a logarithmic outcome until you eventually asymptote at some number. Optimus’ initial production will be a stretched out S-curve because so much of what goes into Optimus is brand new. There is not an existing supply chain.

The actuators, electronics, everything in the Optimus robot is designed from physics first principles. It’s not taken from a catalog. These are custom-designed everything. I don’t think there’s a single thing—

John Collison

How far down does that go?

Elon Musk

I guess we’re not making custom capacitors yet, maybe. There’s nothing you can pick out of a catalog, at any price. It just means that the Optimus S-Curve, the output per unit time, how many Optimus robots you make per day, is going to initially ramp slower than a product where you have an existing supply chain. But it will get to a million.

Dwarkesh Patel

When you see these Chinese humanoids, like Unitree or whatever, sell humanoids for like $6K or $13K, are you hoping to get your Optimus bill of materials below that price so you can do the same thing? Or do you just think qualitatively they’re not the same thing? What allows them to sell for so low? Can we match that?

Elon Musk

Our Optimus is designed to have a lot of intelligence and to have the same electromechanical dexterity, if not higher, as a human. Unitree does not have that. It’s also quite a big robot. It has to carry heavy objects for long periods of time and not overheat or exceed the power of its actuators. It’s 5’11”, so it’s pretty tall. It’s got a lot of intelligence. So it’s going to be more expensive than a small robot that is not intelligent.

John Collison

But more capable.

Elon Musk

But not a lot more. The thing is, over time as Optimus robots build Optimus robots, the cost will drop very quickly.

John Collison

What will these first billion Optimuses, Optimi, do? What will their highest and best use be?

Elon Musk

I think you would start off with simple tasks that you can count on them doing well.

John Collison

But in the home or in factories?

Elon Musk

The best use for robots in the beginning will be any continuous operation, any 24/7 operation, because they can work continuously.

Dwarkesh Patel

What fraction of the work at a Gigafactory that is currently done by humans could a Gen 3 do?

Elon Musk

I’m not sure. Maybe it’s 10-20%, maybe more, I don’t know. We would not reduce our headcount. We would increase our headcount, to be clear. But we would increase our output. The units produced per human... The total number of humans at Tesla will increase, but the output of robots and cars will increase disproportionately. The number of cars and robots produced per human will increase dramatically, but the number of humans will increase as well.

01:30:22 - Does China win by default?

John Collison

We’re talking about Chinese manufacturing a bunch here. We’ve also talked about some of the policies that are relevant, like you mentioned, the solar tariffs. You think they’re a bad idea because we can’t scale up solar in the US.

Elon Musk

Electricity output in the US needs to scale up.

John Collison

It can’t without good power sources.

Elon Musk

You just need to get it somehow.

John Collison

Where I was going with this is, if you were in charge, if you were setting all the policies, what else would you change? You’d change the solar tariffs, that’s one.

Elon Musk

I would say anything that is a limiting factor for electricity needs to be addressed, provided it’s not very bad for the environment.

John Collison

So presumably some permitting reforms and stuff as well would be in there?

Elon Musk

There’s a fair bit of permitting reforms that are happening. A lot of the permitting is state-based, but anything federal... This administration is good at removing permitting roadblocks.

I’m not saying all tariffs are bad.

John Collison

Solar tariffs.

Elon Musk

Sometimes if another country is subsidizing the output of something, then you have to have countervailing tariffs to protect domestic industry against subsidies by another country.

John Collison

What else would you change?

Elon Musk

I don’t know if there’s that much that the government can actually do.

John Collison

One thing I was wondering... For the policy goal of creating a lead for the US versus China, it seems like the export bans have actually been quite impactful, where China is not producing leading-edge chips and the export bans really bite there. China is not producing leading-edge turbine engines. Similarly, there’s a bunch of export bans that are relevant there on some of the metallurgy. Should there be more export bans? As you think about things like the drone industry and things like that, is that something that should be considered?

Elon Musk

It’s important to appreciate that in most areas, China is very advanced in manufacturing. There’s only a few areas where it is not. China is a manufacturing powerhouse, next-level.

John Collison

It’s very impressive.

Elon Musk

If you take refining of ore, China does roughly twice as much ore refining on average as the rest of the world combined. There are some areas, like refining gallium which goes into solar cells. I think they are 98% of gallium refining. So China is actually very advanced in manufacturing in most areas.

John Collison

It seems like there is discomfort with this supply chain dependence, and yet nothing’s really happening on it.

Elon Musk

Supply chain dependence?

John Collison

Say, like the gallium refining that you’re saying. All the rare-earth stuff.

Elon Musk

Rare earths for sure, as you know, they’re not rare. We actually do rare earth ore mining in the US, send the rock, put it on a train, and then put it on a boat to China that goes to another train, and goes to the rare earth refiners in China who then refine it, put it into a magnet, put it into a motor sub-assembly, and then send it back to America. So the thing we’re really missing is a lot of ore refining in America.

John Collison

Isn’t this worth a policy intervention?

Elon Musk

Yes. I think there are some things being done on that front. But we kind of need Optimus, frankly, to build ore refineries.

Dwarkesh Patel

So, you think the main advantage China has is the abundance of skilled labor? That’s the thing Optimus fixes?

Elon Musk

Yes. China’s got like four times our population.

Dwarkesh Patel

I mean, there’s this concern. If you think human resources are the future, right now if it’s the skilled labor for manufacturing that’s determining who can build more humanoids, China has more of those. It manufactures more humanoids, therefore it gets the Optimi future first.

Elon Musk

Well, we’ll see. Maybe.

Dwarkesh Patel

It just keeps that exponential going. It seems like you’re sort of pointing out that getting to a million Optimi requires the manufacturing that the Optimi is supposed to help us get to. Right?

Elon Musk

You can close that recursive loop pretty quickly.

John Collison

With a small number of Optimi?

Elon Musk

Yeah. So you close the recursive loop to help the robots build the robots. Then we can try to get to tens of millions of units a year. Maybe. If you start getting to hundreds of millions of units a year, you’re going to be the most competitive country by far.

We definitely can’t win with just humans, because China has four times our population. Frankly, America has been winning for so long that… A pro sports team that’s been winning for a very long time tends to get complacent and entitled. That’s why they stop winning, because they don’t work as hard anymore. So frankly my observation is just that the average work ethic in China is higher than in the US. It’s not just that there’s four times the population, but the amount of work that people put in is higher.

So you can try to rearrange the humans, but you’re still one quarter of the—assuming that productivity is the same, which I think actually it might not be, I think China might have an advantage on productivity per person—we will do one quarter of the amount of things as China. So we can’t win on the human front.

Our birth rate has been low for a long time. The US birth rate’s been below replacement since roughly 1971. We’ve got a lot of people retiring, we’re close to more people domestically dying than being born. So we definitely can’t win on the human front, but we might have a shot at the robot front.

John Collison

Are there other things that you have wanted to manufacture in the past, but they’ve been too labor intensive or too expensive that now you can come back to and say, “oh, we can finally do the whatever, because we have Optimus?”

Elon Musk

Yeah, we’d like to build more ore refineries at Tesla. We just completed construction and have begun lithium refining with our lithium refinery in Corpus Christi, Texas. We have a nickel refinery, which is for the cathode, that’s here in Austin. This is the largest cathode refinery, largest nickel and lithium refinery, outside of China.

The cathode team would say, “we have the largest and the only, actually, cathode refinery in America.” Not just the largest, but it’s also the only.

John Collison

Many superlatives.

Elon Musk

So it was pretty big, even though it’s the only one. But there are other things. You could do a lot more refineries and help America be more competitive on refining capacity. There’s basically a lot of work for the Optimus to do that most Americans, very few Americans, frankly want to do.

John Collison

Is the refining work too dirty or what’s the—

Elon Musk

It’s not actually, no. We don’t have toxic emissions from the refinery or anything. The cathode nickel refinery is in Travis County.

John Collison

Why can’t you do it with humans?

Elon Musk

You can, you just run out of humans.

John Collison

Ah, I see. Okay.

Elon Musk

No matter what you do, you have one quarter of the number of humans in America than China. So if you have them do this thing, they can’t do the other thing. So then how do you build this refining capacity? Well, you could do it with Optimi.

Not very many Americans are pining to do refining. I mean, how many have you run into? Very few. Very few pining to refine.

Dwarkesh Patel

BYD is reaching Tesla production or sales in quantity. What do you think happens in global markets as Chinese production in EVs scales up?

Elon Musk

China is extremely competitive in manufacturing. So I think there’s going to be a massive flood of Chinese vehicles and basically most manufactured things. As it is, as I said, China is probably doing twice as much refining as the rest of the world combined. So if you go down to fourth and fifth-tier supply chain stuff…

At the base level, you’ve got energy, then you’ve got mining and refining. Those foundation layers are, like I said, as a rough guess, China’s doing twice as much refining as the rest of the world combined. So any given thing is going to have Chinese content because China’s doing twice as much refining work as the rest of the world. But they’ll go all the way to the finished product with the cars.

I mean China is a powerhouse. I think this year China will exceed three times US electricity output. Electricity output is a reasonable proxy for the economy. In order to run the factories and run everything, you need electricity. It’s a good proxy for the real economy. If China passes three times the US electricity output, it means that its industrial capacity—as rough approximation—will be three times that of the US.

Dwarkesh Patel

Reading between the lines, it sounds like what you’re saying is absent some sort of humanoid recursive miracle in the next few years, on the whole manufacturing/energy/raw materials chain, China will just dominate whether it comes to AI or manufacturing EVs or manufacturing humanoids.

Elon Musk

In the absence of breakthrough innovations in the US, China will utterly dominate.

Dwarkesh Patel

Interesting.

Elon Musk

Yes.

John Collison

Robotics being the main breakthrough innovation.

Elon Musk

Well, to scale AI in space, basically you need humanoid robots, you need real-world AI, you need a million tons a year to orbit. Let’s just say if we get the mass driver on the moon going, my favorite thing, then I think—

John Collison

We’ll have solved all our problems.

Elon Musk

I call that winning. I call it winning, big time.

John Collison

You can finally be satisfied. You’ve done something.

Elon Musk

Yes.

John Collison

You have the mass driver on the moon.

Elon Musk

I just want to see that thing in operation.

John Collison

Was that out of some sci-fi or where did you…?

Elon Musk

Well, actually, there is a Heinlein book. The Moon is a Harsh Mistress.

John Collison

Okay, yeah, but that’s slightly different. That’s a gravity slingshot or...

Elon Musk

No, they have a mass driver on the Moon.

John Collison

Okay, yeah, but they use that to attack Earth. So maybe it’s not the greatest...

Elon Musk

Well they use that to… assert their independence.

John Collison

Exactly. What are your plans for the mass driver on the Moon?

Elon Musk

They asserted their independence. Earth government disagreed and they lobbed things until Earth government agreed.

John Collison

That book is a hoot. I found that book much better than his other one that everyone reads, Stranger in a Strange Land.

Elon Musk

“Grok” comes from Stranger in a Strange Land. The first two-thirds of Stranger in a Strange Land are good, and then it gets very weird in the third portion. But there are still some good concepts in there.

01:44:16 - Lessons from running SpaceX

John Collison

One thing we were discussing a lot is your system for managing people. You interviewed the first few thousand of SpaceX employees and lots of other companies.

Elon Musk

It obviously doesn’t scale.

John Collison

Well, yes, but what doesn’t scale?

Elon Musk

Me.

John Collison

Sure, sure. I know that. But what are you looking for?

Elon Musk

There literally are not enough hours in the day. It’s impossible.

John Collison

But what are you looking for that someone else who’s good at interviewing and hiring people… What’s the je ne sais quoi?

Elon Musk

At this point, I might have more training data on evaluating technical talent especially—talent of all kinds I suppose, but technical talent especially—given that I’ve done so many technical interviews and then seen the results. So my training set is enormous and has a very wide range.

Generally, the things I ask for are bullet points for evidence of exceptional ability. These things can be pretty off the wall. It doesn’t need to be in the specific domain, but evidence of exceptional ability. So if somebody can cite even one thing, but let’s say three things, where you go, “Wow, wow, wow,” then that’s a good sign.

Dwarkesh Patel

Why do you have to be the one to determine that?

Elon Musk

No, I don’t. I can’t be. It’s impossible. The total headcount across all companies is 200,000 people.

John Collison

But in the early days, what was it that you were looking for that couldn’t be delegated in those interviews?

Elon Musk

I guess I need to build my training set. It’s not like I batted a thousand here. I would make mistakes, but then I’d be able to see where I thought somebody would work out well, but they didn’t. Then why did they not work out well? What can I do, I guess RL myself, to in the future have a better batting average when interviewing people? My batting average is still not perfect, but it’s very high.

Dwarkesh Patel

What are some surprising reasons people don’t work out?

Elon Musk

Surprising reasons…

Dwarkesh Patel

Like, they don’t understand technical domain, et cetera, et cetera. But you’ve got the long tail now of like, “I was really excited about this person. It didn’t work out.” Curious why that happens.

Elon Musk

Generally what I tell people—I tell myself, I guess, aspirationally—is, don’t look at the resume. Just believe your interaction. The resume may seem very impressive and it’s like, “Wow, the resume looks good.” But if the conversation after 20 minutes is not “wow,” you should believe the conversation, not the paper.

John Collison

I feel like part of your method is that… There was this meme in the media a few years back about Tesla being a revolving door of executive talent. Whereas actually, I think when you look at it, Tesla’s had a very consistent and internally promoted executive bench over the past few years.

Then at SpaceX, you have all these folks like Mark Juncosa and Steve Davis—

Elon Musk

Steve Davis runs The Boring Company these days.

John Collison

Bill Riley, and folks like that. It feels like part of what has worked well is having very capable technical deputies. What do all of those people have in common?

Elon Musk

Well, the Tesla senior team, at this point has probably got an average tenure of 10-12 years. It’s quite long. But there were times when Tesla went through an extremely rapid growth phase, so things were just somewhat sped up. As you know, a company goes through different orders of magnitude of size. People that could help manage, say, a 50-person company versus a 500-person company versus a 5,000-person company versus a 50,000-person company.

John Collison

You outgrew people.

Elon Musk

It’s just not the same team. It’s not always the same team. So if a company is growing very rapidly, the rate at which executive positions will change will also be proportionate to the rapidity of the growth generally.

Tesla had a further challenge where when Tesla had very successful periods, we would be relentlessly recruited from. Like, relentlessly. When Apple had their electric car program, they were carpet bombing Tesla with recruiting calls. Engineers just unplugged their phones.

John Collison

“I’m trying to get work done here.”

Elon Musk

Yeah. “If I get one more call from an Apple recruiter…” But their opening offer without any interview would be like double the compensation at Tesla. So we had a bit of the “Tesla pixie dust” thing where it’s like, “Oh, if you hire a Tesla executive, suddenly everything’s going to be successful.”

I’ve fallen prey to the pixie dust thing as well, where it’s like, “Oh, we’ll hire someone from Google or Apple and they’ll be immediately successful,” but that’s not how it works. People are people. There’s no magical pixie dust. So when we had the pixie dust problem, we would get relentlessly recruited from.

Also, Tesla being engineering, especially being primarily in Silicon Valley, it’s easier for people to just... They don’t have to change their life very much. Their commute’s going to be the same.

John Collison

So how do you prevent that? How do you prevent the pixie dust effect where everyone’s trying to poach all your people?

Elon Musk

I don’t think there’s much we can do to stop it. That’s one of the reasons why Tesla… Really, being in Silicon Valley and having the pixie dust thing at the same time meant that there was just a very, very aggressive recruitment.

John Collison

Presumably being in Austin helps then?

Elon Musk

Austin, it helps. Tesla still has a majority of its engineering in California. Getting engineers to move… I call it the “significant other” problem.

John Collison

Yes, “significant others” have jobs.

Elon Musk

Exactly. So for Starbase that was particularly difficult, since the odds of finding a non-SpaceX job…

John Collison

In Brownsville, Texas…

Elon Musk

…are pretty low. It’s quite difficult. It’s like a technology monastery thing, remote and mostly dudes.

Dwarkesh Patel

Not much of an improvement over SF.

John Collison

If you go back to these people who’ve really been very effective in a technical capacity at Tesla, at SpaceX, and those sorts of places, what do you think they have in common other than... Is it just that they’re very sharp on the rocketry or the technical foundations, or do you think it’s something organizational?

Is it something about their ability to work with you? Is it their ability to be flexible but not too flexible? What makes a good sparring partner for you?

Elon Musk

I don’t think of it as a sparring partner. If somebody gets things done, I love them, and if they don’t, I hate them. So it’s pretty straightforward. It’s not like some idiosyncratic thing. If somebody executes well, I’m a huge fan, and if they don’t, I’m not. But it’s not about mapping to my idiosyncratic preferences. I certainly try not to have it be mapping to my idiosyncratic preferences.

Generally, I think it’s a good idea to hire for talent and drive and trustworthiness. And I think goodness of heart is important. I underweighted that at one point. So, are they a good person? Trustworthy? Smart and talented and hard working? If so, you can add domain knowledge.

But those fundamental traits, those fundamental properties, you cannot change. So most of the people who are at Tesla and SpaceX did not come from the aerospace industry or the auto industry.

Dwarkesh Patel

What has had to change most about your management style as your companies have scaled from 100 to 1,000 to 10,000 people? You’re known for this very micro management, just getting into the details of things.

Elon Musk

Nano management, please. Pico management. Femto management.

John Collison

Keep going.

Elon Musk

We’re going to go all the way down to Planck’s constant. All the way down to Heisenberg uncertainty principle.

Dwarkesh Patel

Are you still able to get into details as much as you want? Would your companies be more successful if they were smaller? How do you think about that?

Elon Musk

Because I have a fixed amount of time in the day, my time is necessarily diluted as things grow and as the span of activity increases. It’s impossible for me to actually be a micromanager because that would imply I have some thousands of hours per day. It is a logical impossibility for me to micromanage things.

Now, there are times when I will drill down into a specific issue because that specific issue is the limiting factor on the progress of the company. The reason for drilling into some very detailed item is because it is the limiting factor. It’s not arbitrarily drilling into tiny things.

From a time standpoint, it is physically impossible for me to arbitrarily go into tiny things that don’t matter. That would result in failure. But sometimes the tiny things are decisive in victory.

John Collison

Famously, you switched the Starship design from composites to steel.

Elon Musk

Yes.

John Collison

You made that decision. That wasn’t people going around saying, “Oh, we found something better, boss.” That was you encouraging people against some resistance. Can you tell us how you came to that whole concept of the steel switch?

Elon Musk

Desperation, I’d say. Originally, we were going to make Starship out of carbon fiber. Carbon fiber is pretty expensive. When you do volume production, you can get any given thing to start to approach its material cost.

The problem with carbon fiber is that material cost is still very high. Particularly if you go for a high-strength specialized carbon fiber that can handle cryogenic oxygen, it’s roughly 50 times the cost of steel. At least in theory, it would be lighter. People generally think of steel as being heavy and carbon fiber as being light.

For room temperature applications, like a Formula 1 car, static aero structure, or any kind of aero structure really, you’re probably going to be better off with carbon fiber. The problem is that we were trying to make this enormous rocket out of carbon fiber and our progress was extremely slow.

John Collison

It had been picked in the first place just because it’s light?

Elon Musk

Yes. At first glance, most people would think that the choice for making something light would be carbon fiber. The thing is that when you make something very enormous out of carbon fiber and then you try to have the carbon fiber be efficiently cured, meaning not room temperature cured, because sometimes you got 50 plies of carbon fiber… Carbon fiber is really carbon string and glue. In order to have high strength, you need an autoclave. Something that’s essentially a high pressure oven. If you have something that’s gigantic, that one’s got to be bigger than the rocket.

We were trying to make an autoclave that’s bigger than any autoclave that’s ever existed. Or you can do room temperature cure, which takes a long time and has issues. The final issue is that we were just making very slow progress with carbon fiber.

Dwarkesh Patel

The meta question is why it had to be you who made that decision. There’s many engineers on your team.

John Collison

How did the team not arrive at steel?

Dwarkesh Patel

Yeah exactly. This is part of a broader question, understanding your comparative advantage at your companies.

Elon Musk

Because we were making very slow progress with carbon fiber, I was like, “Okay, we’ve got to try something else.” For the Falcon 9, the primary airframe is made of aluminum lithium, which has a very good strength-to-weight. Actually, it has about the same, maybe better, strength to weight for its application than carbon fiber. But aluminum lithium is very difficult to work with.

In order to weld it, you have to do something called friction stir welding, where you join the metal without entering the liquid phase. It’s kind of wild that you can do that. But with this particular type of welding, you can do that. It’s very difficult. Let’s say you want to make a modification or attach something to aluminum lithium, you now have to use a mechanical attachment with seals. You can’t weld it on. So I wanted to avoid using aluminum lithium for the primary structure for Starship.

There was this very special grade of carbon fiber that had very good mass properties. With a rocket, you’re really trying to maximize the percentage of the rocket that is propellant, minimize the mass obviously. But like I said, we were making very slow progress. I said, “at this rate, we’re never going to get to Mars. So we’ve got to think of something else.”

I didn’t want to use aluminum lithium because of the difficulty of friction stir welding, especially doing that at scale. It was hard enough at 3.6 meters in diameter, let alone at 9 meters or above. Then I said, “what about steel?”

I had a clue here because some of the early US rockets had used very thin steel. The Atlas rockets had used a steel balloon tank. It’s not like steel had never been used before. It actually had been used. When you look at the material properties of stainless steel, full-hard, strain hardened stainless steel, at cryogenic temperature the strength to weight is actually similar to carbon fiber.

If you look at material properties at room temperature, it looks like the steel is going to be twice as heavy. But if you look at the material properties at cryogenic temperature of full-hard steel, stainless of particular grades, then you actually get to a similar strength to weight as carbon fiber.

In the case of Starship, both the fuel and the oxidizer are cryogenic. For Falcon 9, the fuel is rocket propellant-grade kerosene, basically a very pure form of jet fuel. That is roughly room temperature. Although we do actually chill it slightly below, we chill it like a beer.

John Collison

Delicious.

Elon Musk

We do chill it, but it’s not cryogenic. In fact, if we made it cryogenic, it would just turn to wax. But for Starship, it’s liquid methane and liquid oxygen. They are liquid at similar temperatures. Basically, almost the entire primary structure is at cryogenic temperature. So then you’ve got a 300-series stainless that’s strain hardened. Because almost all things are cryogenic temperature, it actually has similar strength to weight as carbon fiber.

But it costs 50x less in raw material and is very easy to work with. You can weld stainless steel outdoors. You could smoke a cigar while welding stainless steel. It’s very resilient. You can modify it easily. If you want to attach something, you just weld it right on. Very easy to work with, very low cost.

Like I said, at cryogenic temperature, it’s similar strength-to-weight to carbon fiber. Then when you factor in that we have a much reduced heat shield mass, because the melting point of steel, is much greater than the melting point of aluminum… It’s about twice the melting point of aluminum.

John Collison

So you can just run the rocket much hotter?

Elon Musk

Yes, especially for the ship which is coming in like a blazing meteor. You can greatly reduce the mass of the heat shield. You can cut the mass of the windward part of the heat shield, maybe in half, and you don’t need any heat shielding on the leeward side.

The net result is that actually the steel rocket weighs less than the carbon fiber rocket, because the resin in the carbon fiber rocket starts to melt. Basically, carbon fiber and aluminum have about the same operating temperature capabilities, whereas steel can operate at twice the temperature. These are very rough approximations.

John Collison

I won’t build the rocket.

Elon Musk

What I mean is people will say, “Oh, he said this twice. It’s actually 0.8.” I’m like, shut up, assholes.

Dwarkesh Patel

That’s what the main comment’s going to be about.

Elon Musk

God damn it. The point is, in retrospect, we should have started with steel in the beginning. It was dumb not to do steel.

John Collison

Okay, but to play this back to you, what I’m hearing is that steel was a riskier, less proven path, other than the early US rockets. Versus carbon fiber was a worse but more proven out path. So you need to be the one to push for, “Hey, we’re going to do this riskier path and just figure it out.” So you’re fighting a sort of conservatism in a sense.

Elon Musk

That’s why I initially said that the issue is that we weren’t making fast enough progress. We were having trouble making even a small barrel section of the carbon fiber that didn’t have wrinkles in it. Because at that large scale, you have to have many plies, many layers of the carbon fiber. You’ve got to cure it and you’ve got to cure it in such a way that it doesn’t have any wrinkles or defects.

Carbon fiber is much less resilient than steel. It has much less toughness. Stainless steel will stretch and bend, the carbon fiber will tend to shatter. Toughness being the area under the stress strain curve. You’re generally going to have to do better with steel, but stainless steel to be precise.

John Collison

One other Starship question. So I visited Starbase, I think it was two years ago, with Sam Teller, and that was awesome. It was very cool to see, in a whole bunch of ways.

One thing I noticed was that people really took pride in the simplicity of things, where everyone wants to tell you how Starship is just a big soda can, and we’re hiring welders, and if you can weld in any industrial project, you can weld here. But there’s a lot of pride in the simplicity.

Elon Musk

Well, factually Starship is a very complicated rocket.

John Collison

So that’s what I’m getting at. Are things simple or are they complex?

Elon Musk

I think maybe just what they’re trying to say is that you don’t have to have prior experience in the rocket industry to work on Starship. Somebody just needs to be smart and work hard and be trustworthy and they can work on a rocket. They don’t need prior rocket experience. Starship is the most complicated machine ever made by humans, by a long shot.

John Collison

In what regards?

Elon Musk

Anything, really. I’d say there isn’t a more complex machine. I’d say that pretty much any project I can think of would be easier than this. That’s why nobody has ever made a fully reusable orbital rocket. It’s a very hard problem. Many smart people have tried before, very smart people with immense resources, and they failed.

And we haven’t succeeded yet. Falcon is partially reusable, but the upper stage is not. Starship Version 3, I think this design can be fully reusable. That full reusability is what will enable us to become a multi-planet civilization. Any technical problem, even like a Hadron Collider or something like that, is an easier problem than this.

John Collison

We spent a lot of time on bottlenecks. Can you say what the current Starship bottlenecks are, even at a high level?

Elon Musk

Trying to make it not explode, generally. It really wants to explode.

John Collison

That old chestnut. All those combustible materials.

Elon Musk

We’ve had two boosters explode on the test stand. One obliterated the entire test facility. So it only takes that one mistake. The amount of energy contained in a Starship is insane.

John Collison

Is that why it’s harder than Falcon? It’s because it’s just more energy?

Elon Musk

It’s a lot of new technology. It’s pushing the performance envelope. The Raptor 3 engine is a very, very advanced engine. It’s by far the best rocket engine ever made. But it desperately wants to blow up. Just to put things into perspective here, on liftoff the rocket is generating over 100 gigawatts of power. That’s 20% of US electricity.

Dwarkesh Patel

It’s actually insane.

John Collison

It’s a great comparison.

Elon Musk

While not exploding.

John Collison

Sometimes.

Elon Musk

Sometimes, yes. So I was like, how does it not explode? There’s thousands of ways that it could explode and only one way that it doesn’t. So we want it not only to really not explode, but fly reliably on a daily basis, like once per hour. Obviously, if it blows up a lot, it’s very difficult to maintain that launch cadence.

John Collison

Yes.

Elon Musk

What’s the single biggest remaining problem for Starship? It’s having the heat shield be reusable. No one’s ever made a reusable orbital heat shield. So the heat shield’s gotta make it through the ascent phase without shucking a bunch of tiles, and then it’s gotta come back in and also not lose a bunch of tiles or overheat the main airframe.

John Collison

Isn’t that hard because it’s fundamentally a consumable?

Elon Musk

Well, yes, but your brake pads in your car are also consumable, but they last a very long time.

John Collison

Fair.

Elon Musk

So it just needs to last a very long time. We have brought the ship back and had it do a soft landing in the ocean. We’ve done that a few times. But it lost a lot of tiles. It was not reusable without a lot of work. Even though it did come to a soft landing, it would not have been reusable without a lot of work.

So it’s not really reusable in that sense. That’s the biggest problem that remains, a fully reusable heat shield. You want to be able to land it, refill propellant and fly again. You can’t do this laborious inspection of 40,000 tiles type of thing.

Dwarkesh Patel

When I read biographies of yours, it seems like you’re just able to drive the sense of urgency and drive the sense of “this is the thing that can scale.” I’m curious why you think other organizations of your…

SpaceX and Tesla are really big companies now. You’re still able to keep that culture. What goes wrong with other companies such that they’re not able to do that?

Elon Musk

I don’t know.

Dwarkesh Patel

Like today, you said you had a bunch of SpaceX meetings. What is it that you’re doing there that’s keeping that?

John Collison

It’s adding urgency?

Elon Musk

Well, I don’t know. I guess the urgency is going to come from whoever is leading the company. I have a maniacal sense of urgency. So that maniacal sense of urgency projects through the rest of the company.

Dwarkesh Patel

Is it because of consequences? They’re like, “Elon set a crazy deadline, but if I don’t get it, I know what happens to me.” Is it just that you’re able to identify bottlenecks and get rid of them so people can move fast? How do you think about why your companies are able to move fast?

Elon Musk

I’m constantly addressing the limiting factor. On the deadlines front, I generally actually try to aim for a deadline that I at least think is at the 50th percentile. So it’s not like an impossible deadline, but it’s the most aggressive deadline I can think of that could be achieved with 50% probability. Which means that it’ll be late half the time.

There is a law of gas expansion that applies to schedules. If you said we’re going to do something in five years, which to me is like infinity time, it will expand to fill the available schedule and it’ll take five years.

Physics will limit how fast you can do certain things. So scaling up manufacturing, there’s a rate at which you can move the atoms and scale manufacturing. That’s why you can’t instantly make a million units a year of something. You’ve got to design the manufacturing line. You’ve got to bring it up. You’ve got to ride the S-curve of production.

What can I say that’s actually helpful to people? Generally, a maniacal sense of urgency is a very big deal. You want to have an aggressive schedule and you want to figure out what the limiting factor is at any point in time and help the team address that limiting factor.

John Collison

So Starlink was slowly in the works for many years.

Elon Musk

We talked about it all the way in the beginning of the company.

John Collison

So then there was a team you had built in Redmond, and then at one point you decided this team is just not cutting it. It went for a few years slowly, and so why didn’t you act earlier, and why did you act when you did? Why was that the right moment at which to act?

Elon Musk

I have these very detailed engineering reviews weekly. That’s maybe a very unusual level of granularity. I don’t know anyone who runs a company, or at least a manufacturing company, that goes with the level of detail that I go into. It’s not as though... I have a pretty good understanding of what’s actually going on because we go through things in detail.

I’m a big believer in skip-level meetings where instead of having the person that reports to me say things, it’s everyone that reports to them saying something in the technical review. And there can’t be advanced preparation. Otherwise you’re going to get “glazed”, as I say these days.

John Collison

Exactly. Very Gen Z of you.

Dwarkesh Patel

How do you prevent advanced preparation? Do you call on them randomly?

Elon Musk

No, I just go around the room. Everyone provides an update. It’s a lot of information to keep in your head. If you have meetings weekly or twice weekly, you’ve got a snapshot of what that person said. You can then plot the progress points. You can sort of mentally plot the points on a curve and say, “are we converging to a solution or not?”

I’ll take drastic action only when I conclude that success is not in a set of possible outcomes. So when I finally reach the conclusion that unless drastic action is done, we have no chance of success, then I must take drastic action. I came to that conclusion in 2018, took drastic action and fixed the problem.

Dwarkesh Patel

You’ve got many, many companies. In each of them it sounds like you do this kind of deep engineering understanding of what the relevant bottlenecks are so you can do these reviews with people.

You’ve been able to scale it up to five, six, seven companies. Within one of these companies, you have many different mini companies within them. What determines the max amount here? Because you have like 80 companies…?

Elon Musk

80? No.

Dwarkesh Patel

But you have so many already. That’s already remarkable.

John Collison

By this current number.

Dwarkesh Patel

Exactly.

John Collison

We can barely keep one company together.

Elon Musk

It depends on the situation. I actually don’t have regular meetings with The Boring Company, so The Boring Company is sort of cruising along. Basically, if something is working well and making good progress, then there’s no point in me spending time on it.

I actually allocate time according to where the limiting factor. Where are things problematic? Where are we pushing against? What is holding us back? I focus, at the risk of saying the words too many times, on the limiting factor.

Elon Musk

The irony is if something’s going really well, they don’t see much of me. But if something is going badly, they’ll see a lot of me. Or not even badly…

John Collison

If something is the limiting factor.

Elon Musk

The limiting factor, exactly. It’s not exactly going badly but it’s the thing that we need to make go faster.

John Collison

When something’s a limiting factor at SpaceX or Tesla, are you talking weekly and daily with the engineer that’s working on it? How does that actually work?

Elon Musk

Most things that are the limiting factor are weekly and some things are twice weekly. The AI5 chip review is twice weekly. Every Tuesday and Saturday is the chip review.

John Collison

Is it open ended in how long it goes?

Elon Musk

Technically, yes, but usually it’s two or three hours. Sometimes less. It depends on how much information we’ve got to go through.

John Collison

That’s another thing. I’m just trying to tease out the differences here because the outcomes seem quite different. I think it’s interesting to know what inputs are different. It feels like in the corporate world, one, like you were saying, the CEO doing engineering reviews does not always happen despite the fact that that is what the company is doing.

But then time is often pretty finely sliced into half hour meetings or even 15 minute meetings. It seems like you hold more open-ended, “We’re talking about it until we figure it out” type things.

Elon Musk

Sometimes. But most of them seem to more or less stay on time. Today’s Starship engineering review went a bit longer because there were more topics to discuss. They’re trying to figure out how to scale to a million plus tons to orbit per year. It’s quite challenging.

02:20:08 - DOGE

Dwarkesh Patel

Can I ask a question? You said about Optimus and AI that they’re going to result in double digit growth rates within a matter of years.

Elon Musk

Oh, like the economy? Yes. I think that’s right.

Dwarkesh Patel

What was the point of the DOGE cuts if the economy is going to grow so much?

Elon Musk

Well, I think waste and fraud are not good things to have. I was actually pretty worried about... In the absence of AI and robotics, we’re actually totally screwed because the national debt is piling up like crazy. The interest payments to national debt exceed the military budget, which is a trillion dollars. So we have over a trillion dollars just in interest payments. I was pretty concerned about that. Maybe if I spend some time, we can slow down the bankruptcy of the United States and give us enough time for the AI and robots to help solve the national debt.

Or not help solve, it’s the only thing that could solve the national debt. We are 1000% going to go bankrupt as a country, and fail as a country, without AI and robots. Nothing else will solve the national debt. We just need enough time to build the AI and robots to not go bankrupt before then.

Dwarkesh Patel

I guess the thing I’m curious about is, when DOGE starts you have this enormous ability to enact reform.

Elon Musk

Not that enormous.

Dwarkesh Patel

Sure. I totally buy your point that it’s important that AI and robotics drive productivity improvements, drive GDP growth. But why not just directly go after the things you were pointing out, like the tariffs on certain components, or permitting?

Elon Musk

I’m not the president. And it is very hard to cut things that are obvious waste and fraud, like ridiculous waste and fraud. What I discovered is that it’s extremely difficult even to cut very obvious waste and fraud from the government because the government has to operate on who’s complaining.

If you cut off payments to fraudsters, they immediately come up with the most sympathetic sounding reasons to continue the payment. They don’t say, “Please keep the fraud going.” They’re like, “You’re killing baby pandas.” Meanwhile, no baby pandas are dying. They’re just making it up. The fraudsters are capable of coming up with extremely compelling, heart-wrenching stories that are false, but nonetheless sound sympathetic. That’s what happened.

Perhaps I should have known better. But I thought, wait, let’s try to cut some amount of waste and pork from the government. Maybe there shouldn’t be 20 million people marked as alive in Social Security who are definitely dead, and over the age of 115.

The oldest American is 114. So it’s safe to say if somebody is 115 and marked as alive in the Social Security database, there’s either a typo… Somebody should call them and say, “We seem to have your birthday wrong, or we need to mark you as dead.” One of the two things.

John Collison

Very intimidating call to get.

Elon Musk

Well, it seems like a reasonable thing. Say if their birthday is in the future and they have a Small Business Administration loan, and their birthday is 2165, we either have a typo or we have fraud. So we say, “we appear to have gotten the century of your birth incorrect.”

John Collison

Or a great plot for a movie.

Elon Musk

Yes. That’s what I mean by, ludicrous fraud.

Dwarkesh Patel

Were those people getting payments?

Elon Musk

Some were getting payments from Social Security. But the main fraud vector was to mark somebody as alive in Social Security and then use every other government payment system to basically do fraud. Because what those other government payment systems do, they would simply do an “are you alive” check to the Social Security database. It’s a bank shot.

Dwarkesh Patel

What would you estimate is the total amount of fraud from this mechanism?

Elon Musk

By the way, the Government Accountability Office has done these estimates before. I’m not the only one. In fact, I think the GAO did an analysis, a rough estimate of fraud during the Biden administration, and calculated it at roughly half a trillion dollars. So don’t take my word for it. Take a report issued during the Biden administration. How about that?

Dwarkesh Patel

From this Social Security mechanism?

Elon Musk

It’s one of many. It’s important to appreciate that the government is very ineffective at stopping fraud. It’s not like a company where, with stopping fraud, you’ve got a motivation because it’s affecting the earnings of your company. The government just prints more money. You need caring and competence. These are in short supply at the federal level.

When you go to the DMV, do you think, “Wow, this is a bastion of competence”? Well, now imagine it’s worse than the DMV because it’s the DMV that can print money.

At least the state level DMVs need to... The states more or less need to stay within their budget or they go bankrupt. But the federal government just prints more money.

Dwarkesh Patel

If there’s actually half a trillion of fraud, why was it not possible to cut all that?

Elon Musk

You really have to stand back and recalibrate your expectations for competence. Because you’re operating in a world where you’ve got to make ends meet. You’ve got to pay your bills...

Dwarkesh Patel

Find the microphones.

Elon Musk

Exactly. It’s not like there’s a giant, largely uncaring monster bureaucracy. It’s a bunch of anachronistic computers that are just sending payments. One of the things that the DOGE team did sounds so simple and probably will save $100-200 billion a year. It was simply requiring payments from the main Treasury computer—which is called PAM, Payment Accounts Master or something like that, there’s $5 trillion payments a year—that go out have a payment appropriation code. Make it mandatory, not optional, that you have anything at all in the comment field.

You have to recalibrate how dumb things are. Payments were being sent out with no appropriation code, not checking back to any congressional appropriation, and with no explanation. This is why the Department of War, formerly the Department of Defense, cannot pass an audit, because the information is literally not there. Recalibrate your expectations.

Dwarkesh Patel

I want to better understand this half a trillion number, because there’s an IG report in 2024.

Elon Musk

Why is it so low?

Dwarkesh Patel

Maybe, but we found that over seven years, the Social Security fraud they estimated was like $70 billion over seven years, so like $10 billion a year. So I’d be curious to see what the other $490 billion is.

Elon Musk

Federal government expenditures are $7.5 trillion a year. How competent do you think the government is?

Dwarkesh Patel

The discretionary spending there is like… 15%?

Elon Musk

But it doesn’t matter. Most of the fraud is non-discretionary. It’s basically fraudulent Medicare, Medicaid, Social Security, disability. There’s a zillion government payments. A bunch of these payments are in fact block transfers to the states. So the federal government doesn’t even have the information in a lot of cases to even know if there’s fraud.

Let’s consider reductio ad absurdum. The government is perfect and has no fraud. What is your probability estimate of that? Zero. Okay, so then would you say, fraud and waste at the government is 90% efficient? That also would be quite generous.

But if it’s only 90%, that means that there’s $750 billion a year of waste and fraud. And it’s not 90%. It’s not 90% effective.

Dwarkesh Patel

This seems like a strange way to first principles the amount of fraud in the government. Just like, how much do you think there is?

Anyways, we don’t have to do it live, but I’d be curious—

Elon Musk

You know a lot about fraud at Stripe? People are constantly trying to do fraud.

John Collison

Yeah, but as you say, it’s a little bit of a... We’ve really ground it down, but it’s a little bit of a different problem space because you’re dealing with a much more heterogeneous set of fraud vectors here than we are.

Elon Musk

But at Stripe, you have high competence and you try hard. You have high competence and high caring, but still fraud is non-zero. Now imagine it’s at a much bigger scale, there’s much less competence, and much less caring.

At PayPal back in the day, we tried to manage fraud down to about 1% of the payment volume. That was very difficult. It took a tremendous amount of competence and caring to get fraud merely to 1%. Now imagine that you’re an organization where there’s much less caring and much less competence. It’s going to be much more than 1%.

John Collison

How do you feel now looking back on politics and doing stuff there? Looking from the outside in, two things have been quite impactful: one, the America PAC, and two, the acquisition of Twitter at the time. But also it seems like there was a bunch of heartache. What’s your grading of the whole experience?

Elon Musk

I think those things needed to be done to maximize the probability that the future is good. Politics generally is very tribal. People lose their objectivity usually with politics. They generally have trouble seeing the good on the other side or the bad on their own side. That’s generally how it goes. That, I guess, was one of the things that surprised me the most.

You often simply cannot reason with people. If they’re in one tribe or the other. They simply believe that everything their tribe does is good and anything the other political tribe does is bad. Persuading them otherwise is almost impossible.

But I think overall those actions—acquiring Twitter, getting Trump elected, even though it makes a lot of people angry—I think those actions were good for civilization.

Dwarkesh Patel

How does it feed into the future you’re excited about?

Elon Musk

Well, America needs to be strong enough to last long enough to extend life to other planets and to get AI and robotics to the point where we can ensure that the future is good.

On the other hand, if we were to descend into, say, communism or some situation where the state was extremely oppressive, that would mean that we might not be able to become multi-planetary. The state might stamp out our progress in AI and robotics.

Dwarkesh Patel

Optimus, Grok, et cetera. Not just yours, but any revenue-maximizing company’s products will be leveraged by the government over time. How does this concern manifest in what private companies should be willing to give governments? What kinds of guardrails?

Should AI models be made to do whatever the government that has contracted them out to do and asks them to do? Should Grok get to say, “Actually, even if the military wants to do X, no, Grok will not do that”?

Elon Musk

I think maybe the biggest danger of AI and robotics going wrong is government. People who are opposed to corporations or worried about corporations should really worry the most about government. Because government is just a corporation in the limit. Government is just the biggest corporation with a monopoly on violence.

I always find it a strange dichotomy where people would think corporations are bad, but the government is good, when the government is simply the biggest and worst corporation. But people have that dichotomy. They somehow think at the same time that government can be good, but corporations bad, and this is not true. Corporations have better morality than the government.

I actually think it’s a thing to be worried about. The government could potentially use AI and robotics to suppress the population. That is a serious concern.

Dwarkesh Patel

As the guy building AI and robotics, how do you prevent that?

Elon Musk

If you limit the powers of government, which is really what the US Constitution is intended to do, to limit the powers of government, then you’re probably going to have a better outcome than if you have more government.

John Collison

Robotics will be available to all governments, right?

Elon Musk

I don’t know about all governments. It’s difficult to predict. I can say what’s the endpoint, or what is many years in the future, but it’s difficult to predict the path along that way. If civilization progresses, AI will vastly exceed the sum of all human intelligence. There will be far more robots than humans. Along the way what happens is very difficult to predict.

Dwarkesh Patel

It seems one thing you could do is just say, “whatever government X, you’re not allowed to use Optimus to do X, Y, Z.” Just write out a policy. I think you tweeted recently that Grok should have a moral constitution. One of those things could be that we limit what governments are allowed to do with this advanced technology.

Elon Musk

Technically if politicians pass a law and they can enforce that law, then it’s hard to not do that law. The best thing we can have is limited government where you have the appropriate crosschecks between the executive, judicial, and legislative branches.

Dwarkesh Patel

The reason I’m curious about it is that at some point it seems the limits will come from you. You’ve got the Optimus, you’ve got the space GPUs…

Elon Musk

You think I’ll be the boss of the government?

Dwarkesh Patel

Already it’s the case with SpaceX that for things that are crucial—the government really cares about getting certain satellites up in space or whatever—it needs SpaceX. It is the necessary contractor.

You are in the process of building more and more of the technological components of the future that will have an analogous role in different industries. You could have this ability to set some policy that suppressing classical liberalism in any way… “My companies will not help in any way with that”, or some policy like that.

Elon Musk

I will do my best to ensure that anything that’s within my control maximizes the good outcome for humanity. I think anything else would be shortsighted, because obviously I’m part of humanity, so I like humans. Pro human.

02:38:28 - TeraFab

Dwarkesh Patel

You mentioned that Dojo 3 will be used for space-based compute.

Elon Musk

You really read what I say.

Dwarkesh Patel

I don’t know if you know, Elon, but you have a lot of followers.

Elon Musk

Dead giveaway. How did you discern my secrets? Oh I posted them on X.

Dwarkesh Patel

How do you design a chip for space? What changes?

Elon Musk

You want to design it to be more radiation tolerant and run at a higher temperature. Roughly, if you increase the operating temperature by 20% in degrees Kelvin, you can cut your radiator mass in half. So running at a higher temperature is helpful in space.

There are various things you can do for shielding the memory. But neural nets are going to be very resilient to bit flips. Most of what happens for radiation is random bit flips. But if you’ve got a multi-trillion parameter model and you get a few bit flips, it doesn’t matter. Heuristic programs are going to be much more sensitive to bit flips than some giant parameter file.

I just design it to run hot. I think you pretty much do it the same way that you do things on Earth, apart from making it run hotter.

Dwarkesh Patel

The solar array is most of the weight on the satellite. Is there a way to make the GPUs even more powerful than what Nvidia and TPUs and et cetera are planning on doing that would be especially privileged in the space-based world?

Elon Musk

The basic math is, if you can do about a kilowatt per reticle, then you’d need 100 million full reticle chips to do 100 gigawatts. Depending on what your yield assumptions are, that tells you how many chips you need to make. If you’re going to have 100 gigawatts of power, you need 100 million chips that are running at a kilowatt sustained, per reticle. Basic math.

Dwarkesh Patel

100 million chips depends on… If you look at the die size of something like Blackwell GPUs or something, and how many you can get out of a wafer, you can get on the order of dozens or less per wafer. So basically, this is a world where if we’re putting that out every single year, you’re producing millions of wafers a month. That’s the plan with TeraFab? Millions of wafers a month of advanced process nodes?

Elon Musk

Yeah it could be north of a million or something. You’ve got to do the memory too.

Dwarkesh Patel

Are you going to make a memory fab?

Elon Musk

I think the TeraFab’s got to do memory. It’s got to do logic, memory, and packaging.

Dwarkesh Patel

I’m very curious how somebody gets started. This is the most complicated thing man has ever made. Obviously, if anybody’s up to the task, you’re up to the task. So you realize it’s a bottleneck, and you go to your engineers. What do you tell them to do? “I want a million wafers a month in 2030.”

Elon Musk

That’s right. That’s exactly what I want.

Dwarkesh Patel

Do you call ASML? What is the next step?

John Collison

No so much to ask.

Elon Musk

We make a little fab and see what happens. Make our mistakes at a small scale and then make a big one.

Dwarkesh Patel

Is a little fab done?

Elon Musk

No, it’s not done. We’re not going to keep that cat in the bag. That cat’s going to come out of the bag. There’ll be drones hovering over the bloody thing. You’ll be able to see its construction progress on X in real time.

Look, I don’t know, we could just flounder in failure, to be fair. Success is not guaranteed. Since we want to try to make something like 100 million… We want 100 gigawatts of power and chips that can take 100 gigawatts by 2030. We’ll take as many chips as our suppliers will give us. I’ve actually said this to TSMC and Samsung and Micron: “please build more fabs faster”. We will guarantee to buy the output of those fabs. So they’re already moving as fast as they can. It’s us plus them.

John Collison

There’s a narrative that the people doing AI want a very large number of chips as quickly as possible. Then many of the input suppliers, the fabs, but also the turbine manufacturers, are not ramping up production very quickly.

Elon Musk

No, they’re not.

John Collison

The explanation you hear is that they’re dispositionally conservative. They’re Taiwanese or German, as the story may be. They just don’t believe... Is that really the explanation or is there something else?

Elon Musk

Well, it’s reasonable to... If somebody’s been in the computer memory business for 30 or 40 years…

John Collison

They’ve seen cycles.

Elon Musk

They’ve seen boom and bust 10 times. That’s a lot of layers of scar tissue. During the boom times, it looks like everything is going to be great forever. Then the crash happens and they’re desperately trying to avoid bankruptcy. Then there’s another boom and another crash.

John Collison

Are there other ideas you think others should go pursue that you’re not for whatever reasons right now?

Elon Musk

There are a few companies that are pursuing new ways of doing chips, but they’re just not scaling fast.

John Collison

I don’t even mean within AI, I mean just generally.

Elon Musk

People should do the thing where they find that they’re highly motivated to do that thing, as opposed to some idea that I suggest. They should do the thing that they find personally interesting and motivating to do.

But going back to the limiting factor… I used that phrase about 100 times. The current limiting factor that I see in the three to four year timeframe, it’s chips. In the one year timeframe, it’s energy, power production, electricity. It’s not clear to me that there’s enough usable electricity to turn on all the AI chips that are being made.

Towards the end of this year, I think people are going to have real trouble turning on... The chip output will exceed the ability to turn chips on.

Dwarkesh Patel

What’s your plan to deal with that world?

Elon Musk

We’re trying to accelerate electricity production. I guess that’s maybe one of the reasons that xAI will be maybe the leader, hopefully the leader. We’ll be able to turn on more chips than other people can turn on, faster, because we’re good at hardware.

Generally, the innovations from the corporations that call themselves labs, the ideas tend to flow… It’s rare to see that there’s more than about a six-month difference. The ideas travel back and forth with the people.

So I think you sort of hit the hardware wall and then whichever company can scale hardware the fastest will be the leader. So I think xAI will be able to scale hardware the fastest and therefore most likely will be the leader.

John Collison

You joked or were self-conscious about using the “limiting factor” phrase again. But I actually think there’s something deep here. If you look at a lot of things we’ve touched on over the course of it, it’s maybe a good note to end on. If you think of a senescent, low-agency company, it would have some bottleneck and not really be doing anything about it.

Marc Andreessen had the line of, “most people are willing to endure any amount of chronic pain to avoid acute pain”. It feels like a lot of the cases we’re talking about are just leaning into the acute pain, whatever it is. “Okay, we got to figure out how to work with steel, or we got to figure out how to run the chips in space.” We’ll take some near-term acute pain to actually solve the bottleneck. So that’s kind of a unifying theme.

Elon Musk

I have a high pain threshold. That’s helpful.

John Collison

To solve the bottleneck.

Elon Musk

Yes. One thing I can say is, I think the future is going to be very interesting. As I said at Davos—I think I was on the ground for like three hours or something—it’s better to err on the side of optimism and be wrong than err on the side of pessimism and be right, for quality of life. You’ll be happier if you err on the side of optimism rather than erring on the side of pessimism. So I recommend erring on the side of optimism.

John Collison

Here’s to that.

Dwarkesh Patel

Cool. Elon, thanks for doing this.

John Collison

Thank you.

Elon Musk

All right, thanks guys. All right.

John Collison

Great stamina.

Dwarkesh Patel

Hopefully this didn’t count as a pain in the pain tolerance.

Hiring scouts to help me find guests

Dwarkesh Patel — Thu, 15 Jan 2026 16:02:50 GMT

My main bottleneck is finding excellent guests. So, I’m hiring a couple part time scouts to help me find the next David Reich/Sarah Paine/Adam Brown.

$100/hour, fully remote, work hours are flexible - I expect it’ll be 5-10 hours a week.

Ideal candidate is maybe a grad student, or a post doc, or working in one of the fields I wanna find guests in. I’m looking for people who are really plugged into some discipline and have high taste.

Beyond just scouting guests, I’ll want your help assembling curriculums that help me prep for interviews and rapidly get up to speed.

The application form is here, and it’s extremely simple - just pitch me on a guest and tell me a bit about yourself. Please submit by 11:59 PM Pacific, Friday, Jan 23.

I’m looking to hire ~one scout for each of the following fields: bio, history, econ, math/physics, AI/hardware.

However, it’s very possible I end up hiring more (or fewer), or break apart the domains of knowledge in a different way, based on the range of expertise of the best people who apply.

What I’m looking for in guests

I’m looking for people who are deep experts in at least one field, and who are polymathic enough to think through all kinds of tangential questions in a really interesting way.

So I’m selecting for this synthetic ability to connect one’s expertise to all kinds of important questions about the world - an ability which is often deliberately masked in public academic work. Which means that it can only really come out in conversation.

That’s why I want to hire scouts. I need their network and context - they know who the polymathic geniuses are, who gave a fascinating lecture at the last big conference they attended, who can just connect all kinds of interesting ideas in the field together over conversation, etc.

We get tons of inbound from people who are working on impressive companies or doing interesting research projects. But almost always it’s a no; while I think their work is important, it’s self-contained in a way that I worry won’t lead to interesting broad discussion.

To get a little more concrete, here’s what worked well about some of my recent favorite guests:

Let me talk through why I think some interviews worked especially well, so you can think about what people in fields you’re familiar with fill a similar mold.

Jacob Kimmel: A lot of people who pitch themselves as guests are capable of only talking about their own research. But the amazing thing about Jacob is that he is an insane polymath. For example, he could explain why evolution didn’t select for longevity by drawing deep analogies to how gradients flow in ML models. He had all these other random interesting takes, from why humans never evolved their own antibiotics to how there’s this gene that used to protect us from HIV-like viruses but got repurposed, which hints at some ghost scourge. And then he could zoom out and give a great diagnosis of what’s bottlenecking pharma progress. I really want to emphasize how that’s different from other brilliant people I get pitched – these people are also doing incredible research, but they don’t have this range of really deep, interesting takes. That part is super crucial.
David Reich: It’s actually quite surprising that my second most popular guest of all time is a geneticist of ancient DNA. How did that happen? Here’s why I think this episode blew up. In high school, you get some vague explanation of human evolution. And you feel like you understand it and can move on with your life. And here comes David, showing you how this very fundamental topic, which you assume was settled and haven’t bothered thinking about in years, is actually way more murky and surprising than you realized, and how new discoveries are totally overturning our basic understanding of the field (in this case, the how, when, where of human evolution)1.
Andrej Karpathy: It’s extremely rare to get someone who is expert-level in a technical, fast-moving, and frothy field, but who has no vested interest in a particular company or approach, and who is in a position to just give an unbiased lay of the land. I have a couple questions below about biotech or formal math or robotics in the wake of AI progress - if there’s a Karpathy-type person in those fields, I’d be very keen to get a technical lay of the land and vibe check of what claims are credible versus crazy.

Some recent questions

In case it’s helpful for brainstorming a guest, I’ve listed out a few big questions that have been on my mind recently. But please feel free to ignore them - there’s way more interesting questions in the world than the ones I am aware of - feel free to say, “You might not yet be curious about antibody development/the history of language/the dark ages/battery tech, but the guest I have in mind for that topic is so good that it’s going to be your next big banger episode.”

Bio

Dario’s Machines of Loving Grace argues we’ll compress a century of bio progress into a few years - that big breakthroughs like CAR-T therapy, mRNA vaccines, cheap genome sequencing, etc show how in the long run things which seem like data or physical bottlenecks can be solved by better tools to measure/predict/perturb/understand biological system, and these tools are downstream of intelligence. But here’s what I don’t fully understand: over the last 3 decades, we’ve seen a million-fold reduction in genome sequencing costs, 1000-fold decrease in DNA synthesis costs, the development of precise gene editing tools like CRISPR, and the ability to conduct massively parallel experiments through multiplexing techniques. But it doesn’t seem like we’re curing diseases or coming up with new treatments at a faster rate now than we were 30 years ago. If anything, drug development is slowing down. I want to find a biology researcher who can think through how plausible a 10x or 100x speedup in new drug discovery actually is. They should obviously know a lot about and have hot takes on what’s actually bottlenecking progress today, and they should be flexible enough to imagine what might change with much more intelligence.
What exactly is the special sauce of the brain that we’re still missing? Adam Marblestone thinks it’s the curriculum of reward functions and the learning/steering subsystems. Others argue that gradient descent is fundamentally worse than how the brain learns within a lifetime (which is closer to in-context learning in its flexibility and sample efficiency).

Math/Physics

I’ve been really enjoying Strogatz’s Nonlinear Dynamics and Chaos textbook, and I want to make something podcast-shaped out of it. Strogatz himself has deferred until after he finishes his next book, so I’m looking for another mathematician on a related topic. I think the right format here isn’t a normal meandering interview - it’s something more like a lecture. A mathematician comes in with a specific topic or example we can deep dive on. He posts up at a blackboard, starts explaining a topic, and I interrupt to clarify confusions and ask follow-up questions. The model is something like Terence Tao and Grant Sanderson’s cosmic distance ladder video. Who can replicate something similar with me with some independently explainable topic in chaos/nonlinear dynamics or adjacent topics? I’d be especially keen if someone can present something on how the topics in this textbook tie into ML (see for example Neural network training makes beautiful fractals).
What real world impact should we expect from the current batch of AI for math projects? What are the fields of technology where people are going, “Ah we could totally solve quantum computing (or fusion or AGI) only if we had more theorems!” But maybe problems in biology and physics and materials and so on reduce down to math in a way I’m not foreseeing, and automating formal math alone is enough to unlock a bunch of progress. See footnotes for some more questions I wanna ask the right guest on this topic.
I started reading Proofs and Refutations, which is this famous 1976 book by the Hungarian mathematician Imre Lakatos about the philosophy of mathematics. He says math involves a lot of changing definitions and swapping lemmas in order to deal with different counterexamples. This seems fine for a good faith mathematical community, but super reward hackable for these AI-for-math models. Also it involves a lot of realizing how a problem in one domain is really a problem in another, and noticing the meta level pattern - AIs so far have been especially bad at this kind of thing. If math is just proof search within a fixed formal system, then AI can help a lot. But if its dialectical construction and refinement of concepts (based on what tasteful parsimonious definition can withstand counterexamples) , then I feel self play and ‘automated cleverness’ alone won’t do the trick. But maybe automated counterexamples are super useful. I’m sure for practicing mathematicians there’s a bunch of stuff that’s naive or wrong about the above. Would love to chat out what the actual research math process is like, and what good it would do to automate it.

AI/hardware

RL progress has been very fast, but it’s partly the result of going from almost nothing to 1e26 FLOPs training compute in a year (aka like going from GPT-1 to GPT-4.5). It’s still possible that it has terrible scaling exponents and further progress will be very slow. And also it’s not clear how much of the progress over the last year comes from inference scaling, which has worse variable economics. But on the other hand, maybe there’s a ton of low hanging fruit in improving RL - with pretraining, there’s been 5 years of developing the theory and empirics of optimal batch sizes, learning rates, architectures, etc. As that low hanging fruit is picked, maybe RL progress continues to be fast? The other big question about RL training is how much transfer learning are we seeing - is there all this crazy meta learning that’s not directly induced by any env and which will enable flexible human-like labor soon? I have no idea. My friends at labs who are actually doing this training obviously wouldn’t tell me. But I want to actually concretely understand what’s going on here.

History

There’s the famous Needham question, which asks why China didn’t industrialize first despite leading the world in population, inventions, and bureaucratic sophistication. I find the standard explanation of how this centralized Ming/Qing regime damped invention and exploration unsatisfying. Or at least I don’t understand it concretely. It’s such a big country - how can you retard progress across the whole thing, especially given that state capacity was presumably weaker in the past? Or at least I assume it was - what did a provincial bureaucrat actually do day-to-day? Was there a price system? Private property? How did the state actually interfere with merchants and artisans?

Economics

There’s something unsatisfying about the arguments that we’ll see 20%+ explosive economic growth from AI. Even if true, what does that mean? What is actually happening? I thought Machines of Loving Grace was a great account of what plausibly is happening on the human facing side of the singularity - aka the FLOPs that are going towards curing disease. But presumably most of what is happening is investment towards more robots, more compute, etc. My sense of what that side of things looks like is so murky and handwavy. There is a version of Machines of Loving Grace you can do that is somewhat concrete about all the sci fi shit - not just gesturing at the galaxies, but getting specific about the space GPUs and factorio like solar tiling and all the other things I’m not thinking of which are relevant to understanding 2040. Presumably the right guest is someone who is really strong in engineering/physics and economics and has a penchant for sci-fi and has a lot of concrete ideas here.
What should India or Nigeria or for that matter any country not directly in the semiconductor/foundation model supply chain do right now? If the main mechanism of catchup growth goes away (namely, that the underutilized labor of developing countries can rapidly be made more productive with capital and know-how from the developed world), what happens to all these countries that are not China or the US?

Just to give you a sample of some of the surprising findings that he talked through:

70,000 years ago, half a dozen different species of humans (Neanderthals, Denisovans, ‘Hobbits’, etc) lived across Eurasia. And then some small group of modern humans (only 1,000 to 10,000 people) drove all of them to extinction. Everyone native to Eurasia and America is descended from this one tribe.
Neanderthals may have gotten 30-70% of their DNA from modern humans. Which implies that maybe non-Africans today are actually “Neanderthals who became modernized by waves and waves of admixture” rather than modern humans with a bit of Neanderthal mixed in.
Yersinia pestis (bubonic plague bacteria) may have killed a quarter to half of all people in Western Eurasia for thousands of years, starting around 5,000 years ago. And may be central to explaining everything from the Yamnaya expansion to the fall of Rome to the Industrial Revolution.
It’s not clear modern humans were even primarily in Africa during the key period (2 million to 500,000 years ago) when human brains diverged from those of other species. Our lineage may have resided in Eurasia for significant stretches.

Okay I’ll stop, but you see my point. What are the other fields like human evolution, and the other presenters like David Reich, who will make you go, “What the fuck, I had no idea.”

David being David is actually a huge piece of the puzzle here which I want to replicate. He’s just incredibly deep and polymathic on what may from the outside look like one field but is in fact very many, from population genetics to archeology to linguistics. And while he’s intellectually humble enough to make qualifiers, he will (and this is very important) go ahead and give hot takes and start speculating about connections between fields and how different hypotheses relate to each other and so on. He won’t just stay at, “Our results show a genetic cline between North and South Indians.” He’ll say, “And we could be wrong here, but this suggests that the caste system which enforced this never otherwise seen levels of endogamy has been incredibly strong for millennia.”

What I've been reading recently - Jan 10, 2026

Dwarkesh Patel — Sat, 10 Jan 2026 20:30:07 GMT

I was recently chatting with a friend who has a similar job to mine. We were talking about how even though our jobs are fundamentally about learning about stuff, our time so easily gets sucked up by other things. So to hold myself accountable, I’m gonna try to publish a blog post every two weeks or so where I explain what I’ve been reading.

Max Hodak’s theory of consciousness

I’m totally gonna butcher this - please excuse. If you wanna get the real deal, go check out his summary blog post and his full talk on this topic.

Max is focused on two big sub-questions which together form “the binding problem”:

Mode binding: how do color, shape, texture, and motion get combined into a unified visual percept of “a red cup”?
Moment binding: why do we experience all the neurons firing across our entire brain over the course of 10s of milliseconds as a single quanta of experience?

Max thinks each of these binding sub-problems is related to a brain wave:

Gamma waves - 40 Hz - Fast, local coordination of nearby neurons to get on the same page about what they’re representing.
Alpha waves - 10 Hz - Slower waves that run through the whole brain and unify experience - think of these like the forward pass of the brain.
- Two cool things about alpha waves I hadn’t realized. 1. that neurons ride the peak of this oscillation 2. when alpha waves slow down or speed up (fight or flight reactions, etc), people experience time dilation.

Anyways, Max points out that the brain is storing a bunch of structured representations about the world physically, and some feedback controller has to go in and make sure that these representations are correct. This is part of what the alpha waves are doing. And this feedback control and binding is consciousness. I’m glossing over a bunch of logical connections that I definitely don’t understand. But I’ll leave it here.

I know Max could provide a really good answer, but just talking to myself, I’m confused on what the reason is to think that feedback control = consciousness? By this logic, does memory refresh = consciousness too?

Max thinks that figuring out what’s up with consciousness will mean discovering new physics. And specifically, physics at the level of the 4 fundamental forces - some property as basic as mass or charge. His logic is that either consciousness has no real impact on the world (it’s just a byproduct of other stuff the brain does), which would be odd, or it actually has an effect, which would mean it’s new physics.

I’m not sure I buy this. 1. Can’t it be an effect that’s best understood at an implication of existing laws of physics - the fact that wood floats on water has an impact on the world, but you don’t need new physics to explain it 2. Doesn’t it seem implausible that evolution blindly stumbled upon and is now making good use of a whole undiscovered physical field which we have never managed to actually interact with using our technology, nor seen summoned anywhere else in the universe?

Nonlinear dynamics and Chaos by Steven Strogatz

I’m only 3 chapters in, so I’ve only got the building blocks so far. The fundamental idea is this. It’s often hard to anticipate how a system will evolve just by observing a bunch of different trajectories over time. But it’s much easier to see what will happen if you plot how the system will evolve from different starting points. The examples get more and more interesting, and because Strogatz focuses on the graphical and geometric interpretations, the motivating problems are super satisfying; the book is really a bunch of 3Blue1Brown videos on a certain topic stapled together.

Side note: I could not have understood anything here if I didn’t have LLMs and couldn’t watch the lectures async. I paused every minute or so (to clarify some confusion with a chatbot or to try and anticipate the next step), and I had the same section of textbook open at the same time.

I’m now wondering to myself, “How the hell did I learn anything in college at all?” I would be so lost if I was actually taking this course in college and just attending the lectures live.

In college, I actually did bounce out of a difficult course I feel like I could totally learn today with LLMs and async lectures + my adult executive function.

As I was working through these examples (some inspired by actual papers), I kept thinking about what parts the “automated cleverness” (Terry Tao’s term) of today’s AIs could actually help with.

It’s crazy how much understanding you can get about a physical system through mathematics. But that understanding is so dependent on insight and interpretation.

To give one example, Section 3.7 has a really clever model of an insect outbreak, showing how budworms, birds, and trees play out against each other given different growth rates and other dynamics.

But first you have to figure out the right dimensionless forms. And that requires judgment about which dimensions actually matter. In the insect model, the choice was to think in terms of R and K and treat the bird population as basically an artifact of those parameters. But you could have done it the other way around—from the basis of birds.

Then there’s how you make the visualization. Once you’ve got the dynamics in dimensionless form, you could just graph the equation and find the fixed points. But the result would be almost impossible to interpret. Graph it a different way, though, and suddenly the intercepts align with your intuition. You can actually see the three regimes: where carrying capacity is so low the population never gets going, where birds keep things in check, and where the outbreak has outgrown the birds’ ability to control it.

This kind of insight is inseparable from understanding what you’re even trying to learn about the system. And I’m skeptical today’s AI helps much here. When these methods were first developed, the right forms and interpretations weren’t obvious. The mathematician who wrote the original paper had to come up with new insights about how to think about the problem.

Maybe models are now good enough to apply these methods to new systems that fit the same template. But that just means the few mathematicians who invent genuinely new frameworks are the only ones who stay relevant.

Machines of Loving Grace by Dario Amodei

Starting with the biology section: Dario argues that we’ll get a century of bio progress in a few years. His argument:

Most bio progress is driven by breakthrough discoveries which give you whole new primitives for what you can measure, change, or predict (CAR-T therapy, mRNA vaccines, CRISPR, genome sequencing costs declining so much, etc).
These discoveries seem to have been made in scrappy haphazard ways, often years after they were initially possible, and often by people responsible for other breakthroughs as well. All 3 of these observations hint that they are bottlenecked by intelligence.
Dario acknowledges that data is a huge bottleneck for bio. But the tools we have for collecting data can also be expanded by intelligence. Human researchers came up with multiplexing and AlphaFold and Perturb-Seq - the AI researchers will come up with even more.

Here’s the counterargument. The kinds of human researcher breakthroughs he uses as examples of what AI could do more of haven’t had a huge impact on health. Over the last 3 decades, we’ve seen a million-fold reduction in genome sequencing costs, 1000-fold decrease in DNA synthesis costs, the development of precise gene editing tools like CRISPR, and the ability to conduct massively parallel experiments through multiplexing techniques. But it doesn’t seem like we’re curing diseases or coming up with new treatments at a faster rate now than we were 30 years ago. If anything, drug development is slowing down. Why think that AI will be able to fundamentally change this dynamic?

Relatedly, Jacob Trefethan has an excellent blog post makes the the argument that AI won’t speed up medical progress that much (he also steelmans the opposite point in this other post). Jacob points out that making a drug to cure something like Alzheimer’s is really hard. Raw understanding of some of the disease life cycle (which more intelligence could give you more of) is not enough. We understand that Alzheimer’s is clearly linked to Amyloid beta, and there are now many different drugs trying to remove amyloid plaques which have all not worked. Even if we get more insights like the Amyloid beta thing from AI scientists, that alone will not be enough to identify the correct targets. You just have to do a bunch of experiments on live humans.

This is why Dario’s point about clinical trials falls flat. He argues that clinical trials are currently slow because we just don’t know whether a given drug will actually work. But if we had much greater confidence, like we did with the mRNA vaccines for COVID, then we could test and approve drugs much faster. However, I don’t see why we should think that modulo the full hyperrealistic simulation of the human body, we could tell ex ante which drugs are gonna work. I don’t yet buy the argument that a million George Church clones in a datacenter could derisk all the drug trials

Quick notes on other parts of the essay:

Overall I find it pretty impressive that a tech CEO is this generally thoughtful.
The poverty and econ section doesn’t address that the main mechanism of catchup growth goes away post AGI; namely developing countries have lots of underutilized labor which is bottlenecking production, and because the marginal product of labor is high in the world today, those countries can get rich fast. So how exactly are these other countries catching up?
The key point that underlies his framework that intelligence can drive a century of progress in 5-10 years : “Things that are hard constraints in the short run may become more malleable to intelligence in the long run. For example, intelligence might be used to develop a new experimental paradigm that allows us to learn in vitro what used to require live animal experiments, or to build the tools needed to collect new data (e.g. the bigger particle accelerator), or to (within ethical limits) find ways around human-based constraints (e.g. helping to improve the clinical trial system, helping to create new jurisdictions where clinical trials have less bureaucracy, or improving the science itself to make human clinical trials less necessary or cheaper).”
- it’s interesting to consider why this isn’t true for factors of production today. We live in a (relatively) capital-abundant and labor-scarce world. That is reflected in the labor share of income being 2x as high as the capital share of income. But this has been true for centuries upon centuries. Contra Piketty in “Capital in the 21st Century”, all these capital holders have not been able to get some runaway capital accumulation process going by figuring out a way around labor constraints. Why think that intelligence will be any different than capital in its ability to get around other factors of production? maybe the argument is that intelligence can actually help generate the other factors of production in a way that capital can’t.

Neural network training makes beautiful fractals by Jascha Sohl-Dickstein

Absolutely fascinating blog post.

You want to train your model at the highest possible learning rate under which it still converges. But the boundary of convergence versus divergence is fractal, which makes these hyperparameters really hard to optimize for via gradient descent.

Now you can ask the question: evolution somehow found the right hyperparameters to train our brains. How did evolution solve this wicked problem? Presumably because gradient free optimization fares better against these kinds of fractal landscapes - if you optimize for the part of the region where the average speed of convergence is high (rather than just take the gradient from a specific point that’s bounded in an unpredictable way by fractals), seems like you could do much better.

Backing up, why is the meta-loss landscape fractal in the first place? Jascha’s explanation is that fractals often emerge when iteratively applying a function. Gradient descent on the parameters is one such function that you iterate across training steps. But then the follow up question is this. There’s lots of other iterative functions you could think of, even within the context of neural networks. Do they all lead to fractals? For example:

In chain of thought, you apply a model to a string, which makes a new string, to which you apply the model, etc.
RNNs keep applying the same parameters to the hidden state.

Over conversation, an AI researcher friend revealed that CoT and RNNs both have variance problems that could well be explained by these fractal like dynamics. Though I only understand this claim at a hand-wavy level.

Adam Marblestone — AI is missing something fundamental about the brain

Dwarkesh Patel — Tue, 30 Dec 2025 17:07:17 GMT

Adam Marblestone is CEO of Convergent Research. He’s had a very interesting past life: he was a research scientist at Google Deepmind on their neuroscience team and has worked on everything from brain-computer interfaces to quantum computing to nanotech and even formal mathematics.

In this episode, we discuss how the brain learns so much from so little, what the AI field can learn from neuroscience, and the answer to Ilya’s question: how does the genome encode abstract reward functions? Turns out, they’re all the same question.

Watch on YouTube; listen on Apple Podcasts or Spotify.

Sponsors

Gemini 3 Pro recently helped me run an experiment to test multi-agent scaling: basically, if you have a fixed budget of compute, what is the optimal way to split it up across agents? Gemini was my colleague throughout the process — honestly, I couldn’t have investigated this question without it. Try Gemini 3 Pro today gemini.google.com
Labelbox helps you train agents to do economically-valuable, real-world tasks. Labelbox’s network of subject-matter experts ensures you get hyper-realistic RL environments, and their custom tooling lets you generate the highest-quality training data possible from those environments. Learn more at labelbox.com/dwarkesh

To sponsor a future episode, visit dwarkesh.com/advertise.

Timestamps

(00:00:00) – The brain’s secret sauce is the reward functions, not the architecture

(00:22:20) – Amortized inference and what the genome actually stores

(00:42:42) – Model-based vs model-free RL in the brain

(00:50:31) – Is biological hardware a limitation or an advantage?

(01:03:59) – Why a map of the human brain is important

(01:23:28) – What value will automating math have?

(01:38:18) – Architecture of the brain

Transcript

00:00:00 – The brain’s secret sauce is the reward functions, not the architecture

Dwarkesh Patel

The big million-dollar question that I have, that I’ve been trying to get the answer to through all these interviews with AI researchers: How does the brain do it? We’re throwing way more data at these LLMs and they still have a small fraction of the total capabilities that a human does. So what’s going on?

Adam Marblestone

This might be the quadrillion-dollar question or something like that. You can make an argument that this is the most important question in science. I don’t claim to know the answer. I also don’t think that the answer will necessarily come even from a lot of smart people thinking about it as much as they are. My overall meta-level take is that we have to empower the field of neuroscience to just make neuroscience a more powerful field technologically and otherwise, to actually be able to crack a question like this.

Maybe the way that we would think about this now with modern AI, neural nets, deep learning, is that there are certain key components of that. There’s the architecture. There’s maybe hyperparameters of how many layers you have or properties of that architecture. There is the learning algorithm itself. How do you train it? Backprop, gradient descent, is it something else? How is it initialized? If we take the learning part of the system, it still may have some initialization of the weights. And then there are also cost functions. What is it being trained to do? What’s the reward signal? What are the loss functions, supervision signals?

My personal hunch within that framework is that the field has neglected the role of these very specific loss functions, very specific cost functions. Machine learning tends to like mathematically simple loss functions. Predict the next token, cross-entropy, these simple computer scientist loss functions. I think evolution may have built a lot of complexity into the loss functions actually, many different loss functions for different areas turned on at different stages of development. A lot of Python code, basically, generating a specific curriculum for what different parts of the brain need to learn.

Because evolution has seen many times what was successful and unsuccessful, and evolution could encode the knowledge of the learning curriculum. In the machine learning framework, maybe we can come back and we can talk about where do the loss functions of the brain come from? Can different loss functions lead to different efficiency of learning?

Dwarkesh Patel

People say the cortex has got the universal human learning algorithm, the special sauce that humans have. What’s up with that?

Adam Marblestone

This is a huge question and we don’t know. I’ve seen models where the cortex… The cortex typically has this six-layered structure, layers in a slightly different sense than layers of a neural net. Any one location in the cortex has six physical layers of tissue as you go in layers of the sheet. And those areas then connect to each other and that’s more like the layers of a network.

I’ve seen versions of that where what you’re trying to explain is just, “How does it approximate backprop?” And what is the cost function for that? What is the network being asked to do, if you are trying to say it’s something like backprop? Is it doing backprop on next token prediction or is it doing backprop on classifying images or what is it doing? And no one knows. But one thought about it, one possibility about it, is that it’s just this incredibly general prediction engine. So any one area of the cortex is just trying to predict… Basically can it learn to predict any subset of all the variables it sees from any other subset? Omnidirectional inference, or omnidirectional prediction.

Whereas an LLM is just seeing everything in the context window and then it computes a very particular conditional probability which is, “Given all the last thousands of things, what are the probabilities for the next token.” But it would be weird for a large language model to say “the quick brown fox blank blank the lazy dog” and fill in the middle versus doing the next token, if it’s doing just forward. It can learn how to do that stuff at this emergent level of the context window and everything, but natively it’s just predicting the next token.

What if the cortex is natively made so that any area of cortex can predict any pattern in any subset of its inputs given any other missing subset? That is a little bit more like “probabilistic AI”. A lot of the things I’m saying, by the way, are extremely similar to what Yann LeCun would say. He’s really interested in these energy-based models and something like that is like, the joint distribution of all the variables. What is the likelihood or unlikelihood of just any combination of variables?

If I clamp some of them and I say that definitely these variables are in these states, then I can compute, with probabilistic sampling for example—conditioned on these being set in this state, and these could be any arbitrary subset of variables in the model—can I predict what any other subset is going to do and sample from any other subset given clamping this subset? And I could choose a totally different subset and sample from that subset. So it’s omnidirectional inference.

And so there could be some parts of the cortex, there might be association areas of cortex that predict vision from audition. There might be areas that predict things that the more innate part of the brain is going to do. Because remember, this whole thing is riding on top of a lizard brain and lizard body, if you will. And that thing is a thing that’s worth predicting too. You’re not just predicting do I see this or do I see that. Is this muscle about to tense? Am I about to have a reflex where I laugh? Is my heart rate about to go up? Am I about to activate this instinctive behavior?

Dwarkesh Patel

Based on my higher-level understanding… Like I can match somebody has told me there’s a spider on my back to this lizard part that would activate if I was literally seeing a spider in front of me. You learn to associate the two so that even just from somebody hearing you say “There’s a spider on your back”

Adam Marblestone

Well, let’s come back to this. This is partly having to do with Steve Byrnes’ theories, which I’m recently obsessed about. But on your podcast with Ilya, he said, “Look, I’m not aware of any good theory of how evolution encodes high-level desires or intentions.” I think this is very connected to all of these questions about the loss functions and the cost functions that the brain would use. And it’s a really profound question, right?

Let’s say that I am embarrassed for saying the wrong thing on your podcast because I’m imagining that Yann LeCun is listening and he says, “That’s not my theory. You described energy-based models really badly.” That’s going to activate in me innate embarrassment and shame, and I’m going to want to go hide and whatever. That’s going to activate these innate reflexes. That’s important because I might otherwise get killed by Yann LeCun’s marauding army of other…

Dwarkesh Patel

The French AI researchers are coming for you, Adam.

Adam Marblestone

So it’s important that I have that instinctual response. But of course, evolution has never seen Yann LeCun or known about energy-based models or known what an important scientist or a podcast is. Somehow the brain has to encode this desire to not piss off really important people in the tribe or something like this in a very robust way, without knowing in advance all the things that the Learning Subsystem of the brain, the part that is learning cortex and other parts… The cortex is going to learn this world model. It’s going to include things like Yann LeCun and podcasts. And evolution has to make sure that those neurons, whatever the Yann-LeCun-being-upset-with-me neurons, get properly wired up to the shame response or this part of the reward function. And this is important, right?

Because if we’re going to be able to seek status in the tribe or learn from knowledgeable people, as you said, or things like that, exchange knowledge and skills with friends but not with enemies… We have to learn all this stuff. It has to be able to robustly wire these learned features of the world, learned parts of the world model, up to these innate reward functions, and then actually use that to then learn more. Because next time I’m not going to try to piss off Yann LeCun if he emails me that I got this wrong. We’re going to do further learning based on that.

In constructing the reward function, it has to use learned information. But how can evolution, which didn’t know about Yann LeCun, do that? The basic idea that Steve Byrnes is proposing is that part of the cortex, or other areas like the amygdala that learn, what they’re doing is they’re modeling the Steering Subsystem. The Steering Subsystem is the part with these more innately programmed responses and the innate programming of these series of reward functions, cost functions, bootstrapping functions that exist.

There are parts of the amygdala, for example, that are able to monitor what those parts do and predict what those parts do. How do you find the neurons that are important for social status? Well, you have some innate heuristics of social status, for example, or you have some innate heuristics of friendliness that the Steering Subsystem can use. And the Steering Subsystem actually has its own sensory system, which is crazy. We think of vision as being something that the cortex does. But there’s also a Steering Subsystem, subcortical visual system called the superior colliculus with innate ability to detect faces, for example, or threats.

So there’s a visual system that has innate heuristics and the Steering Subsystem has its own responses. There’ll be part of the amygdala or part of the cortex that is learning to predict those responses. What are the neurons that matter in the cortex for social status or for friendship? They’re the ones that predict those innate heuristics for friendship. You train a predictor in the cortex and you say, “Which neurons are part of the predictor?” Those are the ones that, now you’ve actually managed to wire it up.

Dwarkesh Patel

This is fascinating. I feel like I still don’t understand… I understand how the cortex could learn how this primitive part of the brain would respond to… Obviously it has these labels on, “here’s literally a picture of a spider, and this is bad, be scared of this.” The cortex learns that this is bad because the innate part tells it that. But then it has to generalize to, “Okay, the spider’s on my back. And somebody’s telling me the spider’s on your back. That’s also bad.”

Adam Marblestone

Yes.

Dwarkesh Patel

But it never got supervision on that. So how does it…?

Adam Marblestone

Well, it’s because the Learning Subsystem is a powerful learning algorithm that does have generalization, that is capable of generalization. The Steering Subsystem, these are the innate responses. You’re going to have some built into your Steering Subsystem, these lower brain areas: hypothalamus, brainstem, et cetera. Again, they have their own primitive sensory systems.

So there may be an innate response. If I see something that’s moving fast toward my body that I didn’t previously see was there and is small and dark and high contrast, that might be an insect skittering onto my body. I am going to flinch. There are these innate responses. There’s going to be some group of neurons, let’s say, in the hypothalamus, that is the I-am-flinching or I-just-flinched neurons in the hypothalamus.

When you flinch, first of all, it’s a negative contribution to the reward function. You didn’t want that to happen, perhaps. But that’s a reward function that doesn’t have any generalization in it. I’m going to avoid that exact situation of the thing skittering toward me. Maybe I’m going to avoid some actions that lead to the thing skittering. That’s a generalization you can get, what Steve calls downstream of the reward function. I’m going to avoid the situation where the spider was skittering toward me, but you’re also going to do something else.

There’s going to be a part of your amygdala, say, that is saying, “Okay, a few milliseconds, hundreds of milliseconds or seconds earlier, could I have predicted that flinching response?” It’s going to be a group of neurons that is essentially a classifier of, “Am I about to flinch?” And I’m going to have classifiers for that for every important Steering Subsystem variable that evolution needs to take care of. Am I about to flinch? Am I talking to a friend? Should I laugh now? Is the friend high status? Whatever variables the hypothalamus, brainstem, contains… Am I about to taste salt?

It’s going to have all these variables and for each one it’s going to have a predictor. It’s going to train that predictor. Now the predictor that it trains, that can have some generalization. The reason it can have some generalization is because it just has a totally different input. Its input data might be things like the word “spider”, but the word “spider” can activate in all sorts of situations that lead to the word “spider” activating in your world model. If you have a complex world model with really complex features that inherently gives you some generalization. It’s not just the thing skittering toward me, it’s even the word “spider” or the concept of “spider” is going to cause that to trigger. This predictor can learn that. Whatever spider neurons are in my world model, which could even be a book about spiders or somewhere, a room where there are spiders or whatever that is…

Dwarkesh Patel

The amount of heebie-jeebies that this conversation is eliciting in the audience…

Adam Marblestone

Now I’m activating your Steering Subsystem, your Steering Subsystem spider hypothalamus subgroup of neurons of skittering insects are activating based on these very abstract concepts in the conversation.

Dwarkesh Patel

If you keep going, I’m going to put in a trigger warning.

Adam Marblestone

That’s because you learned this. The cortex inherently has the ability to generalize because it’s just predicting based on these very abstract variables and all these integrated information that it has. Whereas the Steering Subsystem only can use whatever the superior colliculus and a few other sensors can spit out.

Dwarkesh Patel

By the way, it’s remarkable that the person who’s made this connection between different pieces of neuroscience, Steve Byrnes, is a former physicist. For the last few years, he’s been trying to synthesize—

Adam Marblestone

He’s an AI safety researcher. He’s just synthesizing. This comes back to the academic incentives thing. I think that this is a little bit hard to say. What is the exact next experiment? How am I going to publish a paper on this? How am I going to train my grad student to do this? It’s very speculative. But there’s a lot in the neuroscience literature and Steve has been able to pull this together. And I think that Steve has an answer to Ilya’s question essentially, which is, how does the brain ultimately code for these higher-level desires and link them up to the more primitive rewards?

Dwarkesh Patel

Very naive question, but why can’t we achieve this omnidirectional inference by just training the model to not just map from a token to next token, but remove the masks in the training so it maps every token to every token, or come up with more labels between video and audio and text so that it’s forced to map one to each one?

Adam Marblestone

I mean, that may be the way. It’s not clear to me. Some people think that there’s a different way that it does probabilistic inference or a different learning algorithm that isn’t backprop. There might be other ways of learning, energy-based models or other things like that, that you can imagine that is involved in being able to do this and that the brain has that.

But I think there’s a version of it where what the brain does is crappy versions of backprop to learn to predict through a few layers and that it’s kind of like a multimodal foundation model. LLMs are maybe just predicting the next token. But vision models maybe are trained in learning to fill in the blanks or reconstruct different pieces or combinations. But I think that it does it in an extremely flexible way.

If you train a model to just fill in this blank at the center, okay, that’s great. But what if you didn’t train it to fill in this other blank over to the left? Then it doesn’t know how to do that. It’s not part of its repertoire of predictions that are amortized into the network. Whereas with a really powerful inference system, you could choose at test time, what is the subset of variables it needs to infer and which ones are clamped?

Dwarkesh Patel

Okay, two sub-questions. One, it makes you wonder whether the thing that is lacking in artificial neural networks is less about the reward function and more about the encoder or the embedding… Maybe the issue is that you’re not representing video and audio and text in the right latent abstraction such that they could intermingle and conflict.

Maybe this is also related to why LLMs seem bad at drawing connections between different ideas. Are the ideas represented at a level of generality at which you could notice different connections?

Adam Marblestone

Well, the problem is these questions are all commingled. If we don’t know if it’s doing a backprop-like learning, and we don’t know if it’s doing energy-based models, and we don’t know how these areas are even connected in the first place, it’s very hard to really get to the ground truth of this. But yeah, it’s possible.

I think that people have done some work. My friend Joel Dapello actually did something some years ago where he put a model—I think it was a model of V1, specifically how the early visual cortex represents images—as an input into a convnet and that improves some things. It could be differences. The retina is also doing motion detection and certain things are getting filtered out. There may be some preprocessing of the sensory data. There may be some clever combinations of which modalities are predicting which or so on, that lead to better representation. There may be much more clever things than that.

Some people certainly do think that there’s inductive biases built in the architecture that will shape the representations differently or that there are clever things that you can do. Astera, which is the same organization that employs Steve Byrnes, just launched this neuroscience project based on Doris Tsao’s work. She has some ideas about how you can build vision systems that basically require less training. They build into the assumptions of the design of the architecture things like objects are bounded by surfaces and surfaces have certain types of shapes and relationships of how they occlude each other and stuff like that. It may be possible to build more assumptions into the network. Evolution may have also put some changes of architecture. It’s just I think that also the cost functions and so on may be a key thing that it does.

00:22:20 – Amortized inference and what the genome actually stores

Dwarkesh Patel

I want to talk about this idea that you just glanced off of which was amortized inference. Maybe I should try to explain what I think it means, because I think it’s probably wrong and this will help you correct me.

Adam Marblestone

It’s been a few years for me too.

Dwarkesh Patel

Right now, the way the models work is that you have an input, it maps it to an output, and this is amortizing a process, the real process, which we think is what intelligence is. It’s that you have some prior over how the world could be, what are the causes that make the world the way that it is. And then when you see some observation, you should be like, “Okay, here’s all the ways the world could be. This cause explains what’s happening best.”

Now, doing this calculation over every possible cause is computationally intractable. So then you just have to sample like, “Oh, here’s a potential cause. Does this explain this observation? No, forget it. Let’s keep sampling.” And then eventually you get the cause, then the cause explains the observation, and then this becomes your posterior.

Adam Marblestone

That’s actually pretty good. Bayesian inference in general is of this very intractable thing. The algorithms that we have for doing that tend to require taking a lot of samples, Monte Carlo methods, taking a lot of samples. And taking samples takes time. This is like the original Boltzmann machines and stuff. They’re using techniques like this, and still it’s used with probabilistic programming, other types of methods often. The Bayesian inference problem, which is basically the problem of perception, given some model of the world and given some data, how should I update my… What are the missing variables in my internal model?

Dwarkesh Patel

And I guess the idea is that neural networks are hopefully… Obviously, mechanistically, the neural network is not starting with, “Here is my model of the world, and I’m going to try to explain this data.” But the hope is that instead of starting with, “Hey, does this cause explain this observation? No. Did this cause explain this observation? Yes.” What you do is just like observation…

Adam Marblestone

What’s the cause that the neural net thinks is the best one?

Dwarkesh Patel

Observation to cause. So the feedforward goes observation to cause to then the output that…

Adam Marblestone

You don’t have to evaluate all these energy values or whatever and sample around to make them higher and lower. You just say, approximately that process would result in this being the top one or something like that.

Dwarkesh Patel

Exactly. One way to think about it might be that test-time compute, inference-time compute is actually doing this sampling again. You literally read its chain of thought. It’s actually doing this toy example we’re talking about where it’s like, “Oh, can I solve this problem by doing X? Nah, I need a different approach.” This raises the question. I mean, over time it is the case that the capabilities which required inference-time compute to elicit, get distilled into the model. So you’re amortizing the thing which previously you needed to do these rollouts, these Monte Carlo rollouts, to figure out.

In general, maybe there’s this principle that digital minds which can be copied, have different tradeoffs which are relevant, from biological minds which cannot. So in general, it should make sense to amortize more things because you can literally copy the amortization, or copy the things that you have sort of built in.

This is a tangential question where it might be interesting to speculate about. In the future, as these things become more intelligent and the way we train them becomes more economically rational, what will make sense to amortize into these minds, which evolution did not think was worth amortizing into biological minds? You have to retrain every time.

Adam Marblestone

First of all, I think the probabilistic AI people would be like, of course you need test-time compute, because this inference problem is really hard and the only ways we know how to do it involve lots of test-time compute. Otherwise it’s just this crappy approximation that’s never going to… You have to do infinite data or something to make this. I think some of the probabilistic people will be like, “No, it’s inherently probabilistic and amortizing it in this way just doesn’t make sense.” They might then also point to the brain and say, “Okay, well the brain, the neurons are stochastic and they’re sampling and they’re doing things. So maybe the brain actually is doing more like the non-amortized inference, the real inference.”

But it’s also strange how perception can work in just milliseconds or whatever. It doesn’t seem like it uses that much sampling. So it’s also clearly doing some baking things into approximate forward passes or something like that to do this. In the future, I don’t know. Is it already a trend to some degree that things that people were having to use test-time compute for, are getting used to train back the base model? Now it can do it in one pass.

Maybe evolution did or didn’t do that. I think evolution still has to pass everything through the genome to build the network and the environment in which humans are living is very dynamic. So maybe, if we believe this is true, there’s a Learning Subsystem per Steve Byrnes, and a Steering Subsystem, that the Learning Subsystem doesn’t have a lot of pre-initialization or pretraining. It has a certain architecture, but then within lifetime it learns. Then evolution didn’t actually amortize that much into that network. It amortized it instead into a set of innate behaviors in a set of these bootstrapping cost functions, or ways of building up very particular reward signals.

Dwarkesh Patel

This framework helps explain this mystery that people have pointed out and I’ve asked a few guests about, which is that if you want to analogize evolution to pretraining, well how do you explain the fact that so little information is conveyed through the genome? So 3 gigabytes is the size of the total human genome. Obviously a small fraction of that is actually relevant to coding the brain.

Previously people made this analogy, that actually evolution has found the hyperparameters of the model, the numbers which tell you how many layers there should be, the architecture, basically, how things should be wired together. But if a big part of the story is that increased sample efficiency aids learning, generally makes systems more performant, is the reward function, is the loss function—and if evolution found those loss functions that aid learning—then it actually makes sense how you can build an intelligence with so little information. Because the reward function, in Python the reward function is literally a line. So you just have a thousand lines like this, and that doesn’t take up that much space.

Adam Marblestone

Yes. It also gets to do this generalization thing with the thing I was describing where we were talking about the spider, where it learns just the word “spider” which triggers the spider reflex or whatever. It gets to exploit that too. It gets to build a reward function that actually has a bunch of generalization in it just by specifying these innate spider stuff and the Thought Assessors, as Steve calls them, that do the learning.

That’s potentially a really compact solution to building up these more complex reward functions too, that you need. It doesn’t have to anticipate everything about the future of the reward function. It just has to anticipate what variables are relevant and what are heuristics for finding what those variables are. And then it has to have a very compact specification for the learning algorithm and basic architecture of the Learning Subsystem. And then it has to specify all this Python code of all the stuff about the spiders and all the stuff about friends, and all the stuff about your mother, and all the stuff about mating and social groups and joint eye contact. It has to specify all that stuff.

So is this really true? I think that there is some evidence for it. Fei Chen and Evan Macosko and various other researchers have been doing these single-cell atlases. One of the things that scaling up neuroscience technology—again, this is one of my obsessions—has done through the BRAIN Initiative, a big neuroscience funding program, is they’ve basically gone through different areas, especially of the mouse brain, and mapped where the different cell types are? How many different types of cells are there in different areas of cortex? Are they the same across different areas? Then you look at these subcortical regions, which are more like the Steering Subsystem or reward-function-generating regions. How many different types of cells do they have? And which neuron types do they have?

We don’t know how they’re all connected and exactly what they do or what the circuits are or what they mean, but you can just quantify how many different kinds of cells there are with sequencing the RNA. And there are a lot more weird and diverse and bespoke cell types in the Steering Subsystem, basically, than there are in the Learning Subsystem. Like the cortical cell types, it seems like there’s enough to build a learning algorithm up there and specify some hyperparameters. And in this Steering Subsystem, there’s like a gazillion, thousands of really weird cells, which might be like the one for the spider flinch reflex and the one for I’m-about-to-taste-salt.

Dwarkesh Patel

Why would each reward function need a different cell type?

Adam Marblestone

Well, this is where you get innately wired circuits. In the learning algorithm part, in the Learning Subsystem, you specify the initial architecture, you specify a learning algorithm. All the juice is happening through plasticity of the synapses, changes of the synapses within that big network. But it’s a relatively repeating architecture, how it’s initialized. It’s just like how the amount of Python code needed to make an eight-layer transformer is not that different from one that makes a three-layer transformer. You’re just replicating.

Whereas all this Python code for the reward function, if superior colliculus sees something that’s skittering and you’re feeling goosebumps on your skin or whatever, then trigger spider reflex, that’s just a bunch of bespoke, species-specific, situation-specific crap. The cortex doesn’t know about spiders, it just knows about layers.

Dwarkesh Patel

But you’re saying that the only way to write this reward function is to have a special cell type?

Adam Marblestone

Yeah, well, I think so. I think you either have to have special cell types or you have to somehow otherwise get special wiring rules that evolution can say this neuron needs to wire to this neuron, without any learning. And the way that that is most likely to happen, I think, is that those cells express different receptors and proteins that say, “Okay, when this one comes in contact with this one, let’s form a synapse.” So it’s genetic wiring, and those need cell types to do it.

Dwarkesh Patel

I’m sure this would make a lot more sense if I knew 101 neuroscience, but it seems like there’s still a lot of complexity, or generality rather, in the Steering Subsystem. So if the Steering Subsystem has its own visual system that’s separate from the visual cortex, different features still need to plug into that vision system. So the spider thing needs to plug into it and also the love thing needs to plug into it, et cetera, et cetera. So it seems complicated.

Adam Marblestone

It’s still complicated. That’s all the more reason why a lot of the genomic real estate on the genome, and in terms of these different cell types and so on, would go into wiring up the Steering Subsystem, pre-wiring it.

Dwarkesh Patel

Can we tell how much of the genome is clearly working? So I guess you could tell how many are relevant to producing the RNA that manifest or the epigenetics that manifest in different cell types in the brain. Right?

Adam Marblestone

Yeah. This is what the cell types help you get at. I don’t think it’s exactly like, “Oh, this percent of the genome is doing this”, but you could say, “Okay, in all these Steering Subsystem subtypes, how many different genes are involved in specifying which is which and how they wire? And how much genomic real estate do those genes take up versus the ones that specify visual cortex versus auditory cortex? You’re just reusing the same genes to do the same thing twice. Whereas the spider reflex hooking up… Yes, you’re right. They have to build a vision system and they have to build some auditory systems and touch systems and navigation-type systems.

Even feeding into the hippocampus and stuff like that, there’s head direction cells. Even the fly brain has innate circuits that figure out its orientation and help it navigate in the world. It uses vision, figures out its optical flow of how it’s flying and how its flight is related to the wind direction. It has all these innate stuff that I think in the mammal brain we would all lump that into the Steering Subsystem. There’s a lot of work. So all the genes that basically go into specifying all the things a fly has to do, we’re going to have stuff like that too, just all in the Steering Subsystem.

Dwarkesh Patel

But do we have some estimate of like, “Here’s how many nucleotides, here are many megabases it takes to—”

Adam Marblestone

I don’t know. I mean, I think you might be able to talk to biologists about this. I mean, we have a lot in common with yeast from a genes perspective. Yeast is still used as a model for some amount of drug development and stuff like that in biology. And so much of the genome is just going towards you having a cell at all, it can recycle waste, it can get energy, it can replicate.

And then what do we have in common with a mouse? So we do know at some level that the differences between us and a chimpanzee or something—and that includes the social instincts and the more advanced differences in cortex and so on—it’s a tiny number of genes that go into this additional amount of making the eight-layer transformer instead of the six-layer transformer or tweaking that reward function.

Dwarkesh Patel

This would help explain why the hominid brain exploded in size so fast. Presumably, tell me if this is correct, but under this story, social learning or some other thing increased the ability to learn from the environment. It increased our sample efficiency. Instead of having to go and kill the boar yourself and figure out how to do that, you can just be like, “The elder told me this is how you make a spear.” Now it increases the incentive to have a bigger cortex, which can learn these things.

Adam Marblestone

Yes and that can be done with a relatively few genes, because it’s really replicating what the mouse already has, making more of it. It’s maybe not exactly the same and there may be tweaks, from a genome perspective, you don’t have to reinvent all this stuff.

Dwarkesh Patel

So then how far back in the history of the evolution of the brain does the cortex go back? Is the idea that the cortex has always figured out this omnidirectional inference thing, that’s been a solved problem for a long time? Then the big unlock with primates is that we got the reward function, which increased the returns to having omnidirectional inference?

Adam Marblestone

It’s a good question.

Dwarkesh Patel

Or is the omnidirectional inference also something that took a while to unlock?

Adam Marblestone

I’m not sure that there’s agreement about that. I think there might be specific questions about language. Are there tweaks, whether that’s through auditory and memory, some combination auditory memory regions? There may also be macro-wiring where you need to wire auditory regions into memory regions or something like that, and into some of these social instincts to get language, for example, to happen. But that might also be a small number of gene changes to be able to say, “Oh, I just need from my temporal lobe over here, going over to the auditory cortex, something.”

There is some evidence for the Broca’s area, Wernicke’s area. They’re connected with the hippocampus and so on and prefrontal cortex. So there’s like some small number of genes maybe for enabling humans to really properly do language. That could be a big one. But is it that something changed about the cortex and it became possible to do these things? Or is that that potential was already there, but there wasn’t the incentive to expand that capability and then use it, wire it to these social instincts and use it more? I would lean somewhat toward the latter. I think a mouse has a lot of similarity in terms of cortex as a human.

Dwarkesh Patel

Although there’s Suzana Herculano-Houzel‘s work on how the number of neurons scales better with weight with primate brains than it does with rodent brains. So does that suggest that there actually was some improvement in the scalability of the cortex?

Adam Marblestone

Maybe, maybe. I’m not super deep on this. There may have been changes in architecture, changes in the folding, changes in neuron properties and stuff that somehow slightly tweak this. But there’s still a scaling. either way.

Dwarkesh Patel

That’s right.

Adam Marblestone

So I’m not saying there isn’t something special about humans in the architecture of the Learning Subsystem at all. But yeah I think it’s pretty widely thought that this is expanded. But then the question is, “Okay, well, how does that fit in also with the Steering Subsystem changes and the instincts that make use of this and allow you to bootstrap using this effectively?”

But just to say a few other things, even the fly brain has some amount, even very far back… I mean, I think you’ve read this great book, A Brief History of Intelligence, right? I think this is a really good book. Lots of AI researchers think this is a really good book it seems.

You have some amount of learning going back all the way to anything that has a brain. Basically you have something like primitive reinforcement learning, going back at least to vertebrates. Imagine a zebrafish. Then you have these other branches. Birds may have reinvented something cortex-like. It doesn’t have the six layers, but they have something a little bit cortex-like. So some of those things after reptiles, in some sense birds and mammals both made a somewhat cortex-like, but differently organized thing.

But even a fly brain has associative learning centers that actually do things that maybe look a little bit like this Thought Assessor concept from Byrnes, where there’s a specific dopamine signal to train specific subgroups of neurons in the fly mushroom body to associate different sensory information with, “Am I going to get food now?” or “Am I going to get hurt now?”

Dwarkesh Patel

Brief tangent. I remember reading in one blog post that Beren Millidge wrote that the parts of the cortex which are associated with audio and vision have scaled disproportionately between other primates and humans, whereas the parts associated, say, with odor have not. And I remember him saying something like that this is explained by that kind of data having worse scaling law properties. Maybe he meant this, but I think another interpretation of actually what’s happening there is that these social reward functions that are built into the Steering Subsystem needed to make use more of being able to see your elders and see what the visual cues are and hear what they’re saying. And in order to make sense of these cues which guide learning, you needed to activate the vision and audio more than odor.

Adam Marblestone

I mean, there’s all this stuff. I feel like it’s come up in your shows before, actually. But like even the design of the human eye where you have the pupil and the white and everything, we are designed to be able to establish relationships based on joint eye contact. Maybe this came up in the Sutton episode. I can’t remember. But yeah, we have to bootstrap to the point where we can detect eye contact and where we can communicate by language. That’s like what the first couple years of life are trying to do.

00:42:42 – Model-based vs model-free RL in the brain

Dwarkesh Patel

Okay, I want to ask you about RL. So currently, the way these LLMs are trained, if they solve the unit test or solve a math problem, that whole trajectory, every token in that trajectory is upweighted. What’s going on with humans? Are there different types of model-based versus model-free that are happening in different parts of the brain?

Adam Marblestone

Yeah, I mean, this is another one of these things. Again, all my answers to these questions, any specific thing I say, it’s all just saying that directionally we can explore around this. I find this interesting, maybe I feel like the literature points in these directions in some very broad way. What I actually want to do is go and map the entire mouse brain and figure this out comprehensively and make neuroscience a ground-truth science. So I don’t know, basically.

But first of all, I think with Ilya on the podcast, he was like, “It’s weird that you don’t use value functions, right?” You use the dumbest form of RL basically. Of course these people are incredibly smart and they’re optimizing for how to do it on GPUs and it’s really incredible what they’re achieving. But conceptually it’s a really dumb form of RL, even compared to what was being done 10 years ago. Even the Atari game-playing stuff was using Q-learning, which is basically a kind of temporal difference learning. The temporal difference learning basically means you have some kind of a value function of what action I choose now doesn’t just tell me literally what happens immediately after this. It tells me what is the long-run consequence of that for my expected total reward or something like that.

So you would have value functions like… The fact that we don’t have value functions at all in the LLMs is crazy. I think because Ilya said it, I can say it. I know one one-hundredth of what he does about AI, but it’s kind of crazy that this is working.

But in terms of the brain, I think there are some parts of the brain that are thought to do something that’s very much like model-free RL, that’s parts of the striatum and basal ganglia. It is thought that they have a certain finite relatively small action space. The types of actions they could take, first of all, might be like, “Tell the brainstem and spinal cord to do this motor action? Yes or no.” Or it might be more complicated cognitive-type actions like, “Tell the thalamus to allow this part of the cortex to talk to this other part,” or “Release the memory that’s in the hippocampus and start a new one or something.” But there’s some finite set of actions that come out of the basal ganglia, and that it’s just a very simple RL.

So there are probably parts of other brains and our brain that are just doing very simple naive-type RL algorithms. Layering one thing on top of that is that some of the major work in neuroscience, like Peter Dayan’s work, and a bunch of work that is part of why I think DeepMind did the temporal difference learning stuff in the first place. They were very interested in neuroscience. There’s a lot of neuroscience evidence that the dopamine is giving this reward prediction error signal, rather than just reward, “yes or no, a gazillion time steps in the future.” It’s a prediction error and that’s consistent with learning these value functions.

So there’s that and then there’s maybe higher-order stuff. We have the cortex making this world model. Well, one of the things the cortex world model can contain is a model of when you do and don’t get rewards. Again, it’s predicting what the Steering Subsystem will do. It could be predicting what the basal ganglia will do. You have a model in your cortex that has more generalization and more concepts and all this stuff that says, “Okay, these types of plans, these types of actions will lead in these types of circumstances to reward.” So I have a model of my reward.

Some people also think that you can go the other way. So this is part of the inference picture. There’s this idea of RL as inference. You could say, “Well, conditional on my having a high reward, sample a plan that I would have had to get there.” That’s inference of the plan part from the reward part. I’m clamping the reward as high and inferring the plan, sampling from plans that could lead to that. So if you have this very general cortical thing, it can just do. If you have this very general model-based system and the model, among other things, includes plans and rewards, then you just get it for free, basically.

Dwarkesh Patel

So in neural network parlance, there’s a value head associated to the omnidirectional inference that’s happening in the—

Adam Marblestone

Yes, or there’s a value input.

Dwarkesh Patel

Oh, okay. Interesting.

Adam Marblestone

Yeah and it can predict. One of the almost sensory variables it can predict is what rewards it’s going to get.

Dwarkesh Patel

By the way, speaking about amortizing things, obviously value is like amortized rollouts of looking up reward.

Adam Marblestone

Yeah, something like that. It’s like a statistical average or prediction of it.

Dwarkesh Patel

Tangential thought. Joe Henrich and others have this idea for the way human societies have learned to do things like, how do you figure out that this kind of bean, which actually just almost always poisons you, is edible if you do this ten-step incredibly complicated process, any one of which if you fail, at the bean will be poisonous? How do you figure out how to hunt this seal in this particular way, with this particular weapon, at this particular time of the year, et cetera? There’s no way but just like trying shit over generations. And it strikes me this is actually very much like model-free RL happening at a civilizational level. No, not exactly.

Adam Marblestone

Evolution is the simplest algorithm in some sense. If we believe that all of this can come from evolution, the outer loop can be extremely not foresighted.

Dwarkesh Patel

Right, that’s interesting. Just hierarchies of… Evolution: model-free…

Adam Marblestone

So what does that tell you? Maybe the simple algorithms can just get you anything if you do it enough.

Dwarkesh Patel

Right.

Adam Marblestone

Yeah, I don’t know.

Dwarkesh Patel

So, evolution: model-free. Basal ganglia: model-free. Cortex: model-based. Culture: model-free potentially. I mean you pay attention to your elders or whatever.

Adam Marblestone

Maybe there’s like group selection or whatever of these things is like more model-free. But now I think culture, well, it stores some of the model.

00:50:31 – Is biological hardware a limitation or an advantage?

Dwarkesh Patel

Stepping back, is it a disadvantage or an advantage for humans that we get to use biological hardware, in comparison to computers as they exist now? What I mean by this question is, if there’s “the algorithm”, would the algorithm just qualitatively perform much worse or much better if inscribed in the hardware of today? The reason to think it might…. Here’s what I mean. Obviously the brain has had to make a bunch of tradeoffs which are not relevant to computing hardware. It has to be much more energetically efficient. Maybe as a result it has to run on slower speeds so that there can be a smaller voltage gap. So the brain runs at 200 hertz, it has to run on 20 watts. On the other hand, with robotics we’ve clearly experienced that fingers are way more nimble than we can make motors so far. So maybe there’s something in the brain that is the equivalent of cognitive dexterity, which is maybe due to the fact that we can do unstructured sparsity. We can co-locate the memory and the compute.

Adam Marblestone

Yes.

Dwarkesh Patel

Where does this all net out? Are you like, “Fuck, we would be so much smarter if we didn’t have to deal with these brains.” Or are you like—

Adam Marblestone

I think in the end we will get the best of both worlds somehow. I think an obvious downside of the brain is it cannot be copied. You don’t have external read-write access to every neuron and synapse, whereas you do. I can just edit something in the weight matrix in Python or whatever and load that up and copy that. In principle. So the fact that it can’t be copied and random-accessed is very annoying. But otherwise maybe it has a lot of advantages. It also tells you that you want to somehow do the co-design of the algorithm. It maybe even doesn’t change it that much from all of what we discussed, but you want to somehow do this co-design.

So yeah, how do you do it with really slow low-voltage switches? That’s going to be really important for energy consumption. Co-locating memory and compute. I think that hardware companies will probably just try to co-locate memory and compute. They will try to use lower voltages, allow some stochastic stuff.

There are some people that think that all this probabilistic stuff that we were talking about—“Oh, it’s actually energy-based models, so on”—it is doing lots of sampling. It’s not just amortizing everything. The neurons are also very natural for that because they’re naturally stochastic. So you don’t have to do a random number generator in a bunch of Python code basically to generate a sample. The neuron just generates samples and it can tune what the different probabilities are and learn those tunings. So it could be that it’s very co-designed with some kind of inference method or something.

Dwarkesh Patel

It’d be hilarious…. I mean the message I’m taking from this interview is that like all these people that folks make fun of on Twitter, Yann LeCun and Beff Jezos and whatever, I don’t know maybe they got it right.

Adam Marblestone

That is actually one read of it. Granted, I haven’t really worked on AI at all since LLMs took off, so I’m just out of the loop. But I’m surprised and I think it’s amazing how the scaling is working and everything. But yeah, I think Yann LeCun and Beff Jezos are kind of onto something about the probabilistic models or at least possibly. In fact that’s what all the neuroscientists and all the AI people thought until 2021 or something.

Dwarkesh Patel

Right. So there’s a bunch of cellular stuff happening in the brain that is not just about neuron-to-neuron synaptic connections. How much of that is functionally doing more work than the synapses themselves are doing versus it’s just a bunch of kludge that you have to do in order to make the synaptic thing work. So with a digital mind, you can nudge the synapse, sorry the parameter, extremely easily. But with a cell to modulate a synapse according to the gradient signal, it just takes all this crazy machinery. So is it actually doing more than it takes extremely little code to do?

Adam Marblestone

I don’t know, but I’m not a believer in the radical, “Oh, actually memory is not synapses mostly, or learning is mostly genetic changes” or something like that. I think it would just make a lot of sense, I think you put it really well for it to be more like the second thing you said. Let’s say you want to do weight normalization across all the weights coming out of your neuron or into your neuron. Well, you probably have to somehow tell the nucleus of the cell about this and then have that send everything back out to the synapses or something. So there’s going to be a lot of cellular changes. Or let’s say that you just had a lot of plasticity and you’re part of this memory. Now that’s got consolidated into the cortex or whatever. Now we want to reuse you as a new one that can learn again.

There’s going to be a ton of cellular changes, so there’s going to be tons of stuff happening in the cell. But algorithmically, it’s not really adding something beyond these algorithms. It’s just implementing something that in a digital computer is very easy for us to go and just find the weights and change them. In a cell, it just literally has to do all this with molecular machines itself without any central controller. It’s kind of incredible.

There are some things that cells do, I think, that seem more convincing. One of the things the cerebellum has to do is predict over time. What is the time delay? Let’s say that I see a flash and then some number of milliseconds later, I’m going to get a puff of air in my eyelid or something. The cerebellum can be very good at predicting what’s the timing between the flash and the air puff, so that now your eye will just close automatically. The cerebellum is involved in that type of reflex, learned reflex.

There are some cells in the cerebellum where it seems like the cell body is playing a role in storing that time constant, changing that time constant of delay, versus that all being somehow done with like, “I’m going to make a longer ring of synapses to make that delay longer.” No, the cell body will just store that time delay for you. So there are some examples, but I’m not a believer out of the box in essentially this theory that what’s happening is changes in connections between neurons and that that’s the main algorithmic thing that’s going on. I think there’s very good reason to still believe that it’s that rather than some crazy cellular stuff.

Dwarkesh Patel

Going back to this whole perspective of how our intelligence is not just this omnidirectional inference thing that builds a world model, but really this system that teaches us what to pay attention to what the important salient factors are to learn from, et cetera. I want to see if there’s some intuition we can drive from this about what different kinds of intelligences might be like. So it seems like AGI or superhuman intelligence should still have this ability to learn a world model that’s quite general, but then it might be incentivized to pay attention to different things that are relevant for the modern post-singularity environment. How different should we expect different intelligences to be?

Adam Marblestone

I think one way to think about this question is, is it actually possible to make the paperclip maximizer or whatever? If you try to make the paperclip maximizer, does it end up just not being smart or something like that because the only reward function it had was to make paperclips? I’d say, can you do that? I don’t know. If I channel Steve Byrnes more, I think he’s very concerned that the minimum viable things in the Steering Subsystem that you need to get something smart is way less than the minimum viable set of things you need for it to have human-like social instincts and ethics and stuff like that.

So a lot of what you want to know about the Steering Subsystem is actually the specifics of how you do alignment essentially, or what human behavior and social instincts is versus just what you need for capabilities. We talked about it in a slightly different way because we were sort of saying, “Well, in order for humans to learn socially, they need to make eye contact and learn from others.” But we already know from LLMs that depending on your starting point, you can learn language without that stuff. So I think that it probably is possible to make super powerful model-based RL optimizing systems and stuff like that that don’t have most of what we have in the human brain reward functions and as a consequence might want to maximize paperclips. And that’s a concern.

Dwarkesh Patel

But you’re pointing out that in order to make a competent paperclip maximizer, the kind of thing that can build spaceships and learn physics and whatever, it needs to have some drives which elicit learning, including say curiosity and exploration.

Adam Marblestone

Yeah, curiosity, interest in others, interest in social interactions. But that’s pretty minimal I think. And that’s true for humans, but it might be less true for something that’s already pretrained as an LLM or something. So most of why we want to know the Steering Subsystem, I think if I’m channeling Steve, is alignment reasons.

Dwarkesh Patel

How confident are we that we even have the right algorithmic conceptual vocabulary to think about what the brain is doing? What I mean by this is that there was one big contribution to AI from neuroscience which was this idea of the neuron in the 1950s, just this original contribution. But then it seems like a lot of what we’ve learned afterwards about what the high-level algorithm the brain is implementing, from the backprop to if there’s something analogous to backprop happening in the brain to “Oh is V1 doing something like CNNs” to TD learning and Bellman equations, actor-critic, whatever… It seems inspired by this dynamic where we come up with some idea, maybe we can make AI neural networks work this way, and then we notice that something in the brain also works that way. So why not think there’s more things like this.

Adam Marblestone

There may be. I think the reason that I think that we might be onto something is that the AIs we’re making based on these ideas are working surprisingly well. There’s also a bunch of just empirical stuff. Convolutional neural nets and variants of convolutional neural nets. I’m not sure what the absolute latest is, but compared to other models in computational neuroscience of what the visual system is doing, they are just more predictive. You can just score, even pretrained on cat pictures and stuff, CNNs, what is the representational similarity that they have on some arbitrary other image compared to the brain activations measured in different ways? Jim DiCarlo’s lab has this brain score and the AI model is actually… There seems to be some relevance there. Neuroscience doesn’t necessarily have something better than that.

So yes, that’s just recapitulating what you’re saying, that the best computational neuroscience theories we have seem to have been invented largely as a result of AI models and finding things that work. So find backprop works and then saying, “Can we approximate backprop with cortical circuits?” or something. There’s been things like that.

Now, some people totally disagree with this. György Buzsáki is a neuroscientist who has a book called The Brain from the Inside Out where he basically says all our psychology concepts, AI concepts, all this stuff is just made-up stuff. What we actually have to do is figure out what is the actual set of primitives that the brain actually uses. And our vocabulary is not going to be adequate to that. We have to start with the brain and make new vocabulary rather than saying backprop and then try to apply that to the brain or something like that. He studies a lot of oscillations and stuff in the brain as opposed to individual neurons and what they do.

I don’t know. I think that there’s a case to be made for that. And from a research program design perspective, one thing we should be trying to do is just simulate a tiny worm or a tiny zebrafish, almost as biophysical or as bottom-up as possible. Like get connectome, molecules, activity and just study it as a physical dynamical system and look at what it does.

But I don’t know, it just feels like AI is really good fodder for computational neuroscience. Those might actually be pretty good models. We should look at that. I both think that there should be a part of the research portfolio that is totally bottom-up and not trying to apply our vocabulary that we learn from AI onto these systems, and that there should be another big part of this that’s trying to reverse engineer it using that vocabulary or variant of that vocabulary. We should just be pursuing both. My guess is that the reverse engineering one is actually going to work-ish or something. Like we do see things like TD learning, which Sutton also invented separately.

Dwarkesh Patel

That must be a crazy feeling to just like—

Adam Marblestone

Yeah, that’s crazy.

Dwarkesh Patel

This equation I wrote down is like in the brain.

Adam Marblestone

It seems like the dopamine is doing some of that, yeah.

01:03:59 – Why a map of the human brain is important

Dwarkesh Patel

So let me ask you about this. You guys are funding different groups that are trying to figure out what’s up in the brain. If we had a perfect representation, however you define it, of the brain, why think it would actually let us figure out the answer to these questions? We have neural networks which are way more interpretable, not just because we understand what’s in the weight matrices, but because there are weight matrices. There are these boxes with numbers in them. Even then we can tell very basic things. We can kind of see circuits for very basic pattern matching of following one token with another. I feel like we don’t really have an explanation of why LLMs are intelligent just because they’re interpretable.

Adam Marblestone

I think I would somewhat dispute it. We have some description of what the LLM is fundamentally doing. What that’s doing is that I have an architecture and I have a learning rule and I have hyperparameters and I have initialization and I have training data.

Dwarkesh Patel

But those are things we learned because we built them, not because we interpreted them from seeing the weights. The analogous thing to connectome is like seeing the weights.

Adam Marblestone

What I think we should do is we should describe the brain more in that language of things like architectures, learning rules, initializations, rather than trying to find the Golden Gate Bridge circuit and saying exactly how this neuron actually… That’s going to be some incredibly complicated learned pattern. Konrad Kording and Tim Lillicrap have this paper from a while ago, maybe five years ago, called “What does it mean to understand a neural network?” What they say is basically that you could imagine you train a neural network to compute the digits of pi or something. It’s like some crazy pattern. You also train that thing to predict the most complicated thing you find, predict stock prices, basically predict really complex systems, computationally complete systems. I could train a neural network to do cellular automata or whatever crazy thing. It’s like, we’re never going to be able to fully capture that with interpretability, I think. It’s just going to just be doing really complicated computations internally.

But we can still say that the way it got that way is that it had an architecture and we gave it this training data and it had this loss function. So I want to describe the brain in the same way. And I think that this framework that I’ve been kind of laying out is that we need to understand the cortex and how it embodies a learning algorithm. I don’t need to understand how it computes “Golden Gate Bridge.”

Dwarkesh Patel

But if you can see all the neurons, if you have the connectome, why does that teach you what the learning algorithm is?

Adam Marblestone

Well, I guess there are a couple different views of it. So it depends on these different parts of this portfolio. On the totally bottom-up, we-have-to-simulate-everything portfolio, it kind of just doesn’t. You have to make a simulation of the zebrafish brain or something and then you see what are the emergent dynamics in this and you come up with new names and new concepts and all that. That’s the most extreme bottom-up neuroscience view. But even there the connectome is really important for doing that biophysical or bottom-up simulation.

But on the other hand you can say, “Well, what if we can actually apply some ideas from AI?” We basically need to figure out, is it an energy-based model or is it an amortized VAE-type model? Is it doing backprop or is it doing something else? Are the learning rules local or global? If we have some repertoire of possible ideas about this, just think of the connectome as a huge number of additional constraints that will help to refine, to ultimately have a consistent picture of that.

I think about this for the Steering Subsystem stuff too, just very basic things about it. How many different types of dopamine signal or of Steering Subsystem signal or thought assessor or so on… How many different types of what broad categories are there? Like even this very basic information that there’s more cell types in the hypothalamus than there are in the cortex, that’s new information about how much structure is built there versus somewhere else. How many different dopamine neurons are there? Is the wiring between prefrontal and auditory the same as the wiring between prefrontal and visual? The most basic things, we don’t know. The problem is learning even the most basic things by a series of bespoke experiments takes an incredibly long time. Whereas just learning all that at once by getting a connectome is just way more efficient.

Dwarkesh Patel

What is the timeline on this? Presumably the idea of this is, first, to inform the development of AI. You want to be able to figure out how we get AIs to want to care about what other people think of its internal thought pattern. But interp researchers are making progress on this question just by inspecting normal neural networks. There must be some feature…

Adam Marblestone

You can do interp on LLMs that exist. You can’t do interp on a hypothetical model-based reinforcement algorithm like the brain that we will eventually converge to when we do AGI.

Dwarkesh Patel

Fair. But what timelines on AI do you need for this research to be practical and relevant?

Adam Marblestone

I think it’s fair to say it’s not super practical and relevant if you’re in an AI 2027 scenario. And so what science I’m doing now is not going to affect the science of ten years from now. Because what’s going to affect the science of 10 years from now is the outcome of this AI 2027 scenario. It kind of doesn’t matter that much probably if I have the connectome, maybe it slightly tweaks certain things.

But I think there’s a lot of reason to think maybe that we will get a lot out of this paradigm. But then the real thing, the thing that is the single event that is transformative for the entire future or something type event is still more than five years away or something.

Dwarkesh Patel

Is that because we haven’t captured omnidirectional inference, we haven’t figured out the right ways to get a mind to pay attention to things in a way that makes sense?

Adam Marblestone

I mean, I would take the entirety of your collective podcast with everyone as showing the distribution of these things. I don’t know. What was Karpathy’s timeline, right? What’s Demis’s timeline? So not everybody has a three-year timeline.

Dwarkesh Patel

But there are different reasons and I’m curious which ones are yours.

Adam Marblestone

What are mine? I don’t know, I’m just watching your podcast. I’m trying to understand the distribution. I don’t have a super strong claim that LLMs can’t do it.

Dwarkesh Patel

But is the crux the data efficiency or…?

Adam Marblestone

I think part of it is just that it is weirdly different from all this brain stuff. So intuitively it’s just weirdly different than all this brain stuff and I’m kind of waiting for the thing that starts to look more like brain stuff. I think if AlphaZero, and model-based RL and all these other things that were being worked on 10 years ago, had been giving us the GPT-5 type capabilities, then I would be like, “Oh wow, we’re both in the right paradigm and seeing the results a priori. So my prior and my data are agreeing.” Now it’s like, “I don’t know what exactly my data is. Looks pretty good, but my prior is sort of weird so I don’t have a super strong opinion on it.”

So I think there’s a possibility that essentially all other scientific research that is being done is somehow obviated. But I don’t put a huge amount of probability on that. I think my timelines might be more in the 10-year-ish range. If that’s the case, I think there is probably a difference between a world where we have connectomes on hard drives and we have an understanding of Steering Subsystem architecture. We’ve compared even the most basic properties of what are the reward functions, cost function, architecture, et cetera, of a mouse versus a shrew versus a small primate, et cetera.

Dwarkesh Patel

Is this practical in 10 years?

Adam Marblestone

I think it has to be a really big push.

Dwarkesh Patel

How much funding, how does it compare to where we are now?

Adam Marblestone

It’s like low billions-dollar scale funding in a very concerted way I would say.

Dwarkesh Patel

And how much is on it now?

Adam Marblestone

So if I just talk about some of the specific things we have going on with connectomics… E11 Bio is our main thing on connectomics. They are trying to make the technology of connectomic brain mapping several orders of magnitude cheaper. The Wellcome Trust put out a report a year or two ago that said to get one mouse brain, the first mouse brain connectome would be a several billion dollars project. E11 technology, and the suite of efforts in the field, is trying to get a single mouse connectome down to low tens of millions of dollars.

That’s a mammal brain. A human brain is about 1,000 times bigger. If with a mouse brain you can get to $10 million or $20 million, $30 million, with technology, if you just naively scale that, a human brain is now still billions of dollars, to just do one human brain. Can you go beyond that? Can you get a human brain for less than a billion? But I’m not sure you need every neuron in the human brain.

We want to, for example, do an entire mouse brain and a human Steering Subsystem and the entire brains of several different mammals with different social instincts. So with a bunch of technology push and a bunch of concerted effort, real significant progress if it’s focused effort can be done in the hundreds of millions to low billions scale.

Dwarkesh Patel

What is the definition of a connectome? Presumably it’s not a bottom-up biophysics model. So is it just that it can estimate the input-output of a brain? What is the level of abstraction?

Adam Marblestone

You can give different definitions and one of the things that’s cool… So the standard approach to connectomics uses the electron microscope and very, very thin slices of brain tissue. It’s basically labeling. The cell membranes are going to show up, scatter electrons a lot and everything else is going to scatter electrons less. But you don’t see a lot of details of the molecules, which types of synapses, different synapses of different molecular combinations and properties.

E11 and some other research in the field has switched to an optical microscope paradigm. With optical, the photons don’t damage the tissue, so you can wash it and look at fragile gentle molecules. So with E11’s approach, you can get a “molecularly annotated connectome.” That’s not just who is connected to who by some synapse, but what are the molecules that are present at the synapse? What type of cell is that?

A molecularly annotated connectome, that’s not exactly the same as having the synaptic weights. That’s not exactly the same as being able to simulate the neurons and say what’s the functional consequence of having these molecules and connections. But you can also do some amount of activity mapping and try to correlate structure to function. Train an ML model basically to predict the activity from the connectome.

Dwarkesh Patel

What are the lessons to be taken away from the Human Genome Project? One way you could look at it is that it was a mistake and you shouldn’t have spent billions of dollars getting one genome mapped. Rather you should have just invested in technologies which have now allowed us to map genomes for hundreds of dollars.

Adam Marblestone

Well, George Church was my PhD advisor and he’s pointed out that it was $3 billion or something, roughly $1 per base pair for the first genome. Then the National Human Genome Research Institute basically structured the funding process right. They got a bunch of companies competing to lower the cost. And then the cost dropped like a million-fold in 10 years because they changed the paradigm from macroscopic chemical techniques to these individual DNA molecules which would make a little cluster of DNA molecules on the microscope and you would see just a few DNA molecules at a time on each pixel of the camera. It would give you a different, in parallel, look at different fragments of DNA. So you parallelize the thing by millions-fold. That’s what reduced the cost by millions-fold.

With switching from electron microscopy to optical connectomics, potentially even future types of connectomics technology, we think there should be similar patterns. That’s why E11, the Focus Research Organization, started with technology development rather than starting with saying we’re going to do a human brain or something and let’s just brute force it. We said let’s get the cost down with new technology. But then it’s still a big thing. Even with new next-generation technology, you still need to spend hundreds of millions on data collection.

Dwarkesh Patel

Is this going to be funded with philanthropy, by governments, by investors?

Adam Marblestone

This is very TBD and very much evolving in some sense as we speak. I’m hearing some rumors going around of connectomics-related companies potentially forming. So far E11 has been philanthropy. The National Science Foundation just put out this call for Tech Labs, which is somewhat FRO-inspired or related. You could have a tech lab for actually going and mapping the mouse brain with us and that would be philanthropy plus government still in a nonprofit, open-source framework. But can companies accelerate that? Can you credibly link connectomics to AI in the context of a company and get investment for that? It’s possible.

Dwarkesh Patel

I mean the cost of training these AIs is increasing so much. If you could tell some story of not only are we going to figure out some safety thing, but in fact once we do that, we’ll also be able to tell you how AI works… You should go to these AI labs and just be like, “Give me one one-hundredth of your projected budget in 2030.”

Adam Marblestone

I sort of tried a little bit seven or eight years ago and there was not a lot of interest. Maybe now there would be. But all the things that we’ve been talking about, it’s really fun to talk about, but it’s ultimately speculation. What is the actual reason for the energy efficiency of the brain, for example? Is it doing real inference or amortized inference or something else? This is all answerable by neuroscience. It’s going to be hard, but it’s actually answerable. So if you can only do that for low billions of dollars or something to really comprehensively solve that, it seems to me, in the grand scheme of trillions of dollars of GPUs and stuff, it actually makes sense to do that investment.

Dwarkesh Patel

Also, there’s been many labs that have been launched in the last year where they’re raising on the valuation of billions for things which are quite credible but are not like, “Our ARR next quarter is going to be whatever.” It’s like we’re going to discover materials and—

Adam Marblestone

Yes, moonshot startups or billionaire-backed startups. Moonshot startups I see as on a continuum with FROs. FROs are a way of channeling philanthropic support and ensuring that it’s open source public benefit, various other things that may be properties of a given FRO. But yes, billionaire-backed startups, if they can target the right science, the exact right science.

I think there’s a lot of ways to do moonshot neuroscience companies that would never get you the connectome. It’s like, “Oh, we’re going to upload the brain” or something, but never actually get the mouse connectome or something. These fundamental things that you need to get to ground truth the science. There are lots of ways to have a moonshot company go wrong and not do the actual science. But there also may be ways to have companies or big corporate labs get involved and actually do it correctly.

Dwarkesh Patel

This brings to mind an idea that you had in a lecture you gave five years ago about. Do you want to explain behavior cloning?

Adam Marblestone

Actually this is funny because the first time I saw this idea, I think it might have been in a blog post by Gwern. There’s always a Gwern blog post. There are now academic research efforts and some amount of emerging company-type efforts to try to do this.

Normally, let’s say I’m training an image classifier or something. I show it pictures of cats and dogs or whatever and they have the label “cat” or “dog”. And I have a neural network that’s supposed to predict the label “cat” or “dog” or something. That is a limited amount of information per label that you’re putting in. It’s just “cat” or “dog”.

What if I also had, “Predict what is my neural activity pattern when I see a cat or when I see a dog and all the other things?” If you add that as an auxiliary loss function or an auxiliary prediction task, does that sculpt the network to know the information that humans know about cats and dogs and to represent it in a way that’s consistent with how the brain represents it and the kind of representational dimensions or geometry of how the brain represents things, as opposed to just having these labels? Does that let it generalize better? Does that let it have richer labeling?

Of course that sounds really challenging. It’s very easy to generate lots and lots of labeled cat pictures. Scale AI or whatever can do this. It is harder to generate lots and lots of brain activity patterns that correspond to things that you want to train the AI to do. But again, this is just a technological limitation of neuroscience. If every iPhone was also a brain scanner, you would not have this problem and we would be training AI with the brain signals. It’s just the order in which technology has developed is that we got GPUs before we got portable brain scanners.

Dwarkesh Patel

What is the ML analog, what you’d be doing here? Because when you distill models, you’re still looking at the final layer of the log probs across all—

Adam Marblestone

If you distill one model into another, that is a certain thing. You are just trying to copy one model into another. I think that we don’t really have a perfect proposal to distill the brain. To distill the brain you need a much more complex brain interface. Maybe you could also do that. You could make surrogate models. Andreas Tolias and people like that are doing some amount of neural network surrogate models of brain activity data. Instead of having your visual cortex do the computation, just have the surrogate model. So you’re distilling your visual cortex into a neural network to some degree. That’s a kind of distillation.

This is doing something a little different. This is basically just saying I’m adding an auxiliary… I think of it as regularization or I think of it as adding an auxiliary loss function that’s smoothing out the prediction task to also always be consistent with how the brain represents it. It might help you with things like adversarial examples, for example.

Dwarkesh Patel

But what exactly are you predicting? You’re predicting the internal state of the brain?

Adam Marblestone

Yes. So in addition to predicting the label, a vector of labels like yes cat, not dog, yes, not boat, one-hot vector or whatever of yes, it’s cat, instead of these gazillion other categories, let’s say in this simple example. You’re also predicting a vector which is all these brain signal measurements.

So Gwern, anyway, had this long-ago blog post of like, “Oh, this is an intermediate thing. We talk about whole brain emulation, we talk about AGI, we talk about brain-computer interface. We should also be talking about this brain-data-augmented thing that’s trained on all your behavior, but is also trained on predicting some of your neural patterns.”

Dwarkesh Patel

And you’re saying the Learning System is already doing this through the Steering System?

Adam Marblestone

Yeah, and our brain, our learning system also has to predict the Steering Subsystem as an auxiliary task. That helps the Steering Subsystem. Now, the Steering Subsystem can access that predictor and build a cool reward function using it.

01:23:28 – What value will automating math have?

Dwarkesh Patel

Separately, you’re on the board of Lean, which is this formal math language that mathematicians use to prove theorems and so forth. Obviously there’s a bunch of conversation right now about AI automating math. What’s your take?

Adam Marblestone

Well, I think that there are parts of math that it seems like it’s pretty well on track to automate. First of all, Lean was developed for a number of years at Microsoft and other places. It has become one of the Convergent Focused Research Organizations to kind of drive more engineering and focus onto it.

So Lean is this programming language where instead of expressing your math proof on pen and paper, you express it in this programming language Lean. And then at the end, if you do that that way, it is a verifiable language so that you can click “verify” and Lean will tell you whether the conclusions of your proof actually follow perfectly from your assumptions of your proof. So it checks whether the proof is correct automatically.

By itself, this is useful for mathematicians collaborating and stuff like that. If I’m some amateur mathematician and I want to add to a proof, Terry Tao is not going to just believe my result. But if Lean says it’s correct, it’s just correct. So it makes it easy for collaboration to happen, but it also makes it easy for correctness of proofs to be an RL signal in very much RLVR. Formalized math proofing—so formal means it’s expressed in something like Lean and verifiable—is now mechanically verifiable. That becomes a perfect RLVR task.

I think that that is going to just keep working, it seems like there is at least one billion-dollar valuation company, Harmonic, based on this. AlphaProof is based on this. A couple other emerging really interesting companies. I think that this problem of RLVRing the crap out of math proving is going to work and we will be able to have things that search for proofs and find them in the same way that we have AlphaGo or what have you that can search for ways of playing the game of Go. With that verifiable signal, it works.

So does this solve math? There is still the part that has to do with conjecturing new interesting ideas. There’s still the conceptual organization of math of what is interesting. How do you come up with new theorem statements in the first place? Or even the very high-level breakdown of what strategies you use to do proofs. I think this will shift the burden of that so that humans don’t have to do a lot of the mechanical parts of math. Validating lemmas and proofs and checking if the statement of this in this paper is exactly the same as that paper and stuff like that. That will just work.

If you really think we’re going to get all these things we’ve been talking about, real AGI would also be able to make conjectures. Bengio has a paper, more like a theoretical paper. There are probably a bunch of other papers emerging about this. Is there a loss function for good explanations or good conjectures? That’s a pretty profound question.

A really interesting math proof or statement might be one that compresses lots of information and has lots of implications for lots of other theorems. Otherwise you would have to prove those theorems using long complex passive inference. Here, if you have this theorem, this theorem is correct, and you have short passive inference to all the other ones. And it’s a short compact statement. So it’s like a powerful explanation that explains all the rest of math. And part of what math is doing is making these compact things that explain the other things.

Dwarkesh Patel

It’s like the Kolmogorov complexity of this statement or something.

Adam Marblestone

Yeah, of generating all the other statements, given that you know this one or stuff like that. Or if you add this, how does it affect the complexity of the rest of the network of proofs? So can you make a loss function that adds, “Oh, I want this proof to be a really highly powerful proof”? I think some people are trying to work on that. So maybe you can automate the creativity part.

If you had true AGI, it would do everything a human can do. So it would also do the things that the creative mathematicians do. But barring that, I think just RLVRing the crap out of proofs, I think that’s going to be just a really useful tool for mathematicians. It’s going to accelerate math a lot and change it a lot, but not necessarily immediately change everything about it.

Will we get mechanical proof of the Riemann hypothesis or something like that, or things like that? Maybe, I don’t know. I don’t know enough details of how hard these things are to search for, and I’m not sure anyone can fully predict that, just as we couldn’t exactly predict when Go would be solved or something like that.

I think it’s going to have lots of really cool applied applications. So one of the things you want to do is you want to have provably stable, secure, unhackable software. So you can write math proofs about software and say, “This code, not only does it pass these unit tests, but I can mathematically prove that there’s no way to hack it in these ways, or no way to mess with the memory”, or these types of things that hackers use, or it has these properties. You can use the same Lean and same proof to do formally verified software.

I think that’s going to be a really powerful piece of cybersecurity that’s relevant for all sorts of other AI hacking the world stuff. And if you can prove the Riemann hypothesis, you’re also going to be able to prove insanely complex things about very complex software. And then you’ll be able to ask the LLM, “Synthesize me a software that I can prove is correct.”

Dwarkesh Patel

Why hasn’t provable programming language taken off as a result of LLMs?

Adam Marblestone

I think it’s starting to. One challenge—we are actually incubating a potential Focused Research Organization on this—is the specification problem. So mathematicians know what interesting theorems they want to formalize. Let’s say I have some code that is involved in running the power grid or something and it has some security properties, well what is the formal spec of those properties? The power grid engineers just made this thing, but they don’t necessarily know how to lift the formal spec from that. And it’s not necessarily easy to come up with the spec that is the spec that you want for your code. People aren’t used to coming up with formal specs and there are not a lot of tools for it.

So you also have this user interface plus AI problem of what security specs should I be specifying? Is this the spec that I wanted? So there’s a spec problem and it’s just been really complex and hard. But it’s only just in the last very short time that the LLMs are able to generate verifiable proofs of things that are useful to mathematicians, starting to be able to do some amount of that for software verification, hardware verification.

But I think if you project the trends over the next couple years, it’s possible that it just flips the tide. Formal methods, this whole field of formal methods or formal verification, provable software. It’s this weird almost backwater of the more theoretical part of programming languages and stuff, very academically flavored often. Although there was this DARPA program that made a provably secure quadcopter helicopter and stuff like that.

Dwarkesh Patel

Secure against… What is the property that is exactly proved? Not for that particular project, but just in general. Because obviously things malfunction for all kinds of reasons.

Adam Marblestone

You could say that what’s going on in this part of the memory over here, which is supposed to be the part the user can access, can’t in any way affect what’s going on in the memory over here or something like that. Things like that.

Dwarkesh Patel

So there’s two questions. One is how useful is this? Two is, how satisfying, as a mathematician, would it be? The fact that there’s this application towards proving that software has certain properties or hardware has certain properties, if that works, that would obviously be very useful. But from a pure… Are we going to figure out mathematics? Is your sense that there’s something about finding that one construction cross-maps to another construction in a different domain, or finding that, “Oh, this lemma, if you redefine this term, it still satisfies what I meant by this term. But a counterexample that previously knocked it down no longer applies.” That kind of dialectical thing that happens in mathematics.

Adam Marblestone

Will the software replace that?

Dwarkesh Patel

Yeah. How much of the value of this sort of pure mathematics just comes from actually just coming up with entirely new ways of thinking about a problem, mapping it to a totally different representation? Do we have examples?

Adam Marblestone

I don’t know. I think of it maybe a little bit like when everybody had to write assembly code or something like that. The amount of fun cool startups that got created was just a lot less or something. Fewer people could do it, progress was more grinding and slow and lonely and so on. You had more false failures because you didn’t get something about the assembly code, rather than the essential thing of was your concept right. Harder to collaborate and stuff like that. And so I think it will be really good.

There is some worry that by not learning to do the mechanical parts of the proofs that you fail to generate the intuitions that inform the more conceptual parts, the creative part.

Dwarkesh Patel

It’s the same with assembly.

Adam Marblestone

Right. So at what point is that applying? With vibe coding, are people not learning computer science or actually are they vibe coding and they’re also simultaneously looking at the LLM that’s explaining these abstract computer science concepts to them and it’s all just all happening faster? Their feedback loop is faster and they’re learning way more abstract computer science and algorithm stuff because they’re vibe coding. I don’t know, it’s not obvious. That might be something about the user interface and the human infrastructure around it.

But I guess there’s some worry that people don’t learn the mechanics and therefore don’t build the grounded intuitions or something. But my hunch is it’s super positive. Exactly, on net, how useful that will be or how much overall math breakthroughs, or math breakthroughs even that we care about, will happen? I don’t know.

One other thing that I think is cool is the accessibility question. Okay, that sounds a little bit corny. Okay, yeah, more people can do math, but who cares? But I think there’s lots of people that could have interesting ideas. Like maybe the quantum theory of gravity or something. Yeah, one of us will come up with the quantum theory of gravity instead of a card-carrying physicist. In the same way that Steve Byrnes is reading the neuroscience literature and he hasn’t been in the neuroscience lab that much. But he’s able to synthesize across the neuroscience literature and be like, “Oh, Learning Subsystem, Steering Subsystem. Does this all make sense?” He’s an outsider neuroscientist in some ways. Can you have outsider string theorists or something, because the math is just done for them by the computer? And does that lead to more innovation in string theory? Maybe yes.

Dwarkesh Patel

Interesting. Okay, so if this approach works and you’re right that LLMs are not the final paradigm, and suppose it takes at least 10 years to get the final paradigm in that world. There’s this fun sci-fi premise where you have… Terence Tao today had a tweet where he’s like, “These models are like automated cleverness but not automated intelligence.” And you can quibble with the definitions there. But if you have automated cleverness and you have some way of filtering—which if you can formalize and prove things that the LLMs are saying you could do—then you could have this situation where quantity has a quality all of its own.

So what are the domains of the world which could be put in this provable symbolic representation? So in the world where AGI is super far away, maybe it makes sense to literally turn everything the LLMs ever do, or almost everything they do, into super provable statements. So LLMs can actually build on top of each other because everything they do is super provable.

Maybe this is just necessary because you have billions of intelligences running around. Even if they are super intelligent, the only way the future AGI civilization can collaborate with each other is if they can prove each step. They’re just brute force churning out… This is what the Jupiter brains are doing.

Adam Marblestone

It’s a universal language, it’s provable. It’s also provable from the perspective of, “Are you trying to exploit me or are you sending me some message that’s trying to hack into my brain effectively?” Are you trying to socially influence me? Are you actually just sending me just the information that I need and no more for this?

So davidad, who’s this program director at ARIA now in the UK, he has this whole design of an ARPA-style program, a sort of safeguarded AI that very heavily leverages provable safety properties. Can you apply proofs to… Can you have a world model? But that world model is actually not specified just in neuron activations, but it’s specified in equations. Those might be very complex equations, but if you can just get insanely good at just auto-proving these things with cleverness, auto-cleverness… Can you have explicitly interpretable world models as opposed to neural net world models and move back basically to symbolic methods just because you can just have insane amount of ability to prove things? Yeah, I mean that’s an interesting vision. I don’t know in the next 10 years whether that will be the vision that plays out, but I think it’s really interesting to think about.

Even for math, I mean, Terence Tao is doing some amount of stuff where it’s not about whether you can prove the individual theorems. It’s like let’s prove all the theorems en masse and then let’s study the properties of the aggregate set of proved theorems. Which are the ones that got proved and which are the ones that didn’t? Okay, well that’s the landscape of all the theorems instead of one theorem at a time.

01:38:18 – Architecture of the brain

Dwarkesh Patel

Speaking of symbolic representations, one question I was meaning to ask you is, how does the brain represent the world model? Obviously nets out in neurons, but I don’t mean extremely functionally. I mean conceptually, is it in something that’s analogous to the hidden state of a neural network or is it something that’s closer to a symbolic language?

Adam Marblestone

We don’t know. There’s some amount of study of this. There’s these things like face patch neurons that represent certain parts of the face that geometrically combine in interesting ways. That’s with geometry and vision. Is that true for other more abstract things? There’s this idea of cognitive maps. A lot of the stuff that a rodent hippocampus has to learn is place cells and, where is the rodent going to go next and is it going to get a reward there? It’s very geometric. And do we organize concepts with an abstract version of a spatial map?

There’s some questions of can we do true symbolic operations? Can I have a register in my brain that copies a variable to another register regardless of what the content of that variable is? That’s this variable binding problem. Basically I don’t know if we have that machinery or is it more like cost functions and architectures that make some of that approximately emerge, but maybe it would also emerge in a neural net? There’s a bunch of interesting neuroscience research trying to study this, what the representations look like.

Dwarkesh Patel

But what’s your hunch?

Adam Marblestone

Yeah, my hunch is that it’s going to be a huge mess and we should look at the architecture, the loss functions, and the learning rules. I don’t expect it to be pretty in there.

Dwarkesh Patel

Which is that it is not a symbolic language type thing?

Adam Marblestone

Yeah, probably it’s not that symbolic. But other people think very differently.

Dwarkesh Patel

Another random question speaking of binding, what is up with feeling like there’s an experience? All the parts of your brain which are modeling very different things, have different drives, and at least presumably feel like there’s an experience happening right now. Also that across time you feel like…

Adam Marblestone

Yeah, I’m pretty much at a loss on this one. I don’t know. Max Hodak has been giving talks about this recently. He’s another really hardcore neuroscience person, neurotechnology person. The thing I mentioned with Doris maybe also sounds like it might have some touching on this question. But yeah, I don’t think anybody has any idea. It might even involve new physics.

Dwarkesh Patel

Here’s another question which might not have an answer yet. Continual learning, is that the product of something extremely fundamental at the level of even the learning algorithm? You could say, “Look, at least the way we do backprop in neural networks is that you freeze the weight, there’s a training period and you freeze the weights. So you just need this active inference or some other learning rule in order to do continual learning.” Or do you think it’s more a matter of architecture and how memory is exactly stored and what kind of associative memory you have basically?

Adam Marblestone

So continual learning… I don’t know. At the architectural level, there’s probably some interesting stuff that the hippocampus is doing. People have long thought this. What kinds of sequences is it storing? How is it organizing, representing that? How is it replaying it back? What is it replaying back? How exactly does that memory consolidation work? Is it training the cortex using replays or memories from the hippocampus or something like that? There’s probably some of that stuff.

There might be multiple timescales of plasticity or clever learning rules that can simultaneously be storing short-term information and also doing backprop with it. Neurons may be doing a couple things: some fast weight plasticity and some slower plasticity at the same time, or synapses that have many states. I mean, I don’t know. From a neuroscience perspective, I’m not sure that I’ve seen something that’s super clear on what causes continual learning except maybe to say that this systems consolidation idea of hippocampus consolidating cortex. Some people think it is a big piece of this and we still don’t fully understand the details.

Dwarkesh Patel

Speaking of fast weights, is there something in the brain which is the equivalent of this distinction between parameters and activations that we see in neural networks? Specifically in transformers we have this idea that some of the activations are the key and value vectors of previous tokens that you build up over time.

There’s the so-called fast weights that whenever you have a new token, you query them against these activations, but you also obviously can’t query them against all the other parameters in the network which are part of the actual built-in weights. Is there some such distinction that’s analogous?

Adam Marblestone

I don’t know. I mean we definitely have weights and activations. Whether you can use the activations in these clever ways, different forms of actual attention, like attention in the brain… Is that based on, “I’m trying to pay attention”... I think there’s probably several different kinds of actual attention in the brain. I want to pay attention to this area of visual cortex. I want to pay attention to the content in other areas that is triggered by the content in this area. Attention that’s just based on reflexes and stuff like that.

So I don’t know. There’s not just the cortex, there’s also the thalamus. The thalamus is also involved in somehow relaying or gating information. There’s cortico-cortical connections. There’s also some amount of connection between cortical areas that goes through the thalamus. Is it possible that this is doing some sort of matching or constraint satisfaction or matching across keys over here and values over there? Is it possible that it can do stuff like that? Maybe. I don’t know. This is all part of the architecture of this corticothalamic system. I don’t know how transformer-like it is or if there’s anything analogous to that attention. It’d be interesting to find out.

Dwarkesh Patel

We’ve got to give you a billion dollars so you can come on the podcast again and tell me how exactly the brain works.

Adam Marblestone

Mostly I just do data collection. It’s really unbiased data collection so all the other people can figure out these questions.

Dwarkesh Patel

Maybe the final question to go off on is, what was the most interesting thing you learned from the Gap Map? Maybe you want to explain what the Gap Map is.

Adam Marblestone

In the process of incubating and coming up with these Focused Research Organizations, these nonprofit startup-like moonshots that we’ve been getting philanthropists and now government agencies to fund, we talked to a lot of scientists. Some of the scientists were just like, “Here’s the next thing my graduate student will do. Here’s what I find interesting. Exploring these really interesting hypothesis spaces, all the types of things we’ve been talking about.”

Some of them were like, “Here’s this gap. I need this piece of infrastructure. There’s no combination of grad students in my lab or me loosely collaborating with other labs with traditional grants that could ever get me that. I need to have an organized engineering team that builds the miniature equivalent of the Hubble Space Telescope. If I can build that Hubble Space Telescope, then I will unblock all the other researchers in my field or some path of technological progress in the way that the Hubble Space Telescope lifted the boats and improved the life of every astronomer.” But it wasn’t really an astronomy discovery in itself. It was just that you had to put this giant mirror in space with a CCD camera and organize all the people and engineering and stuff to do that. So some of the things we talked to scientists about looked like that.

The Gap Map is just a list of a lot of those things and we call it a Gap Map. I think it’s actually more like a fundamental capabilities map. What are all these things, like mini Hubble space telescopes? And then we organized that into gaps for helping people understand that or search that.

Dwarkesh Patel

What was the most surprising thing you found?

Adam Marblestone

I think I’ve talked about this before, but one thing is just the overall size or shape of it or something like that. It’s a few hundred fundamental capabilities. So if each of these were a deep tech startup-size project, that’s only a few billion dollars or something. If each one of those were a Series A, that’s only… It’s not like a trillion dollars to solve these gaps. It’s lower than that. So that’s one thing. Maybe we assumed that, and that’s what we got. It’s not really comprehensive. It’s really just a way of summarizing a lot of conversations we’ve had with scientists.

I do think that in the aggregate process, things like Lean are actually surprising because I did start from neuroscience and biology and it was very obvious that there’s these -omics. We need genomics, but we also need connectomics. We can engineer E. coli, but we also need to engineer the other cells. There’s somewhat obvious parts of biological infrastructure. I did not realize that math proving infrastructure was a thing and that was emergent from trying to do this.

So I’m looking forward to seeing other things where it’s not actually this hard intellectual problem to solve it. It’s maybe slightly the equivalent of AI researchers just needing GPUs or something like that and focus and really good PyTorch code to start doing this. Which are the fields that do or don’t need that? So fields that have had gazillions of dollars of investment, do they still need some of those? Do they still have some of those gaps or is it only more neglected fields? We’re even finding some interesting ones in actual astronomy, actual telescopes that have not been explored. Maybe because if you’re getting above a critical mass-size project, then you have to have a really big project and that’s a more bureaucratic process with the federal agencies.

Dwarkesh Patel

I guess you just need scale in every single domain of science these days.

Adam Marblestone

Yeah, I think you need scale in many of the domains of science. That does not mean that the low-scale work is not important. It does not mean that creativity, serendipity, etc., and each student pursuing a totally different direction or thesis that you see in universities is not also really key. But I think some amount of scalable infrastructure is missing in essentially every area of science, even math, which is crazy. Because mathematicians I thought just needed whiteboards, but they actually need Lean. They actually need verifiable programming languages and stuff. I didn’t know that.

Dwarkesh Patel

Cool. Adam, this is super fun. Thanks for coming on.

Adam Marblestone

Thank you so much. My pleasure.

Dwarkesh Patel

Where can people find your stuff?

Adam Marblestone

Pleasure. The easiest way now… My adammarblestone.org website is currently down, I guess. But convergentresearch.org can link to a lot of the stuff we’ve been doing.

Dwarkesh Patel

And then you have a great blog, Longitudinal Science.

Adam Marblestone

Longitudinal Science, yes, on WordPress.

Dwarkesh Patel

Cool.

Adam Marblestone

Thank you so much. Pleasure.

Thoughts on AI progress (Dec 2025)

Dwarkesh Patel — Tue, 23 Dec 2025 20:24:48 GMT

What are we scaling?

I’m confused why some people have short timelines and at the same time are bullish on the current scale up of reinforcement learning atop LLMs. If we’re actually close to a human-like learner, this whole approach of training on verifiable outcomes is doomed.

Currently the labs are trying to bake in a bunch of skills into these models through “mid-training” - there’s an entire supply chain of companies building RL environments which teach the model how to navigate a web browser or use Excel to write financial models.

Either these models will soon learn on the job in a self directed way - making all this pre-baking pointless - or they won’t - which means AGI is not imminent. Humans don’t have to go through a special training phase where they need to rehearse every single piece of software they might ever need to use.

Beren Millidge made interesting points about this in a recent blog post:

When we see frontier models improving at various benchmarks we should think not just of increased scale and clever ML research ideas but billions of dollars spent paying PhDs, MDs, and other experts to write questions and provide example answers and reasoning targeting these precise capabilities.

You can see this tension most vividly in robotics. In some fundamental sense, robotics is an algorithms problem, not a hardware or data problem — with very little training, humans can learn how to teleoperate current hardware to do useful work. So if we had a human like learner, robotics would (in large part) be solved. But the fact that we don’t have such a learner makes it necessary to go out into a thousand different homes to learn how to pick up dishes or fold laundry.

One counterargument I’ve heard from the takeoff-within-5-years crew is that we have to do this cludgy RL in service of building a superhuman AI researcher, and then the million copies of automated Ilya can go figure out how to solve robust and efficient learning from experience.

This gives the vibes of that old joke, “We’re losing money on every sale, but we’ll make it up in volume.” Somehow this automated researcher is going to figure out the algorithm for AGI - a problem humans have been banging their head against for the better part of a century - while not having the basic learning capabilities that children have? I find this super implausible.

Besides, even if you think the RLVR scaleup will soon help us automate AI research, the labs’ actions suggest otherwise. You don’t need to pre-bake the consultant’s skills at crafting Powerpoint slides in order to automate Ilya. So clearly the labs’ actions hint at a world view where these models will continue to fare poorly at generalizing and on-the-job learning, thus making it necessary to build in the skills that they hope will be economically valuable.

Another counterargument you could make is that even if the model could learn these skills on the job, it is just so much more efficient to build them up just once during training rather that again and again for each user or company. And look, it makes a lot of sense to just bake in fluency with common tools like browsers and terminals. Indeed one of the key advantages that AGIs will have is this greater capacity to share knowledge across copies. But people are underrating how much company and context specific skills are required to do most jobs. And there just isn’t currently a robust efficient way for AIs to pick up those skills.

Human labor is valuable precisely because it’s not shleppy to train

I was at a dinner with an AI researcher and a biologist. The biologist said she had long timelines. We asked what she thought AI would struggle with. She said her work has recently involved looking at slides and decide if a dot is actually a macrophage or just looks like one. The AI researcher says, “Image classification is a textbook deep learning problem—we could easily train for that.”

I thought this was a very interesting exchange, because it revealed a key crux between me and the people who expect transformative economic impacts in the next few years. Human workers are valuable precisely because we don’t need to build schleppy training loops for every small part of their job. It’s not net-productive to build a custom training pipeline to identify what macrophages look like given the way this particular lab prepares slides, then another for the next lab-specific micro-task, and so on. What you actually need is an AI that can learn from semantic feedback or from self directed experience, and then generalize, the way a human does.

Every day, you have to do a hundred things that require judgment, situational awareness, and skills & context learned on the job. These tasks differ not just across different people, but from one day to the next even for the same person. It is not possible to automate even a single job by just baking in some predefined set of skills, let alone all the jobs.

In fact, I think people are really underestimating how big a deal actual AGI will be because they’re just imagining more of this current regime. They’re not thinking about billions of human-like intelligences on a server which can copy and merge all their learnings. And to be clear, I expect this (aka actual AGI) in the next decade or two. That’s fucking crazy!

Economic diffusion lag is cope for missing capabilities

Sometimes people will say that the reason that AIs aren’t more widely deployed across firms and already providing lots of value (outside of coding) is that technology takes a long time to diffuse. I think this is cope. People are using this cope to gloss over the fact that these models just lack the capabilities necessary for broad economic value.

Steven Byrnes has an excellent post on this and many other points:

New technologies take a long time to integrate into the economy? Well ask yourself: how do highly-skilled, experienced, and entrepreneurial immigrant humans manage to integrate into the economy immediately? Once you’ve answered that question, note that AGI will be able to do those things too.

If these models were actually like humans on a server, they’d diffuse incredibly quickly. In fact, they’d be so much easier to integrate and onboard than a normal human employee (they could read your entire Slack and Drive in minutes and immediately distill all the skills your other AI employees have). Plus, hiring is very much like a lemons market, where it’s hard to tell who the good people are, and hiring someone bad is quite costly. This is a dynamic you wouldn’t have to worry about when you just wanna spin up another instance of a vetted AGI model.

For these reasons, I expect it’s going to be much much easier to diffuse AI labor into firms than it is to hire a person. And companies hire lots of people all the time. If the capabilities were actually at AGI level, people would be willing to spend trillions of dollars a year buying tokens (knowledge workers cumulatively earn 10s of trillions of dollars of wages a year). The reason that lab revenue are 4 orders of magnitude off right now is that the models are nowhere near as capable as human knowledge workers.

Goal post shifting is justified

AI bulls will often criticize AI bears for repeatedly moving the goal posts. This is often fair. AI has made a ton of progress in the last decade, and it’s easy to forget that.

But some amount of goal post shifting is justified. If you showed me Gemini 3 in 2020, I would have been certain that it could automate half of knowledge work. We keep solving what we thought were the sufficient bottlenecks to AGI (general understanding, few shot learning, reasoning), and yet we still don’t have AGI (defined as, say, being able to completely automate 95% of knowledge work jobs). What is the rational response?

It’s totally reasonable to look at this and say, “Oh actually there’s more to intelligence and labor than I previously realized. And while we’re really close to (and in many ways have surpassed) what I would have defined as AGI in the past, the fact that model companies are not making trillions is revenue clearly reveals that my previous definition of AGI was too narrow.”

I expect this to keep happening into the future. I expect that by 2030 that the labs will have made significant progress on my hobby horse of continual learning, and the models will start earning 100s of billions in revenue, but they won’t have automated all knowledge work, and I’ll be like, “We’ve made a lot of progress, but we’re not at AGI yet. We also need X, Y, and Z thing to get to trillions in revenue.”

Models keep getting more impressive at the rate the short timelines people predict, but more useful at the rate the long timelines people predict.

RL scaling is laundering the prestige of pretraining scaling

With pretraining, we had this extremely clean and general trend in improvement in loss across multiple orders of magnitude of compute (albeit on a power law, which is as weak as exponential growth is strong). People are trying to launder the presitge of pretraining scaling, which was almost as predictable as a physical law of the universe, to justify bullish projections about RLVR, for which we have no well fit publicly known trend. When intrepid researchers do try to piece together the implications from scarce public datapoints, they get quite bearish results. For example, Toby Ord has a great post where he cleverly connects the dots between different o-series benchmark charts, which suggested “we need something like a 1,000,000x scale-up of total RL compute to give a boost similar to a GPT level”.

Broadly deployed intelligence explosion

People have spent a lot of time talking about a software only singularity (where AI models write the code for a smarter successor system), a software + hardware singularity (where AIs also improve their successor’s computing hardware), or variations therein.

All these scenarios neglect what I think will be the main driver of further improvements atop AGI: continual learning. Again, think about how humans become more capable at anything. It’s mostly from experience in the relevant domain.

Over conversation, Beren Millidge made the interesting suggestion that the future might look continual learning agents going out, doing jobs and generating value, and then bringing all their learnings back to the hive mind model, which does some kind of batch distillations on all these agents. The agents themselves could be quite specialized - containing what Karpathy called “the cognitive core” plus knowledge and skills relevant to the job they’re being deployed to do.

“Solving” continual learning won’t be a singular one-and-done achievement. Instead, it will feel like solving in context learning. GPT-3 demonstrated that in context learning could be very powerful (its ICL capabilities were so remarkable that the title of the GPT-3 paper is ‘Language Models are Few-Shot Learners’). But of course, we didn’t “solve” in-context learning when GPT-3 came out - and indeed there’s plenty of progress still to be made, from comprehension to context length. I expect a similar progression with continual learning. Labs will probably release something next year which they call continual learning, and which will in fact count as progress towards continual learning. But human level continual learning may take another 5 to 10 years of further progress.

This is why I don’t expect some kind of runaway gains to the first model that cracks continual learning, thus getting more and more widely deployed and capable. If you had fully solved continual learning drop out of nowhere, then sure, it’s “game set match”, as Satya put it. But that’s not what’s going to happen. Instead, some lab is going to figure out how to get some initial traction on the problem. Playing around with this feature will make it clear how it was implemented, and the other labs will soon replicate this breakthrough and improve it slightly.

Besides, I just have some prior that competition will stay fierce, informed by the observation that all these previous supposed flywheels (user engagement on chat, synthetic data, etc) have done very little to diminish the greater and greater competition between model companies. Every month (or less), the big three will rotate around the podium, with other competitors not that far behind. There is some force (potentially talent poaching, rumor mills, or reverse engineering) which has so far neutralized any runaway advantages a single lab might have had.

Sarah Paine — Why Russia Lost the Cold War

Dwarkesh Patel — Fri, 19 Dec 2025 17:41:19 GMT

This is the final episode of the Sarah Paine lecture series, and it’s probably my favorite one.

Sarah gives a “tour of the arguments” on what ultimately led to the Soviet Union’s collapse, diving into the role of the US, the Sino-Soviet border conflict, the oil bust, ethnic rebellions and even the Roman Catholic Church. As she points out, this is all particularly interesting as we find ourselves potentially at the beginning of another Cold War.

As we wrap up this lecture series, I want to take a moment to thank Sarah for doing this with me. It has been such a pleasure.

If you want more of her scholarship, I highly recommend checking out the books she’s written. You can find them here.

Watch on YouTube; listen on Apple Podcasts or Spotify.

Sponsors

Labelbox helped me create a tool to transcribe our episodes! I’ve struggled with transcription in the past because I don’t just want verbatim transcripts, I want transcripts reworded to read like essays. Labelbox helped me generate the exact data I needed for this. If you want to learn how Labelbox can help you (or if you want to try out the transcriber tool yourself), go to labelbox.com/dwarkesh
Sardine doesn’t just assess customer risk for banking & retail. Their AI risk management platform is also extremely good at detecting fraudulent job applications, which I’ve found useful for my own hiring process. If you need help with hiring risk—or any other type of fraud prevention—go to sardine.ai/dwarkesh.
Gemini’s Nano Banana Pro helped us make many of the visuals in this episode. For example, we used it to turn dense tables into clear charts so that’d it be easier to quickly understand the trends that Sarah discusses. You can try Nano Banana Pro now in the Gemini app. Go to gemini.google.com.

Timestamps

(00:00:00) – Did Reagan single-handedly win the Cold War?

(00:15:53) – Eastern Bloc uprisings & oil crisis

(00:30:37) – Gorbachev’s mistakes

(00:37:33) – German unification and NATO expansion

(00:48:31) – The Gulf War and the Cold War endgame

(00:56:10) – How central planning survived so long

(01:14:46) – Sarah’s life in the USSR in 1988

Transcript

00:00:00 – Did Reagan single-handedly win the Cold War?

Sarah Paine

Thank you for coming. It’s a treat to be with you and sharing all this stuff. Since we seem to be in a second Cold War, maybe it’s a good time to revisit the last one to see why it turned out the way it did and why the participants in it thought it turned out the way it did.

I’m going to pose the question: why did Russia lose the Cold War? People have loads of different answers to that question. This is going to be a tour of the counter-arguments. I’m going to start with an answer that many Americans have. It’s a very simple one that’s like, “Ronald Reagan single-handedly defeated the Soviet Union.” That’s one possible answer. But then I’m going to give you all kinds of counter-arguments to that.

Some of them are going to be other external explanations of what others did to the Soviet Union. Others are internal ones of what the Soviet Union did, the cards that it didn’t play particularly well. And then I’ve got some umbrella explanations. So that’s my plan for this evening.

The story that Ronald Reagan did it… Well, here’s a picture at the Reagan Ranch after the Cold War is over. You see the Gorbachevs and you see the Reagans and they seem to be having a grand old time, which suggests there’s something maybe off with that explanation.

But anyway, the way the “Ronald Reagan did it” school goes is that Ronald Reagan did a massive military buildup and some would argue it bankrupted the Soviet Union. He was a man of words and deeds. He made really good speeches that were memorable.

Here’s one before Parliament where he says, “The regimes planted by totalitarianism have had more than 30 years to establish their legitimacy, but none—not one regime—has yet been able to risk free elections. Regimes planted by bayonets do not take root.”

And then here he is before the Brandenburg Gate, this is in Berlin, long a symbol of German greatness. But then it was a locked gate on the Berlin Wall. Here’s Ronald Reagan: “General Secretary Gorbachev, if you seek peace, if you seek prosperity for the Soviet Union and Eastern Europe, if you seek liberalization, come to this gate. Mr. Gorbachev, open this gate. Tear down this wall!“

And who can forget the “Evil Empire” speech, which he gave to the National Association of Evangelicals in Orlando, Florida, and they skipped Disneyland to hear it.

Reagan did a very significant military buildup that actually had started under Carter when the Soviets invaded Afghanistan, big mistake as we discovered. He also invested in and deployed missiles in Europe. He was busy funding anti-communist insurgencies and also others who didn’t like the Soviet Union all over the world. He started doing more aggressive military patrolling. By the time he’s out of office, he was like half a dozen ships short of this 600-ship navy or whatever it is he was planning to make. He was also trying to build a missile shield, his Strategic Defense Initiative.

The problem is the Soviets tried to match him on this. If you add up the GNPs of the United States, NATO allies, and Japan, well, that would be seven times larger than the Soviet GNP. You’ve got to be aware of asymmetric strategy. The CIA thought during the Cold War that perhaps Russia was spending up to 20% of its GNP on defense. After the Cold War ended, when you were getting more accurate statistics, it turns out it was at least 40 or 50%. Some people say it was up to a truly economy-busting 70%, if you take into account all the infrastructure investments that were associated with military things. If you look during the Cold War, the United States was spending less than 8%, Germany less than 6%, Japan less than 2%, and Nazi Germany, which is no piker, 55%. So you look at all this and it was difficult.

So I am going to be quoting lots of Russians today because they have thought deeply about the fate of their country, how life as they knew it disappeared, the Soviet Union gone, the empire gone. They thought a lot about it. Here is a former Soviet ambassador to West Germany, Valentin Falin. Here’s his take: “Following the American strategy of our exhaustion in the arms race, our crisis in public health and all the things that have to do with standard of living reached a new dimension.” Then if you add to the arms race of the United States the arms race that was going on with China on that border, the arms race plunged the Soviet economy into a permanent crisis.

Here you have Georgy Arbatov, who was the late Soviet Union’s finest expert on the United States, or at least the most famous one. He’s looking at the Soviet war in Afghanistan. He said, “It is quite clear that the Afghan war was most advantageous for the United States. And we got our Vietnam.” Because the United States is busy funding the other side, and it’s costly. Gorbachev is looking at this, as he’s telling the Politburo a year after he came into power. He said, “Look, the Americans are betting precisely on the fact that the Soviet Union is scared of this SDI, the Strategic Defense Initiative, a missile defense. That’s why they’re putting pressure on us, to exhaust us.” Correct.

So some would argue that the US victory in the arms race guaranteed victory in the Cold War. Go Ronnie. That’s one explanation. But I’m going to give you a tour of the counter-arguments and some other explanations, starting with Presidents Ford, Carter, and the Helsinki Declaration.

After World War II, the Soviets had wanted to convene a conference of European states to confirm its expanded World War II borders. And for a long time, nobody was interested. The Western Europeans are sick of all the drama. The United States still doesn’t want to show, but we go along with our allies, and our allies insist on including human rights provisions. We think this is crazy because we know the Soviets are never going to enforce those things. But you get the Helsinki Accords that have all sorts of human rights provisions.

Well, lo and behold, unbeknownst to anybody, dissidents across the Eastern Bloc and human rights activists across the West start holding the communists to account for the agreements that they have signed and start contrasting the liberation that communism promises versus the dictatorship actually delivered. This human rights movement within the Soviet bloc and abroad, took on a life of its own.

Here you have the former director of the CIA and former head of the Department of Defense, Robert Gates, saying, “The Soviets desperately wanted this big conference and it laid the foundations for the end of their empire. We resisted it for years only to discover years later that this conference had yielded benefits beyond our wildest imagination.” Go figure.

Here is Jimmy Carter with his human rights initiative. It was Gorbachev’s English language translator who said that Carter’s emphasis on precisely the human rights that were denied to Soviets really resonated and it made people think that they wanted a more democratic, open, liberal society. Here’s Carter giving a graduation address at Notre Dame. He said, “We have reaffirmed America’s commitment to human rights as a fundamental tenet of our foreign policy. What draws us Americans together is a belief in human freedom. We want the world to know that our nation stands for more than just financial prosperity. We’re bigger than that.”

And here is Eduard Shevardnadze, Gorbachev’s foreign minister, echoing some of these sentiments. He said, “Look, the belief that we are a great country is deeply ingrained in me, but great in what? Territory? Population, quantity of arms, people’s troubles, the individual’s lack of rights? And what do we, who have virtually the highest infant mortality rate in the world, take pride in? It’s not easy answering the questions. Who are you? Who do you wish to be? A country which is feared or a country which is respected? A country of power or a country of kindness.”

Others agreed that communism was essential to the survival of the Soviet Union, but it’s an undemocratic ideology. Fundamentally, it’s a foundation that can’t endure forever. That’s the take of Vitaly Ignatenko, who’s a Russian journalist. Oleg Grinevsky, who’s a Soviet career diplomat, is saying, “Look, communist ideology is associated above all with the Soviet Union. Its rejection created a vacuum and it determined its ultimate fate.” Boris Yeltsin, who is Gorbachev’s successor, said, “Look, no one wants a new Soviet Union.”

So some would argue, this counter-argument, that human rights clauses of the Helsinki Accords and Carter’s subsequent human rights campaign destroyed communist belief in communism. Okay. Another president, another counter-argument. Those who are fans of Richard Nixon would say, “No, no, no, no, no. It was Richard Nixon who played the China card so the United States and China could gang up on the Soviet Union and overextend it financially to wreck it militarily.”

I think the Chinese would beg to differ and say, “No, no, no, no. It was Mao who played the America card.” Because what’s going on in 1969? There’s a border war between China and the Soviet Union. China’s gotten its nuclear bomb in ‘64. It no longer has to defer to the Soviet Union and starts playing more tough on their border disagreements. So the Soviets are really upset. They come to the United States and ask us whether it would be okay to nuke these people, because they think Americans don’t like the Chinese . Well we didn’t, but we said, “No, it’s not okay to nuke those people.”

So the Chinese figure it out. The one that wants to nuke you is your primary adversary, right? Up until then… Think about it, China and Russia, for them the United States was the primary adversary. Now they’re primary adversaries with each other, freeing up the United States to decide which one it’s going to cozy up to. And the United States decides it’s going to cozy up to China.

Why? Well, Chinese belligerency forces the Soviets. Not only have they already got a big militarized border with Europe, now they’re going to do the same thing on a very long border with China. These are nuclear-armed mechanized forces, very expensive. Imagine if this country had to have such borders with Canada and Mexico. It would be bankrupting, and we are far richer than the Soviet Union was then, whenever. It was bankrupting. So some would argue that US cooperation with China fatally overextended the Soviet Union.

One could take all of these arguments, starting with President Nixon all the way through Reagan, to make an overarching argument that says, “Look, each president opened up opportunities for the others who then leveraged them.” So Nixon plays the China card, which others play with increasing dexterity. Ford comes in and begins dabbling in human rights. Carter then comes in and really goes for human rights and starts doing a military buildup, which then Ronald Reagan really does. So that by the time you get to Reagan, he is dealing in a position of both ideological and military strength vis-à-vis the Soviet Union.

For those who think that US foreign policy was not consistent during the Cold War, you’re not looking at it at the strategic level. There were certain different strategies going on and how best to achieve it, but both parties agreed the goals were free trade, democracy, containment of communism. Those were staples of US foreign policy, for both parties, for its duration.

So some would argue that Presidents Nixon through Reagan produced the cumulative presidential effects to defeat the Soviet Union.

Okay, others would say to forget this great man theory of history business, that’s really passé. What really accounted for the outcome of the Cold War was this military platform, that’s Pentagonese for large military systems. But anyway, it’s a nuclear-powered, nuclear-armed submarine. They say that this is the item.

The way deterrence theory worked during the Cold War, and I believe now as well, is that in order to deter the other side, you have to have a reliable second-strike capability. So if they thought of lobbing a nuke at you, they would be guaranteed that you would have the second strike to lob a nuke back. Therefore, they’re never going to lob the first nuke.

When Jimmy Carter became president, he was a graduate of Annapolis and also a submariner. The United States began a much more aggressive deployment of its fleet and that’s continued even more so under Reagan. We’re taking our submarines and we’re targeting Soviet submarines in their home water bastions. So the Soviets are thinking that we’re going to be able to destroy their second-strike capability on our first strike and they’re having a heart attack.

So here you have Valery Boldin, a longtime aide to Gorbachev, saying, “Look, the most powerful strength of the United States is the naval fleet and we aren’t going to get one, or our geography isn’t set up to use one the way the United States can.” And then you have Marshal Yazov saying, “For the Americans, the main means of atomic attack is the fleet.”

So then you get Marshal Akhromeyev, who’s visiting the United States in 1987. At the end of the Cold War he will kill himself, but he’s still around in ‘87. He’s telling his American hosts, “You know where our submarines are, but we don’t know where yours are. It’s destabilizing. You, you the United States Navy, are the problem.” Go Navy. And here’s his host, Admiral Trost, who’s going, “Yeah, the inability of the Soviet Union to maintain a strong defensive capability led to the demise of the Soviet Union and to the removal of the Soviets as a major threat to us.”

So you can make a perfectly good argument to say the Soviet Union could not counter technologically or financially the US submarine threat to its retaliatory nuclear forces, so war termination was the only thing it could do.

All of these preceding explanations are navel explanations, spelled with an ‘e’, as in staring at one’s own. They’re all about what the United States did or didn’t do. So let’s get beyond the half-court tennis of Team America. You need to look at the other side of the net. This is where the Western guru for things military, Carl von Clausewitz, emphasizes reciprocity in war and the interaction of both sides. You’re not going to do well unless you consider what the other side is doing.

00:15:53 – Eastern Bloc uprisings & oil crisis

So I have given you some external explanations and I’m going to do the internal ones. Here is Arnold Toynbee, he’s one of the finest historians of the 20th century. He wrote a big multi-volume history of the West, in which he argues that civilizations die from suicide, not by murder. So I discussed the murder, what the United States tried to do to the Soviet Union. Now I’m going to talk about the suicide, what the Soviets did to themselves. And here is counter-argument number one. The Soviet Union was an empire, and when that collapsed, that meant they lost the Cold War.

During the Cold War, the Korean War and the Vietnam War, there was much fear in the West of this domino theory. The idea is one country falls to communism, then the next and next and next and next would fall to communism. Turns out the domino theory did not apply to capitalism. It applied to communism because once the democratic contagion hit one Warsaw Pact country in Eastern Europe, it spread to the others until it was a seething mess and they fell like dominoes.

So in 1988-89, there were all kinds of demonstrations in the Eastern Bloc, the Soviet Union. In the Soviet Union, they’re for political freedoms. In the Eastern Bloc, they’re for freedom from the Soviet Union. Gorbachev may not have gotten that detail. They’re all about not only wanting political freedoms, but also they’re about crumbling economies and how to fix their miserable standards of living. Very uncharacteristically, the Russians didn’t send tanks. In fact, Gorbachev welcomed and encouraged reforms in the Eastern Bloc, both political and economic, just as he was doing in the Soviet Union. So his ideas of glasnost, openness, and perestroika, rebuilding, resonated at home and abroad.

These reforms began in Poland. Poland had been a scene of much worker unrest many times, in 1956, 1970, 1976, and 1980 and 1981. In 1981, this is when Solidarity, the workers movement, gets going and it gets a national and an international reputation. The next set of strikes are happening in 1988, because in the preceding several years, the Polish standard of living had shrunk by over 3%.

The government was out of cash and wanted to raise basic food prices, and Poles hit the streets. The government was in a panic, because it was worried the economy would go into free fall. So the government cut a deal with Solidarity. They said, “You call off the strikes and then we’ll let you into political talks,” and Solidarity agreed. There was a complicating factor on all of this. It’s called the Roman Catholic Church, which is an institution of enormous credibility and legitimacy in Poland, which had a partiality for Solidarity and it had a Polish pope.

So the roundtable discussions were these political talks. They occurred a year later in February 1989, and the Soviets encouraged them. In fact, here’s one Soviet person there advising the Poles: “Look, you’ve got to find some quick solutions out of your economic and political mess. You’re an itty-bitty country, so when you make mistakes, they’ll be itty-bitty mistakes. But if we make them, they’ll be big.” They got that one right.

The Polish Communist Party thought they had this one covered by the way they jiggered the election rules. Not quite. The day they held elections is exactly the same day that Deng Xiaoping turned the tanks on demonstrators in Beijing and you have the Tiananmen Massacre. Two solutions for the problem. So the way the elections worked out in Poland is that Solidarity won every single seat for which it could compete but one. And then only three people in the communist-designated seats actually won. So who won all the rest of them? The box on the ballot called “none of the above.” Yes, the Roman Catholic Church had helped instruct people that that’s the box you want. With that, the legitimacy of the Communist Party to rule had just been wrecked and we’re on to democracy in Poland.

This democratic contagion then spread into East Germany four months later. This is about the 40th anniversary of the founding of East Germany. 70,000 people demonstrated in Leipzig. Within the week around like 1.4 million Germans are demonstrating in over 200 demonstrations. Typically, the East Germans would have sent tanks. That was what they would have done in the past. But would-be tank man Erich Honecker was already out of a job. His ruinous policies of living off debt since he came to power in 1971 had just about wrecked East Germany. So he was out.

Then less than two weeks later, the Council of Ministers resigns. Then on November 8th, the Politburo resigns. Then on the 9th, whatever is left of that government is issuing new travel regulations. You might wonder what travel has got to do with it. I’ll get there.

So in response to a question at a news conference, this guy, Günter Schabowski, who was one of the remaining communists helping run the show, gets asked a question and he doesn’t know the answer. So he wings it. The question is, “When do these travel regulations go into effect?” And he goes, “Immediately.” Well, crowds immediately started gathering at the six gates to the Berlin Wall. At one of them, the border guards decided that discretion was the better part of valor, and they opened the gate and East Germans poured into West Berlin.

Within the first week alone, over half of East Germany’s population visited the West. Within the month, 1% of the population emigrated to the West. Like the Polish elections, this opening of the gate was a pivotal decision. A pivotal decision, whatever it is, means there’s no going back to the way it was. Here’s good old Günter going, “Gosh, we hadn’t a clue that opening the wall was the beginning of the end of East Germany.” Okay, better luck next time. And the Russians were shocked by how unpopular they were. They were thinking they were going to get credit, Gorbachev, for Eastern Europe’s liberation rather than blame for Eastern Europe’s enserfment.

Here you have Yuri Ryzhov, a scientist and parliamentarian going, “All of our former satellites by compulsion cast off from us as fast and as far as possible.” And Anatoly Kovalev, who is a deputy foreign minister, said, “Look, we had no confidence whatsoever concerning whom the East German army is going to shoot, the demonstrators or us. And the same thing for the Polish and Hungarian armies.” Great. With allies like this, who needs enemies? The allies kind of cover it. So this argument says unrest in the empire forced the Soviet Union to forfeit the Cold War.

Okay, I got another counterargument. It says, “Nonsense, the real problem was that the satellites were unhealthy. That’s why the whole thing fell apart.

So this map is 1960. You see all those tempting green places. They’re about to become independent, and they are really sick of their Western European colonizers. Enter the Soviet Union with a program to put the West out of business. There were many takers.

Fast forward to late 1980s. The Soviet Union is on a roll. Small hitch, in the late 1970s there was a big recession that continued into the 80s and tanked commodity prices. For some of the newfound pals like Angola, South Yemen, Ethiopia, Nicaragua, it wrecked their export earnings because they’re exporting commodities and these commodity prices are down. In many cases, it halved them. The Soviet Union was really dependent on oil exports, still is. Oil prices tanked and oil accounted for up to 55% of the Soviet budget. So here Brezhnev has got a deep bench of non-performing piles at a time when he doesn’t have the money to support all of them.

Worse yet from the Soviet point of view, it’s dumped all this money in these Third World friends but meanwhile, it’s got its own nationalities who are deeply unhappy and they want out of the empire. Most problematically, they all revolt at exactly the same time. One of the rules for continental empire is “no two-front wars”. Russia has so many fronts at this point, it can’t even keep count.

The unrest in the internal empire of nationalities started as soon as Gorbachev got in. There were student movements in Kazakhstan and Yakutia, opposite ends of things. By the time you get to 1990, there are like 76 seething ethnic rebellions in different parts of this. There was too much going on for the Soviet government to handle. So you could argue that the Soviet Union bankrupted itself with the Third World while ignoring its own internal Third World of nationalities, whose simultaneous revolts brought down the Soviet Union.

I got a completely different argument for you. If you don’t like all of those, I got another one for you. It’s the economy, stupid, right? That line. One could argue that communism failed as an economic system. If you look at growth statistics for the Soviet Union, they’re pretty good post-World War II when they’re rebuilding, but they really stagnate from the mid-70s onward. For the decade preceding Gorbachev’s coming to power, Soviet growth stats were one to two percent lower than those of the United States, and the compounding effects of that were enormous.

What’s going on? Everyone’s lying to each other. The data that Soviets are using is garbage. If you’re working for a subunit of an enterprise, you have to lie about the inventories you have, saying you have less than you do, and then you have to lie about what you need, saying you need more than you do because you’re worried about getting enough things. It’s not a market system where the price dictates it. This is all about the plan. You’ve got to enter the right numbers and then you get whatever inputs you get from the centralized plan.

So everyone’s lying. They’re aggregating all the lies. The higher up the food chain you aggregate these things, the worse the data is, so that the Soviet government has no idea what the actual value of capital or labor are. It has no idea what actual productivity is, and no one has any idea what consumer preferences are. You’re not using markets and prices. The misallocation of capital and labor goes unnoticed until it metastasizes into a catastrophe.

To give you a sense of these misallocations, the Soviet Union was rotting 20 to 40% of its crops. It’s using scarce hard currency for agricultural imports to make up for those crops, a total mess. You can look at what happens to the economy with oil prices down. We’re into a spiraling mess, so that from when Gorbachev comes in in ‘85 to when it hits a trough in Russia in 1998, you see this crashing share of world GDP by the Eastern bloc. If you look at Soviet statistics on deficits, trade balances, debt, they’re just soaring, and then GNP growth goes double-digit negative. That’s called shrinkage. It’s not the normal thing.

Marshall Yazov, here’s his take: “We simply lack the power of all these wealthy NATO nations. We had to find an alternative to the arms race.” And here’s a foreign service officer, Anatoly Adamishin. He said, “Look, our problems began with the departure from isolation. The main reasons for collapse were internal, not external reasons. The Soviet economy was literally exhausted from this monstrous arms race, militarism, enemies with half the world.” That’s his take. Gorbachev told the Central Committee, “Look, we’re encircled not by invincible armies, but by superior economies.” He often told people, “Living this way any longer is impossible.” So you can make a powerful argument that it’s the Soviet economy that lost the Cold War.

00:30:37 – Gorbachev’s mistakes

This gentleman, Alexis de Tocqueville, is very famous for writing a book about the last days of the French monarchy before the French Revolution overturned it. He also wrote something about Democracy in America, both excellent books. But this one’s from the one about France, where Tocqueville observes, “The most dangerous moment for a bad government is when it begins to reform.”

Russians of all political persuasions agree on at least one thing. That is that Gorbachev’s role in how the Cold War turned out was pivotal, that he played a very essential part. Gorbachev made his decision based on certain false assumptions. One of them was the irreversible direction of history. Gorbachev thought of history going always forward towards communism, never backwards to capitalism. Of course, Eastern Europe took a U-turn, went straight back to capitalism. Here is Leonid Shebarshin, who is a senior person in the KGB, their intelligence office. He said, “The thought never occurred to the government that it’s possible to withdraw from socialism.”

If you think about both communist theory and how imperialism works in practice, usually the mother country is more developed than whatever all the colonies are, right? Well, the Soviet Union was an inverted empire. People in Eastern Europe as a group were more well-educated and they were richer than Russians. It was like a donut empire. So when the empire went to Eastern Europe, Russians could no longer siphon off the wealth of these enserfed populations in Eastern Europe, which explains why they wanted to leave. It also suggests why Putin wants them back.

Another false assumption has to do with the sentiments of the neighbors. Gorbachev was convinced he was going to get credit for liberating Eastern Europe, rather than blame as a Russian for having enserfed them in the first place. For Gorbachev, the clock began on his watch. For other people, no, Stalin’s when it began, when he started shooting a lot of people.

Here you have Anatoly Chernyaev, foreign policy adviser to Gorbachev, saying that Gorbachev thought that bringing freedom to our Eastern European satellites would have them adopt socialism with a human face. “He made an enormous mistake because these countries brutally turned their back on us.” Really, if that’s brutal, then what pray tell was Stalin? And then it gets better: “The politics in connection with our former friends were totally unexpected to us.” Really? You occupy people, you never leave, you shoot a lot of people in their government, you put in a new government, you siphon off a lot of their wealth, and you impose a non-performing economic system, and you wonder why they don’t like you.

Think about the United States. It intervenes all around the world in other people’s troubles. It dumps billions in economic aid and even leaves and people don’t like us. I don’t know why the Russians think they’re so special.

Another false assumption: Gorbachev believed that if the Warsaw Pact, the military alliance of the Eastern Bloc, disappeared, then NATO would disappear. He also believed that if the Comecon, which is their trading organization, went away, then the European Community in those days—it becomes the European Union later—would disappear. Not quite. It turns out that organizations that are coercive versus those that are voluntary, they dissolve for different reasons.

And then Gorbachev also assumed that the United States would share a continental outlook of not wanting strong powers and that the United States therefore would not want a unified Germany, let alone a strong unified Germany. So when all the unrest is happening in Germany, Gorbachev is off taking a vacation. Poor life choice, because at that moment, President George Bush Sr. and Chancellor Kohl of Germany are working on fast-tracking German unification of a fully sovereign, unified Germany—both halves in NATO.

So many of Gorbachev’s closest supporters at the end of it all blamed him. They said, “Look, his foreign policy mistakes were a function of his domestic policy mistakes and it destroyed the Soviet Union.” Back to this America expert, Vladimir Lukin: “Gorbachev was no Deng Xiaoping.” And Arbatov, who’s their premier America expert: “The stupidity of our leaders caused the disintegration of the Soviet Union.” So the big bozo was playing with plastic bags, stuck one on his head, committed suicide. It was by mistake. Lukin continued: “In the West, they love Gorbachev because everything took place so easily and cheaply, basically like that, but only for you. For us, it was expensive.” But you could argue the time to reassess all the Stalinist stuff was long overdue.

Here’s a completely different way of looking at it. I’ve been giving you sins of commission, and now I’m going to do sins of omission. It’s a good framework. It’s useful for other things. The sins of commission are all the things Gorbachev did. Now what I’m going to do is what the army didn’t do. Some would argue that the Red Army should have done exactly what Deng Xiaoping ordered his army to do. You just send the tanks against civilian demonstrators and they truly crush them and it’ll be over. Communist Party is still in power in China 30 years later. So there are some people who believe that this was a terrible mistake.

So this argument would be that timely tank deployments—TTD, my contribution to military acronyms—would have changed the outcome of the Cold War. Others would be back to the great men of history and sins of commission, and they wouldn’t be picking on Gorbachev but his successor Boris Yeltsin. There are two big pieces of evidence when we look. He removed Article 6 from the Soviet Constitution, which guaranteed that the Communist Party would always monopolize power. And then in addition in the following year, Yeltsin’s the head of Russia, he gets together with the heads of Ukraine and Belarus, and they signed the Belavezha Accords, which then formally dissolved the Soviet Union. So according to this way of thinking, it’s his fault. It’s suicide on purpose. And what it does is it opens the door for multiple parties and for the nationalities within the Soviet Empire to become independent.

00:37:33 – German unification and NATO expansion

So I’ve given you internal explanations. I’ve given you external explanations. Now I’m going to give you some umbrella explanations. They’re based on all the preceding evidence, and they come to opposite conclusions. The first one was, well, any of the above, it’s inevitable. The opposite conclusion from the same evidence is that no, it took all of the above. The West barely won.

I’m going to start with “any of the above”. You could argue with this many serious problems, it was a matter of time before the Soviet Union collapsed. It was an objectionable system for precisely the reasons the West didn’t like it. It had a brutally inefficient economic system. Russians who invented the thing, at the end of the day, didn’t want it either. By this way of looking at it, you have people like Yuri Ryzhov, a genuine rocket scientist, who says, “Look, the main reason for the collapse of the Soviet Union is the rottenness of its system.” Then here’s a journalist, Teimuraz Stepanov, who said, “Look, I think from the beginning the genes of disintegration were contained in the genetics of this governmental political formation.” Don’t you love the products of the Soviet educational system? Don’t ever use wording like that.

So you could argue that the Soviet Union was destined to fail with this many problems. Others would come to the opposite conclusion. They would say, “No, it took every single one of them for the Cold War to end on Western terms.” Back to Anatoly Kovalev, the deputy foreign minister, he said, “Look, all these factors merge—internal, ideological, economic, military—it’s all of them. You remove any one of them and you get a different outcome. Maybe the Cold War ends, but it might end completely differently.” So by this line of reasoning, the West barely won and should feel very fortunate that it did.

One can take this last argument and say it was more than that. It also took the confluence in office of two very talented leaders: Helmut Kohl of Germany and George Bush Sr. of the United States, not the son who got into those forever wars, but the dad who didn’t. George Bush Sr. had one of the most amazing resumes of any person ever to become president of the United States. Just look at him. When he’s really young, he’s a war hero in World War II. He’s a Navy pilot, a dangerous thing to do. He did it. Then he comes back and he gets his BA at Yale and graduates with honors. Then he becomes a representative for this district in Texas after he’s already made himself a millionaire in the oil business that he started. Then he became ambassador to the UN, followed by US representative to the PRC, before we had formal diplomatic relations. So he’s the guy who’s setting that up. He becomes director of the CIA, and then he is Ronald Reagan’s understudy for eight years as vice president. He is incredibly fit for the job.

Helmut Kohl is equally fit for the job. He is the longest-serving chancellor in German history since his illustrious predecessor, Otto von Bismarck. He starts out getting a PhD in history and political science. He also starts out in business, but then he works for state government, initially as a representative, then as a governor. He becomes chairman of his political party, the Christian Democratic Union, for a quarter of a century.

Once he gets in, he decides he’s going to buy up East Germany one tourist at a time. How does that work? East Germans, it turns out, really like to travel. West Germans had always been able to travel to East Germany, or they long had been able to travel to East Germany, but East Germans definitely could not easily travel to West Germany. Why? Because they have a habit of staying. But all of a sudden, East Germany eases up on the travel regulations. You might ask why, and the answer would be money. Just like the Poles, the East Germans were deep in an economic mess of their own making.

Would-be tank man Erich Honecker, who got the boot at the very end, well, his staying-in-power paradigm that he implements in 1971 is that he’s going to live off debt. He needs to make certain social benefits available and consumer benefits available for labor stability, to not have labor unrest. The way he’s going to do that is he’s not going to do many domestic investments and he’s going to do a lot of borrowing, particularly from West Germany. Well, that’s unsustainable long-term. By the time you get to the end of the Cold War, if he’s going to fix that and even out the accounts, it would be a 30% decline in the East German standard of living. So he really needs the pocket change from the tourists.

So what Kohl does is a brisk business of tourists and things. What he does in return for the easing of travel restrictions, he pays East Germany several hundred million Deutschmarks extra to allow that to happen. And then he gets the Hungarians to go along. He gets the Hungarians to open up their Austrian border to let East Germans out that way, and he gives them a half a billion Deutschmarks for that little favor.

When Kohl introduces his 10-point unification program—because now he’s thinking he’s going to get both Germanys together—this is when he starts doling out big bucks to the Soviet Union, whose economy is unraveling. Gorbachev is going to be desperate for this cash as that’s happening. So West Germany provides 100 million in food, especially in meat, for the Soviet Union that doesn’t have these things.

Nevertheless, the unrest just keeps on going. The Berlin Wall, as I’ve told you, is breached, and then you wind up with a West German caretaker government, and the financial situation in Russia itself is unraveling. By the time you get to January 1990, Bush and Kohl get together and they decide they want to really fast-track German reunification. Why? Because they’ve got to get it done before this unraveling crisis causes Gorbachev to fall from power. So they have got a game going, the two of them. It’s complicated. Here’s why.

Gorbachev was dead-set against Germany, a united Germany, in NATO. He’s not really keen about a united Germany, let alone one in NATO. The US State Department experts, the guys who know everything, are saying, “No, you want to go slow on this unification business.” Kohl is also running a coalition government. There are people in that government he cannot fire because they’re from different political parties. One of them is his foreign minister, this guy Genscher, who is very skeptical about Germany being part of NATO. Then it turns out, although Britain had talked a good piece during the Cold War, it didn’t actually want a unified Germany, nor did France. Why? Because that unified Germany would eclipse them economically. They didn’t want that to happen.

So Kohl and Bush divide up the tasks. Kohl is going to reassure the Soviet Union that Germany is not going to be belligerent or do horrible things. And Kohl is going to work on financial unification because the Soviets are thinking in terms of military unification. You know, where you deploy your troops. That determines things. Wrong instrument of national power, precisely because the Soviets didn’t understand finance. That’s why they’re in such a mess. Whereas the Germans do. What they’re going to do is get East Germany on the West German Deutschmark, and at that point they will control all the money and they will control decisions. But the Russians aren’t going to see that coming.

Meanwhile, Bush is supposed to work the alliances particularly with Britain and France in the West. There are all sorts of meetings that are coming up. Bush’s job is to delay those meetings for as long as possible so German unification can proceed as far as possible. The two of them are doing a tag-team diplomacy with Gorbachev that he just can’t keep up with, given that his own home economy has got these double-digit shrinkage rates.

Here’s how they go. As the trades get bigger, the amount of money you pay Gorbachev gets bigger. First of all, it’s just to get a unified Germany. Then it’s to get a unified Germany with West Germany still in NATO. Then it’s to get a unified Germany with all of Germany in NATO. So here’s how the money goes. Gorbachev agrees to German unification. We are no longer paying hundreds of millions of Deutschmarks. We’re paying billions of Deutschmarks, five billion Deutschmarks for that one. Then Gorbachev agrees that states can choose their own alliances, i.e. whether or not to join NATO. The US offers nine assurances, but it’s also a trade agreement that Gorbachev really wants. Then the economic union goes into effect.

So we’ve now done the financial reunification of Germany. This is when there’s a London Declaration that’s inviting Eastern European countries to coordinate more closely with NATO. In return, Gorbachev gets a promise of a G7 summit meeting that’s going to fast-track aid to him, which it will do. And then Gorbachev agrees to German NATO membership.

At this point, even bigger things are happening. Germany’s going to agree to its border with Poland. I’ll get there and explain. Germany provides 15 billion in Deutschmarks, including building all kinds of new apartment buildings for repatriated Soviet soldiers who are going home. Why are you doing that? Because you want those soldiers focused on buying furniture, not running a military coup. That’s what they’re doing.

So the unification happens in mid-September 1990. Here’s the Polish borders. At the end of World War II, Stalin moved Poland 200 kilometers to the west, and it winds up taking a third of German territory by the time that’s all over. So the Germans don’t really want to sign all that away. In addition, as part of that, there were 12 million German refugees who were thrown out of wherever they were living to send them back to Germany, of whom 2 million died. So this is a big deal and it’s in living memory. Germany agrees to this, that the borders are done. German-Polish borders are set.

00:48:31 – The Gulf War and the Cold War endgame

Complicating factor: a month and a half before this unification treaty is signed, Saddam Hussein decides he’s going to invade Kuwait because he’s broke. He’s had a long war with Iran, huge debts, many owed to Kuwait, which he doesn’t want to pay back. So if you invade them, that solves that problem. Also, he would take over Kuwait’s very rich oil fields, and together that would make Iraq probably the swing producer of oil. So he thinks that’s a great idea.

Except the Cold War’s over actually. The Russians are more than willing to cooperate with the United States. Gorbachev really needs more money, and he is willing to go along with getting Iraq out of Kuwait, but not with regime change in Iraq. Because think about it, Iraq is a very important debtor state to the Soviet Union. It owed them between $10–13 billion. That’s a lot of money for a broke creditor.

But Gorbachev is being extraordinarily cooperative with Bush Sr. He sends Yevgeny Primakov on multiple missions to Baghdad. The first one, Primakov gets all Russian hostages out of Iraq. Then on the second trip, he gets all Westerners out, Americans included. Third trip, not so lucky. He’s there for the coalition force bombing. I don’t think he liked that very much. But imagine that bombing going on if there were Western human shields going down with every target. Russia took that card right off the table.

Here’s some of the reasoning. Sergei Tarasenko was an aide to Foreign Minister Shevardnadze, and they understood that the United States was going to do something about this invasion of Kuwait. So the Russians thought, “It’ll be better if we force all of this to go through the UN, where Russia has a veto power.” He said, “Look, there was a division of roles.” It extends to China, the help that Russia provided. “When the Americans asked us to work with the Chinese, we told the Chinese, ‘Think about it. You’re one of the big five with veto power. Doesn’t it suit your interest to funnel everything through the UN where you can put your foot down?’ And the Chinese came around to that idea.”

However, the Russians had red lines. Here’s Anatoly Kovalev again, the deputy foreign minister. The red line is, American troops stay out of Iraq. No regime change in Iraq. You do that and you will tank the termination of the Cold War. And that would be the goal. Here’s Kovalev saying, “I advanced the basic principle that we must support the territorial integrity of Iraq. This was our sacred position. We must not permit a division of Iraq.”

So if you wonder why the ground war ended after 100 hours, this is it. The big thing out there is war termination of the Cold War. That’s the big thing. Saddam Hussein is a minor event over there. Sorry, but he was. If it had tanked Cold War termination or upset the reunification of Germany, France and Britain might have been very happy, because François Mitterrand, who is the president of France, and Margaret Thatcher, prime minister of Britain, were against German unification. They knew it would marginalize their own country. Germany’s going to be a bigger economy, which it is.

François Mitterrand eventually found solace in expanding the European Community to the European Union when you’re incorporating all these Eastern Bloc countries into it. He plays a really important role in concluding the Maastricht Treaty that forms the European Union. But Margaret Thatcher just plain lost. She was just upset about the whole thing. She said, “Germany will be the Japan of Europe and worse than Japan.” I guess she hadn’t been to Japan lately. She said, “The Germans will get in peace what Hitler couldn’t get in war.” She wanted to leave Red Army troops in Germany for the duration. Imagine if that had been the case and now dealing with Putin… If he had troops in Germany, we would be in trouble.

But Bush and Kohl worked around all of them. Bush said to Kohl at the end of it, “Look, I’m not going to beat my chest and dance on the Berlin Wall.” Both of them were very careful never to humiliate Gorbachev about the Soviet loss of the Cold War. Why? Because they knew that if they did that, he might fall from power sooner rather than later. Also, they were afraid that if they did that, the hardliners would come to power much more rapidly than they actually did. It was 20 years before Putin started consolidating his power.

The newly independent countries of Eastern Europe needed those 20 years to integrate militarily, politically, economically with the West so that the cement could set before you got the Russians trying to destabilize them. So they bought them 20 years to do this. But there’s a cost to all this. Bush never got credit for his essential role in ending the Cold War on Western terms. So he was not reelected for a second term.

Anyway, when it came time for Nobel Prizes and why the Cold War ended, Anatoly Adamishin, this Soviet Foreign Service officer, said, “Look, it’s difficult to deny the Soviet Union was the one that ended the Cold War.” And Edwin Meese, who was a counselor to Reagan and also his attorney general, said, “Look, the Cold War began because of Soviet policies and it ended in a sense because of Soviet policies.” The Nobel Prize Committee agreed. They awarded the prize to Gorbachev, not to Bush, for his role in liberating Eastern Europe.

So when you’re thinking about this question of why Russia lost the Cold War, I hope you will come up with a more complicated answer than, “Well, Ronnie did it.” There are probably other causes at work as well. Anyway, thank you for your attention. That’s what I have for you this evening.

00:56:10 – How central planning survived so long

Dwarkesh Patel

Sarah, thank you so much for doing these.

Sarah Paine

Thank you for having me. That would be the more important thing.

Dwarkesh Patel

There’s an interesting question of why the Soviet Union collapsed when it did. I think the even more interesting question is why a system that was so centrally planned, monstrously inefficient, brutal, a colonial land empire, how such a country could survive for so long into the 20th century. I feel like that’s the thing that actually needs explanation. How did this regime last for 74 years?

Sarah Paine

There are loads of dysfunctional places all over the planet that have been dysfunctional forever. You look at well, why are they dysfunctional? To me, the answer to that one in a way is the example of North Korea. Of all countries that should fall, a place that has ongoing famines in the 21st century, and it used to be the richest part of the Korean Peninsula.

These authoritarian regimes are really good at maintaining coercive powers. Think about it. In order to educate someone, it takes years as a parent to bring up a little person and then you get them educated and maybe they’re an A-list politician. It takes seconds to assassinate them. It’s the asymmetry between construction and destruction. Destruction is so easy. Dictatorships are all over the world. It’s a sad part of the human condition. They clearly know what they’re up to.

In the case of the Soviet Union, there were multiple intelligence organizations. That’s what Stalin was using to keep track of everyone. So you want to monopolize information so that you know more information than other people. And then they have a whole bunch of people who are the winners of the nomenklatura, the elites there. You make sure you pay all of them off. I mean think about it. Human societies, slaves, serfs… We humans have been doing these things to each other for a long time.

Dwarkesh Patel

So dictatorships can certainly sustain themselves for a long time. But the Soviet Union was special in that by the 60s and 70s, they had a GNP that’s 60% of America’s, this incredibly dynamic economy. In the 40s and 50s, they had much higher growth rates, so much so that prominent economists like Paul Samuelson are saying that by the 90s, based on what they’re seeing at the time, the Soviet Union will have a bigger economy than America.

This is just quite surprising that they would have such high growth rates. If you just think about how central planning works, people are going to tell you how much steel you can make and which company gets to use the cotton fabric and cement, etc. You have hundreds of millions of people living under this system. It’s actually quite shocking that they actually had notable growth rates after World War II for decades on end.

Sarah Paine

Well, first of all, it’s a war economy, essentially. You’re putting all your money into having a big military. Russians define greatness—this is part of it—as being a big power, and its a military power with territory. Most countries in wartime mobilize for the military. This country did it in World War II. All kinds of rationing, we’re not using market prices. You’re setting different prices, giving people ration cards and things. The thing about the Soviets is they kept it forever. They never got rid of it. So that’s one piece.

Another problem with the Soviet Union is all of the data. So I don’t know what data you’ve seen, and I know the data I’ve seen. It’s hard to know because the Ruble is a non-convertible currency and a lot of things they measured in weight and other things. Like they’re the greatest TV producer in the world, they said. Why? Because they made the heaviest TVs in the world. I’m serious, when I was there this was it. They would spontaneously combust, which is not the normal thing a TV should do for you, burn down the apartment building.

So they’re going to measure their heavy TVs as a positive, and the Ruble is non-convertible. So there was a guy named Murray Feshbach, and I can’t remember which part of the US government he was in, but he was really good at looking at their statistics and then adjusting them. But people didn’t know. I gave you the CIA ones. The CIA, they’re not stupid people. They’ve got the best data they could find and they’re coming up with 20% of the Soviet budget is probably devoted to the military. After the Cold War is over, they’re going, “Whoops, we missed.” It’s at least double that and maybe triple. So it’s really hard to know even with the statistics you’re getting. Certainly what Paul Samuelson had wouldn’t be accurate. It’s just a guess.

Dwarkesh Patel

My favorite example of this is that there were top-down commands that you had to produce a certain amount of steel. A steel factory would then be incentivized to make thicker bars of steel rather than thinner bars because that would count as greater production, except a lot of inputs actually do require the thinner sheets. So then the other factories have to thin down the steel, but that also counts towards GDP. So producing the inefficient steel and then cutting it down to size is both being double-counted towards GDP.

Sarah Paine

Oh, and just the whole waste of it. Like the heavy TVs, they probably have four times the inputs that they need to make them that would be good for other things. It’s this notion that you can actually plan an economy. Prices are a miracle. Good old Adam Smith, the invisible hand. Prices are the way to go and markets, it’s more efficient.

Dwarkesh Patel

I wonder if one thing that’s going on is that in the early and mid-20th century, you have economies which are much simpler, at least compared to today. So even then, obviously, command and control is less workable than capitalism. But if you just have heavy industry, you need a certain amount of cement, steel, concrete, fabrics, coal. That’s much more workable than, “We’ve got to centrally command what SaaS tools your enterprise is allowed to use.”

Sarah Paine

Oh, yeah. It’s interesting on the development thing. The communists have insisted on heavy industry. That’s the thing that they want. Forget about the consumer goods. If you look at the countries that really have made it, like Japan and the Meiji Restoration, they’re doing a lot of light industry and consumer goods. Then they move into heavy, but they’ve already got people on bicycles and they’ve got textiles and other things up and running.

That would also apply to Taiwan and Korea. They do, by all means, get heavy industry. But that’s not the starter program. The starter program is basic standard of living. Again I’m not an economist, but it turns out if you just look at who’s rich and who’s not, that seems to me a more workable thing.

Dwarkesh Patel

There’s also the fact that the centralized regime is building things according to the 30s plan. And even after post-war reconstruction, they’re still calling back on these plans from the 30s that call for heavy industry for a bygone era.

In the 70s, 80s, we had our rust belt collapse of manufacturing. People complain about this as, “Look, the US has this hollowed-out manufacturing base.” But it’s much better to have industries which are left behind so that the whole economy as a whole can be more dynamic and move on than the Soviet Union where the entire thing became a rust belt because they couldn’t move on.

Sarah Paine

It’s more exciting than that. Again, I’m not an economist, but apparently they missed the plastics revolution. I mean think about our own lives. Now we’re finding we have too many plastics, but plastics are an incredible material and they’re just missing that. I remember in Russia trying to figure out where to get sour cream and was being laughed at by Russians because I was so stupid in the store that I couldn’t find it. Well, we have little plastic tubs with the sour cream. Back in the late 80s, when I was there, you had to bring your glass jar with you so you could hand it over the counter so someone could take a filthy ladle and fill up your jar. I mean, this is part of not having plastics.

And then they totally missed the computer revolution. This plays into Ronald Reagan winning the military race. We’re putting these chips and things into our ballistic missiles and they can’t do that. And that’s a problem.

Dwarkesh Patel

Speaking of plastics, I didn’t realize before preparing for this lecture the overwhelming role that oil played in first explaining why the Soviet Union was able to sustain itself for so long and then why it collapsed. By the late 50s, Soviet growth rates were already starting to go down, especially compared to the postwar boom that America is experiencing. In ‘59, they discovered these massive oil fields in Siberia.

And then from 1973 to 1985, I think, 80% of the Soviet Union’s hard currency earnings were just from oil. They use this because central planning can’t produce even grain, let alone advanced technology. They use this to import a bunch of stuff to sustain the Red Army, to sustain the population, to subsidize Eastern Europe. And then of course, prices collapsed in 1985. Do you think that if the Siberian reserves weren’t found in the late 50s, that it’s possible that the Soviet Union would have collapsed 30 years prior?

Sarah Paine

I don’t know, but they wouldn’t have been able to do all the Africa program and things. It just would be too expensive. So certainly it would have been a reduced thing. It’s also the gas reserves they got up in like the north central Soviet Union. I can’t remember the places, but this is the gas that gets pumped to Europe because that’s the better place. They make those big investments and it takes a while for them to pay off. That was a big deal because they needed help from Western oil companies or whoever does the gas pipelines, compressors, whatever it is you need. There was a big to-do about that, about whether we should sell the stuff or whether we shouldn’t sell the stuff. The Europeans wanted to sell. We were trying not to. This was going on under Reagan as well.

But anyway, they had built a lot of it and it was essential to their pocket change. But then when they got all the pocket change, they never saved. Whatever the oil wealth was, they spent up to the max. Doesn’t it sound familiar? Governments, you have money, you spend it. Forget about rainy days.

Dwarkesh Patel

So after the Soviet Union collapses, there was a period when Putin was still winning somewhat free elections. So if you look at why Russia’s economy recovers and why Putin was so popular in the 2000s, from 2000 to 2008, oil goes from $10 a barrel to $140 a barrel. This goes to your point about how we give credit or blame to political leaders for often what are just long-run macro trends.

Sarah Paine

Well, what I didn’t cover is that when the Soviet Union collapses, Soviet living standards, Russian living standards, they implode and it’s a mess for 20 years. It is just unbelievably difficult.

Oh, and another piece of the brilliant Soviet management: in order to maintain control over the empire, instead of building things all in one place, you build some plane parts here, some plane parts there, some plane parts all over the empire. So when the empire goes, great, I’ve got a quarter of a plane, and then where do I get the other parts? So all of that fell apart.

When Putin suddenly has a lot of money he starts spending it on people, because initially there’s plenty of money. Russian standards of living do go up. So of course they like him, and they give him credit for all of that. But then that runs its course, right? And then it’s less good and then he’s more excited about… Well, it’s his mindset anyway. When you get more money, you want to get the empire back. And then Russians also like that, right?

Dwarkesh Patel

Speaking of the empire, Russia’s economy just had this terrible period after the collapse of the Soviet Union. A lot of the Eastern European satellites seem to recover in this gangbusters way. Obviously, East Germany. But even Poland today is such a big success story. What’s going wrong with the mainland itself that these other countries are able to recover from communism much better?

Sarah Paine

Well, they had always been much more connected to Western Europe. Czechoslovakia before the war was a full-up highly developed country absolutely tied to the West. Poland, I believe, Copernicus is from a place like Poland, right? It’s a center of the Enlightenment.

But when I was using the George Bush Sr. archives, it’s fascinating. So it’s ‘88, ‘89 when the Soviet Union’s imploding. There’s a lot of correspondence between Eastern European, particularly Polish, leaders coming to the Bush administration saying, “Hey, our banking system, we know it’s a mess. Our financial system’s a mess. We know we need expertise to help us figure out what our legal system is going to look like.” And Bush is all over that. I’m sure he farmed them out to the private sector who would also be all over that, like giving them free consulting. So as a result, you do have them really taking advantage of this 20 years.

At the same time when Bush would have loved to have given some of the same advice… There were people like Jeffrey Sachs and others who went to the Soviet Union, but it was not remotely the same thing. This is people throughout Polish society requesting this advice, not like one guy with an office in Moscow. Basically, the Russians thought they knew it all and they thought they understood. This is all the unknown unknowns, the things you don’t understand, your blind spots. Truly, economics is a blind spot for the Soviets.

Because think about it, when the tsars ran the show, it’s like a riff off the Mongol Empire. You take cuts from people’s businesses, from trade that comes through. Then it’s also about selling basic commodities. You’re not thinking, under the tsars, of Russia doing high-end manufacturing. I mean, I guess Fabergé and some jewelry if you want to do that. But really that’s not it. It doesn’t have this commercial tradition, being tied into this commercial tradition of Western Europe and all the sea routes for trade.

Then when you get the communists, they aren’t about that at all. So there’s really a dearth of knowledge. Think about this country with all the little kids selling lemonade, right? You see them on the streets. They’re already learning. The kids who are doing newspaper routes, they’re already learning about buying things, selling things at a very young age. We just take this knowledge for granted. It was just absent in the Soviet Union and not as much absent in Eastern Europe that had been more connected in.

Dwarkesh Patel

Before we get to the period of Russian collapse, let’s go back to the end of the Soviet period. Gorbachev starts instituting these economic reforms along with glasnost and perestroika. But what I find mysterious is that those economic reforms not only fail to prevent the stagnation that the Soviet Union is experiencing, but they in fact make things worse.

You would think that reform, even if it’s handled badly, would have some sort of positive impact. If you do it badly, then it’ll have a smaller positive impact. But here it just causes this huge hyperinflation, causes all these big problems. So why did reform have this backwards impact?

Sarah Paine

There’s so much that needs reforming there. But part of it, I think, is because he wanted to do political reforms. That’s what he understands. As a human being, that would be the thing that he’s very familiar with. Think about it. He’s an A-list member of the Communist Party to be the guy when they do generational change, he’s the one. So he’s obviously very astute at that level, but the problem is economics. He’s giving away political power before he’s fixed the economic problems. China’s conclusion is there is no way you’re going to touch political power. They’re going to hang onto that and then deal with as much of the economics as they’re going to deal with. That’s part of it.

But part of it is there’s no tradition for all of these things. Then you go, “Well, how did Russia get this way?” It’s a very difficult address. Prior to the Industrial Revolution, it’s flat, neighbors all invade, and so you needed a big army in order to defeat them. A big army is going to want a war economy. Historically, you’re going to want to support a big land force. I mean this is my take. Others who are actually experts on these various periods of Russian history can come up with something else. But I think you’re funneling, you’re channeling your economics into that.

Whereas you’re looking at Europe, particularly Britain, and it’s merchants. They have a big aristocracy who are not going to dirty themselves with buying and selling stuff, but there are a tremendous number of very rich merchants in Britain that are going to influence government laws and things, which is not going to take place in Russia. Then what’s nice about the Navy for Britain is you send them away. They’re not going to run a coup in the capital because they’re off on the ship somewhere. And there aren’t that many of them compared to a standing army. So I suspect, I can’t prove this, that this leads to different outcomes or contributes to them.

Dwarkesh Patel

One theory I heard that is complementary to your theory is that Gorbachev is instituting reforms because he thinks there should be decentralization and democratization, but he doesn’t fundamentally believe in the market system. So he’s delegating power to these quasi-firms. At the same time, he thinks the price system is immoral, private property is immoral. So they can’t intermediate between themselves using real prices.

So then how do these firms intermediate? Well, there’s corruption. If you can’t use actual prices and property to figure out who gets what allocation of scarce resources, you just backroom deal, which makes the problem worse.

Sarah Paine

Well there’s no legal system and you need a legal system. Legal systems take a long time to develop. So you’re telling the Soviet Union, “Okay, communism is down and now chop chop, we need a new legal system.” It’s not going to happen.

01:14:46 – Sarah’s life in the USSR in 1988

Dwarkesh Patel

You were mentioning the problem that Eastern European countries especially had, which is that they’re going more and more into debt because they’re not able to produce globally competitive exports. They have this last-ditch effort that “We’re going to solve our problems with some technological miracle. We need to get even more over-leveraged. We’ll get some Western machinery or technology, and then we’ll be able to finally produce something that the world wants.” I’m curious up to what point this was a plausible hope. Through the 80s and even till the end of the 80s, they still believed that Czechoslovakia or East Germany or something could catch up with West Europe?

Sarah Paine

They’re desperate. Think about it. If you’re a communist leader, how many other cards are there to play? You’re looking, “Okay, this is the only card I got.” And they’re doing other things because of the social unrest. They want to import food and consumer products because they’ve been so neglected.

Then there’s another piece, which is VCRs, the videos. All of a sudden, those things came around. I remember being in the Soviet Union, the academic year of 1988-89. One of my classmates had been an English language tutor of this person in Moscow and set me up because that was the only way to get a good meal once a week. For a meal, I would talk English for an hour.

What that family wanted more than anything else was a VCR player. You could have hard currency and buy it at the diplomatic store. So I basically got them a VCR by going to the diplomatic thing with my very limited foreign currency. I bought an overpriced VCR for them and got all kinds of meals for the rest of the year. But it meant that they could all of a sudden get Western movies.

There are things in movies where there’ll be a picture of a fugitive running by the fruit section of the Berkeley Bowl. The Russians would gasp like, “Oh.” It’s unbelievable. I think that Raisa Gorbachev, Gorbachev’s wife, when she came and visited, she must have realized that a welfare mother on food stamps had better buying power than she did by just being able to have access to Walmart.

I think the elites, as they’re traveling… I have no statistical data on this, but as you travel, it’s like I’m comparing me getting sour cream in a jar. That was the other thing, counting up all the things in a Soviet supermarket. The total was something like 77 items total in this supermarket. I don’t think that compares favorably to a candy rack as you leave a 7-Eleven. And when you went by the meat section, the smell just about knocked you out, rotten meat. It was really disgusting.

I got really good at making borscht. Go to the peasant market, pay hard currency for bones, because I couldn’t afford any meat, but I could afford the bones. Then I would buy… The Russians produce really good sugar beets so I got beets. Then you’re starting to get rotten apples over the winter, but they at least come from Hungary. Russians didn’t even produce apples in those days, but Hungarians did. The Romanians provided the canned tomatoes, and I could do a credible borscht.

But you’re talking about Moscow, the center of everything. I remember buying potatoes at the market and the rotten spots felt gelatinous. So you’d have to cut those out. And then you’re wondering how many nutrients are in the rest of that potato. It was a really gross year. I remember going to the candy store and I would buy caramel from Poland or somewhere. It was like a food item because it was actually edible.

Dwarkesh Patel

At this point, I bet you were wondering why you didn’t write a biography of Napoleon so you could just visit Paris instead.

Sarah Paine

My brother’s comment is, “You’re studying Russia and China, two countries in the breakdown lane.”

Dwarkesh Patel

By the way, the point about the grocery stores having 74 items is interesting in two ways. One, central control works much better if you have a much smaller amount of items to optimize over. So if things are standardized, it can work much better. And second, to your point about GDP being hard to compare between the Soviets and the United States, how do you compare a rotten tomato or a rotten potato to the Idaho ones that you can get?

Sarah Paine

They would have compared it by pound.

Dwarkesh Patel

Exactly.

Sarah Paine

Yeah.

Dwarkesh Patel

You said you were there in ‘88 and ‘89. So this is before the Berlin Wall has fallen.

Sarah Paine

I was watching the Tiananmen demonstrations on Soviet TV. The only reason you got that TV coverage is because Gorbachev was in Beijing. So all the press was there. That’s why you have the coverage. And they stayed on because the students were demonstrating and the Chinese closed society wasn’t aware of the power of television. Guys, they’re going to film you doing all of this stuff and they will get the film out.

Dwarkesh Patel

In ‘88, was the mood… Obviously things are going terribly, but did people realize that they’re only two years away, or three years away from the complete dissolution of the Soviet Union?

Sarah Paine

No. Maybe the end of the Soviet Union, but there was such optimism of thinking we’re finally going to be a full-up democratic country. It’s going to be wonderful, with no sense of the work schedules that go into a capitalist economy. To create the wealth in this country, a lot of people are working far more than 40 hours a week, particularly as they’re getting started, working enormous hours.

That was not something that was in most people’s minds. Sure, the kids who became the ballerinas in the Bolshoi are working long hours to do that. But as an economy as a whole, they didn’t understand the source of wealth and had no inkling of all the things that are missing, not least of which is that no one’s got the right education. Great, you got Marx memorized. That does you zero good.

Dwarkesh Patel

So around this time is when people are finally learning about what actually happened during the Stalinist period.

Sarah Paine

Oh, yeah.

Dwarkesh Patel

So people are optimistic that we can have a changing of the guard and maybe things will improve. But at the same time, they’re learning about how terrible their history actually was. Between these two things… Also at some point they must realize through the 90s that things actually aren’t improving. In fact, they’re getting worse. So what is the inflection point at which the mood is just...

Sarah Paine

I don’t know, because I wasn’t living there. I was thinking that there would be impending problems as a Chicken Little American. The sky is falling, the sky is falling. Americans always think disaster is coming. I sort of fit in that crowd. But I think there was a lot of optimism and exuberance thinking, “We have the freedom to really understand our history and what’s happening.”

This is for educated people, people with college degrees in Moscow and St. Petersburg. Now what’s going on in the rest of the country is undoubtedly a different story because as bad as living in Moscow was, living in the countryside was going back in time far further. So those people weren’t living well at all. And it’s going to get really bad for them.

Dwarkesh Patel

Okay, so people are learning about these things for the first time. Is the sense that they kind of suspected? I mean, people have family. They must have known, “My uncle was off in this little mining town that he was forced to go to for a decade right after World War II.” Were they totally shocked or was there some sense that things were pretty bad and now we’re just learning the extent of it?

Sarah Paine

I think there was an understanding it was terrible, but I think there’s this exuberance of thinking it’s going to get much better. Then the disappointment is equally extreme. And then there was this feeling that the West owes us because you’re all really rich and you now owe us to fix everything. The counterargument to that is, “No you are an enormous migraine. You set back all of these countries across the globe in time with this nonfunctioning communist model that you peddled around there. And now you want extra aid.”

The problem was that we wanted to do some of the aid, but they’re not going to be receptive to it. That was another conclusion with the Bush administration, that if we dumped a lot of money in it, it would just go straight into corruption. You need a legal structure in order to place money, and they just plain didn’t have it. That was another thing that was worrying the Bush administration. There’s nowhere to put the money.

Dwarkesh Patel

Speaking of these different countries that the Soviets and the United States were competing for during the Cold War, you had this presentation where you say Reagan alone didn’t do this. But I wonder if the broader lesson is that nothing any US president did in terms of foreign policy… That was all a sideshow, this tête-à-tête competition for different Third World countries: “We’re going to get Brazil, we’re going to get Vietnam, we’re going to get Algeria.” That just seems much less significant than the fact that liberal capitalism was more appealing and out-produced communism. So even if some country, even if Brazil goes communist, this is not going to change the fundamental playing board here.

Sarah Paine

If you do not protect the liberal economies of Europe, you’re not going to have anywhere to play the liberal economic game, and also Japan. One of the reasons you feel that liberal economies work is you’ve got economic miracles going on in Japan, Korea, Taiwan, Singapore, and Hong Kong back in the day. So if you abandon those places...

Also in the Cold War, there was a tremendous amount of economic growth across the world, particularly in the Third World. Why? Because in the past, if there’s a civil war, whoever’s losing either comes to us or comes to the Russians and says, “Help us.” So whoever it is helps, and then the other side feels obliged to help, and then you’re just destroying wealth ever more rapidly. The Cold War was anything but cold in the Third World. Tens of millions of people died in these conflicts. So when you end that, all of a sudden they can start compounding growth.

So there is a problem with not countering someone who’s going to impose communist systems all over the place. Communist systems are really good at putting dictators into power in a civil war situation. It’s very effective. That’s how Mao gets into power. The problem is, then they win the civil war, they’re in power, they annihilate the opposition, but then it produces compounding poverty thereafter.

Dwarkesh Patel

So there is this conundrum, and I genuinely don’t know the answer to this. In order to beat off these communist factions and guerrillas, we often through the Cold War had to support other dictators. Probably in many cases they were better than the communist alternative. It’s just very hard to beat Pol Pot and Mao in terms of how terrible you can be. But obviously this was in its own way problematic. Even if we didn’t have to support dictators, we had to alienate countries.

You had this previous lecture that you gave on the Indo-Pakistani chapter in history where we had to alienate India in order to fend off against the Soviet Union in this little episode. I don’t know what the solution to this is. If you think that this theater mattered less, then you could say we should have just kept our hands clean of these different Third World countries. But to your point, if you want to be able to show that these countries are going to experience growth under capitalism, then you want them to not be under the subjugation of communists. But then you have to support sometimes objectionable regimes.

Sarah Paine

I think you had a more optimistic generation, ironically, optimistic. The people who had survived World War II, there was a real generosity. American servicemen and women were welcomed all over Europe and they were adored in Europe. They came back and they were a very generous group of people. Others felt generous to them.

That’s when the GI Bill just passed saying, “You’ve saved everyone. Therefore we’re going to give you college educations, extend home loans to you.” Not to African Americans, they were excluded from this, which is a problem. But white Americans weren’t. It led to massive economic growth where people who’d never had a college education in their family, they did. All of a sudden, instead of having really hard manual labor, this real optimism. And then it extended to foreign countries. This is when this country was tremendously generous to others, and it worked very well for us.

Think about the Marshall Plan. It looks really generous putting all this money into Europe. We made a fortune off of it, as did Europe. If you’re smart, you’re looking for win-wins of things where you both benefit because that’ll incentivize the other side to join in. This is basic strategy. This is one of the reasons I’ve got problems with the United States’ turn to zero-sum approaches where “I’m going to get everything, you get nothing. Then I look so smart when we do the clickbait on this moment where I get everything and you get nothing.” It’s much smarter.

The other piece is that a lot of things don’t pay off immediately. George Bush is not reelected president. He absolutely deserved to be. Because what he did, the payoff was huge, ending the Cold War on Western terms. But it doesn’t pay off in time for the next election. I think this is where Americans miss it. You’re looking at what someone does on a given day when the real implications are what’s going to happen in a decade. Like on tax policy, if we keep racking up our debt, it may get us out of the corner today, but is it going to back us into a corner later on? This is where Americans need to think a little harder about long-term implications of things.

Dwarkesh Patel

I thought when you pointed out that it would cost 60 billion Deutschmarks for West Germany to pay Gorbachev to let East Germany join West Germany. That’s a lot of money. But if you think about decades and decades of future growth, it’s a huge bargain. It’s a mistake to think about how expensive things seem at the moment. It’s another huge country that you’ve turned.

Sarah Paine

There’s a statement that politicians think of the next election, statesmen think of the next generation. George Bush and Helmut Kohl are statesmen. They’re thinking of the next generation. The group that fought World War II, many of US and allied leaders, were statespeople. They’re thinking of the next generation. Or if you’re thinking of where I’ve got Mitterrand, who’s negotiating the Maastricht Treaty about the European Union, that is statesperson’s work of what’s the next generation. It’s important. We need more statesmen, statespeople, political leaders.

Dwarkesh Patel

To try out a different thesis on you, through this period the Soviet Union is also trying to buy off other countries, especially when it thinks its economy can grow. Especially when oil, after the 1973 oil crisis, oil prices just skyrocket. This is why some Soviet citizens remember the Brezhnev era favorably. Oil made it possible for the Soviets to not only import stuff, but through the Brezhnev period there’s actually a net export of resources to Eastern European satellites rather than the other way around.

Sarah Paine

That’s probably their data. I get it, their oil is really subsidized, but everything in the Soviet Union that was worth having came from somewhere else. The problem is how do you measure it? They’re just going to measure by weight or something else. It doesn’t really capture what they’re getting.

Dwarkesh Patel

The larger question being that, it’s not like the Soviet Union didn’t think of doing things like the Marshall Plan. Obviously nothing to that extent, but this idea that you can win people’s favors by providing them military aid, providing them foreign aid. They just didn’t have the resources to do it to the extent that the US could.

Sarah Paine

That’s true, but there’s a real coercive piece too. If you mess with them, it’ll be really ugly.

Dwarkesh Patel

Here’s what I don’t understand about the arms buildup during the Cold War. The Soviet Union is spending 2% of their GDP just on nuclear weapons alone at its peak. Arms control advocates will make this quip, which is that we’ve already got enough weapons to destroy the world many times over. Why do we need more? But that is sort of an interesting question. What was the point of spending so much of GDP on the marginal nuclear weapon or marginal weapon system?

Sarah Paine

I don’t know the answer, but you read the plans about these things and you wonder what people are thinking. We were trying to develop tactical nukes. There was only a little trick with that. Whoever deployed it would be within the blast range of the tactical nuke. You’re going, “Who develops a weapon like that?” Apparently we did. Luckily we didn’t deploy it.

I don’t know the answer of why we had such massive redundancy in these nuclear weapons, why the arsenals were so massive. I don’t know the story on how you maintain these things and how long they last. It doesn’t make much more sense to me than it does to you.

Dwarkesh Patel

Another question. Sino-Soviet split, this huge diplomatic coup. The Soviets had to put a million soldiers on the Siberian front against China. They had to spend 2% of GDP just stationing and garrisoning this area, which is obviously a lot. That’s often what many countries spend on defense as a whole, let alone just along one front.

At the same time, 2% GDP, well if they just had one or two more years of extra economic growth or faster growth, that could make up for this huge diplomatic coup. Again this goes back to the point of, if some domestic policy just caused slightly higher economic growth rates, that would make up for the biggest diplomatic coup of the entire Cold War. It goes back to economy first, diplomacy second.

Sarah Paine

Firstly, I have real problems with the statistics. I got a sample size of one, moi. I remember living in Moscow. It was so backwards. It’s just breathtakingly backwards in just about every way imaginable. They got a big fancy subway system that looks remarkably retro, and at least it works. But the consumer goods were so awful, the quality was so bad. You look at the buildings themselves.

I get it, they make nuclear weapons. Do they make anything else? Their cars were a joke, their Ladas or whatever they were. It’s just thing after thing. So you’re looking at all their stats because that’s what they are telling you, that we’re so great. It really is an Emperor Wears No Clothes moment that finally the little kid goes, “Oh, you’re actually naked.”

I can give you an example. These acquaintances in Moscow were talking about hospitals outside of Moscow that some of them didn’t have running water. How do you have a hospital without running water? I don’t even know how that’s even conceivable. Or when their kid had put her hand through a glass door or something. They wanted to get her stitched up because she’s bleeding. She’s not going to die, but she’s probably bleeding all over the place. They bring her to one place and, oh, they got no thread to do the stitches. So then they have to go to another place. Who runs a country like this?

Dwarkesh Patel

Alright, you convinced me. BART is acceptable. I’ll stick here. Subway’s not a big deal. I don’t want to move to Moscow.

Okay, while the Eastern European satellites are trying to leave the Soviet Union, this has happened many times through the 20th century. Hungary in 1956, Czechoslovakia in ‘68, Poland through Solidarity. Every previous time there’s a many-million-person-strong Red Army stationed in Eastern Europe left over from World War II, which rolls in the tanks and prevents these revolutions from taking place.

So what happens in the late 80s and early 90s? The Red Army is still there. There’s still millions of Red Army soldiers. They just don’t shoot.

Sarah Paine

Generational change. The leaders don’t have the stomach for it anymore. I don’t know how you’d feel about sending tanks and going, “Oh, we’re going to splatter all these people.” I think for many Americans, that would not be the choice that they would make. So this ruthless generation is gone.

Another piece is that Gorbachev had traveled and I think he had some Czech friends. I can’t remember all of his lists of friends. But they’d been horrified by Czechoslovakia in 1968 as young people watching, as Russian young people watching it and thinking, “It’s just wrong. We shouldn’t be doing this. If communism is what it should be, this is not what should be happening.”

This is of their youth. Gorbachev and his generation. It’s not just him, he reflects a whole generation of communists. They’re thinking, “There’s got to be another way. This is just not right.” So he thinks he’s got his other way. It’s this exuberance of the reforms and things that are happening in Russia. There’s a tremendous feeling of energy. He’s telling the Poles, “You get at it too. We’re all going to do this thing.” But it’s all the expertise and things that he’s missing, that he’s unaware that he’s missing, as are all these other people, because how could they have it? They’ve been living in a command economy.

Dwarkesh Patel

This is what I wanted to ask you about. You had the de Tocqueville quote about how revolutions happen when governments start to institute some kind of reform. Gorbachev is doing perestroika, glasnost. There’s the conservative reactionary parts of the Communist Party, which by the way is a phrase I wouldn’t expect to have said. But they’re trying to resist this. So Gorbachev goes about dismantling the party secretariat and instead devolving power down to the individual republics. We know what happens later. These republics are saying, “Look, we want our own country now.”

But this raises a question. If you do inherit a brutal regime, and now you say, “I want to do reforms.” You know this dynamic that de Tocqueville pointed out, which is that as soon as you start reforms, actually what tends to happen is that you lose power, not that people consolidate it under you. What actually should you do? Because you’re like, “I want to improve people’s lives.” But as soon as you try to do that, the whole thing’s going to fall apart.

Sarah Paine

This is so far above my pay grade. I’m a professor. I have trouble justifying a B+ on a paper. I’m a believer in gradual reforms. Do it incrementally. For the Soviet Union, it would be gradual legal reforms, work it through their Duma slowly, and do it that way. But seek out help from the European Union that has many, many experts that would be overjoyed if Putin and friends would cease doing their number on Ukraine. Now the problem is you’re going to get into reparations for the horrors they’ve inflicted. So that ship has sadly sailed for this generation. There’s no nice ending for Russians. It’s too late.

But you can look at Europe itself improving its institutions and Ukraine improving its institutions. If you think about what forces you to change, the existential threat on Ukraine, if they survive all this, this is forcing them really to clean up their institutions. So it’s happening rapidly there, but we don’t know the end of that story, how it ends.

Dwarkesh Patel

I do think these are interesting lessons here of whenever we look at a country from the outside, we have this thing of, “Well, just reform everything and just fix your economy.” Whenever we understand the system better… For example, in the United States, healthcare is 20% of GDP. This idea that Trump or Obama or Biden, whoever, could just come in and be like, “Well, I’ll just fix healthcare.” We recognize that this is a wildly implausible thing to happen. But then we have this expectation that in Russia, Gorbachev or Yeltsin could have just been like, “100% of my economy is messed up, and I’m just going to fix it.”

Sarah Paine

American hubris in action. Think about our country. We have one of the most crazy tax codes on the planet, and neither party can touch it. Because you touch any part of it, someone negotiated that wording exactly. Yet think of how much of our economy is taken up by the overhead of all the tax accountants, all the misdirected cash in order to take advantage of something that’s simply an invention of the tax system.

There was years ago when there was talk of doing a flat tax, “Wouldn’t that be much more efficient?” You can imagine what accountants thought about that one. That idea has totally died. Talk about inefficiency. Then we realize we have budgetary problems in this country. This would seem to be something that ought to be on people’s radar, clean up the tax code. But isn’t it precisely that many people don’t want the radar on the tax code? That’s why we’re wondering who can get in and out of girls’ or boys’ bathrooms, instead of looking at the tax code, which should be the real thing.

Dwarkesh Patel

I think there should be big deductions for podcasts. It should count for research and development.

Sarah Paine

Well, Dwarkesh, you’re almost at that stage. You need to add a lobbyist in DC.

Dwarkesh Patel

We’ll work on it.

There’s a very interesting book about North Korea, I forget the title, where the author is pointing out that North Korea could not even start doing reforms today because as soon as there was some sort of information from the outside world that North Koreans could see—which would be part of any reform—they would immediately realize that everything the government has told them is false. South Korea is enormously wealthier and they have this terrible standard of living.

Obviously, this is the same experience that Eastern Europeans had. Literally in many cases, you had a country that was bisected in half and the other half is living so much richer. In those situations, I guess this goes back to the question of, “Well, today in North Korea, how would it even kick off if Kim Jong-un just had a change of heart or if somebody else came into power?” They’re probably just trapped in this to the extent that they want to keep power.

Sarah Paine

Oh, he’s trapped because he’s a dead boy if he tries to take a go at retirement. In Asia—I don’t know exactly all of the parts of Asia where this applies to, it’s some parts—there’s a thought that things last for three generations and then it’s over. So he’s the third generation. Whether this is true or not doesn’t matter. If you believe it’s true, it will become a self-fulfilling prophecy. So I’ll be interested. I probably won’t live to see it, you in the room will, what happens to the Kim family, whether it makes it to generation four or not. But by their own belief system, in theory, they shouldn’t. So who knows?

Dwarkesh Patel

One more question about oil.

Sarah Paine

Based on my big expertise on oil, zero. Okay.

Dwarkesh Patel

During this period between ‘73 and ‘85, when they had these huge oil revenues, presumably there was some amount of exuberance. But did the government recognize and realize that they’re super fragile to the price of oil and if that collapses, they need some sort of contingency plan, some rainy day fund? You must notice that, “Oh, this is half my budget, and all of my foreign currency is coming from oil, and this is a very volatile commodity.” Nobody noticed that?

Sarah Paine

Yeah, well, it’s interesting. I was reading this long chronology that was put together sort of like early Putin. So before they really shut down all the information. It was just a chronology of the Cold War, big fat book. Just like someone like me to read a book like that. So I’m going date after date after date after date. It’s written by people who are really angry about how the Cold War turned out. One of the takeaways from the compilers of this thing is they kept criticizing. They showed how much for every year Russia was making in oil revenues. It was huge. But in their analysis it was, “And they saved none of it”, right? There was no sense of investing in something.

There’s something called consumption. There’s another thing called investment. Going around and buying a bunch of Western grain is consumption. There’s none of this being put in anything that’s going to yield anything. So that was a big criticism from the authors of this book. To the question you’re asking, “No, they just milked it while they were there.”

Dwarkesh Patel

Final question, this is not so much a question as an observation. I don’t know if you have a reaction to this. Just look at Russia’s history through the 20th century: tsarism, communism, collectivization, to more than 10% of your population dying from World War II, then back to Stalin, and then more communism, and then the economy collapses again, and then Putin. Especially if you look at the satellite states, they had all of this happen to them and worse because now they’re getting invaded.

Whereas you have other countries. Japan and Germany also had tragic histories, but then they recovered. Maybe it’s just the tragedy of Russia.

Sarah Paine

Yeah, you’re lucky you’re not Russian.

Dwarkesh Patel

Yeah, exactly.

Sarah Paine

No, it is tragic. It is tragic. It started out as a difficult address, pre-Industrial Revolution, that required certain things to survive. They were more ruthless than their neighbors. They did survive. I mean, in a previous lecture, I discussed how they wiped out entire princely states and Khanates and things, they just wiped them out. Then you’re using their elites because it’s a rough neighborhood. The problem is if you aren’t on the winning side, you’re going to be on the losing side, right?

But since the Industrial Revolution, where you can do compounded economic growth that comes from commerce and trade and industry and things, that’s the real way to get powerful because power becomes a function of your wealth. That involves having legal systems, institutions, and stability. Russia has found it very difficult getting with that program. It has to do with, I think, this very difficult historical legacy of who rises to power, and also all the missing things. They didn’t have the Renaissance, they didn’t have the Reformation, these fundamental movements that were very influential in the West.

So there’s a lot of negative space of things that didn’t happen. There’s all the awful stuff that you saw that did happen, but then they’re missing things. So it’s very difficult. Then people like Putin can set the clock way back because he’s killed so many Ukrainians. What he’s done will take a generation at minimum to get to anywhere where people are going to be thinking about… People will be talking about reparations from Russia for quite a while and they’re poor, they’re not going to want to do that.

Dwarkesh Patel

I should have thought to end on a more optimistic note, but...

Sarah Paine

Well, history’s ended, okay?

Dwarkesh Patel

Well, you’ve outlined the ways in which countries can chart a better course for themselves and that’s where the optimism can come from.

Sarah Paine

Actually, I’ve told a story about the last Cold War that stayed cold in the industrialized world, which was a good thing, because it could have been nuclear. It was tragic in many other parts of the world, but at least it stayed cold in the industrialized part. There was a strategy that a very thoughtful generation of people, not just in the United States but all over the West, put together to allow for a non-nuclear landing for the Soviet Union when it fell apart. From this, you can derive some of the strategies that worked for ending it that way. These are the kind of strategies that we’re going to have to use in order to navigate the second Cold War.

The other piece about the Cold War is the Soviet Union living miserable lives of their own making. But Americans were actually having a good time. They paid taxes, they had to pay for all the nuclear weapons. But as I recall, people are running around in Disneyland, they’re doing their European trips, they’re buying houses. So actually Americans, people in Western Europe, were living fulfilling lives while they’re waiting out for others to get with the program.

If we’re going to make it through this second one, we need to start cooperating with our allies, building institutions, and improving laws. Don’t just burn down the house. We will get through this one too, and we will live fulfilling lives while we’re waiting for Putin to come up with something different or Xi Jinping to come up with something different. But if we blow through our good hand of cards...

You interview all kinds of people at the cutting edge of technology. If we get rid of all of our university funding, we aren’t going to have the intellectual capital on which those businesses are based. If we’re going to dump all our allies for unknown reasons and just alienate them so they organize without us… If we’re going to just throw away entire institutions without thinking very carefully about what we’re doing… We become a cooperative adversary and we will be the bozo putting a plastic bag on our own head.

I look at the rhymes here. The Soviets had this ancient leadership who just couldn’t get their act together and they’re living off of debt instead of thinking creatively. The rhymes are awful, but we don’t have to do it that way. So it is more optimistic, but we need to get our house in order. That’s why I’m doing these lectures. They’re lectures in strategy to give you tools on how to come to your own decisions. That’s your business, not mine.

Dwarkesh Patel

That is an excellent note to close on. Sarah, I want to thank you so much for doing this lecture series with us. It has been a true education across these six lectures, everything from individual wars to the strategic and tactical decisions which explain them, to the broader lessons for today’s world. I do interview lots of different kinds of people, but from a sort of view-per-minute average-adjusted basis, I host a Sarah Paine podcast. If you just sort by popular, Sarah Paine comes up a lot.

Sarah Paine

But you’ve got it backwards. I was an unknown academic and then you cold-called me about doing an interview. I said, “Sure.” Dwarkesh, as a result of all this, I’m getting emails from all over the place. So let’s talk about who’s grateful to whom. Anyway, I’m devoted to your generation. Thank you for having me. Thank you for coming and being such a warm audience. Really appreciate it.

Thoughts on AI progress (Dec 2025)

Dwarkesh Patel — Tue, 02 Dec 2025 21:39:14 GMT

What are we scaling?

Beren Millidge made interesting points about this in a recent blog post:

When we see frontier models improving at various benchmarks we should think not just of increased scale and clever ML research ideas but billions of dollars spent paying PhDs, MDs, and other experts to write questions and provide example answers and reasoning targeting these precise capabilities ... In a way, this is like a large-scale reprise of the expert systems era, where instead of paying experts to directly program their thinking as code, they provide numerous examples of their reasoning and process formalized and tracked, and then we distill this into models through behavioural cloning. This has updated me slightly towards longer AI timelines since given we need such effort to design extremely high quality human trajectories and environments for frontier systems implies that they still lack the critical core of learning that an actual AGI must possess.

Human labor is valuable precisely because it’s not shleppy to train

Economic diffusion lag is cope for missing capabilities

Steven Byrnes has an excellent post on this and many other points:

New technologies take a long time to integrate into the economy? Well ask yourself: how do highly-skilled, experienced, and entrepreneurial immigrant humans manage to integrate into the economy immediately? Once you’ve answered that question, note that AGI will be able to do those things too.

Goal post shifting is justified

AI bulls will often criticize AI bears for repeatedly moving the goal posts. This is often fair. AI has made a ton of progress in the last decade, and it’s easy to forget that.

Models keep getting more impressive at the rate the short timelines people predict, but more useful at the rate the long timelines people predict.

RL scaling is laundering the prestige of pretraining scaling

Comparison to human distribution will make us at first overestimate (and then underestimate) AI

There is huge variance in the amount of value that different humans can add, especially in white collar with its O-ring dynamics. The village idiot adds ~0 value to knowledge work, while top AI researchers are worth billions of dollars to Mark Zuckerberg.

AI models at any given snapshot of time, however, are roughly equally capable. Humans have all this variance, whereas AI models don’t. Because a disproportionate share of value-add in knowledge work comes from the top percentile humans, if we try to compare the intelligence of these AI models to the median human, then we will systematically overestimate the value they can generate. But by the same token, when models finally do match top human performance, their impact might be quite explosive.

Broadly deployed intelligence explosion

There’ll also probably be diminishing returns from learning-from-deployment. Each of the first 1000 consultant agents are each learning a ton from deployment. Less so the next 1000. And is there such a long tail to consultant work that the millionth deployed instance is likely to see something super important the other 999,999 instances missed? In fact, I wouldn’t be surprised if continual learning also ends up leading to a power law, but with respect to the number of instances deployed.

Podcast Strategy Doc (December 2025)

Dwarkesh Patel — Mon, 01 Dec 2025 18:35:34 GMT

The mission

I originally titled my podcast The Lunar Society. I changed it to Dwarkesh Podcast eventually because people kept thinking it was a crypto podcast (”to the moon!!!”). I named it after The Lunar Society of Birmingham, an informal club that met in the late 18th century. Members included James Watt, Matthew Boulton, Erasmus Darwin, Joseph Priestley, and Josiah Wedgwood. These were the scientists, inventors, and philosophers who had made first contact with the Industrial Revolution which was just starting to take shape around them. And they discussed everything from steam engines to abolition to chemistry to education reform.

Someday people will look back on this period the way we look back on the Enlightenment. Great thinkers having important debates right as the world was about to undergo these massive technological, economic, and political revolutions. And some of these thinkers actually managed to get a couple of the big things right.

Whatever happens next, I want the debates to have happened on this podcast, and to have happened well.

We are moving from the age of podcasts to the age of essays

I wanna make essays a first class citizen of what I do. This is for a couple of reasons:

Interviews are best when I have some take that I can bounce against my guest. You only get to see Federer’s skill when he’s rallying against a decent player, and certainly not if he’s just bouncing the ball against a wall.
As AI becomes more and more closed off, the best people will not be in a position where they can explain their thinking clearly. This is why the Karpathy episode was so incredible. It’s rare to get an industry expert without any particular thing to pitch, and who can talk openly about the research. But I’m not aware of anyone else who is Karpathy-tier, and who is not obliged to keep his or her mouth shut about a couple of things.
My essays have done much better than my expectations, in terms of reach, correctness and impact. I wrote the continual learning essay on a whim one afternoon, because I wanted to articulate why all these LLM scripts I’ve written for my business haven’t been helpful. And I’m still a little shocked to realize that I had stumbled upon (at least part of) what Ilya is working on at SSI. It’s not a crazy insight by any means, but it’s notable that you can just think about stuff, and there’s a good chance you’ll figure out what’s up. Btw, after I released the essay, both Sam Altman and Demis Hassabis have said that continual learning is a major bottleneck on the path to AGI. Of course, there’s no way to know whether they read my essay. But honestly, even if they hadn’t, I’d still be pretty stoked if I had independently pointed my finger at the exact same bottleneck as these guys, despite all their additional context.
Which brings to my next point. I feel like there’s actually not that many secrets. The researchers and CEOs of the AI labs are a couple months ahead of you. This just doesn’t amount to any substantial secret knowledge that, if only you knew, you’d also have 2027 timelines. A ton of progress has been made in the last 3 years since ChatGPT, but none of it was super shocking based on the rumor mill and some connecting of the dots. And then there’s the big picture questions about AI’s impacts, where your thinking might very plausibly be much better than people at the labs, just because it takes time to think, and these people are busy running a damn company.
Some of the questions I’m most interested in simply can’t be answered extemporaneously by any human being on the planet. They require knowledge across multiple different fields, and couple hours (to days) of crunching the numbers or thinking through shit.
Because often enough my guests can’t just answer pretty complicated fractal questions in a satisfying way on the spot, I get frustrated with the whole enterprise. The main angst I’ve kept receding back to over and over is, “Okay what did I actually learn from this interview? And if I didn’t get that much concrete insight and understanding out of it, despite a week+ of research and hours of conversation, what hope is there for the audience? And if no one learned anything, what the fuck are we doing here?” I feel much essays survive this cynicism much better. For example, I’m often frustrated that social scientists won’t speculate with me about what their insights imply about AI civilization, or historians about how history might have turned out differently given different counterfactuals. But it’s ridiculous to count on a scholar who is thinking about AGI for first time in his life to start shooting off some galaxy brain implications from his theory. But I can go read their books, and use my understanding of the technology to come up with some hot takes.
I can easily co-release my essays as narrations on my podcast and YouTube feed, so actually the essays are super complementary to this audio/video audience I’ve built up.

Gratitude

In the spirit of Thanksgiving: a lottery winner who then won another lottery is less lucky than I am.

Every once in a while, I’ll be grabbing dinner with a writer whose work I was obsessed with in college. And a part of me is just like, “What the fuck is happening right now?” Many of my greatest intellectual heroes are now my direct friends and teachers. My job is to spend a week learning about whatever I’m most interested in, and then talk to the world expert on that topic. A job I would pay to do has rewarded me - intellectually, financially, socially - beyond my wildest expectations. And there’s millions of people who are into this stuff! This audience contains some of the smartest people in the world, including many of the people I am a huge fan of. Then there’s my team. It’s unreal how talented, agentic, tasteful, and detail-oriented my colleagues are. I genuinely have no idea how I convinced people this good to come run a podcast.

Ilya Sutskever — We're moving from the age of scaling to the age of research

Dwarkesh Patel — Tue, 25 Nov 2025 17:04:02 GMT

Ilya & I discuss SSI’s strategy, the problems with pre-training, how to improve the generalization of AI models, and how to ensure AGI goes well.

Watch on YouTube; listen on Apple Podcasts or Spotify.

Sponsors

Gemini 3 is the first model I’ve used that can find connections I haven’t anticipated. I recently wrote a blog post on RL’s information efficiency, and Gemini 3 helped me think it all through. It also generated the relevant charts and ran toy ML experiments for me with zero bugs. Try Gemini 3 today at gemini.google
Labelbox helped me create a tool to transcribe our episodes! I’ve struggled with transcription in the past because I don’t just want verbatim transcripts, I want transcripts reworded to read like essays. Labelbox helped me generate the exact data I needed for this. If you want to learn how Labelbox can help you (or if you want to try out the transcriber tool yourself), go to labelbox.com/dwarkesh
Sardine is an AI risk management platform that brings together thousands of device, behavior, and identity signals to help you assess a user’s risk of fraud & abuse. Sardine also offers a suite of agents to automate investigations so that as fraudsters use AI to scale their attacks, you can use AI to scale your defenses. Learn more at sardine.ai/dwarkesh

To sponsor a future episode, visit dwarkesh.com/advertise.

Timestamps

(00:00:00) – Explaining model jaggedness

(00:09:39) - Emotions and value functions

(00:18:49) – What are we scaling?

(00:25:13) – Why humans generalize better than models

(00:35:45) – Straight-shotting superintelligence

(00:46:47) – SSI’s model will learn from deployment

(00:55:07) – Alignment

(01:18:13) – “We are squarely an age of research company”

(01:29:23) – Self-play and multi-agent

(01:32:42) – Research taste

Transcript

00:00:00 – Explaining model jaggedness

Ilya Sutskever 00:00:00

You know what’s crazy? That all of this is real.

Dwarkesh Patel 00:00:04

Meaning what?

Ilya Sutskever 00:00:05

Don’t you think so? All this AI stuff and all this Bay Area… that it’s happening. Isn’t it straight out of science fiction?

Dwarkesh Patel 00:00:14

Another thing that’s crazy is how normal the slow takeoff feels. The idea that we’d be investing 1% of GDP in AI, I feel like it would have felt like a bigger deal, whereas right now it just feels...

Ilya Sutskever 00:00:26

We get used to things pretty fast, it turns out. But also it’s kind of abstract. What does it mean? It means that you see it in the news, that such and such company announced such and such dollar amount. That’s all you see. It’s not really felt in any other way so far.

Dwarkesh Patel 00:00:45

Should we actually begin here? I think this is an interesting discussion.

Ilya Sutskever 00:00:47

Sure.

Dwarkesh Patel 00:00:48

I think your point, about how from the average person’s point of view nothing is that different, will continue being true even into the singularity.

Ilya Sutskever 00:00:57

No, I don’t think so.

Dwarkesh Patel 00:00:58

Okay, interesting.

Ilya Sutskever 00:01:00

The thing which I was referring to not feeling different is, okay, such and such company announced some difficult-to-comprehend dollar amount of investment. I don’t think anyone knows what to do with that.

But I think the impact of AI is going to be felt. AI is going to be diffused through the economy. There’ll be very strong economic forces for this, and I think the impact is going to be felt very strongly.

Dwarkesh Patel 00:01:30

When do you expect that impact? I think the models seem smarter than their economic impact would imply.

Ilya Sutskever 00:01:38

Yeah. This is one of the very confusing things about the models right now. How to reconcile the fact that they are doing so well on evals? You look at the evals and you go, “Those are pretty hard evals.” They are doing so well. But the economic impact seems to be dramatically behind. It’s very difficult to make sense of, how can the model, on the one hand, do these amazing things, and then on the other hand, repeat itself twice in some situation?

An example would be, let’s say you use vibe coding to do something. You go to some place and then you get a bug. Then you tell the model, “Can you please fix the bug?” And the model says, “Oh my God, you’re so right. I have a bug. Let me go fix that.” And it introduces a second bug. Then you tell it, “You have this new second bug,” and it tells you, “Oh my God, how could I have done it? You’re so right again,” and brings back the first bug, and you can alternate between those. How is that possible? I’m not sure, but it does suggest that something strange is going on.

I have two possible explanations. The more whimsical explanation is that maybe RL training makes the models a little too single-minded and narrowly focused, a little bit too unaware, even though it also makes them aware in some other ways. Because of this, they can’t do basic things.

But there is another explanation. Back when people were doing pre-training, the question of what data to train on was answered, because that answer was everything. When you do pre-training, you need all the data. So you don’t have to think if it’s going to be this data or that data.

But when people do RL training, they do need to think. They say, “Okay, we want to have this kind of RL training for this thing and that kind of RL training for that thing.” From what I hear, all the companies have teams that just produce new RL environments and just add it to the training mix. The question is, well, what are those? There are so many degrees of freedom. There is such a huge variety of RL environments you could produce.

One thing you could do, and I think this is something that is done inadvertently, is that people take inspiration from the evals. You say, “Hey, I would love our model to do really well when we release it. I want the evals to look great. What would be RL training that could help on this task?” I think that is something that happens, and it could explain a lot of what’s going on.

If you combine this with generalization of the models actually being inadequate, that has the potential to explain a lot of what we are seeing, this disconnect between eval performance and actual real-world performance, which is something that we don’t today even understand, what we mean by that.

Dwarkesh Patel 00:05:00

I like this idea that the real reward hacking is the human researchers who are too focused on the evals.

I think there are two ways to understand, or to try to think about, what you have just pointed out. One is that if it’s the case that simply by becoming superhuman at a coding competition, a model will not automatically become more tasteful and exercise better judgment about how to improve your codebase, well then you should expand the suite of environments such that you’re not just testing it on having the best performance in coding competition. It should also be able to make the best kind of application for X thing or Y thing or Z thing.

Another, maybe this is what you’re hinting at, is to say, “Why should it be the case in the first place that becoming superhuman at coding competitions doesn’t make you a more tasteful programmer more generally?” Maybe the thing to do is not to keep stacking up the amount and diversity of environments, but to figure out an approach which lets you learn from one environment and improve your performance on something else.

Ilya Sutskever 00:06:08

I have a human analogy which might be helpful. Let’s take the case of competitive programming, since you mentioned that. Suppose you have two students. One of them decided they want to be the best competitive programmer, so they will practice 10,000 hours for that domain. They will solve all the problems, memorize all the proof techniques, and be very skilled at quickly and correctly implementing all the algorithms. By doing so, they became one of the best.

Student number two thought, “Oh, competitive programming is cool.” Maybe they practiced for 100 hours, much less, and they also did really well. Which one do you think is going to do better in their career later on?

Dwarkesh Patel 00:06:56

The second.

Ilya Sutskever 00:06:57

Right. I think that’s basically what’s going on. The models are much more like the first student, but even more. Because then we say, the model should be good at competitive programming so let’s get every single competitive programming problem ever. And then let’s do some data augmentation so we have even more competitive programming problems, and we train on that. Now you’ve got this great competitive programmer.

With this analogy, I think it’s more intuitive. Yeah, okay, if it’s so well trained, all the different algorithms and all the different proof techniques are right at its fingertips. And it’s more intuitive that with this level of preparation, it would not necessarily generalize to other things.

Dwarkesh Patel 00:07:39

But then what is the analogy for what the second student is doing before they do the 100 hours of fine-tuning?

Ilya Sutskever 00:07:48

I think they have “it.” The “it” factor. When I was an undergrad, I remember there was a student like this that studied with me, so I know it exists.

Dwarkesh Patel 00:08:01

I think it’s interesting to distinguish “it” from whatever pre-training does. One way to understand what you just said about not having to choose the data in pre-training is to say it’s actually not dissimilar to the 10,000 hours of practice. It’s just that you get that 10,000 hours of practice for free because it’s already somewhere in the pre-training distribution. But maybe you’re suggesting there’s actually not that much generalization from pre-training. There’s just so much data in pre-training, but it’s not necessarily generalizing better than RL.

Ilya Sutskever 00:08:31

The main strength of pre-training is that: A, there is so much of it, and B, you don’t have to think hard about what data to put into pre-training. It’s very natural data, and it does include in it a lot of what people do: people’s thoughts and a lot of the features. It’s like the whole world as projected by people onto text, and pre-training tries to capture that using a huge amount of data.

Pre-training is very difficult to reason about because it’s so hard to understand the manner in which the model relies on pre-training data. Whenever the model makes a mistake, could it be because something by chance is not as supported by the pre-training data? “Support by pre-training” is maybe a loose term. I don’t know if I can add anything more useful on this. I don’t think there is a human analog to pre-training.

00:09:39 – Emotions and value functions

Dwarkesh Patel 00:09:39

Here are analogies that people have proposed for what the human analogy to pre-training is. I’m curious to get your thoughts on why they’re potentially wrong. One is to think about the first 18, or 15, or 13 years of a person’s life when they aren’t necessarily economically productive, but they are doing something that is making them understand the world better and so forth. The other is to think about evolution as doing some kind of search for 3 billion years, which then results in a human lifetime instance.

I’m curious if you think either of these are analogous to pre-training. How would you think about what lifetime human learning is like, if not pre-training?

Ilya Sutskever 00:10:22

I think there are some similarities between both of these and pre-training, and pre-training tries to play the role of both of these. But I think there are some big differences as well. The amount of pre-training data is very, very staggering.

Dwarkesh Patel 00:10:39

Yes.

Ilya Sutskever 00:10:40

Somehow a human being, after even 15 years with a tiny fraction of the pre-training data, they know much less. But whatever they do know, they know much more deeply somehow. Already at that age, you would not make mistakes that our AIs make.

There is another thing. You might say, could it be something like evolution? The answer is maybe. But in this case, I think evolution might actually have an edge. I remember reading about this case. One way in which neuroscientists can learn about the brain is by studying people with brain damage to different parts of the brain. Some people have the most strange symptoms you could imagine. It’s actually really, really interesting.

One case that comes to mind that’s relevant. I read about this person who had some kind of brain damage, a stroke or an accident, that took out his emotional processing. So he stopped feeling any emotion. He still remained very articulate and he could solve little puzzles, and on tests he seemed to be just fine. But he felt no emotion. He didn’t feel sad, he didn’t feel anger, he didn’t feel animated. He became somehow extremely bad at making any decisions at all. It would take him hours to decide on which socks to wear. He would make very bad financial decisions.

What does it say about the role of our built-in emotions in making us a viable agent, essentially? To connect to your question about pre-training, maybe if you are good enough at getting everything out of pre-training, you could get that as well. But that’s the kind of thing which seems... Well, it may or may not be possible to get that from pre-training.

Dwarkesh Patel 00:12:56

What is “that”? Clearly not just directly emotion. It seems like some almost value function-like thing which is telling you what the end reward for any decision should be. You think that doesn’t sort of implicitly come from pre-training?

Ilya Sutskever 00:13:15

I think it could. I’m just saying it’s not 100% obvious.

Dwarkesh Patel 00:13:19

But what is that? How do you think about emotions? What is the ML analogy for emotions?

Ilya Sutskever 00:13:26

It should be some kind of a value function thing. But I don’t think there is a great ML analogy because right now, value functions don’t play a very prominent role in the things people do.

Dwarkesh Patel 00:13:36

It might be worth defining for the audience what a value function is, if you want to do that.

Ilya Sutskever 00:13:39

Certainly, I’ll be very happy to do that. When people do reinforcement learning, the way reinforcement learning is done right now, how do people train those agents? You have your neural net and you give it a problem, and then you tell the model, “Go solve it.” The model takes maybe thousands, hundreds of thousands of actions or thoughts or something, and then it produces a solution. The solution is graded.

And then the score is used to provide a training signal for every single action in your trajectory. That means that if you are doing something that goes for a long time—if you’re training a task that takes a long time to solve—it will do no learning at all until you come up with the proposed solution. That’s how reinforcement learning is done naively. That’s how o1, R1 ostensibly are done.

The value function says something like, “Maybe I could sometimes, not always, tell you if you are doing well or badly.” The notion of a value function is more useful in some domains than others. For example, when you play chess and you lose a piece, I messed up. You don’t need to play the whole game to know that what I just did was bad, and therefore whatever preceded it was also bad.

The value function lets you short-circuit the wait until the very end. Let’s suppose that you are doing some kind of a math thing or a programming thing, and you’re trying to explore a particular solution or direction. After, let’s say, a thousand steps of thinking, you concluded that this direction is unpromising. As soon as you conclude this, you could already get a reward signal a thousand timesteps previously, when you decided to pursue down this path. You say, “Next time I shouldn’t pursue this path in a similar situation,” long before you actually came up with the proposed solution.

Dwarkesh Patel 00:15:52

This was in the DeepSeek R1 paper— that the space of trajectories is so wide that maybe it’s hard to learn a mapping from an intermediate trajectory and value. And also given that, in coding for example you’ll have the wrong idea, then you’ll go back, then you’ll change something.

Ilya Sutskever 00:16:12

This sounds like such lack of faith in deep learning. Sure it might be difficult, but nothing deep learning can’t do. My expectation is that a value function should be useful, and I fully expect that they will be used in the future, if not already.

What I was alluding to with the person whose emotional center got damaged, it’s more that maybe what it suggests is that the value function of humans is modulated by emotions in some important way that’s hardcoded by evolution. And maybe that is important for people to be effective in the world.

Dwarkesh Patel 00:17:00

That’s the thing I was planning on asking you. There’s something really interesting about emotions of the value function, which is that it’s impressive that they have this much utility while still being rather simple to understand.

Ilya Sutskever 00:17:15

I have two responses. I do agree that compared to the kind of things that we learn and the things we are talking about, the kind of AI we are talking about, emotions are relatively simple. They might even be so simple that maybe you could map them out in a human-understandable way. I think it would be cool to do.

In terms of utility though, I think there is a thing where there is this complexity-robustness tradeoff, where complex things can be very useful, but simple things are very useful in a very broad range of situations. One way to interpret what we are seeing is that we’ve got these emotions that evolved mostly from our mammal ancestors and then fine-tuned a little bit while we were hominids, just a bit. We do have a decent amount of social emotions though which mammals may lack. But they’re not very sophisticated. And because they’re not sophisticated, they serve us so well in this very different world compared to the one that we’ve been living in.

Actually, they also make mistakes. For example, our emotions… Well actually, I don’t know. Does hunger count as an emotion? It’s debatable. But I think, for example, our intuitive feeling of hunger is not succeeding in guiding us correctly in this world with an abundance of food.

00:18:49 – What are we scaling?

Dwarkesh Patel 00:18:49

People have been talking about scaling data, scaling parameters, scaling compute. Is there a more general way to think about scaling? What are the other scaling axes?

Ilya Sutskever 00:19:00

Here’s a perspective that I think might be true. The way ML used to work is that people would just tinker with stuff and try to get interesting results. That’s what’s been going on in the past.

Then the scaling insight arrived. Scaling laws, GPT-3, and suddenly everyone realized we should scale. This is an example of how language affects thought. “Scaling” is just one word, but it’s such a powerful word because it informs people what to do. They say, “Let’s try to scale things.” So you say, what are we scaling? Pre-training was the thing to scale. It was a particular scaling recipe.

The big breakthrough of pre-training is the realization that this recipe is good. You say, “Hey, if you mix some compute with some data into a neural net of a certain size, you will get results. You will know that you’ll be better if you just scale the recipe up.” This is also great. Companies love this because it gives you a very low-risk way of investing your resources.

It’s much harder to invest your resources in research. Compare that. If you research, you need to be like, “Go forth researchers and research and come up with something”, versus get more data, get more compute. You know you’ll get something from pre-training.

Indeed, it looks like, based on various things some people say on Twitter, maybe it appears that Gemini have found a way to get more out of pre-training. At some point though, pre-training will run out of data. The data is very clearly finite. What do you do next? Either you do some kind of souped-up pre-training, a different recipe from the one you’ve done before, or you’re doing RL, or maybe something else. But now that compute is big, compute is now very big, in some sense we are back to the age of research.

Maybe here’s another way to put it. Up until 2020, from 2012 to 2020, it was the age of research. Now, from 2020 to 2025, it was the age of scaling—maybe plus or minus, let’s add error bars to those years—because people say, “This is amazing. You’ve got to scale more. Keep scaling.” The one word: scaling.

But now the scale is so big. Is the belief really, “Oh, it’s so big, but if you had 100x more, everything would be so different?” It would be different, for sure. But is the belief that if you just 100x the scale, everything would be transformed? I don’t think that’s true. So it’s back to the age of research again, just with big computers.

Dwarkesh Patel 00:22:06

That’s a very interesting way to put it. But let me ask you the question you just posed then. What are we scaling, and what would it mean to have a recipe? I guess I’m not aware of a very clean relationship that almost looks like a law of physics which existed in pre-training. There was a power law between data or compute or parameters and loss. What is the kind of relationship we should be seeking, and how should we think about what this new recipe might look like?

Ilya Sutskever 00:22:38

We’ve already witnessed a transition from one type of scaling to a different type of scaling, from pre-training to RL. Now people are scaling RL. Now based on what people say on Twitter, they spend more compute on RL than on pre-training at this point, because RL can actually consume quite a bit of compute. You do very long rollouts, so it takes a lot of compute to produce those rollouts. Then you get a relatively small amount of learning per rollout, so you really can spend a lot of compute.

I wouldn’t even call it scaling. I would say, “Hey, what are you doing? Is the thing you are doing the most productive thing you could be doing? Can you find a more productive way of using your compute?” We’ve discussed the value function business earlier. Maybe once people get good at value functions, they will be using their resources more productively. If you find a whole other way of training models, you could say, “Is this scaling or is it just using your resources?” I think it becomes a little bit ambiguous.

In the sense that, when people were in the age of research back then, it was, “Let’s try this and this and this. Let’s try that and that and that. Oh, look, something interesting is happening.” I think there will be a return to that.

Dwarkesh Patel 00:24:10

If we’re back in the era of research, stepping back, what is the part of the recipe that we need to think most about? When you say value function, people are already trying the current recipe, but then having LLM-as-a-Judge and so forth. You could say that’s a value function, but it sounds like you have something much more fundamental in mind. Should we even rethink pre-training at all and not just add more steps to the end of that process?

Ilya Sutskever 00:24:35

The discussion about value function, I think it was interesting. I want to emphasize that I think the value function is something that’s going to make RL more efficient, and I think that makes a difference. But I think anything you can do with a value function, you can do without, just more slowly. The thing which I think is the most fundamental is that these models somehow just generalize dramatically worse than people. It’s super obvious. That seems like a very fundamental thing.

00:25:13 – Why humans generalize better than models

Dwarkesh Patel 00:25:13

So this is the crux: generalization. There are two sub-questions. There’s one which is about sample efficiency: why should it take so much more data for these models to learn than humans? There’s a second question. Even separate from the amount of data it takes, why is it so hard to teach the thing we want to a model than to a human? For a human, we don’t necessarily need a verifiable reward to be able to… You’re probably mentoring a bunch of researchers right now, and you’re talking with them, you’re showing them your code, and you’re showing them how you think. From that, they’re picking up your way of thinking and how they should do research.

You don’t have to set a verifiable reward for them that’s like, “Okay, this is the next part of the curriculum, and now this is the next part of your curriculum. Oh, this training was unstable.” There’s not this schleppy, bespoke process. Perhaps these two issues are actually related in some way, but I’d be curious to explore this second thing, which is more like continual learning, and this first thing, which feels just like sample efficiency.

Ilya Sutskever 00:26:19

You could actually wonder that one possible explanation for the human sample efficiency that needs to be considered is evolution. Evolution has given us a small amount of the most useful information possible. For things like vision, hearing, and locomotion, I think there’s a pretty strong case that evolution has given us a lot.

For example, human dexterity far exceeds… I mean robots can become dexterous too if you subject them to a huge amount of training in simulation. But to train a robot in the real world to quickly pick up a new skill like a person does seems very out of reach. Here you could say, “Oh yeah, locomotion. All our ancestors needed great locomotion, squirrels. So with locomotion, maybe we’ve got some unbelievable prior.”

You could make the same case for vision. I believe Yann LeCun made the point that children learn to drive after 10 hours of practice, which is true. But our vision is so good. At least for me, I remember myself being a five-year-old. I was very excited about cars back then. I’m pretty sure my car recognition was more than adequate for driving already as a five-year-old. You don’t get to see that much data as a five-year-old. You spend most of your time in your parents’ house, so you have very low data diversity.

But you could say maybe that’s evolution too. But in language and math and coding, probably not.

Dwarkesh Patel 00:28:00

It still seems better than models. Obviously, models are better than the average human at language, math, and coding. But are they better than the average human at learning?

Ilya Sutskever 00:28:09

Oh yeah. Oh yeah, absolutely. What I meant to say is that language, math, and coding—and especially math and coding—suggests that whatever it is that makes people good at learning is probably not so much a complicated prior, but something more, some fundamental thing.

Dwarkesh Patel 00:28:29

I’m not sure I understood. Why should that be the case?

Ilya Sutskever 00:28:32

So consider a skill in which people exhibit some kind of great reliability. If the skill is one that was very useful to our ancestors for many millions of years, hundreds of millions of years, you could argue that maybe humans are good at it because of evolution, because we have a prior, an evolutionary prior that’s encoded in some very non-obvious way that somehow makes us so good at it.

But if people exhibit great ability, reliability, robustness, and ability to learn in a domain that really did not exist until recently, then this is more an indication that people might have just better machine learning, period.

Dwarkesh Patel 00:29:29

How should we think about what that is? What is the ML analogy? There are a couple of interesting things about it. It takes fewer samples. It’s more unsupervised. A child learning to drive a car… Children are not learning to drive a car. A teenager learning how to drive a car is not exactly getting some prebuilt, verifiable reward. It comes from their interaction with the machine and with the environment. It takes much fewer samples. It seems more unsupervised. It seems more robust?

Ilya Sutskever 00:30:07

Much more robust. The robustness of people is really staggering.

Dwarkesh Patel 00:30:12

Do you have a unified way of thinking about why all these things are happening at once? What is the ML analogy that could realize something like this?

Ilya Sutskever 00:30:24

One of the things that you’ve been asking about is how can the teenage driver self-correct and learn from their experience without an external teacher? The answer is that they have their value function. They have a general sense which is also, by the way, extremely robust in people. Whatever the human value function is, with a few exceptions around addiction, it’s actually very, very robust.

So for something like a teenager that’s learning to drive, they start to drive, and they already have a sense of how they’re driving immediately, how badly they are, how unconfident. And then they see, “Okay.” And then, of course, the learning speed of any teenager is so fast. After 10 hours, you’re good to go.

Dwarkesh Patel 00:31:17

It seems like humans have some solution, but I’m curious about how they are doing it and why is it so hard? How do we need to reconceptualize the way we’re training models to make something like this possible?

Ilya Sutskever 00:31:27

That is a great question to ask, and it’s a question I have a lot of opinions about. But unfortunately, we live in a world where not all machine learning ideas are discussed freely, and this is one of them. There’s probably a way to do it. I think it can be done. The fact that people are like that, I think it’s a proof that it can be done.

There may be another blocker though, which is that there is a possibility that the human neurons do more compute than we think. If that is true, and if that plays an important role, then things might be more difficult. But regardless, I do think it points to the existence of some machine learning principle that I have opinions on. But unfortunately, circumstances make it hard to discuss in detail.

Dwarkesh Patel 00:32:28

Nobody listens to this podcast, Ilya.

00:35:45 – Straight-shotting superintelligence

Dwarkesh Patel 00:35:45

I’m curious. If you say we are back in an era of research, you were there from 2012 to 2020. What is the vibe now going to be if we go back to the era of research?

For example, even after AlexNet, the amount of compute that was used to run experiments kept increasing, and the size of frontier systems kept increasing. Do you think now that this era of research will still require tremendous amounts of compute? Do you think it will require going back into the archives and reading old papers?

You were at Google and OpenAI and Stanford, these places, when there was more of a vibe of research? What kind of things should we be expecting in the community?

Ilya Sutskever 00:36:38

One consequence of the age of scaling is that scaling sucked out all the air in the room. Because scaling sucked out all the air in the room, everyone started to do the same thing. We got to the point where we are in a world where there are more companies than ideas by quite a bit. Actually on that, there is this Silicon Valley saying that says that ideas are cheap, execution is everything. People say that a lot, and there is truth to that. But then I saw someone say on Twitter something like, “If ideas are so cheap, how come no one’s having any ideas?” And I think it’s true too.

If you think about research progress in terms of bottlenecks, there are several bottlenecks. One of them is ideas, and one of them is your ability to bring them to life, which might be compute but also engineering. If you go back to the ‘90s, let’s say, you had people who had pretty good ideas, and if they had much larger computers, maybe they could demonstrate that their ideas were viable. But they could not, so they could only have a very, very small demonstration that did not convince anyone. So the bottleneck was compute.

Then in the age of scaling, compute has increased a lot. Of course, there is a question of how much compute is needed, but compute is large. Compute is large enough such that it’s not obvious that you need that much more compute to prove some idea. I’ll give you an analogy. AlexNet was built on two GPUs. That was the total amount of compute used for it. The transformer was built on 8 to 64 GPUs. No single transformer paper experiment used more than 64 GPUs of 2017, which would be like, what, two GPUs of today? The ResNet, right? You could argue that the o1 reasoning was not the most compute-heavy thing in the world.

So for research, you definitely need some amount of compute, but it’s far from obvious that you need the absolutely largest amount of compute ever for research. You might argue, and I think it is true, that if you want to build the absolutely best system then it helps to have much more compute. Especially if everyone is within the same paradigm, then compute becomes one of the big differentiators.

Dwarkesh Patel 00:39:41

I’m asking you for the history, because you were actually there. I’m not sure what actually happened. It sounds like it was possible to develop these ideas using minimal amounts of compute. But the transformer didn’t immediately become famous. It became the thing everybody started doing and then started experimenting on top of and building on top of because it was validated at higher and higher levels of compute.

Ilya Sutskever 00:40:06

Correct.

Dwarkesh Patel 00:40:07

And if you at SSI have 50 different ideas, how will you know which one is the next transformer and which one is brittle, without having the kinds of compute that other frontier labs have?

Ilya Sutskever 00:40:22

I can comment on that. The short comment is that you mentioned SSI. Specifically for us, the amount of compute that SSI has for research is really not that small. I want to explain why. Simple math can explain why the amount of compute that we have is comparable for research than one might think. I’ll explain.

SSI has raised $3 billion, which is a lot by any absolute sense. But you could say, “Look at the other companies raising much more.” But a lot of their compute goes for inference. These big numbers, these big loans, it’s earmarked for inference. That’s number one. Number two, if you want to have a product on which you do inference, you need to have a big staff of engineers, salespeople. A lot of the research needs to be dedicated to producing all kinds of product-related features. So then when you look at what’s actually left for research, the difference becomes a lot smaller.

The other thing is, if you are doing something different, do you really need the absolute maximal scale to prove it? I don’t think that’s true at all. I think that in our case, we have sufficient compute to prove, to convince ourselves and anyone else, that what we are doing is correct.

Dwarkesh Patel 00:42:02

There have been public estimates that companies like OpenAI spend on the order of $5-6 billion a year just so far, on experiments. This is separate from the amount of money they’re spending on inference and so forth. So it seems like they’re spending more a year running research experiments than you guys have in total funding.

Ilya Sutskever 00:42:22

I think it’s a question of what you do with it. It’s a question of what you do with it. In their case, in the case of others, there is a lot more demand on the training compute. There’s a lot more different work streams, there are different modalities, there is just more stuff. So it becomes fragmented.

Dwarkesh Patel 00:42:44

How will SSI make money?

Ilya Sutskever 00:42:46

My answer to this question is something like this. Right now, we just focus on the research, and then the answer to that question will reveal itself. I think there will be lots of possible answers.

Dwarkesh Patel 00:43:01

Is SSI’s plan still to straight shot superintelligence?

Ilya Sutskever 00:43:04

Maybe. I think that there is merit to it. I think there’s a lot of merit because it’s very nice to not be affected by the day-to-day market competition. But I think there are two reasons that may cause us to change the plan. One is pragmatic, if timelines turned out to be long, which they might. Second, I think there is a lot of value in the best and most powerful AI being out there impacting the world. I think this is a meaningfully valuable thing.

Dwarkesh Patel 00:43:48

So then why is your default plan to straight shot superintelligence? Because it sounds like OpenAI, Anthropic, all these other companies, their explicit thinking is, “Look, we have weaker and weaker intelligences that the public can get used to and prepare for.” Why is it potentially better to build a superintelligence directly?

Ilya Sutskever 00:44:08

I’ll make the case for and against. The case for is that one of the challenges that people face when they’re in the market is that they have to participate in the rat race. The rat race is quite difficult in that it exposes you to difficult trade-offs which you need to make. It is nice to say, “We’ll insulate ourselves from all this and just focus on the research and come out only when we are ready, and not before.” But the counterpoint is valid too, and those are opposing forces. The counterpoint is, “Hey, it is useful for the world to see powerful AI. It is useful for the world to see powerful AI because that’s the only way you can communicate it.”

Dwarkesh Patel 00:44:57

Well, I guess not even just that you can communicate the idea—

Ilya Sutskever 00:45:00

Communicate the AI, not the idea. Communicate the AI.

Dwarkesh Patel 00:45:04

What do you mean, “communicate the AI”?

Ilya Sutskever 00:45:06

Let’s suppose you write an essay about AI, and the essay says, “AI is going to be this, and AI is going to be that, and it’s going to be this.” You read it and you say, “Okay, this is an interesting essay.” Now suppose you see an AI doing this, an AI doing that. It is incomparable. Basically I think that there is a big benefit from AI being in the public, and that would be a reason for us to not be quite straight shot.

Dwarkesh Patel 00:45:37

I guess it’s not even that, but I do think that is an important part of it. The other big thing is that I can’t think of another discipline in human engineering and research where the end artifact was made safer mostly through just thinking about how to make it safe, as opposed to, why airplane crashes per mile are so much lower today than they were decades ago. Why is it so much harder to find a bug in Linux than it would have been decades ago? I think it’s mostly because these systems were deployed to the world. You noticed failures, those failures were corrected and the systems became more robust.

I’m not sure why AGI and superhuman intelligence would be any different, especially given—and I hope we’re going to get to this—it seems like the harms of superintelligence are not just about having some malevolent paper clipper out there. But this is a really powerful thing and we don’t even know how to conceptualize how people interact with it, what people will do with it. Having gradual access to it seems like a better way to maybe spread out the impact of it and to help people prepare for it.

00:46:47 – SSI’s model will learn from deployment

Ilya Sutskever 00:46:47

Well I think on this point, even in the straight shot scenario, you would still do a gradual release of it, that’s how I would imagine it. Gradualism would be an inherent component of any plan. It’s just a question of what is the first thing that you get out of the door. That’s number one.

Number two, I believe you have advocated for continual learning more than other people, and I actually think that this is an important and correct thing. Here is why. I’ll give you another example of how language affects thinking. In this case, it will be two words that have shaped everyone’s thinking, I maintain. First word: AGI. Second word: pre-training. Let me explain.

The term AGI, why does this term exist? It’s a very particular term. Why does it exist? There’s a reason. The reason that the term AGI exists is, in my opinion, not so much because it’s a very important, essential descriptor of some end state of intelligence, but because it is a reaction to a different term that existed, and the term is narrow AI. If you go back to ancient history of gameplay and AI, of checkers AI, chess AI, computer games AI, everyone would say, look at this narrow intelligence. Sure, the chess AI can beat Kasparov, but it can’t do anything else. It is so narrow, artificial narrow intelligence. So in response, as a reaction to this, some people said, this is not good. It is so narrow. What we need is general AI, an AI that can just do all the things. That term just got a lot of traction.

The second thing that got a lot of traction is pre-training, specifically the recipe of pre-training. I think the way people do RL now is maybe undoing the conceptual imprint of pre-training. But pre-training had this property. You do more pre-training and the model gets better at everything, more or less uniformly. General AI. Pre-training gives AGI.

But the thing that happened with AGI and pre-training is that in some sense they overshot the target. If you think about the term “AGI”, especially in the context of pre-training, you will realize that a human being is not an AGI. Yes, there is definitely a foundation of skills, but a human being lacks a huge amount of knowledge. Instead, we rely on continual learning.

So when you think about, “Okay, so let’s suppose that we achieve success and we produce some kind of safe superintelligence.” The question is, how do you define it? Where on the curve of continual learning is it going to be?

I produce a superintelligent 15-year-old that’s very eager to go. They don’t know very much at all, a great student, very eager. You go and be a programmer, you go and be a doctor, go and learn. So you could imagine that the deployment itself will involve some kind of a learning trial-and-error period. It’s a process, as opposed to you dropping the finished thing.

Dwarkesh Patel 00:50:45

I see. You’re suggesting that the thing you’re pointing out with superintelligence is not some finished mind which knows how to do every single job in the economy. Because the way, say, the original OpenAI charter or whatever defines AGI is like, it can do every single job, every single thing a human can do. You’re proposing instead a mind which can learn to do every single job, and that is superintelligence.

Ilya Sutskever 00:51:15

Yes.

Dwarkesh Patel 00:51:16

But once you have the learning algorithm, it gets deployed into the world the same way a human laborer might join an organization.

Ilya Sutskever 00:51:25

Exactly.

Dwarkesh Patel 00:51:26

It seems like one of these two things might happen, maybe neither of these happens. One, this super-efficient learning algorithm becomes superhuman, becomes as good as you and potentially even better, at the task of ML research. As a result the algorithm itself becomes more and more superhuman.

The other is, even if that doesn’t happen, if you have a single model—this is explicitly your vision—where instances of a model which are deployed through the economy doing different jobs, learning how to do those jobs, continually learning on the job, picking up all the skills that any human could pick up, but picking them all up at the same time, and then amalgamating their learnings, you basically have a model which functionally becomes superintelligent even without any sort of recursive self-improvement in software. Because you now have one model that can do every single job in the economy and humans can’t merge our minds in the same way. So do you expect some sort of intelligence explosion from broad deployment?

Ilya Sutskever 00:52:30

I think that it is likely that we will have rapid economic growth. I think with broad deployment, there are two arguments you could make which are conflicting. One is that once indeed you get to a point where you have an AI that can learn to do things quickly and you have many of them, then there will be a strong force to deploy them in the economy unless there will be some kind of a regulation that stops it, which by the way there might be.

But the idea of very rapid economic growth for some time, I think it’s very possible from broad deployment. The question is how rapid it’s going to be. I think this is hard to know because on the one hand you have this very efficient worker. On the other hand, the world is just really big and there’s a lot of stuff, and that stuff moves at a different speed. But then on the other hand, now the AI could… So I think very rapid economic growth is possible. We will see all kinds of things like different countries with different rules and the ones which have the friendlier rules, the economic growth will be faster. Hard to predict.

00:55:07 – Alignment

Dwarkesh Patel 00:55:07

It seems to me that this is a very precarious situation to be in. In the limit, we know that this should be possible. If you have something that is as good as a human at learning, but which can merge its brains—merge different instances in a way that humans can’t merge—already, this seems like a thing that should physically be possible. Humans are possible, digital computers are possible. You just need both of those combined to produce this thing.

It also seems this kind of thing is extremely powerful. Economic growth is one way to put it. A Dyson sphere is a lot of economic growth. But another way to put it is that you will have, in potentially a very short period of time... You hire people at SSI, and in six months, they’re net productive, probably. A human learns really fast, and this thing is becoming smarter and smarter very fast. How do you think about making that go well? Why is SSI positioned to do that well? What is SSI’s plan there, is basically what I’m trying to ask.

Ilya Sutskever 00:56:10

One of the ways in which my thinking has been changing is that I now place more importance on AI being deployed incrementally and in advance. One very difficult thing about AI is that we are talking about systems that don’t yet exist and it’s hard to imagine them.

I think that one of the things that’s happening is that in practice, it’s very hard to feel the AGI. It’s very hard to feel the AGI. We can talk about it, but imagine having a conversation about how it is like to be old when you’re old and frail. You can have a conversation, you can try to imagine it, but it’s just hard, and you come back to reality where that’s not the case. I think that a lot of the issues around AGI and its future power stem from the fact that it’s very difficult to imagine. Future AI is going to be different. It’s going to be powerful. Indeed, the whole problem, what is the problem of AI and AGI? The whole problem is the power. The whole problem is the power.

When the power is really big, what’s going to happen? One of the ways in which I’ve changed my mind over the past year—and that change of mind, I’ll hedge a little bit, may back-propagate into the plans of our company—is that if it’s hard to imagine, what do you do? You’ve got to be showing the thing. You’ve got to be showing the thing. I maintain that most people who work on AI also can’t imagine it because it’s too different from what people see on a day-to-day basis.

I do maintain, here’s something which I predict will happen. This is a prediction. I maintain that as AI becomes more powerful, people will change their behaviors. We will see all kinds of unprecedented things which are not happening right now. I’ll give some examples. I think for better or worse, the frontier companies will play a very important role in what happens, as will the government. The kind of things that I think you’ll see, which you see the beginnings of, are companies that are fierce competitors starting to collaborate on AI safety. You may have seen OpenAI and Anthropic doing a first small step, but that did not exist. That’s something which I predicted in one of my talks about three years ago, that such a thing will happen. I also maintain that as AI continues to become more powerful, more visibly powerful, there will also be a desire from governments and the public to do something. I think this is a very important force, of showing the AI.

That’s number one. Number two, okay, so the AI is being built. What needs to be done? One thing that I maintain that will happen is that right now, people who are working on AI, I maintain that the AI doesn’t feel powerful because of its mistakes. I do think that at some point the AI will start to feel powerful actually. I think when that happens, we will see a big change in the way all AI companies approach safety. They’ll become much more paranoid. I say this as a prediction that we will see happen. We’ll see if I’m right. But I think this is something that will happen because they will see the AI becoming more powerful. Everything that’s happening right now, I maintain, is because people look at today’s AI and it’s hard to imagine the future AI.

There is a third thing which needs to happen. I’m talking about it in broader terms, not just from the perspective of SSI because you asked me about our company. The question is, what should the companies aspire to build? What should they aspire to build? There has been one big idea that everyone has been locked into, which is the self-improving AI. Why did it happen? Because there are fewer ideas than companies. But I maintain that there is something that’s better to build, and I think that everyone will want that.

It’s the AI that’s robustly aligned to care about sentient life specifically. I think in particular, there’s a case to be made that it will be easier to build an AI that cares about sentient life than an AI that cares about human life alone, because the AI itself will be sentient. And if you think about things like mirror neurons and human empathy for animals, which you might argue it’s not big enough, but it exists. I think it’s an emergent property from the fact that we model others with the same circuit that we use to model ourselves, because that’s the most efficient thing to do.

Dwarkesh Patel 01:02:06

So even if you got an AI to care about sentient beings—and it’s not actually clear to me that that’s what you should try to do if you solved alignment—it would still be the case that most sentient beings will be AIs. There will be trillions, eventually quadrillions, of AIs. Humans will be a very small fraction of sentient beings. So it’s not clear to me if the goal is some kind of human control over this future civilization, that this is the best criterion.

Ilya Sutskever 01:02:37

It’s true. It’s possible it’s not the best criterion. I’ll say two things. Number one, care for sentient life, I think there is merit to it. It should be considered. I think it would be helpful if there was some kind of short list of ideas that the companies, when they are in this situation, could use. That’s number two.

Number three, I think it would be really materially helpful if the power of the most powerful superintelligence was somehow capped because it would address a lot of these concerns. The question of how to do it, I’m not sure, but I think that would be materially helpful when you’re talking about really, really powerful systems.

Dwarkesh Patel 01:03:35

Before we continue the alignment discussion, I want to double-click on that. How much room is there at the top? How do you think about superintelligence? Do you think, using this learning efficiency idea, maybe it is just extremely fast at learning new skills or new knowledge? Does it just have a bigger pool of strategies? Is there a single cohesive “it” in the center that’s more powerful or bigger? If so, do you imagine that this will be sort of godlike in comparison to the rest of human civilization, or does it just feel like another agent, or another cluster of agents?

Ilya Sutskever 01:04:10

This is an area where different people have different intuitions. I think it will be very powerful, for sure. What I think is most likely to happen is that there will be multiple such AIs being created roughly at the same time. I think that if the cluster is big enough—like if the cluster is literally continent-sized—that thing could be really powerful, indeed. If you literally have a continent-sized cluster, those AIs can be very powerful. All I can tell you is that if you’re talking about extremely powerful AIs, truly dramatically powerful, it would be nice if they could be restrained in some ways or if there were some kind of agreement or something.

What is the concern of superintelligence? What is one way to explain the concern? If you imagine a system that is sufficiently powerful, really sufficiently powerful—and you could say you need to do something sensible like care for sentient life in a very single-minded way—we might not like the results. That’s really what it is.

Maybe, by the way, the answer is that you do not build an RL agent in the usual sense. I’ll point several things out. I think human beings are semi-RL agents. We pursue a reward, and then the emotions or whatever make us tire out of the reward and we pursue a different reward. The market is a very short-sighted kind of agent. Evolution is the same. Evolution is very intelligent in some ways, but very dumb in other ways. The government has been designed to be a never-ending fight between three parts, which has an effect. So I think things like this.

Another thing that makes this discussion difficult is that we are talking about systems that don’t exist, that we don’t know how to build. That’s the other thing and that’s actually my belief. I think what people are doing right now will go some distance and then peter out. It will continue to improve, but it will also not be “it”. The “It” we don’t know how to build, and a lot hinges on understanding reliable generalization.

I’ll say another thing. One of the things that you could say about what causes alignment to be difficult is that your ability to learn human values is fragile. Then your ability to optimize them is fragile. You actually learn to optimize them. And can’t you say, “Are these not all instances of unreliable generalization?” Why is it that human beings appear to generalize so much better? What if generalization was much better? What would happen in this case? What would be the effect? But those questions are right now still unanswerable.

Dwarkesh Patel 01:07:21

How does one think about what AI going well looks like? You’ve scoped out how AI might evolve. We’ll have these sort of continual learning agents. AI will be very powerful. Maybe there will be many different AIs. How do you think about lots of continent-sized compute intelligences going around? How dangerous is that? How do we make that less dangerous? And how do we do that in a way that protects an equilibrium where there might be misaligned AIs out there and bad actors out there?

Ilya Sutskever 01:07:58

Here’s one reason why I liked “AI that cares for sentient life”. We can debate on whether it’s good or bad. But if the first N of these dramatic systems do care for, love, humanity or something, care for sentient life, obviously this also needs to be achieved. This needs to be achieved. So if this is achieved by the first N of those systems, then I can see it go well, at least for quite some time.

Then there is the question of what happens in the long run. How do you achieve a long-run equilibrium? I think that there, there is an answer as well. I don’t like this answer, but it needs to be considered.

In the long run, you might say, “Okay, if you have a world where powerful AIs exist, in the short term, you could say you have universal high income. You have universal high income and we’re all doing well.” But what do the Buddhists say? “Change is the only constant.” Things change. There is some kind of government, political structure thing, and it changes because these things have a shelf life. Some new government thing comes up and it functions, and then after some time it stops functioning. That’s something that we see happening all the time.

So I think for the long-run equilibrium, one approach is that you could say maybe every person will have an AI that will do their bidding, and that’s good. If that could be maintained indefinitely, that’s true. But the downside with that is then the AI goes and earns money for the person and advocates for their needs in the political sphere, and maybe then writes a little report saying, “Okay, here’s what I’ve done, here’s the situation,” and the person says, “Great, keep it up.” But the person is no longer a participant. Then you can say that’s a precarious place to be in.

I’m going to preface by saying I don’t like this solution, but it is a solution. The solution is if people become part-AI with some kind of Neuralink++. Because what will happen as a result is that now the AI understands something, and we understand it too, because now the understanding is transmitted wholesale. So now if the AI is in some situation, you are involved in that situation yourself fully. I think this is the answer to the equilibrium.

Dwarkesh Patel 01:10:47

I wonder if the fact that emotions which were developed millions—or in many cases, billions—of years ago in a totally different environment are still guiding our actions so strongly is an example of alignment success.

To spell out what I mean—I don’t know whether it’s more accurate to call it a value function or reward function—but the brainstem has a directive where it’s saying, “Mate with somebody who’s more successful.” The cortex is the part that understands what success means in the modern context. But the brainstem is able to align the cortex and say, “However you recognize success to be—and I’m not smart enough to understand what that is— you’re still going to pursue this directive.”

Ilya Sutskever 01:11:36

I think there’s a more general point. I think it’s actually really mysterious how evolution encodes high-level desires. It’s pretty easy to understand how evolution would endow us with the desire for food that smells good because smell is a chemical, so just pursue that chemical. It’s very easy to imagine evolution doing that thing.

But evolution also has endowed us with all these social desires. We really care about being seen positively by society. We care about being in good standing. All these social intuitions that we have, I feel strongly that they’re baked in. I don’t know how evolution did it because it’s a high-level concept that’s represented in the brain.

Let’s say you care about some social thing, it’s not a low-level signal like smell. It’s not something for which there is a sensor. The brain needs to do a lot of processing to piece together lots of bits of information to understand what’s going on socially. Somehow evolution said, “That’s what you should care about.” How did it do it?

It did it quickly, too. All these sophisticated social things that we care about, I think they evolved pretty recently. Evolution had an easy time hard-coding this high-level desire. I’m unaware of a good hypothesis for how it’s done. I had some ideas I was kicking around, but none of them are satisfying.

Dwarkesh Patel 01:13:26

What’s especially impressive is it was desire that you learned in your lifetime, it makes sense because your brain is intelligent. It makes sense why you would be able to learn intelligent desires. Maybe this is not your point, but one way to understand it is that the desire is built into the genome, and the genome is not intelligent. But you’re somehow able to describe this feature. It’s not even clear how you define that feature, and you can build it into the genes.

Ilya Sutskever 01:13:55

Essentially, or maybe I’ll put it differently. If you think about the tools that are available to the genome, it says, “Okay, here’s a recipe for building a brain.” You could say, “Here is a recipe for connecting the dopamine neurons to the smell sensor.” And if the smell is a certain kind of good smell, you want to eat that.

I could imagine the genome doing that. I’m claiming that it is harder to imagine. It’s harder to imagine the genome saying you should care about some complicated computation that your entire brain, a big chunk of your brain, does. That’s all I’m claiming. I can tell you a speculation of how it could be done. Let me offer a speculation, and I’ll explain why the speculation is probably false.

So the brain has brain regions. We have our cortex. It has all those brain regions. The cortex is uniform, but the brain regions and the neurons in the cortex kind of speak to their neighbors mostly. That explains why you get brain regions. Because if you want to do some kind of speech processing, all the neurons that do speech need to talk to each other. And because neurons can only speak to their nearby neighbors, for the most part, it has to be a region.

All the regions are mostly located in the same place from person to person. So maybe evolution hard-coded literally a location on the brain. So it says, “Oh, when the GPS coordinates of the brain such and such, when that fires, that’s what you should care about.” Maybe that’s what evolution did because that would be within the toolkit of evolution.

Dwarkesh Patel 01:15:35

Yeah, although there are examples where, for example, people who are born blind have that area of their cortex adopted by another sense. I have no idea, but I’d be surprised if the desires or the reward functions which require a visual signal no longer worked for people who have their different areas of their cortex co-opted.

For example, if you no longer have vision, can you still feel the sense that I want people around me to like me and so forth, which usually there are also visual cues for.

Ilya Sutskever 01:16:12

I fully agree with that. I think there’s an even stronger counterargument to this theory. There are people who get half of their brains removed in childhood, and they still have all their brain regions. But they all somehow move to just one hemisphere, which suggests that the brain regions, their location is not fixed and so that theory is not true.

It would have been cool if it was true, but it’s not. So I think that’s a mystery. But it’s an interesting mystery. The fact is that somehow evolution was able to endow us to care about social stuff very, very reliably. Even people who have all kinds of strange mental conditions and deficiencies and emotional problems tend to care about this also.

01:18:13 – “We are squarely an age of research company”

Dwarkesh Patel 01:18:13

What is SSI planning on doing differently? Presumably your plan is to be one of the frontier companies when this time arrives. Presumably you started SSI because you’re like, “I think I have a way of approaching how to do this safely in a way that the other companies don’t.” What is that difference?

Ilya Sutskever 01:18:36

The way I would describe it is that there are some ideas that I think are promising and I want to investigate them and see if they are indeed promising or not. It’s really that simple. It’s an attempt. If the ideas turn out to be correct—these ideas that we discussed around understanding generalization—then I think we will have something worthy.

Will they turn out to be correct? We are doing research. We are squarely an “age of research” company. We are making progress. We’ve actually made quite good progress over the past year, but we need to keep making more progress, more research. That’s how I see it. I see it as an attempt to be a voice and a participant.

Dwarkesh Patel 01:19:29

Your cofounder and previous CEO left to go to Meta recently, and people have asked, “Well, if there were a lot of breakthroughs being made, that seems like a thing that should have been unlikely.” I wonder how you respond.

Ilya Sutskever 01:19:45

For this, I will simply remind a few facts that may have been forgotten. I think these facts which provide the context explain the situation. The context was that we were fundraising at a $32 billion valuation, and then Meta came in and offered to acquire us, and I said no. But my former cofounder in some sense said yes. As a result, he also was able to enjoy a lot of near-term liquidity, and he was the only person from SSI to join Meta.

Dwarkesh Patel 01:20:27

It sounds like SSI’s plan is to be a company that is at the frontier when you get to this very important period in human history where you have superhuman intelligence. You have these ideas about how to make superhuman intelligence go well. But other companies will be trying their own ideas. What distinguishes SSI’s approach to making superintelligence go well?

Ilya Sutskever 01:20:49

The main thing that distinguishes SSI is its technical approach. We have a different technical approach that I think is worthy and we are pursuing it.

I maintain that in the end there will be a convergence of strategies. I think there will be a convergence of strategies where at some point, as AI becomes more powerful, it’s going to become more or less clearer to everyone what the strategy should be. It should be something like, you need to find some way to talk to each other and you want your first actual real superintelligent AI to be aligned and somehow care for sentient life, care for people, democratic, one of those, some combination thereof.

I think this is the condition that everyone should strive for. That’s what SSI is striving for. I think that this time, if not already, all the other companies will realize that they’re striving towards the same thing. We’ll see. I think that the world will truly change as AI becomes more powerful. I think things will be really different and people will be acting really differently.

Dwarkesh Patel 01:22:14

Speaking of forecasts, what are your forecasts to this system you’re describing, which can learn as well as a human and subsequently, as a result, become superhuman?

Ilya Sutskever 01:22:26

I think like 5 to 20.

Dwarkesh Patel 01:22:28

5 to 20 years?

Ilya Sutskever 01:22:29

Mhm.

Dwarkesh Patel 01:22:30

I just want to unroll how you might see the world coming. It’s like, we have a couple more years where these other companies are continuing the current approach and it stalls out. “Stalls out” here meaning they earn no more than low hundreds of billions in revenue? How do you think about what stalling out means?

Ilya Sutskever 01:22:49

I think stalling out will look like…it will all look very similar among all the different companies. It could be something like this. I’m not sure because I think even with stalling out, I think these companies could make a stupendous revenue. Maybe not profits because they will need to work hard to differentiate each other from themselves, but revenue definitely.

Dwarkesh Patel 01:23:20

But something in your model implies that when the correct solution does emerge, there will be convergence between all the companies. I’m curious why you think that’s the case.

Ilya Sutskever 01:23:32

I was talking more about convergence on their alignment strategies. I think eventual convergence on the technical approach is probably going to happen as well, but I was alluding to convergence to the alignment strategies. What exactly is the thing that should be done?

Dwarkesh Patel 01:23:46

I just want to better understand how you see the future unrolling. Currently, we have these different companies, and you expect their approach to continue generating revenue but not get to this human-like learner. So now we have these different forks of companies. We have you, we have Thinking Machines, there’s a bunch of other labs. Maybe one of them figures out the correct approach. But then the release of their product makes it clear to other people how to do this thing.

Ilya Sutskever 01:24:09

I think it won’t be clear how to do it, but it will be clear that something different is possible, and that is information. People will then be trying to figure out how that works. I do think though that one of the things not addressed here, not discussed, is that with each increase in the AI’s capabilities, I think there will be some kind of changes, but I don’t know exactly which ones, in how things are being done. I think it’s going to be important, yet I can’t spell out what that is exactly.

Dwarkesh Patel 01:24:49

By default, you would expect the company that has that model to be getting all these gains because they have the model that has the skills and knowledge that it’s building up in the world. What is the reason to think that the benefits of that would be widely distributed and not just end up at whatever model company gets this continuous learning loop going first?

Ilya Sutskever 01:25:13

Here is what I think is going to happen. Number one, let’s look at how things have gone so far with the AIs of the past. One company produced an advance and the other company scrambled and produced some similar things after some amount of time and they started to compete in the market and push the prices down. So I think from the market perspective, something similar will happen there as well.

We are talking about the good world, by the way. What’s the good world? It’s where we have these powerful human-like learners that are also… By the way, maybe there’s another thing we haven’t discussed on the spec of the superintelligent AI that I think is worth considering. It’s that you make it narrow, it can be useful and narrow at the same time. You can have lots of narrow superintelligent AIs.

But suppose you have many of them and you have some company that’s producing a lot of profits from it. Then you have another company that comes in and starts to compete. The way the competition is going to work is through specialization. Competition loves specialization. You see it in the market, you see it in evolution as well. You’re going to have lots of different niches and you’re going to have lots of different companies who are occupying different niches. In this world we might say one AI company is really quite a bit better at some area of really complicated economic activity and a different company is better at another area. And the third company is really good at litigation.

Dwarkesh Patel 01:27:18

Isn’t this contradicted by what human-like learning implies? It’s that it can learn…

Ilya Sutskever 01:27:21

It can, but you have accumulated learning. You have a big investment. You spent a lot of compute to become really, really good, really phenomenal at this thing. Someone else spent a huge amount of compute and a huge amount of experience to get really good at some other thing. You apply a lot of human learning to get there, but now you are at this high point where someone else would say, “Look, I don’t want to start learning what you’ve learned.”

Dwarkesh Patel 01:27:48

I guess that would require many different companies to begin at the human-like continual learning agent at the same time so that they can start their different tree search in different branches. But if one company gets that agent first, or gets that learner first, it does then seem like… Well, if you just think about every single job in the economy, having an instance learning each one seems tractable for a company.

Ilya Sutskever 01:28:19

That’s a valid argument. My strong intuition is that it’s not how it’s going to go. The argument says it will go this way, but my strong intuition is that it will not go this way. In theory, there is no difference between theory and practice. In practice, there is. I think that’s going to be one of those.

Dwarkesh Patel 01:28:41

A lot of people’s models of recursive self-improvement literally, explicitly state we will have a million Ilyas in a server that are coming up with different ideas, and this will lead to a superintelligence emerging very fast.

Do you have some intuition about how parallelizable the thing you are doing is? What are the gains from making copies of Ilya?

Ilya Sutskever 01:29:02

I don’t know. I think there’ll definitely be diminishing returns because you want people who think differently rather than the same. If there were literal copies of me, I’m not sure how much more incremental value you’d get. People who think differently, that’s what you want.

01:29:23 – Self-play and multi-agent

Dwarkesh Patel 01:29:23

Why is it that if you look at different models, even released by totally different companies trained on potentially non-overlapping datasets, it’s actually crazy how similar LLMs are to each other?

Ilya Sutskever 01:29:38

Maybe the datasets are not as non-overlapping as it seems.

Dwarkesh Patel 01:29:41

But there’s some sense in which even if an individual human might be less productive than the future AI, maybe there’s something to the fact that human teams have more diversity than teams of AIs might have. How do we elicit meaningful diversity among AIs? I think just raising the temperature just results in gibberish. You want something more like different scientists have different prejudices or different ideas. How do you get that kind of diversity among AI agents?

Ilya Sutskever 01:30:06

So the reason there has been no diversity, I believe, is because of pre-training. All the pre-trained models are pretty much the same because they pre-train on the same data. Now RL and post-training is where some differentiation starts to emerge because different people come up with different RL training.

Dwarkesh Patel 01:30:26

I’ve heard you hint in the past about self-play as a way to either get data or match agents to other agents of equivalent intelligence to kick off learning. How should we think about why there are no public proposals of this kind of thing working with LLMs?

Ilya Sutskever 01:30:49

I would say there are two things to say. The reason why I thought self-play was interesting is because it offered a way to create models using compute only, without data. If you think that data is the ultimate bottleneck, then using compute only is very interesting. So that’s what makes it interesting.

The thing is that self-play, at least the way it was done in the past—when you have agents which somehow compete with each other—it’s only good for developing a certain set of skills. It is too narrow. It’s only good for negotiation, conflict, certain social skills, strategizing, that kind of stuff. If you care about those skills, then self-play will be useful.

Actually, I think that self-play did find a home, but just in a different form. So things like debate, prover-verifier, you have some kind of an LLM-as-a-Judge which is also incentivized to find mistakes in your work. You could say this is not exactly self-play, but this is a related adversarial setup that people are doing, I believe.

Really self-play is a special case of more general competition between agents. The natural response to competition is to try to be different. So if you were to put multiple agents together and you tell them, “You all need to work on some problem and you are an agent and you’re inspecting what everyone else is working,” they’re going to say, “Well, if they’re already taking this approach, it’s not clear I should pursue it. I should pursue something differentiated.” So I think something like this could also create an incentive for a diversity of approaches.

01:32:42 – Research taste

Dwarkesh Patel 01:32:42

Final question: What is research taste? You’re obviously the person in the world who is considered to have the best taste in doing research in AI. You were the co-author on the biggest things that have happened in the history of deep learning, from AlexNet to GPT-3 to so on. What is it, how do you characterize how you come up with these ideas?

Ilya Sutskever 01:33:14

I can comment on this for myself. I think different people do it differently. One thing that guides me personally is an aesthetic of how AI should be, by thinking about how people are, but thinking correctly. It’s very easy to think about how people are incorrectly, but what does it mean to think about people correctly?

I’ll give you some examples. The idea of the artificial neuron is directly inspired by the brain, and it’s a great idea. Why? Because you say the brain has all these different organs, it has the folds, but the folds probably don’t matter. Why do we think that the neurons matter? Because there are many of them. It kind of feels right, so you want the neuron. You want some local learning rule that will change the connections between the neurons. It feels plausible that the brain does it.

The idea of the distributed representation. The idea that the brain responds to experience therefore our neural net should learn from experience. The brain learns from experience, the neural net should learn from experience. You kind of ask yourself, is something fundamental or not fundamental? How things should be.

I think that’s been guiding me a fair bit, thinking from multiple angles and looking for almost beauty, beauty and simplicity. Ugliness, there’s no room for ugliness. It’s beauty, simplicity, elegance, correct inspiration from the brain. All of those things need to be present at the same time. The more they are present, the more confident you can be in a top-down belief.

The top-down belief is the thing that sustains you when the experiments contradict you. Because if you trust the data all the time, well sometimes you can be doing the correct thing but there’s a bug. But you don’t know that there is a bug. How can you tell that there is a bug? How do you know if you should keep debugging or you conclude it’s the wrong direction? It’s the top-down. You can say things have to be this way. Something like this has to work, therefore we’ve got to keep going. That’s the top-down, and it’s based on this multifaceted beauty and inspiration by the brain.

Dwarkesh Patel 01:35:31

Alright, we’ll leave it there.

Ilya Sutskever 01:35:33

Thank you so much.

Dwarkesh Patel 01:35:34

Ilya, thank you so much.

Ilya Sutskever 01:35:36

Alright. Appreciate it.

Dwarkesh Patel 01:35:37

That was great.

Ilya Sutskever 01:35:38

Yeah, I enjoyed it.

Dwarkesh Patel 01:35:39

Yes, me too.

RL is even more information inefficient than you thought

Dwarkesh Patel — Mon, 17 Nov 2025 16:54:09 GMT

Recently, people have been talking about how it takes way more FLOPs to get a single sample in RL than it does in supervised learning. In pretraining, you get a signal on every single token you train on. In RL, you have to unroll a whole thinking trajectory that’s 10s of 1000s of tokens long in order to get a single reward signal at the end (for example, did the unit test for my code pass/did I get the right answer to this math problem/etc).

But this is only half the problem. Here’s a simple way to compare the learning efficiency of reinforcement learning versus supervised learning:

Bits/FLOP = Samples/Flop * Bits/Sample.

What I haven’t heard people talk about is the other term in our equation: Bits/Sample. And for most of training, the information density per sample is way way lower for RL.

Subscribe now

Putting things in plain English

In supervised learning (aka pretraining), you’re just soaking up bits. Every token is a hint at the structure of language, and the mind crafting that language, and the world that mind is seeing. Early in training, when you have a totally random model, you’re just maximally uncertain over all of this content. So each token is just blowing your mind. And you’re getting this exact signal of how wrong you were about the right answer, and what parameters you need to update to be less wrong.

Suppose you start with a randomly initialized model, and you kickstart training. If you’re doing next-token-prediction using supervised learning on “The sky is”, the training loop goes, “It’s actually ‘blue’. You said the probability of ‘blue’ is .001%. Make the connections that were suggesting ‘blue’ way way stronger. Alright, next token.”

In RL with policy gradient, you upweight all the trajectories where you get the answer right, and downweight all the trajectories where you get the answer wrong. But a model that’s not already very smart is just astonishingly unlikely to get the answer right.

If you were doing next-token-prediction on “The sky is” with RL, the training loop would be something like, “Okay, ‘halcyon’ is wrong. Don’t do the thing that led to saying ‘halycon’ … Okay ‘serendipity’ is wrong …” Rinse and repeat this guesswork for somewhere around the number of tokens you have in your vocabulary (on the order of 100,000).

The details

Let’s think about how maximum bits/sample change as the pass rate (p) changes. Pass rate here means how likely you are to say the correct answer. To keep this simple, let’s say the answer is a token long. Then the pass rate when you have a totally untrained model is just 1/ (size of your vocabulary).

In supervised learning, you get told exactly what the right label is for each sample. The amount of new information you learn corresponds to how surprised you are to learn the correct answer - the lower your pass rate (aka prior probability of the correct answer), the more you learned from seeing the correct label. The basic formula for entropy tells us that you can learn -log(p) bits/sample from supervised learning.

In RL, you only get told whether you got the right answer or not. The amount of new information you can extract is bounded by how uncertain you are about this binary outcome. If you almost always pass (p ≈ 1) or almost always fail (p ≈ 0), each trial is very unlikely to surprise you. You’ll learn most when the probability of passing is like a coin toss (p ≈ 0.5). The basic formula for the information content of a binary random variable tells us that you can learn at most Entropy(p) = -p log(p) - (1-p) log(1-p)1 bits/sample from RL.

Okay let’s plot this.

Doesn’t look terrible. Yes, pretraining is much better for half of the pass rate range, but then RL is better for the other half. However, this graph is super misleading. Because what the power law (in scaling laws) implies is that you need an equivalent amount of compute to cross each order of magnitude improvement in the pass rate. If it took you X many FLOPs to go from 1/100,000 pass rate to 1/10,000, then it will take you X many FLOPs to go from 1/10,000 pass rate to 1/1,000. So, we should actually chart the pass rate on a log scale - again, to account for how each increment in the x-axis corresponds to the same number of FLOPs.

Oh boy, is that a sad picture. The regime where RL has comparable information density per sample to pre-training is this tiny slice at the very end of training, when you’ve got a pretty reasonable model anyways.

And again, I want to emphasize that this is totally separate from the point that getting a single sample from RL (aka unrolling a full trajectory before getting any signal) might take upwards of a million times more compute.

It’s even worse than this - variance

The situation for RL early in training is actually even worse than described above. When the pass rate is low, your gradient estimate is going to be incredibly noisy and unpredictable. Either you don’t sample the correct answer at all in your batch, in which you get almost no information. Or you do, and you get this giant spike. You’re getting jerked around, which is terrible for performant training.2

Interestingly, pretraining has the exact inverse problem. There, variance is super high at the END of training. As pretraining progresses, you exhaust more and more of the reducible loss (things your model can actually learn about the data). What remains is mostly the irreducible loss. The irreducible loss is the intrinsic unpredictability of internet text.

How should the prompt, “Bob’s favorite color is” end? Depends on Bob. There’s not some correct answer which your super smart model can actually get good at predicting. But your super smart model is still getting a gradient update on whatever random answer someone put on the internet. And this noise is drowning out the true signal that the couple of actually learnable tokens in the batch are giving you. I don’t know if this is accurate, but it seems like this explosion of variance at the end of pretraining is relevant to why batch sizes are increased as pretraining progresses.

Getting to the Goldilocks zone in RL

If RL works best in the regime where your pass rate is >>1%, then this raises the question, how can we construct the RL training to get (and keep) models in this learning flow state?

For example, we can think of pretraining AND inference scaling as increasing the pass rate during RL, allowing you to extract far more bits per sample.

It’s been noted that curriculum learning in not especially helpful for pretraining, but often essential for RL. This makes total sense when you think about how RL is only getting meaningful bits per sample in this Goldilocks zone of pass rate, so you really want to order the learning such that the difficulty of challenges increases in tandem with the model’s intelligence.

Our pass rate framework also gives us good intuitions for why self play has been so productive in the history of RL. If you’re competing against a player who is almost as good as you, you are balancing around a 50% pass rate, which peaks out the bits you get from a random binary variable.

But self play is not the only way we can imagine of keeping pass rate high through training. Perhaps we can come up with some proxy evaluation which is much more dense. Density here can mean one of two things:

Samples/FLOP density: You estimate the final reward using this proxy evaluation, but much earlier on in the episode, saving you the compute of unrolling the full trajectory. This is what a value function does.
Bits/Sample density: You come up with a proxy objective which is much easier to pass than the actual test under question. The simplest example I can think of is a process-reward model which says, “Hey, this rollout got the wrong answer, but I can see that its reasoning was on the right track at the start. So let’s up-weight those early tokens.”

Section 4.2 of the Deepseek R1 paper why so far, it’s been hard to develop useful proxy objectives like this for LLMs.

Fewer bits, sure, but very valuable bits

To be fair to RL, while you may be learning far fewer Bits/FLOP in RL, the bits you learn are very important. They are not apples-to-apples comparable to the bits in pretraining. This is for two key reasons:

Pre-training is teaching you what the data manifold of the internet looks like, which is only partially and indirectly related to, “How do I perform economically valuable tasks?” Whereas RL has the promise of giving you the good stuff directly.
Even if the pre-training corpus contains the instructions about how to accomplish a specific task, it does not have the thinking trace which teaches the model about how to correct its mistakes, or leverage its jagged and non-human repertoire of skills to accomplish the task.

The rebuttal is that those bits are only available for a small fraction of the pass rate range (again, weighted on a log scale to account for how pass rate is trash for most of training).

By the way, now we can understand all these claims about how RLVR is only eliciting the capabilities already latent in the pretrained model. Of course that’s the case. If the pretrained model didn’t have a high enough pass rate to begin with, then RL would have atrocious bits/sample, and thus not be able to learn at all. Move 37 is obviously one famous example where RL did teach a model a de-novo strategy. It’s worth noting that AlphaGo was trained on self play (see above re how self play increases pass rate), and that AlphaGo was surprisingly compute intensive for its time.

The jaggedness of RL

People have pointed out that RLVR is empirically just leading models to associate a thought pattern to a problem type rather than instilling a more general policy of stepping back and thinking through the best approach.

Think about it. How is it possible that we have models which are world-class at coding competitions but at the same time leave extremely foreseeable bugs and technical debt all throughout the codebase?

What explains this weird jaggedness? Perhaps RLVR can’t distinguish trajectories that were generated from a more generalizable procedure vs just greedily matching the problem shape to some associated thought process.

When you’re doing policy gradient rollouts, this more complex general policy is extremely unlikely to be ever be sampled, whereas the simple heuristic policy does get sampled and grows in frequency until it reaches fixation. Meanwhile, the general policy recedes further and further from sight.

Then the question is, how do we build a short bridge between simple heuristic solutions and the more complex general strategy? And will that bridge just spontaneously emerge as time horizons expand, thus potentially requiring generalization?

My concern is that this general policy of stepping back and making tasteful judgements based on your understanding of the world will continue to be hard to spot-light using verifiable rewards, even on longer time horizon tasks. And so the solution to this jaggedness will require a more robust training procedure, not just scaling RLVR.

Human learning

Here we’re only talking about the bits/sample learned from model free RL - aka from some binary outcome at the end of an episode. But of course humans are obviously learning way more efficiently than this. Think about a repeat entrepreneur. We say that she has a ton of hard-won wisdom and experience. Very little of that learning comes from the one bit of outcome from her previous episode (whether the startup succeeded or not).

It’s not clear what the ML analog is for human learning from experience. Clearly, our observations and reflections update our world model (independent of the outcome at the end). And this is playing a very important role in our learning.

Maybe we shouldn’t be asking how we model free RL to ≈50% pass rate, so that can squeeze out a full drop of information from the outcome. Maybe we should be asking, how do humans wring out the buckets of information from the environment?

Basically, this equation is saying, Information learned from a binary outcome = p(sample is correct) * (information gained when sample is correct) + p(sample is incorrect) * (information gained when sample is incorrect).

Thank you to Lukas Berglund for spotting that my previous exposition on this point was incorrect.