<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Dwarkesh Podcast]]></title><description><![CDATA[Deeply researched interviews]]></description><link>https://www.dwarkesh.com</link><image><url>https://substackcdn.com/image/fetch/$s_!QEPJ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F90fa9666-5b8b-4685-a8fb-4b64cb7e0333_1080x1080.png</url><title>Dwarkesh Podcast</title><link>https://www.dwarkesh.com</link></image><generator>Substack</generator><lastBuildDate>Thu, 11 Jun 2026 17:44:12 GMT</lastBuildDate><atom:link href="https://www.dwarkesh.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Dwarkesh Patel]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[dwarkesh@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[dwarkesh@substack.com]]></itunes:email><itunes:name><![CDATA[Dwarkesh Patel]]></itunes:name></itunes:owner><itunes:author><![CDATA[Dwarkesh Patel]]></itunes:author><googleplay:owner><![CDATA[dwarkesh@substack.com]]></googleplay:owner><googleplay:email><![CDATA[dwarkesh@substack.com]]></googleplay:email><googleplay:author><![CDATA[Dwarkesh Patel]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[The sample efficiency black hole]]></title><description><![CDATA["We see these AIs as a galaxy glittering with capabilities, but at their center, invisible to the naked eye, holding all the constellations together, is an unimaginably massive black hole of data."]]></description><link>https://www.dwarkesh.com/p/the-sample-efficiency-black-hole</link><guid isPermaLink="false">https://www.dwarkesh.com/p/the-sample-efficiency-black-hole</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Mon, 08 Jun 2026 18:09:37 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/cde6bb73-e466-436b-82b6-2f5002929650_1731x909.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>One definition of intelligence is sample efficiency - that is to say, how much data do you need to see in a given domain in order to operate fluently and competently. It&#8217;s not clear that we&#8217;ve actually made much progress on training sample efficiency over the last few years - it seems like more so we&#8217;ve dramatically widened and improved the data distribution.</p><p>The main way that AIs have been getting better is from adding<a href="https://epoch.ai/gradient-updates/the-least-understood-driver-of-ai-progress"> more and better data</a>, and scaling the compute to develop that data in the first place. Obviously RL is the main way that has happened. You can think of RL as a kind of synthetic data generation - you dump a lot of compute against a verifier in order to find the &#8220;good&#8221; data. Then you train your model to predict these correct rollouts, much in the same way that you might train it to predict the next word in internet text.</p><p>For this process to work, the model must have at least prior some probability to anticipate the correct solution, which is why you also need mind-stretching amounts of human expert trajectories in every single field and skill you want the model to be competent at.</p><p>It&#8217;s hard to overstate how task specific and bespoke this human expert data is. If you want to get some intuition, go read some job descriptions at<a href="https://work.mercor.com/explore"> Mercor</a> or Surge&#8217;s websites. There are listings for a <a href="https://work.mercor.com/explore?listingId=list_AAABnfU_UoKURfBKz-hPsIhO">word specialists</a> who will convert legacy documents into polished Word files, and <a href="https://work.mercor.com/explore?listingId=list_AAABndKk8wND-PxLaqFAq5HH">legal experts</a> who will write realistic M&amp;A diligences or securities filings, and <a href="https://work.mercor.com/explore?listingId=list_AAABmu1UcrKMfGwlYNNP64qU">management consultants</a> who will write up template market research, and dozens more other particular categories.</p><p>And it is not only that the data have to be so domain specific, but there has to be so much of it! Each skill corresponds to at least hundreds of human experts who are generating example completions, writing rubrics, and explaining their chain of thought. There&#8217;s a reason that the data industry producing these expert labels (and the RL environments in which their meticulously catalogued skills can congeal) is earning billions a year in revenue, soon deca-billions.</p><p>Imagine if it took a couple decades worth of courses with hundreds of concurrent professors and millions of practice tasks for you to learn how to polish a word file. Even the task count difference understates the gap - the models have to grind their far more numerous tasks each far harder. Whereas a human student might practice a textbook problem once or twice, GRPO has the model generate hundreds to thousands of rollouts per task. We are building some Frankenstein&#8217;s monster, with a billion grafts of carefully constructed examples sewn together.</p><p>Epoch recently reported that <a href="https://epoch.ai/data-insights/open-closed-eci-gap">open models only lag state-of-the-art closed models by 4 months</a>. I think the reason it is relatively easy for open source and previous laggards to catch up to within months of the frontier is that data is the real driver of progress. And data can be easily distilled from public APIs, whereas hyper-parameters and training tricks and architectural micro-optimizations cannot - if the latter were driving most of progress, then catching up would be harder than we are observing it to be.</p><p>It is easy to forget how much data these models are trained on, and how much more it is than what we humans see in our lifetimes. We see these AIs as a galaxy glittering with capabilities, but at their center, invisible to the naked eye, holding all the constellations together, is an unimaginably massive black hole of data.</p><h3>Intermission: Comparing human and AI sample efficiency</h3><ul><li><p>If a person hears and sees on average ~2,000 words an hour, then from birth to adulthood, they&#8217;ll see ~200 millions tokens. By contrast, frontier models are trained on somewhere between 10s to 100s of trillions of tokens. That is close to a million fold difference.</p></li><li><p>A person can learn to teleoperate any random humanoid or robot arm within hours. The reason robotics isn&#8217;t already a deca-trillion dollar industry, with a endless army of Unitree G1s doing all kinds of useful work in world, is that our AIs learn so much less efficiently than humans, and even the millions of hours of demonstrations we&#8217;ve collected is not enough to allow them to perform complex, open ended tasks.</p></li><li><p>A teenager can learn to drive a car with about 20 hours of practice. Even if you include their ~16 years of accumulated physical intuition as relevant training data, that is at least 3-4 orders of magnitude less than the amount of data Waymo and Tesla have needed to train their self-driving car models.</p></li></ul><p>I wanna deal with some common objections to this kind of comparison:</p><ul><li><p>Many billions of years of evolution is our pre-training, so it&#8217;s unfair to compare how little data we see simply within our lifetime to what these cold-started LLMs have to learn from.</p><ul><li><p>Our genome is 3GB, about 1-2% protein coding. That is just not enough space to store the model parameters that are supposedly pretrained (frontier models are terabytes sized). The closer analogy is probably that evolution has found the right hyperparameters and loss functions (Sidenote: I had an interesting <a href="https://www.dwarkesh.com/p/adam-marblestone">podcast with Adam Marblestone</a> where he argued that the loss functions were the more significant find from evolution), but that the equivalent of parameter training is still happening within lifetime, and is encoded in the map of neural connections in the brain built up over a lifetime.</p></li><li><p>Even if it were the case that we can explain away the trillions of tokens required to pretrain a base model as catching up to evolution, it doesn&#8217;t explain why the marginal capabilities take so much data - once you have been educated, you don&#8217;t need 100 different professors to learn a new programming language, but the AIs (even once pretrained) do.</p></li></ul></li><li><p>These comparisons are not including the multimodal data we see in our lifetimes. If you include all this sensory information, we&#8217;re probably in the 10s to 100s of billions of tokens range from birth to adulthood</p><ul><li><p>Blind/deaf people who are cut off from this kind of sensory information might lack faculty with the relevant sense, but still have the same general intelligence as everyone else. Which suggests that all these billions of sensory tokens are not really the thing making humans smart.</p></li><li><p>In fact, deaf people who can only communicate via sign language and reading (and not from hearing) are ingesting far less than the 200 million language tokens we calculated earlier, and even this is sufficient for them to be fully general intelligences.</p></li></ul></li><li><p>Scaling laws tell us that bigger models are more sample efficient. The human brain is 100T synapses - if each synapse is ~1 parameter, and frontier models are currently roughly ~5T parameters, then maybe we could achieve human-level sample efficiency with another order of magnitude or two of parameter scaling.</p><ul><li><p>The way the scaling law equations work is that parameter and data terms are added to the loss independently. If you have a model that is trained compute optimally, and suppose you ask, well what if I just wanna maximize sample efficiency and use less data - and I&#8217;ll throw in as many parameters as it takes to make that happen. With the constants from the Chinchilla scaling laws paper (and the nature of the result wouldn&#8217;t change even with different constants), even if you increased the number of parameters by infinity, that would only decrease by a factor of ~10 the amount of data you need in order to keep the same loss. Humans are somewhere between thousands to millions of times more sample efficient than these models. Scaling of current models simply can&#8217;t make up for that discrepancy. This really does suggest that humans are on a different scaling curve altogether.</p></li></ul></li></ul><h3>Does sample efficiency matter?</h3><p>But you might ask, why does sample efficiency matter? The labs have two overarching objectives: automate white collar work, and automate AI research itself. Is human level sample efficiency necessary for either?</p><p>The bet with white collar work is the common tasks that a software engineer or analyst or accountant does are, well common. And we can bring common tasks into distribution quite easily through RL and SFT. The <a href="https://www.anthropic.com/news/series-h">revenue curves</a> of these AI labs suggest that there is enormous value from bringing tasks into distribution, even if we don&#8217;t replicate human sample efficiency.</p><p>Yes it is far more inefficient to train AIs to do these tasks than it is to train humans. But so what? Human lifespan does not allow for the quantity and breath of training these models experience. If you as a human had some weird learning disability where you needed to read through every public repository on Github before you could be a competent developer, it would not make sense to train you up. You&#8217;d be on Social Security by the early stages of your education, and even once you were trained, you could work on only one project at a time. But AIs can learn these skills by firehosing gigawatts of training at a time. And what they learn can be amortized across billions of sessions, so we can be ludicrously inefficient in training them and still be wildly in the green.</p><p>How much &#8220;out-of-distribution&#8221; thinking do white collar employees need to do that you simply can&#8217;t train for in advance? Well this is more a question about the nature of different jobs rather than a question about AI research. And also depends on the job - some jobs are mechanical and predictable enough that they were automated long before the modern era of AI, for example bank tellers or travel agents. And there are other jobs which require dealing on a daily basis with problems that are quite distant from the data distribution. Even software engineering (the jobs AIs are supposed to take first) is one such. I would be willing to bet that there&#8217;s overall more demand for human software engineers in 2028 than there is now, largely due to the complementary input of AI.</p><p>The labs&#8217; plan for these later kinds of jobs is to first automate AI research, and then have the automated AI researchers solve this sample efficiency problem. So then the question is, can AIs, which do not have human-level sample efficiency, nonetheless solve the remaining research problems on the way to human-like intelligence and learning.</p><p>That question I&#8217;ll address in a future blog post - I think the way that people currently think about an intelligence explosion is pretty clumsy. Either people dismiss the possibility of AIs speeding up AI progress altogether, or they just assume that God pops out the other end. People are not reasoning about what extremely rapid progress, but starting with LLMs, looks like. </p><h2></h2>]]></content:encoded></item><item><title><![CDATA[Alex Imas and Phil Trammell – What remains scarce after AGI?]]></title><description><![CDATA[&#8220;One robot now turns into many robots next year, but the number of ballerinas is the same.&#8221;]]></description><link>https://www.dwarkesh.com/p/alex-imas-phil-trammell</link><guid isPermaLink="false">https://www.dwarkesh.com/p/alex-imas-phil-trammell</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Thu, 04 Jun 2026 16:14:49 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/200613353/cc9231d96754b37a8935be6734872dcd.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>Economics of AGI episode w <a href="https://www.aleximas.com/">Alex Imas</a> and <a href="https://philiptrammell.com/">Phil Trammell</a>.</p><p>There&#8217;s a bunch of important questions about how we deal with AI that only economics can answer.</p><p>What is the optimal way to tax and redistribute the wealth that will be generated? How should countries not in the AI supply chain index into the gains? Is there any world where inequality doesn&#8217;t explode?</p><p>It might seem like these questions have obvious answers, but the first thing economics teaches you is that your intuitions can often be entirely wrong.</p><p>It was very helpful to chat through these things with Alex and Phil.</p><p>Watch on <a href="https://youtu.be/Jj-kBHzUohs">YouTube</a>; listen on <a href="https://podcasts.apple.com/us/podcast/alex-imas-and-phil-trammell-what-remains-scarce-after-agi/id1516093381?i=1000771185825">Apple Podcasts</a> or <a href="https://open.spotify.com/episode/52wp90vqwiRmmQaOm9M2uZ?si=8a81MnA4Tf-X3VUzpzE1qg">Spotify</a>.</p><div id="youtube2-Jj-kBHzUohs" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;Jj-kBHzUohs&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/Jj-kBHzUohs?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2><strong>Sponsors</strong></h2><ul><li><p><a href="https://janestreet.com/dwarkesh">Jane Street</a> invests heavily in turning smart people into exceptional researchers and engineers. In addition to their apprenticeship model, Jane Street runs lectures and bootcamps in their in-office classrooms -- managers clear their teams&#8217; schedules to encourage attendance. If you&#8217;d like to work at a place that takes learning this seriously, Jane Street is hiring. Check out their open roles at <a href="https://janestreet.com/dwarkesh">janestreet.com/dwarkesh</a></p></li></ul><ul><li><p><a href="https://gemini.google">Google&#8217;s Gemini Omni</a> has incredible video editing capabilities -- you can upload a video and have Omni change the background, adjust lighting, or add specific elements. But Omni is also a preview of how future frontier models will be trained -- fully multimodal on both input and output. You can try it yourself in the Gemini app at <a href="https://gemini.google">gemini.google</a> or in Flow at <a href="https://flow.google">flow.google</a></p></li><li><p><a href="https://cursor.com/dwarkesh">Cursor</a> used targeted RL with textual feedback to help train their Composer 2.5 model. One of their researchers, Sasha Rush, gave me an impromptu blackboard lecture to explain how this form of on-policy self-distillation works -- I posted the full thing on X. If you want to try Composer 2.5, go to <a href="https://cursor.com/dwarkesh">cursor.com/dwarkesh</a></p></li></ul><h2>Timestamps</h2><p>(00:00:00) &#8211; Will capital share increase?</p><p>(00:19:36) &#8211; Messy Middle scenario</p><p>(00:25:57) &#8211; How to tax and redistribute AI wealth</p><p>(00:30:02) &#8211; Why demand collapse is unlikely</p><p>(00:39:26) &#8211; Human employees would be hard to integrate into the machine economy</p><p>(00:43:08) &#8211; What if some humans (or AIs) value wealth accumulation intrinsically?</p><p>(01:01:28) &#8211; What should developing countries do?</p><h2>Transcript</h2><h3>00:00:00 &#8211; Will capital share increase?</h3><p><strong>Dwarkesh Patel</strong></p><p>Today I&#8217;m chatting with <a href="https://www.aleximas.com/">Alex Imas</a>, who is Director of AGI Economics at <a href="https://en.wikipedia.org/wiki/Google_DeepMind">Google DeepMind</a> and <a href="https://www.chicagobooth.edu/faculty/directory/i/alex-imas">Professor of Economics at the University of Chicago</a>, and <a href="https://philiptrammell.com/">Phil Trammell</a>, who is Head of Economics at <a href="https://epoch.ai/">Epoch</a> and <a href="https://digitaleconomy.stanford.edu/person/philip-trammell/">research scholar at Stanford</a>.</p><p>In general what I want to understand in this interview is what economics tells us about what we can expect in a world with more and more automation and more advanced AI. I want to understand what that tells us about what will happen to wages and the labor share, what the best way to tax and redistribute the wealth generated by AGI will be, and what kinds of things will be scarce. What is scarce tells you where the value will accrue.</p><p>I want to start there. What are some plausible candidates of what will be scarce?</p><p><strong>Alex Imas</strong></p><p>Something like the <a href="https://aleximas.substack.com/p/what-will-be-scarce">relational sector</a>, which is defined as services and goods where the fact that a human was in the loop is part of the value of that product. Because humans are naturally scarce, if we have automation where a lot of other things stop being scarce, we will still have scarcity in the things that humans are involved in and in the loop for.</p><p><strong>Dwarkesh Patel</strong></p><p>I&#8217;m curious to understand whether humans doing services for other humans can ever be a big part of the economy. Here&#8217;s maybe one intuition pump. In a world where AI can physically do anything humans can do, there&#8217;s this whole machine economy where they&#8217;re building factories and doing research and coming up with new ideas. Humans may or may not be involved in the physical production of those things, but probably not in the ultimate limit, if robotics is solved. If you don&#8217;t care about humans being involved in that process, why would they be?</p><p>But then there are these other things you point out where we actually do want the ballerina or the barista to be a human. That&#8217;s part of the value of going to a cafe or a performance. But only humans have that preference. So there&#8217;s this human economy where humans are doing services for each other, and part of their wealth is flowing to other humans.</p><p>But part of their wealth is also flowing out, because they will want some of the automated goods this machine-only economy is creating. This is not a closed loop. A lot of things in the machine-only economy are a closed loop because the machines don&#8217;t care about getting the human barista to make them a coffee.</p><p>Within that model, isn&#8217;t it intrinsic that the human-only economy will become a smaller and smaller share?</p><p><strong>Alex Imas</strong></p><p>I would like to pitch a rephrasing of that question. My view is that the individual forecasts economists like us would make, as individual forecasts, are not necessarily very useful.</p><p>There was a <a href="https://empiricrafting.substack.com/p/we-need-well-capitalized-prediction">blog post</a> by <a href="https://andreyfradkin.com/">Andrey Fradkin</a>, <a href="https://brianjabarian.org/">Brian Jabarian</a>, and <a href="https://www.andrewjkoh.com/garden">Andrew Koh</a> that came out yesterday looking at economists&#8217; forecasts about the labor market. What they found is that there&#8217;s a ton of disagreement in every single direction.</p><p>What they advocate for, and I&#8217;m in agreement here, is that rather than thinking about individual forecasts, we should be generating prediction markets where you get aggregate forecasts and <a href="https://en.wikipedia.org/wiki/Wisdom_of_the_crowd">wisdom-of-the-crowd</a> effects. The reason I think this is because we have been famously terrible at forecasting.</p><p>Let&#8217;s go all the way back to 1820. This debate we&#8217;ve been having is actually 200 years old. <a href="https://en.wikipedia.org/wiki/David_Ricardo">David Ricardo</a> is one of the <a href="https://en.wikipedia.org/wiki/Classical_economics">classical economists</a>, not <a href="https://en.wikipedia.org/wiki/Neoclassical_economics">neoclassical</a>. When the <a href="https://en.wikipedia.org/wiki/Industrial_Revolution">Industrial Revolution</a> started happening, he wrote a bunch of stuff saying, &#8220;This is going to be great for everybody. Prices are going to come down.&#8221; But then he turned around and said, &#8220;Wait, I can see all these jobs that are creating value are going to be automated by these machines. This is going to be really bad. Everybody&#8217;s going to become unemployed, and there&#8217;s going to be political unrest.&#8221;</p><p>And if you look at Ricardo&#8217;s predictions, they&#8217;re actually right. All those jobs that made money in Ricardo&#8217;s time got automated. If David Ricardo woke up and somebody told him all those jobs did get automated, and then asked him, &#8220;What do you think the <a href="https://fred.stlouisfed.org/series/LNS12300060">prime-age employment rate</a> is in 2026?&#8221;, I think he&#8217;d be surprised to be told it was the highest it&#8217;s ever been other than 2000. We have the highest number of employed people that could potentially be employed since 2000. That was the peak and now it&#8217;s the second peak basically.</p><p>What David Ricardo ended up missing is that you have these economics of structural change, where everything that got automated became cheap. People had more money to spend, and then they started spending it on services. This is the <a href="https://en.wikipedia.org/wiki/Lump_of_labour_fallacy">lump-of-labor fallacy</a>. David Ricardo didn&#8217;t consider that new jobs would be created.</p><p>But it&#8217;s not obvious that money would go to services. Why wouldn&#8217;t it go to more automated goods and something like that? I&#8217;m not using this anecdote to say this is what&#8217;s going to happen now and that we&#8217;re going to have full employment. I&#8217;m using it to say it&#8217;s really hard to make predictions.</p><p>What may be a really useful tool that economists have is to instead start with a premise. Maybe we start today: labor share is zero. Labor share has gone down. What could possibly explain this? Let&#8217;s write down an economic model of what happened. Phil will talk about this later today. Or you can write down a model that asks, &#8220;What if labor share just stays the same? What can make that happen?&#8221;</p><p>If you don&#8217;t take anything else out of this conversation from me: We don&#8217;t have any data. I&#8217;ve been saying we need a Manhattan Project for data. We don&#8217;t have data on consumer <a href="https://en.wikipedia.org/wiki/Price_elasticity_of_demand">demand elasticities</a>. We don&#8217;t know what they are. We&#8217;re not really tracking what jobs are getting created or destroyed. The <a href="https://www.onetcenter.org/database.html">O*NET database</a>, with all of the tasks and different jobs, has been rarely updated and is super low quality.</p><p>What is really useful is to think about the potential scenarios, map them out, and say what dimension of <a href="https://en.wikipedia.org/wiki/Scarcity">scarcity</a> will generate each scenario. If there&#8217;s full employment, we can talk about the relational sector. If the labor share collapses, we can talk about other sorts of scenarios. That will tell us what data we should be collecting.</p><p><strong>Dwarkesh Patel</strong></p><p>It&#8217;s probably worth defining labor share and capital share real quick. The whole economy, the total sum of goods and services sold, is either paid out to people in wages or it&#8217;s paid out to capital, which is to say, there&#8217;s rent on buildings and shareholders of companies that get paid out.</p><p>For many hundreds of years, ~60% of the economy basically gets paid out to humans in wages, and the other 30-40% gets paid out to people who own machines and land and claims on companies. The question is, if 60% is going to wages right now, does that shrink as AIs get smarter and better?</p><p><strong>Alex Imas</strong></p><p>This is a <a href="https://en.wikipedia.org/wiki/Kaldor%27s_facts">Kaldor fact</a>. We should stress this. It&#8217;s incredibly surprising that it&#8217;s over 60% after the Industrial Revolution and all of the automation we&#8217;ve ever seen. Some people are worried it&#8217;s an accounting error that it&#8217;s been so constant.</p><p>There&#8217;s even a controversy right now. Some might say labor share has been falling in the last 20 to 30 years. But there have been a lot of accounting changes in the last 30 to 40 years. For example, <a href="https://en.wikipedia.org/wiki/Tony_Atkinson">Atkinson</a> has a <a href="https://www.brookings.edu/wp-content/uploads/2016/07/2013b_elsby_labor_share.pdf">paper showing</a> that if you keep the accounting constant over the years, labor share hasn&#8217;t even fallen ever.</p><p><strong>Dwarkesh Patel</strong></p><p>But it&#8217;s not that surprising, right? Phil, you made this point that if labor and capital are complements, you need both to do anything. It would make sense that you&#8217;d need to pay both of them to get something done.</p><p><strong>Alex Imas</strong></p><p>You have had stuff be completely automated.</p><p><strong>Phil Trammell</strong></p><p>There&#8217;s a sense in which nothing has yet been completely automated. Look at the network-adjusted factor shares of a good. Look down the supply chain and not just the final step and how much of that is done by capital and labor, but what went into the machines that can automate that final step. You&#8217;ll find that labor is adding a lot of value down the supply chain.</p><p>Computer and electronic products in the US have a very stable network-adjusted capital share of around 50%. It&#8217;s not 100%. I do think there&#8217;s this qualitative shift that I think we agree is coming, which is that there will be at least some goods whose network-adjusted capital share goes to one. The whole supply chain can be automated, and there&#8217;s no part in it that we care intrinsically about having a human do. That will be a qualitative shift.</p><p>Interestingly, the implications of that shift for the overall capital share are ambiguous. Let&#8217;s say we&#8217;ve got two sectors: the human-intrinsic sector with the ballerinas, and everything else. Right now, everything else has been scarce because of the lack of labor in it. But if we fully automate the supply chains for everything else, and we satiate in everything else really fast, then the quantity of everything that&#8217;s not a ballerina goes to infinity, but the marginal utility in that stuff goes to zero faster than the quantity is rising.</p><p><strong>Alex Imas</strong></p><p>I also want to move away from the ballerina example. The point I was trying to make in my post&#8212;working backwards from a particular scenario&#8212;was that the ballerina and the performer are the wrong reference class. Right now we have a lot of jobs where you have different tasks. This is the task-based model of jobs. Take a doctor, what is their job? They&#8217;re filling out insurance documents. They&#8217;re going and calling different pharmaceutical companies. One of their tasks is to see the patient and talk to them, but that&#8217;s not the main part of the job.</p><p>You could have a job and a service or a good be a product of different types of tasks, and you can automate a ton of those tasks. If the consumer is willing to pay more for a product or service where every single task is automated except for that one part where the doctor is delivering the diagnosis and providing support, we would call that job part of the relational sector. People are willing to pay more for the human to stay in the loop in the job.</p><p>We don&#8217;t have data to say, &#8220;Here are relational jobs, here are not.&#8221; You literally need to collect data of the following sort. Do a conjoint analysis of your willingness to pay for this service or good. Here&#8217;s the counterfactual where everything is produced by machine. Here&#8217;s the counterfactual where this one task is not produced by a machine. What is your willingness to pay? What is your elasticity for the human to not be in the loop?</p><p>If I don&#8217;t have that data, what prediction am I going to make in this story?</p><p><strong>Dwarkesh Patel</strong></p><p>Isn&#8217;t there another point, which is that there are a lot of fully automated goods that don&#8217;t even exist yet? And you can&#8217;t collect any data right now about, say, how much people will want to keep buying more and more of some drug that makes you healthier that is fully produced by the AIs.</p><p><strong>Alex Imas</strong></p><p>Absolutely. That&#8217;s kind of Phil&#8217;s point. You could have an increase in variety in capital where you don&#8217;t get the satiation. You&#8217;re increasing variety, so you&#8217;re not hitting that diminishing <a href="https://en.wikipedia.org/wiki/Marginal_utility">marginal utility</a> point where most of your income is going to the human sector. If that increasing variety is fast enough, and there is no such increasing variety in the human sector, then you can get all of the relational goods you want, but it doesn&#8217;t matter for labor share. It goes to zero.</p><p><strong>Dwarkesh Patel</strong></p><p>Phil, I liked <a href="https://philiptrammell.substack.com/p/is-labor-a-luxury-in-the-long-run">your analogy to some Mongolian economist</a> sitting around in 1400 thinking about what will be scarce and the limits of that kind of analysis. I think you should talk about that.</p><p><strong>Phil Trammell</strong></p><p>Just look at the goods available to a Mongolian of the distant past. I&#8217;m no expert on this society, but I know that they didn&#8217;t have nearly the variety that we have now. Look at the jobs that were intrinsically human, like being a singer. And then you look at the things that were not intrinsically human, like the transportation services provided by their horses or the different kinds of food they had.</p><p>If they just held the varieties fixed in both categories and asked, &#8220;What will happen once we have a lot more automation?&#8221;, they might have said, &#8220;We&#8217;ll just satiate in horse-like transportation and in yogurt and in yurts. Those shares will all go to zero, and we&#8217;ll be left spending all of our money on singers.&#8221;</p><p>But of course, that&#8217;s not what happened. As we&#8217;ve accumulated more wealth and more advanced machines, we&#8217;ve expanded the range of things other than singers to spend our money on, and the share spent on singers has stayed negligible. Likewise, that&#8217;s my central prediction about how the future unfolds, though it could go either way.</p><p><strong>Dwarkesh Patel</strong></p><p>I was going to make a point and I realize it&#8217;s a fallacy, but the reason it&#8217;s a fallacy is interesting. It&#8217;s just hard to imagine a world where there are trillions upon trillions of robots, but there&#8217;s only some billion-odd humans, and the cumulative amount we&#8217;re spending on robots is less than what we&#8217;re spending to pay Magnus Carlsen or&#8212;</p><p><strong>Alex Imas</strong></p><p>Financial advisors or doctors or tutors.</p><p><strong>Dwarkesh Patel</strong></p><p>Right, or podcasters or whatever. But then I realized why it&#8217;s a fallacy. The number of transistors in the world has literally trillion-X&#8217;d, maybe quadrillion-X&#8217;d. Your colleague <a href="https://en.wikipedia.org/wiki/Charles_I._Jones">Chad Jones</a> has a <a href="https://web.stanford.edu/~chadj/AIandEconomicFuture.pdf">very interesting result</a> about how the share of the economy that is going towards paying for computing, paying for the transistors, <a href="https://www.youtube.com/shorts/iaOPdyO1SgE">has been decreasing</a>.</p><p>The point you made is that one way to think about <a href="https://en.wikipedia.org/wiki/Moore%27s_law">Moore&#8217;s law</a> is&#8230; What sets price? Supply and demand. So not only are we producing more transistors more cheaply, but also the value of the marginal transistor is decreasing. As you were saying, another way of saying Moore&#8217;s law is&#8230;</p><p><strong>Phil Trammell</strong></p><p>I like the pessimistic framing of Moore&#8217;s law: every 18 months, the value of computation halves. We&#8217;re running out of uses for computation so fast that it&#8217;s sustaining Moore&#8217;s law.</p><p><strong>Dwarkesh Patel</strong></p><p>This is relevant to a conversation about AI where maybe for the first time, this is no longer true. The famous fact here is that an <a href="https://www.nvidia.com/en-us/data-center/h100/">H100</a> costs more to rent now than it did three years ago, even though we have much superior technology and much more compute in the world. Because as models get smarter, the opportunity cost of compute gets higher.</p><p><strong>Alex Imas</strong></p><p>This is Phil&#8217;s point about increasing variety. What we have done is increased the types of things that people demand from capital. Now all of a sudden you have a new variety that you could be using capital for, and you jump back up.</p><p><strong>Dwarkesh Patel</strong></p><p>You could imagine we just never satiate demand for compute. As long as that stays the case, the share of the economy that is going towards compute would keep increasing.</p><p><strong>Alex Imas</strong></p><p>That&#8217;s the big question. That is the ultimate question that we need to be looking at. What number of new uses are we finding for that compute where you have the demand for these uses? What I want to emphasize is that a lot of models in economics, especially in the space that we&#8217;re talking about, take demand as almost exogenous. They don&#8217;t unpack the psychology of what people actually want.</p><p>What got me thinking about the idea of the relational sector is work that I was doing on the fact that there does seem to be this intrinsic value. It&#8217;s not just because it&#8217;s scarce; it&#8217;s because there&#8217;s some intrinsic preference that people have for empathy, connection, and interacting with another person.</p><p>One of the experiments that we ran involved an art print. We have an incentive-compatible way of asking, &#8220;How much are you willing to pay for this art print?&#8221; People are actually paying real money for it. Then we say, &#8220;Look, there&#8217;s only one of those art prints, and it&#8217;s either made by AI or by a person.&#8221; These are between-subject conditions. With one, you get the effect that the person-produced art print is valued much higher than the AI version.</p><p>Then, in a set of other conditions, we say there&#8217;s 500 of these being produced. For the human-made one, the price goes down a lot because it&#8217;s no longer seen as making a connection with this one artist. With AI there&#8217;s no difference. AI is already viewed as a commodity.</p><p>We need to do a lot more research on this, but it seems that&#8217;s the key difference between this and something like a horse. A horse was an input into an output, where you can replace the horse with something else. You only care about the output.</p><p>The only way this relational story works&#8212;and this is what we need more data on&#8212;is if a human is not a horse in the sense that they are providing value from the output, where if you replace the human, the value of the output decreases. If that&#8217;s not strong enough, and if it doesn&#8217;t hold for enough sectors or enough jobs, then this story doesn&#8217;t work anymore.</p><h3>00:19:36 &#8211; Messy Middle scenario</h3><p><strong>Dwarkesh Patel</strong></p><p>There&#8217;s one possibility which <a href="https://www.brookings.edu/people/molly-kinder/">Molly Kinder</a> has written about, this <a href="https://mollykinder2.substack.com/p/the-messy-middle">&#8220;Messy Middle&#8221; scenario</a>. That possibility made me think about whether it might be better to have&#8212;at least as far as wealth distribution and redistribution go&#8212;a much faster AI takeoff.</p><p>I want to ask you whether the following possibility is at all likely, or if there&#8217;s any set of assumptions that can make it so. AI makes it possible to automate jobs such that many people are losing their jobs, but it doesn&#8217;t create enough wealth, while the process of automation is happening, to basically pay off the people who are getting laid off and create a <a href="https://en.wikipedia.org/wiki/Pareto_efficiency">Pareto improvement</a>, where everybody&#8217;s getting better as a result of AI automation.</p><p>Of course, there&#8217;s a trivial sense in which that must be true. Whatever money the company is saving by not paying the humans instead of just paying the AIs, those resources still exist in the economy and can just be paid out to people.</p><p>But there&#8217;s going to be some allocative inefficiency. The government doesn&#8217;t know exactly who got laid off because of AI. There&#8217;s a political problem. If the Meta worker gets laid off first and they were making $200,000 a year, is there a politically sustainable situation where you give them a $200,000 check a year when there are many working people making much less? Do you find this scenario plausible, where AI is automating a bunch of things, but there isn&#8217;t as much wealth creation as there is automation?</p><p><strong>Phil Trammell</strong></p><p>I think it&#8217;s possible. To me, it does seem like a pretty narrow window. My guess is that if we have the technology to automate so many jobs that it becomes a new kind of political problem, then the pie will also be growing really fast.</p><p>Well, unless in all of those professions it&#8217;s automating, it&#8217;s just a hair more productive. So the cost of all the capital to replace all the software engineers is just a hair less than the cost of what we&#8217;ve been paying the software engineers.</p><p><strong>Dwarkesh Patel</strong></p><p>Why is it implausible that a company can save money by laying off a bunch of software engineers? And in the long run, there&#8217;s a <a href="https://en.wikipedia.org/wiki/Jevons_paradox">Jevons paradox</a>, and we can&#8217;t anticipate in advance what we&#8217;d do with more software, and surely there will be more uses. But in the short run, the effect is just that a lot of people are laid off, and they still need to figure out how they can use a million times more JavaScript tokens.</p><p><strong>Alex Imas</strong></p><p>Phil and I have been writing about these things, and we have mathematical models in the back of these things. We don&#8217;t have any <a href="https://en.wikipedia.org/wiki/Political_economy">political economy</a> in any of our models.</p><p><a href="https://www.hoover.org/profiles/andrew-b-hall">Andy Hall</a> wrote a really nice <a href="https://freesystems.substack.com/p/the-politics-of-jobless-prosperity">blog post about the politics of AGI</a>, and he made a really interesting observation. If there&#8217;s a 2% increase in unemployment, the political winds completely change. Unemployment has a huge effect on what happens politically.</p><p>Referring to Molly&#8217;s excellent essay, I think in some ways one of the worst scenarios is a drip scenario because of the political economy piece. What you might see is people not really being unemployed en masse, but moving into sectors that pay them less money. This is what happened with phone operators between 1920 and 1940. Phone operators were completely automated, but it took 20 years, even though the technology existed. There was this drip. It wasn&#8217;t like this giant sector just disappeared.</p><p>There&#8217;s a really nice <a href="https://download.ssrn.com/22/04/21/ssrn_id4089815_code2285595.pdf?response-content-disposition=inline&amp;X-Amz-Security-Token=IQoJb3JpZ2luX2VjEG0aCXVzLWVhc3QtMSJHMEUCIQDzz27RvbafjTgz5bN45yLDjXS8JWtuvREjCTWEYhMntgIgMl3wF13f7y%2BeR1ecBVRw95Zwg%2FUYkBwnUCjfCXugVdMqvQUINhAEGgwzMDg0NzUzMDEyNTciDK9KrzvRM8c4scV3%2BCqaBUUUPhcgkkA5J7nWzjWUwQ%2BW%2FWiz07vqfkcuev1b7uPrZ2FoSF78WVT2gvrMkF3UFpctnjbATY0Dfcu0FylOmuqPystyaBMvpHvvJ2a4rr1CLnksD5GK8i48iMnWODqen0hvzLwDvyMvC7kKmrqKnW9buX9YFz%2B2EuofQLDz14Krp5Rr2gL0M%2Bjk2jTrtE0DXbAHriHdtucIy7TsddlC3c%2Bxv8uRTNSfkkwI8aVHsFv%2FWwDDWP%2BPnufHs7BFpHj2%2BvT45QcUYosW4e8MyCAlqjU19pA4eQNZFUFJSss%2BBbrnBkUmdAJTeL828l%2FkswGklDYzFLWI3bwWWKYjOtsaCkgYKmoCosPRaZKEOreAit3DIHf3B4eoAtkLdZRtQs7clSI0ktGkAYE4kQ0Buv5%2FHrmVD6mA08SAmCM6GqMFHShSUBPcdUYcnRl%2BDhpt0dQthVkljNhfXUA93PPXbU7PHxU5j6PF9MJch%2F6hlgjlfEWQjwysNKbVc5yIoAFI2%2Bk6jNMKVZaQmXfDATsaW2RLM%2BMz3aOiPue1FgVsiBmmLBS%2BAJY8PL%2B3rGa5RzSK6IkC8MqI1pZtLYuZKijC14VQlf6SeCpoSSurbVjNkwKRrfBM6nTj22p3M%2BNiE53Hi3MShAACgbhuF8z1q3EYJ3GV%2FThQ1NvaPHSCGNg%2F20RRRmXoXb%2FvQ0xhr8gwi5gG9bP1%2BHtV12Ll4lMOw3sDRrit8au1BzCzy6hwPdNmahlqEin9VEWz9A3LOZGAcuHQN9KWVmcVb2auq8QLtqzrzWgBHGGg8RTMzebPLTLPlmoa6iT%2FQzek012r%2FD9ZbIZEFfEQRz8XhgAxulnLfYtYArol738qTQgkcmFHPDghh9Hg6mGigxGwOIkiM7zAujDT4P7QBjqxAWzB0pGI3E6X1pc3P7lqg6n%2Blr9JEQk2EbpYUUq3Yzy8fQA623K7TV4CAAdTbcLpqSLvEyYSMtzR2djaeb4wHkJSQxL8AGKm7WVS40%2BqN2M0JA2OuqdZB%2BjBpz9HWlhQiWtJL%2BB4bgAmKi864IPPgsNtzapmxskRs0iP5HYKlqFJlDlBZyPgkYctYvJ6HewJQSTJqf8I7doTh8KpkHKCyUYaKNOhNm%2FBRX%2BVUBJmWysX0A%3D%3D&amp;X-Amz-Algorithm=AWS4-HMAC-SHA256&amp;X-Amz-Date=20260603T050530Z&amp;X-Amz-SignedHeaders=host&amp;X-Amz-Expires=300&amp;X-Amz-Credential=ASIAUPUUPRWE3O62YUZY%2F20260603%2Fus-east-1%2Fs3%2Faws4_request&amp;X-Amz-Signature=d27a50db22be30b59ecac31931f8efbc0aad1b6deea6939b2e3d036c76c3e84f&amp;abstractId=3722562">QJE paper</a> on this showing that they got reabsorbed into the economy, but at lower salaries, and they were mostly underemployed. That&#8217;s the scenario Molly was writing about, this messy middle where things aren&#8217;t a disaster. We saw with COVID that the fiscal response can move quickly if there&#8217;s an emergency. An emergency is a quick uptick in unemployment, which could even look like 2-3%. That becomes a national emergency if it happens fast.</p><p><strong>Dwarkesh Patel</strong></p><p>The concern is that whatever you&#8217;re saving on those white-collar workers, if that&#8217;s not growing the economy but just creating saved resources that can be allocated elsewhere, is that enough to do a broad-based redistribution scheme? You have the money you&#8217;ve saved off a couple of people. Unless you can figure out exactly how to get it to them specifically, you have the problem of, &#8220;Can I do a <a href="https://en.wikipedia.org/wiki/Universal_basic_income">UBI</a> off the money I saved by laying off&#8230;?&#8221;</p><p><strong>Alex Imas</strong></p><p>You&#8217;re basically saying the pie did not grow that much. You&#8217;re just displacing a bunch of people, but that didn&#8217;t grow the technological frontier of what the economy can produce.</p><p><strong>Dwarkesh Patel</strong></p><p>Then there&#8217;s a question of whether every time this has happened in history, the technological frontier has expanded a bunch.</p><p><strong>Alex Imas</strong></p><p>I think that&#8217;s the case. Simply in history, the technological frontier has expanded. I think Phil made the same point. It&#8217;s hard to imagine that sort of scenario where you are getting intelligence that&#8217;s just enough to replace the software engineer but still costs a lot of money. It&#8217;s just a hair less expensive than the software engineer, so you&#8217;re not getting this abundance effect. Where is the redistribution going to happen because the pie didn&#8217;t grow?</p><p><strong>Dwarkesh Patel</strong></p><p>This is very helpful. Many different things have to be true for this scenario to come to pass, each of which seem unlikely.</p><p>One, it has to be the case that it is possible to automate entire white-collar jobs, but only in a piecemeal way. That is to say that you can only automate software engineers, but that same program can&#8217;t also automate an accountant and an analyst and whatever.</p><p>My model of intelligence is such that&#8212;both the breadth of tasks it requires to do something like software engineering and what intelligence is&#8212;if you can really just lay off all the software engineers, you&#8217;ve got enough in the bucket there that you could automate all kinds of white-collar work. There are huge amounts of potential savings that have happened as a result of these layoffs, and also AI is going to be cheaper than human labor.</p><p>If both of those things are true, this messy middle scenario where we literally don&#8217;t have the wealth to go around seems unlikely. Then the question is, what is the best way to tax it and redistribute it?</p><h3>00:25:57 &#8211; How to tax and redistribute AI wealth</h3><p><strong>Alex Imas</strong></p><p>I have some thoughts. I think it&#8217;s really important to outline the costs and benefits. First, there&#8217;s differential complexity in implementing these things. Two, they differ in the timeline of being actually helpful. Something like <a href="https://en.wikipedia.org/wiki/Asset-based_egalitarianism">universal basic capital</a> is not going to generate returns for something that happens in six months. You probably are going to end up with a layer of things.</p><p>Take a <a href="https://en.wikipedia.org/wiki/Negative_income_tax">negative income tax</a>, for example. You implement it, and the day it turns into law, you already have this insurance that there&#8217;s a floor where everybody gets a certain amount of money, and if you earn more money, you get taxed more. But there are positives and negatives to a negative income tax.</p><p>With UBI, for example, I worry a lot about the political economy implications. If people are just dependent on a check, it really matters who&#8217;s in power. Right now, we&#8217;re endowed with labor that can turn into income. When that is no longer the case and we are at the mercy of the elected official for basic needs, that feels like a power-sharing arrangement that&#8217;s really dangerous.</p><p><strong>Dwarkesh Patel</strong></p><p>But wouldn&#8217;t that be true of any sort of government redistribution program?</p><p><strong>Alex Imas</strong></p><p>With something like universal basic capital, where you have an ownership share and property rights for capital, you just have a share.</p><p><strong>Dwarkesh Patel</strong></p><p>You&#8217;re a normal shareholder.</p><p><strong>Alex Imas</strong></p><p>You&#8217;re just a normal person.</p><p><strong>Dwarkesh Patel</strong></p><p>But this goes back to the question of indexing, because if indexing is hard, then universal basic capital is hard.</p><p><strong>Alex Imas</strong></p><p>That&#8217;s the problem of universal basic capital: targeting. What do you target to put into people&#8217;s portfolios?</p><p><strong>Dwarkesh Patel</strong></p><p>Like, what if Anthropic goes to zero, but some random robotics company takes all this over?</p><p><strong>Alex Imas</strong></p><p>Exactly. That&#8217;s the risk of universal basic capital. With a negative income tax, you have the same sort of issues as with UBI, where somebody comes into power and says, &#8220;We&#8217;re not going to do that anymore,&#8221; and people can&#8217;t work, and then you have the issue of the floor being gone.</p><p><strong>Dwarkesh Patel</strong></p><p>One concern with the <a href="https://en.wikipedia.org/wiki/Wealth_tax">wealth tax</a> is that there&#8217;s no politically sustainable equilibrium at a 0.5% wealth tax. This happened with the income tax, of course. It starts low, it&#8217;s for war or something, and then it slowly escalates until the marginal income tax rate in the US is on the order of 40%, and in certain states, upwards of 50%.</p><p>With a <a href="https://en.wikipedia.org/wiki/Wealth_tax">capital tax</a>, is there a reason to worry that it would distort investment? Would people just say, &#8220;Why would I invest in Anthropic or Intel? The government is going to take larger and larger shares of it and dilute my share.&#8221;</p><p><strong>Phil Trammell</strong></p><p>Hold on. It&#8217;s worth separating how the revenue is raised, what&#8217;s taxed, and how it&#8217;s distributed. It could be that the government hands out shares of Anthropic to everyone by a broad-based tax and then buying Anthropic. Which would probably be the right thing to do. Hopefully, some populist proposal doesn&#8217;t interfere with that and expropriate some particular company that everyone happens to know about.</p><p><strong>Dwarkesh Patel</strong></p><p>You&#8217;re suggesting there could be some sort of optimal tax. We&#8217;re taxing externalities or we&#8217;re taxing land. I guess we probably need to tax something other than just those two things.</p><p><strong>Phil Trammell</strong></p><p>Or consumption.</p><p><strong>Dwarkesh Patel</strong></p><p>Ok, a <a href="https://en.wikipedia.org/wiki/Consumption_tax">consumption tax</a>, like a European <a href="https://en.wikipedia.org/wiki/Value-added_tax">value-added tax</a>, allows the government to go buy a bunch of stocks, and then they just distribute those stocks to everybody.</p><p><strong>Alex Imas</strong></p><p>That&#8217;s <a href="https://en.wikipedia.org/wiki/David_Autor">David Autor&#8217;s</a>...</p><p><strong>Phil Trammell</strong></p><p>That&#8217;s not going to be that different from just redistributing the stocks, but it will be a little different.</p><p><strong>Alex Imas</strong></p><p>That was the proposal for <a href="https://en.wikipedia.org/wiki/Social_Security_(United_States)">Social Security</a>, by the way. That was privatizing Social Security. It&#8217;s worked so far, but there are questions about how long it&#8217;s going to keep working. Privatizing Social Security was basically giving everybody a basket of stocks.</p><h3>00:30:02 &#8211; Why demand collapse is unlikely</h3><p><strong>Dwarkesh Patel</strong></p><p>People talk about whether there&#8217;s a white-collar apocalypse already. Is there any evidence that suggests there is mass automation or unemployment as a result of AI already?</p><p><strong>Alex Imas</strong></p><p>A lot of people are looking at it. This is an area where there&#8217;s a lot of eyes and a lot of data being produced. The <a href="https://budgetlab.yale.edu/">Budget Lab over at Yale</a> is doing really good analysis on this. They just recently released a <a href="https://budgetlab.yale.edu/research/tracking-impact-ai-labor-market">report</a>, and you really have to squint to see anything happening. If you want to take an approach across the entire economy, even looking at software engineering, the most exposed sectors, there&#8217;s just not really anything going on.</p><p>There might be a little bit of a signal about junior developers getting jobs less than before. But that&#8217;s a &#8220;less than before&#8221; rather than a level shift, as in there&#8217;s actually an increased demand for senior software engineers, if anything. If you look at the trend, for junior developers, it&#8217;s a bit below trend.</p><p><strong>Dwarkesh Patel</strong></p><p>So you&#8217;re saying the growth is slower than before, but there is still growth even for entry-level software engineers. What do you think is going on with the anecdotal evidence of graduating college students saying that they&#8217;re finding it harder to find CS jobs?</p><p><strong>Alex Imas</strong></p><p>I think that&#8217;s anecdotal evidence.</p><p><strong>Dwarkesh Patel</strong></p><p>You think it&#8217;s always been hard to get jobs for some people, and now it&#8217;s getting turned into an AI narrative? Same with the layoffs, where it&#8217;s probably just a normal layoff, and they turned it into an AI layoff.</p><p><strong>Alex Imas</strong></p><p>You have to be careful with all of this. There are these public coordination devices. Let&#8217;s say we get into a narrative where if you&#8217;re a firm and you&#8217;re not laying people off, then you&#8217;re seen as not adapting AI enough. Then you&#8217;re going to just get a cascade effect of firms needing to keep up with the Joneses in terms of starting to lay people off. That&#8217;s super worrying, where the firm might actually be worse off after the layoffs than before, but it&#8217;s just doing the layoffs to have the perception that, &#8220;Look, we&#8217;re not behind the times. We&#8217;re using AI.&#8221;</p><p>You probably heard these anecdotal stories of token counters, where you have to<a href="https://www.wsj.com/cio-journal/why-some-companies-say-ai-tokenmaxxing-is-key-to-survival-e699a128"> maximize tokens</a> and things like that. Right now, we don&#8217;t really have any evidence of a white-collar bloodbath.</p><p><strong>Dwarkesh Patel</strong></p><p>Is that surprising at all, given all these things AI can do? This is a story as old as time. If you automate some complementary task, the overall bucket of things&#8212;the human labor which complements the automation&#8212;will increase in value.</p><p><strong>Alex Imas</strong></p><p>One of the statistics that&#8217;s really important for that argument is <a href="https://en.wikipedia.org/wiki/Price_elasticity_of_demand">elasticity of demand</a>.</p><p>Take the <a href="https://www.nber.org/system/files/working_papers/w34639/w34639.pdf">O-ring model of jobs</a>. A job is a series of tasks. Let&#8217;s say the AI automates nine out of ten tasks. One task is not automated. If that person can now focus in on that task, the job will become more productive. If that translates into a price effect where the product is actually cheaper, and if demand responds enough where it&#8217;s being bought more and the service is being used more, that could actually lead to more hiring.</p><p>A lot of people on the internet have been making that argument very generally, saying, &#8220;Look, if anything in the data, we&#8217;re seeing an uptick in software engineering demand.&#8221; Which suggests that at least for now, given the way that jobs work, it might be elastic enough.</p><p><strong>Dwarkesh Patel</strong></p><p>I think this elasticity of demand argument is incredibly important for a lot of arguments that people make, or just a lot of labels that people use without understanding what the underlying causation is. People often talk about Jevons paradox. This is the idea that as something gets cheaper, you will want so much more of it that the total amount you spend on the thing increases. Famously, this happened to coal in Britain ~200 years ago.</p><p>But really this only happens if the demand for something is highly elastic. There are many things for which there is not super elastic demand. If oil, for example, gets super cheap, it&#8217;s not like magically&#8212;</p><p><strong>Alex Imas</strong></p><p>Or insulin.</p><p><strong>Dwarkesh Patel</strong></p><p>Exactly. It&#8217;s not like magically there&#8217;s going to be so many more cars that now we&#8217;re going to be using way more oil than before.</p><p><strong>Phil Trammell</strong></p><p>At least not in the short run.</p><p><strong>Dwarkesh Patel</strong></p><p>Exactly. The long-run elasticity is higher than short-run elasticity. But even in the long run, agriculture famously is the example where we can produce way more food if we dedicated the same portion of the economy that we dedicated to agriculture in the past. We&#8217;re already producing more food regardless, but we could produce even more if the same portion of the economy that was producing food 100 years ago was currently producing food. But you eat enough, and then you&#8217;re done.</p><p>The claim with software is that it is not some inherent property of markets that as it gets cheaper, you&#8217;ll just keep wanting more of it. The thing about software is this is a particular kind of good where as it gets cheaper, we&#8217;ll want more and more of it.</p><p>It is also highly relevant, and you wrote an <a href="https://aleximas.substack.com/p/will-advanced-ai-lead-to-negative?r=1ds20&amp;utm_medium=ios&amp;triedRedirect=true">essay</a> about this&#8212;a lot of this podcast is me summarizing your essays back to you. There&#8217;s this <a href="https://www.citriniresearch.com/p/2028gic">very viral scenario planning about the future by Citrini</a>, predicting that as a result of automation and very powerful AI, there will be a recession. White-collar workers will get automated, their salaries will no longer be available, and so there will be a slump. Do you want to recapitulate why this might be implausible?</p><p><strong>Alex Imas</strong></p><p>Part of it is plausible, part of it&#8217;s not. The part that we started the conversation with is the idea that there could be a lot of unemployment. If the speed of automation is quick, people could get laid off, and they may not find work very quickly. We can quibble about the unemployment part of the Citrini essay, but that&#8217;s not the issue. The issue is that they talked about negative economic growth.</p><p>What I did in the piece, that Phil and I had a back and forth on, was to say, let&#8217;s start with the proposition that there&#8217;s negative economic growth. What conditions do you need in the economy to get negative economic growth? It turns out the conditions are pretty improbable.</p><p>One thing that you need is for the holders of capital, rich people basically&#8230; Basically what you have in those sorts of scenarios is a reallocation of wealth and income from lower-income people who are using their labor towards tech capital owners. So you need demand to be bounded, like a hard bound, not even a soft diminishing sensitivity. You need for them to eventually say, &#8220;I&#8217;ve had enough. I don&#8217;t want to spend any more money.&#8221; And for that money to not enter as investment. Then you can get negative growth.</p><p><strong>Dwarkesh Patel</strong></p><p>The crucial thing is, even if we don&#8217;t want more shit, the world in which there&#8217;s a singularity and we don&#8217;t want to invest more money is crazy. We&#8217;re not saying, &#8220;Let&#8217;s build more data centers. Let&#8217;s build more fabs.&#8221; Even though we have AGI, we&#8217;re not investing in more data centers to run the AGI and that&#8217;s driving more economic growth.</p><p><strong>Alex Imas</strong></p><p>I sent the essay to Phil, and Phil wrote back being like, &#8220;This is pretty dumb,&#8221; like my essay. He said, &#8220;You&#8217;re trying to say that there&#8217;s going to be negative economic growth, but these are very implausible conditions.&#8221; And I was like, &#8220;That&#8217;s the point of the essay. These are very implausible economic conditions.&#8221;</p><p>That&#8217;s where scenario planning really shines. You have the Citrini essay, which was great that it was written because it started a conversation. It&#8217;s so intuitive, this idea that if there&#8217;s demand collapse, we can get the economy to shrink. You could get that with a depression. In the <a href="https://en.wikipedia.org/wiki/Great_Depression">Depression</a>, the technological frontier didn&#8217;t expand.</p><p>Here, the technological frontier is expanding. You actually have abundance. For abundance to generate negative economic growth, that&#8217;s really hard to get.</p><h3>00:39:26 &#8211; Human employees would be hard to integrate into the machine economy</h3><p><strong>Dwarkesh Patel</strong></p><p>We were talking a second ago about why there isn&#8217;t more automation as a result of <a href="https://en.wikipedia.org/wiki/Large_language_model">LLMs</a>. One plausible mechanism could be, as you were saying with the <a href="https://en.wikipedia.org/wiki/O-ring_theory_of_economic_development">O-ring theory</a>&#8230; O-ring theory refers to the fact that the <a href="https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster">Challenger shuttle</a> blew up because one component malfunctioned, and it destroyed the whole thing. Maybe that&#8217;s a more general model of how goods are produced in the economy. You have to make sure everything is reliable and works well. So you can&#8217;t automate an entire job to an AI right now. Even though it might be able to perform it at some probability, you need extreme reliability in order for it to not destroy the finished good.</p><p>This might explain why there&#8217;s a lot less automation now than there otherwise could be. But I think it works in the other direction once AIs get advanced enough. Integrating humans into the production flow of future goods will become difficult. Even beyond the arguments about how humans will be more expensive or less capable, there will be whole production flows organized for AI labor. They&#8217;re talking in <a href="https://www.lesswrong.com/posts/qehggwKRMEyWqvjZG/reflections-on-neuralese">neuralese</a>. They&#8217;re thinking many thousands of times faster.</p><p>So even if there&#8217;s some comparative advantage where it makes sense to hire a human, there will be transaction costs and worries of reliability that will actually make it hard to integrate humans into future production flows.</p><p><strong>Phil Trammell</strong></p><p>That seems right to me. In particular, I just want to distinguish between the point that if you automate nine-tenths of a job, people might shift over to the last tenth, but there might be ten times more work demanded of them.</p><p>Compare that to <a href="https://www.nber.org/system/files/working_papers/w34639/w34639.pdf">the model of O-ring automation from Gans and Goldfarb recently</a>. If you can only automate nine-tenths of the job, but you do it to a lower standard of quality than the human could, you might not want to automate even those nine-tenths.</p><p>That&#8217;s the thing that could totally port over. Symmetrically, it could be a reason why we don&#8217;t use a human for one-tenth of the job anymore, because a human just can&#8217;t perform it to the level of quality that the AI can perform the other parts of the job, or the level of speed. They end up pulling down the quality or speed of the finished product.</p><p><strong>Dwarkesh Patel</strong></p><p>By the way, the model you&#8217;re talking about seems extremely plausible to me for why more lawyers, accountants, or even software engineers are not automated. There are cases where there&#8217;s a pretty good probability that the thing worked as you expect, but the thing you&#8217;re paying the lawyer for is: &#8220;No, really, my company&#8217;s not going to go under because&#8212;&#8221;</p><p><strong>Alex Imas</strong></p><p>You&#8217;re also paying for a lot of regulation-type stuff. With lawyers particularly, you need some entity to back up the product. You need ownership of the product. You need somebody to be able to fire or hire, and there are licensing issues. There&#8217;s a lot of regulatory layers that are also going to be keeping&#8212;even if there&#8217;s no relational element&#8212;humans in the loop that have nothing to do with the ability of the human to actually perform the service.</p><p><strong>Phil Trammell</strong></p><p>All of these frictions on the political-type decisions that we are accustomed to only trusting humans for&#8212;legislation, being a judge, being a juror, or all the licensing that keeps certain professions human&#8212;that all strikes me as transitional.</p><p>What we expect to come from a human and how we organize our politics has changed so many times throughout history, from little hunter-gatherer bands to empires to whatnot. Once an AI-run political system is much more efficient than the alternatives, those will probably tend to out-compete the others.</p><h3>00:43:08 &#8211; What if some humans (or AIs) value wealth accumulation intrinsically?</h3><p><strong>Dwarkesh Patel</strong></p><p>Speaking of which, we&#8217;ve been talking about what preferences humans currently have and what impact that has on what kinds of goods will be scarce in the future. But of course, we&#8217;ll have different kinds of entities in the future: AIs. There was a time when there were no humans on Earth, but evolution selected for agents that have specific drives and preferences because those tend to survive the most, and those preferences now determine what a hundred-trillion-dollar world economy produces.</p><p>Why not expect the same thing from AIs in the future? This is not even a world with catastrophic misalignment, where they just kill everybody. But there will be evolution of, even if not individual AIs, firms which have AIs as part of them. What will that evolution favor? It will probably favor firms or agents that grow. There&#8217;s a selection argument that things which grow will be more prevalent.</p><p>Maybe just based on that, you can make some predictions about what their preferences will be. Is the kind of entity which prefers to have human-intrinsic goods going to be the kind of entity that accumulates resources the most? Probably not. Probably it saves more and has unsatisfiable demand for whatever the relevant resource happens to be. Compute is an obvious one. Can we use that to make some predictions about the non-human preferences that will be guiding the future?</p><p><strong>Alex Imas</strong></p><p>If there&#8217;s an AI that has its own welfare, is fully autonomous, and is making its own decisions that are welfare-relevant, to be honest, I have absolutely no prior that it would prefer to deal with humans. There&#8217;s no reason.</p><p>But let me take the other side of that argument. Humans&#8217; preferences to be interacting with one another, to trust and empathize with other humans versus a simulated AI, I think it&#8217;s a really important question whether those will change.</p><p>I&#8217;ve heard a lot of arguments saying, &#8220;Look, right now we&#8217;re just not used to the technology. What you&#8217;re thinking of as relational&#8230; At some point, people are just going to see an AI therapist as a superior product, and they&#8217;re not going to need the empathy that the human is providing.&#8221;</p><p>I think this is actually a really complicated question. Here&#8217;s one argument for why it&#8217;s not going to go away, and it has to do with evolution. Let&#8217;s say there are two types of people. One person doesn&#8217;t really have this preference. They can just interact with an AI, whatever can simulate it better. The other one has almost a moral emotion&#8212;using <a href="https://en.wikipedia.org/wiki/Jonathan_Haidt">Jonathan Haidt&#8217;s</a> framework&#8212;against offloading those sorts of social interactions to an AI.</p><p>Which of those two people are going to reproduce, find a mate, all of these sorts of things? I think the answer is clear. It&#8217;s the second one that has the preference for other people.</p><p><strong>Dwarkesh Patel</strong></p><p>Depends on how the reproduction is happening.</p><p><strong>Alex Imas</strong></p><p>Fair. But if we&#8217;re in the world where reproduction is still happening the way that it&#8217;s happening, I think&#8230; And this is a big question, I&#8217;m not making a prediction.</p><p>You had <a href="https://en.wikipedia.org/wiki/David_Reich_(geneticist)">David Reich</a> on the show. His point on <a href="https://www.dwarkesh.com/p/david-reich-2">the last podcast</a> was that we&#8217;re buzzing with <a href="https://en.wikipedia.org/wiki/Natural_selection">natural selection</a>. So even if you get some sort of indifference now, you might get selection to point into an even stronger preference for other humans.</p><p><strong>Dwarkesh Patel</strong></p><p>Here&#8217;s one way to think about it. How is the wealth of the richest people in the world instantiated? We were having a call earlier, and you made the point that their consumption is more geared towards relational goods. Like Mark Zuckerberg is hiring MMA instructors and dancers for his wife&#8217;s birthday, and so forth.</p><p>But most of his wealth is just stock in Meta. As a controlling shareholder, he could say, &#8220;Meta, turn all this wealth into dividend income, and I will just spend that on consumption.&#8221; Instead, he would rather have his wealth compound and have Meta build more data centers.</p><p>So you don&#8217;t even have to change humans for this to be the case. The humans who are wealthiest&#8212;and growing wealthier because their wealth is compounding&#8212;just have this almost Nick Landian preference for accelerating capital. That does seem to suggest that this is an important determinant of what kinds of things are produced in the future.</p><p><strong>Phil Trammell</strong></p><p>There are two ways you could get the two kinds of people, one of whom prefers a human therapist and one of whom is fine interacting with the AI. If they both satiate equally quickly in capital but the one who likes the human therapist also just likes having some human-intrinsic services, then the marginal value of capital in the future, compared to the marginal value of capital today, for each of them if they start out equally rich, should be basically the same. There could be interactions and whatnot, but basically, that should be the same.</p><p>If what&#8217;s driving the difference is that one person just doesn&#8217;t satiate in capital because they&#8217;re engaged by the prospect of exploring the universe and turning their head into a galaxy brain or whatever, and the other one satiates, then the person who doesn&#8217;t satiate in capital is going to, if they&#8217;re being rational, have a higher savings rate.</p><p>So in the long run, they&#8217;re going to have most of the wealth, and the overall capital share will basically be the capital share of that person&#8217;s spending, which is going to be one.</p><p><strong>Dwarkesh Patel</strong></p><p>It&#8217;s important that we&#8217;re not talking about a hypothetical future. <a href="https://www.dwarkesh.com/p/elon-musk">Elon Musk</a> is talking about <a href="https://en.wikipedia.org/wiki/Mass_driver">mass drivers</a> on the moon. He&#8217;s by far the wealthiest person in the world. Obviously, currently his investments are going towards humans as well as machines, but I don&#8217;t think he cares particularly that his future researchers and engineers are humans versus AI.</p><p><strong>Phil Trammell</strong></p><p>And he manages to reproduce fast as well. So I just think it&#8217;s worth drawing that distinction. There are currently some rich people that don&#8217;t seem to satiate quickly in capital, and so maybe in the long run they&#8217;ll save the most. That does seem right to me.</p><p>I would also say, even if they do reproduce more slowly biologically, that might just not matter that much in the long run if they can live forever.</p><p><strong>Alex Imas</strong></p><p>The living forever is key. Again, we&#8217;re scenario-building here. If you could live forever, a lot of stuff changes for my story as well. To your point about rich people not consuming a lot and investing, this will all depend on the returns to capital. Right now, the returns to data centers are super high, but if we get into a situation where people are satiated with capital, then the returns to accumulating capital are going to be lower. Then these rich people are going to be consuming more, because the incentive to invest is smaller.</p><p>Basically, you think about the general equilibrium of this sort of process&#8230; We have gotten tremendously richer since 1820. Many more people are investing, but you&#8217;re still getting a consumption response which keeps people employed and labor share high. That&#8217;s because&#8212;</p><p><strong>Phil Trammell</strong></p><p>Hold on.</p><p><strong>Dwarkesh Patel</strong></p><p>Wait, not necessarily. I think you&#8217;re probably making the same point. It could be that their investment has to be titrated through actual laborers who have to do things for their investment to work. In the future, only the consumption is human-mediated, right? Because the investment can just be done by the robots.</p><p><strong>Alex Imas</strong></p><p>So we&#8217;re in the scenario of how you can keep high labor share. Let&#8217;s take that scenario. In the scenario with high labor share, for whatever reason, the returns to capital are going to be lower.</p><p><strong>Dwarkesh Patel</strong></p><p>That&#8217;s right. To the earlier thing where we were saying why the messy middle is implausible, I feel like we can do a similar thing here. For our returns to capital to be lower, the growth rate has to be lower, right? It certainly has to be lower than what we&#8217;re expecting through the period of transformative AI. If there&#8217;s explosive growth&#8230;</p><p><strong>Phil Trammell</strong></p><p>Yes and no. The capital stock could grow quickly, but the price of <a href="https://www.investopedia.com/ask/answers/032515/what-difference-between-capital-good-and-consumer-good.asp">capital goods</a> relative to consumption goods could be falling faster than the capital stock is growing.</p><p><strong>Alex Imas</strong></p><p>It&#8217;s the difference between the potential frontier of technology and the realized prices of these things. Because you have relative prices.</p><p><strong>Dwarkesh Patel</strong></p><p>So you&#8217;re saying I could be putting my money towards earning 30% interest and investing in data centers, or whatever. There will be something in the future, if the growth rate is high, that earns high returns. Or, as a result of all these technological breakthroughs, there&#8217;s some cool product that I really want to buy right now, and both of those will be compelling options.</p><p><strong>Phil Trammell</strong></p><p>Yeah. It doesn&#8217;t have to be a new product. It could be a human-intrinsic product.</p><p><strong>Dwarkesh Patel</strong></p><p>Although, if it&#8217;s a human-intrinsic product, we would want to have it much more in the future than we want it now, because the thing it compares against is&#8212;</p><p><strong>Phil Trammell</strong></p><p>We might want it the same as we want it now in the sense that the marginal utility in a ballerina performance is exactly the same as now. But the marginal utility in a robot might just be a lot lower than now. So in units of robots, we want it a lot more than we want it now.</p><p>Would the interest rate be 30%? It depends what you mean by the <a href="https://en.wikipedia.org/wiki/Real_interest_rate">real interest rate</a>. It might be that every robot now can turn into 100 robots next year. So in units of robots, the interest rate&#8217;s 10,000%. But if the price of robots is falling really fast...</p><p><strong>Alex Imas</strong></p><p>Prices adjust. I think that&#8217;s the whole point.</p><p><strong>Phil Trammell</strong></p><p>Here prices are adjusting in this interesting way that too many macro models don&#8217;t allow for. What&#8217;s happening is what would be called <a href="https://en.wikipedia.org/wiki/Investment-specific_technological_progress">investment-specific technical change</a>.</p><p>The price of capital is falling relative to the price of consumption, instead of doing the standard macro thing of saying there&#8217;s just output, this chimera of a thing called output, which one for one can be allocated to capital or consumption. That&#8217;s not going to be true in this world.</p><p>Every unit of capital next year is giving up way less consumption than each unit of capital this year. One robot now turns into many robots next year, but the number of ballerinas is the same.</p><p><strong>Alex Imas</strong></p><p>Again, we&#8217;re going to go back to the increasing varieties thing. If all of those extra robots next year are actually different varieties of robots and I&#8217;m not getting satiated on those robots, then it&#8217;s a very different story.</p><p><strong>Dwarkesh Patel</strong></p><p>But now we&#8217;re talking about the consumption world. For the investment side of things, there could be just some greedy titan of industry who keeps wanting more and more robots. That alone would be enough to increase the marginal value of robots and therefore decrease labor share?</p><p><strong>Phil Trammell</strong></p><p>Yes.</p><p><strong>Dwarkesh Patel</strong></p><p>Why are we not expecting greedy titans of industry to keep existing?</p><p><strong>Alex Imas</strong></p><p>Greedy titans of industry historically have <a href="https://en.wikipedia.org/wiki/Carnegie_library">built libraries</a> and&#8212;</p><p><strong>Dwarkesh Patel</strong></p><p>But that&#8217;s because they die, and they&#8217;re like&#8212;</p><p><strong>Alex Imas</strong></p><p>Oh, they all die. Everybody dies.</p><p><strong>Dwarkesh Patel</strong></p><p>Well, we&#8217;ll see.</p><p><strong>Alex Imas</strong></p><p>Conditional on people dying&#8230; You had a guest on the show who said to understand the future, you should think about the past. You could have new types of titans being born whose entire reason for accumulating wealth is just to accumulate wealth.</p><p>But a lot of the time, at least historically, the wealth accumulation process is part of a large social interaction amongst peers, amongst the community, where you want to be admired in some way. The <a href="https://en.wikipedia.org/wiki/Stylized_fact">stylized fact</a> of titans of industry is you accumulate the capital, and then you buy a bunch of stuff.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess this is a historical question, but it does seem to me that in a lot of cases what is happening is that as they near the end of their life, they either hand it off to their children, who are worse stewards of capital than they are. They don&#8217;t even manage to grow their wealth at the rate the economy grows, much less faster than the economy grows, which their parents were doing.</p><p>They&#8217;re like, &#8220;Well, I care less about my children having it than me playing this game of accumulating wealth. So I&#8217;m just going to give it to some trust.&#8221; If people are living longer or if they can figure out some way to align their trust to this wealth accumulation process&#8230; It just feels like the evolution here is so strong. You just need a couple of agents that think this way for this to be the dominant thing determining the preferences of the whole economy, because this part is growing much faster than the other parts of the economy.</p><p><strong>Alex Imas</strong></p><p>The part about satiation and diminishing marginal utility keeps coming up, I think it&#8217;s really important. If a person has an intrinsic preference for accumulation, that&#8217;s just what they want, then I think your story is totally right. But that&#8217;s just not how preferences usually work. You have enough hedonics in your life, and then the social status&#8230; <a href="https://en.wikipedia.org/wiki/Jean-Jacques_Rousseau">Rousseau</a> wrote about <a href="https://en.wikipedia.org/wiki/Amour-propre">this</a>, <a href="https://en.wikipedia.org/wiki/Augustine_of_Hippo">St. Augustine</a> wrote about this. This is a basic part of preferences.</p><p>Now, you guys are arguing about something else, where you could have such high concentration that you could just have a couple of exceptions to the rule, and that&#8217;s going to be enough. I have nothing to say about that.</p><p><strong>Phil Trammell</strong></p><p>I think the claim is a little stronger, not just that you could have some exceptions, but that historically and today we see the exceptions. They just haven&#8217;t really taken over the economy historically because there have been these dissipation shocks, as they&#8217;re called. They&#8217;ve given it to their kids who squandered it, or they put it in foundations which spent it. It&#8217;s not really a shock, but&#8230;</p><p>People might have liked to fill the universe with monuments to themselves and live forever, very wealthy. It&#8217;s a weird preference, but it&#8217;s not a hypothetical preference. I think that&#8217;s the claim.</p><p>But who knows what&#8217;s going on in their heads? Even without the intrinsic preference for accumulation, there are some instrumental reasons why some people might value accumulation, which is also worth bringing up. There&#8217;s a desire for political, philosophical, or religious influence. People get into an arms race over what society looks like and what people believe.</p><p>Similarly but differently, because it&#8217;s not an arms race, there&#8217;s just total <a href="https://plato.stanford.edu/entries/consequentialism/">utilitarian</a> philanthropy. When I think about why it might be good to have a lot of wealth in the future as a good classical utilitarian, to me, the value&#8212;or at least one way you could have an almost unsatiating utility function in having wealth in the future&#8212;is to create new happy beings. They just add to the total welfare of the world.</p><p>This idea goes at least as far back as <a href="https://en.wikipedia.org/wiki/Nick_Bostrom">Bostrom&#8217;s</a> <a href="https://nickbostrom.com/papers/astronomical-waste/">astronomical waste</a> point, that we could put <a href="https://en.wikipedia.org/wiki/Dyson_sphere">Dyson spheres</a> around the stars and turn all the energy into really happy simulations and whatnot.</p><p><strong>Dwarkesh Patel</strong></p><p>I think the particular greediness of this optimizer doesn&#8217;t matter, what they&#8217;re greedy for. Forgetting about utilitarian philosophy or whatever, a pure <a href="https://en.wikipedia.org/wiki/Self-replicating_spacecraft">von Neumann probe</a> has&#8230; I don&#8217;t know, is this an accurate way to say it? They just have high marginal value for the random solar system they&#8217;ll occupy because that turns into more solar systems. A von Neumann probe is a thing that can exist. That&#8217;s a very greedy optimizer.</p><p><strong>Phil Trammell</strong></p><p>If we&#8217;re talking about whether they&#8217;ll dominate the economy, maybe this is a technicality. But we only count final consumption goods and investment goods as GDP. If there&#8217;s just this phenomenon&#8212;</p><p><strong>Dwarkesh Patel</strong></p><p>How does a von Neumann probe show up in GDP?</p><p><strong>Phil Trammell</strong></p><p>Exactly. If we recognize it as a person that owns itself, and it&#8217;s optimizing on the margin between spending a bit more on a baby von Neumann probe that colonizes another star system or a ballerina or something, and it just doesn&#8217;t value the ballerina very much&#8230;</p><p><strong>Alex Imas</strong></p><p>When we&#8217;re talking about AI beings, it just completely depends on how we&#8217;re doing the accounting there.</p><p><strong>Dwarkesh Patel</strong></p><p>What does the world look like in a world where von Neumann probes are possible? Is it possible labor share is high?</p><p><strong>Phil Trammell</strong></p><p>I think it&#8217;s possible the labor share is high the way we usually count it.</p><h3>01:01:28 &#8211; What should developing countries do?</h3><p><strong>Dwarkesh Patel</strong></p><p>Do economists have any advice for countries which are not in the AI production chain? If you&#8217;re not either producing the AI models, you&#8217;re not producing the hardware that goes into AI models, if you&#8217;re not Korea making <a href="https://en.wikipedia.org/wiki/High_Bandwidth_Memory">HBM</a> or Taiwan with the fabs or the Netherlands with <a href="https://en.wikipedia.org/wiki/ASML">ASML</a>. India or Nigeria, what should they be doing right now? If you&#8217;re talking to Modi right now, what do you say?</p><p><strong>Alex Imas</strong></p><p>I think the biggest lack of resources that we have allocated in the economics profession is thinking about middle-income developing countries in the age of AI. This is something I fault myself with as well. There&#8217;s not enough people thinking about this question.</p><p>There are scenarios where you get AI technology being allocated and dissipating to Nigeria and developing countries, leveling the playing field, essentially giving them a level up as far as capabilities.</p><p>But there&#8217;s another world where, because they don&#8217;t have enough resources, they&#8217;re not training the models, they don&#8217;t have the hardware, and they just completely get left behind. And because of automation, we can produce commodities in developed countries now. Then we don&#8217;t even have the consumer market. That world looks pretty bad.</p><p><strong>Phil Trammell</strong></p><p>This seems to me like an extension of the messy middle case. One of the ways in which the messy middle might only be bad in a narrow range of scenarios isn&#8217;t just that it would be easy to redistribute because the pie would be bigger, but because the interest rate would be way higher, and/or, equivalently, the price of everything except human-intrinsic goods would be falling really rapidly. They&#8217;re sort of two sides of the same coin. A little bit of savings would turn into a lot of consumption next year.</p><p>Things have to go really wrong for us to just get over the threshold of capital being productive enough to automate lots of work, but not be productive enough that the interest rate is high and the price of capital-produced goods is falling a lot. Even without redistribution, a little bit of savings will save a lot of people.</p><p><strong>Dwarkesh Patel</strong></p><p>You&#8217;re saying if the developing countries have some savings in the developed world, that will be enough to produce a lot of surplus that they can then&#8212;</p><p><strong>Alex Imas</strong></p><p>They will now be able to consume a lot using their savings.</p><p><strong>Phil Trammell</strong></p><p>But the messy middle could be wider in this case. They&#8217;re starting from such a lower level in terms of how much they have and how much it&#8217;s actually indexed to the global economy. I think it&#8217;s important for them to get on it now. I don&#8217;t have strong feelings about whether it should take the form of <a href="https://en.wikipedia.org/wiki/Sovereign_wealth_fund">sovereign wealth funds</a> that invest in the right supply chains or just subsidies to their own citizens to buy a little bit of&#8212;</p><p><strong>Dwarkesh Patel</strong></p><p>This is actually a crucial point. We were talking earlier about why the Rockefellers of the world, why their descendants don&#8217;t control everything, if our argument about the selection of these greedy optimizers holds. One argument is just that it&#8217;s very hard to index the economy. Maybe they would&#8217;ve just decided to have their heirs index the economy and have their wealth grow at the rate of economic growth, and their heirs would be trillionaires by now.</p><p>Before index funds existed, it was just very hard. A very small fraction of the economy, going back 100 years, accounts for a majority of the value created now. If you missed those particular things, your wealth would&#8217;ve just stagnated.</p><p>Maybe there was a brief golden window from the creation of index funds up until five years ago where you could actually index the economy and have your wealth grow at the rate the economy grows. But now we&#8217;re in this world with very concentrated returns, especially to private companies. As we were making the point in our blog post, this is capital that the average person has disproportionately less access to. Most of their capital is having a random house, at least in the US.</p><p><strong>Phil Trammell</strong></p><p>Or a part of a house.</p><p><strong>Dwarkesh Patel</strong></p><p>Which, as we were saying, is capital that is uniquely ill-suited to be complementary to the production of AI or the serving of AI or to robots.</p><p><strong>Phil Trammell</strong></p><p>Or the kinds of goods that the rich will bid up the prices of.</p><p><strong>Dwarkesh Patel</strong></p><p>Exactly. What is the value of a house currently? It&#8217;s that the land is close to other humans and modulo relational stuff that is just not going to be the main factor of production in the future.</p><p><strong>Alex Imas</strong></p><p>This would be why a <a href="https://en.wikipedia.org/wiki/Georgism">Georgist</a> tax would not raise enough money for the sort of programs that we will be discussing.</p><p><strong>Dwarkesh Patel</strong></p><p>Right. But stepping back, the point I was trying to make is, if it gets harder to index the economy now, and that is the main way in which normal people are supposed to&#8212;modulo some sort of universal basic income&#8212;</p><p><strong>Phil Trammell</strong></p><p>In the developed world.</p><p><strong>Dwarkesh Patel</strong></p><p>&#8212;are supposed to have some purchase on the wealth from AI. And it&#8217;s also the way that developing countries are supposed to have some purchase on the wealth gains from AI.</p><p>But it&#8217;s very hard. Does Nigeria own a lot of <a href="https://en.wikipedia.org/wiki/SK_Hynix">SK Hynix</a> and <a href="https://en.wikipedia.org/wiki/Anthropic">Anthropic</a>? I&#8217;m guessing not. It&#8217;s not enough for them to just own the S&amp;P 500.</p><p><strong>Alex Imas</strong></p><p>This brings up a really important point. Is AI going to be like electricity or social media?</p><p>Think about <a href="https://en.wikipedia.org/wiki/Consolidated_Edison">ConEd</a>, or whatever the electricity provider here is. It&#8217;s a monopoly. It provides a resource that everybody uses. But do we think about electricity as creating a concentration of power? Does ConEd have this huge amount of political power, social power, or something like that? No, because with electricity, a lot of the downstream benefits actually came to the users of the electricity rather than the actual entity producing it.</p><p>On the other hand, with social media, it was the opposite case. Social media was everywhere. Everybody uses social media, but the <a href="https://en.wikipedia.org/wiki/Economic_rent">rents</a> went to the platform.</p><p><strong>Dwarkesh Patel</strong></p><p>That&#8217;s a really interesting point. I don&#8217;t endorse this take yet, I&#8217;m going to talk out loud. The more you think our economy is going to be run on AGI the way our economy currently runs on electricity&#8212;that is, there&#8217;s a broad fundamental transformation of the entire economy&#8212;the more it looks like electricity&#8230; Every company in the S&amp;P of the future, if it&#8217;s going to make it to the S&amp;P 500, it is because it has leveraged AI.</p><p><strong>Alex Imas</strong></p><p>Exactly. And then you&#8217;re indexed again.</p><p><strong>Dwarkesh Patel</strong></p><p>But then again, I guess if you just look at how concentrated the S&amp;P is over time, just these big tech companies much more so&#8230; I guess this goes to a fundamental point that it&#8217;s hard to reason about how much of the gains from AI these individual private companies will be able to control.</p><p><strong>Alex Imas</strong></p><p>I think the open model thing is going to be a big point here. If we&#8217;re indeed in a world where the open models are six months behind the frontier&#8212;or nine months&#8212;then we&#8217;ll hit AGI, we&#8217;ll hit whatever, and in six months, everybody has access to this resource.</p><p><strong>Dwarkesh Patel</strong></p><p>This goes to show you that every question is connected to every other. That question about whether there&#8217;s runaway gains connects to questions about <a href="https://en.wikipedia.org/wiki/Recursive_self-improvement">recursive self-improvement</a>. Even if not recursive self-improvement, then <a href="https://www.ibm.com/think/topics/continual-learning">continual learning</a>, or online learning, which lets a model learn on the job. So if it&#8217;s deployed, it gets to learn more. These are just forecasting technical questions which then impact whether Uganda will have any purchase on the returns of AGI.</p><p>The reason I&#8217;m emphasizing the question is I think both for the messy middle and for developing countries, a recommendation that is often made naively is that you&#8217;ve got to do some kind of retraining. You&#8217;ve got to do some kind of jobs program, or you&#8217;ve got to have them build data centers in your country. I think you guys are suggesting something closer to just buying the index of AGI. That&#8217;s probably a much cleaner strategy and much more likely to succeed.</p><p><strong>Alex Imas</strong></p><p>These are the two scenarios. I think there is a world where it is concentrated, in which case it&#8217;s going to be really hard to index AGI. There is another world where it is electricity. Basically every company has access to AGI. So you just buy the index. Nigeria just needs to buy the index, and Nigeria has access to AGI because of the open models.</p><p><strong>Phil Trammell</strong></p><p>Just to get back to the question about whether to go with retraining or trying to index. I would prioritize trying to index, just given how fast AI could hit the world. But I definitely wouldn&#8217;t just rely on that.</p><p>In the messy middle cases or the long-timeline cases where we don&#8217;t get anything like AGI all that soon, it would be leaving a lot of value on the table if you could have retrained to be a bit better educated on how to use the latest wave of computing. I don&#8217;t think there&#8217;s that much of an either/or there.</p><p><strong>Dwarkesh Patel</strong></p><p>Maybe the reason to be pessimistic about this is because one of the reasons a country is poor is that it has a bad education system, so becoming the best in the world at retraining people at using AI doesn&#8217;t seem like a particularly promising strategy for that poor country.</p><p><strong>Alex Imas</strong></p><p>Although there are cases where, in developing countries, you had this <a href="https://en.wikipedia.org/wiki/Leapfrogging#Leapfrogging_in_developing_countries">leapfrogging effect</a> with, for example, <a href="https://www.richmondfed.org/-/media/richmondfedorg/publications/research/working_papers/2021/wp21-05r.pdf">mobile banking</a>. It&#8217;s much more prevalent in <a href="https://www.gsb.stanford.edu/insights/mobile-banking-takes-nigeria">Nigeria</a> than it is in Germany. Everybody is doing mobile banking. They have it on their phones, and they&#8217;re constantly doing this sort of thing.</p><p>Again, I&#8217;m not putting probabilities on this, but with a transformative technology like AI, you could get leapfrogging where you skip the step in the middle and get really astronomical growth.</p><p><strong>Phil Trammell</strong></p><p>Just about the ease of indexing, I think it&#8217;s definitely something to worry about a bit and keep an eye on. But as discussed in our own essay, and as other people have pointed out, it&#8217;s already not that hard to index. There&#8217;s been a bit of an increase in the privatization of returns, but still, well under 20% of the total market cap of non-tiny companies in the US is private.</p><p>Everyone thinks about OpenAI and Anthropic. If that&#8217;s where all the wealth will accrue, then all these questions about whether open models will stay only a little bit behind, those are important. But even they look like they&#8217;re going public before too long, probably. The frictions that have been keeping companies from going public might themselves be alleviated by AI a lot, just all of the disclosure requirements and whatnot. They want to get access to more potential investors, too.</p><p>If I had to guess, I would guess that the long general trend of lowering those frictions and making it easier for more and more people to index will continue, despite the recent bump in the other direction.</p><p><strong>Dwarkesh Patel</strong></p><p>This actually makes me hope even more so than before that the labs do get commoditized, or at the very least they go public as soon as possible. But hopefully they just get totally commoditized. I think AI will be much more popular and, more importantly, will be much more likely to lead to broad increases in prosperity if it is as hard to capture the gains of AI as it is to capture the gains of electrification.</p><p><strong>Alex Imas</strong></p><p>Exactly. There&#8217;s no anti-electricity people out there.</p><p><strong>Dwarkesh Patel</strong></p><p>I mean electricity doesn&#8217;t take your job, but&#8212;</p><p><strong>Phil Trammell</strong></p><p>Well, it takes some people&#8217;s jobs.</p><p><strong>Dwarkesh Patel</strong></p><p>It took some people&#8217;s jobs, yeah.</p><p><strong>Alex Imas</strong></p><p>This is maybe tangential to the conversation but I think narratives matter. There&#8217;s this really negative narrative around AI right now, but that&#8217;s because people are not putting out the positive narrative. There&#8217;s a reason. It&#8217;s more difficult to imagine a good thing that doesn&#8217;t exist than losing something that exists. It&#8217;s much easier for somebody to go on a podcast and say, &#8220;These jobs that you like, they&#8217;re going away,&#8221; than for somebody to spin up a utopia which doesn&#8217;t exist yet.</p><p><strong>Phil Trammell</strong></p><p>I hope this isn&#8217;t too out of left field, but I would be remiss if I didn&#8217;t point out one big cost of having commoditized frontier AI models, which is the tech race dynamic. For safety purposes, you might want fewer frontier companies so that each one has a buffer in case they want to slow things down to make things safer.</p><p>The way this relates to our point before about the widespread access of the returns, is that I think there&#8217;s a lot less of a trade-off there than some people imagine. Some people think either frontier AI gets commoditized and we all enjoy the benefits&#8212;but there might be some risk, because the market&#8217;s really competitive and cutthroat&#8212;or things are safer because there&#8217;s a big gap between the leader and the laggard. But that means the leaders get fantastically wealthy? No. You could just have a relatively big gap, but it&#8217;s a public company, and ownership in it is widely distributed.</p><p><strong>Dwarkesh Patel</strong></p><p>More recently, I have been thinking that the risk of commodification&#8212;which is that it diffuses the ability to use AI to harmful ends&#8212;is worth the benefit. I worry that having these concentrated labs not only makes it so that the surplus isn&#8217;t as widely distributed through society, but also creates a very tangible, clear political target for the government. We saw this with the <a href="https://www.lawfaremedia.org/article/what-the-defense-production-act-can-and-can't-do-to-anthropic">Defense Production Act threat against Anthropic</a>. If there wasn&#8217;t one lab, or a couple of labs, that are clearly ahead of others, this kind of threat would be much harder to make.</p><p>Thank you guys for doing this. I feel like there&#8217;s a lot of unresolved questions, but it is helpful to know what the first branch is along all these important dimensions.</p><p><strong>Phil Trammell</strong></p><p>Great.</p><p><strong>Alex Imas</strong></p><p>Thank you.</p>]]></content:encoded></item><item><title><![CDATA[Reiner Pope – Chip design from the bottom up]]></title><description><![CDATA[Working up from basic logic gates to why GPUs, TPUs, FPGAs, and the human brain each look the way they do.]]></description><link>https://www.dwarkesh.com/p/reiner-pope-2</link><guid isPermaLink="false">https://www.dwarkesh.com/p/reiner-pope-2</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Fri, 22 May 2026 15:38:34 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/198847047/2743eb4e51ed68a1e96ee4dd1d97ebdf.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>New blackboard lecture with Reiner Pope: how do chips actually work - starting with basic logic gates, and working up to why GPUs, TPUs, FPGAs, and the human brain each look the way they do.</p><p><a href="https://reiner.org/">Reiner</a> is CEO of <a href="https://matx.com/">MatX</a>, a new chip startup (full disclosure - I&#8217;m an angel investor). He was previously at Google, where he worked on <a href="https://arxiv.org/abs/2211.05102">software</a> <a href="https://jax-ml.github.io/scaling-book/">efficiency</a>, compilers, and TPU architecture.</p><p>Watch this one on <a href="https://youtu.be/oIk3R-sMX5o">YouTube</a> so you can see the chalkboard.</p><div id="youtube2-oIk3R-sMX5o" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;oIk3R-sMX5o&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/oIk3R-sMX5o?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Sponsors</h2><ul><li><p><a href="https://crusoe.ai/dwarkesh">Crusoe</a> was one of only five GPU clouds that made the gold tier in SemiAnalysis&#8217; most recent ClusterMAX report. Gold-tier providers like Crusoe delivered 5-15% lower TCO than silver-tier clouds, even with identical GPU pricing. This is because optimizations like early fault detection and rapid node replacement don&#8217;t necessarily show up in the sticker price, but still matter a ton in the real world. Learn more at <a href="https://crusoe.ai/dwarkesh">crusoe.ai/dwarkesh</a></p></li></ul><ul><li><p><a href="https://cursor.com/dwarkesh">Cursor</a> is where I do most of my work&#8212;from reading research papers to visualizing technical concepts to coding up internal tools for the podcast. Most recently, I used it to build two different review interfaces for my essay contest, one that anonymizes submissions for scoring and another that lets me see applicants&#8217; essays next to their resumes and websites. Whatever you&#8217;re working on, you should try doing it in Cursor. Get started at <a href="https://cursor.com/dwarkesh">cursor.com/dwarkesh</a></p></li></ul><ul><li><p><a href="https://janestreet.com/dwarkesh">Jane Street</a> let me ask Ron Minsky and Dan Pontecorvo, two senior Jane Streeters, a bunch of questions about how they use AI. We discussed everything from the types of models they&#8217;re training to how they think about the future of trading to why they&#8217;re more bullish than ever on hiring technical talent. You can watch the full conversation and learn more about their open positions at <a href="https://janestreet.com/dwarkesh">janestreet.com/dwarkesh</a></p></li></ul><h2>Timestamps</h2><p>00:00:00 &#8211; Building a multiply-accumulate from logic gates</p><p>00:16:31 &#8211; Muxes and the cost of data movement</p><p>00:26:10 &#8211; How systolic arrays work</p><p>00:39:11 &#8211; Clock cycles and pipeline registers</p><p>00:51:51 &#8211; FPGAs vs ASICs</p><p>01:03:25 &#8211; Cache vs scratchpad</p><p>01:07:27 &#8211; Why CPU cores are much bigger than GPU cores</p><p>01:12:00 &#8211; Brains vs chips</p><p>01:15:33 &#8211; A GPU is just a bunch of tiny TPUs</p><h2>Transcript</h2><h3>00:00:00 &#8211; Building a multiply-accumulate from logic gates</h3><p><strong>Dwarkesh Patel</strong></p><p>I&#8217;m back with <a href="https://www.dwarkesh.com/p/reiner-pope">Reiner Pope</a>, CEO of <a href="https://matx.com/">MatX</a>, a new AI chip company. <a href="https://www.dwarkesh.com/p/reiner-pope">Last time</a> we were talking about what happens inside a data center. Now I want to understand what happens inside an AI chip. How does a chip actually work? Full disclosure, by the way: I am an angel investor in MatX. So hopefully you have designed a good chip.</p><p><strong>Reiner Pope</strong></p><p>Hope so. I&#8217;ll start with the smallest fundamental unit of <a href="https://en.wikipedia.org/wiki/Integrated_circuit_design">chip design</a>, and we&#8217;ll build up to what an actual production chip is and what its components are. At the very bottom level of a chip, the primitives we work with are <a href="https://en.wikipedia.org/wiki/Logic_gate">logic gates</a>, very simple things like AND, OR, and NOT. These are connected together by wires that have to be laid out physically as metal traces on a chip.</p><p>The main function that AI chips want to compute is the <a href="https://towardsdatascience.com/a-birds-eye-view-of-linear-algebra-why-is-matrix-multiplication-like-that/">multiplication of matrices</a>. Inside that, the fundamental primitive is a <a href="https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation">multiply-accumulate</a> of pairs of numbers. We&#8217;re going to demonstrate what that calculation looks like by hand, and then infer what a circuit would look like for that.</p><p>It&#8217;ll be easiest if I do a multiply-accumulate of a four-bit number with another four-bit number. The clearest primitive is actually multiply-accumulate. So there&#8217;s a multiply of these two terms, and then we&#8217;re going to add in an eight-bit number.</p><p><strong>Dwarkesh Patel</strong></p><p>Can I ask a clarifying question? Why is this the natural primitive for whatever computation happens inside a computer?</p><p><strong>Reiner Pope</strong></p><p>There are a few reasons. It&#8217;s a little bit more efficient, but the reason it&#8217;s natural for AI chips is that if you look at what&#8217;s happening during a matrix multiply&#8230; What is a matrix multiply in short? There&#8217;s a for-loop over <em>i</em>, over <em>j</em>, and over <em>k,</em> of output [<em>i</em>, <em>k</em>] += input [<em>i</em>, <em>j</em>] x other input [<em>j</em>, <em>k</em>]. A multiply-accumulate happens at every single step of a matrix multiply.</p><p>The other observation is that the precision will almost always be higher in the accumulation step than in the multiplication step. This is specific to AI chips. You&#8217;re multiplying low-precision numbers, and then when you accumulate, errors accumulate quickly, so you need more precision there. This is why we&#8217;ve chosen to do a four-bit multiplication and an eight-bit addition.</p><p><strong>Dwarkesh Patel</strong></p><p>Let me make sure I understood that. There are two ways to understand that. One is that the value will be larger than the inputs. The other is that if it was a floating-point number it would be&#8230; Maybe that part is less intuitive to me. But maybe it&#8217;s the same principle?</p><p><strong>Reiner Pope</strong></p><p>It really is the same principle. The separate principle is that as you&#8217;re summing up this number, you&#8217;re summing up a whole bunch of numbers, so you&#8217;ve got a lot of rounding errors accumulating. Whereas in this case, there&#8217;s only one multiplication in the chain, so there aren&#8217;t a lot of rounding errors accumulating in the multiplication.</p><p><strong>Dwarkesh Patel</strong></p><p>Why are you summing up a whole bunch of numbers? There&#8217;s just two numbers there.</p><p><strong>Reiner Pope</strong></p><p>This summation is repeated <em>j</em> many times.</p><p><strong>Dwarkesh Patel</strong></p><p>Any errors accumulate. I see.</p><p><strong>Reiner Pope</strong></p><p>So how would we perform this calculation by hand? As a human, we would probably separate it into two, but we can do it all in one using long multiplication.</p><p>For the multiplication term first, we&#8217;re going to multiply this four-bit number by every single bit position in the other four-bit number. We write that out. First, 1001 multiplied by this bit position. That is the number itself. Then shifted across by one, we&#8217;re multiplying by 0. That gives us an all-0 number. Shifted across one more to multiply by this one, we get 1001. Finally, for this last bit position, we get an all-0 number again.</p><p>This gives us a bunch of terms that we have to add for the multiplication. While we&#8217;re doing that summation, we might as well add in the actual accumulator term as well. So we just copy that directly across. So this is the sum. It&#8217;s a five-way sum that we want to compute.</p><p>What logic gates did it take us to get to this intermediate step? We needed to produce all 16 of these partial products. How do I produce one of these partial products? Let&#8217;s take this number 1, for example here. We produce it by multiplying this number by this one over here. We can produce that with an <a href="https://en.wikipedia.org/wiki/AND_gate">AND gate</a>. This number is 1 if both this bit is 1 and this bit is 1. If either of them is 0, then the multiplication of 0 times anything is 0. To produce all of this, we ended up consuming 16 AND gates. In the general case, if I were doing a <em>p</em> bit multiply times a <em>q</em> bit multiply, this will be <em>p</em> times <em>q</em> many ANDs.</p><p>Finally, I sum them. Most of the work is going to happen in the summing. Let me describe the other logic gate that we use here. AND is almost the simplest logic gate that exists on a chip. It&#8217;s almost the smallest. At the other extreme, the very largest logic gate you&#8217;ll typically use is something called a <a href="https://www.geeksforgeeks.org/digital-logic/full-adder-in-digital-logic/">full adder</a>.</p><p>Coming from software, you might think that a full adder adds 32-bit numbers together. In this case, it just adds three single-bit numbers together, so you can think of it as adding 0, 1, and 1 together. When I add these together, the result can be 0, 1, 2, or 3, so I can express that in binary using just two bits. As input, it has three bits. As output, it has two bits. The number 2 in binary is 10. This is also known as a 3&#8594;2 compressor because it takes three bits of input and produces two bits of output.</p><p><strong>Dwarkesh Patel</strong></p><p>Just to make sure I understood: the two inputs are an X and a Y value and then some carry that came in&#8230;</p><p><strong>Reiner Pope</strong></p><p>The three inputs are all bits in the same bit position, like three bits in a column here. The two outputs, I&#8217;ve drawn them vertically here and horizontally here to match this vertical versus horizontal layout. This expresses that things in the same column are in the same bit position, whereas things in adjacent columns are different. This is a carry out, whereas this was the sum.</p><p><strong>Dwarkesh Patel</strong></p><p>So if the inputs in the full adder were, say, 101, then the output would be 10. If it were 111, it&#8217;d be 11. If it were 000, it&#8217;d be 00. If it were 010, it&#8217;d still be 01. Got it.</p><p><strong>Reiner Pope</strong></p><p>Yeah. It&#8217;s essentially just counting the number of things and expressing that in binary. This circuit captures what we as humans naturally do when we&#8217;re summing along a column.</p><p>I&#8217;ll show one iteration of using the full adder to sum. The way I sum here is going to be a little unnatural for humans. We would sum along the column and then remember the carry, but instead of remembering the carry, we&#8217;ll explicitly write it out. We proceed from the rightmost column toward the left. On the rightmost column, we sum the 1 and the 1, and that produces a zero here and a carry of one. We&#8217;ve used this full adder circuit on this pair of bits and produced a pair of bits as output.</p><p>Now we can do the same thing with this column. We have a column of four numbers, so we&#8217;ll take the first three of them, run a full adder on them, and that gives us a 0 and a 0 as output. The sum of these is 00. That&#8217;s the full adder applied to all these bits. As I&#8217;ve used up bits, I&#8217;ll cross them out to indicate that I&#8217;ve handled them.</p><p>Let&#8217;s keep going a little bit more. I take these three numbers, I add them, and that gives me a 1 and a 0. I&#8217;ve dealt with these three numbers. Now I take these three numbers and add them, and that gives me a 1 and a 0, and I&#8217;ve dealt with these numbers.</p><p>The way to view this is that I have this whole grid of numbers that need to be added. I&#8217;m going to keep applying full adders to all the bits here, constantly removing three numbers from a column and writing out two numbers as output. Keep going over and over again until I eventually get just one single number coming out.</p><p>This approach is called a <a href="https://en.wikipedia.org/wiki/Dadda_multiplier">Dadda multiplier</a>. This is the standard for how you do area-efficient multipliers using full adders. Let&#8217;s try to quantify the circuit size of this so we have a sense of how big things are and can compare them later.</p><p>How many full adders did I use? How many numbers did I start with? I have the 16 partial products, which is a product of all of these terms with all of these terms, plus the eight terms that I&#8217;m adding here. I started off with 24 bits. Eventually I produced eight bits on the output. In every step, I was crossing off three numbers and writing two numbers out as a result.</p><p>Every single use of a full adder eliminates one of the bits here. So how many full adders? It must be 24 minus 8, so there were 16 full adders in this circuit. This is true in the general case as well. There will be <em>p</em> times <em>q</em> many full adders in this circuit.</p><p><strong>Dwarkesh Patel</strong></p><p>Let me make sure I understand the logic of that. The input bits, 24, is <em>p</em> x <em>q</em>, plus <em>p</em> + <em>q</em>. The output bits are just <em>p</em> + <em>q</em>. So <em>p</em> x <em>q</em> plus <em>p</em> + <em>q</em>, minus <em>p</em> + <em>q</em> equals <em>p</em> x <em>q</em>.</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s right. I think this explains, or at least hints at, the second reason we chose to do a multiply-accumulate. The first reason is that it&#8217;s what shows up in matrix multiplication. The second is that it gave us this very slick, simple <em>p</em> x <em>q</em>, very simple algebra.</p><p>We&#8217;ve described this whole procedure. Every single atomic step that I took here becomes a logic gate, and then the wires are connected together. When I had these three inputs that I used to produce these two outputs, if I think of mapping this to a physical device, there would be a wire connecting all three of these things together into a logic gate that produced this output.</p><p>This is the main primitive, at different bit widths, that&#8217;s inside an AI chip. We&#8217;re going to build up from here to how you would use it to run all the other operations you might want.</p><p><strong>Dwarkesh Patel</strong></p><p>This might be the wrong time to ask, but whenever Nvidia reports that this chip can do X many <a href="https://www.exxactcorp.com/blog/hpc/what-is-fp8-fp6-fp4">FP4</a> or half as many <a href="https://www.exxactcorp.com/blog/hpc/what-is-fp8-fp6-fp4">FP8</a>, it seems to imply those circuits are fungible, that there&#8217;s not a dedicated FP4 versus FP8. But the way you&#8217;re mapping it out here, it seems like if it has to be mapped out in the logic, you would need a dedicated FP4 multiply-accumulate and then a dedicated FP8 accumulate. Can you &#8220;funge&#8221; them?</p><p><strong>Reiner Pope</strong></p><p>As drawn, they&#8217;re not particularly fungible. This is actually one of the main choices you have to make when designing a chip: how much of FP4 and how much of FP8 do I have? Sometimes I&#8217;ll make that consideration from the point of view of the customer requirement. Another angle is to equalize the power budget between FP4 and FP8.</p><p><strong>Dwarkesh Patel</strong></p><p>When they report those numbers and it just happens to be the case that it does 2x as many FP4 as FP8, they&#8217;re just choosing to give equivalent <a href="https://en.wikipedia.org/wiki/Die_(integrated_circuit)">die</a> areas to all the floating points, and as a result it ends up being&#8230;?</p><p><strong>Reiner Pope</strong></p><p>Why is the ratio exactly 2x? Part of it is that surely it won&#8217;t be exactly equivalent to die area. There&#8217;s a data movement reason. We&#8217;ll maybe come back to this when we look at how it goes into and out of memories. There&#8217;s something really nice from a software level about the fact that I can pack two four-bit numbers into the same storage as an eight-bit number. When I store that to memory, the sizing of the buses that I wire out within the chip makes that work out really nicely.</p><p><strong>Dwarkesh Patel</strong></p><p>Come to think of it, it&#8217;s not just 2x. The amount of area it takes sounds like it&#8217;s quadratic with the bit length. That&#8217;s why smaller precision is even more favorable than you might think.</p><p><strong>Reiner Pope</strong></p><p>This is a really big reason. In fact, Nvidia made a change. Historically, up until <a href="https://www.exxactcorp.com/blog/hpc/comparing-nvidia-tensor-core-gpus">B100</a> or <a href="https://www.nvidia.com/en-us/data-center/dgx-b200/">B200</a>, every time you halved the bit precision, you doubled the <a href="https://en.wikipedia.org/wiki/Floating_point_operations_per_second">FLOP</a> count. For the reason you said, because of this quadratic scaling, that ratio is actually slightly wrong. You should get an even bigger speedup than you might otherwise think. Nvidia&#8217;s product specs have started acknowledging that in <a href="https://www.nvidia.com/en-us/data-center/dgx-b300/">B300</a> and beyond, where the FP4 is three times faster than the FP8.</p><p><strong>Dwarkesh Patel</strong></p><p>Though it should be 4x.</p><p><strong>Reiner Pope</strong></p><p>Yeah. What I&#8217;ve shown here is the simplest case of integer multiply. When you&#8217;re dealing with floating point, as you do in FP4 and FP8, there&#8217;s this other term, the exponent, that complicates the calculation.</p><p>What can we see already from this? I think the big observation you&#8217;ve made is that there&#8217;s this quadratic scaling with bit width, which is very effective and is the single reason low-precision arithmetic has worked so well for <a href="https://en.wikipedia.org/wiki/Neural_network_(machine_learning)">neural nets</a>. The other thing we&#8217;re going to do now is compare the area spent on the multiplication itself with all the circuitry around it.</p><h3>00:16:31 &#8211; Muxes and the cost of data movement</h3><p><strong>Reiner Pope</strong></p><p>We&#8217;ll walk back in time a little bit and see how <a href="https://en.wikipedia.org/wiki/Graphics_processing_unit">GPUs</a> prior to <a href="https://www.nvidia.com/en-us/data-center/tensor-cores/">Tensor Cores</a> worked, which is in fact the same way <a href="https://en.wikipedia.org/wiki/Central_processing_unit">CPUs</a> worked. Where do we stick this multiply-accumulate unit? Generically, I&#8217;ll describe a <a href="https://en.wikipedia.org/wiki/CUDA">CUDA</a> core or a CPU. You&#8217;ll have some <a href="https://en.wikipedia.org/wiki/Register_file">register file</a> which stores some number of entries, maybe eight entries of, in this case, 4-bit numbers, but typically 32-bit numbers.</p><p>Inside the CUDA core, I&#8217;ll have some register file of some depth, and then I&#8217;ll have my multiply-accumulate circuit. What it&#8217;s going to do is take three arbitrary registers from this register file, perform the multiply-accumulate, and then write back to the register file. It&#8217;s going to write to this one, but it was able to read from this one, this one, and another random one. It will take three inputs like this.</p><p>This is the core data path of many processors. Most processors look like this. You&#8217;ve got some set of registers, and then you&#8217;ve got some set of logic units, or <a href="https://en.wikipedia.org/wiki/Arithmetic_logic_unit">ALUs</a>. We want to analyze the cost of the data movement from the register file to the ALU and back.</p><p>Ultimately, there&#8217;s going to be some circuit that says, &#8220;Well, I don&#8217;t always have to select this guy. I might select any of the registers at any point in time.&#8221; The first question is: how can I build a circuit? The circuit I&#8217;m going to look for is a <a href="https://en.wikipedia.org/wiki/Multiplexer">mux</a>. In this case, it&#8217;s going to have eight inputs, one from each entry of the register file, and it&#8217;s going to have one output, which is actually producing this output.</p><p>What is the cost of this thing? All we have to build it out of is AND and <a href="https://en.wikipedia.org/wiki/OR_gate">OR</a>. How do we build it? We do the dumbest thing possible. We form a mask. When we want to read the third entry, we&#8217;re going to AND every single entry with either 1 or 0 based on whether that&#8217;s what we want to read, and then we&#8217;re going to OR all of them together.</p><p><strong>Dwarkesh Patel</strong></p><p>Just to make sure I understand the basics. What the mux is doing is just selecting an input?</p><p><strong>Reiner Pope</strong></p><p>Just selecting, invisible to software. You say &#8220;I want input number three,&#8221; and that means there&#8217;s a mux here.</p><p>So what is the cost of this mux? An <em>n</em>-input mux operating on <em>p</em> bits. I have <em>n</em> rows. That&#8217;s just eight rows, and each row is <em>p</em> bits wide. I have to AND every single bit, so I get <em>n</em> x <em>p</em> many AND gates. For every single input I have to decide whether I&#8217;m going to <a href="https://www.geeksforgeeks.org/digital-logic/multiplexers-in-digital-logic/">mask it out</a> or not. Then I&#8217;m going to OR them all together.</p><p>There&#8217;s going to be <em>n</em> &#8211; 1 times <em>p</em> many OR gates. I&#8217;ve got all of these different things, almost all of them are 0s, but I need to collapse them from my eight options down into one option. Every step, I need to OR one row into an existing row.</p><p><strong>Dwarkesh Patel</strong></p><p>It&#8217;s actually funny that you don&#8217;t think at the level of hardware. You just think, &#8220;Oh, I&#8217;ll just select element three,&#8221; and something as simple as that is in and of itself quite a complicated circuit.</p><p><strong>Reiner Pope</strong></p><p>This is the first step of all of the hidden data movement costs that show up. We&#8217;re just going to compare. I have to pay this cost. I&#8217;ve got one mux here, and in fact I have two more copies of that for each of the three inputs to my multiply-accumulate operation. I have this cost, which is 3 x <em>n</em> x <em>p</em> AND gates over here, compared to <em>p</em> x <em>q</em> gates in the actual circuit that is doing the thing I care about.</p><p>If we plug in actual numbers, with <em>n</em> being eight, I get 24 x <em>p</em> gates just in the data movement, compared to&#8212;if <em>q</em> is four&#8212;4 x <em>p</em> gates just in the multiply-adder.</p><p><strong>Dwarkesh Patel</strong></p><p>Where is the three coming from?</p><p><strong>Reiner Pope</strong></p><p>Three different inputs here. What I&#8217;m hinting at is that all of this work, which scales as the size of the register file&#8212;and this is a very small register file&#8212;all of this work just moving the data from the register file to the logic unit is many, many times more expensive than the logic unit.</p><p><strong>Dwarkesh Patel</strong></p><p>It may be helpful to just see what a mux looks like, maybe a two-bit or a four-bit mux.</p><p><strong>Reiner Pope</strong></p><p>We&#8217;ll do a two-way mux. We&#8217;ve got two different numbers, we&#8217;ve got these two inputs. These are the inputs that are being selected between, and the selector can either be &#8220;I want this one&#8221; or &#8220;I want the other one.&#8221; This is a <a href="https://en.wikipedia.org/wiki/One-hot">one-hot</a> encoding.</p><p>This is what we start with. Let&#8217;s focus on this case. This is the actual input we got, and we want to produce this guy as the result. Very laboriously, we AND this bit with all of these. That produces ANDing this bit with this row. Likewise, we AND this bit with this row. That produces all 0s. There are four ANDs here.</p><p>Finally, we OR these two together, and this gives a 1. We OR these two together, this gives a 1. We OR these two together, it gives a 0. We OR these two together and it gives a 1. Those are the four ORs. This actually ends up looking a little bit like addition. We did exactly the same set of ANDs. We&#8217;ve ANDed all of these things together, but then instead of collapsing it by using full adder circuits, we just get a very simple collapsing with OR gates.</p><p><strong>Dwarkesh Patel</strong></p><p>But that doesn&#8217;t look like <em>n</em> times <em>p</em>.</p><p><strong>Reiner Pope</strong></p><p>This was with <em>n</em>=2 inputs. In the general case, we will have <em>n</em> rows, and we&#8217;ll have <em>p</em> bits per row. That gives us the <em>n</em> times <em>p</em> many AND gates. In this circuit I&#8217;ve described, almost all of the cost, seven-eighths of the cost, is in reading and writing the register file, and only a tiny fraction of the cost is in the logic unit itself.</p><p>This is the problem to solve. This essentially was the state of play prior to the <a href="https://en.wikipedia.org/wiki/Volta_(microarchitecture)">Volta generation</a> of Nvidia GPUs. This kind of thing is what was inside the CUDA cores. This problem statement is what motivated the introduction of Tensor Cores, which are more generically called <a href="https://en.wikipedia.org/wiki/Systolic_array">systolic arrays</a>.</p><p>Think about how we&#8217;re going to solve this problem. We&#8217;re spending almost all of our circuit area on something that we really don&#8217;t care about and is hidden to the software programmer, and the thing that we actually care about is not much of the area. Make this one bigger somehow while keeping this at the same size. That&#8217;s the goal.</p><h3>00:26:10 &#8211; How systolic arrays work</h3><p><strong>Reiner Pope</strong></p><p>The evolution was that we had baked this much into hardware at this stage. This single line is a multiply-accumulate, and this single thing was baked into hardware. The idea of a systolic array is to go two levels of loops up and bake this entire loop out here into hardware. The idea is that if we have a much bigger granularity fixed-function piece of logic, maybe the taxes we pay on the input and output are much smaller.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. It sounds like you&#8217;re suggesting that if you go up one step in the matrix multiply loop, you can tilt the balance more towards compute than communication.</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s right. There are two effects we&#8217;re going to take advantage of here. One is that we can do more stuff per every trip through our register file. The other is that in some of this loop, we can take advantage of certain things staying fixed.</p><p>Visually, we&#8217;re going to look at this matrix multiplication. This portion of the loop corresponds to a <a href="https://cse.buffalo.edu/faculty/atri/courses/matrix/matrix-vect-notes.pdf">matrix-vector multiplication</a>. We&#8217;ll take a matrix and multiply it by a vector. How do we do this? Every column gets multiplied by the vector and then summed. We&#8217;re going to sum along columns.</p><p>This 0 and 3 gets multiplied by the 3 and 7 and gets summed, and then the 1 and 2 gets multiplied by the 3 and 7 and gets summed. There is a multiply-accumulate associated with every single one of these entries in the matrix. We&#8217;ll draw out these four multiply-accumulates.</p><p><strong>Dwarkesh Patel</strong></p><p>Just to make sure I understand why there are four multiply-accumulates: each entry in the column that corresponds to the output vector is a <a href="https://en.wikipedia.org/wiki/Dot_product">dot product</a>, and in this case it will be two multiplications and then the addition of those two multiplications. You&#8217;re accumulating...</p><p><strong>Reiner Pope</strong></p><p>Really there&#8217;s only one addition per dot product, but we like to start with zero.</p><p><strong>Dwarkesh Patel</strong></p><p>But it includes the initialization of zero.</p><p><strong>Reiner Pope</strong></p><p>Yeah. We want to have quadratically more compute. We have <em>x</em> times <em>y</em> as much compute as we had before. But we want to aim for having only <em>x</em> times as much communication. The intention is to get this advantage term going as <em>y</em>.</p><p>We&#8217;ve laid down the multiplications. We want to bring in a vector of size two, and that is already in line with our columns target. That&#8217;s fine. However, we need to manage the communication of this matrix, which exceeds our budget of <em>x</em>.</p><p>The idea is that in an AI context, this matrix is going to stay fixed for a long period of time. We&#8217;ve got some register files sitting over here. &#8202;The amount of stuff coming out of this register file&#8230; this is the term that we want to go as <em>x</em>, in some sense. We don&#8217;t want to bring this full matrix in from the register file every cycle, because that would cost too much in terms of wiring from the register file.</p><p>Our key trick is that this matrix can be stored locally to the systolic array. We&#8217;ll store these numbers 0, 1, 2, and 3 in a gate called a register that physically stores these numbers, and we&#8217;re going to reuse these numbers over and over again for a large number of different vectors.</p><p><strong>Dwarkesh Patel</strong></p><p>The optimization here is that the nature of matrix multiplication is that you can store this square quadratic thing directly where the logic is happening, which has an extra dimension compared to the inputs that you keep swapping in and out.</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s right.</p><p><strong>Dwarkesh Patel</strong></p><p>This is the nature of what a matrix multiplication is. You do a lot of multiplication to get one value out. A dot product is the result of a lot of multiplications. So that optimization means you can stuff a lot of multiplication in before you get some value out of it.</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s right. Just to complete the picture of concretely how that looks: I swapped the 3 and the 2 here. Just like this 0 and 3 is going to multiply by the 3 and the 7, we&#8217;re going to form a dot product along columns here. We&#8217;re going to feed a 3 and a 7 in here. This feeds into this multiplication and also feeds into this multiplication. Likewise, the 3 feeds into here and also into here. Then we&#8217;re going to sum along here. Starting at the top of a column, we feed in 0s, and then coming out the bottom we get results.</p><p>Visually, there&#8217;s a dot product performed along columns in a matrix, and that maps exactly to what is done spatially in the systolic array. This is one dot product summed vertically, and this is a second dot product also summed vertically.</p><p>What is the data that needs to go into and out of the register file? We have <em>x</em> amount of data coming out on the output, and we also have <em>x</em> amount of data coming from the input. With respect to the input and output vectors at least, we&#8217;ve met our goal of having only <em>x</em> as much data going in and out of the register file.</p><p>This leaves open the question: I said the weight matrix is stored locally in the systolic array, so how did it get there in the first place? At some point, you need to boot your chip and populate this data, so where did that come from?</p><p>The trick is that we just do it very slowly. We very slowly trickle-feed it into the systolic array. The simplest strategy is that we run this daisy chain: feed a number in here, and on the next clock cycle it will move down to the next entry of the systolic array. We can do that in every column in parallel, this is also going to come from here, and that gives us another factor of approximately <em>x</em> units of bandwidth coming in.</p><p><strong>Dwarkesh Patel</strong></p><p>Would you mind repeating that sentence one more time?</p><p><strong>Reiner Pope</strong></p><p>We know that we&#8217;re going to be bringing numbers only rarely into the matrix. We just want to come up with any construction at all such that the amount of wiring that crosses the boundary of the systolic array is bounded to <em>x</em> and not go as <em>xy</em>.</p><p>A particularly simple strategy is that we bring a number into the top row of the systolic array in one clock cycle. Then, for <em>y</em> consecutive clock cycles, we bring in the top row every time and shift all the other rows down by one. This keeps the wiring that needs to come from this expensive register file only down to a factor of <em>x</em> rather than <em>xy</em>.</p><p><strong>Dwarkesh Patel</strong></p><p>I see. There are two questions in terms of communication: communication time and communication bandwidth. You&#8217;re saying that since we&#8217;re only going to be loading this in once, let&#8217;s minimize bandwidth, because bandwidth equals die area. We load it in slowly over smaller lanes because we&#8217;re just going to keep this value in there for a while.</p><p><strong>Reiner Pope</strong></p><p>Exactly.</p><p><strong>Dwarkesh Patel</strong></p><p>It&#8217;s interesting to me that when we were talking last time about inference across many chips, the big high-level thing we&#8217;re trying to optimize for is increasing the amount of compute per memory bandwidth, that is to say, per communication. Here also, we&#8217;re trying to increase the amount of actual multiplies or additions relative to transporting information from registers to the logic. In both cases, you&#8217;re trying to maximize compute relative to communication.</p><p><strong>Reiner Pope</strong></p><p>This shows up all the way up and down the stack. This is close to the bottom, to the gates. There&#8217;s a version that&#8217;s maybe even closer to the gates in the precision of the number format that you choose to use. We saw that same effect. There&#8217;s a squared versus linear term going on both purely in the precision of the ALU, but also in the size of the matrix.</p><p>This unit is the next bigger unit. We had the multiplication circuit, and on top of that we have a pretty large systolic array. I drew it as 2x2, but older <a href="https://en.wikipedia.org/wiki/Tensor_Processing_Unit">TPUs</a> were described as 128x128 of this circuit shown here. This ends up being the most efficient known circuit for implementing a matrix multiply.</p><p><strong>Dwarkesh Patel</strong></p><p>We&#8217;ve talked about how it seems obvious that you should try to maximize compute relative to communication. What are non-obvious trade-offs that keep you up at night, about whether you should do X or Y and it&#8217;s not obvious what the answer is?</p><p><strong>Reiner Pope</strong></p><p>Most of the decisions in chip design are sizing decisions. Already in what we&#8217;ve drawn so far&#8230; AI chips all have this circuit in them. They have a systolic array, and somewhere near it a register file providing inputs and outputs.</p><p>Even within this scope, the sizing questions you have are: how big should I make my systolic array, and how big should I make the register file? These two questions are coupled. One way to think of it is to set a budget for what percentage of chip area you want to spend on data movement. Maybe I say that I want this to be 10% and the systolic array to be 90%.</p><p>Then I can size my register file. Bigger register files are more flexible. They allow me to get more application-level performance out, but they take away from the area spent on the systolic array.</p><h3>00:39:11 &#8211; Clock cycles and pipeline registers</h3><p><strong>Dwarkesh Patel</strong></p><p>Where does the <a href="https://en.wikipedia.org/wiki/Clock_rate">clock cycle</a> of a chip come in? What determines what that is? And what is a clock cycle of a chip?</p><p><strong>Reiner Pope</strong></p><p>At baseline, it&#8217;s worth observing that chips are incredibly <a href="https://en.wikipedia.org/wiki/Parallel_computing">parallel</a>. You&#8217;ve got 100 billion transistors in a chip. A key thing you need to do whenever you have massive parallelism is synchronize between the different parallel units.</p><p>In software, typically you have these very expensive synchronization methods like a <a href="https://en.wikipedia.org/wiki/Lock_(computer_science)">mutex</a>. One thread will finish what it&#8217;s doing, grab a lock stored somewhere in memory, and notify the other thread that it&#8217;s done. On chips, we take a very different approach. Every nanosecond or so, all circuitry in the chip will pause for a moment and synchronize. That is the clock cycle. The entire chip typically goes in lockstep to the next operation in one fell swoop.</p><p>What this looks like in circuitry is that the clock is mediated by registers, which are these storage devices we&#8217;ve drawn elsewhere. The way to think of it is: I have some storage holding a bit, which might be 0 or 1. Then I have some cloud of logic, which maybe is this systolic array or multiplier. I have a bunch of inputs feeding into this cloud of logic, and eventually there&#8217;s going to be some output register that it writes to.</p><p>There is a global clock signal driving all these registers. At a certain instance in time, when the clock strikes, whatever value happens to be on that wire at that instant is what gets stored.</p><p>The challenge is that I would like to have my clock speed run as fast as possible. If I run at two gigahertz, I get twice as many operations done per second as if I run at one gigahertz. But what that ends up meaning is that I&#8217;m very sensitive to the delay through this cloud of logic, because any computation happening in there needs to finish before the next clock cycle hits. A major point of optimization on any chip is to make this delay as short as possible.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. The constraint here seems to be that if you add too much logic, you might risk missing the clock cycle. But if you don&#8217;t add enough, you&#8217;re leaving potential compute on the table. Is there ever a situation where you take a probabilistic chance that a computation finishes, or is it strictly that it either finishes by the clock cycle or it doesn&#8217;t?</p><p><strong>Reiner Pope</strong></p><p>In standard chip design, you margin it such that there is a probability, but it&#8217;s many standard deviations out. For all intents and purposes, it is a reliable part and will always meet the clock.</p><p>There are some weird exceptions, like <a href="https://en.wikipedia.org/wiki/Clock_domain_crossing">clock domain crossings</a> where you go from one clock to another. Then you actually do have to reason about this probability. But in the main path, you margin it such that you&#8217;ll get there 25% of the clock cycle in advance, making it very unlikely that it misses.</p><p><strong>Dwarkesh Patel</strong></p><p>Where the clocks synchronize, where the registers are, is this something you determine as a chip designer? Or is it an artifact where you want a certain sequence of logic, and the software you use to convert your <a href="https://en.wikipedia.org/wiki/Verilog">Verilog</a> into what you send to <a href="https://en.wikipedia.org/wiki/TSMC">TSMC</a> just determines that to make it work, you have to put a register here, here, and here, making sure no single step makes the whole chip&#8217;s clock cycle longer than it has to be?</p><p><strong>Reiner Pope</strong></p><p>Inserting them is actually a huge part of the work of designing a chip. It&#8217;s done by a combination of manual and automatic methods.</p><p>To show the very dumb version of what you can do here, you can take this logic and split it in half. Instead of just one cloud of logic, I can have two smaller clouds of logic that do the same thing, but split them up by a register. If you split it in the middle, you can hit twice the clock frequency. That&#8217;s great, you get twice the performance, but at the cost of an extra register, which means more storage.</p><p><strong>Dwarkesh Patel</strong></p><p>Stepping back, why do we need to synchronize the whole chip? If you imagine playing <a href="https://en.wikipedia.org/wiki/Factorio">Factorio</a> or something, there&#8217;s no global clock cycle. Things are just done when they&#8217;re done. There&#8217;s iron on the plate, and you can take it if you want.</p><p><strong>Reiner Pope</strong></p><p>Taking that analogy, the thing you need to be mindful of is if I have two different paths through some logic. Say I have to do computation <em>f</em> here and computation <em>g</em> here, and they&#8217;re going to meet for computation <em>h</em>.</p><p>There&#8217;s going to be manufacturing variance. In some chips <em>f</em> will take a little longer; in some chips <em>g</em> will take a little longer. If I have a signal propagating through, and the results from <em>f</em> and <em>g</em> have to meet up at <em>h</em>, what can go wrong is that <em>f</em> gets there early and it meets the previous value of <em>g</em>, or the next value of <em>g</em>.</p><p><strong>Dwarkesh Patel</strong></p><p>Ah. And <em>h</em> needs to know when to start, when the next iteration has&#8230;</p><p><strong>Reiner Pope</strong></p><p>Exactly.</p><p><strong>Dwarkesh Patel</strong></p><p>This explains why different chips made at the same process node, the same TSMC technology, can have different clock cycles. Two chips made at <a href="https://en.wikipedia.org/wiki/3_nm_process">3 nm</a> might have different clock cycles based on whether they were able to optimize to ensure no single critical path is so long that it slows down the whole chip&#8217;s clock cycle.</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s right. This optimization I showed here is called <a href="https://en.wikipedia.org/wiki/Instruction_pipelining">pipeline register insertion</a>. We&#8217;ve inserted a register in the middle of the pipeline. This is a pure trade-off between clock speed and area. That is the easy case. There is a harder case too. I drew out a pipeline of logic, but in other cases you may have some calculation which actually feeds back in on itself. It runs some function <em>f</em> and then writes back to itself. For example, this might be an addition where you&#8217;re adding a number every clock cycle. This little circuit essentially sums all the numbers presented on different clock cycles.</p><p>The challenge is, if this plus takes too long, what can I do? If I try to put a pipeline register right in the middle of it, it changes the computation that&#8217;s done. Instead of forming a running sum of everything that comes in, I will actually have two different running sums. I&#8217;ll end up with a running sum of the even numbers and a running sum of the odd numbers. This constraint&#8212;where I have a loop in my logic, which all chips have somewhere&#8212;is the hardest thing to address and sets the clock cycle.</p><p><strong>Dwarkesh Patel</strong></p><p>I don&#8217;t understand why it would be a problem to have that. I&#8217;m not even sure what it would mean to have a register there. Is it a sort of atomic operation?</p><p><strong>Reiner Pope</strong></p><p>Well, plus is not really atomic.</p><p><strong>Dwarkesh Patel</strong></p><p>As you just demonstrated.</p><p><strong>Reiner Pope</strong></p><p>It took a whole lot of work to do a summation. You can take the early parts of that work, stick a register in the middle, and then take the late parts of that work.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay. TSMC offers a <a href="https://en.wikipedia.org/wiki/Process_design_kit">PDK</a> which specifies the primitives of logic they can grant you in the chip. It&#8217;s up to them to determine that no primitive is bigger than the clock cycle they&#8217;re hoping a process node targets. But other than that, can&#8217;t you just take all the primitives from TSMC and keep adding registers between them as much as needed until you get to your desired clock cycle?</p><p><strong>Reiner Pope</strong></p><p>As a logic designer, the chip architect sets the clock cycle. For example, the primitives you get from TSMC are on the order of AND gates or full adders. It depends a lot on voltage and which library you choose, but generally you can have about 10, 20, or 30 of these sequentially in a clock cycle. These primitives are very fast, maybe 10 picoseconds.</p><p>As a logic designer, in principle, if you just had a register and an AND gate in a loop, you could get an insanely fast clock speed, more than four, five, or six gigahertz. But if you take this really simple circuit and look at the area you&#8217;re spending here&#8230; This is called one gate equivalent in size, so unit of one in area. This thing is maybe a unit of eight in area.</p><p>Again, almost all your cost becomes synchronization or communication cost compared to the actual logic. This would be a case where you&#8217;ve gone too far. You&#8217;ve made your clock speed really fast at the cost of spending almost all of your area on pipeline registers.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. So you&#8217;re hinting at a dynamic where you can have a really fast clock speed but you&#8217;re not getting that much work done. You can have low latency but low throughput.</p><p><strong>Reiner Pope</strong></p><p>It hurts your throughput, in fact, because the throughput of your chip is the product of how much you get done per clock cycle&#8212;which is based on area efficiency&#8212;times how many clocks you get per second.</p><p><strong>Dwarkesh Patel</strong></p><p>This is actually so similar to the thing we were discussing last time about batch size, where if you have a low batch size, any one user can receive their next token really fast, but the total number of tokens processed in, say, an hour will be lower than it could otherwise be.</p><p><strong>Reiner Pope</strong></p><p>Exactly. You get less parallelism out if you drive your clock speed up really high.</p><h3>00:51:51 &#8211; FPGAs vs ASICs</h3><p><strong>Dwarkesh Patel</strong></p><p>I remember talking to an <a href="https://en.wikipedia.org/wiki/Field-programmable_gate_array">FPGA</a> engineer at <a href="https://www.janestreet.com/">Jane Street</a>, Clark, who helped me prep for the previous interview we did together. He was explaining why they use FPGAs. I imagine that for high-frequency trading, throughput is less important than latency, so having very specific control over the clock cycle in a deterministic way is the most important thing. Maybe it&#8217;d be interesting to talk about why you can&#8217;t just achieve that with an <a href="https://en.wikipedia.org/wiki/Application-specific_integrated_circuit">ASIC</a>, or why you might use an FPGA to have deterministic clock cycles for <a href="https://en.wikipedia.org/wiki/High-frequency_trading">high-frequency trading</a>.</p><p><strong>Reiner Pope</strong></p><p>Let&#8217;s consider the business case for an FPGA versus an ASIC. FPGAs and ASICs use largely the same conceptual model. You have a series of gates built from small primitives&#8212;ANDs, <a href="https://en.wikipedia.org/wiki/OR_gate">ORs</a>, <a href="https://en.wikipedia.org/wiki/XOR_gate">XORs</a>&#8212;connected together with wires running in a fixed clock cycle. Anything you can express in an FPGA you can express in an ASIC too. It will be about an order of magnitude cheaper and have better energy efficiency on an ASIC than an FPGA.</p><p>The trade-off is that the first FPGA costs you $10,000, whereas the first ASIC you make costs $30 million because it requires an entire <a href="https://en.wikipedia.org/wiki/Tape-out">tape-out</a>. The business use case for an FPGA is when you want something that has very <a href="https://www.ti.com/lit/ml/slap159/slap159.pdf">deterministic latency</a>, fast runtime, and high parallelism, but you are going to change the workload frequently, maybe every month. You don&#8217;t want to pay that tape-out cost every time.</p><p>How does an FPGA actually emulate the ASIC programming model in a fixed piece of hardware? At its core, it has the two components we just talked about. It has registers as storage devices, and it has <a href="https://en.wikipedia.org/wiki/Lookup_table">lookup tables</a> (LUTs) which provide all of the gates.</p><p>Then there&#8217;s a third component. We have a swarm of these registers and LUTs, and they are connected by a big set of muxes. In front of every single one of these, we have a mux which selects an input from everywhere else. We have a whole bunch of different options feeding into all of these things.</p><p>What this allows is essentially that when I program my FPGA, I can take all of these components and superimpose a particular wiring which goes through this LUT, feed it into another LUT, send it to this register, and then feed it into another LUT, or something like that.</p><p>What I&#8217;ve drawn in orange is how you&#8230; FPGA means Field-Programmable Gate Array. The orange is what has been programmed in the field, whereas the white is all the wires that must exist in the FPGA in order to actually make the device in the first place.</p><p><strong>Dwarkesh Patel</strong></p><p>What does it mean to be programmed in the field?</p><p><strong>Reiner Pope</strong></p><p>Programmed in the field means the device is being deployed in a data center. It&#8217;s sitting out in the world, and then you can come and program it.</p><p><strong>Dwarkesh Patel</strong></p><p>Ah, not field as in like electric field. Field as in like out there in the world, ok.</p><p>If I look at how the field programming comes out of the first lookup table and goes into a second one, how does that work?</p><p><strong>Reiner Pope</strong></p><p>Where are the wires that make that happen? I got a little bit lazy in drawing all of these. Every single device here has a mux sitting in front of it, which can select from all of the nearby circuits that are available. The actual configuration of the FPGA amounts to the mux control. In this mux, we have the data inputs, and we have the control that selects.</p><p>There&#8217;s a little storage device sitting next to every single one of these muxes saying, &#8220;This is where you&#8217;re going to source your input from.&#8221; Programming it consists of configuring every single one of these muxes.</p><p><strong>Dwarkesh Patel</strong></p><p>That makes sense. What is happening inside of the lookup table?</p><p><strong>Reiner Pope</strong></p><p>The lookup table is also going to have a little bit of control telling it what to do. Its purpose is to configurably take the role of an AND gate, OR gate, XOR, or any of those different things. There are many ways you could consider doing that. The way it&#8217;s done in traditional FPGAs&#8230; A lookup table has four bits of input and one bit of output. How many different functions are there from four bits to one bit? There are 16 different functions.</p><p>You can tabulate this as 16 different numbers. You&#8217;ve got a table of 0111001, 16 entries. This table is stored in this blue configuration bit. It views these four bits as binary, looks up the relevant row of the table, and emits that bit. This is essentially a truth-table view of lookup tables.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, so if you think about an AND gate, OR gate, <a href="https://en.wikipedia.org/wiki/NOR_gate">NOR gate</a>, XOR gate, these all take as input&#8230;</p><p><strong>Reiner Pope</strong></p><p>Those are two-input functions. Sometimes we have a three-input function, like a three-way XOR, or a four-way XOR.</p><p><strong>Dwarkesh Patel</strong></p><p>In this case, does it just depend on how big it is?</p><p><strong>Reiner Pope</strong></p><p>The typical size for LUTs is four inputs. It&#8217;s sort of just a sweet spot. There&#8217;s another compute vs. communication trade-off here. If it has too few inputs, you need to use more LUTs.</p><p><strong>Dwarkesh Patel</strong></p><p>Basically the lookup table is a <a href="https://en.wikipedia.org/wiki/Truth_table">truth table</a>. With a truth table, you can program in any gate you want. So instead of a lookup table, you can just think of it as a programmable gate.</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s right. One of the things you can do here is you can see where the rule of thumb that an FPGA is an order of magnitude more expensive than an ASIC comes from. You count how many gates would be inside this lookup table.</p><p>We can view this lookup table essentially as one of these muxes. It has to select between 16 different values, so it&#8217;s a mux with <em>n</em>=16 options and <em>p</em>=1 bits. As we saw earlier, this circuit costs <em>n</em> times<em> p</em> many gates. So it costs <em>np</em>, which is 16, AND gates, and also 16 ORs.</p><p><strong>Dwarkesh Patel</strong></p><p>This circuit being the mux?</p><p><strong>Reiner Pope</strong></p><p>Exactly, the mux.</p><p><strong>Dwarkesh Patel</strong></p><p>The mux that goes into the lookup table?</p><p><strong>Reiner Pope</strong></p><p>The lookup table itself you can think of as being a big mux that selects from all 16 rows down to one output. That&#8217;s the lookup table.</p><p><strong>Dwarkesh Patel</strong></p><p>But the way you&#8217;ve drawn it here, there&#8217;s a mux and then a lookup table.</p><p><strong>Reiner Pope</strong></p><p>It&#8217;s muxes all the way down. There is a second mux that is inside here. This mux is this mux.</p><p><strong>Dwarkesh Patel</strong></p><p>And the other mux is just saying&#8230;</p><p><strong>Reiner Pope</strong></p><p>where it came from in this mess of gates.</p><p><strong>Dwarkesh Patel</strong></p><p>Right, and the second mux is, &#8220;Okay, now you have one value, but that value is still a four-bit value.&#8221;</p><p><strong>Reiner Pope</strong></p><p>Yeah, I&#8217;ve selected four bits from the soup. Then I use those four bits to select which entry in the lookup table I&#8217;m going to use.</p><p><strong>Dwarkesh Patel</strong></p><p>Suppose in the first mux you&#8217;re pulling from eight nearby registers as input. That&#8217;s a total of 32 bits going in. Out of that, four bits come out. Those four bits go into the second mux, which is inside the lookup table.</p><p><strong>Reiner Pope</strong></p><p>In this case, these registers are single-bit registers. If there are eight nearby registers and lookup tables, then I have eight bits total coming in nearby. I select from eight down to four different values. There are actually four different muxes, a little mux associated with each of these input bits. Each of them is selecting one out of eight.</p><p><strong>Dwarkesh Patel</strong></p><p>Where are those eight coming from?</p><p><strong>Reiner Pope</strong></p><p>Nearby registers and other LUTs.</p><p><strong>Dwarkesh Patel</strong></p><p>Each register is one bit.</p><p><strong>Reiner Pope</strong></p><p>Yes.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess AMD or whoever makes these FPGAs still has to be opinionated about which registers are connected to which registers. You can program in the actual gates, but they add a wire in the connect&#8230; the communication topology, right?</p><p><strong>Reiner Pope</strong></p><p>You get flexibility at a local grain. There&#8217;s a nearby neighborhood you can select from, but for more coarse, long-distance connections, they form an opinion on that.</p><p><strong>Dwarkesh Patel</strong></p><p>And the reason it&#8217;s 10x slower is why?</p><p><strong>Reiner Pope</strong></p><p>If you look at the cost of building this lookup table, it&#8217;s 32 gates. It can give me the equivalent of&#8212;what&#8217;s an interesting thing I can do here&#8212;a four-way AND gate. A four-way AND means AND, AND, and then an AND of an AND. This is a circuit I could implement in an ASIC directly using three AND gates. Using a LUT, I can also implement it, but it&#8217;s going to take 32 gates instead of three.</p><p><strong>Dwarkesh Patel</strong></p><p>So the overhead is really coming from the fact that there&#8217;s a more concise way to describe a truth table than listing out every single possible combination of inputs, which is just to write out the gate.</p><p><strong>Reiner Pope</strong></p><p>Yes, to place down the polysilicon and the wires and so on.</p><h3>01:03:25 &#8211; Cache vs scratchpad</h3><p><strong>Dwarkesh Patel</strong></p><p>Interesting. One important point you made to me is that the reason they prefer FPGAs to CPUs is that they get deterministic clock cycles. They know when a packet will come in and go out. Why isn&#8217;t that a guarantee in CPUs?</p><p><strong>Reiner Pope</strong></p><p>You can actually design a CPU that has deterministic latency as well. In fact, the processors inside a lot of AI chips also have deterministic latency. <a href="https://en.wikipedia.org/wiki/Groq">Groq</a> has advertised this. TPUs have that in the core as well.</p><p>The challenge is getting deterministic latency and high speed at the same time. Non-deterministic latency comes from specific design choices in a CPU. It&#8217;s actually possible to remove those design choices and make a CPU with deterministic latency, but those are not very attractive in the market, so people don&#8217;t make those CPUs anymore.</p><p>In some sense, deterministic latency is a simpler starting point, and some chip designers have added things in to make it non-deterministic. To take a concrete example, probably the most important source of non-determinism on a CPU is the <a href="https://en.wikipedia.org/wiki/CPU_cache">CPU cache</a> itself.</p><p>In a CPU, you have the <a href="https://en.wikipedia.org/wiki/Die_(integrated_circuit)">CPU die</a> itself, and then <a href="https://en.wikipedia.org/wiki/DDR_SDRAM">DDR memory</a> off on the side. You have a cache system inside that remembers recent accesses to DDR and stores them. When I&#8217;m running through my CPU instructions, every time I have an instruction that accesses memory, it first checks if the data was stored in the cache. If not, it fetches it from DDR.</p><p>This is a huge optimization. The cache is two orders of magnitude faster than the DDR. If you never used the cache, basically all programs would run a hundred times slower. The presence of a cache is absolutely necessary for a CPU to run at a reasonable speed.</p><p>But whether or not you get a cache hit depends on the ambient environment of the CPU: what other programs are running, what has run recently, and what the random number generator inside the cache system is doing. That is a big source of non-determinism in the runtime of a CPU.</p><p>That is the memory system for a CPU. The big thing you can do differently is, instead of having the hardware say, &#8220;I&#8217;m going to read memory&#8221; and then the hardware decides whether or not it comes from the cache, you can bake this decision into software, a different design philosophy.</p><p>You see this in TPUs, for example. I&#8217;ll draw the same diagram, but I&#8217;ll call it a scratchpad. The main difference is&#8230; This would be a TPU, and you have <a href="https://en.wikipedia.org/wiki/High_Bandwidth_Memory">HBM</a> in this case rather than DDR, but it&#8217;s still an off-chip memory. Instead of the software saying &#8220;first access memory&#8221; and letting the hardware decide, you have one kind of instruction that goes to the scratchpad and a totally different kind of instruction that goes to HBM.</p><p>This style is generically known as <a href="https://en.wikipedia.org/wiki/Scratchpad_memory">scratchpad</a> instead of cache. The key distinction is that you have one kind of instruction that says &#8220;read or write scratchpad,&#8221; and a totally different instruction that says &#8220;read or write HBM.&#8221;</p><p><strong>Dwarkesh Patel</strong></p><p>So scratchpad being the cache.</p><p><strong>Reiner Pope</strong></p><p>Yeah, this thing here is the scratchpad.</p><h3>01:07:27 &#8211; Why CPU cores are much bigger than GPU cores</h3><p><strong>Dwarkesh Patel</strong></p><p>Stepping way back: people say computers have the &#8220;<a href="https://en.wikipedia.org/wiki/Von_Neumann_architecture">von Neumann architecture</a>&#8221;, where there&#8217;s this serial processing of information. Maybe it&#8217;s just because we&#8217;ve been talking about parallel accelerators, but the FPGA is super parallel. The AI accelerators, the TPUs, are super parallel. Even CPUs are super parallel if you think about all the cores they have. In what sense is modern hardware actually the von Neumann architecture? Is that actually a fair way to describe modern hardware?</p><p><strong>Reiner Pope</strong></p><p>I think it&#8217;s a fair way to describe CPUs. The amount of parallelism you get on a CPU is about 100 cores times maybe 16-way vector units, so about 1,000-way parallelism on a CPU.</p><p><strong>Dwarkesh Patel</strong></p><p>One question: there is a die being used for the CPU, and if there are fewer <a href="https://en.wikipedia.org/wiki/Thread_(computing)">threads</a>, just as a matter of transistor voltages switching on and off, is it that there&#8217;s literally one control flow&#8212;a small part of the die&#8212;where voltages are switching on and off?</p><p>How do you actually occupy the die area of a CPU&#8230;</p><p><strong>Reiner Pope</strong></p><p>If there are so few cores, what are you spending all of the die on?</p><p><strong>Dwarkesh Patel</strong></p><p>Yeah, what is happening there?</p><p><strong>Reiner Pope</strong></p><p>The cores are just much bigger and more complicated. We should compare a CPU core, which takes up one one-hundredth of the die, to a LUT. A LUT is only 16 gates. It&#8217;s clear why there are so many more LUTs in an FPGA than cores in a CPU.</p><p>But why are there more CUDA cores, for example, than CPU cores? What&#8217;s the difference between a CPU and a GPU? Inside the CPU, one big use of the area is the cache. Very little is actually the ALUs. Mostly it&#8217;s these register files rather than the logic units. Both of those have equivalents in a GPU, so that&#8217;s not a big difference.</p><p>But the thing that does not have an equivalent in a GPU is the <a href="https://en.wikipedia.org/wiki/Branch_predictor">branch predictor</a>. There is a whole big area in the CPU which is just a bunch of predictors saying when the next <a href="https://en.wikipedia.org/wiki/Branch_(computer_science)">branch</a> will be and where the branch target is. Stripping a lot of that out, as well as making these register files tighter, drives a lot of the GPU gains over the CPU.</p><p><strong>Dwarkesh Patel</strong></p><p>What is the purpose of the branch predictor? To execute both branches at once, or what does it do?</p><p><strong>Reiner Pope</strong></p><p>The issue is that when I&#8217;ve got a series of instructions, if I have a branch, the actual step of processing an instruction takes a really long time. It takes maybe five nanoseconds.</p><p>The time to notice that I&#8217;ve got a branch, evaluate whether the <a href="https://en.wikipedia.org/wiki/Boolean_circuit">Boolean</a> is true, update the program counter to the new target, and then read from the instruction memory could take five nanoseconds to finish. So in reality, this may finish somewhere down here. I want to run a clock speed that is much faster than what five nanoseconds allows. Five nanoseconds is a 200 MHz clock speed. I would like to run at one or two gigahertz.</p><p>So I need to run other instructions while the branch is being evaluated. I just want to keep running the instructions that happen after me. But that might have been wrong. If the branch ended up being taken, then I need to know that instead of evaluating these instructions, I actually need to jump to wherever the target is and run those instructions instead. The purpose of the branch predictor is to predict, five cycles earlier, that a branch is going to happen, before you even get to that instruction.</p><h3>01:12:00 &#8211; Brains vs chips</h3><p><strong>Dwarkesh Patel</strong></p><p>If I think about how the brain works versus what you&#8217;re describing here, at a high level the differences might be that while you can do structured sparsity in these accelerators and save yourself some area that you would have otherwise had to dedicate to gates, in the brain there&#8217;s unstructured sparsity. Any neuron can connect to any other neuron, and not in ways where they have the column aligned.</p><p>Then there&#8217;s the fact that memory and compute are co-located. Although I guess you could say in a way the memory and compute are co-located on these dies too.</p><p><strong>Reiner Pope</strong></p><p>This is exactly the co-location, in some sense, of the memory and compute.</p><p><strong>Dwarkesh Patel</strong></p><p>So maybe that isn&#8217;t a big difference. Another big difference is that the clock cycle on the brain is much slower than on computers. Partly that&#8217;s to preserve energy, because the faster the clock cycle, the bigger the voltage needs to be in order for the signal to settle and to identify what state a transistor is in.</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s right.</p><p><strong>Dwarkesh Patel</strong></p><p>I don&#8217;t know if you have any commentary on what the brain might be doing versus how these chips work.</p><p><strong>Reiner Pope</strong></p><p>Let&#8217;s take the clock speed one first. The clock speed is quite high on a chip because that drives higher throughput. When we compare a GPU running some workload, it&#8217;s running batch size 1,000. Whereas the brain is not running batch size 1,000, there&#8217;s only one of me.</p><p>You could imagine saying, &#8220;Take a GPU and instead of running at a gigahertz, run it at a megahertz,&#8221; and that would start to look a little more like the equivalent things you&#8217;re talking about in the brain. But in the way silicon works, that does not give you a 1,000x advantage in energy efficiency.</p><p>What it ends up looking like is you just run this circuit once to stabilization, and then it will sit idle for a long period of time. It doesn&#8217;t consume a lot of energy while it&#8217;s sitting idle because most of the energy is consumed in toggling bits from zero to one and back.</p><p>Let&#8217;s talk about the energy consumption of a circuit like this. The way to think of a bit being stored is that you&#8217;ve deposited some charge in a capacitor sitting somewhere in the chip implicitly. It becomes charged when the bit becomes a one, and then it becomes discharged when it next goes to a zero.</p><p>That cycle of charging the capacitor and then dumping that charge out to ground is where the energy is consumed. This is called the dynamic or switching power, and it&#8217;s most of the energy consumption of a chip. There is some other energy consumption just coming from the fact that insulators aren&#8217;t perfect, but we&#8217;ll discard that. Most of the energy consumption comes from toggling from zero to one and back to zero.</p><p>If you run a chip much slower and you only clock it once every thousand clock cycles, you will have 1,000 times fewer transitions. It will be about 1,000 times less energy consumption. But it&#8217;s not a substantial advantage in energy efficiency.</p><h3>01:15:33 &#8211; A GPU is just a bunch of tiny TPUs</h3><p><strong>Dwarkesh Patel</strong></p><p>Okay, so you described how a TPU works at a high level. What is the difference at a high level between how a GPU and a TPU work?</p><p><strong>Reiner Pope</strong></p><p>There is a high-level organization principle that is different, and then inside the cores things are different. Looking at the high level, we&#8217;ll take a GPU and a TPU and see what the top-level block structure looks like.</p><p>If you think of this as the whole chip in each case, the organization of the GPU is mostly a bunch of almost-identical units, which are the <a href="https://modal.com/gpu-glossary/device-hardware/streaming-multiprocessor">SMs</a>. They&#8217;ve got an <a href="https://en.wikipedia.org/wiki/CPU_cache#MULTILEVEL">L2 memory</a> in the middle, and then a bunch more of these SMs on the bottom. So there is a fairly regular grid of cores.</p><p>If we look at a TPU in comparison, you end up with much coarser-grained units of logic. You end up with just a few matrix units, which are the big systolic arrays. In the middle you&#8217;ve got some vector unit, and then you&#8217;ve got your matrix units at the bottom. These matrix units with a vector unit in the middle make up the whole TPU chip.</p><p>You can think of scaling this thing down into a really tiny unit with a smaller matrix unit and a smaller vector unit, and that is sort of what an SM is. From a very high-level point of view, the GPU has a lot of tiny TPUs tiled across the whole chip.</p><p><strong>Dwarkesh Patel</strong></p><p>Oh, interesting. You&#8217;re suggesting the tensor core within a streaming SM is analogous to an <a href="https://docs.cloud.google.com/tpu/docs/system-architecture-tpu-vm">MXU</a>?</p><p><strong>Reiner Pope</strong></p><p>Yeah, it&#8217;s all very similar.</p><p><strong>Dwarkesh Patel</strong></p><p>I see. If you had more lack of structure, having a bunch of tiny TPUs makes a lot of sense. Whereas if you just have huge matrix multiplications, you might want to avoid the cost of having individual SMs with their own registers and warp schedulers. Why not just make a huge thing and amortize those costs across the whole thing?</p><p><strong>Reiner Pope</strong></p><p>This shows up in how large you can grow things. We&#8217;ve seen this theme, especially with the systolic array, where a larger systolic array amortizes the register file costs better.</p><p>This design allows you to have larger systolic arrays, whereas the GPU design constrains you to having small units of everything. There is a trade-off, however. Because of this coarse-grained separation of things, you need to move a lot of data from the vector unit to the matrix units, through just two lines of perimeter here.</p><p>If you look at the equivalent thing in a GPU, you&#8217;ve got vector units everywhere, and you can move data through many different lines. The amount of data you can move between a vector unit and a matrix unit is actually much higher in a GPU than in a TPU. Instead of having to move all the data through just two lines, you&#8217;re moving it through 16 lines of wiring in a GPU.</p><p><strong>Dwarkesh Patel</strong></p><p>Right. But also you might have to move across less area.</p><p><strong>Reiner Pope</strong></p><p>Which is an energy saving as well. So if you can operate entirely within an SM, the data movement is much smaller. But the moment you want to operate across SMs, it becomes more complicated and expensive.</p><p><strong>Dwarkesh Patel</strong></p><p>So you don&#8217;t have to comment, but one might expect that a thing MatX might try to do is get the GPU-like smaller structure of systolic arrays surrounded by <a href="https://en.wikipedia.org/wiki/Static_random-access_memory">SRAM</a>, but at the same time make it so that the things you need in an SM to support the CUDA architecture&#8212;which take a bunch of space&#8212;you might discard.</p><p><strong>Reiner Pope</strong></p><p>We&#8217;ve talked publicly about something we call a <a href="https://matx.com/research/series_b">splittable systolic array</a>, which in some sense you can think of as big systolic arrays that can be small systolic arrays too.</p><p><strong>Dwarkesh Patel</strong></p><p>Cool. Okay, I think that&#8217;s a good note to close on. Reiner, thank you so much.</p><p><strong>Reiner Pope</strong></p><p>Thanks, Dwarkesh.</p>]]></content:encoded></item><item><title><![CDATA[The mistake of conflating intelligence and power]]></title><description><![CDATA[f this is your definition of intelligence is the ability to achieve your goals across a wide variety of domains, then Stalin was the most intelligent person who ever lived.]]></description><link>https://www.dwarkesh.com/p/the-mistake-of-conflating-intelligence</link><guid isPermaLink="false">https://www.dwarkesh.com/p/the-mistake-of-conflating-intelligence</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Sat, 16 May 2026 19:04:16 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/bd67ed7a-9e42-4d03-8978-11c7b322d5ad_300x168.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>I had an interesting discussion recently. Someone asked me, what is intelligence? I said, the ability to achieve your goals across a wide range of domains. Okay, he says, then by that definition isn&#8217;t Donald Trump the intelligent person in the world, followed in quick succession by Xi Jinping and Vladimir Putin?</p><p>To be clear, these people are obviously very competent and clever. But when you think of ASI, you don&#8217;t think of Trump, but more so. </p><p>The person who kept pressing this question was correctly pointing out that I basically defined intelligence as power. And by this definition, Stalin was the most intelligent person who ever lived.</p><p>Now, of course, you could change the definition of intelligence to something more like, manipulate abstract concepts and rotate shapes. </p><p>But notice that the most powerful people in the world do not max out this quantity. The correlation between extreme power and this kind of intelligence might be even weaker than the correlation between extreme power and height. The physicists are not running the world.</p><p>We tend to conflate power-seeking AI and superintelligent (in science and tech) AI. I&#8217;m not denying that AI can be power-seeking. Whatever skills and drives Donald Trump has could be embodied in a digital mind. I&#8217;m simply pointing out that the way AI systems are currently becoming smarter (by getting trained to be to be really good at specific economically valuable tasks like coding) is not that strongly correlated with power.</p><p>We often talk about power in this way that misunderstands how it is actually derived in our world. Our intuitions are primed by games like Diplomacy or Go, which are designed to isolate and reward a g loaded kind of strategic reasoning. </p><p>But in the real world, power is more the product of having the authority and trust to get lots of people to collaborate with you, rather than some galaxy brain scheming capability. Trump is not powerful because his brain, considered in isolation, is the most effective optimization engine on Earth. He is powerful because the government which hundreds of millions of people consider legitimate gives him a lot of authority.</p><p>A group versus individual level analysis is useful here. As <a href="https://www.amazon.com/Hive-Mind-Your-Nations-Matters/dp/0804785961">Garett Jones has written</a> a lot about, individual IQ is only modestly correlated with individual income, but national IQ is strongly correlated with national outcomes. This is because intelligence has a lot of spillover effects - smarter societies cooperate more, save more, and can coordinate to build things like space shuttles and semiconductors. </p><p>Richard Trevithick, who invented the high-pressure steam engine, died in poverty, buried in an unmarked pauper&#8217;s grave. But the fact that 18th and 19th century Britain had lots and lots of people like Trevithick contributed to Britain being able to set up a global empire and outcompete lots of backwards principalities around the world.</p><p>It seems to me that the right mental model is that automated firms will outcompete everyone else in normal capitalist ways, rather than a single AI outthinking everyone else.</p>]]></content:encoded></item><item><title><![CDATA[Notes on pretraining parallelisms and failed training runs. ]]></title><description><![CDATA[Deeply researched interviews]]></description><link>https://www.dwarkesh.com/p/notes-on-pretraining-parallelisms</link><guid isPermaLink="false">https://www.dwarkesh.com/p/notes-on-pretraining-parallelisms</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Sat, 16 May 2026 19:01:16 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/c8ed868d-6197-4209-9d77-1b7513f1c1ae_1264x694.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Wrote up some flashcards <a href="https://flashcards.dwarkesh.com/pretraining/">here</a> to help myself retain all the stuff below</em>.</p><h3>On why pretraining runs fails</h3><p>Had an interesting chat with someone on why pretraining runs often fail. It was very interesting to get a sense of all the tangible ways that things can get fucked, and why training is such a precarious operation. At a high level, breaking causality, and adding bias, seem to be key culprits.</p><p>Breaking causality:</p><ul><li><p>When you do expert routing, you first go through the router, which gives you a score of how much each token wants each expert. There&#8217;s two ways to proceed from here: 1. Token routing, where you read the scores from the token&#8217;s perspective, and allocate to each token&#8217;s top k experts. Problem is that you could end up with wildly unbalanced allocation across experts, which is terrible for performance. Alternatively, you could (and only in training) do expert choice, where you just split the tokens by which are more relatively preferred by each expert. This way you can enforce that each expert gets roughly the same number of tokens. But the big problem is that this breaks causality, because which expert token n gets allocated to may depend on which expert token n + k might be router to. And breaking causality is very bad, because you&#8217;re getting information in training (and updating based on it) that you wouldn&#8217;t see in deployment.</p><ul><li><p>Rumor is that this explains why Llama 4 was underwhelming.</p></li><li><p>I guess you could do expert choice during prefill inference? But maybe it doesn&#8217;t work well in practice to allocate tokens to experts which would not have received that token in actual training.</p></li><li><p>Tbh I don&#8217;t fully understand why breaking causality is so bad. I understand you can&#8217;t see beyond causality in real inference. But why is this minor deviation such a big issue?</p></li></ul></li><li><p>Another thing that can break causality is token dropping. Where experts just ignore the tokens in the batch that they&#8217;re supposed to process, but which rank not so strongly, and cutting whom would spare going outside padding. This breaks causality cause a later token being more strongly matched to this expert might lead to an earlier token getting ignored.</p><ul><li><p>Apparently this was an issue with Gemini 2 Pro.</p></li></ul></li></ul><p>Adding bias:</p><ul><li><p>Bias much worse than variance - variance can average out, but bias compounds</p></li><li><p>Apparently the original GPT 4 training was slow and got initially fucked because of the following bug: they were using FP16 on their collectives like all-reduce. FP16 distributes its granularity according to logarithmic density - between 1 and 2, the mantissa bits carve the interval ~0.001 apart. But 1024 and up, the mantissa might be carving the interval by multiple whole number values. Suppose some collective involves adding 1 + 1 &#8230; 10,000 times - you could get in a situation where as soon as you get to 1024, you add 1, it goes to 1025, you round down to the nearest interval at 1024, add one again. And so the calculated value is 10x off the real value. Huge issue if you&#8217;re trying to sum many small gradients into a large accumulator. And imagine how hard the bug must have been to find!</p></li></ul><p>Implications for AI training:</p><ul><li><p>Some of the people who think we can cure aging argue that there&#8217;s basically 5 different ways people die of old age (heart disease, cancer, etc), and that if we cure these 5 different diseases, then we&#8217;d basically have solved aging. You could ask a similar question about these failed pretraining runs - are there 5 different ways training runs fail, in which case once a lab figures out numerics and , you&#8217;ll just have smooth sailing, or will you keep seeing new bespoke issues emerge at each new level of scale? The person I talked to seemed to think the later - he pointed out that even within numerics, there&#8217;s so many ways you can fuck things up. And new ones will keep emerging at scale.</p></li><li><p>Bearish on AI fully automating kernel writing anytime soon. Presumably this is because he thinks it&#8217;s more of an AGI complete problem than some give it credit for. There&#8217;s another school of thought that says, &#8220;Hey, which kernel gets attention or MLP to run fastest on this scaleup is a super verifiable domain, thus we can RL to superhuman performance easily.&#8221; But he says, it took Nvidia, which has the best kernel engineers in the world, a long time to optimize for Blackwell, which suggests that actually it&#8217;s quite hard, and might not be super easy to close the loop on.</p></li><li><p>Sometimes people say inference for RL generation and inference for end user generation is basically the same. But this person pointed out that in RL inference, numerical drift between inference and training engine can cause these subtle off policy biases, which matter a ton for highest quality training. But are not an issue if just serving to users.</p></li><li><p>Emphasized how important it is to have a disciplined process for amalgamating compute multipliers, because of the risks of stacking up bugs with subtle biases.</p></li></ul><h3>Pretraining parallelisms</h3><p>Notes from an excellent lecture that <a href="https://horace.io/">Horace He</a> gave my friends and me. </p><p>What made this lecture so good is that Horace built up the whole topic as a chain of problems and solutions: here&#8217;s what we want to do, here&#8217;s why it breaks, here&#8217;s how we fix it, here&#8217;s why that fix eventually breaks too. Most explanations just list out a hodge podge of different strategies, without ever connecting them to the problems they solve or explaining why you&#8217;d pick one over another.</p><ul><li><p>Equation for pretraining flops = 6ND. 2 FLOPs per parameter per token for the forward pass (multiply + add). Backward pass is 2&#215; forward because you compute gradients w.r.t. both input matrices. So 2 + 4 = 6.</p></li><li><p>Okay we can&#8217;t do all this on one GPU. So how do we split up this problem? The obvious solution is to do data parallel - where you copy the model weights across each GPU, and you just do a part of the batch on each GPU.</p><ul><li><p>The obvious problem is that each GPU only has a limited amount of HBM - B300 is 288GB - and this is not enough to store the weights as models get bigger and bigger, much less their activations.</p></li></ul></li><li><p>Okay so next thing we try is fully sharded data parallel - each GPU only stores 1/N of the parameters of each layer - before processing each layer, you all-gather the full layer&#8217;s parameters from all GPUs (each GPU only stores 1/N of each layer). After processing, each GPU discards the gathered parameters.</p><ul><li><p>It was emphasized that this is the go to default. And you only move on from this when having too many GPUs forces you to move on, for reasons explained later. The reason this is the default is that it&#8217;s trivial to overlap compute and communication time - that&#8217;s because the only thing being communicated is the weights, which are not dependent on what happened in the layer before, so you can start all gathering the next layer while you&#8217;re still computing this layer. Compare this against tensor or expert parallelism, which do need to share activations for one layer before you can process the next one. The problem with pipeline parallelism is bubbles as explained below.</p></li><li><p>From a comms volume perspective, FSDP looks insanely expensive at first &#8212; you all-gather every layer&#8217;s full weights across all GPUs, use them for one matmul, then throw them away. But this ignores what regular data parallelism already costs you - in regular DP, you still need to do an all reduce after every layer of the backwards pass in order to sync the batch&#8217;s gradients across all the GPUs. That all-reduce has comms volume of params &#215; 2. FSDP adds all-gathers &#8212; one per layer in the forward pass, one per layer in the backward pass. But an all-gather is half the comms volume of an all-reduce. So naive FSDP comms volume ends up being # params * 4 (all gather forward and back, plus all reduce on back). You can do even better: since each gradient shard only needs to end up on the one GPU that owns it, replace the all-reduce with a reduce-scatter (which skips the final broadcast step). That gets you to params &#215; 3 total &#8212; a 50% overhead over vanilla DP.</p></li></ul></li><li><p>So why can&#8217;t you always just do FSDP?</p><ul><li><p>Comms crossover: You want your compute time to be greater than your comms time - you don&#8217;t want to be bottlenecked on comms. But since compute time for FSDP decreases as you increase the number of GPUs, and comms time does not, as you scale the number of GPUs on FSDP, your MFU can totally crater. When this happens, you need to add pipeline parallelism too.</p><ul><li><p>Compute time = (6 * # tokens * active params) / (compute per GPU * number of GPUs)</p><ul><li><p>This decreases as you increase number of GPUs</p></li></ul></li><li><p>Comms time = (# total params * 3) / (nv link domain size * infiniband BW)</p><ul><li><p>Comms time does not increase as you add more domains. This was really confusing to me. Each domain collectively holds all the parameters, and you need to sync gradients across domains after each layer of the backward pass. You&#8217;d think that adding more domains means more hops in the ring, so the all-reduce gets slower. But the standard ring algorithm splits the message into one chunk per participant. More domains means more hops, but proportionally smaller chunks per hop. (This breaks down when chunks get so small that per-hop latency dominates, at which point you switch to tree algorithms.)</p><ul><li><p>Technically, you can do better than a naive single all reduce for the gradients between all the domains. You do a hierarchical collective to optimize comms time across multiple NVLink domains. Key thing to remember is that each GPU in the domain gets its own bandwidth access to infiniband. So you wanna use it all up since interconnect bandwidth is the bottleneck. You do this by trying to do as much as possible within a scaleup before you move out. So you do reduce scatter within a scale up to give each GPU the domain-level reduced gradients for a shard of the layer, then all reduce these shards across corresponding GPUs across domains, then all gather within a domain. This shifts the comms time line down, thus moving the crossover point to the right.</p></li><li><p>Made an animation to illustrate it using Cursor and Composer 2:</p></li></ul></li></ul></li></ul></li></ul><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;737151e9-d4af-437f-97f0-2746037027ff&quot;,&quot;duration&quot;:null}"></div><ul><li><p>If you look at the equations, you can see that if you increase batch size, crossover point moves to right, and if you make the model more sparse, moves to the left.</p></li><li><p>Also why TPUs are better at FSDP - because more accelerators within a domain.</p></li></ul></li></ul><ul><li><p>Batch size floor: FSDP is data-parallel, so each GPU processes at least one sequence. Attention is computed within a sequence and can&#8217;t (easily) be split across GPUs. If your critical batch size is 10M tokens and sequence length is 10K, you only have 1K sequences &#8212; so you can&#8217;t scale beyond 1K GPUs with pure FSDP, even if you have plenty of comms bandwidth left.</p></li><li><p>Problems with pipeline parallelism (the next addition you&#8217;d make to FSDP in order to deal with these issues):</p><ul><li><p>The problem with pipeline parallelism is different - there you have bubbles that emerge from the fact that at the beginning of the batch, the GPUs dedicated to the final layers are not being used, and conversely at the end of the batch, the GPUs dedicated to the first layers are not being used. The reason you can&#8217;t overlap batches in training to solve pipeline bubbles is that you need to consolidate gradients and update the model before you process the next batch.</p></li><li><p>But also you&#8217;re adding architecture constraints - things like Kimi&#8217;s attention-to-residuals (where each block attends to all previous layers&#8217; residuals) become very difficult when those residuals live on different pipeline stages. Similarly, interleaving sliding-window and global attention layers could cause load imbalance across stages. Dealing with all this slows down research iteration, which is the greatest sin you can commit.</p></li></ul></li></ul>]]></content:encoded></item><item><title><![CDATA[RLVR might be disproportionately bad at science]]></title><description><![CDATA[the verification loop for theories can be on the order of decades and centuries, and even then we know today as the better theory can often actually make worse predictions]]></description><link>https://www.dwarkesh.com/p/rlvr-might-be-disproportionately</link><guid isPermaLink="false">https://www.dwarkesh.com/p/rlvr-might-be-disproportionately</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Sat, 16 May 2026 19:00:59 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/a224a528-8ab3-4269-ab6f-e38c6fbeab1f_960x822.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>I&#8217;m writing up some threads that we explored in <a href="https://www.dwarkesh.com/p/michael-nielsen">my interview with Michael Nielson</a>. That episode was one of my favorites.</em></p><p>The organizing question from my interview with Michael Nielson was, &#8220;How do we recognize scientific progress?&#8221; It&#8217;s especially relevant to thinking about what it would take for AI to close the RL verification loop on scientific discovery. But it&#8217;s also a surprisingly mysterious and elusive question when thinking about the history of human science.</p><p>Some people have this idea that AI is going to be disproportionately good at making scientific breakthroughs. The reason they think this is that 1. Science is &#8216;verifiable&#8217;, 2. AI is absolutely crushing domains that have a tight verification loop - coding, math, etc - because you can RL on these loops.</p><p>But the history of human science shows that the verification loop for theories can be on the order of decades and centuries, and even then experiments do not definitely rule out alternatives: Ancient Athenians dismissed Aristarchus (2nd century BC) on heliocentrism because it would imply stellar parallax. The first successful measurement of stellar parallax was in 1838, achieved by Friedrich Wilhelm Bessel.</p><p>What we know today as the better theory can often actually make worse predictions: it&#8217;s well known that Copernicus&#8217;s model of circular orbits around the sun was less accurate than Ptolemy&#8217;s geocentric model, which had accumulated millenia of correcting epicycles. What is not well known is that Copernicus&#8217;s theory wasn&#8217;t even simpler (Ptolemy&#8217;s model interpreted the true elliptical nature of orbits using an equant trick where other planets are not moving in uniform circular motion around Earth exactly, but rather an off center point. Copernicus didn&#8217;t like this, because it violated his Platonic heuristics  - so he discarded the quant trick, which led to a less parsimonious model, since Copernicus had to add more epicycles and epicyclets to make up for it.)</p><p>So in what sense was it a better theory in 1543? In some sense, it wasn&#8217;t! You couldn&#8217;t have known ex ante that heliocentrism married with Kepler&#8217;s 3 laws (1619) is a much cleaner and more accurate theory, or that there&#8217;s a very beautiful unification of heliocentric orbits and terrestrial gravity (Newton in 1686).</p><p>There was one ex ante reason that you should have preferred Copernicus in 1543: his theory required retrograde motion as a natural consequence of his theory, whereas for Ptolemy it was an ad hoc addition. Even more impressively, his theory, developed in 1543, actually predicted the phases of Venus before they were observed by Galileo in 1610. But both of these things were also implied by Brahe&#8217;s model, which had set the sun to orbit the earth and then all the planets to orbit the sun.</p><p>Under a naive falsificationist framework, you&#8217;d have to wait until Stellar parallax was observed in 1838 to know that Brahe was wrong. But obviously the scientific community was able to make progress faster than this. There is some mixture of judgment and heuristics in the progress of science that we don&#8217;t even understand well enough to actually articulate, much less codify into an RL loop.</p><p>Or consider the case of the discovery of Neptune in 1846. Uranus deviated from its predicted Newtonian path. Le Verrier predicted that an unknown perturbing planet must exist, calculated its mass and orbit, and Neptune was found almost exactly where predicted.</p><p>But the Neptune story is symmetric to a failure case. Mercury had an anomalous precession, where the ellipse that shows its orbit would rotate 43 arcseconds more per century than should be implied by the impact of other planets using Newtonian mechanics. This led astronomers to speculate that there&#8217;s an unknown planet Vulcan within Mercury&#8217;s orbit. But it was resolved in 1915 with Einstein&#8217;s General Relativity.</p><p>A proper Newtonian would still proceed with the research agenda, but modify it as follows. First, you predict some unknown planet. If it can&#8217;t be found, you say it&#8217;s so small, it must require a bigger telescope, and you build a bigger telescope. And if you still can&#8217;t find it, maybe there&#8217;s a cloud of cosmic dust occluding it. If still not found, maybe the satellite&#8217;s instruments are being screwed by some unknown magnetic field, and you send a new satellite. At each of these steps, had you discovered a new planet, or some unknown cosmic dust, or some new magnetic field, that would have been a sensational victory for Newtonians.</p><p>Ex ante, this is not unreasonable to do! It is only after decades or maybe centuries of patchwork that we can then analyze, are we simply adding epicycles, or is this theoretical framework progressive, in that it makes predictions we wouldn&#8217;t otherwise be able to.</p><p>What do these examples illustrate? That ex ante it is almost impossible to determine which research programs are progressive (will predict and explain unanticipated new phenomenon) and which are regressive (need to be contorted repeatedly to accommodate seemingly disconfirming new phenomenon).</p><p>But the verification loop is often extremely long and weirdly hostile, and even then, experiments do not definitely rule out alternatives (see the discussion in the Nielson episode about how physicist contemporaneous with the 1880s Michelson-Morley experiments thought that it simply ruled out a particular theory of ether. Only Einstein made the full conceptual leap to discard the ether altogether).</p><p>This means that big conceptual breakthroughs <em>cannot</em> be easily verified. They are recognized decades  or centuries later, when it turns out they were much more productive than the alternatives available. What this means for AI for science is that 1. You can&#8217;t easily train an RL loop for big conceptual breakthroughs.</p><p>And 2. the society of AI scientists will still need individual AI instances that have idiosyncratic biases and heuristics, and to pursue them unrelentingly for decades on end - for example, like the one Einstein had in insisting that there shouldn&#8217;t be some arbitrary inertial reference frame. There should be dedicated people to keep a bunch of dormant research agendas alive in case they turn out to be productive upon further investigation. To understand the kind of intransigent dedication to hypotheses that is needed to preserve correct scientific idea - even in the face of disconfirming evidence - consider the following story: In 1815, Prout hypothesized that the atomic weights of all pure chemical elements are whole numbers, because experimentally, most elements seem to come out like this. But there&#8217;s many anomalies - for example Chlorine&#8217;s atomic weight is measured at 35.5. And so Prout&#8217;s school claimed that maybe the chemical substance in which these elements appeared were impure. But there seemed to be no chemical reaction that could get rid of the impurities. And then they said, maybe it&#8217;s fractions of full atomic weights - but the closer you measure, the less natural the fractions seem to get - Chlorine goes from 35.5 to 35.46. It takes until almost a century later for people to realize that these measurements are showing multiple isotopes of the same element, which can be separated physically, but have no chemical distinguishing characteristics.</p><p>What I&#8217;m trying to say is that ex ante, one couldn&#8217;t have known which research program would be more productive. We need to invest in all of them concurrently. But that investment looks like a bunch of different individual scientists being super unreasonable and obstinate about propping up their preferred research agenda.</p><h2>What does the parallel discovery of a deep idea like Darwinism tell us?</h2><p>The Origin of Species was published in 1859. Principia Mathematica was published in 1687, two centuries earlier. Conceptually, it seems like natural selection is much simpler than the theory of gravity. A contemporary of Darwin&#8217;s, Thomas Huxley, read the Origin of Species and said, &#8220;How extremely stupid not to have thought of that!&#8221; Nobody ever said the same for not beating Newton to the Principia. I wonder if the reason this happened is that, while Darwin&#8217;s theory is conceptually simpler, it cannot be decisively tested. The evidence is circumstantial, retrospective, and cumulative. There&#8217;s no equivalent of Newton running the numbers on the moon&#8217;s orbital period and radius, and confirming that it corresponds to his equations.</p><p>Also you need this concept of deep time. Charles Lyell published the Principles of Geology in 1830, which gave Darwin the vast stretches of time that natural selection needed. And the fact that Darwin and Wallace basically arrived at evolution at the same time (and both credited Lyell&#8217;s contribution) does suggest that these underrated intellectual footholds were quite important (geology, paleontology of ancient extinct species which showed intermediate species (in some cases between apes and humans), biogeography from voyages and age of colonization, more sophisticated artificial selection like pigeon breeding). It&#8217;s interesting that an idea whose essence must have been obvious to herders and parents for thousands of years actually required many millennia of ancillary intuition pumps to fully spell out.</p><p>The pattern of parallel discovery in science and technology is very interesting, and seems to contradict this vibe that certain innovations could have happened earlier much earlier than they really did.</p><div data-component-name="FragmentNodeToDOM"><p></p></div>]]></content:encoded></item><item><title><![CDATA[Eric Jang – Building AlphaGo from scratch]]></title><description><![CDATA[AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play.]]></description><link>https://www.dwarkesh.com/p/eric-jang</link><guid isPermaLink="false">https://www.dwarkesh.com/p/eric-jang</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Fri, 15 May 2026 16:04:58 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/197852876/67d1346b253e5834d07b4510c3fc98b9.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p><a href="https://evjang.com/">Eric Jang</a> walks through how to build AlphaGo from scratch, but with modern AI tools.</p><p>Sometimes you understand the future better by stepping backward. AlphaGo is still the cleanest worked example of the primitives of intelligence: search, learning from experience, and self-play. You have to go back to 2017 to get insight into how the more general AIs of the future might learn.</p><p>Once he explained how AlphaGo works, it gave us the context to have a discussion about how RL works in LLMs and how it could work better &#8211; naive policy gradient RL has to figure out which of the 100k+ tokens in your trajectory actually got you the right answer, while AlphaGo&#8217;s MCTS suggests a strictly better action every single move, giving you a training target that sidesteps the credit assignment problem. The way humans learn is surely closer to the second.</p><p>Eric also kickstarted an Autoresearch loop on his project. And it was very interesting to discuss which parts of AI research LLMs can already automate pretty well (implementing and running experiments, optimizing hyperparameters) and which they still struggle with (choosing the right question to investigate next, escaping research dead ends). Informative to all the recent discussion about when we should expect an intelligence explosion, and what it would look like from the inside.</p><p>Watch this one on <a href="https://youtu.be/X_ZVSPcZhtw">YouTube</a> so you can see the chalkboard.</p><p>And check out the <a href="https://flashcards.dwarkesh.com/eric-jang/">flashcards</a> I wrote to retain the insights.</p><div id="youtube2-X_ZVSPcZhtw" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;X_ZVSPcZhtw&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/X_ZVSPcZhtw?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Sponsors</h2><ul><li><p><a href="https://cursor.com/dwarkesh">Cursor</a>&#8216;s agent SDK let me build a pipeline to generate flashcards for this episode. For each card, I had an agent read the transcript, ingest blackboard screenshots, generate an SVG visual, and run everything through a critic. A durable agent is much better at this kind of work than a chain of LLM calls, and Cursor&#8217;s SDK made it easy. Check out the cards at <a href="https://flashcards.dwarkesh.com">flashcards.dwarkesh.com</a> and get started with the SDK at <a href="https://cursor.com/dwarkesh">cursor.com/dwarkesh</a></p></li></ul><ul><li><p><a href="https://janestreet.com/dwarkesh">Jane Street</a> gave me a real deep-dive tour of one of their datacenters. I got to ask a bunch of questions to Ron Minsky, who co-leads Jane Street&#8217;s tech group, and Dan Pontecorvo, who runs Jane Street&#8217;s physical engineering team. They were willing to literally pull up the floorboards and take out racks to explain how everything works. Check out the full tour at <a href="https://janestreet.com/dwarkesh">janestreet.com/dwarkesh</a></p></li></ul><h2>Timestamps</h2><p>(00:00:00) &#8211; Basics of Go</p><p>(00:08:17) &#8211; Monte Carlo Tree Search</p><p>(00:32:04) &#8211; What the neural network does</p><p>(01:00:33) &#8211; Self-play</p><p>(01:25:38) &#8211; Alternative RL approaches</p><p>(01:45:47) &#8211; Why doesn&#8217;t MCTS work for LLMs</p><p>(02:01:09) &#8211; Off-policy training</p><p>(02:12:02) &#8211; RL is even more information inefficient than you thought</p><p>(02:22:16) &#8211; Automated AI researchers</p><h2>Transcript</h2><h3>00:00:00 &#8211; Basics of Go</h3><p><strong>Dwarkesh Patel</strong></p><p>Today, I&#8217;m here with <a href="https://evjang.com/">Eric Jang</a>, who was most recently vice president of AI at <a href="https://en.wikipedia.org/wiki/1X_Technologies">1X Technologies</a>, and before that, senior research scientist at what is now <a href="https://deepmind.google/models/gemini-robotics/">Google DeepMind Robotics</a>. You&#8217;ve been on sabbatical for the last few months. One of the things you&#8217;ve been doing is rebuilding, improving, and hacking on <a href="https://en.wikipedia.org/wiki/AlphaGo">AlphaGo</a>.</p><p>Today, you&#8217;re going to explain <a href="https://github.com/ericjang/autogo">building AlphaGo from scratch</a> and what it tells us about the future of AI research and development. Before we get to that, why is AlphaGo interesting? Why is this the project you decided to do on sabbatical rather than just hanging out at the beach?</p><p><strong>Eric Jang</strong></p><p>I like making things, and AlphaGo and <a href="https://en.wikipedia.org/wiki/Go_(game)">Go</a> AI is one of those things that really got me into the field. When I saw the early breakthroughs on AlphaGo in 2014, 2015, 2016 and so forth, it was profound to see how smart AI systems could become and the computational complexity class they could tackle with <a href="https://en.wikipedia.org/wiki/Deep_learning">deep learning</a>. This is a problem that has long been understood to be intractable for <a href="https://en.wikipedia.org/wiki/Search_algorithm">search</a>, and yet it was solved through deep learning. That was quite mysterious to me, and I&#8217;ve always wanted to understand that phenomenon a little better.</p><p>My training is in deep <a href="https://en.wikipedia.org/wiki/Neural_network_(machine_learning)">neural nets</a> for robotics, where the decisions made by the neural networks are a bit more intuitive. But AlphaGo is a problem where the decisions are the result of a very, very deep search. It&#8217;s always been very mysterious to me how a ten-layer network can amortize the simulation of something so deep in the <a href="https://en.wikipedia.org/wiki/Game_tree">game tree</a>.</p><p><strong>Eric Jang</strong></p><p>If you plot out how much compute it took to build various iterations of strong Go bots over the years, you can see that in 2020 there was an open-source project called <a href="https://en.wikipedia.org/wiki/KataGo">KataGo</a> by <a href="https://x.com/lightvector1">David Wu</a> from <a href="https://www.janestreet.com/">Jane Street</a>, which achieved a 40x reduction in the compute needed to train a really strong Go bot tabula rasa. I&#8217;m not certain if it&#8217;s stronger than <a href="https://en.wikipedia.org/wiki/AlphaGo_Zero">AlphaGo Zero</a>, <a href="https://en.wikipedia.org/wiki/AlphaZero">AlphaZero</a>, or <a href="https://en.wikipedia.org/wiki/MuZero">MuZero</a>, but it&#8217;s very strong, and this is what most Go practitioners today train against when they&#8217;re playing an AI.</p><p>Thanks to LLM coding, what took a whole team of research scientists at <a href="https://en.wikipedia.org/wiki/Google_DeepMind">DeepMind</a> and millions of dollars of research and compute can now be done for a few thousand dollars of rented compute.</p><p><strong>Dwarkesh Patel</strong></p><p>We should first discuss how Go works. How does the game work?</p><p><strong>Eric Jang</strong></p><p>Go is a very simple game that can be implemented quickly and easily on a computer. The objective is to put down black and white stones and try to occupy as much territory as possible. I might start by putting down a black stone. Black always goes first. Go ahead. The way you capture an opponent&#8217;s stones is that for every intersection, if you can surround all four of its neighbors with your stones, then it&#8217;s cut off from oxygen, if you will, and it&#8217;s a dead stone. Now I control these four stones as well as this empty intersection here.</p><p>There are slight variations between <a href="https://www.cs.cmu.edu/~wjh/go/rules/Chinese.html">Chinese</a>, <a href="https://www.cs.cmu.edu/~wjh/go/rules/Japanese.html">Japanese</a>, and what are called <a href="https://senseis.xmp.net/?TrompTaylorRules">Tromp-Taylor rules</a>. Tromp-Taylor rules are designed to be completely unambiguous, so this is what all Go AIs train and resolve against. In typical Go, when humans play, you&#8217;re actually not allowed to put this white stone down here. It would be instant suicide. In Tromp-Taylor, it&#8217;s actually fine. You put it down, and it immediately resolves to death, so the outcome is the same.</p><p>Let&#8217;s start over and play a few stones, and then I&#8217;ll explain some more. I&#8217;ll just start there.</p><p><strong>Dwarkesh Patel</strong></p><p>I&#8217;m basically playing randomly here, but I&#8217;m trying to get around your stones and see if I can surround them.</p><p><strong>Eric Jang</strong></p><p>This move exposes one empty neighbor for your white stone. It&#8217;s akin to a check in chess. If you don&#8217;t respond immediately by putting one here, then I can immediately capture this.</p><p><strong>Dwarkesh Patel</strong></p><p>I see. Because it&#8217;s the diagonals that determine whether you&#8217;re surrounded or not.</p><p><strong>Eric Jang</strong></p><p>The cross-section, not the diagonals. This one is surrounded on three sides, so you&#8217;re at threat of losing that stone if you don&#8217;t play one immediately there.</p><p>Now you can see that I&#8217;m starting to pressure you, because by putting a stone here, you&#8217;re forced to put one here.</p><p><strong>Dwarkesh Patel</strong></p><p>Otherwise, you would have this two-block to yourself.</p><p><strong>Eric Jang</strong></p><p>Yes. And if you think through what happens if you were to respond here, you can probably search into the future and deduce what I&#8217;ll do in response once you do that.</p><p><strong>Dwarkesh Patel</strong></p><p>You have a lot of confidence in my abilities, but I&#8217;m guessing you&#8217;d put the black here.</p><p><strong>Eric Jang</strong></p><p>That&#8217;s right, and then I would capture all three of these stones.</p><p><strong>Dwarkesh Patel</strong></p><p>So I should just assume that this little block is gone.</p><p><strong>Eric Jang</strong></p><p>Yes. In Go, it&#8217;s actually okay to let an opponent capture some stones if, for example, it lets you position to capture more stones somewhere else on the board. This is what makes Go a beautiful game: you can lose the battle but win the war. As the board size increases, the complexity of these micro versus macro dynamics gets more interesting.</p><p><strong>Dwarkesh Patel</strong></p><p>Presumably you&#8217;d put one here.</p><p><strong>Eric Jang</strong></p><p>So now I would capture this entire group, and this would be mine.</p><p>There&#8217;s one more case I want to demonstrate, which I actually had a bug in my code for recently. Let&#8217;s consider a formation like this, with other pieces on the board in play. Let&#8217;s talk about how the game ends. In this territory, who controls these areas? Is it white or black?</p><p><strong>Dwarkesh Patel</strong></p><p>White.</p><p><strong>Eric Jang</strong></p><p>It&#8217;s actually black, because I&#8217;ve surrounded this whole area. Assuming I have other black stones here, it&#8217;s very hard for you to break this out of the control of these stones.</p><p><strong>Dwarkesh Patel</strong></p><p>And when the final score is tallied, would these ones also count as being in...</p><p><strong>Eric Jang</strong></p><p>Great question. This is where different rule sets have different ways of scoring. We should talk about how you resolve scores between humans and how you resolve scores in computer Go, because there&#8217;s some ambiguity in how humans evaluate this.</p><p>Most humans would look at this board configuration and conclude that black has totally surrounded white, and white has no chance of life. We could play out more here, but at the end I would capture everything. However, if you have a way of breaking this formation and connecting white to something outside of it, then it can flip. This is where it&#8217;s a little bit hard for a computer to decide these kinds of things.</p><p>How do humans do it? It&#8217;s worth thinking about how humans resolve this, because this will map later to how we think about the deep neural network. Humans basically say, &#8220;I think the game is done,&#8221; and then you have to also say, &#8220;I think the game is done.&#8221; Then we&#8217;ll say, &#8220;I think these are my stones,&#8221; and you have to agree. If you don&#8217;t agree, we keep playing.</p><p>Essentially, once two humans&#8212;their so-called <a href="https://en.wikipedia.org/wiki/Reinforcement_learning#Value_function">value function</a>&#8212;agree on a consensus, then the Chinese rules resolve that. In Tromp-Taylor scoring, it&#8217;s perfectly unambiguous, so it can be decided algorithmically by a computer. If you have this at the endgame, the way you score it is that you first count how many stones you control, and that&#8217;s unambiguous.</p><p>Then you count how many empty intersections are not touched by your opponent&#8217;s stones. These intersections would not count for either player, because all of these intersections are connected to both white stones and black stones.</p><p>If this were like this, then white would get three points. This is a little odd because a human would know that white is actually losing these points. But Tromp-Taylor scoring would consider white to have all of these points as well as these points. So that is a very big difference in how computer Go scores things and how humans score things.</p><p><strong>Dwarkesh Patel</strong></p><p>How does the game end?</p><p><strong>Eric Jang</strong></p><p>The game ends when either a player chooses to resign or both players pass consecutively. Those are the rules.</p><p><strong>Dwarkesh Patel</strong></p><p>Now help me crack this with AI.</p><h3>00:08:17 &#8211; Monte Carlo Tree Search</h3><p><strong>Dwarkesh Patel</strong></p><p>Let&#8217;s understand how AlphaGo actually works and how somebody in the audience might be able to implement it.</p><p><strong>Eric Jang</strong></p><p>Let&#8217;s start with an intuition about the underlying search process used to make moves, and we&#8217;ll layer on ideas from deep learning to make it much more efficient and tractable.</p><p>Go is a game with just two players. We&#8217;re going to draw a person here, and we&#8217;re going to draw an AI here. Let&#8217;s say this person is playing black, so they go first. They go here. Now the AI is going to make a move based on what it sees here. There&#8217;s a question of how you encode these inputs into the AI. Maybe you could use ones and zeros, but you want to represent black, white, and empty. You would need at least three different values. Maybe you could use zeros, ones, and twos. The AI might see something like zero, zero, zero, zero, one. This is the input to the AI on its turn.</p><p>The AI can choose. Let&#8217;s just pick three possible random moves it can make, and I just drew these at random. Which move is best here? Well, we don&#8217;t know until the game ends. Go doesn&#8217;t have any kind of local reward for which move here is good. This is what makes Go a very difficult game. You don&#8217;t actually know who won until you really get to the end.</p><p>How deep is this tree? On a 19x19 Go board, there are roughly on the order of 361 moves on any given move, and of course, as it fills up, you have fewer moves. The number of steps in the game can be somewhere from 250 to 300 moves. Experts might decide to end the game well before that, but under Tromp-Taylor scoring, you actually have to play things all the way to the end. So this could be 300 moves, a depth of 300 in the tree.</p><p>If you keep on expanding possible moves&#8212;here the AI goes, then here the human would go, and so forth&#8212;you find that you end up with an enormous explosion in the possible game outcomes originating from just this one state. This is something on the order of 361<sup>300</sup>, which is far more than the number of atoms in the universe. Of course, there are redundancies and symmetries, so it&#8217;s not actually that, but if you were to do a naive tree with no merging of children, you end up with a tree about this big.</p><p><strong>Dwarkesh Patel</strong></p><p>What do you mean by &#8220;merging of children&#8221;?</p><p><strong>Eric Jang</strong></p><p>Let me use this board here. If we start here, and then you play here, and then I play here, and then you play here, that&#8217;s equivalent to me starting here, you playing here, me playing here, and then you playing here. Both arrived at the same spot, but through different paths. So this child node can be thought of as a shared ancestor.</p><p><strong>Dwarkesh Patel</strong></p><p>Got it. And I guess it starts at 361 but it decreases by one each time.</p><p><strong>Eric Jang</strong></p><p>Yes, the branching factor decreases by one each time. In any case, this is a very large tree. This is also why computer scientists for many years thought that Go was not a tractable problem this century. The amount of compute you would need to exhaustively search every possibility is just too large.</p><p>If you could, Go is actually a deterministic game. In any given state, you can compute the best possible strategy you can make in order to win the game. You can search all the possible futures where you win, and then just make sure you always stay in that set of futures.</p><p>AlphaGo&#8217;s core conceptual breakthrough was using neural nets to make this search problem tractable. Before we get into how neural networks are involved, let&#8217;s talk about how we can, assuming we had a powerful enough computer, search this tree to find the best move.</p><p>In the beginning, you&#8217;re not going to build out the whole tree, because storing it would be very expensive. Instead, you might interactively figure out which leaves of this tree are worthy of exploring and expanding into the future to see what else is there. There are some early algorithms in the <a href="https://en.wikipedia.org/wiki/Multi-armed_bandit">bandit literature</a> like <a href="https://www.jeremykun.com/2013/10/28/optimism-in-the-face-of-uncertainty-the-ucb1-algorithm/">UCB1</a>, which is not exactly appropriate for a sequential game like Go, but very much inspired the action selection algorithm used in AlphaGo. UCB1 looks like: on every move, we&#8217;re going to take the best action&#8212;the <a href="https://en.wikipedia.org/wiki/Arg_max">argmax</a> over <em>a</em> that maximizes <em>Q</em> of <em>a</em>, and I&#8217;ll explain what <em>Q</em> of <em>a</em> is in a moment&#8212;plus some sort of exploration bonus.</p><p>On every node, we&#8217;re going to track a few quantities. Let&#8217;s consider each of these a node. This is the root node where you&#8217;re making decisions from, and these are the children of the root node. Each node is basically a data structure that stores a visit count of this child node.</p><p><strong>Dwarkesh Patel</strong></p><p>How often the parent visited this node.</p><p><strong>Eric Jang</strong></p><p>Yes. One thing that&#8217;s easy to trip on if you come from robotics or other kinds of reinforcement learning is, where are the actions? I&#8217;m only talking about nodes. Nodes here represent states, and because this is a perfectly deterministic game with no randomness, you actually can just infer the action based on the child. If I go here, that implies an action, and this is the state we resolve to.</p><p>If you ask LLMs to vibe-code an <a href="https://en.wikipedia.org/wiki/Monte_Carlo_tree_search">MCTS</a> implementation, they will most likely design the right data structure here. But it&#8217;s sort of a chef&#8217;s choice. You can rewrite the tree structure however you like. This is what Claude 4.6 wrote for me when I asked it, and it was a very reasonable choice.</p><p><em>Q<sub>a</sub></em> represents the mean action value of this action. I&#8217;ll use a subscript <em>a</em> to denote that this corresponds to taking a specific action to get here from the root node. Taking <em>a</em> gets us to this node here. We&#8217;re also going to store the probability of taking this action</p><p><strong>Dwarkesh Patel</strong></p><p>Again, from the parent?</p><p><strong>Eric Jang</strong></p><p>From the parent, yes. What are the odds that we sample this one? This will become relevant later. We&#8217;ve talked about a deterministic tree for now, so I&#8217;ll bring probabilities into this later.</p><p>Finally, we have a dictionary of children, which is just more of these nodes in a classic <a href="https://en.wikipedia.org/wiki/Linked_list">linked list</a>-style reference <a href="https://en.wikipedia.org/wiki/Search_tree">tree</a>. This is the basic data structure to implement a tree.</p><p>In AlphaGo, they use a slightly different action-selection criterion called <a href="https://arxiv.org/pdf/2012.11045">PUCT</a>, short for Predicted Upper Confidence with Trees. When you select which child to take, you do argmax <em>a</em> of <em>Q</em>(<em>s</em>,<em>a</em>) plus a constant. The equation forms are pretty similar. These are both scoring criteria. You want to argmax this quantity and you want to argmax this quantity to determine which action to take.</p><p>Let&#8217;s break down the intuition of how you select actions here. <em>Q</em>(<em>s</em>,<em>a</em>) is the mean action value, so how good a given child is on average. If you actually knew the whole tree, this is all you need to select the best action. You don&#8217;t really need to do more than that. But if you&#8217;re interactively building this tree as you&#8217;re figuring out what the <em>Q</em> values should be, then occasionally you have to try some other actions as an <a href="https://en.wikipedia.org/wiki/Exploration%E2%80%93exploitation_dilemma">explore-versus-exploit trade-off</a>.</p><p>In both UCB and PUCT, there&#8217;s this term that basically rewards taking actions you haven&#8217;t taken before. As we mentioned, each node stores the visit count of taking that specific action. Everything is initialized to zero. For a given action, let&#8217;s call it action <em>a</em>, initially it&#8217;s zero. As <em>n </em>is increasing, if we&#8217;ve already made 10 action selections from that root node but we haven&#8217;t picked <em>a</em> yet, then this term starts to become quite large for <em>a</em>. Conversely, if we&#8217;ve chosen <em>a</em> 10 times out of 10, then this term is quite small. It diminishes very quickly. The same thing is true here.</p><p><strong>Dwarkesh Patel</strong></p><p>Just to make sure I&#8217;m understanding, let me put it in my own words. Let&#8217;s focus on UCB. Conceptually, you can think of it as two different things: the <em>Q</em> and the exploration term.</p><p>Let&#8217;s be clear about what <em>Q</em> is. <em>Q</em> is basically saying, once we do these rollouts&#8212;you&#8217;re actually running all these simulations&#8212;you go down the tree and figure out: if I end up at the terminal value of this tree, do I win this game or not? You average whether I win or not across all the leaves of this tree starting from this node. That average, you put in <em>Q</em>. So <em>Q</em> represents the probability that I&#8217;ll win this game starting at this node. That is your sort of exploit. It&#8217;s saying: I&#8217;ve run these simulations, and I think this is a good move or not.</p><p>The other term is saying: have I explored this branch enough yet relative to the other actions I could be exploring, or have already explored? If I haven&#8217;t explored this branch yet, maybe I think it has a low score, but I just haven&#8217;t explored that many leaves down this node in this tree. So I should try this even though <em>Q</em>, the exploit term, is telling me it&#8217;s not that valuable.</p><p>Because ln(<em>n</em>) grows slower than <em>n</em>, over time you will move from the argmax being dominated by the exploration term, which is the second term here, to the argmax being dominated by the <em>Q</em> term, which is when you&#8217;ve done enough simulations and are confident that this is the branch to go down.</p><p><strong>Eric Jang</strong></p><p>Yes, that&#8217;s right. The motivation for UCB was to come up with an algorithm where, if you don&#8217;t know the payoff of the different actions you can select, this strategy, given some exploration term here, bounds your regret in terms of how wrong you can possibly be.</p><p>I don&#8217;t know the proof. I also don&#8217;t know if this one is proved to have logarithmically or square-root-bounded regret, but I think the algorithm was derived to look something like this. You can tell these terms grow a little differently, and this is to account for the fact that Go has many more actions for any given move compared to your standard bandit problem.</p><p>One small clarification: you talked about simulations and probabilities. We should remember that Go fundamentally is a deterministic game. Where does the notion of probability come from here? If you had a very powerful computer, there are no probabilities. You can just compute the true average of the mean action value.</p><p>So where does the probability come in? In computer Go before AlphaGo, we&#8217;ve always done some sort of <a href="https://en.wikipedia.org/wiki/Monte_Carlo_method">Monte Carlo method</a> where we take the expected <em>Q</em> value averaged over a randomly selected tree. That randomly selected tree is where probabilities come in. The interpretation of <em>Q</em> is: what is the expected action value under the random distribution induced by some random search process?</p><p>Where does the random search process come in? That&#8217;s where <em>P<sub>a</sub></em>, of action, comes in. If we assume a naive algorithm where you have a uniform probability of taking any valid action, then this would just be one over the number of valid moves. You would be taking this average over a very diffuse tree. This is a valid integral, but it&#8217;s very slow because you&#8217;re going to consider a lot of trees that have very low value. It&#8217;s essentially almost like an <a href="https://en.wikipedia.org/wiki/Importance_sampling">importance sampling</a> problem. Only a few actions and paths contribute high value, and almost everything else is low value. So that&#8217;s a tricky problem here.</p><p>This is the action selection criterion for how you decide which moves to go down. As you move down in tree search, you will eventually run into a node where it&#8217;s quite clear you&#8217;ve won or lost. At the very end of the game, when there are no valid moves left to play under Tromp-Taylor scoring, you can decide whether you won or lost. This is the final return of the whole game.</p><p>We can assign a value, <em>U</em>, to a terminal leaf node of the tree, but how do we assign values to the nodes prior to that, the parents? You take the mean action value, which is essentially your average. Suppose these were all leaf nodes. The mean action value of this node is just the average of whether you won or lost at the leaf nodes. Correspondingly, you can walk up the chain and say the mean action value of this node&#8212;let&#8217;s call it <em>Q<sub>b</sub></em>&#8212;is just a weighted average of these ones here.</p><p>The weighted average could depend on whether you have a different sampling distribution. But the basic intuition is that you want to resolve the game where you have a deterministic win or lose, and then you can go backwards&#8212;this is called the backup step&#8212;and assign values to these nodes or actions corresponding to the average over the final terminal leaf.</p><p>If you were to do this without neural networks, it would still be intractable. You would have trouble finding which actions to sample. A lot of the actions would contribute very low value, especially if you&#8217;re trying to fight your way out of a losing position. Only a few actions give you high value, so the search in practice is still very expensive. But the idea is that because Go follows a tree structure, you can inform a very good estimate of the value of this node based on the downstream values, assuming they&#8217;re all correct and you&#8217;ve searched deep enough.</p><p><strong>Dwarkesh Patel</strong></p><p>Your earlier explanation&#8212;about the sorts of states where it&#8217;s obvious to a human who&#8217;s going to win, but deterministically you still have to play it out&#8212;actually drew upon the intuition for why 1) the value function is both trainable and 2) necessary in order to learn this game effectively. Maybe it&#8217;s worth defining value in the first place.</p><p><strong>Eric Jang</strong></p><p>We talked about <em>U</em> being your final resolution of whether you won or lost. This is a terminal leaf node condition. Humans don&#8217;t play all the way to the leaves of the tree. They stop dozens of moves before, maybe even 100 moves before in high-level play.</p><p>How do they know? You can think about humans as implicitly having a neural network called a value function that takes in a board state and evaluates <em>p</em>(win). The human glances at the board and knows, &#8220;I&#8217;m probably going to lose.&#8221; They&#8217;re essentially running a neural network that looks at a board and implicitly amortizes a huge number of possible game playouts. They take that average and decide whether the board is winnable or not, and whether they should concede or keep playing.</p><p>This is remarkable. If you think about the beauty of something like this, a neural network in a human can somehow do all of this simulation at a glance. They just know within a few seconds, without actually playing every single game logically, based on crystallized knowledge and experience. They can do this. This gives us a hint that in games like Go, there are ways to radically speed up the search process. This is one of the fundamental intuitions behind why AlphaGo works. You can train a value function to look at a board and quickly resolve the game without playing out all of these trees to a very deep search depth.</p><p><strong>Dwarkesh Patel</strong></p><p>Makes sense. I will say for the audience, for previous episodes when I was prepping and it seemed relevant to understand how AlphaGo works, I would find it very confusing. But it&#8217;s the kind of thing where once you understand the problem in this way and then build the next few pieces, it is actually a lot more understandable and it makes a lot of sense. It&#8217;s okay to be confused right now, but it&#8217;s probably simpler to understand, by the end of this lecture, than you anticipate. I&#8217;ll just make that note for the audience.</p><p><strong>Eric Jang</strong></p><p>The important intuition at a high level&#8212;to step back about where we&#8217;re going with all this&#8212;is that classically, for games like Go, you could build a tree, but we don&#8217;t have computers powerful enough for that. Estimating the value of every action you could possibly take is also hard because you don&#8217;t know until the end of the game.</p><p>You could take averages by playing them to the end, but that&#8217;s also hard because you don&#8217;t know which actions to take to sample these averages. Conceptually, there are two problems: the breadth of the tree and the depth of the tree. AlphaGo gives us a way to shrink both of those to be tractable. That&#8217;s essentially the core idea behind it.</p><p>We take the idea that humans can glance at a board and instantly predict whether we win. That maybe gives us the opportunity to truncate how deep we search. We also know that humans can intuitively, at a glance, decide what moves might be good on a Go board. These are two things we can use deep neural networks for, to accelerate the search process.</p><p>Before we talk about neural nets, let&#8217;s go back to how this playout works. We&#8217;ve only talked about making one move. The AI looks at this encoded Go board. It has a tree. It searches deeply into the tree to find out which of its actions might be the best, takes that action, and goes back to the human. Now the human sees a Go board that looks like this, and they make their move. Maybe they put their stone here.</p><p>Now we go back to the AI, which now looks at a new encoded board. I&#8217;ve used 2 to denote the AI playing as white, 1 to denote the human playing as black, and 0 as empty. On the AI&#8217;s turn, it does the MCTS tree search all over again from scratch. It throws away the old tree that it searched last round. Now there&#8217;s a new root node and it begins to search anew.</p><p>You can basically think about MCTS as a search algorithm that decides which moves to play best, aided by neural networks, and it&#8217;s done on every move. Let&#8217;s talk about the neural network part of this.</p><p><strong>Dwarkesh Patel</strong></p><p>While you&#8217;re erasing, another thing that was important for me to understand was about the MCTS data structure with nodes and children of nodes. This is done per move and reinstantiated once a move is made. A human makes a move, then the AI looks at this and runs a bunch of simulations to figure out what move to make next.</p><p>A simulation is exploring one more node in the MCTS tree. Once you run 1,000 simulations, that informs the probability of what move to make next, as you&#8217;ll explain. That&#8217;s what you store. You choose the best move given those probabilities. You discard all of that, the next player makes a move, and you restart this process at the beginning of every move.</p><p><strong>Eric Jang</strong></p><p>Correct. One small addendum: you don&#8217;t discard all of that. You keep one thing behind that we&#8217;ll use later.</p><h3>00:32:04 &#8211; What the neural network does</h3><p><strong>Eric Jang</strong></p><p>Now that we have a basic intuition of how moves are made with search, we&#8217;re going to talk about how neural networks can speed this up by providing an analog to human intuition. There are two networks. There is the value network, which takes in a state and predicts, am I going to win or lose? It&#8217;s a binary classification problem. Then we have a policy network, which induces a distribution over good actions to take.</p><p>I&#8217;m going to draw a one-dimensional flattened move distribution, but this is really a square grid. These are the probability distributions over good actions. Both of these are categorical classification problems. You can train this like any classifier with deep learning, <a href="https://en.wikipedia.org/wiki/Cross-entropy">cross-entropy</a> loss, that kind of stuff.</p><p>The specific architecture does not matter too much. I tried a few different architectures. <a href="https://en.wikipedia.org/wiki/Transformer_(deep_learning)">Transformers</a> work, <a href="https://www.geeksforgeeks.org/deep-learning/residual-networks-resnet-deep-learning/">ResNets</a> work. For small data regimes, my experience is that ResNets still outperform transformers and give you more bang for the buck at lower budgets. But this may not always be true.</p><p><strong>Dwarkesh Patel</strong></p><p>Wait, why is that?</p><p><strong>Eric Jang</strong></p><p>They provide the inductive bias of local convolutions. Generally, transformers start to outperform <a href="https://en.wikipedia.org/wiki/Residual_neural_network">residual convolutional networks</a> when you want more global context.</p><p>One interesting finding from the <a href="https://arxiv.org/abs/1902.10565">KataGo paper</a> was that they found it quite useful to pool together and aggregate global features throughout the network, to give the network a global sense of how to connect value from one side of the board to the other.</p><p><strong>Dwarkesh Patel</strong></p><p>What does it mean to aggregate global features?</p><p><strong>Eric Jang</strong></p><p>If you have a very large 19x19 Go board, and you&#8217;ve got some battles going on here and some battles going on there, when you pass this through a <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network">convolutional neural network</a>, the receptive fields are going to be good at computing local things and making that invariant. But they won&#8217;t be able to connect these two features easily. They need to be pooled together and attend to each other somehow.</p><p>The argument for why transformers are good for <a href="https://en.wikipedia.org/wiki/Computer_vision">computer vision</a> tasks&#8212;<a href="https://en.wikipedia.org/wiki/Vision_transformer">vision transformers</a> and so forth&#8212;is that because they have global attention across the whole thing, they can more easily draw these connections. But you do need more data there so that you can learn the invariant local features through data.</p><p>I&#8217;ve tried very hard to make transformers work for this problem because I was curious if transformers would present some sort of breakthrough in Go and just remove a lot of those tricks. But try as I might, I haven&#8217;t figured out a way to make transformers better than ResNets for now.</p><p><strong>Dwarkesh Patel</strong></p><p>One more tangential question. It makes sense why transformers, with their global pooling of information, would be better if you need to consider information that is not just spatially&#8230; CNNs give you a bias that the things next to you are especially relevant.</p><p><strong>Eric Jang</strong></p><p>And then they&#8217;re aggregated up slowly.</p><p><strong>Dwarkesh Patel</strong></p><p>For games where it isn&#8217;t that relevant what is happening locally&#8212;you just have to consider the whole thing&#8212;you&#8217;re saying transformers would work better. We&#8217;re talking about the spatial dimension. How about the temporal dimension? Right now we&#8217;re only considering the previous move because it is a deterministic full-information game.</p><p>What if it were something like poker or <a href="https://en.wikipedia.org/wiki/Diplomacy_(game)">Diplomacy</a>, where a bluff they made a while back is relevant to understanding now? And to decide your next move, you need to consider all those previous states. Would that change the consideration of what inductive bias and which architecture is most relevant?</p><p><strong>Eric Jang</strong></p><p>Great question. Go is a <a href="https://en.wikipedia.org/wiki/Perfect_information">perfect information</a> game. In perfect information games, there does exist a <a href="https://en.wikipedia.org/wiki/Nash_equilibrium">Nash equilibrium</a> strategy for which you can do no worse than any other strategy.</p><p>If you know your opponent has a particular bias, like they love to play aggressively, you can in principle counter that specific strategy better than a Nash equilibrium policy. But to counter any given strategy, there exists a single Nash equilibrium that can be decided solely using the current state. That is a design choice AlphaGo made, which in hindsight turned out to work very well because the Nash equilibrium seems to be superhuman. No human strategy seems to be able to beat it.</p><p>There are variations of this where you would need to consider temporal history. This is a very exciting research area, and I&#8217;d encourage people to fork <a href="https://github.com/ericjang/autogo">my repo</a> and try it out. If you were to play 2v2 Go, you actually need to model your partner&#8217;s behavior. You may not have information on how they play, so you need to aggregate information on how they play so you can respond accordingly. These are situations where it&#8217;s no longer a perfect information game. In games of imperfect information or partial observability, you do need context to build a model. That&#8217;s a place where things can get very exciting in terms of <a href="https://en.wikipedia.org/wiki/Self-play">self-play</a> or Diplomacy-style games.</p><p>Returning to the neural network, the architecture again is not super important. You can get it to work with transformers. You can get it to work with ResNets. I found that for low-budget experiments, ResNets work a little better. You can also use <a href="https://www.dwarkesh.com/p/andrej-karpathy">Karpathy</a>-style <a href="https://github.com/karpathy/autoresearch">AutoResearch</a> <a href="https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)">hyperparameter</a> tuning to make your architecture pretty good. You don&#8217;t have to worry too much about that. You just need to set up the problem so that you have a target optimization.</p><p>We&#8217;re going to pick a somewhat arbitrary architecture that worked for what I did. But again, this part is not super important. You have your encoded board state and, similar to an RGB, we&#8217;re going to have three channels. One channel to encode black, one to encode white, and one to encode empties or a masked region if you want to train on multiple board sizes. I&#8217;m actually not going to talk about multiple board sizes for now. That&#8217;s a bit too complicated.</p><p>We have this two- or three-channel RGB-like image, and we feed it into a ResNet. Then we have two branching heads. One head predicts the value function, which is a single <a href="https://en.wikipedia.org/wiki/Logit">logit</a>, let&#8217;s just call it R<sup>1</sup>. Then we have the policy, which is R<sup>361</sup>. This is the architecture. We&#8217;re going to train this to predict the outcomes of games given the board state, and we&#8217;re also going to train it to predict what are good moves.</p><p>The <a href="https://www.nature.com/articles/nature16961">original AlphaGo paper</a>, called AlphaGo Lee, initialized this network with a <a href="https://en.wikipedia.org/wiki/Supervised_learning">supervised learning</a> dataset of expert human play. Later, they removed this restriction by having the model teach itself how to play well. I find it super nice for implementation, for your audience, to always initialize your experiments to something easy and get the problem working before trying to bite off the whole thing and learn tabula rasa.</p><p>In deep learning, initialization is everything. You always want to initialize your research project to something as close to success as possible, especially if you&#8217;re doing something new that you haven&#8217;t done before. Always pick something that works and then get it to do something better, rather than start from something that doesn&#8217;t work at all and try to make it work.</p><p>Under that philosophy, it&#8217;s a great idea to start with something that has a good initialization. We&#8217;re going to take human expert plays and train this model to predict good actions. We&#8217;re going to take all the moves in which an expert won and predict those actions. Regardless of board state, whether you won or lost, you&#8217;re going to predict the outcome.</p><p>You might be wondering, for some of the early boards where only one stone has been put down, how could you possibly know who the winner of this game is? Well, if you have hundreds of thousands of games, on average you&#8217;ll probably see that boards starting like this have half of the games branch off and win, and half branch off and lose. So that&#8217;ll actually be fine. When you train this model to predict those outcomes, the logit will sort of converge to 0.5. For these things, it&#8217;s expected that once you train the model, a starting board state will look like 0.5, and then as you progress towards the end of the game, the win probability will either go up or down.</p><p>This is your move number. As you get hundreds of steps into the game, it becomes much clearer who&#8217;s more likely to win or lose under your expert data distribution.</p><p><strong>Dwarkesh Patel</strong></p><p>I didn&#8217;t understand the significance of why this way of thinking about value is especially relevant to the expert data.</p><p><strong>Eric Jang</strong></p><p>It is not relevant to the expert data. It&#8217;s true for any data that you trained it on. If you were to learn tabula rasa, you would also expect this to fall out.</p><p>Imagine you&#8217;re vibe coding AlphaGo. You gather some expert datasets from KataGo online, or you have a dataset of human players, and you train this model. It turns out this model is already a pretty good Go player. It will most likely beat most human players. If you just take this policy recommendation and take the argmax over these probabilities&#8212;if you take the argmax and just take this action as your Go play&#8212;it&#8217;ll be a very fast Go player that doesn&#8217;t think in terms of reasoning steps. It just shoots from the hip, and it&#8217;ll be a very strong Go player. This is already quite miraculous if you think about how ten neural network layers, maybe under 3 million parameters, can already do something that impressive.</p><p>You can start this way. It&#8217;s important when implementing this to verify that this is probably true. It&#8217;s good to verify that your Go rules are implemented correctly and that you can run these simulations relatively quickly. Just as a checkpoint, you want to make sure you can actually do this basic step before you try to layer on more complex things like search.</p><p>But we can do a lot better than taking the raw neural network and playing the moves. So let&#8217;s apply the neural network to improve Monte Carlo tree search. We start with our root node, and we now have a four-step iterative process to do MCTS.</p><p>This tripped me up when I was first reading the paper and trying to understand it. Essentially what we&#8217;re going to do is choose a number of simulations, and this number varies. It can be somewhere between 200 and 2,048. I believe in the <a href="https://en.wikipedia.org/wiki/AlphaGo_versus_Lee_Sedol">AlphaGo vs. Lee match</a>, they used tens of thousands of simulations per move because they really wanted to boost the strength of the model as much as possible. But in training, you don&#8217;t actually need too many. KataGo, I think, uses something in this order as well.</p><p><strong>Dwarkesh Patel</strong></p><p>Do you know if they used&#8230; If you watch the <a href="https://youtu.be/WXuK6gekU1Y">documentary</a>, they had a laptop out during the game. They didn&#8217;t use the laptop itself. It was on some&#8212;</p><p><strong>Eric Jang</strong></p><p>It was on some <a href="https://en.wikipedia.org/wiki/Tensor_Processing_Unit">TPU</a> pod, I think.</p><p><strong>Dwarkesh Patel</strong></p><p>Honestly, kind of unfair. <a href="https://en.wikipedia.org/wiki/Lee_Sedol">Lee</a> is not using 1E22 <a href="https://en.wikipedia.org/wiki/Floating_point_operations_per_second">FLOPS</a> to do a move.</p><p><strong>Eric Jang</strong></p><p>Fair enough. Interestingly, modern Go bots don&#8217;t need that much compute at test time. What we&#8217;ll find out, as we talk about how the MCTS policy improvement works, is that over time the raw network actually takes all of the burden of that big TPU pod and just pushes it into the network. You can do all of that work with one neural network <a href="https://towardsdatascience.com/neural-networks-forward-pass-and-backpropagation-be3b75a1cfcc/">forward pass</a>. The TPU pod will always add the extra oomph on top, and that&#8217;s what they wanted for the match.</p><p>We&#8217;re going to pick this number of simulations thing, and for every simulation, we&#8217;re going to do several things simultaneously. We&#8217;re going to see which moves are the best in the current tree. We&#8217;re going to add extra leaves to the tree if we get to a point where we need to add a leaf, and we&#8217;re going to update the action values for the tree. Every simulation involves this four-step process: selection, expansion, evaluation, and backup.</p><p>At the beginning of our Monte Carlo tree search, our tree is very basic. It only has the root node, or our current board that our AI wants to play at. We&#8217;re basically going to select the best action for this. When this root node is created, we also know that we can evaluate it under our neural network and get the quantities <em>V<sub>&#952;</sub></em>, as well as our probability over actions. And I&#8217;m going to say root. For all of the actions here, we can create a bunch of children. In this case, I&#8217;m drawing a 3x3 board with one missing, so there are eight possible children associated with this root node. Each of these has an associated probability of taking that action, so there&#8217;s P<sub>8</sub>, P<sub>1</sub>, ... P<sub>2</sub>, and so on.</p><p>So at the beginning of our Monte Carlo tree search, we have our root node, and we can initialize it with some children. The policy network evaluated on the root node gives us&#8212;on a 3x3 board with one existing stone placed&#8212;eight possible children that this AI could take. With each of the children, the policy network also gives us the probability of selecting that child. The first step is to do the selection of the tree. Again, this is a very shallow tree. All we have so far is essentially a tree of depth one.</p><p>Our first move is to select by maximizing, or argmaxing, the PUCT criterion, which is basically <em>Q</em>(<em>s</em>,<em>a</em>) + C<sub>PUCT</sub> x <em>P<sub>a</sub></em> x (&#8730;N / (1 + <em>N<sub>a</sub></em>)). For each of these, <em>N<sub>a</sub></em> is zero for all the actions initially. <em>N</em> is zero. So we&#8217;re going to pick according to this. Initially, the chosen action is most likely going to be biased towards the highest-likelihood action, because these are uniform for every node.</p><p>Let&#8217;s suppose P<sub>1</sub> was the highest-probability node, so you selected this one here. Now you get to this node and you realize it&#8217;s not a leaf node. It&#8217;s not a terminal game, so you cannot resolve the final resolution. The next step is expansion. You will then run this board state through the policy network. Note that this is the AI&#8217;s move. The AI is making this move. When we expand this tree, we&#8217;re now thinking about what the human, or any opponent, might do.</p><p>When we evaluate the node here, we&#8217;re going to evaluate it from the perspective of this player. This node has possible actions that we could take, and we expand the leaf nodes here. For each of these nodes that we could arrive at, we&#8217;re going to now check how good they are. From here, the human could play here, here, or here. We&#8217;re going to store <em>V<sub>&#952;</sub></em> for each of these, so <em>V<sub>&#952;</sub></em> of Node<sub>1</sub>&#8217;, <em>V<sub>&#952;</sub></em> of Node<sub>1</sub>&#8217;&#8217;, and so on. We&#8217;re basically using our neural network to make an intuitive guess of how good this board is from the perspective of this player.</p><p>Fortunately, because it&#8217;s a zero-sum game, it&#8217;s easy to deduce that the value for this player at this step is just one minus the value from the other perspective. It&#8217;s easy to flip the search process depending on which player you&#8217;re at. This is the expansion step. You&#8217;ve taken a non-leaf node, expanded it, and evaluated the value. This is essentially a quick guess as to whether I&#8217;m going to win or not if I were to play to the end. You can almost think about <em>V<sub>&#952;</sub></em> as a shortcut for searching to the end of the tree, for any given simulation.</p><p>This is essentially the evaluation step. We&#8217;re evaluating the quality of each of these boards. In the original AlphaGo Lee, they did something kind of interesting. They took this value and averaged it with the value of a real Go playout. They actually played a real game from here all the way to the end. I&#8217;m just going to draw this squiggly line to indicate some path. They play this all the way to Tromp-Taylor resolution of a full board. So this is a zero or one. They took this value and averaged it with this one here. The formula was <em>&#945;</em> x <em>V<sub>&#952;</sub></em> of some node plus (1 - <em>&#945;</em>) of a truly randomly sampled playout.</p><p>You might be wondering how they play this out. It would be very costly to do another search on this playout, almost like a tree within a tree. They don&#8217;t do this. Instead, they just take the policy network and play it against itself. They use it as both players, and they just play it all the way to the end. This helps ground the estimates here in reality because you can get a single-sample estimate of whether you win or not. In the endgame, where the board is almost resolved, this becomes quite useful because the play according to the policy will most likely decide a pretty reasonable guess of the game, so you&#8217;re not facing a problem where this becomes untethered from reality.</p><p>It turns out this is totally unnecessary. In all subsequent papers after AlphaGo Lee, they got rid of it. In my implementation, I did the same, and it speeds things up a lot because you don&#8217;t have to roll these games out on every single simulation.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, just to reinforce my own understanding and re-explain it. For the audience, by the way, in case it&#8217;s not obvious, the <em>P</em> there in the select, that is the probability coming from the network in this case.</p><p><strong>Eric Jang</strong></p><p>Correct. The policy network here.</p><p><strong>Dwarkesh Patel</strong></p><p>Fundamentally, a simulation, just think of it as rolling out one more node in the search process.</p><p><strong>Eric Jang</strong></p><p>Almost. A simulation is easy to think about when the whole tree already exists. You just walk down the tree using the PUCT selection criterion, and then you keep going. In AlphaGo, the data structure is such that we begin with a tree that basically only has depth one&#8212;its only children&#8212;and you want to iteratively build out the tree as you&#8217;re also selecting actions down the tree. That&#8217;s the core thing here. Because Go is such a combinatorially complex game, you cannot afford to build the tree in advance and then search it. You must search while building the tree.</p><p>Let me finish up with the last step, which is the backup. Once you&#8217;ve scored these things, the <em>Q</em> value assigned to the node here for taking this action is just the average across your evaluated values. You take a running mean over all the simulations you&#8217;ve taken, averaging the values of the children nodes. That&#8217;s the backup step, and once you evaluate this, you can recursively go back up. If you know the action value of this node, you can then take the average on its parent, and so on.</p><p>You have this four-step process: 1) You&#8217;re choosing the best action you know of so far. 2) You may run into a node you haven&#8217;t been to before, so you need to grow the tree a bit. 3) You run it through the network to guess whether you&#8217;re going to win or not. 4) You walk all the way back up to the root node to update your values on what the best moves are.</p><p>As you do this iteratively, because you&#8217;re always selecting according to this criterion, you&#8217;re always going to be selecting the best action you think at any given branch. The final visit counts of how often you chose these things will reflect your correct policy distribution as induced through this search process. The visit count we stored in the node earlier actually becomes the vote for which action we should finally select.</p><p>As a test of understanding, it&#8217;s worth thinking a little bit about whether we could make this even simpler. Could we actually maybe even get rid of this one and still make the thing work? Recall that when you do an expansion and then an evaluation at this node, you are checking the win probability of each of the child nodes. So if this one is one and these are zero, you do know something about which action might be better to take. Why would you still need this, right? Why not just normalize the values into a distribution and call that your policy distribution?</p><p>This is fine. You can do this, and it probably does work. But in practice, having a single forward pass that gives you a pretty good guess is how the breadth is pruned out. There is a duality here. It would be weird if the policy recommended an action that disagreed with the value. If a policy said this was very high probability, but this one said it was a low value, then there&#8217;s something fundamentally wrong between your policy head and your value head. They are linked, and you probably could get rid of this if you came up with a different way to recover it from just the value evaluations.</p><p><strong>Dwarkesh Patel</strong></p><p>Right. Just to make sure I understand, the reason you don&#8217;t do that is so you don&#8217;t have to do 360 independent forward passes to say, &#8220;Okay, here&#8217;s the value of everything. Let&#8217;s argmax over it.&#8221; Instead, you can just do one forward pass and get the probabilities of all of them.</p><p><strong>Eric Jang</strong></p><p>You can usually batch these somewhat efficiently, so it probably isn&#8217;t a huge computational burden in practice. But yes, you would have to pass up to 361 boards into a single mini-batch update to evaluate all the values, then normalize them.</p><p>There&#8217;s actually a more important reason why we still do this, which is how Monte Carlo tree search is used to feed back on itself and recursively improve its own predictions and search capabilities. That&#8217;s where having this as an explicit entity you&#8217;re modeling, rather than an implicit normalization over your value, is a good idea.</p><p><strong>Dwarkesh Patel</strong></p><p>Makes sense. Okay.</p><p><strong>Eric Jang</strong></p><p>We talked about the simulations. What you end up with as you roll out the number of simulations is a tree that looks like&#8230; I&#8217;m drawing a very low-dimensional version of this. Of course, in the real game, it&#8217;s much more high-dimensional. You&#8217;ll end up with a tree structure that has a lot of leaves that terminate and are not visited again because their value is deemed to be too low. But along one path, there will be a set of actions with very high visit counts that gravitate towards that one set of decisions as you increase <em>N</em>.</p><p>This is the mental picture of what the tree in Monte Carlo tree search looks like. You should contrast this with an exhaustive tree like in tic-tac-toe, where there are nine actions, then eight, then seven and six, and so it&#8217;s a nine-factorial-sized tree. The Monte Carlo tree search in Go is very sparse. It only considers the paths that you&#8217;ve expanded children nodes on.</p><p>Now that we have the search algorithm that applies the value function as well as the policy function, we can talk about how the Monte Carlo tree search algorithm can act as an improvement operator on top of these guys here.</p><h3>01:00:33 &#8211; Self-play</h3><p><strong>Eric Jang</strong></p><p>We now talk about the <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">RL</a> part of how this thing gets stronger by playing itself. Let&#8217;s say we play a game. You make a move. The AI will compute the search, and this is the visit count distribution. Let&#8217;s say this is your initial policy recommendation at this node.</p><p>After MCTS, it gets more confident about one of these actions. Maybe the distribution looks a bit more peaky like this based on the search. Of course, you can tune the search process so that it ends up more diffuse, but that&#8217;s probably not a good idea. MCTS should get more confident about specific actions over others. It might place a lot of weight on other actions initially, and then as you increase the number of sims, it should converge to a very peaky distribution.</p><p>Let&#8217;s call this &#960;. Let&#8217;s wrap this in an MCTS operator of (<em>a </em>| <em>s</em>). After applying the MCTS process, your policy recommended distribution looks like this. It&#8217;s a bit more peaky than the previous one. Then you take the argmax, or maybe you just sample from this. It doesn&#8217;t have to be the argmax. You make your move, throw away the tree, and begin anew on the next move. Again, you compute a new distribution. Initially, maybe your guess looks like this, and then you refine it through MCTS.</p><p><strong>Dwarkesh Patel</strong></p><p>There should be one more X on the board, right?</p><p><strong>Eric Jang</strong></p><p>I&#8217;m sorry, that&#8217;s correct, yes. Something that looks like this. On every move, you have your initial guess from your policy network. Then the search process, which combines your policy network and your value network, arrives at a more confident action that you take, and so on and so forth.</p><p>Then the game ends, and one person wins and one person loses. The beauty of how AlphaGo trains itself is that it can actually take this final search process&#8212;the outcome of the search process&#8212;and tell the policy network, &#8220;Hey, instead of having MCTS do all this legwork to arrive here, why don&#8217;t you just predict that from the get-go? Why don&#8217;t you not use this guess and predict this to begin with?&#8221; If you have this guess to begin with in your policy network, then MCTS has to do a lot less work to get things to work.</p><p>If we draw a test-time scaling plot&#8230; Let&#8217;s say this is the number of simulations. At zero simulations, your implicit win rate is here. If you just take this raw action, this is what your win rate is. Let&#8217;s say we increase the number of sims, maybe you have a win rate curve that looks like this. When you search for 1,000 simulation steps, that gets you to a policy here that gets you to here, which is great.</p><p>But if you were to distill this MCTS policy network back into your shoot-from-the-hip policy network, then you could actually start here. Let&#8217;s say this was zero, by distillation, if you then spend another 1,000 sim steps, you actually get to here. It&#8217;s almost as if you could amortize the first 1,000 steps into the policy network instead of the search process, then you could begin at a much better starting point and get a much better result for the number of sims that you play.</p><p><strong>Dwarkesh Patel</strong></p><p>The sigmoid type nature of test-time scaling as the number of simulations increases, the increase in win rate gets smaller. Is that true even for the distilled network? That is to say, is there some gain where we start from the distilled network and get these early gains again, or is that just inherent to the nature of MCTS?</p><p><strong>Eric Jang</strong></p><p>To be honest, I actually don&#8217;t know the test-time scaling behavior of MCTS simulations. I believe it might be quite sensitive to how strong this one is in practice. I&#8217;m just drawing a monotonically increasing function that gets to one.</p><p>Don&#8217;t pay too much attention to the shape of the curve. Just know that it&#8217;s monotonic with respect to sims. The idea of MCTS is very brilliant. We got something better by applying search. Now, on our next iteration of updating this network, we&#8217;re going to train it to approximate the outcome of 1,000 steps of search.</p><p>Instead of starting here, we get to have our neural network start here, and then the play gets stronger once we then apply another 1,000 steps on top of it. You can keep going. The training algorithm for AlphaGo is to basically take the games where you&#8217;ve applied the search on every move that the policy encountered&#8212;whether you won or lost, and that&#8217;s quite important&#8212;and just train the model to imitate the search process.</p><p>There&#8217;s an analogy to robotics, which is the <a href="https://shuijing725.github.io/files/20190926_DAgger.pdf">DAgger algorithm</a>. First I&#8217;m going to draw a schematic of the states: S<sub>0</sub>, S<sub>1</sub>, S<sub>2</sub>, S<sub>3</sub>. Let&#8217;s say we took a series of actions in an <a href="https://en.wikipedia.org/wiki/Markov_decision_process">MDP</a> to get a trajectory. These actions may be suboptimal. Maybe we lost at the end of this game. There is a family of algorithms that basically take trajectories and relabel the actions to better trajectories. Maybe a better action here would have been to take A<sub>0</sub>&#8217;. A better action here would have been A<sub>1</sub>&#8217;, and yet another one, A<sub>2</sub>&#8217;, A<sub>3</sub>&#8217;.</p><p>What MCTS is doing is saying: you played this game where you eventually lost, but on every single action, I&#8217;m going to give you a strictly better action that you should have taken instead. It does not guarantee that you are going to win, but it does guarantee that if you take these tuples as training data and retrain your policy network to predict these actions instead of the original ones, you&#8217;re going to do better.</p><p>This is very related to DAgger in robotics and <a href="https://en.wikipedia.org/wiki/Imitation_learning">imitation learning</a>, where you want to collect an intervention here. Even if you&#8217;re in a not-great state&#8212;for example, a self-driving car that veers off the side of the road&#8212;there is still a valid action that corrects you and brings you back.</p><p><strong>Dwarkesh Patel</strong></p><p>Pedantic question, but is there a guarantee that MCTS must be better than the policy? For example, you could imagine early on in training, because MCTS is informed by the value network, that when the value network hasn&#8217;t been well-trained on finished games, MCTS is worse than a randomly initialized policy. Is it just a heuristic that MCTS is better than the policy, or is there some guarantee?</p><p><strong>Eric Jang</strong></p><p>In practice, it is a heuristic. It does work in practice, but let me illustrate an example where MCTS can give you a worse distribution than your policy network. This can often happen if your self-play algorithm has trained to a good point, but then somehow it collapses because it&#8217;s not trained on diverse data or something.</p><p>Let&#8217;s say we have a board state where the policy recommendations are very good. So &#928;<em><sub>&#952;</sub></em> (<em>a</em> | <em>s</em>) is great. But somehow, maybe because we&#8217;re playing a lot of games where the bots just resign instead of playing all the way to the Tromp-Taylor resolution, they forget how to evaluate those late-stage plans. Like in the case we showed with the corner play, maybe 100% of our training data in our replay buffer has lost examples of how to evaluate the value function at those states.</p><p>You might end up in a scenario where your terminal value is very bad. If the terminal values of the leaves are not good, then this will propagate all the way up and cause your PUCT selection criteria and your backups to be off. And then you end up visiting a very different distribution than what your policy initially recommended.</p><p>Also, if your number of sims is low, you might have a variance issue where you just don&#8217;t explore enough. It&#8217;s only guaranteed to converge when you take <em>N</em> to infinity. Variance in your search process as well as inaccuracies in your evaluation can definitely screw with the quality of your policy recommendation. That&#8217;s why it&#8217;s not a guaranteed improvement.</p><p>That is why I suspect AlphaGo Lee had the playouts to the end in their training algorithm, so that they could ground this thing in real playouts. In practice, what you could also do is, for 10% of the games, prevent the bots from resigning and just say, &#8220;Resolve it to the end.&#8221; That way you get some training data in your replay buffer to really resolve those late-stage playouts that normal human players would not play to.</p><p>If you assume that the value functions are correct, this is why MCTS gives you a better policy. It&#8217;s a very critical chain of assumptions. Assuming this is accurate, your search process should give you a better recommendation than your initial guess.</p><p><strong>Dwarkesh Patel</strong></p><p>If you have a cold-started policy&#8212;if you have an AlphaZero-type thing&#8212;really what&#8217;s happening for the first few epochs is that the policy is kind of useless. What you&#8217;re really just doing is, &#8220;Hey, let&#8217;s play full games, and once we have played full games, for the preceding moves, we&#8217;ll have labeled who won and who didn&#8217;t win.&#8221; The loss for AlphaZero has two components: how good is the policy relative to MCTS, and how good is the value prediction relative to who actually won the game from this move. You can think of this being applied to every single action or every single move.</p><p>Really what&#8217;s happening at the beginning of AlphaZero training is just that we&#8217;re trying to get the value function to actually predict who will win the game if you find yourself in this state and you&#8217;re this player. Functionally, that&#8217;s all that&#8217;s happening. Later on, once that&#8217;s well trained, now the policy is also improving.</p><p><strong>Eric Jang</strong></p><p>Correct. There&#8217;s one trick I found to be pretty useful. This is not a peer-reviewed claim, so take it with a grain of salt. I found it useful in my own implementation to do the following. You want to first make sure that this is good before you invest a lot of cycles doing MCTS. It doesn&#8217;t really make a lot of sense to do search on garbage value predictions. You want to start at a good place where this works.</p><p>AlphaGo Lee does a very good thing where it just takes human games, you train on them, and it just works. It totally works. You can also take an open-source Go bot, play it against itself, generate data, and that also works. If you have some offline dataset that has realistic, good play, you can easily learn the late-stage value functions pretty well. That&#8217;s what you need to start the search process.</p><p><strong>Dwarkesh Patel</strong></p><p>Sorry, can you just repeat that sentence one more time?</p><p><strong>Eric Jang</strong></p><p>Sure. It&#8217;s quite easy to evaluate a late-stage Go game. When almost all the pieces are on the board, it&#8217;s almost like a decidable problem because there&#8217;s lower and lower uncertainty as to the depth of the tree. So most games played to the end by reasonable people will be good training data to train a good value function at terminal parts of the tree.</p><p>Then, as you play more games, the search will back up good values into the intermediate nodes of the tree. As you increase the amount of data, your value head gets a good intuition of what is a healthy board state versus a not-healthy board state. Those are much more subtle to judge in the mid-game than at the beginning or the end.</p><p>The most difficult part to score is not the beginning, because the beginning is obviously 0.5, and at the end it&#8217;s pretty obvious who&#8217;s winning. The hard part that you want to learn in the value function is who is winning in the middle.</p><p><strong>Dwarkesh Patel</strong></p><p>So this is actually very analogous to <a href="https://en.wikipedia.org/wiki/Temporal_difference_learning">TD learning</a>.</p><p><strong>Eric Jang</strong></p><p>Yes. There&#8217;s a beautiful connection to TD learning that we can talk about in a bit, contrasting it with Monte Carlo tree search.</p><p>You first want to get good value functions, and expert data can give you a quick shortcut. I recommend for practitioners, just do that first to initialize to a good starting point. If you want to do the AlphaZero thing or KataGo tabula rasa learning, what you can try to do is play random games on a small board. Just take a random agent. If you play 50,000 games, you&#8217;ll actually learn a pretty good value function as well. On a 9x9 board, you can see enough of the common patterns with random play.</p><p>If you train on both 9x9 and 19x19 data&#8212;and KataGo proposed one of these architectures&#8212;there&#8217;s pretty good transfer learning from the value head evaluated at 9x9 to 19x19.</p><p><strong>Dwarkesh Patel</strong></p><p>Right, because this, unlike other games, has very much a sense&#8230; There&#8217;s not like a new kind of piece introduced when you increase the size or something.</p><p><strong>Eric Jang</strong></p><p>If we take it to its limit and consider a very tiny 4x4 Go board, if you play 50,000 games, you&#8217;re going to have a lot of end states that look like human play. It&#8217;s just Tic-Tac-Toe at that point.</p><p>If you broaden this a little bit to 5x5 or 9x9, it&#8217;s not unrealistic to imagine that purely random play will generate pretty reasonable-looking boards. You can score those pretty easily. That is what gives you the bootstrapping to then improve your policy with search.</p><p>But it&#8217;s very critical that MCTS has accurate value estimates. You need to ground the value. Ultimately, MCTS will fall apart if you don&#8217;t have a grounding function for the value.</p><p><strong>Dwarkesh Patel</strong></p><p>I&#8217;d be curious how much compute you save by training the value and policy on the same network. Because they share the same representations, how much more efficient is learning? We&#8217;ve just talked about how they&#8217;re making similar predictions, or they should be in line with each other. So I&#8217;d be curious if you&#8217;re halving the amount of compute you have to do by keeping them in the same network.</p><p><strong>Eric Jang</strong></p><p>AlphaGo Lee, the original AlphaGo paper, had two separate networks. In all subsequent papers, they merged them into two heads, and presumably this saves compute. But answering that question in a rigorous scientific way is a simple question that in practice, if you really want to chase the question down to its limit, takes quite a bit of work to really resolve.</p><p>Intuitively, yes, they share a lot of representations. As we mentioned, your policy network and your value network, when doing evaluations, should agree. There really should be this consistency between them.</p><p><strong>Dwarkesh Patel</strong></p><p>Tell me if this is the wrong way to think about it. When I learn how an LLM works and how simple <a href="https://www.promptfoo.dev/blog/rlvr-explained/">RLVR</a> is as an algorithm, I&#8217;m stunned by the kinds of things it can do. It can learn how to build very complicated code repositories simply from getting a yes or no.</p><p>Here, if you understand it more deeply, just predicting MCTS, AlphaGo seems less impressive in retrospect the more you understand it. You&#8217;re putting in a lot of bias by telling it how it should titrate exploration as things go on. You&#8217;re building this very explicit tree search for it. I don&#8217;t know if you share that intuition where the more you understand it, the less impressive the accomplishment in 2017 seems.</p><p><strong>Eric Jang</strong></p><p>I personally disagree. I think they&#8217;re profound for different reasons. I don&#8217;t understand the LLM RL enough to comment on it on your podcast. But why is AlphaGo a profound accomplishment? It&#8217;s worth stepping back a little bit. It is different from modern RL, and we can talk a little bit about some of the algorithmic choices there.</p><p>I think the most profound thing here is that a 10-layer neural network pass, basically 10 steps of reasoning&#8230; Of course, the reasoning is not just one trail of thought. It could be distributed representations and a lot of thoughts going on at the same time. But by construction, let&#8217;s say a 10-layer neural network can only do 10 sequential steps of thinking. 10 steps of neural network parallelized distributed-representation thinking is able to amortize and approximate to very high fidelity a nearly intractable search problem.</p><p>This was a breakthrough that I think most people don&#8217;t even fully comprehend today, how profound that accomplishment is. This is what also girds <a href="https://en.wikipedia.org/wiki/AlphaFold">AlphaFold</a>, for example, where you have a very, very difficult physical simulation process where you would need to roll out as so many microscale simulations, and yet 10 steps of a somewhat small neural network can somehow capture what feels like an <a href="https://en.wikipedia.org/wiki/NP_(complexity)">NP</a>-class problem into a single problem.</p><p>It actually makes me wonder if our understanding of problems like <a href="https://en.wikipedia.org/wiki/P_versus_NP_problem">P=NP</a>, or these <a href="https://en.wikipedia.org/wiki/Computational_hardness_assumption">fundamental computational hardness</a> problems, is incomplete. Obviously, it&#8217;s not a proof of P=NP, but there&#8217;s something to it that is very disturbing, where what felt like a very hard problem can fall to a very simple macroscopic solution.</p><p><strong>Dwarkesh Patel</strong></p><p>That is a very interesting insight, that a lot of problems which are proven to be <a href="https://en.wikipedia.org/wiki/NP-hardness">NP-hard</a>&#8212;I don&#8217;t know if Go is proven to be NP-hard, but <a href="https://en.wikipedia.org/wiki/Protein_folding">protein folding</a>, et cetera&#8212;neural networks can solve. They&#8217;re NP-hard in the worst case, but we&#8217;re usually not concerned about the worst case. These problems usually have a lot of structure to them.</p><p><strong>Eric Jang</strong></p><p>I think the question we should be asking ourselves is about how we&#8217;ve been formulating solutions to NP-hard problems in worst-case complexity. I wouldn&#8217;t say this solves Go. It doesn&#8217;t give us an exact solution of the optimum, but in practice, it is extremely useful. The same thing has been shown in <a href="https://en.wikipedia.org/wiki/AlphaTensor">AlphaTensor</a> and AlphaFold. Yes, there is a very hard problem that, in the worst case, seems intractable, and yet we&#8217;re able to make almost arbitrary amounts of progress.</p><p>In the limit, what might this look like? If you want to simulate something very complex like weather, or predict the future&#8212;do we live in a simulation or not&#8212;the computing resources you need to build a very complex simulation might be much smaller than you think, based on our ability to amortize a lot of that computation into the forward pass of a single network. To me, AlphaGo was the first paper that really showed this profound level of simulation being compressed into a small amount of compute.</p><p><strong>Dwarkesh Patel</strong></p><p>I feel totally not qualified on the computational complexity or the math to comment on this, but I wonder if there&#8217;s an important role of <a href="https://en.wikipedia.org/wiki/Chaos_theory">chaos</a> here. What is the problem with weather, and why does it take 10x the amount of resources to predict weather a day out, and continually so for every additional day out? It&#8217;s because it&#8217;s a chaotic system, so small perturbations can totally change the final estimate as time goes on. I guess you would expect that for Go and protein folding as well.</p><p><strong>Eric Jang</strong></p><p>Here&#8217;s an analogy to weather that might be relevant in Go. Here&#8217;s our current board state. Given what we know about both players, what is the exact board state in the future? This is extremely sensitive to initial conditions. A single stone placed here can disrupt the entire prediction. This is hard. Intuitively, this is the chaotic problem.</p><p>Yet somehow, we can predict who&#8217;s going to win. This captures a lot of possibilities. There&#8217;s a more macroscopic quantity that we really care about, which is the average or expectation or some sort of global macrostructure over a lot of possible futures.</p><p>In weather, it could be the same thing. We don&#8217;t exactly care what the velocity of wind 6,000 feet above a specific latitude and longitude is. We care where the hurricane is, and things like that. In chaos, there&#8217;s the classic <a href="https://en.wikipedia.org/wiki/Lorenz_system">Lorenz attractor</a>. If you start anywhere on the Lorenz attractor, you don&#8217;t know where you&#8217;re going to end up, but you do know that the thing looks like this. There&#8217;s a kind of beauty here. Sometimes we don&#8217;t necessarily care about the microscale things. We actually care about the macroscopic structure, and these things can be predictable.</p><p><strong>Dwarkesh Patel</strong></p><p>Contrast that to something like a <a href="https://en.wikipedia.org/wiki/Hash_function">hash function</a>, which is also incredibly dependent on initial conditions but doesn&#8217;t have a macrostructure, or at least hopefully doesn&#8217;t, if the algorithms work.</p><p><strong>Eric Jang</strong></p><p>One would hope.</p><p><strong>Dwarkesh Patel</strong></p><p>There&#8217;s no equivalent of a value function or broadly how the weather is going to be that&#8217;s interesting there. It&#8217;s really just about what the board is going to look like exactly 100 moves from now exactly.</p><p><strong>Eric Jang</strong></p><p>Intuitively, that seems correct. This is also out of my area of expertise, but I find it interesting that the tools of cryptography and hashing have also not been able to prove that you cannot come up with fast approximations. You cannot come up with fast approximations. If they were able to do that, then you could prove P is not equal to NP.</p><p><strong>Dwarkesh Patel</strong></p><p>In fact, we know that there&#8217;s structure in many cryptographic protocols, obviously <a href="https://en.wikipedia.org/wiki/RSA_cryptosystem">RSA cryptography</a>. There is structure, and that structure is what quantum computers exploit to break them.</p><p><a href="https://www.dwarkesh.com/p/reiner-pope">Reiner</a> has a <a href="https://reiner.org/neural-net-ciphers">very interesting blog post</a>, which we&#8217;ve talked about in the episode, where he points out that if you look at a high level at what cryptographic protocols look like and what neural networks look like, they&#8217;re extremely similar. You have sequential layers of jumbling information together. There&#8217;s a convergent evolution in the algorithms. In cryptography, you want the final state to be incredibly sensitive to initial conditions, so that it comes out looking jumbled if you change anything. In neural networks, you similarly want everything to be dependent on all the information, because you want to process all the information and consider how it relates to itself.</p><p><strong>Eric Jang</strong></p><p>You have the maximum power of a neural network at the edge of chaos. I think there are some <a href="https://sohl-dickstein.github.io/2024/02/12/fractal.html">research papers from Jascha Sohl-Dickstein</a> on this. There&#8217;s something quite fundamental about chaos that&#8217;s not just hopeless noise. There&#8217;s something useful in chaotic systems, at least at that boundary. But this is just thinking about it as a philosophy. I don&#8217;t know the math well enough to comment on it.</p><p>We&#8217;ll talk about LLM RL in a little bit, because there&#8217;s some connection there, but let&#8217;s go back to MCTS. What is it doing? Crucially, it is not saying that we&#8217;re going to increase the probability of winning directly. It&#8217;s not going to upweight all actions that won and downweight all actions that didn&#8217;t win.</p><p>Importantly, what it is doing is saying: for every action we took, we did a pretty exhaustive search on MCTS to see if we could do better, and we&#8217;re going to make every action that we took better by having the policy network predict that outcome instead. This is a very nice idea because you have one supervision target for every single action. The variance of your learning signal is very low compared to the alternative naive RL thing.</p><p>Let&#8217;s consider a very naive algorithm that looks a lot more like modern LLM RL today, where we do something like take the winner of a self-play game and encourage it to do more of that.</p><h3>01:25:38 &#8211; Alternative RL approaches</h3><p><strong>Eric Jang</strong></p><p>It&#8217;s worth thinking a little bit about what alternatives we could use to train self-play agents instead of MCTS. We use a lot of LLM-style RL these days. Is that relevant? Could we do that instead?</p><p>Suppose we have a very naive algorithm where we take a league of agents of different checkpoints and play them against each other. For the games where a single player wins, we reinforce those actions up and retrain the policy network to imitate those winners, instead of the MCTS objective. Let&#8217;s say you have a chain of actions that led to a win, and you have a matchup between two agents that are basically the same. Assume policy <em>a</em> (&#928;<em><sub>a</sub></em>) and policy <em>b</em> (&#928;<em><sub>b</sub></em>) are evenly matched, so their true win rate is 50%.</p><p>Let&#8217;s say you play 100 games, and each game lasts 300 moves. You&#8217;re doing some sort of evolution strategy or some way to perturb these things to get them to do different things. Or maybe you don&#8217;t, and you just play them against each other to see that occasionally one might have a better strategy than the other. Let&#8217;s say policy <em>a</em> wins 51 games and policy <em>b</em> wins 49. This is just due to random luck, or maybe you perturbed policy <em>a</em> in some way that let it do this.</p><p>Just to have a very simple model, let&#8217;s pretend that for 50 of the games, they played exactly equally. In the one game where <em>a</em> won, it played slightly differently. It made one critical move that normally it would have done differently, but due to some exploration or random noise, it happened to make a smarter move than it did previously.</p><p>So you have one true supervision signal for your policy network, and then you have 99 games times 300 moves for which imitating those actions gives you exactly the same policy you had before. The scale of your variance is actually very bad, because you only have one label out of this enormous dataset of supervision actions where you want&#8230; Actually, let me clarify a little bit.</p><p><strong>Dwarkesh Patel</strong></p><p>We&#8217;re talking about how the good move, the out-of-distribution move, is a small fraction of all the moves that are played across all the games on which you&#8217;d want to train. This, of course, reminds me of how LLMs are trained with <a href="https://en.wikipedia.org/wiki/Policy_gradient_method">policy gradient methods</a>. When <a href="https://www.dwarkesh.com/p/andrej-karpathy">Karpathy</a> was on the podcast, he called it like &#8220;sucking supervision through a straw&#8221;. It&#8217;s interesting that this thing you&#8217;re saying&#8212;which would be intractable and prevents you from actually getting beyond a certain level in Go&#8212;is just by default how LLMs are trained?</p><p><strong>Eric Jang</strong></p><p>Right. This is not to say it doesn&#8217;t work. If you imagine increasing the number of games to millions of samples, you actually can get some meaningful supervision samples, so long as you find a way to mask out the supervision from these guys. This is where things start to get pretty related to RL in terms of advantage and baselines.</p><p>Let&#8217;s look at the gradient variance of a very naive approach&#8212;I&#8217;m going to call it gradient RL&#8212;that&#8217;s basically the sum of rewards. The sum of rewards is the return. In our naive setup here, we only have an indicator variable for the return, where either you won or lost. In the case where you lost, your gradient is zero, so you don&#8217;t train on those examples, and when you won, you try to predict those things. You can think about this setup as a special case of this general formula here.</p><p>The trouble here is that this is very high variance. When you multiply these terms out to compute the variance of the gradient, it&#8217;s equal to the expectation of&#8230; Just for simplicity, we can pretend this is on average zero or something if you&#8217;re centering it at no signal. The variance here basically means that you&#8217;re taking the square of this product term. You end up with a term that grows quadratically with <em>T</em>. When you have a setup like this, this thing acts as a coupling effect on top of these terms here.</p><p>Let&#8217;s actually map this to an LLM case, and we can answer why LLMs only do one-step RL instead of a multi-step RL scenario. In LLMs, you have a <a href="https://magazine.sebastianraschka.com/p/understanding-encoder-and-decoder">decoder</a> that might predict some words like &#8220;hello world&#8221;. In current LLM RL, they treat this entire sequence as a single action, <em>a<sub>t</sub></em>, and big <em>T</em> is just one. It&#8217;s true that because of how transformers are formulated through the product of conditional probabilities, the log probability of the whole sequence is equal to the sum of the probabilities of individual tokens.</p><p>In this case, I would say something like log(hel) + log(lo) + log(world). This is true, and if this term were one, then they would be the same thing. However, when you&#8217;re sampling, if you have a reward term assigned to every specific token, now you have these interaction effects between the cross-multiplication of these terms and these terms. The problem becomes, how do you ascribe the credit associated with every episode to all these different terms?</p><p><strong>Dwarkesh Patel</strong></p><p>The thing I&#8217;m confused on is what it would even look like to do it that way in LLMs, because you only get a reward at the end of the episode.</p><p><strong>Eric Jang</strong></p><p>You could imagine a reward that says, &#8220;I&#8217;m going to give you some process supervision where you get a reward for each of these actions on every step.&#8221;</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, so you&#8217;re saying if instead of doing it that way&#8230; Well, the way you&#8217;ve written it, it would be a sum at the end anyway, so they wouldn&#8217;t have to be multiplied. But you&#8217;re saying instead of doing it that way, you would just add up these process rewards at the end and then treat that as one single reward signal?</p><p><strong>Eric Jang</strong></p><p>Correct, for one single log prob action.</p><p><strong>Dwarkesh Patel</strong></p><p>But isn&#8217;t that how it&#8217;s written to begin with anyways? The sum of the rewards?</p><p><strong>Eric Jang</strong></p><p>The thing that&#8217;s a little bit hidden here in the math is that we&#8217;re assuming that when you decompose the problem into a multi-step problem, you&#8217;re now introducing correlations between your actions through the computation of this guy. If you separate these things out, this will magnify the variance of this one.</p><p>In the case where you don&#8217;t separate it out, if you just have T=1, you have a single estimate of log prob and a single estimate of reward. This term still shows up. In LLMs, the naive <a href="https://cs229.stanford.edu/notes2020fall/notes2020fall/cs229-notes14.pdf">REINFORCE</a> estimator looks a bit like the return of the single action&#8230; It looks kind of like this. This is the basic form here, but it&#8217;s still a contributor to variance. You want to make sure that, similar to how in this case we were training on a lot of neutral labels, you&#8217;re sort of penalizing the labels that don&#8217;t help and only rewarding the ones that actually make you better.</p><p>Intuitively, the analogy is: can we find a term in our training objective such that it&#8217;s actually discouraged from doing this, or these don&#8217;t have any effect on the gradient, and this has an effect on the gradient?</p><p><strong>Dwarkesh Patel</strong></p><p>If you applied that there, the only thing you could do is eliminate 49 of the games. At least the way you have it written there, it would be 51 times&#8230;</p><p><strong>Eric Jang</strong></p><p>Actually, the optimal case is to discard all of these moves, and only get a gradient on that single move that you got better on.</p><p><strong>Dwarkesh Patel</strong></p><p>How would you do that?</p><p><strong>Eric Jang</strong></p><p>This is a pretty tricky problem in practice. This is where <a href="https://milvus.io/ai-quick-reference/what-is-the-advantage-function-in-rl">advantage estimation</a> happens in reinforcement learning. You want to subtract a term from your multiplier. Instead of an indicator function of one and zero, you want something that behaves like a zero for all of these guys, and then a one for these ones.</p><p><strong>Dwarkesh Patel</strong></p><p>You could do that if you can say, &#8220;Hey, I won this game, so this is slightly above baseline performance.&#8221;</p><p><strong>Eric Jang</strong></p><p>Well, you won a lot of games. But you don&#8217;t know which ones let you win because they were truly better versus winning by accident.</p><p><strong>Dwarkesh Patel</strong></p><p>How would you design a baseline where it&#8217;s truly better?</p><p><strong>Eric Jang</strong></p><p>This is where in RL people use things like TD learning to better approximate the quality function, the <em>Q</em> that we mentioned earlier. You can try to subtract that from your return.</p><p>Ideally, what you really want to do in RL is push up the actions that make you better than average and push down the actions that make you worse than average. They call this advantage. There are multiple ways to compute it. I highly recommend <a href="https://www.dwarkesh.com/p/john-schulman">John Schulman&#8217;s</a> &#8220;<a href="https://arxiv.org/abs/1506.02438">Generalized Advantage Estimation&#8221;</a> paper as a good treatment on how to think about various ways to compute it. At the end of the day, you want to reduce variance by trying to make this smaller, so it doesn&#8217;t magnify the variance of this one.</p><p><strong>Dwarkesh Patel</strong></p><p>That makes sense. This requires you to have a very good estimate of what average performance from a state would look like. This gets us back to the value function thing we were talking about earlier.</p><p><strong>Eric Jang</strong></p><p>Keep in mind that this <a href="https://en.wikipedia.org/wiki/Model-free_(reinforcement_learning)">model-free</a> RL setting is trying to solve a credit assignment problem where you don&#8217;t know which actions were actually good and which ones were bad. Monte Carlo tree search is doing something very fundamentally different. It&#8217;s not trying to do credit assignment on wins. It&#8217;s trying to improve the label for any given action you took. We can actually think about a completely different algorithm called <a href="https://towardsdatascience.com/neural-fictitious-self-play-800612b4a53f/">neural fictitious self-play</a>, which was used to great effect in systems like <a href="https://en.wikipedia.org/wiki/AlphaStar_(software)">AlphaStar</a> and <a href="https://en.wikipedia.org/wiki/OpenAI_Five">OpenAI&#8217;s Dota</a>.</p><p>Let me talk a little bit about how you can unify some of these RL ideas in the model-free setting as well as the self-play setting. What happens if you don&#8217;t have the ability to easily search a tree? In Go, it&#8217;s a perfectly observable game. You can easily construct a pretty deep tree that completely captures the game state. In a game like StarCraft where you don&#8217;t have complete control over the binary, it&#8217;s a little bit hard to do this, and I&#8217;m not even sure if it&#8217;s a deterministic game. That makes this difficult from a data structures perspective.</p><p>What is done instead is that the basic idea of supervising your actions with a better teacher is still there. We&#8217;re going to talk a little bit about how neural fictitious self-play works. It&#8217;s the same idea. We&#8217;re going to come up with better labels for each of the actions we took, just like in MCTS. But how do we derive the better labels? In MCTS, we perform search. Assuming we have a good value function, the search will give us a better result than our initial guess.</p><p>In a game where you can&#8217;t easily simulate a search process, what they do instead is train what&#8217;s known as a best response policy. You fix your opponent. Let&#8217;s say you&#8217;re currently training &#928;<sub>a</sub> against a strong opponent, &#928;<em><sub>b</sub></em>. In StarCraft, maybe these are the Zergs and you&#8217;re playing Protoss or something. You fix your opponent, and you treat this as a classic model-free RL algorithm, where your goal is just to beat this guy. Here you use your standard TD-learning-style tricks, or use <a href="https://en.wikipedia.org/wiki/Proximal_policy_optimization">PPO</a> or any model-free RL algorithm to try to <a href="https://en.wikipedia.org/wiki/Hill_climbing">hill climb</a> to winning against this player. You have a reward function where the return is one if it wins against &#928;<em><sub>b</sub></em>, and zero otherwise. This is no longer a self-play problem. It&#8217;s just a fixed opponent, and you&#8217;re trying to maximize a score against them. You have a fixed environment where all you care about is beating this guy.</p><p>Once you have a good policy that you train with your favorite model-free RL algorithm&#8212;PPO or <a href="https://www.geeksforgeeks.org/deep-learning/soft-actor-critic-reinforcement-learning-algorithm/">SAC</a> or <a href="https://arxiv.org/abs/1909.12238">V-MPO</a>&#8212;you now have a good policy that gives you a good label for what this one should do when playing against that player. When you train multiple best response policies, you can basically distill the RL algorithms into the labels for a given opponent. You might have a best response policy against &#960;<em><sub>b</sub></em>, and then maybe you have a league of opponents like &#960;<em><sub>b</sub></em>, &#960;<em><sub>c</sub></em>, &#960;<em><sub>d</sub></em>. You&#8217;re going to take the best response policy that you train against each of these fixed opponents, and supervise them with the label that this one would provide.</p><p>This is almost like a proxy for your MCTS teacher. Instead of an MCTS teacher, you use a model-free RL algorithm to find the best search action that you could do to beat your opponent. Then you&#8217;re distilling the policy here into what&#8217;s known as a mixed strategy, where it&#8217;s trying to average across all possible opponents you could play against. This is what gives you something that can do no worse than an averagely selected opponent from the league. This gets around the problem of having to derive a teaching signal from MCTS, but fundamentally it&#8217;s still about relabeling your states with better actions so that they improve your policy.</p><p><strong>Dwarkesh Patel</strong></p><p>Just to make sure I understand, if you win a game against this other policy, you reinforce all the actions on that trajectory.</p><p><strong>Eric Jang</strong></p><p>Yes. Here you can use a number of algorithms like PPO, V-MPO, or even <em><a href="https://en.wikipedia.org/wiki/Q-learning">Q</a></em><a href="https://en.wikipedia.org/wiki/Q-learning">-learning</a> if you want. The specific algorithm here is usually a model-free thing because you don&#8217;t have search, but there&#8217;s an interesting connection between MCTS and Q-learning that I want to bring up.</p><p>In MCTS, you have a tree, and through the resolution of your value function at the approximate leaves of the tree, you can back up through many sequences and obtain some sort of mean value estimate. Your <em>Q</em> is derived from the average of a bunch of simulations.</p><p>In model-free algorithms, there&#8217;s often a component of estimating a <em>Q</em> value. <em>Q</em> values are often learned through TD learning, although in PPO, the way they do advantage estimation is not necessarily through a <a href="https://spinningup.openai.com/en/latest/spinningup/rl_intro.html#:~:text=The%20term%20%E2%80%9CBellman%20backup%E2%80%9D%20comes,%2Dplus%2Dnext%2Dvalue.">Bellman backup</a>. In <em>Q</em>-learning, there&#8217;s this very cool trick where <em>Q</em>(<em>s</em>,<em>a</em>) is backed up as <em>r</em> plus some discount factor times the max over <em>a,</em> <em>Q</em> of your next step. Intuitively, if you have an MDP and this is terminal, it&#8217;s saying that the best action you can take at this state is equal to the reward you get for taking this action, plus the best that you can do at the next state.</p><p>There&#8217;s a recursive, <a href="https://en.wikipedia.org/wiki/Dynamic_programming">dynamic-programming</a> property of MDPs, and you can train neural networks to basically enforce this consistency. You can say, &#8220;Well, once I know the <em>Q</em> value of this action, I can then use that to compute something about the <em>Q</em> value for the next.&#8221;</p><p><strong>Dwarkesh Patel</strong></p><p>Earlier I was like, &#8220;Hey, why are we training policy? Why don&#8217;t we just train the value alone?&#8221; That is what this is.</p><p><strong>Eric Jang</strong></p><p>This is an algorithm for recovering value estimates of intermediate steps when you don&#8217;t have the ability to do forward search. You must collect a trajectory first of <em>n</em> steps before you&#8217;re able to do this trick. But the intuition is kind of the same. Knowing something about the <em>Q</em> value here can tell you something about the <em>Q</em> value here. And indeed, you can recover a policy from a <em>Q</em> value. You don&#8217;t need to explicitly model the policy distribution. You can recover the policy distribution by doing argmax over your <em>Q</em> values.</p><p><em>Q</em>-learning, or approximate dynamic programming, propagates what you know about the future <em>Q</em>s backward like this. You can see that there&#8217;s a similar structure going on here, where in one case you&#8217;re planning over trajectories your agent hasn&#8217;t been to yet, whereas in the other you&#8217;re planning over trajectories your agent has visited.</p><p>Importantly, why was <em>Q</em>-learning a big deal? It&#8217;s because historically we just haven&#8217;t had the ability to do search on fairly high-dimensional problems like robotics or whatever. For a long time, we made the assumption that if we can&#8217;t model the dynamics with a world model, we&#8217;re going to instead just collect trajectories and then plan with respect to the only number that really matters, which is reward.</p><h3>01:45:47 &#8211; Why doesn&#8217;t MCTS work for LLMs</h3><p><strong>Dwarkesh Patel</strong></p><p>This is very interesting. To unify this with our discussion of LLMs, you don&#8217;t have <em>Q</em> values, but you&#8217;re doing this backwards learning where you find the trajectories that pass some unit test in some coding environment and then reinforce those trajectories.</p><p>There&#8217;s a huge difference between that and this forward approach with MCTS. The reason it&#8217;s much more preferable to do MCTS is because you can do it per move and make each move better, rather than having to learn per trajectory and hope, as Karpathy said, to learn this&#8230;</p><p><strong>Eric Jang</strong></p><p>Through a straw.</p><p><strong>Dwarkesh Patel</strong></p><p>Right, you get supervision through a straw. You basically just upgrade all the tokens in a trajectory that might or might not have been relevant to getting the answer right.</p><p>The reason you can do this much more sample-efficient, favorable thing with Go is that because MCTS works in Go, you basically know that if you just do the search locally here&#8212;and this search is truncated at the end by a value function that works even if you haven&#8217;t unfolded your whole trajectory&#8212;you can just say, &#8220;This is my new policy,&#8221; and improve in a more iterative, local way, rather than having to unfold all these trajectories.</p><p><strong>Eric Jang</strong></p><p>There was <a href="https://arxiv.org/abs/2305.10601">some research</a> from Google in 2023 or 2024 where they tried to apply tree structures to reasoning. The jury is still out as to whether this can ever work. We probably will see a revisiting of this idea of forward search in the future.</p><p>But there are two things that make MCTS very simple for Go. Value estimation is concrete. You can determine it for real, and then you can use it to truncate depth, as you said. The breadth is also determined. What&#8217;s critical is that the action selection algorithm, where you iteratively visit and grow the tree, is well suited for the size and depth of problem that Go is.</p><p>But for something like LLM reasoning, PUCT might not be a good enough heuristic. It might be too greedy with local tokens, and it might only give you obvious thoughts that are correct but don&#8217;t really solve your final problem. I would say the jury is probably still out on what the final instantiation of reasoning for LLMs will look like. I wouldn&#8217;t rule out that this stuff could come back, but it&#8217;s a bit hard.</p><p><strong>Dwarkesh Patel</strong></p><p>Don&#8217;t LLMs natively learn to do MCTS, where they&#8217;ll try an approach and be like, &#8220;Oh, that doesn&#8217;t work. Let&#8217;s back up. Let&#8217;s try this other thing,&#8221; and then go in the direction that proves to be more fruitful?</p><p><strong>Eric Jang</strong></p><p>Certainly, LLMs manage to do something that looks like real human reasoning without having to do an explicit tree structure. That being said, I think the idea of doing forward search and simulation to get a better sense of what is valuable might make a comeback, even if not in exactly the same instantiation as AlphaGo.</p><p><strong>Dwarkesh Patel</strong></p><p>Just to make sure I understand the crux of it: the breadth comes from the number of legal actions being wider, and the depth comes from not being able to train a value function as easily, because&#8230;</p><p><strong>Eric Jang</strong></p><p>Here&#8217;s an example where LLMs break down. The C<sub>PUCT</sub> rule involves the &#8730;<em>N</em> over 1+<em>N<sub>a</sub></em>. In an LLM, you&#8217;re most likely never going to sample the same child more than once. If you have multiple steps of thinking, because language is so broad and open-ended, a discrete set of actions is not really an appropriate choice for an LLM. Even though they&#8217;re discrete tokens, it&#8217;s just such a large number that this type of exploration heuristic is probably not the right thing to guide how to search down a tree.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess the crux comes down to the fact that in Go, you know that MCTS is almost certainly better than your current policy, even though you haven&#8217;t explored the end of any trajectory. In normal reasoning for LLMs or robotics, there&#8217;s no way to just locally evaluate and improve your next move in a way that&#8217;s independent of actually solving the problem.</p><p><strong>Eric Jang</strong></p><p>&#8220;No way&#8221; is a strong word. Lots of people have thought about how to apply MCTS or its successors like MuZero to continuous control spaces, and I&#8217;m sure very cool research work is still ongoing to try to crack that problem. But yes, the apparent challenge right now is that most problems in much higher-dimensional action spaces, or something combinatorially much bigger like language, don&#8217;t seem as amenable to the discrete action-selection heuristics and game-evaluation type stuff that Go has.</p><p>That&#8217;s not to say the idea of thinking into the future along multiple parallel tracks might not give you some information about which way to search. If you think about mathematics, it often occupies more of a logical search procedure where you can back up and see which paths seem good or not. There&#8217;s more of a rigid structure there, whereas in a business negotiation it&#8217;s less of a tree and maybe something a bit different.</p><p><strong>Dwarkesh Patel</strong></p><p>We&#8217;re now seated, so I can ask you some more questions about AlphaGo and AI research more generally. In 2021, <a href="https://scholar.google.com/citations?user=wjU_zmMAAAAJ&amp;hl=en">Andy Jones</a> had a paper called <a href="https://arxiv.org/pdf/2104.03113">&#8220;Scaling Scaling Laws with Board Games&#8221;</a>, and he basically anticipated inference compute, or inference scaling, by showing that you can trade off <a href="https://huggingface.co/blog/Kseniase/testtimecompute">test-time compute</a> and training compute. That is to say, you can spend more compute searching through MCTS, and if you do that, you can get the equivalent performance to having spent more time training the model.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!211l!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98fa0b5b-8ad2-4971-bec4-b892c76d6f20_1229x945.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!211l!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98fa0b5b-8ad2-4971-bec4-b892c76d6f20_1229x945.png 424w, https://substackcdn.com/image/fetch/$s_!211l!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98fa0b5b-8ad2-4971-bec4-b892c76d6f20_1229x945.png 848w, https://substackcdn.com/image/fetch/$s_!211l!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98fa0b5b-8ad2-4971-bec4-b892c76d6f20_1229x945.png 1272w, https://substackcdn.com/image/fetch/$s_!211l!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98fa0b5b-8ad2-4971-bec4-b892c76d6f20_1229x945.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!211l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98fa0b5b-8ad2-4971-bec4-b892c76d6f20_1229x945.png" width="1229" height="945" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/98fa0b5b-8ad2-4971-bec4-b892c76d6f20_1229x945.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:945,&quot;width&quot;:1229,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2361343,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!211l!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98fa0b5b-8ad2-4971-bec4-b892c76d6f20_1229x945.png 424w, https://substackcdn.com/image/fetch/$s_!211l!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98fa0b5b-8ad2-4971-bec4-b892c76d6f20_1229x945.png 848w, https://substackcdn.com/image/fetch/$s_!211l!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98fa0b5b-8ad2-4971-bec4-b892c76d6f20_1229x945.png 1272w, https://substackcdn.com/image/fetch/$s_!211l!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F98fa0b5b-8ad2-4971-bec4-b892c76d6f20_1229x945.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>If you see this pattern, you might think that with LLMs you might do something similar in the future, and in fact that&#8217;s what ended up happening.</p><p>What is a fun exploration one could do now to explore other axes of scaling in toy settings, which will be important to understanding what AI development might be like in a few years?</p><p><strong>Eric Jang</strong></p><p>Test-time scaling and reasoning, and how they interact with model size, are quite profound when it comes to how much needs to actually be done as explicit search versus how much can be packed into the forward pass of a neural network. How does a forward pass of a neural network learn to do something that should be sequential and recursive? That&#8217;s quite interesting.</p><p>The Andy Jones &#8220;Scaling Scaling Laws with Board Games&#8221; paper is quite cool. There was another really nice result from that paper. Not only can you predict <a href="https://en.wikipedia.org/wiki/Neural_scaling_law">scaling laws</a> of the LLM variety&#8212;where as you increase parameters you can decrease the amount of compute for search or vice versa&#8212;but he also showed that you can predict how much compute is needed to solve a larger version of the board game.</p><p>For example with Go, which can scale from 3x3 to an infinitely sized Go board, you might be able to revisit this question and try to reproduce whether this shows up. I actually started this project with the motivation of asking: does the &#8220;<a href="http://www.incompleteideas.net/IncIdeas/BitterLesson.html">Bitter Lesson</a>&#8221;, or our knowledge of scaling laws, allow us to execute a lot better on a compute-optimal Go bot? Can we build a strong Go bot without all of the KataGo tricks, just by really focusing on the Bitter Lesson and scaling laws?</p><p>I have not been successful so far, but usually when you want scaling laws to work, you want to be in the regime where the recipe already works and the data sets are good, rather than trying to figure out how to do scaling while also trying to figure out what the right data sets are.</p><p>The scientific understanding component in research often follows a step where you get something to work first. Then you use that system to collect data that then helps you build a mental model of how things work, such as scaling laws. Usually, if you want to build a strong Go bot using scaling laws, you actually have to make a strong Go bot first and then use the scaling laws to extrapolate a bit further into the future.</p><p><strong>Dwarkesh Patel</strong></p><p>Say more. Just so I understand, are you saying scaling laws did not work? There was no scaling-laws pattern that you could see in your Go bot?</p><p><strong>Eric Jang</strong></p><p>A mistake I made initially, when I had some bugs around how MCTS labeling was working, was that I would collect a bunch of data with an expert policy and then treat it as a supervised learning problem and try to identify scaling laws with expert data sets. You can indeed plot things that look like this, but if you&#8217;re in a regime where your policy is not working well, you might just be studying scaling laws on bad data.</p><p>One important implementation detail is that if you want to study a scaling-laws problem, you have to have a problem for which the data is good, the architecture is good, and there are no bugs, and then you solve it from there.</p><p>Ex ante, I wasn&#8217;t able to apply scaling laws to direct what to look at until I had the rest of the system working. This sounds obvious to researchers. Of course you want a working, bug-free system before you study scaling. But as advice for practitioners, on where I tripped up when I started this project, you don&#8217;t necessarily want to jump into the science of studying your man-made artifact before your man-made artifact is interesting enough to be studied.</p><p><strong>Dwarkesh Patel</strong></p><p>Speaking of compute, you can look at these charts of compute used to train the best AI model in the world over time going back 10 years. It&#8217;s a very smooth line in log space that grows exponentially year over year. Except there&#8217;s this huge aberration, and that aberration is AlphaGo Zero, which was trained on way more compute than any other AI model at the time. It was 3E23 FLOPS. It&#8217;s comparable to a frontier LLM, orders of magnitude off, but still.</p><p>The question is, especially with you being able to get something off the ground, did you train it on your own?</p><p><strong>Eric Jang</strong></p><p>I got a donation from <a href="https://www.primeintellect.ai/">Prime Intellect</a> for about $10K. I spent maybe the first $4K doing exploratory research, and then about $3K on the final run. Some of it remained for serving the model.</p><p><strong>Dwarkesh Patel</strong></p><p>Is your sense that they did a bad job training it, if you can do it in $10K now?</p><p><strong>Eric Jang</strong></p><p>The compute required to be the first to do something is always much larger than the compute it takes to catch up. It&#8217;s the same story playing out in LLMs. Once someone else has done it, you can use tricks like <a href="https://en.wikipedia.org/wiki/Knowledge_distillation">distillation</a>. You can use all sorts of crutches to bootstrap your way to success.</p><p>With my own bot that I&#8217;ve hosted online, I actually used best-response training against the KataGo models to get a strong level of performance. As of the time of recording, I&#8217;m validating whether I can do that first step, which is to do the tabula rasa play.</p><p>Importantly for research, you often want to start from a good init. The simple thing I did first was train best-response agents against KataGo. The AlphaZero team did not have any policy they could train against, because they were trying to do everything tabula rasa. Being the first to do it means you&#8217;re prioritizing getting the thing working rather than the most compute-efficient possible implementation.</p><p>This plays out in robotics as well. If you look at the frontier of large models trained for robotics, the scatter plot is all over the place, and there isn&#8217;t a very clean line the way there is for frontier LLMs. That&#8217;s because the folks training these models often are not at the scale where every FLOP counts and they need to squeeze out the performance of every single FLOP as the deciding factor in pre-training. Instead, their focus is more, &#8220;We want a certain capability to show up, so we optimize the training setup to make it easy to derive that capability.&#8221;</p><p>Once you have that capability&#8230; Invariably if you scale up the compute, you are forced to make it compute-efficient, because this is hundreds of millions of dollars we&#8217;re talking about. But in the past, when compute for experiments was more plentiful&#8212;or not accounted for in a way the researcher was really responsible for&#8212;you end up with people optimizing for things besides being on the compute-optimal Pareto frontier.</p><p><strong>Dwarkesh Patel</strong></p><p>I see, like speed or something?</p><p><strong>Eric Jang</strong></p><p>Yeah, time to result, or just getting it to work. The first AlphaGo probably had lots of compute, and they didn&#8217;t need to worry too much about making it the most compute-optimal thing.</p><p><strong>Dwarkesh Patel</strong></p><p>How much of the improvements to compute efficiency are methods that did not exist as of 2017 versus things which they could have done in 2017 but didn&#8217;t?</p><p><strong>Eric Jang</strong></p><p>Great question. Going into this project, I knew in the back of my mind that things always get easier to do over time. I wanted to see where Go was at, given that there hadn&#8217;t seemed to be any major open-source strong bot after KataGo in 2020.</p><p>Reading the KataGo paper, there were a lot of clever ideas. I was thinking, &#8220;Let&#8217;s see if the Bitter Lesson had happened, where a lot of these tricks just go away because Nvidia made faster <a href="https://en.wikipedia.org/wiki/Graphics_processing_unit">GPUs</a>. Roughly, where are we on that?&#8221;</p><p>Again, this is not a peer-reviewed claim. It&#8217;s just my preliminary vibe guess based on what I&#8217;ve seen with my own experiments. It seems like architecture choices don&#8217;t matter that much. Transformer versus ResNet&#8230; We&#8217;re at the speed of GPU where the size of the model is not so big that this really matters.</p><p>You can simplify the setup quite a lot. Instead of doing a distributed asynchronous RL setup with replay buffers and pushers and collectors, you can do a dumb synchronous thing where you collect, train a supervised learning model, and then collect again. There are opportunities to simplify infrastructure.</p><p>Nvidia GPUs have indeed gotten faster. Whereas KataGo was trained on <a href="https://www.nvidia.com/en-gb/data-center/tesla-v100/">V100s</a>, you can train on half the number of desktop <a href="https://en.wikipedia.org/wiki/Blackwell_(microarchitecture)">Blackwell</a> GPUs and it still works. Some of the auxiliary supervision objectives that KataGo developed aren&#8217;t really necessary if you have a strong initialization. If you&#8217;re initializing best-response training against KataGo itself, your own model needs none of the tricks that KataGo needs. The core thing is getting as quickly as possible to some strong opponents. That matters a lot more than the specific architectural innovations.</p><p>But there are still some nice compute multipliers. I found that training on 9x9 boards was very nice for resolving endgame value functions. If you can co-train that on an architecture that can transfer between 9x9 and 19x19, you can really cut down the warm-start time to learn from scratch. AlphaGo Zero&#8217;s plot showed the first 30 hours or so spent basically catching up to the supervised learning baseline. You can cut down that time a lot by pre-training on a small board and then warm-starting that into your 19x19 board play.</p><p>There was some other stuff, like varying the number of sims between episodes. This turns out to be not that sensitive. You can fix it or increase it. It doesn&#8217;t matter too much. Anyway, it&#8217;s nice from a scientific perspective, revisiting an old paper and seeing what really matters.</p><h3>02:01:09 &#8211; Off-policy training</h3><p><strong>Dwarkesh Patel</strong></p><p>This is a tangential question, but why is it okay to have a buffer in AlphaGo? Every time I talk to an AI researcher, they&#8217;re telling me about how bad it is to be off-policy. But the way a naive implementation of AlphaGo Zero would work is that most of the moves in a given backward step, or in a batch of backward steps, would not be among the ones made by the most recently trained model. So why is that okay?</p><p><strong>Eric Jang</strong></p><p>Great question. This gets into the fundamental <a href="https://www.geeksforgeeks.org/machine-learning/on-policy-vs-off-policy-methods-reinforcement-learning/">off-policy versus on-policy</a> reinforcement learning questions. As you&#8217;ll recall, in MCTS, you take actions that you took, and you relabel them to take different actions on the same states.</p><p>The off-policy part comes in where, what if you&#8217;re relabeling states that your new policy would never visit? What&#8217;s the point? You&#8217;re wasting capacity. In the extreme limit, imagine the distribution of states in your training buffer are all states you would never visit. Then you&#8217;re supervising them to take good actions on states you would never achieve. Therefore, your policy can get really bad. This is where off-policy can really hurt AlphaGo.</p><p>However, if you interpret this from the DAgger perspective&#8212;which is basically a way to correct yourself back to the optimal trajectory given some data&#8212;what you want in an algorithm like this is to have mostly states that you would visit, but then a small or reasonable percentage of states in this high-dimensional tube around your optimal trajectories. Any of those states are given a supervision target to funnel you back into your optimal trajectory.</p><p>Maybe I can just draw real quickly here. In a DAgger-style setup, your optimal training data distribution is your optimal states and actions. You want to be in this state, you want to be in this state, you want to be in this state, and then you win here. These are your optimal policy actions. These are the things you definitely want to train on. But to make it robust to disturbances, you want to make sure that if you happen to drift off into some other states, you can funnel yourself back in.</p><p><strong>Dwarkesh Patel</strong></p><p>But why isn&#8217;t this a fully general argument for off-policy training?</p><p><strong>Eric Jang</strong></p><p>This is actually why you want to do off-policy training sometimes. You don&#8217;t want a compounding error where, if you make a mistake, you don&#8217;t have the data for how to return to your optimal distribution.</p><p>Optimal control doesn&#8217;t really say too much about how to not accidentally get here, because it&#8217;s making the assumption that once you learn the policy, you&#8217;re going to get to here. But in applications like robotics, a gust of wind blows you slightly off and now you need to correct. Or the friction on one of your tires is slightly lower than the other wheel, and now your car is drifting and you have to correct it. These kinds of things often happen in more real environments.</p><p>There was a funny quote about chess and Go. The problem with Go and chess is that the other player is always trying to do some shit. Things can drift off, and you always want to be able to correct back to your winning condition. Your replay buffer really should have the states your policy would visit, plus some distribution of states you might drift to, and then how to return back to your optimal states.</p><p>Now, if you take this to the extreme and you say, &#8220;We don&#8217;t have any of this data, and we&#8217;re just going to be labeling with MCTS states that are so far away from our optimal behavior, like this bag of states over here&#8230;&#8221; Well, now each of them gets an MCTS label, and your policy learns how to take the best possible action here, but you never get here. You&#8217;re training your model on states you would never reach. This is where off-policy can really hurt.</p><p>As part of this project, I did try an experiment where I took a bunch of trajectories, and to try to saturate the GPU as much as possible, I took random states from the dataset and reran MCTS on just those states. Instead of playing a whole game where I&#8217;m doing MCTS on every move, I ignore the causality of moves and pick random board states, and I label those with my current network. I might revisit old states I&#8217;ve labeled before and relabel them again with my current network.</p><p>In practice, this actually does work. You can take some states that are reasonable and constantly be relabeling them while training. This starts to converge on a very robotics-like setup, which is very common. You have your dataset of trajectories, and then you have something like a replay buffer pusher. These are off-policy offline trajectories. Your replay buffer pusher pushes transition tuples to the replay buffer, and then you have a job that&#8217;s continuously replanning what the best action you should have done instead of the action you took.</p><p>In robotics, it&#8217;s very common to minimize TD error. Your <a href="https://www.datacamp.com/tutorial/bellman-equation-reinforcement-learning">Bellman updater</a> is constantly pulling things from here and trying to satisfy <em>Q</em>(<em>s</em>,<em>a</em>). From here you have your trainer, which is trying to fit the <em>Q</em> to the <em><sup>Q</sup></em><sup>target</sup>. You can think about this as a sort of planner. You visit old states you&#8217;ve been to, and you take your current model and rethink. What could I have done better if I visited this? This is how off-policy robotic learning systems are usually trained. These days there&#8217;s a simpler recipe, but in the Google <a href="https://sites.google.com/view/qtopt/home">QT-Opt </a>days, we did things like this.</p><p><strong>Dwarkesh Patel</strong></p><p>So what is the trainer?</p><p><strong>Eric Jang</strong></p><p>The trainer is where you try to minimize <em>Q</em>(<em>s</em>,<em>a</em>) and <em>Q</em><sup>target</sup>.</p><p><strong>Dwarkesh Patel</strong></p><p>Can you explain the whole setup again at a high level?</p><p><strong>Eric Jang</strong></p><p>You have your off-policy data that came from various policies. You&#8217;re constantly pushing transitions you saw before to a replay buffer.</p><p>Then you&#8217;ve got this thing called a Bellman updater, which basically replans: instead of this action, what action should I have taken at <em>s</em> to have a better value? The way you enforce that is by trying to minimize the TD error. Given this, you have <em>s&#8217;</em>. You compute <em>Q</em>(<em>s</em>&#8217;,<em>a</em>) and you find the action that should go with <em>s</em>&#8217; that makes this <em>Q</em> value as high as possible.</p><p>Then you add that to the reward here, and that gives you your actual target. For this current <em>s</em> and <em>a</em>, your <em>Q</em><sup>target</sup> is this. Now you send back the <em>Q</em><sup>target</sup> to this transition. With this tuple, you pair a <em>Q</em><sup>target</sup>. Then on the trainer side, you simply use supervised learning and minimize your current network&#8217;s <em>Q</em>(<em>s</em>, <em>a</em>) with its target.</p><p><strong>Dwarkesh Patel</strong></p><p>Got it. So in the background you&#8217;re basically thinking through how valuable all these actions actually were.</p><p><strong>Eric Jang</strong></p><p>Yeah. In a more optimal policy where you&#8217;re trying to maximize this, what is the <em>Q</em><sup>target</sup> of this transition?</p><p><strong>Dwarkesh Patel</strong></p><p>It&#8217;s sort of like daydreaming.</p><p><strong>Eric Jang</strong></p><p>Exactly. You can think about it like you&#8217;re going back in hindsight. Given what I&#8217;ve seen in the historical buffer, was there a better action I could have taken?</p><p>The connection to Go here&#8212;I tried and it was moderately successful, but too complex to open source&#8212;was replacing this with an MCTS relabeler. Instead of doing this target network computation, you run MCTS on your transition. In this case, you have your state, your action, and whether you won or not at the game. Actually, you can just toss those two. You don&#8217;t care about these ones. You just take your state and plan MCTS to get your best policy, &#928;, on your current network. Not the network that took this action, but your current best policy network. You rerun your search offline on these transitions.</p><p>If these are transitions your policy can get to, this actually acts as a very nice stabilizing effect. Another benefit is that you can fully saturate your GPU better because you&#8217;re not blocking on the Go game to give you board states. You simply search across all board states at any depth in parallel. Here the trainer would just predict the MCTS label as best as possible.</p><p>This kind of works, and it is quite relevant in robotics where you have a lot of offline data and you can&#8217;t simulate things like MCTS. But in practice, it does run into the problem that if the current model is looking at states it would never reach, it&#8217;s wasting capacity. You have to be a little bit careful here.</p><p>Much of RL has converged to a much more on-policy setup where they don&#8217;t really try to directly train on off-policy data. At best, they use off-policy data as a way to reduce variance, but not to directly influence the objective.</p><p><strong>Dwarkesh Patel</strong></p><p>Why have they converged to that?</p><p><strong>Eric Jang</strong></p><p>It&#8217;s just more stable. You might use the off-policy <em>Q</em> as a way to do advantage computation, like <em>Q</em> minus the sum of <em>Q</em>. If there are <em>N</em> actions, this is your value, and these are your current <em>Q</em> values. Your advantage for that action is the average value minus your current one. People can try to estimate <em>Q</em> in an off-policy way and then just use advantage here.</p><p>If there&#8217;s a problem in these dynamics, it doesn&#8217;t blow up your loss as much. In robotics, there&#8217;s a convergence toward using off-policy data to shape your rewards but not actually be directly here.</p><h3>02:12:02 &#8211; RL is even more information inefficient than you thought</h3><p><strong>Dwarkesh Patel</strong></p><p>I&#8217;m reminded of our earlier conversation about why MCTS is so favorable compared to the REINFORCE or policy gradient thing LLMs do. This might be totally wrong, but <a href="https://www.dwarkesh.com/p/bits-per-sample">I wrote a blog post</a> a few months ago about how policy gradient RL is even more inefficient than you might think.</p><p>The inefficiency one thinks about naively is the fact that you have to roll out a whole trajectory to get any learning signal at all. As these trajectories become longer and longer&#8212;as an agent has to do two days&#8217; worth of work to figure out if it did a project correctly, rather than just completing the next word in a sentence&#8212;the amount of information per FLOP decreases. You have to unroll two days&#8217; worth of thinking to see if you implemented a feature correctly, so the amount of samples per FLOP has been decreasing.</p><p>You&#8217;re trying to maximize as you&#8217;re learning, bits per FLOP. You can think of bits per FLOP as samples per FLOP times bits per sample. What I mentioned a second ago is that the samples per FLOP go down as RL becomes more long-horizon.</p><p>But this kind of naive RL is also terrible from a bits-per-sample perspective, at least compared to supervised learning. Early on in training, let&#8217;s say you have a vocabulary size for an LLM that is 100K long, so there are 100K possible tokens that one could answer. You have a totally untrained model and a prompt like, &#8220;The sky is&#8230;&#8221; With supervised learning, the model would have some probability distribution over all the things it could say. There&#8217;s a label that says the term here is &#8220;blue,&#8221; and it would learn through cross-entropy loss exactly how far its distribution is from correctly saying &#8220;blue.&#8221;</p><p>If you were doing this through RL, the model would try, &#8220;The sky is halycon.&#8221; Nope, that&#8217;s wrong. &#8220;The sky is told.&#8221; Nope, that&#8217;s wrong. This is a totally untrained model. You would have to do this on the order of 100K times just to stumble on &#8220;blue&#8221; and get some learning signal. In the supervised learning regime, you have your distribution of probabilities, you get told that it&#8217;s &#8220;blue,&#8221; and you figure out how far off you were. The amount you learn is a function of your pass rate. The further away you are from &#8220;blue,&#8221; the more you learn to go toward &#8220;blue&#8221; using cross-entropy loss.</p><p>You can think of it as like your pass rate, your prior probability of having said &#8220;blue.&#8221; As a function of that, in supervised learning, through cross-entropy loss, you would learn negative log(<em>p</em>), <em>p</em> being pass rate, bits, once you get this label. Whereas in RL, if you&#8217;re just randomly guessing and seeing if it works, that&#8217;s basically just the entropy of a binary random variable.</p><p><strong>Eric Jang</strong></p><p>What&#8217;s also tough here is that the distribution you&#8217;re sampling under is your policy&#8217;s distribution. If your policy has no chance of sampling &#8220;blue,&#8221; then you will never get a signal.</p><p><strong>Dwarkesh Patel</strong></p><p>Exactly. That&#8217;s modeled by the fact that your probability of sampling &#8220;blue&#8221; is extremely low. If you sample it, you do learn as much as you would have in supervised learning. In all other cases&#8212;99.99% of the time in an untrained model&#8212;you&#8217;re learning incredibly little from seeing that &#8220;halycon&#8221; or &#8220;told&#8221; is not the correct word.</p><p>That&#8217;s what happens most of the time. You learn very little. If you put your pass rate on the X-axis and the bits you&#8217;re learning from a sample on the Y-axis, with 0%, 50%, and 100%, so at the end of training you&#8217;re here. If you have supervised learning, the negative log pass rate would look something like this. The entropy of a binary random variable would look like this, depending on whether you&#8217;re doing nats or bits. If you do bits it&#8217;s one here at the peak. This is like a coin flip. You learn the most from a coin flip. This is supervised learning. This is RL.</p><p>However, the problem is you spend most of training in this regime, in the low pass rate regime. How fast you&#8217;re learning is a function of how many bits per sample you&#8217;re getting, and you&#8217;re getting very little signal here. If you chart the pass rate on a log scale&#8212;where at the beginning of training with a vocab size of 100K the pass rate is 1/100,000, then 1/10,000, 1/1,000, 1/100&#8212;what this graph looks like here, supervised learning would look like this, and then RL would look like that.</p><p><strong>Eric Jang</strong></p><p>And arguably you spend all your time here, potentially never even getting a single success. It&#8217;s a depressing plot in the sense that once you&#8217;re here, it&#8217;s not at all obvious how you get to there. Once you&#8217;re here you have something, but you actually spend all the time here in many RL problems. There&#8217;s a question of how you initialize so you&#8217;re at least at a non-zero pass rate.</p><p>One more thing I&#8217;d like to add about bits per sample that&#8217;s very relevant to any machine learning problem is that there&#8217;s a connection to soft targets and distillation. If you have access to the logits, not just the <a href="https://en.wikipedia.org/wiki/One-hot">one-hot</a> token answer, if you have access to the soft targets the entropy of this distribution is far higher than the one-hot. There&#8217;s way more information in bits per sample in a soft label. That&#8217;s why distillation is so effective per sample. It&#8217;s giving you way more information per sample.</p><p><strong>Dwarkesh Patel</strong></p><p>I wonder what the equation would be, but obviously it&#8217;s&#8230;</p><p><strong>Eric Jang</strong></p><p>It would just be the entropy of this distribution. The entropy of this is zero. The entropy of this is the entropy equation.</p><p>This is also why AlphaGo is quite beautiful. In AlphaGo, you don&#8217;t train the policy network to imitate the MCTS action. You train it to imitate the MCTS distribution. Both of these are valid, and if you wanted to do a scientific experiment of how important this soft label is, <a href="https://www.ttic.edu/dl/dark14.pdf">dark knowledge</a> distillation, you can run an experiment where you retrain the policy network on the action MCTS selected rather than the soft target.</p><p><strong>Dwarkesh Patel</strong></p><p>Earlier I was stumbling around. Intuitively, why is this ability to do iterative search&#8212;where you don&#8217;t necessarily need to be able to win the game in the beginning, you just need to be able to improve your current policy&#8212;so powerful a capability in learning compared to how LLMs currently learn RL? It&#8217;s exactly this thing of considering your pass rate of the entire trajectory. I just don&#8217;t know a formal way to think about this. Maybe you should help me out here.</p><p><strong>Eric Jang</strong></p><p>Why is AlphaGo an elegant RL algorithm? The major reason is that you never have to initialize at a zero percent success rate and solve the exploration problem of how to get to a non-zero success rate. This is what allows you to hill-climb this beautiful supervised learning signal.</p><p>If you look at the actual implementation of AlphaGo, every step of the way, there&#8217;s actually no TD error learning or dynamic programming, at least explicitly. It&#8217;s just supervised learning on a value classification as well as a policy <a href="https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence">KL</a> minimization. It&#8217;s just a supervised learning problem on improved labels.</p><p>The training is very stable. You can train as big of a network as you want. You can retrain this on the data set. Everything will just go stably. The infrastructure is very simple to implement as well. You don&#8217;t need a complex distributed system to keep everything on policy.</p><p>At the end of the day, you&#8217;re just saying, &#8220;I have some improved labels. Let&#8217;s retrain my supervised model on these targets.&#8221; You&#8217;re always in this beautiful regime where you&#8217;re just trying to improve the policy, rather than escape local minima where every signal is flat all around you.</p><p>One way to draw the curve is if you draw the win rate of an MCTS policy versus the raw network&#8230; Let&#8217;s say this dotted line is the raw network. The MCTS policy looks like this. Every step of the way, this supervision signal is very clean. You&#8217;re never in a situation where the MCTS is giving you no signal, unless your MCTS distribution converges to exactly what your policy network predicts.</p><p><strong>Dwarkesh Patel</strong></p><p>That&#8217;s a great way to explain it. Maybe we sit down and I ask some questions about automated research.</p><h3>02:22:16 &#8211; Automated AI researchers</h3><p><strong>Dwarkesh Patel</strong></p><p>One thing I really wanted to talk to you about is that you did a bunch of the research for this project through this automated LLM coding assistant loop. There&#8217;s an idea that if you fully automated AI research, you could have some sort of singularity.</p><p>Obviously we&#8217;re not there yet, but to the extent that we have early indications of what this process might look like, I&#8217;m curious about your observations about what the AI is good at, what it&#8217;s not good at, what you think about this scenario&#8217;s likelihood eventually, and what thoughts you have about this in general.</p><p><strong>Eric Jang</strong></p><p>I think automated scientific research is one of the most exciting skills that the frontier labs are developing right now. It&#8217;s important for everyone who&#8217;s doing any kind of research to get a good intuition of what it can do now and what it can&#8217;t, and how the science process might work in the future once we have AIs automating a lot of this investigation.</p><p>In brief, I mostly used Opus 4.6 and 4.7 while working on this. What works is that the models can do a very good job of hyperparameter optimization. In the past, people would come up with a search space of hyperparameters like learning rate, weight decay, and maybe how many layers are in your network. They would do a grid search or a <a href="https://www.geeksforgeeks.org/machine-learning/hyperparameter-optimization-based-on-bayesian-optimization/">Bayesian hyperparameter optimization</a> approach, and it would find some tuned parameters.</p><p>The really cool thing that automated coding can do now is search a much more open-ended set of problems. It can say, &#8220;I&#8217;ve identified that the gradients are small in this layer, so let me change it up here. Let me rewrite the code so the data loader has a new augmentation I came up with. Let&#8217;s try to find the best way to fit the constraints of the optimization problem.&#8221; You end up with this much more flexible, high-level, almost grad-student-like ability to just grind a performance metric.</p><p>This can squeeze out quite a lot of performance. On a fixed data set with a fixed time budget, you can improve <a href="https://en.wikipedia.org/wiki/Perplexity">perplexity</a> by quite a lot on a classification problem like LLMs or Go. It is also fantastic now at basically executing any experiment. I have a <a href="https://code.claude.com/docs/en/skills">Claude Skill</a> that I wrote called Experiment where I give it a description of what I want it to plot. I just describe, &#8220;Here&#8217;s the x-axis I want, here&#8217;s the y-axis. Answer this question for me.&#8221; It&#8217;ll run off and do all the experiments, compile the plot, make a report, and suggest what might have caused it and so forth. That&#8217;s what works quite well today, and I think we can expect these abilities to get better in the future.</p><p>But it&#8217;s also useful to know what it&#8217;s not doing so well today. In the <a href="https://evjang.com/2026/04/28/autogo.html">blog version</a> of this tutorial, I have a plot of all the experiments I did grouped in a tree, where every node represents a failed, successful, or mixed experimental result. From there, it branches off into a child representing the follow-on experiment. Occasionally, I&#8217;ll rabbit-hole down a track like off-policy MCTS relabeling, do a few experiments, and then realize it&#8217;s probably not worth it. So then I&#8217;ll jump to a completely different track. I call these things rows.</p><p>What I find is that the current closed models the public can access today don&#8217;t seem to be that great at selecting what the next experiment should be in a given track. They don&#8217;t seem to be able to step back and do the lateral thinking of, &#8220;Wait a minute, this track doesn&#8217;t really make sense. Let&#8217;s go back to first principles and think about what the bottleneck might be, or what we are trying to achieve.&#8221; Often I had to catch infra bugs myself by prompting the right question to Claude to investigate what&#8217;s causing the discrepancy, and then it&#8217;ll answer the question.</p><p>With <a href="https://red.anthropic.com/2026/mythos-preview/">Mythos</a>-class models or Mythos++ models coming online, maybe this just completely changes and these problems fall to improved scaling. But at the same time, I think there&#8217;s a rich opportunity to develop RL environments that might incentivize this kind of lateral thinking. One of the motivations for setting up this Go environment was that Go captures a lot of very interesting research problems, often overlapping with LLMs or robotics. Yet it&#8217;s very quick to verify. The outer loop is ultimately: does the agent do what I think it does? You can check the outcome of a Go game quite easily.</p><p>The inner loop involves all this research engineering around distributed systems, predicting whether an idea is going to work or not, and predicting the difference a particular modification to your training algorithm might make. I think there&#8217;s a rich library of subtasks and sub-environments that you can train an automated scientist to work on, with Go as a sort of outer verification loop. Once you acquire these skills, maybe you can apply them to other domains like biosciences or robotics.</p><p><strong>Dwarkesh Patel</strong></p><p>Or automating AI research.</p><p><strong>Eric Jang</strong></p><p>Or automating AI research.</p><p><strong>Dwarkesh Patel</strong></p><p>Which is the real crux, the scary/incredible thing of making AIs make future versions of AIs. You&#8217;re suggesting the outer loop here could just be your win rate against KataGo?</p><p><strong>Eric Jang</strong></p><p>That&#8217;s one of them. I think there are a lot of deeper questions that one could tackle. For example, let&#8217;s say you have an idea on how to improve a scaling-law compute multiplier. The outcome isn&#8217;t necessarily &#8220;achieve the best Go bot ever&#8221;. The outcome might just be, &#8220;Can I predict what the win rate of my Go bot will be?&#8221; Or, &#8220;Can I predict the scaling-law plots that emerge from my idea?&#8221; But then you can verify that you haven&#8217;t reward-hacked anything by using a very verifiable game like Go on the outer loop.</p><p><strong>Dwarkesh Patel</strong></p><p>I think there are a couple of interesting follow-on questions. There are questions on the inner loop and the outer loop. On the inner loop, there&#8217;s a question of how locally verifiable any modification you might make is. That is to say, would you know whether some idea you try out is actually an improvement or a degradation? Would you know if something isn&#8217;t working as a result of a bug, or is it the result of the idea itself being wrong?</p><p><a href="https://www.dwarkesh.com/p/ilya-sutskever-2">Ilya</a> was talking about how one of the things he thinks makes him a good researcher is that he has a strong belief in what the correct idea is. He&#8217;s able to persevere through bugs and know which things are bugs versus mistakes in the fundamental idea, based on his high-level belief that &#8220;this idea should work, so therefore there has to be a bug&#8221;, versus the other way around. Why don&#8217;t we start with that question? How locally verifiable are things which are good ideas?</p><p><strong>Eric Jang</strong></p><p>As in the case of the success story for deep learning, you can think about this as a decades-long idea that took a lot of faith to get it to work. This presents a very challenging long-horizon RL problem where every step of the way you have a committee telling you that this is a bad idea, and then ultimately you break through. How do you design RL environments that maybe give you some feedback earlier? I think this is a very tough open question that I don&#8217;t have an answer to.</p><p>Ultimately, to play a very strong Go bot, you probably did need to discover deep learning. Having a challenging game that cannot be cheated easily on the outer loop could be used as an outer-loop signal for something like discovering the principles of deep learning. Now, of course, to make it tractable&#8212;and this is where research taste really matters&#8212;you have to come up with ways to initialize your problem so that you don&#8217;t try to solve a very intractable problem. Maybe you can leverage LLMs as a universal grammar in the middle to give you some sort of local feedback.</p><p>The fact that LLMs are a universal grammar means that they can move at almost any level of the stack. They can think very locally, as well as step back and think in very broad steps. I think that&#8217;s where a lot of the lateral thinking ability of humans comes from: knowing when the track you&#8217;re pursuing or the objective you&#8217;re pursuing is not right, and you should be asking a different question.</p><p><strong>Dwarkesh Patel</strong></p><p>The other question is how stackable local improvements are in the attempt to get to a better result on the outer loop. I&#8217;ve heard rumors that at some AI labs, the thing that has gone wrong is that people will individually pursue good ideas, but those don&#8217;t end up stacking well, and so the training run fails because of some weird interaction between two seemingly good ideas. Having a single top-down vision of how things should work is very important. Having worked at different AI labs and also played around with parallel agents trying different ideas, what&#8217;s your sense of how parallelizable AI innovation is?</p><p><strong>Eric Jang</strong></p><p>Great question. I think the research taste for executing well on the Bitter Lesson is that you need to know how much the Bitter Lesson can buy you and how much is too much to ask for, at any given moment. Of course, in the fullness of time compute is the single most important determinant of how things work. It&#8217;s almost inevitable that as you scale up energy and compute and parameters, intelligence will just fall out of that. That&#8217;s super beautiful, super profound. No algorithmic detail really matters beyond that.</p><p>But in the present day, we don&#8217;t have infinite compute and parameters and arbitrarily good initialization, so we have to come up with heuristics that give us that. These heuristics are probably somewhat redundant. That&#8217;s probably why you see this effect where a lot of these compute multipliers don&#8217;t necessarily stack. They might have some correlated benefit. And then three years down the line, when the Nvidia GPUs have gotten even stronger, maybe they stack even less well.</p><p>Maybe at any given point in time, the benefit of any given compute multiplier is transitory, which is what I suspected with the KataGo paper. There were many algorithmic ideas applied, and then you can see that with modern Blackwell GPUs and <a href="https://en.wikipedia.org/wiki/Ada_Lovelace_(microarchitecture)">Ada</a>-class GPUs&#8212;which are much better than the V100-grade GPUs that that paper used&#8212;some of these algorithmic tricks to speed up convergence just don&#8217;t matter so much compared to something else. I think that&#8217;s a matter of taste in the present time.</p><p><strong>Dwarkesh Patel</strong></p><p>How about the outer loop? How verifiable for making AI smarter? With Go, you do have this outer loop of win rate against the best open-source model out there. Even there, as you were saying, there are other outer loops of whether you discovered a new phenomenon, which is actually very hard to&#8230;. If you didn&#8217;t know scaling laws were important&#8230; When were <a href="https://arxiv.org/abs/2203.15556">Chinchilla</a> or <a href="https://arxiv.org/abs/2001.08361">Kaplan scaling laws</a> released?</p><p><strong>Eric Jang</strong></p><p>2018.</p><p><strong>Dwarkesh Patel</strong></p><p>So if you&#8217;re back in 2015, there&#8217;s not an automated procedure one can easily imagine for knowing which paper is the scaling laws paper versus which is just another random plot. Even in the Go case, it&#8217;s a hard-to-verify outer loop, and the whole idea of an outer loop is to have some backstop on improvement, let alone for general AGI, where of course we have a bunch of these benchmarks.</p><p>But there&#8217;s a problem. We know the things we can measure, and we improve on the things we can measure. We care about this broader ability to do economically useful work, which is not super easy to measure, at least until you automate everything. There&#8217;s a question of how good the outer verification loop is for AI self-improvement, and does that matter?</p><p><strong>Eric Jang</strong></p><p>I&#8217;m going to give a non-rigorous argument, but one that I intuitively believe. DeepMind started with a focus on games. They used games as their outer loop, and then their researchers learned from the experience of solving games, and now they&#8217;re working on LLMs.</p><p>Presumably, there was some positive transfer from their time working on games and Atari and Go and StarCraft that now helps them make good LLMs. I assume that there&#8217;s positive transfer in some regard, whether it&#8217;s coding or general research ability or project management. All these things probably help them do well.</p><p>If that&#8217;s the case, why wouldn&#8217;t it also be true for automated AI researchers? They should be able to positively transfer experience tackling quick-to-verify, quick-to-iterate environments to something more ambitious and economically useful, like automating drug discovery.</p><p><strong>Dwarkesh Patel</strong></p><p>I don&#8217;t know. Isn&#8217;t the issue&#8230; Historically, until <a href="https://blog.google/products-and-platforms/products/gemini/gemini-3/">Gemini 3</a> or whatever a couple years ago, people were saying, &#8220;Look, Google isn&#8217;t catching up in LLMs because they&#8217;re too tied to the old approach.&#8221; There are gains, but there are also ways in which it actively hinders you. It&#8217;s not obvious to me.</p><p><strong>Eric Jang</strong></p><p>The jury&#8217;s still out. Let&#8217;s say currently Google&#8217;s doing quite well. Who knows if the initialization of training on games is ultimately going to hobble their ability to be the winner in the long term. It&#8217;s hard to say for sure.</p><p>Likewise, who knows if the seeming late start was really just them pre-training for longer on how to scale up TPUs. They invested all their tech tree in getting TPUs to be good, which seemed not that useful in the short term but in the long term it becomes&#8230; So it&#8217;s even hard for humans to reason about what the optimal research strategy should be, even with the data we have today.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, we should let people know how they can find out more about this project, whether to fork it themselves, or check out your blog post where you do an excellent job explaining many of these ideas. Where do people go next?</p><p><strong>Eric Jang</strong></p><p>My website is <a href="http://evjang.com">evjang.com</a>. There&#8217;s a blog post that links to <a href="https://evjang.com/2026/04/28/autogo.html">an interactive version of this tutorial</a>. On <a href="https://github.com/ericjang">my GitHub</a>, which is the username ericjang, there&#8217;s an <a href="https://github.com/ericjang/autogo">AutoGo repo</a> that people can fork and reproduce the training results.</p><p><strong>Dwarkesh Patel</strong></p><p>I also highly recommend people check out this blog post, <a href="https://evjang.com/2026/02/04/rocks.html">&#8220;As Rocks May Think.&#8221;</a> We touched on some of the ideas in this conversation, but it&#8217;s this grander thesis of what happens when you have thinking as a primitive in&#8212;</p><p><strong>Eric Jang</strong></p><p>Computer science.</p><p><strong>Dwarkesh Patel</strong></p><p>Exactly. I highly recommend people check out that blog post as well.</p><p><strong>Eric Jang</strong></p><p>I encourage the audience to think about the relationship between thinking and Go via MCTS and search, and how it relates to LLMs. I think there&#8217;s something quite profound there, and probably underexplored just because Go has been relatively underexplored compared to the boom in LLMs.</p><p>It&#8217;s not to say that I think we should have trees in our LLMs, but there is some very interesting duality between them. You can actually do a lot of research on Go, MCTS, and reasoning with very small budgets. So that&#8217;s very exciting.</p><p><strong>Dwarkesh Patel</strong></p><p>Awesome, Eric. Thanks for doing this.</p><p><strong>Eric Jang</strong></p><p>Thank you. It&#8217;s an honor to be on the podcast.</p>]]></content:encoded></item><item><title><![CDATA[David Reich – Why the Bronze Age was an inflection point in human evolution]]></title><description><![CDATA["Instead of being quiescent, natural selection is everywhere."]]></description><link>https://www.dwarkesh.com/p/david-reich-2</link><guid isPermaLink="false">https://www.dwarkesh.com/p/david-reich-2</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Fri, 08 May 2026 16:38:22 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/196892360/c03625a044a70e5f333cd67b0c06c1e7.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p><a href="https://reich.hms.harvard.edu/">David Reich</a> is back.</p><p>He and collaborator <a href="https://reich.hms.harvard.edu/people/ali-akbari">Ali Akbari</a> just <a href="https://reich.hms.harvard.edu/sites/reich.hms.harvard.edu/files/inline-files/2026_Akbari_Nature_selection_0.pdf">published a paper</a> that overturns a long-standing consensus about human evolution &#8212; that natural selection has been dormant in our species since the agricultural revolution.</p><p>By scaling ancient DNA sequencing and developing a new statistical method, they found that selection has actually sped up.</p><p>Selection went especially bonkers during the Bronze Age (around 3,000 years ago).</p><p>That&#8217;s when gene frequencies for everything from immune function to body fat to intelligence were most in flux.</p><p>Over the last 10,000 years, selection pushed the genetic predictor of cognitive performance up by roughly a full standard deviation &#8212; most of it between 4,000 and 2,000 years ago.</p><p>After we finished recording, David sketched out on a whiteboard his new heretical model about who the Neanderthals really were. Luckily, I took out my iPhone and managed to record it.</p><p>He thinks the standard story (that Neanderthals are some separate archaic lineage we interbred with a little) just doesn&#8217;t fit the evidence. Instead, he proposes that Neanderthals are essentially genetically-swamped modern humans.</p><p>A small population somewhere around the Caucasus invented Middle Stone Age technology roughly 300,000 years ago and expanded outward. The ones that moved into Europe interbred with local archaic humans, got genetically swamped, and became Neanderthals. The same expansion went into Africa, met much more diverged archaic Africans, and that mixture became us.</p><p>This means Neanderthals and modern humans share the same cultural ancestry &#8212; the only difference is which archaic humans they mixed with afterward.</p><p>David is a brilliant and rigorous scholar. It was a real delight to learn from him again.</p><p>Watch on <a href="https://youtu.be/sRKBGVFVYAw">YouTube</a>; listen on <a href="https://podcasts.apple.com/us/podcast/david-reich-why-the-bronze-age-was-an-inflection/id1516093381?i=1000766816517">Apple Podcasts</a> or <a href="https://open.spotify.com/episode/6BZ56Puv0gsnWCA8yfSde4?si=UIkcA7YrRlWNVyS8VnZomQ">Spotify</a>.</p><div id="youtube2-sRKBGVFVYAw" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;sRKBGVFVYAw&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/sRKBGVFVYAw?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Sponsors</h2><ul><li><p><a href="https://cursor.com/dwarkesh">Cursor</a> was super useful as I prepped for this episode. Whenever I had a question, I&#8217;d have Cursor kick off a few different models simultaneously and then compare their responses. I found that this led to better results than I could get out of any individual LLM. If you&#8217;ve only used Cursor for coding, you should try using it for research. Check it out at <a href="https://cursor.com/dwarkesh">cursor.com/dwarkesh</a></p></li></ul><ul><li><p><a href="https://janestreet.com/dwarkesh">Jane Street</a> uses an internal currency called &#8220;hive bucks&#8221; to allocate compute through a real-time auction &#8211; and anyone can change anyone else&#8217;s bids or even kill their jobs! Everyone just trusts each other to act in the firm&#8217;s best interest, which is what lets the system work in the first place. If this weird and high-trust culture sounds like your kind of thing, Jane Street&#8217;s hiring at <a href="https://janestreet.com/dwarkesh">janestreet.com/dwarkesh</a></p></li><li><p><a href="https://crusoe.ai/dwarkesh">Crusoe&#8217;s</a> ML infra team built fastokens, an open-source tokenizer that delivers a ~9x speedup over Hugging Face and up to 40% faster time-to-first token &#8211; on real production workloads! Crusoe achieved these results by parallelizing things and using some clever engineering to handle duplicates without cross-thread coordination. Learn more at <a href="https://crusoe.ai/dwarkesh">crusoe.ai/dwarkesh</a></p></li></ul><h2>Timestamps</h2><p>(00:00:00) &#8211; Ancient DNA suggests strong selection over last 10,000 years</p><p>(00:15:45) &#8211; Natural selection intensified during the Bronze Age</p><p>(00:35:02) &#8211; Why didn&#8217;t evolution max out intelligence?</p><p>(00:57:21) &#8211; Evolution is limited by time, not population size</p><p>(01:09:02) &#8211; Why no farming before the Ice Age?</p><p>(01:17:13) &#8211; The Neanderthal puzzle David can&#8217;t stop thinking about</p><p>(01:54:10) &#8211; The methodology behind this breakthrough</p><h2>Transcript</h2><h3>00:00:00 &#8211; Ancient DNA suggests strong selection over last 10,000 years</h3><p><strong>Dwarkesh Patel</strong></p><p>I am back with <a href="https://en.wikipedia.org/wiki/David_Reich_(geneticist)">David Reich</a>, who is a professor of <a href="https://en.wikipedia.org/wiki/Ancient_DNA">ancient DNA</a> <a href="https://reich.hms.harvard.edu/">at Harvard</a>. How do you describe what it is that you study?</p><p><strong>David Reich</strong></p><p>I&#8217;m a geneticist, and I work on human history and how ancient people relate to each other and people living today.</p><p><strong>Dwarkesh Patel</strong></p><p><a href="https://www.dwarkesh.com/p/david-reich">We did an interview two years ago</a>, which ended up being one of the most popular interviews I&#8217;ve ever done. I think people found it really compelling that there&#8217;s so much about human history we don&#8217;t know and are just learning about now as a result of the kinds of techniques your lab is using.</p><p><a href="https://hms.harvard.edu/news/massive-ancient-dna-study-reveals-natural-selection-has-accelerated-recent-human-evolution">You have a new preprint</a> that&#8217;s very exciting, and I wanted to talk to you about it. Can you give me a little bit of context on what we&#8217;re talking about today?</p><p><strong>David Reich</strong></p><p>The dream was that when this ancient DNA field started, more than 16 or 17 years ago, we were going to learn a lot about biology &#8212;about how people&#8217;s biology changed over time&#8212; by getting DNA out of ancient human remains and tracking changes over time. And that dream has really not been realized since the beginning of this field.</p><p>The field has been a big success with regard to learning about human history. It&#8217;s resulted in surprising findings about human migrations &#8212;people not being descended from the people who lived in the same place hundreds or thousands or tens of thousands of years before&#8212; and mixture being common in human history, and sex-biased processes being common. And there have been things that were not expected from archaeology. The field&#8217;s been a big success from that perspective, but what&#8217;s not been successful is learning about biology and biological change.</p><p>One big reason has been that the sample sizes have been too small. When you have a single person&#8217;s DNA, it provides a tremendous amount of information about history. That&#8217;s because when you look at one person&#8217;s DNA, it&#8217;s not a single person. It&#8217;s many people. It&#8217;s your two parents, your four grandparents, your eight great-grandparents, 16 great-great-grandparents, and so on. Going back in time, thousands, tens of thousands, even hundreds of thousands of ancestors are contributing to people today.</p><p>When you look at the DNA of a single person&#8217;s <a href="https://en.wikipedia.org/wiki/Genome">genome</a> or a <a href="https://en.wikipedia.org/wiki/Neanderthal">Neanderthal</a> genome, you have effectively tens of thousands of ancestors all represented in your data. And you can position that individual exquisitely with respect to other people from whom you have data. But when you are interested in how a particular genetic variant&#8212;that affects something like your skin pigmentation, or your ability to digest cow&#8217;s milk into adulthood, or a behavioral trait&#8212;changes over time, a single person gives you only one sample, or maybe two samples: the one in their mother and the one in their father.</p><p>To get a high-resolution picture of how the frequency changes over time, you need very big sample sizes, truly very large numbers of people. We just didn&#8217;t have that until the last few years. What motivates the study we&#8217;re talking about today, and the work that hopefully a number of groups will be doing in the coming years, is the fact that we now finally have those numbers. We can do something with the data to see how frequency changes over time.</p><p><strong>Dwarkesh Patel</strong></p><p>Can I ask a question? I&#8217;ll be asking a lot of naive questions through the next few hours, but why are frequency changes especially interesting?</p><p><strong>David Reich</strong></p><p>What we&#8217;re interested in is using the experiment of nature that&#8217;s occurred in our history, over the last tens of thousands of years, to understand what&#8217;s biologically significant in our DNA. If there has been a change in environment that a population has experienced&#8212;for example, people shifted to agriculture, began living close to domesticated animals, or moved from a cold place to a warm place, or a low place to a high place&#8212;then there&#8217;s pressure on the population to adapt to these new stresses and new needs.</p><p>The way you&#8217;re going to detect that is by seeing that the frequency of a genetic variant&#8212;that for example might allow you to live at higher altitude, or that might nudge you to have a different behavioral pattern advantageous in the new situation&#8212;pushes systematically in some direction in a way that is enough for you to detect. It&#8217;s very hard to detect slight shifts in frequency by a few percent or ten percent unless you have a very big sample size. What we&#8217;re looking for are those changes in frequency that are too extreme to be due to chance. That will tell us there have been pushes against the biology as a result of the changes in environment that people have experienced.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. What did you guys find?</p><p><strong>David Reich</strong></p><p>Seven years ago, <a href="https://reich.hms.harvard.edu/people/ali-akbari">Ali Akbari</a>, who at the time was a postdoctoral scientist in my laboratory and a few years later became a permanent staff scientist, set out to use the data we were producing to learn about biological change over time. I think the reason he was interested in our laboratory rather than other places was that a focus of our lab has been generating truly large amounts of data from ancient humans. We&#8217;ve been trying to industrialize the process, make it very inexpensive, make it high quality, and generate large numbers of samples with lots of good data for this purpose.</p><p>There&#8217;s been this large amount of data we&#8217;ve generated. And it made it possible to conceive again of asking whether there have been frequency changes over time. The mainstream view in human evolution in the last several decades has been that <a href="https://en.wikipedia.org/wiki/Natural_selection">natural selection</a> has been pretty quiescent over the last several hundred thousand years of human history.</p><p>There are several lines of evidence that have been deployed to document this. One is that if you compare diverse populations from different continents around the world, for example Europeans and East Asians, and you look at mutations that differ in frequency between these groups&#8212;all mutations differ a little bit in frequency, sometimes a lot&#8212;you can say, &#8220;What are the most different mutations in terms of frequency between Europeans and East Asians?&#8221; And there are almost no genetic changes that are 100% different in frequency between Europeans and East Asians.</p><p>Europeans and East Asians descend from a common ancestral population 40,000 or 50,000 years ago that came out of Africa and the Middle East. This population had a set of gene frequencies, and these variants bopped around randomly&#8212;a process known as <a href="https://en.wikipedia.org/wiki/Genetic_drift">genetic drift</a>&#8212;or perhaps under selection in one direction or another. The time that&#8217;s passed since 40,000 or 50,000 years ago is sufficiently small on an evolutionary timescale that there&#8217;s just not much genetic differentiation on average between these two groups.</p><p>However, if there&#8217;s been natural selection, for example to help people in one place digest alcohol better, or digest milk better, what you might expect is that there would be some mutation that would have rocketed up to very high frequency. Forty or fifty thousand years is a lot of time, it&#8217;s maybe 1,500-2,000 generations. That might easily be enough time to see a 100% difference in frequency. Yet you don&#8217;t see any more than what you would expect by chance.</p><p>This combination of things made it seem that selection has just been quiescent. Maybe a few hundred thousand years ago, the ancestral human population got to some kind of optimum, and after that there hasn&#8217;t been much genetic change in one way or the other.</p><p>There have been small amounts of natural selection, or selection to remove bad mutations that are constantly raining down on the genome, but not what we call <a href="https://en.wikipedia.org/wiki/Directional_selection">directional selection</a>. That would be newly arising mutations, or mutations being pushed in a systematic direction, to help the population get to a different adaptive set point more favorable for the conditions that population is living in.</p><p>We were able to partition how much of the changes in frequencies of all the mutations that we&#8217;re seeing in the DNA&#8212;we&#8217;re looking at about 10 million positions that vary&#8212;is due to directional selection (adaptation) versus other factors, especially genetic drift. And 98% of it is other factors, especially genetic drift. It&#8217;s overwhelmingly migrations and population structure causing fluctuations in frequency.</p><p>As a result, it&#8217;s super hard to detect the signals of adaptive natural selection because they&#8217;re a tiny fraction of the total frequency change. The vast majority of it are these migrations and mixtures. Nevertheless, there&#8217;s so much natural selection, as our study has shown, that it&#8217;s actually been rampant in the genome.</p><p><strong>Dwarkesh Patel</strong></p><p>Can I ask a clarifying question here? Why are we discounting <a href="https://en.wikipedia.org/wiki/Genetic_admixture">population admixture</a> or replacement as selection? If you think about it at a group level, if one population replaces another population, isn&#8217;t that selection?</p><p>I remember from the last episode you were explaining how there have been huge changes in what kinds of people are in a specific area. One population came in and replaced the previous one, and then a new population came in and replaced that one. To the extent that the genetics are relevant to why that population replaced the other one, why should that not count towards what we understand to be selection over the last 10,000 years?</p><p><strong>David Reich</strong></p><p>It could count, and may count, and probably should count in some respects. But it could also be that this population replacement is due to some cultural phenomenon &#8212;technology held by one of these groups and not others. And maybe there are some genetic mutations that are contributing to this. Who knows? It&#8217;s possible.</p><p>But what you&#8217;re seeing is a whole-genome shift. What we&#8217;re looking to see is whether there&#8217;s one place in the DNA that is driving the change in a way that&#8217;s different from the rest of the genome. From a statistical point of view, what happens at these times of migration is there are just huge fluctuations in frequencies. These are extremely uninformative times for detecting natural selection. The best moments to detect natural selection are when migrations and population admixtures are not happening for a few hundred years. During these times, you can actually see the mutation slowly blowing in one direction as a result.</p><p>The way we think about the history of Europe and the Middle East for the purpose of this study is as an archipelago of little populations in space and time, each pretty isolated from each other. You have a little population in Britain isolated for a few hundred years, or a little population in Hungary isolated for a few hundred years, between big events of migration and mixture. In each of those little experiments of nature, we can ask: does this mutation slightly increase in frequency? Does that same mutation slightly increase in frequency? If all the arrows point in the same direction, we win. They&#8217;re telling us that natural selection is occurring.</p><p>For example, 4,500 years ago in Europe, almost all mutations went through huge frequency changes. That&#8217;s not because of natural selection. It&#8217;s because of the steppe migration from <a href="https://en.wikipedia.org/wiki/Pontic%E2%80%93Caspian_steppe">north of the Black and Caspian Sea</a>. 40-80% of the DNA becomes <a href="https://en.wikipedia.org/wiki/Yamnaya_culture">Yamnaya</a> from steppe pastoralists. Their frequencies of mutations were different not because of selection necessarily, but just because they had evolved in different places for thousands and tens of thousands of years. When you look at the descendant populations, there are huge changes in frequency. What you need to do is see if natural selection is explaining a shift more than you would expect by chance.</p><p><strong>Dwarkesh Patel</strong></p><p>So you found these locations that seem to be under selection. I have another clarifying question. You say you found 3,800 locations which you&#8217;re 50% confident have been under selection in the last 10,000 years.</p><p><strong>David Reich</strong></p><p>It&#8217;s 7,200 where we&#8217;re 50% confident. We&#8217;re getting about 7,200 positions in the DNA that have 50% confidence of being real. Only half of those are real&#8212;we don&#8217;t know which ones&#8212;so 3,600 of them are real.</p><p><strong>Dwarkesh Patel</strong></p><p>Does that also mean that outside of those 7,200, you&#8217;re confident the other locations in the genome are not under selection?</p><p><strong>David Reich</strong></p><p>No. If you look at the 25% probability cutoff, there will be tens of thousands, and there will be many real ones there too. In fact, multiple analyses we do suggest that the genome is vibrating with natural selection.</p><p>There are all sorts of weaker effects that would be picked up in even larger studies than we&#8217;ve done. In fact, almost every position in the DNA is correlated to another position that is being dragged in one way or the other by natural selection. Instead of being quiescent, natural selection is everywhere. Even though it&#8217;s only 2% of the frequency change, it&#8217;s tugging the positions in one direction or the other everywhere.</p><p>So we analyzed these positions that we had identified, the hundreds of positions we were super confident about. We looked to see whether they were randomly distributed in the DNA or whether they had patterns. We looked at maybe 100 or so traits where there had been genome-wide association studies for all sorts of different traits, associated with immunity or autoimmunity or behavior or metabolism, and other things.</p><p>For each of these we could ask: do the genetic variations that are known to affect these traits from genome-wide association studies have an unusual number of genetic selection signals? What we found is that there was a vast enrichment, by about four or five-fold, for immune traits. There was a super concentration of selected signals in immune traits. We also saw a strong enrichment for metabolic traits&#8212;things that might impact obesity or fat traits or Type 2 diabetes&#8212;and almost no detectable enrichment, as far as we could tell, for behavioral or psychiatric traits.</p><p><strong>Dwarkesh Patel</strong></p><p>Just to make sure I understand. This is not to say that behavioral or psychiatric or cognitive traits are not under selection. It&#8217;s just that the individual sites where such traits are controlled are not especially likely to be among the locations you&#8217;ve identified as under selection.</p><p><strong>David Reich</strong></p><p>That&#8217;s exactly right. It might seem from the results of that analysis that immune traits are highly selected and that there&#8217;s been no selection for behavior in the last 18,000 years in this part of the world. But that&#8217;s a wrong conclusion, and we have evidence that it&#8217;s a wrong conclusion. There&#8217;s clear evidence of selection also on behavioral traits.</p><p>The reason we think we see much weaker signals for behavioral traits is that behavioral traits, we know from medical studies, are underpinned by much larger numbers of genes than immune traits, which are underpinned by relatively small numbers of genes of strong <a href="https://en.wikipedia.org/wiki/Additive_genetic_effects">effect</a>. Behavioral traits are shaped genetically by a very large number of genes of weak effect, and we just don&#8217;t have the statistical power to detect these very weak signals.</p><p>When we do an analysis looking at our very strong signals of selection, that collection of very strong results is very effectively querying the immune traits, but is not very effectively querying the behavioral traits. It may still be the case, and I guess it is, that immune traits are the most selected category. But it is not at all the case&#8212;and we can prove it&#8217;s not the case&#8212;that behavioral traits are not selected.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting.</p><h3>00:15:45 &#8211; Natural selection intensified during the Bronze Age</h3><p><strong>David Reich</strong></p><p>We&#8217;ve been able to prove that there are two ways to reconcile the previous observations with our new observations. Remember, the previous observation is that natural selection seems to have been quiescent over a timescale of hundreds of thousands or many tens of thousands of years. Reason? That you don&#8217;t see 100% difference in frequency variance across Europeans and East Asians.</p><p>Now we&#8217;re seeing hundreds of positions that are rocketing up in frequency with selection rates of 1% or more in a lot of cases. A 1% or more selection rate will mean a rapid doubling over periods of dozens of generations. Over the 1,500 or 2,000 generations separating Europeans and East Asians, shouldn&#8217;t you see many genetic variants that are 100% different in frequency across populations?</p><p>We were able to show that this is explained by at least two factors. One is that in this part of the world&#8212;Europe and the Middle East&#8212;we are actually in a period of accelerated natural selection. One way to see this is to look at the enrichment pattern we&#8217;re observing, where immune traits are unusually associated with these selection signals. We could compare the last 5,000 years of our time period, what&#8217;s called the <a href="https://en.wikipedia.org/wiki/Bronze_Age">Bronze Age</a> and further onward, to the previous 5,000 years. What we see is that this intensification of selection around immune traits, and similarly the intensification around metabolic traits, has accelerated over this time period.</p><p>It&#8217;s not like natural selection has been at the same rate over all places and times. It&#8217;s increasing over the time period we&#8217;re analyzing. Plausibly the whole time period has increased compared to previous periods. We&#8217;re in a period of intensified selection. That&#8217;s not implausible, because this is a population that went through a huge shock in terms of the way people live and the culture. Almost everyone we&#8217;re analyzing are farmers or food producers in one way or another. Farming was invented for the first time anywhere in the world in the Middle East 11,000 or 12,000 years ago. The people who invented farming exploded into Europe after 8,500 years ago, spread across the continent, and expanded rapidly.</p><p>In the Bronze Age, there was an intensification of how people lived, with much higher population densities. People were living more and more next to their animals and getting their diseases, and exchanging their diseases with the animals and with each other. This is a period of rapid change in how people are living, resulting in different biological needs of this population. It&#8217;s not surprising, perhaps, that in the context of these dramatic changes, the biology of the population might not be ideally adapted.</p><p>There might be what some people call an <a href="https://en.wikipedia.org/wiki/Evolutionary_mismatch">evolutionary mismatch</a>, where you take a genetic variation that evolved in hunter-gatherers and put it into farmers or pastoralists, and it&#8217;s not exactly right. What you&#8217;re seeing is the DNA of this population, which descended from hunter-gatherers only 10,000 years ago, reacting to the shock of having been moved into an agricultural, Bronze Age, high-population-density, urban environment. A hypothesis is that what we&#8217;re seeing is the adaptation that occurs as a result.</p><p><strong>Dwarkesh Patel</strong></p><p>In the paper you have many examples of this intensification of selection around the Bronze Age. It might be helpful to go through some of these.</p><p><strong>David Reich</strong></p><p>One of the things we do in this work is look carefully at many of these positions in the DNA. We actually have an internet browser called the <a href="https://reich-ages.rc.hms.harvard.edu/#/">AGES</a> browser, which Ali and a colleague of his&#8212;who&#8217;s a co-author of our paper&#8212;built. It allows you to query each of these 10 million positions and see the trajectories at each position and the evidence for selection.</p><p>One of the things we see is that, while for the most part the signals of natural selection we detect are consistent with constant natural selection over time, in a handful of them we&#8217;re able to see that there&#8217;s been a reversal or a radical change in natural selection. Very often that occurs in the period between 5,000 to 2,000 years ago, which is the Bronze Age and the <a href="https://en.wikipedia.org/wiki/Iron_Age">Iron Age</a>, a period of rapid population growth and rapid movement to intensive use of many technologies that were not used that way before.</p><p>An example of this is the <a href="https://en.wikipedia.org/wiki/Tyrosine_kinase_2">TYK2</a> genetic variant that is a major risk factor for severe tuberculosis, which is the most important infectious disease killer in the world today. If you look at this major risk factor for tuberculosis, this variant rockets up in frequency from 8,000 or 6,000 years ago to maybe 9% or 10% in this part of the world. Then it rockets down in frequency in the last 3,000 years. In both cases, there&#8217;s very clear evidence of natural selection, in the first case to increase in frequency, and in the next case to decrease in frequency.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!pamj!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc559c6e4-ed6a-4967-ab8b-108498446bf3_1992x1884.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!pamj!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc559c6e4-ed6a-4967-ab8b-108498446bf3_1992x1884.png 424w, https://substackcdn.com/image/fetch/$s_!pamj!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc559c6e4-ed6a-4967-ab8b-108498446bf3_1992x1884.png 848w, https://substackcdn.com/image/fetch/$s_!pamj!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc559c6e4-ed6a-4967-ab8b-108498446bf3_1992x1884.png 1272w, https://substackcdn.com/image/fetch/$s_!pamj!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc559c6e4-ed6a-4967-ab8b-108498446bf3_1992x1884.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!pamj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc559c6e4-ed6a-4967-ab8b-108498446bf3_1992x1884.png" width="400" height="378.31325301204816" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c559c6e4-ed6a-4967-ab8b-108498446bf3_1992x1884.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1884,&quot;width&quot;:1992,&quot;resizeWidth&quot;:400,&quot;bytes&quot;:6769385,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!pamj!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc559c6e4-ed6a-4967-ab8b-108498446bf3_1992x1884.png 424w, https://substackcdn.com/image/fetch/$s_!pamj!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc559c6e4-ed6a-4967-ab8b-108498446bf3_1992x1884.png 848w, https://substackcdn.com/image/fetch/$s_!pamj!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc559c6e4-ed6a-4967-ab8b-108498446bf3_1992x1884.png 1272w, https://substackcdn.com/image/fetch/$s_!pamj!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc559c6e4-ed6a-4967-ab8b-108498446bf3_1992x1884.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A possible reason is the spread of tuberculosis. It maybe becomes endemic in the population 2,000 or 3,000 years ago. That&#8217;s potentially consistent with pathogen sequence data and other lines of evidence. And maybe this variant was protecting against something before then, but then tuberculosis became significant after that point, and it was so bad that it pushed in the opposite direction. That&#8217;s speculative.</p><p><strong>Dwarkesh Patel</strong></p><p>The thing it was protecting against was probably another disease?</p><p><strong>David Reich</strong></p><p>Maybe.</p><p><strong>Dwarkesh Patel</strong></p><p>One of the big takeaways for me from the paper was just that something weird happened in the Bronze Age. As you said, across trait after trait, the selection intensifies during the Bronze Age.</p><p>This makes sense for some things. For example, why do we see <a href="https://en.wikipedia.org/wiki/Lactase_persistence">lactase persistence</a>, where adults can process milk, intensified during this period? This is the time when we start using cattle not just for the meat, but also for milk and wool and other secondary products. So it makes sense why lactase persistence would matter more.</p><p>But then there are other things that seem like they should have been relevant since the dawn of agriculture. I forget the exact name of the allele, but was it <a href="https://en.wikipedia.org/wiki/FADS1">FADS1</a>, which helps convert plant fatty acids into long-chain fatty acids that your body needs? That&#8217;s obviously relevant when you move from a diet of meat as a hunter-gatherer to a diet of cereals.</p><p>That is also one I think you found was under especially high selection 5,000 to 3,000 years ago. So what&#8217;s going on? Why is the Bronze Age so special across all these different traits that you&#8217;re observing?</p><p><strong>David Reich</strong></p><p>So this FADS1/2 variant is a vegetarian/meat-eating adaptation. Already in work prior to this, Ian Mathieson, who worked with me in 2015, <a href="https://mathii.github.io/research/2015/12/14/fads1-selection-and-diet">identified this as a very strongly selected variant</a>. It&#8217;s actually ancient. You see copies in <a href="https://en.wikipedia.org/wiki/Homo">archaic humans</a> too.</p><p>One of the findings of our paper is the <a href="https://en.wikipedia.org/wiki/ABO_blood_group_system">ABO blood system</a>. You get your blood typed as A, B, and O. The B variant has increased up to 10% at the expense of A, but previous work has shown that A and B were both already present in the ancestor of humans and gibbons and other apes. Some of these mutations have been going back and forth and fluctuating over different time periods.</p><p>But we&#8217;re talking about changes in the Bronze Age. The TYK2 variant for tuberculosis risk, a multiple sclerosis risk variant, inflected and increased in frequency before the Bronze Age, and then 2,000 or 3,000 years ago reversed in that period. There are differences in Northern Europe where this process is super strong, very strong positive selection, very strong negative selection. And then in Southern Europe, only a little bit, and not even very strong negative selection.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AC-e!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68968b7-cb1b-4aec-96e9-aa92363c3a30_1999x1882.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AC-e!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68968b7-cb1b-4aec-96e9-aa92363c3a30_1999x1882.png 424w, https://substackcdn.com/image/fetch/$s_!AC-e!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68968b7-cb1b-4aec-96e9-aa92363c3a30_1999x1882.png 848w, https://substackcdn.com/image/fetch/$s_!AC-e!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68968b7-cb1b-4aec-96e9-aa92363c3a30_1999x1882.png 1272w, https://substackcdn.com/image/fetch/$s_!AC-e!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68968b7-cb1b-4aec-96e9-aa92363c3a30_1999x1882.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AC-e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68968b7-cb1b-4aec-96e9-aa92363c3a30_1999x1882.png" width="400" height="376.58829414707355" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f68968b7-cb1b-4aec-96e9-aa92363c3a30_1999x1882.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1882,&quot;width&quot;:1999,&quot;resizeWidth&quot;:400,&quot;bytes&quot;:6849258,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AC-e!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68968b7-cb1b-4aec-96e9-aa92363c3a30_1999x1882.png 424w, https://substackcdn.com/image/fetch/$s_!AC-e!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68968b7-cb1b-4aec-96e9-aa92363c3a30_1999x1882.png 848w, https://substackcdn.com/image/fetch/$s_!AC-e!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68968b7-cb1b-4aec-96e9-aa92363c3a30_1999x1882.png 1272w, https://substackcdn.com/image/fetch/$s_!AC-e!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff68968b7-cb1b-4aec-96e9-aa92363c3a30_1999x1882.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For <a href="https://www.mayoclinic.org/diseases-conditions/hemochromatosis/symptoms-causes/syc-20351443">haemochromatosis</a>, which is pathogenic iron buildup that causes problems in Europe, that too has reversed around this period. In some of the complex traits that maybe we&#8217;ll talk about later, these traits too have periods of intensification of natural selection. For example, depigmentation: Europeans have gotten lighter skin over the last 10,000 years. You can see it in our data. The period of strongest depigmentation is between about 4,000 to 2,000 years ago, and then after that it&#8217;s much less.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mf17!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24868fa4-3404-40d9-be40-ea5ecf5f842f_1943x1859.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mf17!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24868fa4-3404-40d9-be40-ea5ecf5f842f_1943x1859.png 424w, https://substackcdn.com/image/fetch/$s_!mf17!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24868fa4-3404-40d9-be40-ea5ecf5f842f_1943x1859.png 848w, https://substackcdn.com/image/fetch/$s_!mf17!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24868fa4-3404-40d9-be40-ea5ecf5f842f_1943x1859.png 1272w, https://substackcdn.com/image/fetch/$s_!mf17!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24868fa4-3404-40d9-be40-ea5ecf5f842f_1943x1859.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mf17!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24868fa4-3404-40d9-be40-ea5ecf5f842f_1943x1859.png" width="400" height="382.7071538857437" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/24868fa4-3404-40d9-be40-ea5ecf5f842f_1943x1859.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1859,&quot;width&quot;:1943,&quot;resizeWidth&quot;:400,&quot;bytes&quot;:6481105,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mf17!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24868fa4-3404-40d9-be40-ea5ecf5f842f_1943x1859.png 424w, https://substackcdn.com/image/fetch/$s_!mf17!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24868fa4-3404-40d9-be40-ea5ecf5f842f_1943x1859.png 848w, https://substackcdn.com/image/fetch/$s_!mf17!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24868fa4-3404-40d9-be40-ea5ecf5f842f_1943x1859.png 1272w, https://substackcdn.com/image/fetch/$s_!mf17!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24868fa4-3404-40d9-be40-ea5ecf5f842f_1943x1859.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This seems to be a very impactful, eventful, important period where a lot of the processes we are seeing become very powerful. It&#8217;s surprising on first principles. You might think, before you walked into this genetic data, that the big change is going to be starting to grow plants and maybe farm animals. That happens in the <a href="https://en.wikipedia.org/wiki/Neolithic">Neolithic</a>, beginning 10,000-12,000 years ago, and spreads into Europe after 8,500 years ago. But actually, the intensification happens 5,000 years ago, 4,000 years ago.</p><p>It&#8217;s really interesting. This observation of that being an inflection point tells us something about when humans, at least in this part of the world, were wrenched into a way of living that was so different from how their hunter-gatherer ancestors lived that the organism had to adapt very strongly. It may be that the degree of that wrenching process moving into the Bronze Age was qualitatively greater than the degree of the wrenching process that happened from the initial transition to growing plants. That&#8217;s surprising, because our cartoon picture is that the big transition is farming. But the biological readout is saying our genome is reacting much more strongly to these events that happened 5,000 years ago.</p><p><strong>Dwarkesh Patel</strong></p><p>You did some work with Bhatia and many other colleagues in 2014 where you were <a href="https://www.broadinstitute.org/publications/broad6077">looking at 20,000 or 30,000 African American genomes today</a>. You were saying, &#8220;Look, there&#8217;s 80% West African DNA and then 20% European DNA. Can we look at their genomes today and see that their allele frequencies are much different than we&#8217;d expect from this admixture?&#8221; Correct me if I&#8217;m wrong, but you found that they weren&#8217;t.</p><p>That is to say, over 200 or 300 years of extremely intense environmental change&#8212;going from chattel slavery to a completely new environment&#8212;there&#8217;s no effect of natural selection. So we see episodes like this where we don&#8217;t see natural selection, but then the Bronze Age apparently must have had an even stronger effect, where the change in environment is even stronger than what we see from Africans in Africa being migrated to the New World and living under slavery.</p><p><strong>David Reich</strong></p><p>That may be the case. It also may be the case that that period is just too short to see much effect. In the Bhatia et al. paper, where we looked at about 30,000 African Americans, we looked to see whether&#8212;instead of the average percentage of around 80% West African ancestry&#8212;there were some places in the DNA with significantly more than 80%, or significantly less than 80%. That&#8217;s what you would expect if there were natural selection for some genetic variant from Europeans or from Africans.</p><p>We didn&#8217;t see any place in the DNA that was significantly different from what you would expect by chance. One possible explanation is just that there&#8217;s only a handful of generations, maybe five, over which natural selection would operate. So if the selection was 2% a generation, you would still only see a 10% compounded effect, and there&#8217;s just not enough time to detect it.</p><p>But the Bronze Age is not 300 years, it&#8217;s 3,000 years. It&#8217;s the power of compound interest, and you have enough time to begin to see a strong effect. This really, really does seem to be a very impactful time in terms of human history, and you can see it in our complex traits. Look at pigmentation, for example, which is the strongest signal of selection for a complex trait in our data set. You look at genetic mutations that are known to affect pigmentation. You add up their effect across all of the DNA, there&#8217;s dozens or hundreds of them. You look to see when natural selection is strongest, and the time period is really 2,000 to 4,000 years ago.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!O58v!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02bced2d-7411-434e-b637-e9c80f47f423_2007x1723.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!O58v!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02bced2d-7411-434e-b637-e9c80f47f423_2007x1723.png 424w, https://substackcdn.com/image/fetch/$s_!O58v!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02bced2d-7411-434e-b637-e9c80f47f423_2007x1723.png 848w, https://substackcdn.com/image/fetch/$s_!O58v!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02bced2d-7411-434e-b637-e9c80f47f423_2007x1723.png 1272w, https://substackcdn.com/image/fetch/$s_!O58v!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02bced2d-7411-434e-b637-e9c80f47f423_2007x1723.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!O58v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02bced2d-7411-434e-b637-e9c80f47f423_2007x1723.png" width="400" height="343.39810662680617" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/02bced2d-7411-434e-b637-e9c80f47f423_2007x1723.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1723,&quot;width&quot;:2007,&quot;resizeWidth&quot;:400,&quot;bytes&quot;:6308556,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!O58v!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02bced2d-7411-434e-b637-e9c80f47f423_2007x1723.png 424w, https://substackcdn.com/image/fetch/$s_!O58v!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02bced2d-7411-434e-b637-e9c80f47f423_2007x1723.png 848w, https://substackcdn.com/image/fetch/$s_!O58v!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02bced2d-7411-434e-b637-e9c80f47f423_2007x1723.png 1272w, https://substackcdn.com/image/fetch/$s_!O58v!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F02bced2d-7411-434e-b637-e9c80f47f423_2007x1723.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GMUp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16415ac1-a6b9-4f16-9493-32a84adb9375_990x651.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GMUp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16415ac1-a6b9-4f16-9493-32a84adb9375_990x651.png 424w, https://substackcdn.com/image/fetch/$s_!GMUp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16415ac1-a6b9-4f16-9493-32a84adb9375_990x651.png 848w, https://substackcdn.com/image/fetch/$s_!GMUp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16415ac1-a6b9-4f16-9493-32a84adb9375_990x651.png 1272w, https://substackcdn.com/image/fetch/$s_!GMUp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16415ac1-a6b9-4f16-9493-32a84adb9375_990x651.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GMUp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16415ac1-a6b9-4f16-9493-32a84adb9375_990x651.png" width="400" height="263.030303030303" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/16415ac1-a6b9-4f16-9493-32a84adb9375_990x651.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:651,&quot;width&quot;:990,&quot;resizeWidth&quot;:400,&quot;bytes&quot;:1193557,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GMUp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16415ac1-a6b9-4f16-9493-32a84adb9375_990x651.png 424w, https://substackcdn.com/image/fetch/$s_!GMUp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16415ac1-a6b9-4f16-9493-32a84adb9375_990x651.png 848w, https://substackcdn.com/image/fetch/$s_!GMUp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16415ac1-a6b9-4f16-9493-32a84adb9375_990x651.png 1272w, https://substackcdn.com/image/fetch/$s_!GMUp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F16415ac1-a6b9-4f16-9493-32a84adb9375_990x651.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7sD_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8443ce39-b911-4a87-a437-48952ad9beeb_1733x1799.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7sD_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8443ce39-b911-4a87-a437-48952ad9beeb_1733x1799.png 424w, https://substackcdn.com/image/fetch/$s_!7sD_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8443ce39-b911-4a87-a437-48952ad9beeb_1733x1799.png 848w, https://substackcdn.com/image/fetch/$s_!7sD_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8443ce39-b911-4a87-a437-48952ad9beeb_1733x1799.png 1272w, https://substackcdn.com/image/fetch/$s_!7sD_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8443ce39-b911-4a87-a437-48952ad9beeb_1733x1799.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7sD_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8443ce39-b911-4a87-a437-48952ad9beeb_1733x1799.png" width="400" height="415.2336987882285" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8443ce39-b911-4a87-a437-48952ad9beeb_1733x1799.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1799,&quot;width&quot;:1733,&quot;resizeWidth&quot;:400,&quot;bytes&quot;:5544426,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7sD_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8443ce39-b911-4a87-a437-48952ad9beeb_1733x1799.png 424w, https://substackcdn.com/image/fetch/$s_!7sD_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8443ce39-b911-4a87-a437-48952ad9beeb_1733x1799.png 848w, https://substackcdn.com/image/fetch/$s_!7sD_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8443ce39-b911-4a87-a437-48952ad9beeb_1733x1799.png 1272w, https://substackcdn.com/image/fetch/$s_!7sD_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8443ce39-b911-4a87-a437-48952ad9beeb_1733x1799.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>For some of these other traits as well, you see again that the time period over which selection is strongest is 2,000 to 4,000 years ago. For example, if you look at genetic variants that affect measures of cognitive performance, such as performance on intelligence tests in white British people today. This is of course a very strange trait to measure in the past because there were no intelligence tests and there was no school. But it is a predictor today, and you can look at how it&#8217;s changed in the past.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4EYW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ed72503-0342-44d8-992c-afa3db186ca5_2027x1845.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4EYW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ed72503-0342-44d8-992c-afa3db186ca5_2027x1845.png 424w, https://substackcdn.com/image/fetch/$s_!4EYW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ed72503-0342-44d8-992c-afa3db186ca5_2027x1845.png 848w, https://substackcdn.com/image/fetch/$s_!4EYW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ed72503-0342-44d8-992c-afa3db186ca5_2027x1845.png 1272w, https://substackcdn.com/image/fetch/$s_!4EYW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ed72503-0342-44d8-992c-afa3db186ca5_2027x1845.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4EYW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ed72503-0342-44d8-992c-afa3db186ca5_2027x1845.png" width="400" height="364.0848544647262" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6ed72503-0342-44d8-992c-afa3db186ca5_2027x1845.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1845,&quot;width&quot;:2027,&quot;resizeWidth&quot;:400,&quot;bytes&quot;:6814076,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4EYW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ed72503-0342-44d8-992c-afa3db186ca5_2027x1845.png 424w, https://substackcdn.com/image/fetch/$s_!4EYW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ed72503-0342-44d8-992c-afa3db186ca5_2027x1845.png 848w, https://substackcdn.com/image/fetch/$s_!4EYW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ed72503-0342-44d8-992c-afa3db186ca5_2027x1845.png 1272w, https://substackcdn.com/image/fetch/$s_!4EYW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6ed72503-0342-44d8-992c-afa3db186ca5_2027x1845.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!zQh9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f40525-7a82-40c7-a49c-2b028fa270ba_1463x1712.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!zQh9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f40525-7a82-40c7-a49c-2b028fa270ba_1463x1712.png 424w, https://substackcdn.com/image/fetch/$s_!zQh9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f40525-7a82-40c7-a49c-2b028fa270ba_1463x1712.png 848w, https://substackcdn.com/image/fetch/$s_!zQh9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f40525-7a82-40c7-a49c-2b028fa270ba_1463x1712.png 1272w, https://substackcdn.com/image/fetch/$s_!zQh9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f40525-7a82-40c7-a49c-2b028fa270ba_1463x1712.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!zQh9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f40525-7a82-40c7-a49c-2b028fa270ba_1463x1712.png" width="400" height="468.0792891319207" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/93f40525-7a82-40c7-a49c-2b028fa270ba_1463x1712.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1712,&quot;width&quot;:1463,&quot;resizeWidth&quot;:400,&quot;bytes&quot;:4382420,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!zQh9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f40525-7a82-40c7-a49c-2b028fa270ba_1463x1712.png 424w, https://substackcdn.com/image/fetch/$s_!zQh9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f40525-7a82-40c7-a49c-2b028fa270ba_1463x1712.png 848w, https://substackcdn.com/image/fetch/$s_!zQh9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f40525-7a82-40c7-a49c-2b028fa270ba_1463x1712.png 1272w, https://substackcdn.com/image/fetch/$s_!zQh9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F93f40525-7a82-40c7-a49c-2b028fa270ba_1463x1712.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We see very strong natural selection for this combination of genetic variants that predicts people&#8217;s performance on IQ tests and is also highly correlated to the predictor of the number of years of school or the household wealth of people. All crazy traits in the past because there was no wealth in the past, there was no school in the past. But if you look at the predictors today, there is a strong movement in a systematic direction, a large effect, about a standard deviation on the scale of modern variation.</p><p>We can do this trick of looking to see whether there are periods of time when this natural selection has occurred more intensely or less intensely. We drag a 2,000-year window through our data, and we repeat our whole analysis, not on 18,000 years, but just on a short 2,000-year window. We can measure the strength of selection in each of these 2,000-year windows. What you see when you look at intelligence is that this maxes out in the Bronze Age, between 5,000 and 2,000 years ago.</p><p>The impact in the last 2,000 years is almost nothing. There&#8217;s no evidence of natural selection at all. Your bias coming into this, my bias perhaps, might be that if there&#8217;s any signal of natural selection on this trait at all, that it would be unusually strong in the last 2,000 years. Maybe this is a time of industrialization. Maybe this is a time of greater need for this particular trait. But in fact, there&#8217;s no evidence of natural selection at all in the last 2,000 years. There&#8217;s very strong evidence between 2,000 and 4,000 years ago, where instead of a one standard deviation strength of selection, it&#8217;s a two standard deviation strength, averaged over this time period.</p><p><strong>Dwarkesh Patel</strong></p><p>The standard deviation here is how much the polygenic score for the trait itself moves?</p><p><strong>David Reich</strong></p><p>How much the polygenic score for the trait moves over a 10,000-year period within a population that is held constant in terms of its ancestry. What we&#8217;re actually doing is looking in our data set at a heterogeneous group of people. There&#8217;s Southern Europeans and Northern Europeans and hunter-gatherers and farmers. At different times in the past, those groups are more or less represented.</p><p>The whole strength of the methodology Ali Akbari developed is that it corrects for that changing ancestry over time. Really what&#8217;s being asked here is that we&#8217;ve divided up our whole data set into an archipelago of little populations in different places in space and time. We&#8217;re asking in each place in space and time: a little pocket of people in Britain from 4,000 years ago to 3,500 years ago, a little pocket of people in Hungary, a little pocket of people in Italy from 2,000 years ago to 1,500 years ago. In each of these places, where the ancestry is relatively similar without being too disrupted in that short period by migrations, we watch to see if the genetic changes blow in the same direction. We&#8217;re measuring the strength of selection at each point in time after correcting for the big population changes that have occurred.</p><p><strong>Dwarkesh Patel</strong></p><p>The effect here is huge then. One standard deviation above the median would be somebody in the 85th percentile. You&#8217;re saying the effect of selection has been so strong that comparing 10,000 years ago to now, the median has gone to the 85th percentile. That&#8217;s just a huge effect over the last 10,000 years on something like intelligence or the thing that predicts household income.</p><p>Especially given that this is only 2% of the change in allele frequencies, and the 98% is coming from migration&#8230; It&#8217;s stupendous to think about what the impact of migration is, if this alone is driving a standard deviation change in these kinds of qualities, at least among the kind of variation we see in the world today.</p><p><strong>David Reich</strong></p><p>One thing you can see in the data is that the migration impact is huge. For example, if you look at the trajectory for measures of cognitive performance&#8212;scores on intelligence tests in white British people today&#8212;but you look at the predictor of that in people in ancient times, the estimate for the hunter-gatherers of Europe is three standard deviations below the modern mean. So that&#8217;s hugely different.</p><p>Then you see a huge jump from them to the farmers, who are at the mean, at zero. That&#8217;s migration. What you&#8217;re seeing is that those two groups had different set points for those traits. And then the steppe pastoralists have a lower set value.</p><p>You see huge fluctuations in the predictor of this trait over time. That doesn&#8217;t prove selection. That&#8217;s just migration. But what our test is telling you is: in addition to those fluctuations due to migration, is there a consistent effect of natural selection blowing the trait in the same direction over all places and times? That&#8217;s what we&#8217;re detecting.</p><h3>00:35:02 &#8211; Why didn&#8217;t evolution max out intelligence?</h3><p><strong>Dwarkesh Patel</strong></p><p>There&#8217;s this theory called the <a href="https://en.wikipedia.org/wiki/Collective_intelligence">collective intelligence hypothesis</a>, which is the idea that selection for intelligence has actually been in the opposite direction. As society has developed, there&#8217;s been more specialization, and if there&#8217;s more specialization, each person only needs to understand a smaller and smaller part of the world. Therefore, the ancients were actually much smarter than us, and we&#8217;ve evolved down in intelligence.</p><p>Your results seem to point in the opposite direction. Although there hasn&#8217;t been selection in the last 2,000 years as society has gotten more complicated, at least when society began, there was more need for the kind of thing that predicts intelligence today.</p><p>The reason that&#8217;s surprising is, if you think about hunter-gatherers&#8212;reading your colleague <a href="https://en.wikipedia.org/wiki/Joseph_Henrich">Joseph Henrich&#8217;s</a> <a href="https://amzn.to/4f4NwEm">book</a>&#8212;the amount of information they needed to hold onto and assess, everything from how to process food, to how to build shelters, fire, et cetera, compared to my world, where I just need to know how to set up mics and ask questions&#8230; It seems like the demands on intelligence should have been way higher in the ancestral environment. So it&#8217;s very surprising that the beginnings of civilization increased the selection on intelligence.</p><p><strong>David Reich</strong></p><p>This is the power of data. I think if you asked Joe prior to this work what the hunter-gatherer selection would be and where their set point for this particular trait would have been&#8230; I think he probably wouldn&#8217;t have made a very strong prediction, but he would have said, &#8220;Maybe you would have expected it to have a high predicted value of this trait because these people were really having to do a lot of things and figure a lot of stuff out. Maybe once you have more complex societies, there would be more of a collective brain, and maybe there&#8217;d be selection against this trait.&#8221; In fact, it&#8217;s the opposite in some ways.</p><p>It&#8217;s the power of data. It&#8217;s not what you expect. &#8202;It&#8217;s actually the value of data to try to make sense of all these things. It&#8217;s very interesting. The genetic predictor of intelligence, there are lots of things that are confusing about it, so it&#8217;s worth talking about. Or the <a href="https://www.york.ac.uk/news-and-events/news/2024/research/dna-powerful-predictor/">genetic predictor of years of schooling</a>, which is highly correlated to it and is measured even better.</p><p>If you look at the genetic predictor of years of schooling, there&#8217;s another <a href="https://pubmed.ncbi.nlm.nih.gov/28096410/">amazing study from 2017</a> from a group in Iceland that looked at this measure over the last hundred years in Iceland. It looked at older people and younger people born more recently. There&#8217;s an estimated 0.1 standard deviation decrease in the genetic predictor of intelligence in Iceland just within one century. It&#8217;s an absolutely huge effect over a short period. This is selection against years of schooling. If I said intelligence, I didn&#8217;t mean to. It&#8217;s selection against the genetic predictors of the number of years of school.</p><p>One possible interpretation of this&#8212;hand-wavy&#8212;is that what&#8217;s being measured here is not selection for years of schooling or for real intelligence, but for another trait altogether that&#8217;s correlated to both of them. For example, the predictor of the number of years of schooling is very strongly correlated to the age at which women have their first kid. If you control for that, all of the signal of years of schooling goes away. So maybe what you&#8217;re measuring is women&#8217;s decision about when to have children.</p><p>If you have children earlier, you don&#8217;t go to school as much. If you have children later, you go to school more. Maybe it&#8217;s some kind of measurement of delaying gratification or putting things off or planning. The same trait is correlated to body mass index, to obesity, and to walking pace. So is this really intelligence as we think about it, or is it something else that manifests itself differently at different times in the past?</p><p><strong>Dwarkesh Patel</strong></p><p>Obviously, a trait like years of schooling was not itself a meaningful thing in the past. The underlying things for it seem to have been under strong selection. Whatever in the genome predicts years of schooling seems to have been under strong selection. How should we think about this? What&#8217;s the actual thing that&#8217;s changing in the genome?</p><p><strong>David Reich</strong></p><p>There are two things going on that you need to think about. Years of schooling is connected to so many other things genetically. If you look at the genetic predictor of years of schooling&#8212;this trait has been measured in millions of people now&#8212;it&#8217;s correlated to really surprising things. It&#8217;s correlated to the age at which women have their first kid. It&#8217;s correlated to people&#8217;s obesity. It&#8217;s correlated to people&#8217;s walking pace. It&#8217;s correlated to people&#8217;s household wealth. It&#8217;s correlated to a variety of other traits that seem quite different from it.</p><p>If you think you&#8217;re actually measuring the genetic prediction of intelligence, or actual studiousness, you should think again because there are many things that it&#8217;s correlated to. There seems to be some kind of general trait that you could maybe think of as <a href="https://en.wikipedia.org/wiki/Executive_functions">executive function</a> or a propensity to defer gratification&#8212;I&#8217;m just waving my hands&#8212;that is under selection. It pushes all these traits in the same direction one way or the other, and at different times in the past, it&#8217;s advantageous or disadvantageous.</p><p>When we found this signal of the genetic propensity to go to school for more years as it manifests itself in white British people today, we were incredulous. How could this be? Maybe this is a problem. So we did a few tests to try to figure out whether this was real. One of the tests we did was that we looked for a study where this measurement of the number of years of school was done not in Europeans, but in Chinese people in China. We looked at the effect size of many variants as they affected the number of years of school in China, and we saw whether they had a correlation to the trajectory of those same genetic variants in Europeans over the last 10,000 years.</p><p>These are two parts of the world where the populations have been essentially completely disconnected. There&#8217;s no way by chance that the trajectory in Europeans over the last 10,000 years would have anything to do with the effect on years of schooling in China today. But there&#8217;s actually a huge statistical correlation, a five or six standard deviation correlation between the effect size of variants on the number of years of school in China today and the trajectory in Europe. It&#8217;s just as strong, actually, as the effect size of variants in Europeans on years of school to the trajectory in Europeans. We just could not see a way this could happen by chance. Once we saw that, we felt quite convinced that this was a real signal and that somehow there has been natural selection to increase the genetic changes that today manifest themselves as predicting more years of schooling.</p><p><strong>Dwarkesh Patel</strong></p><p>Just to make sure I understood, you&#8217;re looking at this ancient DNA in Europe. You&#8217;re saying it seems to predict years of schooling for modern people in Europe, or at least selection on that ancient DNA seems to predict more years of schooling in modern Europe. You also find that the same variants predict more years of schooling for Chinese people in China. So this is not just some weird artifact from the way these <a href="https://en.wikipedia.org/wiki/Genome-wide_association_study">GWAS</a> were done in Europe. These parts of the genome seem to robustly predict the kind of thing that actually leads to more years of schooling, at least in people today.</p><p><strong>David Reich</strong></p><p>Correct.</p><p><strong>Dwarkesh Patel</strong></p><p>Stepping back, I want to understand what this tells us about what actually changed in our environments over the last 18,000 years. We talked a little about what happened after the Bronze Age. We were talking about this during the collective intelligence part of the conversation. It&#8217;s surprising to me that things like intelligence, or lack of schizophrenia&#8212;things that just seem robustly good&#8212;were not maxed out before the Bronze Age.</p><p>The diversity among different populations was so big that you have the European hunter-gatherers having three standard deviations less predicted value for what they would score on an intelligence test if it existed. But they were existing in the real world in a place where intelligence matters.</p><p>How can it be that this was not a trait&#8230; You just look at the human body or any animal, and evolution has been acting on it so strongly to make it functional for the things it needs to do. And this one thing, which seems so relevant&#8212;especially to what human hunter-gatherers needed to do&#8212;doesn&#8217;t seem to have been under that strong selection in the <a href="https://en.wikipedia.org/wiki/Mesolithic">Mesolithic</a> or <a href="https://en.wikipedia.org/wiki/Paleolithic">Paleolithic</a> eras?</p><p><strong>David Reich</strong></p><p>I think that&#8217;s a great question. As we talked about before, selection is very effective. It can move the mean value of traits within hundreds or thousands of years in one direction or the other if that&#8217;s adaptive in a particular environment. So you might wonder, isn&#8217;t intelligence good in all contexts and places in time? There are a number of ways to think about that.</p><p>First of all, we are speaking from the point of view of a society which intensely values this particular trait, the ability to score well on IQ tests or things like them, or to go to school for a long time. I think it&#8217;s unprecedented in human history that we live in a time like this. If you look at the Hebrew and Christian Bible, and you look at how much intelligence is valued, it&#8217;s basically not at all.</p><p><strong>Dwarkesh Patel</strong></p><p>But when the Bible was being written, <a href="https://en.wikipedia.org/wiki/Old_Testament#Composition">especially the Old Testament</a>, that&#8217;s exactly when selection for intelligence is at the highest point it&#8217;s apparently ever been.</p><p><strong>David Reich</strong></p><p>Exactly. But there it&#8217;s about strength or courage or religiosity. Those are the values. If you read <a href="https://en.wikipedia.org/wiki/Homer">Homer</a> or the texts of other religions, it&#8217;s not intelligence. It&#8217;s beauty and other things. This value system which has a hyper-focus on smarts is not obviously a trait value that&#8217;s been common in the past. You might think that in certain communities there might be valuation of things that are more proximate to years of schooling. But really broadly, it&#8217;s not been a high value in the population.</p><p><strong>Dwarkesh Patel</strong></p><p>Obviously, the thing we care about is not direct performance on an IQ test, especially in the past. The thing I&#8217;m trying to understand better is intelligence more broadly. Maybe IQ-test intelligence is just not that correlated with, &#8220;Here is a new-world environment, go figure out how to process food there and make shelter and everything else.&#8221;</p><p>Your colleagues like Joseph Henrich have talked about how modern people underestimate the difficulty of doing this kind of thing with a small band of people. Maybe that&#8217;s not IQ-test intelligence, and that&#8217;s why we don&#8217;t see that strong a selection effect on this thing. But intuitively, regardless of the value system, it just seems very valuable to have this trait maxed out.</p><p><strong>David Reich</strong></p><p>I&#8217;m being very speculative. Let me give you two examples of how I&#8217;m thinking about this, not that I&#8217;m a particularly good authority on these things. As I mentioned, a lot of these traits, which are quite disparate, are highly correlated to each other. Obesity, years of schooling, walking pace, performance on IQ tests, household wealth, all these crazy traits seem to be governed to a substantial extent by a shared combination of genetic variants.</p><p>Let&#8217;s think about what this might mean. In Iceland in the last hundred years, there&#8217;s been selection against this combination of variants. One possible interpretation is that it&#8217;s basically selection for two ways of investing in your children: having many kids and not investing a lot in them, or having few kids and investing more in them. If you invest in deferring having kids, having more wealth, having more resources, and putting more into each kid, you&#8217;re going to have lower fertility and fewer kids. That&#8217;s going to result in lower fertility, but those kids might survive more and do better in society. Alternatively, you can just have as many kids as you can and invest less in them. They might individually have less good outcomes, but in a time of plenty&#8212;which is potentially Iceland in the 20th century&#8212;it might make sense to have more kids and invest less in them.</p><p>There&#8217;s a toggle between having more kids and investing less in them, and having fewer kids and investing more in excelling in various ways. You can imagine that at different times and in different places&#8230; In ecology, there are different ways. Mammals often invest a lot with a pregnancy and a small number of children, whereas fish will spawn huge numbers of offspring into the river, the great majority of whom will be eaten. But that is an effective way to produce offspring in certain conditions. So there will be a toggle depending on the environmental conditions back and forth between investing in large numbers of offspring with less investment, or smaller numbers of offspring with more investment. Maybe we&#8217;re just seeing that move back and forth over different places and times.</p><p>Similarly, for schizophrenia and bipolar disease, how could this ever be advantageous? Maybe what we&#8217;re seeing with these diseases is a readout of some spectrum of traits that in some contexts might be advantageous. Maybe being anxious, imaginative, or neurotic might be helpful in a shamanistic tradition or a religious tradition which values people who can have visions or be creative. Maybe these are subclinical versions of schizophrenia or bipolar disease that in certain times may be advantageous and in other times may be disadvantageous. You might just be seeing selection for different types of creativity or other thinking that can be valuable in different contexts.</p><p>I&#8217;m waving my hands here, but my sense is that these complex traits have not pushed in one direction because there are advantages to both ends of the spectrum, and there are multidimensional impacts of these different traits.</p><p><strong>Dwarkesh Patel</strong></p><p><a href="https://en.wikipedia.org/wiki/Julian_Jaynes">Julian Jaynes</a> has this famous theory in <em><a href="https://amzn.to/4dfaiag">The Origin of Consciousness in the Breakdown of the Bicameral Mind</a></em>. I&#8217;m butchering this, but fundamentally, the way I understand it is that up until Homer, basically everybody was schizophrenic. People genuinely thought that gods were real people that you were communicating with. His claim is that ancient texts seem to show people behaving in this way.</p><p><strong>David Reich</strong></p><p>You&#8217;re being asked to believe in visions. Even today, there&#8217;s valuation in some religious communities in communicating with God, having visions, and having supernatural communions. So I just don&#8217;t know.</p><p>But I think it&#8217;s super interesting to ask the question of why certain traits are not always advantageous. For schizophrenia and bipolar disease, there is a sense in which most of the mutations are disadvantageous. We can see that from the patterns of variation, where the variants that are risk factors tend to be low frequency and they tend to be small effects.</p><p><strong>Dwarkesh Patel</strong></p><p>So another trait you find under selection is the trend away from body fat since the <a href="https://en.wikipedia.org/wiki/Neolithic_Revolution">agricultural revolution</a>. Why is that?</p><p><strong>David Reich</strong></p><p>What you see is a reduction in the combination of genetic mutations that make you at risk for obesity, body mass index, and similarly very correlated to it, higher fat mass, higher waist-to-hip ratio, and higher type 2 diabetes risk. There is clear selection, by about a standard deviation on the scale of modern variation for these traits, reducing over the last 10,000 years in this part of the world.</p><p>What can be going on there? Why wasn&#8217;t there selection for this combination of traits before? There&#8217;s a longstanding idea known as the <a href="https://en.wikipedia.org/wiki/Thrifty_gene_hypothesis">thrifty gene hypothesis</a>. The idea is that once you have hunter-gatherer populations that move into a farming environment where there&#8217;s plentiful food, there is no longer a need to the same extent to be able to build up body fat to survive in times of stress, because there are more constant stores of food.</p><p>As a result, there will be natural selection against body fat once you move into an agricultural environment and into periods of food plenty. Maybe what you&#8217;re seeing is that this group of people in Europe and the Middle East over the last 10,000 years has moved into a period of relatively more stable food, where building up stores of fat is not as advantageous, and there&#8217;s been selection against this combination of traits. Europeans are actually relatively better protected genetically against type 2 diabetes than some other populations around the world, like African Americans and Native Americans, that have perhaps not been exposed to agriculture for as much time. So you may be seeing the effect of more exposure to more stable food accessibility.</p><p><strong>Dwarkesh Patel</strong></p><p>This is also another way in which the data go against a common story. The common story is that hunter-gatherers actually had much more stable diets because they were more varied, and they weren&#8217;t reliant on a single cereal or crop for their calories. If one game went away, they had other things they could scout for. They could move locations more easily because they weren&#8217;t tied down to the land. So they were more food-stable. But if there&#8217;s been selection against storage of body fat, that suggests that as unstable and as common as famines might have been in agricultural societies, it&#8217;s at least more stable than what the hunter-gatherers had.</p><p><strong>David Reich</strong></p><p>There&#8217;s a timescale issue. You&#8217;re absolutely right. As I understand it, I&#8217;m no anthropologist, when there&#8217;s a hunt in traditional societies or communities that hunt, people will often gorge themselves, eat a huge amount, build up a temporary store of fat, and then go multiple days without eating meat until the next hunt. There is this boom-and-bust access to high-value nutrition that is not true to the same extent in farming communities.</p><p>On the flip side, famines are something that occurs more commonly in agricultural societies, but the timescale and the tempo of them is very different from the hunting tempo. Maybe there&#8217;s a famine every three years. Indeed, if you look at the bones of farmers, at least in some communities, there&#8217;s more stress in them, maybe due to a famine every three years or every five years. But selection might not be acting on that three-year time period. Your fat store from the latest hunt is not going to carry you through to the famine three years later. Survival of famines is a different thing than building up body fat to be able to survive two weeks later.</p><p><strong>Dwarkesh Patel</strong></p><p>A random question I have. You were mentioning that compared to these other things which matter much more for fitness in the ancestral environment&#8212;the immune system, especially after the Bronze Age&#8212;all these other things have mattered more than intelligence. They&#8217;ve been under much more selective pressure than intelligence.</p><p>That makes you wonder whether there&#8217;s much more room at the top for intelligence. If humans had been selected especially for intelligence, they could have been much smarter. The reason that&#8217;s relevant is that we&#8217;re currently building AI systems, which we&#8217;re trying to make as smart as possible. In fact, the only goal of the training process is intelligence. We don&#8217;t have to worry about at the same time making their immune systems powerful&#8212;</p><p><strong>David Reich</strong></p><p>We have lots of energy to spend on it.</p><p><strong>Dwarkesh Patel</strong></p><p>And at the same time making sure they&#8217;re not schizophrenic. I guess we kind of do worry about that. But if intelligence has not been the dominant trait under selection for humans over the last 10, 20, or 100,000 years, does that mean there&#8217;s more room at the top for this trait?</p><p><strong>David Reich</strong></p><p>I think there&#8217;s more room at the top for a lot of these traits. You can move height extremely in one direction, much more than it is today. You can move any of these traits much more extreme in the other direction. There are probably very strong negatives to doing that. You&#8217;re probably sacrificing other things, and there are trade-offs. But it&#8217;s highly likely that if natural selection pushed any of these traits more in one direction than it is, the mean would move.</p><h3>00:57:21 &#8211; Evolution is limited by time, not population size</h3><p><strong>Dwarkesh Patel</strong></p><p>So all of this evolution since <a href="https://en.wikipedia.org/wiki/Recent_African_origin_of_modern_humans">&#8220;Out of Africa&#8221;</a> is acting on alleles that already existed in the pool of human variation from that first group we were talking about last time, on the order of 10,000 people, that exploded out of Africa. Is it surprising that across all these different traits, from cognitive profiles to disease resistance to height, that one pool of people contained so much latent variation that they could supply enough stretchiness to accommodate all of these different traits you&#8217;re studying now?</p><p><strong>David Reich</strong></p><p>That&#8217;s a rich question, and I think the human population has within it a tremendous amount of variation for complex traits. There&#8217;s a huge amount of variation that affects height. There&#8217;s a huge amount of variation that affects body mass index. If you take all these mutations and set them to the high-height variant, a person will be extremely tall, like as tall as a tall building. Of course, that will never happen.</p><p>But if you take all these variants that affect schizophrenia risk and you point them all in the same direction, there will be extreme risk or extreme protection for schizophrenia. For complex traits, ones underpinned by many mutations, all the variation already exists to move the population to a different adaptive set point that&#8217;s optimal in the environment it&#8217;s in.</p><p>If you push the population into a new environment, within hundreds or thousands of years, the population can rapidly move to a new adaptive set point. There are some unusual traits, like the <a href="https://en.wikipedia.org/wiki/Lactase_persistence">ability to digest cow&#8217;s milk</a> or protection against sickle cell anemia, that require a single very important mutation that may not yet exist in the population. You have to wait for the mutation to occur in some people. When the populations are relatively small, only 10,000 people, you might have to wait dozens or hundreds of generations for that mutation to arise. But when the populations are large, there&#8217;s no mutation limit anymore.</p><p>Every mutation that can occur does occur. There are eight billion people in the world. There are maybe 30 new mutations every generation, so that&#8217;s 240 billion new point mutations every generation. There are only three billion DNA bases in the genome, so every mutation that can occur does occur about 100 times every generation. We&#8217;re not mutation-limited anymore. The mutations can arise again. They do arise again. But when the population is only 10,000, you sometimes have to wait dozens or hundreds of generations for the new mutation to occur.</p><p><strong>Dwarkesh Patel</strong></p><p>How likely is it that the thing that changed with the Bronze Age is just that the human population was big enough? By 3000 BC, you go to a population of 50 million-ish people. The population is big enough, and the <a href="https://en.wikipedia.org/wiki/Gene_flow">gene flow</a> between different areas is high enough, such that things which don&#8217;t have an overwhelming selection coefficient, which aren&#8217;t overwhelmingly favored by evolution, are finally visible to selection.</p><p><strong>David Reich</strong></p><p>I think that&#8217;s not likely to be true, but it&#8217;s an extremely interesting thing to think about. Already when population sizes are on the order of a million or so, every mutation that can occur does occur within a few generations. That&#8217;s well before the Bronze Age if you take the population even of a place like Europe, but also of other places. Or maybe it&#8217;s at the dawn of the Bronze Age or the farming period. The question you&#8217;re asking is whether, when the population is small, maybe natural selection doesn&#8217;t work effectively.</p><p>A common thing people think about with natural selection, which is true, is that in small populations selection doesn&#8217;t work effectively. That&#8217;s because mutations bop around in frequency from generation to generation a lot in a small population, just randomly. If you have a population size of 1,000, mutations will bop around by a frequency of one over 1,000 every generation. If the selection coefficient is less than that, it will be drowned in the random bopping around of frequencies due to genetic drift. But that is already for a population of 1,000. A 0.1% selection coefficient is very weak. We&#8217;re talking about 1% effects, and that&#8217;s very strong. It will work very well even in a population of size 1,000 or 10,000.</p><p>If you are talking about mutations of the type that will start rising only in large populations but not small populations, those are selection coefficients on the scale of one over 10,000 or one over 100,000. Those will take 10,000 or 100,000 generations to rise in frequency, which is hundreds of thousands or millions of years. That&#8217;s not going to do anything over the timescale we&#8217;re talking about. There&#8217;s just a timescale issue. We&#8217;re talking about strong, measurable selection coefficients on the order of half a percent or more in this study. All of those are going to work in small populations or large populations. It&#8217;s not going to be affected by the population size.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. You&#8217;re saying that more generally, once you hit a given threshold of population, the dominant factor is time span, not population size.</p><p><strong>David Reich</strong></p><p>Correct. It&#8217;s very interesting, and it&#8217;s actually not widely understood.</p><p><strong>Dwarkesh Patel</strong></p><p>Speaking of data contradicting what you might have otherwise assumed, one of the papers you sent me beforehand, <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC4943878/">Mallick 2016</a>, found that there are no fixed differences between modern and archaic humans 50,000 years ago. We know this is the period in which the so-called cognitive revolution happened, and modernity started, and <a href="https://en.wikipedia.org/wiki/History_of_art#Prehistory">people are making art</a>. Does this suggest that nothing biological changed to make modern humans modern? The thing that happened was some cultural change? How do we understand what this data tells us?</p><p><strong>David Reich</strong></p><p>Right. 100,000 to 50,000 years ago, there&#8217;s a quickening of the pace of change in culture. You see the first extensive representational art, bead necklaces, drawings on the wall, and a rapidly increasing pace of innovation in the types of tools that people use. The thought might be that there would have been some important genetic switch, a kind of important genetic change that occurred in the population and swept to high frequency that everybody soon had. That made it possible to do these things. Maybe some genes allowed people to have complex, representational language, for example.</p><p>One thing we did in 2016 in this paper by Swapan Mallick and colleagues was look across the DNA for places that might be expected to look like this, where nearly all people living today share a common ancestor maybe 100,000 or 200,000 years ago. We looked really hard, and right across all the DNA we could look at, we couldn&#8217;t find anything more recent than four or five hundred thousand years ago.</p><p>This is a crazy result because it looks like there are no key selective sweeps that have occurred in this period that are ancestral to everyone living today. We talked before about no selective sweeps between Europeans and East Asians, but there don&#8217;t even seem to be any selective sweeps shared between all humans in this really important period when a lot of evidence in the material culture record appears. It could be that there&#8217;s biological adaptation in this period, but it&#8217;s polygenic. There are lots of mutations that all shift in the same direction to help the population move to a new set point, but there&#8217;s no key biological change that rises to high frequency in this time.</p><p><strong>Dwarkesh Patel</strong></p><p>This group 50,000 years ago, are they the ancestors of everybody out of Africa or also some Africans?</p><p><strong>David Reich</strong></p><p>This is 100,000 to 50,000 years ago. This is the population that&#8217;s ancestral to West Africans, to most East Africans, to all non-Africans.</p><p>There are a couple of populations in Africa that have substantial ancestry coming from more divergent groups. For example, <a href="https://en.wikipedia.org/wiki/Khoisan">Khoisan</a> from Southern Africa or Central African rainforest hunter-gatherers have substantial fractions of their ancestry from groups that diverged maybe 200,000 years ago from the other lineages. But all of these groups today are able to go to college and do everything everybody else does. There is no evidence that there is any key mutation lacking in some groups that is not present in the others.</p><p><strong>Dwarkesh Patel</strong></p><p>The differences we see between different groups of people, especially if this group 50,000 to 100,000 years ago had a very small population size&#8230; I think last time we were discussing on the order of 10,000 people. So almost everybody in the world, or the variance we see between different humans today, was latent in this group.</p><p>I get your point that if you just stack up different things across the genome, stacking them up really has a big effect. But it&#8217;s interesting that we have so many different groups in the world today, and all that diversity comes from a very small population size.</p><p><strong>David Reich</strong></p><p>A lot of us in human genetics think that our population contains within it the clay that&#8217;s needed to make almost any trait. And that depending on environmental conditions or selection conditions, the mean value of these traits will move in different directions. There&#8217;s an empirical question about how much selection there&#8217;s been in different human populations over time.</p><p>One of the things this new work we&#8217;re involved in is showing is that at least in the last 18,000 years in this part of the world, there has been significant movement, at least for a handful of important traits. We looked at more than 500 traits. About 100 complex traits showed significant movement in a systematic direction over this time period. It really does seem that there is a response to the environments people are living in that has occurred over this period, and that is potentially stronger than in previous periods.</p><h3>01:09:02 &#8211; Why no farming before the Ice Age?</h3><p><strong>Dwarkesh Patel</strong></p><p>We were talking earlier about how there are no fixed differences between humans 30,000 years ago and humans today. So if there&#8217;s no genetic basis for the kind of thing that allowed humans to have more symbolic representation, have farming, et cetera&#8212;I think I asked you this question last time we talked, but especially with this context&#8212;why no farming before the <a href="https://en.wikipedia.org/wiki/Last_Glacial_Period">Ice Age</a>? Genetically we were there.</p><p><strong>David Reich</strong></p><p>That is such an interesting question. Genetically we&#8217;re there. The common ancestral population has all of the ingredients for farming 50,000 years ago. These people are distributed into different parts of the world:  the Americas 15,000 years ago or whatever it is, New Guinea 40,000 years ago, East Asia, Europe, West Africa. No farming developed before 11,000 or 12,000 years ago. It only developed in the last 12,000 years, the period known as the <a href="https://en.wikipedia.org/wiki/Holocene">Holocene</a>, which is the end of the Ice Age.</p><p>If you talk to climate scientists and archaeologists&#8212;I keep asking people this question every time I meet someone who&#8217;s an expert in this&#8212;how can it be that farming develops in all these places? Are we really living in such an unusual time? People tell me, indeed, we&#8217;re living in a very unusual time on a scale of two million years. That is, 12,000 years ago we switched into this period of not just warmth, but climate stability.</p><p>It&#8217;s hard to believe that we&#8217;re living in such a special time. But if you look at data from the bottoms of ponds where you can measure the fluctuations of temperatures using isotopic signatures, apparently we&#8217;re in a period where it&#8217;s fluctuating a lot less year to year, 10 years to 10 years, and 100 years to 100 years. It&#8217;s a period of relative stability that we are miraculously living in.</p><p>When this period of relative stability happens, it follows that multiple groups independently turn to agriculture, even though they all have the same genetic complement that arose 50,000, 100,000, 200,000, 300,000 years ago. It&#8217;s a crazy observation that people just accept, but it&#8217;s unbelievable.</p><p><strong>Dwarkesh Patel</strong></p><p>Oh, so you increased the range there. You said 100,000, 200,000, 300,000 years ago. Based on the genetic differences between modern people and people from 300,000 years ago. Do you basically think they&#8217;re modern 300,000 years ago?</p><p><strong>David Reich</strong></p><p>I don&#8217;t know. This is actively what I&#8217;m thinking about all the time right now. There&#8217;s a big transformation in terms of the culture of humans 300,000 or 400,000 years ago: this invention of <a href="https://en.wikipedia.org/wiki/Levallois_technique">Levallois technology</a>, the ability to make stone tools out of cores.</p><p>The <a href="https://en.wikipedia.org/wiki/Middle_Stone_Age">Middle Stone Age Revolution</a>, or the <a href="https://en.wikipedia.org/wiki/Middle_Paleolithic">Middle Paleolithic Revolution</a> depending on what you call it in Africa or Eurasia, is a new way of making stone tools that&#8217;s shared by <a href="https://en.wikipedia.org/wiki/Neanderthal">Neanderthals</a> and by modern humans, but is not shared in East or South Asia. It&#8217;s a big change, and it presumably involves a cognitive change in order to make this sort of technology.</p><p>Then there&#8217;s a further change to the <a href="https://en.wikipedia.org/wiki/Upper_Paleolithic">Upper Paleolithic Later Stone Age</a>, maybe 100,000 to 50,000 years ago, when there&#8217;s a second transition with a new type of tool making, but it&#8217;s not as revolutionary as the earlier one. So when the cognitive leap happens is unclear.</p><p>The diversification of the lineages leading to people living today, like Khoisan Southern Africans and rainforest hunter-gatherers, all occurs more on the timescale of 300,000 or 200,000 years. All of these people are capable of going to college and doing everything. So it&#8217;s not obvious that the cognitive toolkit, the behavioral toolkit, and the genetic abilities were not all in place 200,000 or 300,000 years ago, and that even Neanderthals had them. It&#8217;s not obvious that this was not the case.</p><p>I just don&#8217;t know. You distribute these people descended from this diversification that happened 200,000 or 300,000 years ago to different parts of the world, and then after 12,000 years ago, you start having agriculture popping up in different places. It&#8217;s an outstanding mystery of human history. I find it unbelievable that we live in a time period that climatologically is so unique on a scale of two million years, but my colleagues tell me it&#8217;s true.</p><p><strong>Dwarkesh Patel</strong></p><p>The climate thing seems surprising given there are so many different environments in which agriculture was independently developed. I understand that across environments the variance could have gone down. If it had only happened in one place at one time, I could have bought that explanation. But the fact that they&#8217;re making maize in the New World and they&#8217;ve got cereals in the Old World in very different environments makes it surprising.</p><p><strong>David Reich</strong></p><p>It&#8217;s very, very surprising. We accept it, but it&#8217;s a crazy observation that most normal people don&#8217;t realize.</p><p>The thing that basically everybody accepts is that the common ancestral population of almost everybody in the world, except for rainforest hunter-gatherers and Khoisan, is around 70,000 years ago. Everybody accepts that these people all have in place the cognitive, behavioral, and intellectual ingredients that are necessary for the farming revolution and building state societies. Because when these descendants get distributed to West Africa, East Africa, the Americas, Europe, South Asia, East Asia, New Guinea, and so on, their descendants all do this. They do it independently, semi-independently, or demonstrably completely independently in all these different parts of the world.</p><p>The cognitive resources for doing this must have all been in place, but it&#8217;s a very long fuse. It delays for 40,000 or 60,000 years in all these different places after the common ancestral population splits up, and then ignites into agriculture and all these other things after that point.</p><p>It&#8217;s a crazy claim. Then you could argue about whether the actual fuse is 300,000 years, from when Neanderthals separated and from when different lineages of extant modern humans separate, and that&#8217;s also plausible. It&#8217;s a crazy set of things that we&#8217;re being asked to believe.</p><p><strong>Dwarkesh Patel</strong></p><p>Is it possible that agriculture existed, but you didn&#8217;t have modern metallurgy or whatever it was that allowed populations to explode starting in 5000 BC with the Bronze Age? Population-wise, it doesn&#8217;t seem like much is happening from 10,000 BC to 5000 BC in the early Neolithic. Is it possible that they had farming but they didn&#8217;t have copper or tin, which you needed to go to the Middle East for, to develop a civilization that could make use of bronze at a large scale, and so they just disappeared from the historical record?</p><p><strong>David Reich</strong></p><p>I think we would see their archaeology. There are extraordinary developments in the Americas which are entirely Stone Age.</p><p><strong>Dwarkesh Patel</strong></p><p>You would see them today if they had completely vanished?</p><p><strong>David Reich</strong></p><p>Oh, yeah. We should go for a trip to Teotihuac&#225;n in Mexico. It&#8217;s so impressive. When I went there when I was 20, it was totally as impressive as ancient Egypt. It&#8217;s huge. It&#8217;s massive. It&#8217;s without metal.</p><p><strong>Dwarkesh Patel</strong></p><p>It&#8217;s even more impressive because it&#8217;s not only without metal, but without animals and without wheels, which is crazy. The marble is just hauled without wheels.</p><p><strong>David Reich</strong></p><p>Right. Take any person who has an old world superiority and take them to these places, and they will not have it anymore. It&#8217;s just extraordinary what&#8217;s in these places. These are people who separated 20,000 years ago at least from the ancestors of East Asians and 40,000 years ago from the ancestors of West Eurasians.</p><p>They just had the same biological and cultural shared toolkit from then, but there&#8217;s a long fuse delay until all this stuff happens. It&#8217;s an amazing thing, and we don&#8217;t question it.</p><h3>01:17:13 &#8211; The Neanderthal puzzle David can&#8217;t stop thinking about</h3><p><strong>Dwarkesh Patel</strong></p><p>What are other questions you are either investigating right now or want to investigate, these kinds of big picture questions of human history?</p><p><strong>David Reich</strong></p><p>I&#8217;m perplexed. I don&#8217;t know if we talked about it before, but I remain very confused about the relationships between archaic and modern humans.</p><p>We have genome sequences now from archaic humans who lived in Europe, West Eurasia, and Central Eurasia, and the Neanderthals. We have archaic sequences from these enigmatic <a href="https://en.wikipedia.org/wiki/Denisovan">Denisovans</a>, who we now have a skeleton for since we last talked. There&#8217;s now a <a href="https://en.wikipedia.org/wiki/Harbin_cranium">skull that&#8217;s been shown to be a Denisovan</a>. We have data from lots of modern humans, and there are really big mysteries about the relationships amongst these groups.</p><p>Genetically, the Denisovans and the Neanderthals are sisters. They descend from a common ancestral population 500,000 or 600,000 years ago. That group descends 700,000 or 800,000 years ago from the common ancestors of modern humans. Genetically, the whole genome data says that Neanderthals and Denisovans are archaic humans from a common ancestral archaic population.</p><p>But there are so many things shared between Neanderthals and modern humans that don&#8217;t seem to be shared with East Asians. They both share Middle Stone Age stone tools, Levallois technology, this cognitively unique way of making stone tools that wasn&#8217;t used in East Asia. They both have the same <a href="https://en.wikipedia.org/wiki/Mitochondrial_DNA">mitochondrial DNA</a> and <a href="https://en.wikipedia.org/wiki/Y_chromosome">Y chromosome</a> sequence.</p><p>The Y chromosome sequence of Neanderthals and the mitochondrial DNA of Neanderthals, is actually modern human that came through interbreeding 200,000 or 300,000 years ago and then shot up to 100% frequency. Neanderthals and modern humans are both the <a href="https://en.wikipedia.org/wiki/Interbreeding_between_archaic_and_modern_humans">product of mixture events</a> that happened between archaic and modern humans 300,000 or 200,000 years ago, demonstrably through patterns of variation in ancient and modern DNA.</p><p>It feels that there&#8217;s something shared between Neanderthals and modern humans that&#8217;s not shared with Denisovans, even though the vote of the whole genome says that Denisovans and Neanderthals are related. One wonders whether there&#8217;s something connecting Neanderthals and modern humans that&#8217;s different from Denisovans, even though genome-wide, Denisovans and Neanderthals cluster. I&#8217;m thinking about that all the time now.</p><p><strong>Dwarkesh Patel</strong></p><p>Connecting them would be interbreeding events or being in the same place at the same time that we missed?</p><p><strong>David Reich</strong></p><p>There&#8217;s a <a href="https://www.science.org/doi/10.1126/science.aea6774">known interbreeding event</a> from the lineage leading to modern humans into Neanderthals, but it&#8217;s supposed to be only 5%.</p><p>I&#8217;m interested in the possibility that that 5% is actually a sign of something much more impactful, that somehow Neanderthals are in some sense deeply modern in some ways, and even though they get swamped by archaic genes, they actually have more of a modern impact than one would think. The Middle Stone Age and Middle Paleolithic Revolution that they share with modern humans is more fundamentally a part of who they are, in some sense, than we think.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. Sorry, when was this interbreeding event?</p><p><strong>David Reich</strong></p><p>300,000 to 200,000 years ago.</p><p><strong>Dwarkesh Patel</strong></p><p>So the common ancestor between Neanderthals and most humans alive today is potentially more recent than the common ancestor between all humans alive today.</p><p><strong>David Reich</strong></p><p>Oh, for sure.</p><p><strong>Dwarkesh Patel</strong></p><p>Which is crazy.</p><p><strong>David Reich</strong></p><p>Well, the divergence to all the archaic humans, including Denisovans, is within human variation.</p><p><strong>Dwarkesh Patel</strong></p><p>Wait, what?</p><p><strong>David Reich</strong></p><p>Yes. The average time to the common ancestor of any two human genes is one or two million years ago. If you look at the copy of chromosome 3 you get from your mother and the copy of chromosome 3 you get from your father, the typical time they share a common ancestor is one or two million years ago. That&#8217;s before the split from Neanderthals and Denisovans. So there are many places in your DNA where you&#8217;re more closely related to a Neanderthal on your mother&#8217;s side than you are to your father.</p><p><strong>Dwarkesh Patel</strong></p><p>I&#8217;m sure there&#8217;s a simple explanation, but how?</p><p><strong>David Reich</strong></p><p>It&#8217;s the same reason that if you have a sister, in some places in your DNA you&#8217;re more closely related to her than you are to me because you share a parent. But in other places you&#8217;re more closely related to me than you are to your sister because you happen not to share the same DNA from your parents.</p><p>It&#8217;s just that the DNA we get from our common ancestral population was already quite variable 500,000 years ago, 700,000 years ago, a million years ago, and some of us descend from some of those ancestors and others descend from other of those ancestors. Neanderthals split from our lineage really close in time on human evolutionary timescales, such that in some places in our DNA we&#8217;re more closely related to Neanderthals than to each other.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. What are the other big questions?</p><p><strong>David Reich</strong></p><p>That&#8217;s the main thing I&#8217;m thinking about a lot these days. I continue to be very obsessed with questions about the spread of human populations around the world and trying to reconstruct that with ancient DNA.</p><p>The thing I&#8217;ve been thinking about a lot recently is the possibility that maybe we&#8217;re not thinking in the right way about the relationship between archaic and modern humans. The standard model is one where Denisovans&#8212;these archaic humans that were found from ancient DNA&#8212;and Neanderthals descend from a common ancestral population 500,000 or 600,000 years ago, and these two separate earlier, maybe 700,000 to 800,000 years ago, from the ancestors of modern humans, people like us. That&#8217;s the big result of a lot of studies since 2010.</p><p>But there&#8217;s also evidence of an interbreeding event that happened maybe 200,000 to 300,000 years ago that resulted in modern humans contributing DNA to the ancestors of Neanderthals. So maybe 5% of the DNA of Neanderthals comes from this interbreeding event, and a lot of studies have shown this. I&#8217;m very interested in this because, from the archaeological record, Neanderthals and modern humans look quite similar to each other, much more similar to each other than a lot of them do to Denisovans, these archaic humans in East Asia.</p><p>For a lot of history, people have thought that Neanderthals are our sisters. But in 2010, the sequencing of the Denisovan genome made it very clear that on average, <a href="https://www.nature.com/articles/nature09710">Denisovans are closer to Neanderthals than to modern humans</a>. This was a very confusing result. Most people now think that Neanderthals and Denisovans descend from a common ancestral population that separated earlier from the ancestors of modern humans.</p><p>I&#8217;m interested in the possibility that the right way to think about Neanderthals is actually as somehow culturally modern humans, even though genetically they&#8217;re mostly Denisovans. The model I&#8217;m thinking about is motivated by this archaeological phenomenon known as the Middle Stone Age Revolution. If this is Africa and this is Europe, we know that the new way of making stone tools&#8212;with cores that were very carefully mined far away from the locations they were used, made out of high-quality stone like flint&#8212;starts being used 300,000 or 400,000 years ago, first in the Caucasus, places like Georgia today, or East Africa.</p><p>This way of making stone tools is quite revolutionary. It is known in Europe as the Middle Paleolithic and in Africa as the Middle Stone Age, and is associated with much more widespread use of fire and moving stone around at much further distances than before.</p><p>I&#8217;m interested in the idea that this is something shared between modern humans and Neanderthals. There&#8217;s somehow some shared cultural feature that&#8217;s absent in East Asia, and that might have a relationship in the genetic data and is somehow related to this 5% DNA.</p><p>The idea I&#8217;m interested in is the possibility that there is a population here that invents the Middle Stone Age and the Middle Paleolithic, sometimes called Levallois technology, and that people from this population expand into Europe and mix with the local archaic humans who are there. That is what this 5% interbreeding event is. It happens 200,000 to 300,000 years ago. It produces a group that, as it expands across this landscape in Europe, mostly picks up the local DNA and becomes mostly archaic genetically, but retains its modern human culture, the way of making stone tools and some of its traditions.</p><p>One of the things that&#8217;s super interesting about this is that if you actually look at the genetics, across the whole genome, Neanderthals and Denisovans cluster. But if you look at the mitochondrial DNA&#8212;which humans and Neanderthals get from their moms&#8212;Neanderthals and modern humans cluster. If you look at the mitochondrial DNA, Denisovans and modern humans share an ancestor well more than 700,000 or 800,000 years ago, as you&#8217;d expect from the history. If you look at the Y chromosome that you get from your dad, Denisovans and modern humans share an ancestor more than 700,000 or 800,000 years ago, which is consistent with this history.</p><p>But if you look at the Neanderthal mitochondrial DNA, it&#8217;s only 300,000 to 450,000 years. If you look at the Y chromosome, it&#8217;s only 300,000 to 450,000 years. What the current genetic work is asking us to believe is that even though this is only 5% of the whole genome, it introduces mitochondrial DNA and Y chromosomes, and they jump up to 100% frequency. It&#8217;s kind of a crazy claim because the probability of this occurring by chance is low, maybe 5% times 5%, a very small number. It&#8217;s what we actually all believe, but it&#8217;s a very surprising event. Somehow it&#8217;s accreted to all the findings in the literature so that we make ourselves believe this, but it seems unlikely on first principles that somehow only 5% would introduce both the Y chromosome and mitochondrial DNA.</p><p>And it really does look like this. There&#8217;s amazing data from a site in Spain that&#8217;s 300,000 to 400,000 years old, called <a href="https://en.wikipedia.org/wiki/Sima_de_los_Huesos_hominins">Sima de los Huesos</a>. They have a nuclear genome that looks Neanderthal-like for most of the genome, but their mitochondrial DNA and Y chromosome are Denisovan-like. So it really looks like there was a population related to modern humans that pushed into this Sima de los Huesos-like population, displaced its mitochondrial DNA and Y chromosome, but kept the rest of its genome. It really looks like something like this happened.</p><p>The idea I&#8217;m playing with&#8212;and probably it&#8217;s wrong, who knows&#8212;is that there&#8217;s a landscape&#8230; This is Europe and you can break up into a hundred or so <a href="https://en.wikipedia.org/wiki/Deme_(biology)">demes</a>, little areas. Modern humans get introduced at the bottom right corner, in the Middle East or somewhere, and they spread into Europe. As this population spreads, there&#8217;s a wave front of expansion, and they&#8217;re interacting with the local archaic humans.</p><p>The theory from simulations and studies of different species like mammals and birds shows that even if there&#8217;s a small amount of interbreeding&#8212;when there&#8217;s an invasion or a movement of expansion of one group into the territory occupied by another&#8212;there&#8217;s massive introgression of local genes. The pioneers at the wave front will sometimes interbreed with the local population. There are so many of them around that their DNA will get swamped by the local group, so by the time they make it to the other side, they&#8217;re largely local.</p><p>Maybe this is what we&#8217;re seeing. You have a modern human population that&#8217;s matrilineal, for example, where the transmission of making stone tools this way is happening from mother to child. That&#8217;s why they&#8217;re retaining their mitochondrial DNA, but by the time they get to the other end of Europe, they&#8217;re mostly local archaic. You end up with a 95% population replacement. This would explain why the mitochondrial DNA is shared between Neanderthals and modern humans, and it would also explain why the mixture proportion is only 5%.</p><p>The really interesting thing is that there&#8217;s other evidence from studies of modern humans showing that modern humans are also admixed. The right way to think about this is that modern humans are a mixture of two groups that diverged maybe 1.5 million years ago, and they come together 200,000 to 300,000 years ago with maybe 20% ancestry from this archaic African group and 80% ancestry from this early modern lineage. And that same group then mixes with Neanderthals, and it&#8217;s 5% modern and 95% local.</p><p>So you actually have this key population that makes the Middle Stone Age or Levallois technology. It appears and expands in all directions&#8212;into Europe and into Africa 200,000 to 300,000 years ago&#8212;bringing this technology, new ideas, and perhaps some genetic adaptations. It expands into archaic humans in Europe, mixes with the local population, and gets 95% replaced but still retains its cultural features and maybe some genetic features.</p><p>It expands in Africa too, but here it&#8217;s not 95% replaced, it&#8217;s only 20% replaced. Probably the reason is that this group is much more diverged. It&#8217;s 1.5 million years diverged rather than 700,000 to 800,000 years. As a result, there are many more genetic incompatibilities and barriers to gene flow. But there&#8217;s still a lot of mixing, maybe 20%, and <a href="https://www.biorxiv.org/content/10.64898/2026.03.11.711219v1">we have evidence that this is a big mixture event</a>.</p><p>So what you&#8217;re actually seeing is a modern human expansion both into Europe and into Africa. In one place, it forms Neanderthals. In another, it forms the ancestors of everybody living today. But all of these groups are descended from this key revolutionary event that happens here.</p><p>We often talk about the revolutionary events 50 to 100,000 years ago, the more symbolic behavior and so on, that first appears in Africa and the Middle East and spreads beyond. But there&#8217;s also this earlier event, and it is contemporaneous with the breakup of all the different groups in Africa today, the Khoisan Southern Africans and the Central African rainforest hunter-gatherers. One wonders whether this is an equally important formative event.</p><p>If that&#8217;s true, it makes you think of Neanderthals as somehow our cousins. Their shared Y chromosome, their shared mitochondrial DNA, they share the formation of this 200,000 or 300,000-year-old event, and their shared toolkit. Even though the genome is telling us they&#8217;re cousins of Denisovans, the correct way to think about them may, in an important sense, be as close cousins of modern humans.</p><p><strong>Dwarkesh Patel</strong></p><p>I have so many questions. Do you have 15 more minutes? First of all, what is going on with this group of archaic Africans 1.5 million years ago? Where in Africa are they, and what happens to the portion of them that don&#8217;t form modern humans? Do they survive?</p><p><strong>David Reich</strong></p><p>This is not from ancient DNA, but from analysis of modern DNA from different people, mostly in Africa, but also non-Africans. In <a href="https://pubmed.ncbi.nlm.nih.gov/40102687/">multiple studies</a>&#8212;there&#8217;s at least three, maybe four or five studies that I know about&#8212;they have looked at the patterns of variation in people today and say the data in modern people today, including in Africans, is not consistent with a homogeneous population.</p><p>It looks like a population that split well more than a million years ago into multiple groups&#8212;at least two, but maybe many&#8212;and then came together a few hundred thousand years ago. The papers have different models that they fit, but they all have this feature of a split-up more than a million years ago, and then on the order of a few hundred thousand years ago, a coming together and a remixture event forming the ancestors of anatomically modern humans.</p><p><strong>Dwarkesh Patel</strong></p><p>This includes the Khoisan and whatever other groups?</p><p><strong>David Reich</strong></p><p>Yes. All of these groups have this, maybe in slightly different proportions. So you ask, where are these people living? Who knows. In this scenario, the 80% is coming from the Caucasus or Northeast Africa, where the Middle Stone Age forms. It&#8217;s from this population that the Middle Stone Age comes. They mix with local groups, and who knows where they are: Southern Africa, Western Africa, Central Africa, Eastern Africa.</p><p>We don&#8217;t have any ancient DNA, but this is a very rich environment. People have been living there for seven million years at least, and there would have been different groups of people everywhere. Probably it&#8217;s not just two groups, it&#8217;s probably more.</p><p>The important theme here is there&#8217;s evidence of substructure that&#8217;s well more than a million years old. This place would have been a landscape full of archaic humans that would have been differently related to these expanding people and would have admixed with them when they came through.</p><p><strong>Dwarkesh Patel</strong></p><p>So with the Neanderthals, the first time around 300,000 years ago, our ancestors share culture with them. They share the Middle Stone Age technology, but they don&#8217;t replace the population. The technology spreads through culture, basically.</p><p><strong>David Reich</strong></p><p>It spreads through genes too. If you look at Yamnaya in India, there&#8217;s almost no Yamnaya ancestry in India. It&#8217;s just diluted down. As Yamnaya expanded into Central Asia and into Europe, it makes the <a href="https://en.wikipedia.org/wiki/Corded_Ware_culture">Corded Ware</a>. There&#8217;s a 25% dilution. It expands back across Central Asia. It goes through the Hindu Kush and gets into northern South Asia. It admixes more with local people. Today, the most Yamnaya ancestry you see in India is 20% or 10%. Most people have less than 10% or 5%.</p><p>There&#8217;s been a lot of mixture on the way, but it is the tracer dye. It tracks Indo-European languages, and important aspects of Indo-European culture are coming through Yamnaya. So if you know where to look, that tracer dye is only 10%, 5%, or 2% in some groups. But it&#8217;s the languages people speak, and important shared cultural elements, that connect them to people on the other side of the Indo-European-speaking world. So this 5%, you shouldn&#8217;t sneeze at it. That&#8217;s tracing something important in this model.</p><p><strong>Dwarkesh Patel</strong></p><p>I understand that if things are transmitted more through women&#8230; Sorry, let me back up. I don&#8217;t understand why the maternal mitochondrial DNA and the Y chromosome would be especially privileged as the spreading is happening. Can you explain that?</p><p><strong>David Reich</strong></p><p>The reason I&#8217;m talking about these matrilineal or patrilineal expansions is that I&#8217;m really troubled, and have been troubled for many years&#8212;especially in the last three or four years&#8212;by the fact that the mitochondrial DNA and Y chromosome cluster Neanderthals and modern humans, but the rest of the genome clusters Neanderthals and Denisovans.</p><p>This is a crazy result that is not seen in any other species. I&#8217;m very interested in patterns that would explain it. If you assume that there was a matrilineal or a patrilineal expansion&#8212;it could be either&#8212;then modern humans, when they were expanding across the landscape of Europe, retained their identity along one of the lines.</p><p>If it&#8217;s matrilineal, when they incorporate a male from the local community, he&#8217;s brought into the community, and the kids are raised based on the culture of the mothers. If it&#8217;s a patrilineal expansion and they incorporate a female from the community, she&#8217;s raised with the culture of the fathers. If that happens, it guarantees that one of these two parts of the genome looks like it does, because it&#8217;s a modern human expansion. If it&#8217;s patrilineal, it will retain the Y chromosome. If it&#8217;s matrilineal, it will retain the mitochondrial DNA. So it will solve one of your two problems.</p><p><strong>Dwarkesh Patel</strong></p><p>But not both.</p><p><strong>David Reich</strong></p><p>It won&#8217;t solve the other one, so you need to solve the other one. You can solve it either by natural selection or by <a href="https://en.wikipedia.org/wiki/Social_selection">social selection</a>. By the way, patrilineality and matrilineality are the rule, not the exception, in human communities. Usually communities have continuity along the male or the female line. Usually it&#8217;s patrilineality, sometimes it&#8217;s matrilineality.</p><p>You can also have phenomena like social selection. It could be that once you have kids of someone whose father, for example, is from the outside community&#8230; Usually in most communities, females all reproduce. That&#8217;s typical today. Usually women have kids if they can. But men in traditional societies are actually very variable in their reproductive success. A large fraction of men never have kids. Then there&#8217;s a subset of men who have many kids with many women.</p><p>There&#8217;s competition among men for kids. In this context, where males are competing for access to females, female mate choice begins to be an important process. You have a phenomenon where if your dad is an archaic male, it could be the case that you&#8217;re not going to be as successful in the competition for local females as if your dad is a non-archaic male.</p><p>Some simple social phenomenon like that could explain the data, and we actually see this in human society. For example, if I remember right, in Central African rainforest hunter-gatherers, there&#8217;s different treatment of boys and girls depending on whether their dad or mom is one group or the other.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess I don&#8217;t understand how the maternal&#8230; The group spreads, and it gets to the next front. They have kids. From the humans that have just entered, the kids will have the mitochondrial DNA from the humans. But from the existing people, they will have the mitochondrial DNA of the archaic humans. Why are the people with the archaic mitochondrial DNA not surviving?</p><p><strong>David Reich</strong></p><p>It&#8217;s a question. There are multiple possible explanations, but it&#8217;s much easier to explain that than both the mitochondrial DNA and the Y chromosome. One possibility is that the mitochondrial DNA was less biologically fit. Another possibility is that there&#8217;s social discrimination against people based on whether their parents are archaic or not, which is not at all surprising in a human context. It&#8217;s the weakest link in this argument. This argument&#8217;s probably wrong, but I&#8217;m just telling you what I&#8217;m thinking about.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, the Neanderthals. So 300,000 years ago our lineage interacts with them, but mostly their lineage survives, and there&#8217;s cultural and genetic diffusion. And then is it 70,000 years ago that we interact again?</p><p><strong>David Reich</strong></p><p>Yes.</p><p><strong>Dwarkesh Patel</strong></p><p>And they don&#8217;t survive.</p><p><strong>David Reich</strong></p><p>The genetic ancestry doesn&#8217;t survive.</p><p><strong>Dwarkesh Patel</strong></p><p>The genetic ancestry doesn&#8217;t survive. Presumably there was also other contact between 300,000 years ago and 70,000 years ago.</p><p><strong>David Reich</strong></p><p>Probably. But these are the ones we are detecting currently.</p><p><strong>Dwarkesh Patel</strong></p><p>Is it just contingent that one time there&#8217;s this kind of diffusion where most of the archaic genome survives, and the other time it&#8217;s total replacement?</p><p><strong>David Reich</strong></p><p>This is not at all surprising given the context. If you think about this model, this is 700,000 or 800,000 years ago. This is 300,000 years ago. So this is 400,000 years separated. You talked about the Bhatia paper with me earlier. That&#8217;s two populations 70,000 years separated. There are no biological incompatibilities between West Africans and Europeans. There&#8217;s no natural selection against biological incompatibilities.</p><p>We know when Neanderthals and modern humans met and mixed, there were biological incompatibilities. That was 700,000 years ago. As populations become further apart, biological incompatibilities rapidly develop, probably as the square of the separation distance, because you need pairs of interacting genes. Here, it would have been maybe only 400,000 years separated between this lineage and that lineage. But here, it&#8217;s 1.2 million years. That&#8217;s a lot. These are at the edge of not being able to produce children. These are quite different humans. These are actually three times closer than these.</p><p>If you look at mixtures of humans today, there are mixtures in Southern Africa of people who are half this distance. If you look at Khoisan and <a href="https://en.wikipedia.org/wiki/Bantu_peoples">Bantu</a> people mixing in Southern Africa, like the <a href="https://en.wikipedia.org/wiki/Xhosa_people">Xhosa</a>, which is the population of Nelson Mandela, these are groups that are separated by almost 200,000 years, which is half of this. Totally compatible.</p><p>What you&#8217;re seeing is a group that&#8217;s actually completely permeable genetically, or nearly completely permeable. This other one almost certainly has substantial biological incompatibilities. Because 200,000 or 300,000 years later, we see interbreeding between Neanderthals and modern humans, or between Denisovans and modern humans, and there&#8217;s clear evidence of incompatibility at that point. But this would be even bigger.</p><p>What you would expect to see is that as this group spread, they would be moving into a territory full of archaic humans. There would be some interbreeding, but the kids would not be very fit. They would die off. There would be a lot of infertility. The barriers to gene flow and to interbreeding would be greater.</p><p>To me, it&#8217;s not at all surprising that as this group moves into Eurasia, you have Eurasian archaics&#8212;the ancestors of Denisovans&#8212;who are only 400,000 years diverged from these people over here. And then you have African archaics, and these are 1.2 million years diverged. They just don&#8217;t interbreed as much, and you don&#8217;t get as much gene flow.</p><p>But the key thing is the timing. It&#8217;s the same time. It really feels like the signature of an explosion of people from one place, interacting with people here and interacting with people there. It&#8217;s the same cultural or technological revolution impacting this place and that place, and creating populations that are impacted by this cultural revolution, which we know is the case because they share the same toolkit.</p><p>Some people argue that Levallois technology is independently invented. But it&#8217;s very similar, and this model would be a way that it could have the same origin. So there&#8217;s a culturally shared thread, this shared toolkit. There&#8217;s a mitochondrial DNA and Y chromosome thread. And then there is a shared timing thread, which is they both form by mixture.</p><p><strong>Dwarkesh Patel</strong></p><p>Because otherwise you&#8217;d have to believe that Neanderthals independently developed Stone Age tools.</p><p><strong>David Reich</strong></p><p>Yes, which is not inconceivable. But it&#8217;s a little bit like believing that farming independently developed in multiple parts of the world.</p><p><strong>Dwarkesh Patel</strong></p><p>Right. But it did.</p><p><strong>David Reich</strong></p><p>It did. So as I said, this is probably wrong. I&#8217;m trying to tell you that we don&#8217;t really know the world we live in. This is not obviously wrong. In fact, to me, this is much more plausible than the model we currently write down. It&#8217;s probably wrong, but it&#8217;s much more plausible. It explains many more things, and it&#8217;s no more complicated.</p><p><strong>Dwarkesh Patel</strong></p><p>Do you want to recapitulate the thing you were saying about the analogy to Ptolemy and the epicycles? I thought that was quite interesting.</p><p><strong>David Reich</strong></p><p>I think the model that we&#8217;ve put together collectively about the relationships between archaic and modern humans has accreted over time. There was this idea that modern humans are distinct and that Neanderthals and Denisovans are sisters of each other. Over time, we detected additional mixture events, like this modern human into Neanderthal, and then these other ones I didn&#8217;t even talk about, like a super-divergent lineage going into Denisovans and all this other stuff. We still say, &#8220;Oh, the whole genome says Neanderthals and Denisovans are sisters, so that&#8217;s the truth.&#8221;</p><p>We&#8217;ve patched it all together and gotten it all to work. You look at the mitochondrial DNA and the Y chromosome, and they have this odd pattern, and it&#8217;s improbable, but we can get that to work if we invoke natural selection, things like this. You patch it all together.</p><p>It reminds one of what happened in the ancient world, where there was this idea that the sun revolves around the Earth, but it doesn&#8217;t quite explain the movements of the planets properly. In order to get the movements of the planets to work right, <a href="https://en.wikipedia.org/wiki/Geocentrism">Ptolemy</a> and the astronomers made up these <a href="https://en.wikipedia.org/wiki/Deferent_and_epicycle">epicycles</a>, these special extra rotations and movements to make everything work about right. It was such a convoluted model. When <a href="https://en.wikipedia.org/wiki/Nicolaus_Copernicus">Copernicus</a> and colleagues suggested instead that <a href="https://en.wikipedia.org/wiki/Copernican_heliocentrism">everything is revolving around the sun</a>, it simplified things ever so much.</p><p>What was happening is that as astronomical information accumulated, it kept being contradictory to the standard model, but it could be made to work by proposing another complication and another complication and another complication. This is not as fantastic as proposing that everything revolves around the sun rather than the Earth, but it is much simpler. And it actually explains many things.</p><p><strong>Dwarkesh Patel</strong></p><p>What is counterintuitive or unexpected or hard to accept about this alternative model? What is the hesitation that people have for adopting this?</p><p><strong>David Reich</strong></p><p>I don&#8217;t know. Nobody&#8217;s thinking about this model right now. It just seems obviously a very natural model to me.</p><p><strong>Dwarkesh Patel</strong></p><p>The reason I ask is that <a href="https://en.wikipedia.org/wiki/Aristarchus_of_Samos">Aristarchus</a>, the ancient Greek, had the heliocentric theory because he had <a href="https://en.wikipedia.org/wiki/Heliocentrism#Aristarchus_of_Samos">deduced how far the Earth is from the sun and noticed other things</a>. But it was not adopted, because his fellow Athenians were like, &#8220;Look, if we believe that the Earth revolves around the sun, for it to be the case that we don&#8217;t see <a href="https://en.wikipedia.org/wiki/Stellar_parallax">relative movement of the stars to the Earth</a>, the only possible explanation is that the stars are so far away that it is just incomprehensible and implausible.&#8221; So the heliocentric theory was dismissed.</p><p>What I&#8217;m trying to ask is, what is the equivalent here of &#8220;for this to work, the stars have to be so far away that it&#8217;s inconceivable,&#8221; where actually the stars are so far away? Maybe we should adopt the implausible implication that this theory gives us.</p><p><strong>David Reich</strong></p><p>That&#8217;s a great question. I think we have to assume that there&#8217;s a linkage between the cultural transformations in Africa and Eurasia at this time, and that&#8217;s not something the community has really put together with the genetic data.</p><p>There&#8217;s this thread in the genetics about substructure in Africans, and then there&#8217;s this whole world based on ancient DNA, and they&#8217;ve never been put together. Nobody&#8217;s put together the now extensive work on modern human substructure with the extensive work based on ancient DNA of archaic human relationships to modern humans.</p><p>If you put them together, you realize they line up in terms of their time of substructuring. I don&#8217;t know if that&#8217;s improbable. It seems parsimonious to me.</p><p><strong>Dwarkesh Patel</strong></p><p>It also seems significant that different groups of humans at this time were capable of adopting Stone Age technology. Once one group had figured it out, the genetic difference between different human lineages was not so big that you could not show people how to use stone tools.</p><p><strong>David Reich</strong></p><p>Who knows? It could be that this was genetically driven. We talked before about the time to the common ancestor of human genes. There&#8217;s nothing at 100,000 years or 150,000 years, but there&#8217;s a lot at 400,000 or 500,000 years.</p><p>If that&#8217;s what happens, and you have a mutation that occurs in the Caucasus, somewhere in the Middle East, or Northeast Africa, there could be key genetic mutations that make people able to do this. Then this population expands. When it moves into Europe, it&#8217;s swamped by local genes, but there could be retention of those genes through selection as it expands.</p><p>Maybe what you&#8217;re seeing is that there are genetic developments. Most of the discussion has been focused on the 50,000 to 100,000-year event, which is anatomically modern human behavior. But a lot of archaeologists think this is an equally&#8212;if not more&#8212;profoundly significant event in many ways. Why is that not the event we should be talking about?</p><p><strong>Dwarkesh Patel</strong></p><p>You&#8217;re talking about how there are no fixed differences between modern humans and the humans 50,000 years ago. Do we know if there are any fixed differences between the people 50,000 years ago and the people 300,000 years ago?</p><p><strong>David Reich</strong></p><p>I think there are.</p><p><strong>Dwarkesh Patel</strong></p><p>Other than obviously these interbreedings.</p><p><strong>David Reich</strong></p><p>If you look at the genetic variation going back 300,000 or 400,000 years, there do begin to be places where all modern humans share common ancestry. That&#8217;s another way of saying there begin to be fixed differences at that time depth. That is where you start seeing evidence for possible fixed differences.</p><p>What&#8217;s happening, if everybody shares a common ancestor 400,000 or 500,000 years ago, is that there&#8217;s a single ancestor at that time. If you compared it to another population, they would descend from a different lineage, so any mutation that occurred ancestral to that single ancestor would be a fixed difference. This is the time at which you can begin to see fixed differences.</p><p><strong>Dwarkesh Patel</strong></p><p>But anatomically modern, cognitively modern humans exist by the beginning of the Middle Stone Age, before we&#8217;re breeding with this ancient group of Africans or breeding with Neanderthals.</p><p><strong>David Reich</strong></p><p>Anatomically modern humans occur exactly here. It&#8217;s the same moment. This is when they occur. The people who have skeletal features like ours, and Neanderthals, appear exactly then.</p><p>This is when it all happens. There is this disconnect between anatomically modern humans in the skeletal record and behaviorally modern humans, which is 50,000 to 100,000 years ago. Anatomically modern humans appear at this time, and recognizable Neanderthals appear roughly around this time, too.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. But we don&#8217;t know what exactly happens, if anything, between 200,000 years ago and 50,000 years ago that goes from just anatomical modernity to behavioral modernity.</p><p><strong>David Reich</strong></p><p>My understanding is no. They&#8217;re busy making Levallois stone tools like Neanderthals for 200,000 years, and they are not more impressive than Neanderthals in any obvious way, as I understand it.</p><p>Then there begins to be in the archaeological record a quickening of behavioral traits, which could be not genetic at all, or could be genetic. There are lots of arguments about this. We were obsessed with intelligence earlier in our conversation. People are obsessed with art and these things that seem important to us but who knows what&#8217;s important?</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. Cool, thanks for the digression.</p><h3>01:54:10 &#8211; The methodology behind this breakthrough</h3><p><strong>David Reich</strong></p><p>The work that I&#8217;ve been involved in has consistently shown that I was wrong in my biases coming into the work, and I&#8217;ve really been almost traumatized by this. Again and again, I&#8217;ve come into a project with some kind of guess about what the data was showing, and then the data doesn&#8217;t show that.</p><p>For example, when I got involved in the <a href="https://en.wikipedia.org/wiki/Neanderthal_genome_project">Neanderthal genome project</a> helping to analyze data looking at how archaic Neanderthals were related to modern humans, I was part of a group of scientists who had established that non-Africans were a simple subset of African variation and that there was no evidence at all of Neanderthal interbreeding into the ancestors of modern humans or other archaic interbreeding. Different analyses that I and many other people had done made it look like non-African variation was just a subset, a small sample of that in Africa, and that could have fully explained the data.</p><p>So when I was involved in analyzing the Neanderthal DNA sequences, what happened was I found this very strong evidence of Neanderthals being more closely related to non-Africans than to Africans. It was very surprising, and I thought it must be a mistake. I was quite incredulous. I thought it was unlikely to be true because other evidence that had been found before seemed to point in the other direction. So I spent several years trying to make these results go away, as did my colleagues, and we just couldn&#8217;t make the results go away. They just kept getting stronger.</p><p>And this experience working on natural selection was the same. What we were convinced of was that natural selection had been pretty quiescent in our species over the last several hundred thousand years. Therefore, if we look at patterns of variation in non-African people today, or in any people today, we should see not a lot of selection going on. Indeed, the first ancient DNA studies, beginning in 2015 with <a href="https://www.nature.com/articles/nature16152">this paper</a> that we were involved in with <a href="https://www.med.upenn.edu/mathieson-lab/">Ian Mathieson</a> and colleagues, seemed to show relatively small numbers of genetic positions associated with natural selection.</p><p>In 2015, we analyzed data from about 200 Europeans and Middle Easterners to try to understand frequency changes over time. We compared those ancient people who were the sources of modern Europeans to people in Europe today, and we looked at frequency differences that were too extreme to be due to chance. We were very excited to find 12 positions that we were convinced were highly different in frequency between Europeans today and what we would expect, based on the history that we and others had identified as the history relating modern to ancient Europeans. Some of these were known and some of these were not known, and this was very exciting.</p><p>We hoped that as the numbers of samples would increase and we got higher resolution to be able to appreciate differences in frequencies over time, it would make it possible to detect far more. What was quite disappointing over the subsequent decade is that that didn&#8217;t happen. For example, the <a href="https://news.ku.dk/all_news/2024/01/discoveries-gleaned-from-human-ancient-dna/">largest study of that type in 2024</a> by a group in Copenhagen analyzed much better data than we had in 2015 and found only 21 positions that were highly different in frequency across time. While that was exciting&#8212;it was almost twice as many as we had found in 2015&#8212;in a lot of ways it was disappointing because the sample size and data quality had gone up so much, and yet this is all that was found.</p><p>That suggested we might be hitting an asymptote and might not be able to get beyond where we currently were. This approach to learning about biology, which was very promising in theory, might not produce a high yield. Maybe natural selection was quiescent, and the reason we&#8217;re seeing so few changes is that there has not been a lot of adaptive directional selection. That was the situation we found ourselves in until just a few years ago when we carried out this study in our research group led by Ali Akbari.</p><p>What we did is we deployed a few innovations to try to improve our power to detect natural selection. One of them was that we just pumped a lot of data into the system, increasing the amount of data by about 14-fold. The main thing that we do in this study is report new data from about 10,000 individuals. This is a very big increase in the amount of data in the literature. The total dataset size of ancient individuals distributed over the last 18,000 years is about 16,000 people. This is a large dataset. It&#8217;s much larger than was previously possible, and when you have more data, you can estimate frequency changes with much more subtlety.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!NIbW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa02c341f-8584-4aa4-9016-5dc62c0a7649_1762x753.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!NIbW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa02c341f-8584-4aa4-9016-5dc62c0a7649_1762x753.png 424w, https://substackcdn.com/image/fetch/$s_!NIbW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa02c341f-8584-4aa4-9016-5dc62c0a7649_1762x753.png 848w, https://substackcdn.com/image/fetch/$s_!NIbW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa02c341f-8584-4aa4-9016-5dc62c0a7649_1762x753.png 1272w, https://substackcdn.com/image/fetch/$s_!NIbW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa02c341f-8584-4aa4-9016-5dc62c0a7649_1762x753.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!NIbW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa02c341f-8584-4aa4-9016-5dc62c0a7649_1762x753.png" width="728" height="311.1146424517594" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a02c341f-8584-4aa4-9016-5dc62c0a7649_1762x753.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:753,&quot;width&quot;:1762,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:2399060,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!NIbW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa02c341f-8584-4aa4-9016-5dc62c0a7649_1762x753.png 424w, https://substackcdn.com/image/fetch/$s_!NIbW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa02c341f-8584-4aa4-9016-5dc62c0a7649_1762x753.png 848w, https://substackcdn.com/image/fetch/$s_!NIbW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa02c341f-8584-4aa4-9016-5dc62c0a7649_1762x753.png 1272w, https://substackcdn.com/image/fetch/$s_!NIbW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa02c341f-8584-4aa4-9016-5dc62c0a7649_1762x753.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The data comes from only one part of the world, which is Europe and the Middle East. It&#8217;s not a more important part of the world than other places, but it&#8217;s the place where maybe 70-80% of the data in the ancient DNA literature so far comes from due to historical reasons. It provides us with a natural laboratory where we can see what happens to the genome in one place over time as environments change. It&#8217;s really interesting to imagine doing this type of analysis in other parts of the world, and the comparative analyses are super important and interesting, but this study right now is about this one place in the world where we have particularly fantastic data.</p><p>The other thing we did is that we developed an entirely new methodology that hadn&#8217;t been used in this area before. The methodology is based on a technique that had been developed for finding risk factors for a disease in medical studies. A simple way to explain it is that we ask how to predict the genetic type a person has based on their pattern of relatedness to other people. We have a dataset of about 16,000 ancient people, and 22,000 people if we include the modern people. Then we look at how closely related each of these 22,000 people are to each other, and we predict the genetic type at each position in the DNA&#8212;at 10 million positions&#8212;based on the pattern of relatedness to all of the other 22,000 people.</p><p>Then we ask if natural selection blowing the frequency of the mutation in the same direction in all geographic places and at all times predicts the data a little bit better than just knowing the relatedness to all the other samples in the database. We&#8217;re simply asking if the alternative hypothesis&#8212;that selection has been blowing in the same direction at all times&#8212;explains the data better. That&#8217;s a dumb assumption, because of course, the truth is that natural selection will have changed in frequency over time. But we&#8217;re just asking the simplest of questions: whether assuming a constant rate of selection explains the data more than not doing so.</p><p><strong>Dwarkesh Patel</strong></p><p>To summarize to make sure I&#8217;ve understood, you&#8217;re trying to make a model that predicts allele frequency changes over time. You have two different parts. One part is this genetic relatedness matrix, which captures how similar different genomes are to each other. That should capture the impact of different bottlenecks, of drift, of population admixtures, and all those things which affect the entire genome.</p><p>Then you have the separate thing, which is, if we look at specific locations, can we just say that, &#8220;Oh, this location has been selected at whatever coefficient over time&#8221;? And if we add some coefficient, does it become easier to predict the allele frequency changes than you would have just seen from this other artifact, which is just looking at, &#8220;Oh, if you look at the whole genome, are these guys in the same, have they gone through the same bottlenecks? Have they gone through the same drift,&#8221; etc.?</p><p><strong>David Reich</strong></p><p>That&#8217;s precisely right.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, what did we learn?</p><p><strong>David Reich</strong></p><p>When we analyzed the data this way, we looked at 10 million positions in the DNA across these 22,000 people, 16,000 of whom were ancient. We looked to see if there was more change in this consistent direction over time than you would expect by chance. When we analyzed the data, we found many hundreds of places in the DNA that were changing too much over time and in too consistent a way to be explained by chance.</p><p>There&#8217;s a bit of a statistical problem in figuring out how many there are because they&#8217;re so densely packed that they&#8217;re close to each other and interfering with each other. But when you try to piece them out and say, &#8220;Let&#8217;s count only one in each place in the DNA and blank out the others,&#8221; we find at least 479 positions that are all independently pushing in the same way. We are 99% confident that those positions are real. By another criteria of being more than 50% confident that they&#8217;re real, we think that about 3,800 positions are all pushing in the same direction. This is a crazy number of results given that in our previous work and other people&#8217;s work, there were at most a couple of dozen discoveries coming from a single scan.</p><p>So when we got this result, we were very surprised. We thought it must be wrong, and we spent the next couple of years trying to make the results go away, but they just kept getting stronger. We were trying to look for some independent type of evidence to tell us whether these positions were real. We stumbled on something really powerful for this purpose that had not been used in this way before. It relied on the fact that we had very large numbers of discoveries, many hundreds of discoveries or even thousands.</p><p>We took a completely independent dataset, which was the corpus of genome-wide association studies. These are studies that people have carried out in hundreds of thousands of people, looking for whether particular genetic mutations are more common in people with high blood pressure than with low blood pressure, or something like this. We took the <a href="https://www.ukbiobank.ac.uk/">UK Biobank</a>, which is about 500,000 people from Great Britain who have been measured for hundreds of traits. The whole genomes of all these people have been sequenced. For each of these traits, we could look at whether each of these 10 million positions are connected to this trait in a convincing way. Out of 10 million positions, about 15%&#8212;about 1.5 million positions in the DNA&#8212;are predictive of at least one of these several hundred traits.</p><p>Then we could ask a question: is our natural selection signal, our statistic, related to whether a mutation causes high blood pressure or some other trait? We slid our statistic for natural selection upward, to a value of one, two, three, four, or five. As we did that, the enrichment for genetic mutations that affect traits got higher and higher. Whereas it was only 15% when we didn&#8217;t use our selection statistic, when we required the selection statistic to be above about five, there was about a five-fold enrichment for mutations that cause traits.</p><p><strong>Dwarkesh Patel</strong></p><p>Sorry, what is a selection statistic?</p><p><strong>David Reich</strong></p><p>This is the statistic we use to measure whether a mutation is changing over time significantly in a non-zero way. It can be approximately thought of as a normally distributed statistic, a <a href="https://en.wikipedia.org/wiki/Normal_distribution">Gaussian statistic</a>, which is the number of standard deviations the statistical value is away from zero, where zero is no natural selection.</p><p>It&#8217;s not exactly that, but it&#8217;s close to that. If the statistic is above five, we see about a five-fold enrichment in mutations that affect a trait. Instead of 15% of the mutations that are at random affecting the trait, it&#8217;s 60% or 70% that are affecting the trait when we slide our statistic upward. This provides completely independent evidence that these sites are real, and as you slide above five, there&#8217;s no more enrichment. Our interpretation of these results&#8212;which we were able to validate and show made sense using computer simulations of our process&#8212;is that once you slide the statistic above five, essentially all the signals of natural selection are real.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, just to make sure I understand. You&#8217;re saying, in order to figure out what <a href="https://en.wikipedia.org/wiki/Allele">alleles</a> have been under selection, your model assigns a statistic saying, &#8220;In order to explain why this allele has a specific frequency, we&#8217;re going to give it a selection statistic.&#8221; Independently, we run these studies on modern populations where we say, &#8220;If you look at height, eye color, intelligence, or whatever trait, what are the parts of the genome that are correlated with that trait?&#8221; The higher the statistic you give it in your study to explain allele frequency changes over time as a result of selection, the more probable it is that that region in the genome is associated with traits that have some functional thing we can measure.</p><p><strong>David Reich</strong></p><p>That&#8217;s exactly right. This is a brilliant idea that Ali had. It abandons the traditional approach of assigning statistical significance to mutations that cause a trait because we&#8217;re just using an external piece of information&#8212;the correlation to traits, measured in a completely different way&#8212;to read off the probability mutations are real.</p><p>We can ask how much enrichment for real signal is there given a particular selection statistic. If it&#8217;s halfway enriched to the plateau, we&#8217;re able to show the correct interpretation is that 50% of the mutations are really selected. If it&#8217;s three-quarters of the way toward the plateau, there&#8217;s a three-quarters probability that the mutation is real. If it&#8217;s 99% of the way to the plateau, there&#8217;s a 99% probability that it&#8217;s real. That gives us a calibrated estimate of the probability that a particular position is really under natural selection.</p><p>A major concern here is that what we&#8217;re actually seeing is not that these mutations are really under selection, but rather that both association to a disease and our selection signal are due to some third thing that&#8217;s causing both of them, which is a type of selection which is not what we&#8217;re after, not selection to adapt to new environments, but what&#8217;s called background selection: selection against newly arising bad mutations that are removed from the population that tend to be concentrated in genes. Genes are also the parts of the genome that tend to be associated to traits. This common process is causing both the enrichment for trait signals and is also causing the enrichment for selection signals that we&#8217;re observing. That&#8217;s the concern. We were super concerned about this.</p><p>So what we did is we repeated this enrichment analysis in slices of the DNA that all were affected to the same extent by background selection, by this rain of slightly bad mutations, and we get exactly the same pattern. We also repeated this experiment just using mutations of the same frequencies because there&#8217;s different statistical power to detect these signals at different frequencies. We see the same pattern where above a value of the selection statistic of around five, we get this plateau.</p><p><strong>Dwarkesh Patel</strong></p><p>The thing that changed that allowed you to increase the amount of sequences you&#8217;re generating by two orders of magnitude is just the statistical method you&#8217;re using to identify which part is human? Or what exactly changed in 2014 and since then?</p><p><strong>David Reich</strong></p><p>There&#8217;s been a whole series of improvements. The big ones have been the huge drop in sequencing cost, which made it possible to generate ancient DNA in the first place. The drop in cost has been a millionfold since the late 2000s, and another maybe one to two orders of magnitude from 2010 to today. That&#8217;s one big change.</p><p>Another change has been in-solution enrichment. It&#8217;s been this way of taking a sample that has very small percentages of human DNA, but then suddenly creating a process that will mean that the great majority of the sequences that one&#8217;s analyzing will be useful for analyses. The approach that we used was we took the DNA samples that we had, most of which were very low percentages of human DNA&#8212;less than 10%, often less than 1%&#8212;which is such a low proportion that it&#8217;s prohibitively expensive to sequence them and to just brute-force sequencing them given the technology that we had available at the time.</p><p>We took these samples and washed them over an artificially synthesized set of short DNA fragments that targeted positions of the DNA that we were interested in analyzing. This is more than a million positions that are highly variable in people, and we picked many of these to be biologically interesting. We had a whole set of known biological targets that affected traits in genome-wide association studies, which is the way that people look to see if there are particular genetic variants in modern people that have particular impacts in phenotypes and traits.</p><p>And so, what we did is we had this artificially synthesized set of DNA fragments that we washed our ancient sample over, and it bound the parts of the DNA that we targeted. The resulting sequence that we generated was very enriched for the parts of the genome that were informative about history. Even though only 10% or 1% of the DNA was human, it ended up that a very large fraction was from the parts of the genome that we were interested in, and it became economically efficient to do it.</p><p><strong>Dwarkesh Patel</strong></p><p>What was the other 99% of the DNA?</p><p><strong>David Reich</strong></p><p>It&#8217;s mostly microbial. It&#8217;s from bacteria and fungi that colonize a person&#8217;s body after they die. Depending on how they die, there&#8217;ll be more or less of these bacteria and fungi. When you typically sequence DNA from a person, it&#8217;ll just be full of microbial sequence. Sometimes the microbial sequence is very interesting; it might be pathogens that a person died of. There&#8217;s amazing work, for example, about different plagues of malaria and Black Death and hepatitis B and so on that have been obtained from the sequences of these pathogens in people&#8217;s teeth and other parts of their body when they died.</p><p>But we&#8217;re focusing here on the human DNA. This changed the amount of data that was possible to produce from tens per year to hundreds per year, and then we further roboticized and industrialized the process so that there were many hundreds or even thousands per year. Just in our laboratory, we&#8217;ve been generating genome-scale data from more than 5,000 individuals per year. I know this is true also of several other laboratories in the world now.</p><p>This huge jump in data, this semi-exponential or even super-exponential jump in some cases, has made it possible to ask and answer questions. While we were only on the order of 10 genome sequences from humans in 2010, this year it&#8217;s passed more than 20,000 reported sequences. There are several orders of magnitude increase, and the questions we were able to ask in 2014 are just not the same as the ones we can ask today.</p><p><strong>Dwarkesh Patel</strong></p><p>Awesome. Excellent. David, thanks for your time.</p><p><strong>David Reich</strong></p><p>Thank you, Dwarkesh.</p>]]></content:encoded></item><item><title><![CDATA[Reiner Pope – The math behind how LLMs are trained and served]]></title><description><![CDATA[It's shocking how much you can deduce about what the labs are doing from a handful of equations and a blackboard]]></description><link>https://www.dwarkesh.com/p/reiner-pope</link><guid isPermaLink="false">https://www.dwarkesh.com/p/reiner-pope</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Wed, 29 Apr 2026 17:07:03 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/195859978/26c19662ae988720a5c793e5f1acbf55.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>Did a very different format with Reiner Pope - a blackboard lecture where he walks through how frontier LLMs are trained and served.</p><p>It&#8217;s shocking how much you can deduce about what the labs are doing from a handful of equations, public API prices, and some chalk.</p><p>It&#8217;s a bit technical, but I encourage you to hang in there &#8211; it&#8217;s really worth it.</p><p>There are less than a handful of people in the world who understand the full stack of AI, from chip design to model architecture, as well as Reiner. It was a real delight to learn from him.</p><p>Wrote up some <a href="https://reiner-flashcards.vercel.app/">flashcards and practice problems</a> to help myself retain what Reiner taught. Hope it&#8217;s helpful to you too!</p><p><a href="https://reiner.org/">Reiner</a> is CEO of <a href="https://matx.com/">MatX</a>, a new chip startup (full disclosure - I&#8217;m an angel investor). He was previously at Google, where he worked on <a href="https://arxiv.org/abs/2211.05102">software</a> <a href="https://jax-ml.github.io/scaling-book/">efficiency</a>, compilers, and TPU architecture. </p><p>Highly recommend the <a href="https://jax-ml.github.io/scaling-book/gpus/">scaling book</a> he coauthored for further study.</p><p>Watch this one on <a href="https://youtu.be/xmkSf5IS-zw">YouTube</a> so you can see the chalkboard.</p><div id="youtube2-xmkSf5IS-zw" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;xmkSf5IS-zw&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/xmkSf5IS-zw?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2><strong>Sponsors</strong></h2><ul><li><p><a href="https://janestreet.com/dwarkesh">Jane Street</a> needs constant access to incredibly low-latency compute. I recently asked one of their engineers, Clark, to talk me through how they meet these demands. Our conversation&#8212;which touched on everything from FPGAs to liquid cooling&#8212;was extremely helpful as I prepped to interview Reiner. You can watch the full discussion and explore Jane Street&#8217;s open roles at <a href="https://janestreet.com/dwarkesh">janestreet.com/dwarkesh</a></p></li></ul><ul><li><p><a href="https://goo.gle/Gemma4">Google&#8217;s Gemma 4</a> is the first open model that&#8217;s let me shut off the internet and create a fully disconnected &#8220;focus machine&#8221;. This is because Gemma is small enough to run on my laptop, but powerful enough to actually be useful. So, to prep for this interview, I downloaded Reiner&#8217;s scaling book, disconnected from wifi, and used Gemma to help me break down the material. Check it out at<a href="https://goo.gle/Gemma4"> goo.gle/Gemma4</a></p></li></ul><ul><li><p><a href="https://cursor.com/dwarkesh">Cursor</a> helped me turn some notes I took on how gradients flow during large-scale pretraining into a great animation. At first, I wasn&#8217;t sure the best way to visualize the concept, but Cursor&#8217;s Composer 2 Fast model let me iterate on different ideas almost instantaneously. You can check out the animation in <a href="https://www.dwarkesh.com/p/what-i-learned-april-15">my recent blog post</a>. And if you have something to visualize yourself, go to <a href="https://cursor.com/dwarkesh">cursor.com/dwarkesh</a></p></li></ul><h2><strong>Timestamps</strong></h2><p>(00:00:00) &#8211; How batch size affects token cost and speed</p><p>(00:32:09) &#8211; How MoE models are laid out across GPU racks</p><p>(00:47:12) &#8211; How pipeline parallelism spreads model layers across racks</p><p>(01:03:37) &#8211; Why Ilya said, &#8220;As we now know, pipelining is not wise.&#8221;</p><p>(01:18:59) &#8211; Because of RL, models may be 100x over-trained beyond Chinchilla-optimal</p><p>(01:33:02) &#8211; Deducing long context memory costs from API pricing</p><p>(02:04:02) &#8211; Convergent evolution between neural nets and cryptography</p><h2>Transcript</h2><h3>00:00:00 &#8211; How batch size affects token cost and speed</h3><p><strong>Dwarkesh Patel</strong></p><p>Today, I&#8217;m interviewing <a href="https://reiner.org/">Reiner Pope</a>, who is the CEO of <a href="https://matx.com/">MatX</a>, <a href="https://techcrunch.com/2026/02/24/nvidia-challenger-ai-chip-startup-matx-raised-500m/">which is a new chip startup</a>. Previously, he was doing <a href="https://en.wikipedia.org/wiki/Tensor_Processing_Unit">TPU</a> architecture and many other things at Google. This is a very different format from my usual interviews. This is going to be a blackboard lecture. We&#8217;re going to get up in a second. We in fact built this whole new studio with specifically this format in mind, so it&#8217;s a pleasure to get to inaugurate it with you.</p><p>We&#8217;re going to be talking about <a href="https://huggingface.co/blog/ProCreations/the-mega-article">model architecture</a>, <a href="https://www.sei.cmu.edu/blog/a-hitchhikers-guide-to-ml-training-infrastructure/">ML infra</a>, and many other things. The reason I think it&#8217;s an important topic is because once you understand how <a href="https://www.ibm.com/think/topics/model-training">training</a> and <a href="https://cloud.google.com/discover/what-is-ai-inference">inference</a> work in a cluster, a lot of things&#8212;about why AI is the way it is, why AI architectures are the way they are, why API prices are the way they are, and fundamentally why AI progress is the way it is&#8212;start making sense. You need to understand the details to get there, and you need a blackboard to understand the details. Reiner, thank you so much for doing this.</p><p><strong>Reiner Pope</strong></p><p>Very happy to be here.</p><p><strong>Dwarkesh Patel</strong></p><p>Full disclosure, I am an angel investor in MatX, but that&#8217;s unrelated to this podcast. Reiner, to kick us off I&#8217;ll ask this question. We have a couple of companies like Claude and Codex and Cursor offering something like Fast Mode, where for 6x the price, they&#8217;ll stream you <a href="https://blogs.nvidia.com/blog/ai-tokens-explained/">tokens</a> at 2.5x the speed. Mechanically, I&#8217;m curious what&#8217;s going on here. Why is it the case that you can pay more to get faster latency?</p><p>Two, could you keep going? Could you pay 100x more and somehow get much faster speeds? Three, could you go the other way? Could you have something like Claude Code &#8220;Slow Mode&#8221;, where if you are willing to wait for minutes on end, you could get even cheaper prices? Maybe this will help motivate the analysis that you&#8217;ll be doing through the lecture.</p><p><strong>Reiner Pope</strong></p><p>Great. To jump to the conclusion a little bit, the big effect is <a href="https://cloud.google.com/discover/what-is-batch-inference">batch</a> size. What we&#8217;re going to do now is quantify exactly what that looks like and what its implications are on latency and cost. There&#8217;s another effect, which you can call <a href="https://pytorch.org/blog/hitchhikers-guide-speculative-decoding/">speculative decoding</a> or <a href="https://arxiv.org/abs/2404.19737">multi-token prediction</a>. We can maybe come back to that later, but the first thing that we&#8217;ll talk through is batch size.</p><p>What I&#8217;d like to introduce is the two principles of analysis. First, we&#8217;re going to look at a <a href="https://en.wikipedia.org/wiki/Roofline_model">roofline analysis</a> of how we run a <a href="https://en.wikipedia.org/wiki/Transformer_(deep_learning)">transformer</a> model on a cluster of chips. We&#8217;ll take a <a href="https://www.nvidia.com/en-us/data-center/gb200-nvl72/">Blackwell NVL72</a> cluster, so a rack of 72 <a href="https://en.wikipedia.org/wiki/Graphics_processing_unit">GPUs</a>. The roofline analysis means we look at memory bandwidth and compute performance. The other side of that is that we&#8217;re going to look at just two simple factors of the model: the time to operate on the <a href="https://www.geeksforgeeks.org/deep-learning/the-role-of-weights-and-bias-in-neural-networks/">weights</a>, and the time to operate on the context, the <a href="https://huggingface.co/blog/not-lain/kv-caching">KV cache</a>.</p><p>Let&#8217;s jump in. We&#8217;re going to try and estimate the time that it takes to run an inference of a certain shape. We&#8217;re not perfect here. We can&#8217;t exactly predict the time, so instead we&#8217;re going to approximate. We&#8217;re going to say that the time must be greater than or equal to a certain quantity. We&#8217;re going to consider two different aspects: the time it takes to do the memory fetches, and the time it takes to do the compute. It will turn out that this gives us very strong predictive power, even with a simple model.</p><p>One by one, what is the time that it takes to do the compute? There are really two things I need to do in the compute. I need to multiply by all of the active <a href="https://www.ibm.com/think/topics/model-parameters">parameters</a>, and then I need to do some work on the <a href="https://en.wikipedia.org/wiki/Attention_(machine_learning)">attention</a>. Multiplying by all the active parameters, I have a certain batch size that I&#8217;m running, and I&#8217;ve got a number of active parameters in my model. Then I&#8217;m just going to divide this by the compute throughput, which is the <a href="https://en.wikipedia.org/wiki/Floating_point_operations_per_second">FLOPs</a> of the chip. This is a hardware concern.</p><p>This accounts for all of the compute time for all of the weight matrix multiplies. There&#8217;s a little caveat here. We&#8217;ve ignored the time to do any of the attention computation, but that in general will be quite small in comparison to this. So we&#8217;ll ignore this.</p><p><strong>Dwarkesh Patel</strong></p><p>I&#8217;ll just interrupt from time to time to ask some very naive questions or to clarify some basic points. For the audience, you&#8217;re not serving one user at a time. The batch refers to the fact that you&#8217;re serving many different users at the same time, and that&#8217;s a whole batch.</p><p><strong>Reiner Pope</strong></p><p>I can motivate the batch at least a little bit. We will see exactly why batch is such a favorable optimization. What will turn out to be the case is that if you do not batch together many users, the cost and the economics you get can be a thousand times worse than if you do batch many users together. We&#8217;ll be able to see that quite explicitly.</p><p>Then, number of active parameters. If I look at, for example, a <a href="https://en.wikipedia.org/wiki/DeepSeek">DeepSeek</a> model, the <a href="https://api-docs.deepseek.com/news/news1226">DeepSeek V3</a> model has about 37 billion active parameters, and 700 billion total parameters. We&#8217;re focusing on just the ones that are active for a single AI token.</p><p>We&#8217;re modeling compute performance. I&#8217;m going to keep writing equals, but in all of these cases, you can think of this time as being <em>at least</em> this much, and maybe there will be some terms we ignored.</p><p>On the memory side, what do we need to do with memory? We need to fetch all of the weights, so there is some time to fetch the total number of parameters, not just the active parameters. There&#8217;s weight fetch time, and then in addition, there&#8217;s a KV cache fetch time. This actually depends on batch size. For every element of the batch, we have to fetch an entire context length worth of tokens, and there&#8217;s a size per token, bytes for one token. This is a model parameter.</p><p><strong>Dwarkesh Patel</strong></p><p>Maybe just backing up, let&#8217;s explain what the KV cache is real quick.</p><p><strong>Reiner Pope</strong></p><p>When I do a <a href="https://towardsdatascience.com/neural-networks-forward-pass-and-backpropagation-be3b75a1cfcc/">forward pass</a>&#8230; Let me draw how the <a href="https://en.wikipedia.org/wiki/Autoregressive_model">autoregressive inference</a> works. This is during decode. If I have a bunch of text tokens&#8230; I&#8217;m drawing a <a href="https://en.wikipedia.org/wiki/Tensor_(machine_learning)">tensor</a> because ultimately the tokens are represented as a tensor in some <a href="https://en.wikipedia.org/wiki/Embedding_(machine_learning)">embedding</a> dimension. In this direction, I have the sequence length.</p><p>The work of running a decode is that I have to run each token through a whole bunch of matrix multiplies over a bunch of different layers. In general, I&#8217;m going to have to do that work over all of these tokens. But one step of decode is to produce just this one additional token up here.</p><p>What I&#8217;m going to do there is run a full forward pass of multiplying by all of the weight matrices in the entire model. But then I&#8217;ve got this attention mechanism where this token is looking at all of the past tokens, and what is it looking at specifically? It is looking at some internal representation that the model has produced of the tokens, and we call that the KV cache. This process of this single token attending to all of the history of tokens is attention. It is mostly dominated by memory fetches rather than matrix multiplies.</p><p>So we&#8217;ve got the amount of memory that we&#8217;re fetching shown over here, and then this is of course just divided by the memory bandwidth, so the memory bytes per second. In fact, these equations here are enough for us to now draw some fit lines. The things that we&#8217;d like to look at are sensitivity to batch, and then also, which we&#8217;ll draw separately, to context length. We said that the big effect you can get is some trade-off in latency versus cost in batch size.</p><p>Let&#8217;s draw them out. I think there are just really two graphs that we want to draw. We&#8217;ll first draw batch size versus time here. When we look at the shape of this, we&#8217;ve got a maximum of the sum and then another term. Let&#8217;s look at these terms one by one and how they scale: the time for compute and memory, and how they show up.</p><p>Let&#8217;s first look at this compute time. This is just purely linear in batch size with no offset, so it is some curve like this. This is <em>t</em> compute. On the memory side, we&#8217;ve got some portion here that is just this constant in some base offset here, which is the weight fetch. Finally, we have this term here, which is the KV fetch, which is pretty linear in batch size, and so it looks like that. The sum of this plus this maxed with this&#8230; Let&#8217;s at least first draw the sum. The two memory times in conjunction end up looking on this curved slope like this. Then the overall maximum is&#8212;I&#8217;ll draw a little thicker here&#8212;the maximum of these two curves.</p><p>What does this mean? This is a latency plot. If I grow my batch size, initially I get some not very strong dependence on batch size, so there is some lower bound on latency here. This already partially answers the question. For a given hardware configuration&#8212;and we can talk about varying the hardware configuration&#8212;there is a lower bound on latency. It is simply that I need to read all of my total parameters from memory into the chips, and that takes a certain amount of time. If I use all of my memory bandwidth, I can&#8217;t do any better than that.</p><p><strong>Dwarkesh Patel</strong></p><p>It seems like the way you&#8217;ve drawn the slopes for compute time and how the KV grows&#8212;and what implication the KV has on memory time&#8212;</p><p><strong>Reiner Pope</strong></p><p>What if this were above or below?</p><p><strong>Dwarkesh Patel</strong></p><p>Yeah, is that necessarily the case? If this is always true, then as batch size grows compute always dominates KV, which suggests that if you have a big enough batch size, maybe memory is never an issue.</p><p><strong>Reiner Pope</strong></p><p>This is really sensitive to the <a href="https://www.ibm.com/think/topics/context-window">context length</a>, so I think we should come back and explore this. As you vary the context length, the KV fetch time will go up and up, and that will cause a transition from compute-limited to memory-limited.</p><p><strong>Dwarkesh Patel</strong></p><p>Is there something especially significant about the slope being exactly the slope of the compute time?</p><p><strong>Reiner Pope</strong></p><p>Whenever we have balance points, it says that you&#8217;re getting it exactly right. For the particular context length where the slopes match, that says I am equally memory-bound and compute-bound, which is a really desirable place to be.</p><p><strong>Dwarkesh Patel</strong></p><p>This is a very simple algebra problem, but suppose the optimal is 100K context length, and you go to 200K context length. Does your <a href="https://www.glennklockwood.com/garden/MFU">MFU</a> go down to 50%? Does it have a humongous impact on MFU to be slightly outside of the optimal context length range, the Goldilocks zone?</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s right. That is true as modeled here. There is a key point here that I&#8217;m modeling the memory fetch as linear in context length. That depends on model architecture. It is true for all of the model architectures with <a href="https://medium.com/@aadishagrawal/the-evolution-of-attention-mechanisms-scaling-transformers-smartly-73cb96f991cf">dense attention. Sparse attention</a> actually scales much better than that.</p><p><strong>Dwarkesh Patel</strong></p><p>Got it. Is sparse attention what everybody uses in practice?</p><p><strong>Reiner Pope</strong></p><p>I&#8217;m pretty excited about sparse attention. It&#8217;s hard to know what the labs are using. DeepSeek has <a href="https://arxiv.org/abs/2512.02556">published a sparse attention mechanism</a>. I&#8217;ll just put a plug in that some of the DeepSeek papers that have published sparse attention end up putting a square root in this term.</p><p>So far, we&#8217;ve looked at the latency. It&#8217;s hard to read off cost from this. If I think about what cost means&#8230; To run this inference, I&#8217;m going to use the GPU for a certain number of seconds, like one millisecond or 20 milliseconds. I have to pay the rental time for that time. So it&#8217;s $2/hour per GPU or something like that.</p><p>That&#8217;s the cost of this inference, but how many tokens have I processed during that inference? That is the batch size. What we actually want to plot is the cost versus batch size, which is <em>t</em> over <em>B</em> versus batch size. This is the cost per token. We have to imagine dividing each of these three curves by <em>B</em>, so multiplying by this reciprocal. What we end up with there is&#8230; The compute curve was linear. We divide by <em>B</em>, and that makes it a constant here. This is <em>t</em> compute. The KV fetch was linear, and now it becomes a constant as well. Then the weight fetch was constant, and now we&#8217;ve divided by <em>B</em>, so it becomes this hyperbola.</p><p>Again, we&#8217;re going to compute the max of the sum. The sum of these two terms shifts the hyperbola up. The sum of the KV fetch and the weight fetch gives us a higher hyperbola that&#8217;s like this. Then we&#8217;re going to take the max with the compute here. We end up with this being the overall shape that we care about.</p><p>Again, we see some limiting behavior. The cost initially starts very high at a batch size of one. It almost goes to infinity because we&#8217;ve got so many weight fetches that are not amortized over a large batch size. But as we increase the batch size, the weight fetches become amortized over so many different batch elements that their cost grows very small, and eventually the compute time ends up driving the cost. So there is a limiting lower bound on cost, which is this line here.</p><p><strong>Dwarkesh Patel</strong></p><p>So Claude Code Slow or Codex Slow or whatever would just live on this line. It wouldn&#8217;t help much because you&#8217;re not able to amortize the KV values over a much bigger batch.</p><p><strong>Reiner Pope</strong></p><p>They&#8217;re unique per batch. The compute is also unique per batch. So what is the minimum work you can do per batch after amortizing everything else away?</p><p><strong>Dwarkesh Patel</strong></p><p>This point where you are no longer memory bandwidth bound, practically how big a batch do you need? How big are the batches practically for frontier models?</p><p><strong>Reiner Pope</strong></p><p>You can just solve for that. It&#8217;s not even particularly sensitive to model architecture. Let&#8217;s go ahead and do that.</p><p>What we&#8217;re talking about is when the memory time is equal to the compute time. That&#8217;s what that question is. Because we&#8217;re focused on what the batch size is&#8212;and really there&#8217;s a question of when the weights are amortized over the multiplies&#8212;I&#8217;m going to focus on comparing the weight fetch time to the weight multiply time. I&#8217;m going to disregard the KV fetch term just to simplify the analysis so we can get a clean answer out. We&#8217;re going to equate this portion with these two times.</p><p>Writing that out, we get <em>N</em>, number of total parameters, over memory bandwidth, is equal to batch size times number of active parameters divided by the compute performance. Looking over here, everything on the top are model parameters. Everything on the bottom are hardware parameters. It turns out to be nice to rearrange them such that we have the hardware parameters on one side.</p><p>This is equivalent to FLOPs over memory bandwidth being equal to batch size times number of active parameters, divided by the number of total parameters. This hardware parameter ends up being a dimensionless constant. If you look in terms of FLOPs&#8230; What are the dimensions of this? This is multiplies per second. This is bytes per second. So that&#8217;s not quite dimensionless. But what you do is you say, how many <a href="https://towardsdatascience.com/16-8-and-4-bit-floating-point-formats-how-does-it-work-d157a31ef2ef/">FP4</a> multiplies per second times the fact that each FP4 is half a byte. I can actually make this end up being dimensionless. On most GPUs, this ends up being somewhere around 300.</p><p><strong>Dwarkesh Patel</strong></p><p>Has that ratio changed over time as we&#8217;ve gone from model generation to model generation, where the FLOPs keep increasing?</p><p><strong>Reiner Pope</strong></p><p>This is a hardware parameter. To what extent has the hardware changed? From <a href="https://www.nvidia.com/en-us/data-center/a100/">A100</a> to <a href="https://www.nvidia.com/en-us/data-center/h100/">H100</a> to <a href="https://www.exxactcorp.com/blog/hpc/comparing-nvidia-tensor-core-gpus">B100</a>, the FLOPs have increased substantially, the memory bandwidth has also increased substantially, and it has remained reasonably stable.</p><p>We can express this one as well. This is a sparsity parameter. I might even phrase this slightly differently. Let&#8217;s solve for batch size in total. Moving this back over to the other side, we end up with batch size needs to be bigger than approximately 300 times sparsity. For example, in DeepSeek I activate 32 out of 256 experts, so this would be 8 for DeepSeek.</p><p>This actually gives you a ballpark which is remarkably accurate to practice. Generally, people will go a little bit larger than this. They don&#8217;t really want to be exactly at the balance point because real-world efficiencies aren&#8217;t as good as a roofline analysis would say. But take this and maybe double or triple it.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, so it&#8217;s two to three thousand tokens per batch. But then if you included the KV cache, the implication would be that the optimal batch size...</p><p><strong>Reiner Pope</strong></p><p>Should grow larger. We solved for the equivalence between when compute time is equal to memory time. If I add in something that consumes more memory bandwidth, then I have less available for the weight loads. I need to grow the memory bandwidth more, and therefore the batch size more.</p><p><strong>Dwarkesh Patel</strong></p><p>This seems incredibly small. This would be less than one sequence, right?</p><p><strong>Reiner Pope</strong></p><p>Keep in mind that I&#8217;m talking about the number of tokens that I&#8217;m generating one more token for. It&#8217;s actually 2,000 unique sequences.</p><p><strong>Dwarkesh Patel</strong></p><p>Got it. We&#8217;re just talking about a single forward pass on these sequences. You think of the batch as the number of sequences.</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s right.</p><p><strong>Dwarkesh Patel</strong></p><p>If you&#8217;ve got a frontier model and you are actually doing inference, surely they must have more than 2,000 concurrent users. Is there any added latency from the fact that you need to have the whole batch fill up? Or if you have a reasonable amount of users, is it so unlikely that it would take you 100 milliseconds to fill up the next 2,000 slots?</p><p><strong>Reiner Pope</strong></p><p>The way to think about this is: when does the train depart, as a model? Let&#8217;s say I&#8217;ve picked a batch size that I&#8217;m going to run at. By the way, this intersection point is the same intersection point here. I pick this batch size, and I know that it&#8217;s going to take, for example, 20 milliseconds, which is a common place this ends up landing.</p><p>This is a timeline of what is running on the GPU. It&#8217;s going to start a new batch every 20 milliseconds regardless. You can think of this as a schedule for the train. A new train departs every 20 milliseconds. Any passengers who are ready board the train. If the train is full, they wait until the next train. If the train is not full, the train is going to go anyway.</p><p>In terms of what that means for queuing latency, the worst case is that a request arrives just after the train departed. It has to wait for the next train, so that&#8217;s up to 20 milliseconds, and then it has to wait for that train to complete. So the worst-case latency is 40 milliseconds.</p><p><strong>Dwarkesh Patel</strong></p><p>How is the 20 milliseconds derived?</p><p><strong>Reiner Pope</strong></p><p>It&#8217;s a rule of thumb, but where it comes from is not fully explained yet. So far we&#8217;ve focused on memory bandwidth and compute time. When we look at memory, the other consideration is that we want to use all of the memory capacity we have. Generally, we&#8217;re going to use all of that memory capacity to store the weights or the KVs. In the time of doing a forward pass, we want to read all of the memory capacity into the chip. That is capacity divided by bandwidth. That tends to be 20 milliseconds on many different generations of <a href="https://en.wikipedia.org/wiki/High_Bandwidth_Memory">HBM</a>.</p><p><strong>Dwarkesh Patel</strong></p><p>The units make sense. You would have a byte divided by bytes per second.</p><p><strong>Reiner Pope</strong></p><p>For example, on the <a href="https://www.nvidia.com/en-us/data-center/technologies/rubin/">Rubin</a> generation, it is something like 288 gigabytes divided by 20 terabytes per second. This comes out to about 15 milliseconds.</p><p><strong>Dwarkesh Patel</strong></p><p>Let me make sure I understand what this is saying. I understand the unit analysis. What it&#8217;s saying is we can evacuate and replace the HBM in this amount of time. So we don&#8217;t want to be in a situation where the HBM is not big enough that we&#8217;re not actually able to write everything we want to it or take everything out of it. Or we don&#8217;t want to be in a situation where our ability to write back and forth is so small compared...</p><p><strong>Reiner Pope</strong></p><p>There are sort of two scenarios. Why don&#8217;t we pick a latency that is bigger than 15 milliseconds? If I think about what that means, it means I actually have time to read the HBM twice. By the way, most HBM accesses are reads, not writes. It&#8217;s almost all reads because the weight matrices are read-only, and almost all of the KV cache accesses are reads. In around 30 milliseconds, I can read all of HBM twice, but what&#8217;s the point of that? I don&#8217;t want to read the weight matrices twice. I don&#8217;t want to read the KVs twice.</p><p><strong>Dwarkesh Patel</strong></p><p>Makes a ton of sense. A couple of quick questions. If it is the case that the optimal batch size is something like 2,000, it&#8217;s totally dependent on the sparsity, not dependent on the model size or anything.</p><p><strong>Reiner Pope</strong></p><p>Sparsity shows up in model size, but beyond that, it only depends on sparsity, not on scale.</p><p><strong>Dwarkesh Patel</strong></p><p>That&#8217;s a very interesting result. One question is how much of a push towards centralization is it that you would have these economies of scale from inference for batching? But it seems like it&#8217;s not that big a deal. Is 2,000 users at the same time a lot? It doesn&#8217;t seem like a lot.</p><p><strong>Reiner Pope</strong></p><p>We can do a bit of analysis on this. You can think of it in terms of number of users, but a more productive way to think of it is in terms of tokens per second. What does this batch size mean in terms of tokens per second of the system?</p><p>Tokens per second is going to be equal to the batch size. We run a batch of tokens, and we do that every time interval, which is equal to the 15-millisecond or 20-millisecond number. This ends up being batch size times about 60, so 64 x <em>B</em>. This ends up being around 2,000 x 64, so 128,000 tokens per second. This is in more digestible units.</p><p>It&#8217;s hard to reason about concurrent users, but what is the global traffic for a system? When you look at some of the announcements, sometimes the API providers will brag about how much traffic they have. The numbers I remember from some announcements of Gemini last year were in the hundreds of millions of tokens per second worldwide. This is one-thousandth of that.</p><p><strong>Dwarkesh Patel</strong></p><p>Gemini is big. One-thousandth of Gemini is a lot. To actually be competitive at scale, you need to be able to serve at least one-thousandth of Gemini. That&#8217;s interesting.</p><p>The more sparsity you have, the less compute you need. It does seem that as batch sizes get bigger, compute ends up being the bottleneck, according to this analysis. Then the question is, how far can you take sparsity? As the sparsity ratio increases, as you have fewer active parameters relative to total parameters, how much is the performance of the model degrading? Is it degrading faster than you&#8217;re saving compute by increasing the sparsity factor?</p><p><strong>Reiner Pope</strong></p><p>You mean the quality of the model, rather than the speed of the model. Unfortunately, we&#8217;re not able to answer that analytically. That is an empirical question of model quality. The best I can do is pull up a paper and answer that empirically.</p><p><strong>Dwarkesh Patel</strong></p><p>Should we pull up the paper now?</p><p><strong>Reiner Pope</strong></p><p>This paper is &#8220;<a href="https://arxiv.org/abs/2202.01169">Unified Scaling Laws for Routed Language Models</a>.&#8221; It&#8217;s a somewhat old paper by this stage, but one of the things they looked at is if I keep increasing sparsity, what is the model quality impact? This answer is very sensitive to the actual choice of <a href="https://huggingface.co/blog/moe">mixture of experts</a>. Mixture of experts has been around for a really long time, maybe even back in 2017, but the techniques have changed a lot. DeepSeek&#8217;s mixture of experts was a big change in how it worked. There have been older papers, like &#8220;<a href="https://arxiv.org/abs/2006.16668">GShard</a>&#8221; and &#8220;<a href="https://arxiv.org/abs/2101.03961">Switch Transformer</a>&#8221;. The actual empirical results are going to depend on all of that.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!rYuq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50660bc-66f9-4b8f-a9ac-30976e26f8e8_2048x1754.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!rYuq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50660bc-66f9-4b8f-a9ac-30976e26f8e8_2048x1754.png 424w, https://substackcdn.com/image/fetch/$s_!rYuq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50660bc-66f9-4b8f-a9ac-30976e26f8e8_2048x1754.png 848w, https://substackcdn.com/image/fetch/$s_!rYuq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50660bc-66f9-4b8f-a9ac-30976e26f8e8_2048x1754.png 1272w, https://substackcdn.com/image/fetch/$s_!rYuq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50660bc-66f9-4b8f-a9ac-30976e26f8e8_2048x1754.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!rYuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50660bc-66f9-4b8f-a9ac-30976e26f8e8_2048x1754.png" width="550" height="471.0508241758242" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a50660bc-66f9-4b8f-a9ac-30976e26f8e8_2048x1754.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1247,&quot;width&quot;:1456,&quot;resizeWidth&quot;:550,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!rYuq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50660bc-66f9-4b8f-a9ac-30976e26f8e8_2048x1754.png 424w, https://substackcdn.com/image/fetch/$s_!rYuq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50660bc-66f9-4b8f-a9ac-30976e26f8e8_2048x1754.png 848w, https://substackcdn.com/image/fetch/$s_!rYuq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50660bc-66f9-4b8f-a9ac-30976e26f8e8_2048x1754.png 1272w, https://substackcdn.com/image/fetch/$s_!rYuq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa50660bc-66f9-4b8f-a9ac-30976e26f8e8_2048x1754.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>On one of the older techniques shown here, you can see if I hold constant the number of active parameters at a certain size, and then I increase the sparsity, which they call expert count, the quality keeps increasing. If you imagine drawing a horizontal line from 1.3B dense across, you end up seeing that, in this case, the 64-expert 370-million activated parameter model is as good as a dense 1.3-billion model.</p><p><strong>Dwarkesh Patel</strong></p><p>So in some sense, it&#8217;s actually not amazing returns where you need to increase total parameters a hundredfold to get the equivalent of 10x as many active parameters.</p><p><strong>Reiner Pope</strong></p><p>Actually even more so. It&#8217;s a huge increase in parameter count for a modest increase in efficiency.</p><p><strong>Dwarkesh Patel</strong></p><p>So in this case, actually it&#8217;s 4x?</p><p><strong>Reiner Pope</strong></p><p>64x for 4x.</p><p><strong>Dwarkesh Patel</strong></p><p>So while it is true that you get this benefit of being able to economize on your compute time if you increase sparsity, naively it would seem like a trade-off worth making. But if you&#8217;re decreasing this by 2x and then having this go up by 8x every time you double sparsity...</p><p><strong>Reiner Pope</strong></p><p>Is that good or bad, actually? Even from a memory point of view&#8230; Keep in mind you are doubling this portion of the memory fetches, which is amortized by batch. So just keep running a larger batch size. From the point of view of the analysis we&#8217;ve done here, this is a pure win. Keep doing it until you run out of available users, basically.</p><p>There&#8217;s this equivalence where if I have a lot of users, I can go to a much sparser model. From that point of view, it&#8217;s a reasonable trade-off. The other trade-off that shows up here is that it also consumes memory capacity. We&#8217;ve only reasoned about memory bandwidth here, but it also consumes memory capacity.</p><p><strong>Dwarkesh Patel</strong></p><p>I see. Let me make sure I understood. You&#8217;re saying we want to spend less time computing, therefore we do more sparsity. To make that work, we need bigger batch sizes. Which means we need more memory capacity to have more sparsity.</p><p><strong>Reiner Pope</strong></p><p>Maybe this would be a good point to talk about how a mixture of experts layer is typically laid out on a rack of GPUs.</p><h3>00:32:09 &#8211; How MoE models are laid out across GPU racks</h3><p><strong>Dwarkesh Patel</strong></p><p>Cool. Makes sense. Where were we?</p><p><strong>Reiner Pope</strong></p><p>Sparse mixture of experts. Maybe how we lay that out on a GPU.</p><p>Let&#8217;s zoom in on the mixture of experts layer first and draw what that looks like. Typically, we&#8217;ll have some kind of a router layer, which is making the decision of where we route the tokens to. We get tokens coming in here, they go through a router layer, and then we have a bunch of different experts. I&#8217;ll draw a few more to line some up.</p><p>The router will make a decision of which experts it&#8217;s going to route to, and it will be a small fraction of them, maybe 1 in 32. Maybe it will make a decision to route to this one, maybe this one, and maybe this one. Each expert itself is a normal <a href="https://en.wikipedia.org/wiki/Multilayer_perceptron">MLP</a>. It has an up projection and then a down projection with a nonlinearity in between.</p><p>Then finally, we do the inverse operation. Where we were broadcasting things out here, we&#8217;re going to bring them back in and sum them up. Bringing them in like this. Then finally, we have our residual connections. The token is also passed through here, and it gets added to the result of the MoE layer. This is a normal MoE layer.</p><p>What I want to talk through is how this is mapped to a GPU rack and what this means for communication, because I think this will start to show some of the limits of how sparse we can go. The standard practice here, and it is the best solution, is to use <a href="https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html">expert parallelism</a>. That means different experts go on different GPUs. If we take something like a DeepSeek model, they have 256 experts. Let&#8217;s say we want to run that on a <a href="https://en.wikipedia.org/wiki/Blackwell_(microarchitecture)">Blackwell</a> rack. There are 72 GPUs.</p><p>We have a divisibility problem. This is not a power of two. We&#8217;ll just simplify and say we&#8217;re only going to use 64 of them. Just ignore the other eight. It&#8217;s not a big deal. So we have four experts per GPU. Very simple. For the sake of the diagram, actually let&#8217;s just say we have two experts per GPU. We end up just putting these GPU boundaries. Every pair of experts is on its own GPU.</p><p>Then we can look at the communication cost. We had some tokens stored centrally here. They get routed to all of these experts, and there is some communication cost paid here. There&#8217;s the same communication cost paid on the output. The hope is that this does not become communication limited.</p><p>Now what is the traffic pattern here? The traffic pattern here is that any GPU will be talking to any other GPU, depending on the decisions made by the model. This is an <a href="https://www.dell.com/en-us/blog/understanding-llm-gpus-clusters-fabrics-traffic-for-networkers/">all-to-all traffic pattern</a>.</p><p><strong>Dwarkesh Patel</strong></p><p>When you say any GPU in the pre-tense, the router is more than one GPU?</p><p><strong>Reiner Pope</strong></p><p>I drew this as one router. In reality, you would actually have many copies of the router, and you would have as many routers as GPUs, in fact.</p><p><strong>Dwarkesh Patel</strong></p><p>As the incoming traffic.</p><p><strong>Reiner Pope</strong></p><p>Yeah. These are 64 GPUs and these are 64 GPUs. It&#8217;s actually the same GPUs, we just draw them as separate because they&#8217;re serving different purposes. So at this point, any GPU can be sending to any other GPU.</p><p>This all-to-all pattern of communication that shows up and how the Blackwell racks are configured is a perfect fit for the communication pattern that the MoE actually wants to do. However, if you think maybe one rack is too slow and I want to do two racks, then I have this challenge that maybe I&#8217;ve got some sort of rack boundary drawn outside here like this, and I no longer have all-to-all communication between all the GPUs in two racks. The rack-to-rack communication ends up being a substantial bottleneck.</p><p>The fundamental thing here is that one rack bounds the size of an expert layer you can do. This has been part of what&#8217;s been driving towards larger and larger interconnect domains.</p><p><strong>Dwarkesh Patel</strong></p><p>Before we continue, it may be worth you explaining what exactly a rack is. The differences in bandwidth between a rack and within a rack, and the all-to-all versus not all-to-all nature of communication within versus outside.</p><p><strong>Reiner Pope</strong></p><p>This is a place where it starts to be very different between Nvidia, for example, and Google, and then others, including us. Generally, a rack is a physical structure. It&#8217;s a few meters tall, a meter or two wide, depending on configuration, and it stores some number of GPUs or <a href="https://semiengineering.com/what-is-an-xpu/">XPUs</a>, which is typically about 64.</p><p>What constrains it being a certain size is power delivery, weight, and cooling ability. It ends up being about this size in many cases because of these physical constraints. When I deploy a data center, a data center may have thousands of these racks. I&#8217;ve got one of these tall racks, it&#8217;s got a bunch of GPUs in it, and so on. And then I put another rack next to it.</p><p><strong>Dwarkesh Patel</strong></p><p>You make it sound so easy.</p><p><strong>Reiner Pope</strong></p><p>Right. I just drop them in. In Nvidia&#8217;s case, the communication topology&#8230; They actually put the GPUs on the outside of the rack, and then they put these switches on the inside of the rack. What this ends up being is that there&#8217;s a set of switches in here. These are the <a href="https://www.nvidia.com/en-us/data-center/nvlink/">NV switches</a>. Then they run a bunch of cables. Every single GPU has cables going to the switches in the middle. The switches have connections to all the GPUs. All of the GPUs can talk to all the other GPUs in just two hops: going to the switch, going to the other GPU.</p><p>Now, when I want to leave the rack, I end up going via a different path. The GPUs also have a much slower connectivity, which is typically about eight times slower. The green that I drew here in the GPU cases is the NVLink. More generally, it&#8217;s called the <a href="https://naddod.medium.com/understanding-scale-up-vs-scale-out-in-ai-infrastructure-584723afb94d">scale-up network</a>. You will typically also have a <a href="https://naddod.medium.com/understanding-scale-up-vs-scale-out-in-ai-infrastructure-584723afb94d">scale-out network</a>, which allows you to connect to some <a href="https://www.lanaotek.com/what-is-a-switch-in-a-data-center.html">data center switch</a>. All of the GPUs will have some connectivity up to some data center switch somewhere. This is the scale-out, and it tends to be about 8x slower in bandwidth.</p><p>The challenge, if you want to lay out a mixture of experts layer across two racks, is that half of the GPUs here are going to be wanting to talk to the GPUs here. On average, when I look at where the tokens on these GPUs want to go, half of the tokens want to go inside the rack. That&#8217;s great. They can use the fast scale-up network. But half the tokens are going to want to leave the rack and go to the other rack, and that&#8217;s not as good. They need to use a much slower network, and so that becomes the bottleneck on the all-to-all pattern.</p><p>A different choice would be, why don&#8217;t I have a big switch here and connect everything to a much bigger switch that actually combines the two racks together? There are many ideas in this direction, but in general, the reason you have this hierarchy of switches rather than one big switch is to manage the cabling congestion. You just need to run a large number of cables.</p><p><strong>Dwarkesh Patel</strong></p><p>Sorry, is that question you just asked basically, why isn&#8217;t it a bigger scale-up?</p><p><strong>Reiner Pope</strong></p><p>Exactly. Why not just have a million chips in scale-up or a thousand chips?</p><p><strong>Dwarkesh Patel</strong></p><p>What has changed that has allowed Nvidia to go from <a href="https://en.wikipedia.org/wiki/Hopper_(microarchitecture)">Hopper</a>, which was 8, then Blackwell is 72, and now Rubin will be... is it 500 something?</p><p><strong>Reiner Pope</strong></p><p>Yeah, 500 and something.</p><p><strong>Dwarkesh Patel</strong></p><p>What has allowed that to happen?</p><p><strong>Reiner Pope</strong></p><p>From Hopper to Blackwell is mostly just the decision to switch from trays as the form factor to switching to racks as the form factor. That&#8217;s a product decision. There wasn&#8217;t a substantial technical barrier there.</p><p>Switching from 64 to 500 or so, there&#8217;s a bit of <a href="https://www.dwarkesh.com/p/jensen-huang">Jensen</a> math there, but there is at least a genuine 4x increase, which is coming from a much more complicated and difficult rack design. That is actually a new physical design to run more cables.</p><p><strong>Dwarkesh Patel</strong></p><p>The cable complication is just the cost of figuring out which cable hops to which, or which signal goes from what to what?</p><p><strong>Reiner Pope</strong></p><p>Let&#8217;s zoom in on this and look at the wire density. I&#8217;ll draw this diagram just once more so we have a bit of a cleaner and larger version to work with.</p><p>Let&#8217;s say I have some switches in the middle. Initially, I&#8217;m going to start with just two GPUs on each side or two trays of GPUs on each side. Let&#8217;s say maybe each tray wants to have two cables coming out of it. I physically run vertical cables that look like this running out to the switches. Now if I want to double the number of GPUs in a rack, I need to run literally twice the density of cables. I need to run these as well.</p><p><strong>Dwarkesh Patel</strong></p><p>Extremely naive question. But if you look at a physical data center, it seems like there&#8217;s a lot of space within a rack. I don&#8217;t know. The cables are really big and...</p><p><strong>Reiner Pope</strong></p><p>There is space outside the rack. Inside the rack&#8230; As they become more optimized, these racks are very tight. There&#8217;s connector density going from the tray into the rack and the rack&#8217;s backplane, and the backplane itself has a really high density. There are other physical constraints including the bend radius of cables. You don&#8217;t want to snap them and so on.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, so it&#8217;s literally the physical space to put a cable that&#8217;s constraining it. I had no idea. Interesting. That seems surprising. The rack is so big and we can&#8217;t just stuff more cables in there.</p><p><strong>Reiner Pope</strong></p><p>Rack design is not my expertise, but when I talk to folks on what constraints they&#8217;re up against, it&#8217;s a combination of things. What are the big physical things you&#8217;re optimizing for? Space, weight of the rack. It&#8217;s actually really heavy, so you need enough metal to not sag and fall. But then you add more metal, and it&#8217;s heavier. Then power and cooling. All of those are competing. Modern racks are pushing all of those to very extreme physical limits.</p><p><strong>Dwarkesh Patel</strong></p><p>When was <a href="https://en.wikipedia.org/wiki/GPT-4">GPT-4</a> released again? Was it 2022 or 2023?</p><p><strong>Reiner Pope</strong></p><p>2023.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay. And it was rumored to be over one trillion parameters. It seems like only now, within the last six months, have models been getting released that have significantly more parameters than the model released three years ago, when supposedly there should have been this scaling in the meantime.</p><p>Is the reason that we were just waiting for racks with enough memory to hold a five-trillion parameter model, along with its KV cache for enough users for a lot of sequences? Or if you&#8217;re doing <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">RL</a>, a similar consideration of actually holding the KV cache for the batch of problems you&#8217;re trying to solve.</p><p>If you look at Hopper, you had eight Hoppers, and I think that&#8217;s 640 gigabytes as of 2022. With Blackwell finally, which was deployed in&#8230;?</p><p><strong>Reiner Pope</strong></p><p>Very recently. Maybe last year.</p><p><strong>Dwarkesh Patel</strong></p><p>Last year. You finally have a scale-up on the order of 10-20 terabytes, which is enough for a 5T model plus KV cache.</p><p><strong>Reiner Pope</strong></p><p>Deploying in larger scale-up domains is a huge unlock. I&#8217;ve drawn here the Nvidia Blackwell deployment. The Google deployment has actually had very large scale-up domains for a long time.</p><p><strong>Dwarkesh Patel</strong></p><p>That also explains why Gemini seemed to be ahead. It just seems like Gemini has had successful pre-training for longer than some of the other labs.</p><p><strong>Reiner Pope</strong></p><p>Not having been there at the time, I&#8217;m not sure how much is coming from successfully deploying higher sparsity ratios, which it could be. It could also be a whole bunch of actual modeling things, specifically how you do the mixture of experts. We&#8217;ve seen the DeepSeek mixture of experts activate more experts, but finer-grained experts. That was a big innovation. I&#8217;m sure there are many other innovations on the model architecture as well as on the training data.</p><p>It&#8217;s hard to disentangle all of them, but what shows up in terms of the limits of what you can do is that the active parameters, as we saw, are limited by the compute cost, and the total parameters are limited by the scale-up size.</p><h3>00:47:12 &#8211; How pipeline parallelism spreads model layers across racks</h3><p><strong>Dwarkesh Patel</strong></p><p>When you&#8217;re operating within a single scale-up domain, is that a consideration specifically for either forward or backward, or specifically for <a href="https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/">prefill</a> versus <a href="https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/">decode</a>? Or is it preferred to always be within a scale-up whatever kind of workload you have, whether you&#8217;re doing a pre-training run, RL generation, or inference for users?</p><p><strong>Reiner Pope</strong></p><p>Really interesting. To answer that question, we&#8217;re going to need to talk about the communication patterns. We&#8217;ve talked about the mixture of experts communication pattern. That is this all-to-all. All-to-all very strongly favors full connectivity, which is what we&#8217;ve just shown here, and it favors being within one rack.</p><p>There are other kinds of parallelism besides expert parallelism, which we just showed here. In the literature is <a href="https://huggingface.co/docs/text-generation-inference/en/conceptual/tensor_parallelism">tensor parallelism</a>. With the trend towards smaller experts, this has become much less relevant, so we can ignore that. But the other two things we have available are <a href="https://en.wikipedia.org/wiki/Data_parallelism">data parallelism</a> and <a href="https://docs.pytorch.org/docs/stable/distributed.pipelining.html">pipeline parallelism</a>. They can be a much better fit for using multiple racks.</p><p>Let&#8217;s focus on pipeline parallelism specifically. This is one layer of MoE. I&#8217;m going to have a hundred more layers up above. I could decide at this point, for example, to move to a different rack, change rack. Now, is that going to become a communication bottleneck? We can actually solve for when this becomes a communication bottleneck. Before we do that algebraically, let&#8217;s visualize it out and sketch the path. We&#8217;re going to have another MoE layer, and another MoE layer here, and so on.</p><p>Let&#8217;s say I change rack here, and then some number of layers later, I change rack here as well. The methodology we&#8217;re going to use to determine whether we have a communication bottleneck at the point where we change rack is we&#8217;re going to compare the scale-out bandwidth requirements to the scale-up bandwidth requirements. Let&#8217;s write this. The hint is going to be that there&#8217;s a lot more sends here. We&#8217;re sending many things here, whereas we&#8217;re only sending one thing here, and we&#8217;re also maybe doing it many times. That&#8217;s what makes the difference.</p><p><strong>Dwarkesh Patel</strong></p><p>Can I try to guess? Just out of curiosity, to see if I&#8217;m actually understanding, it seems like you&#8217;re sending batch size into the rack.</p><p><strong>Reiner Pope</strong></p><p>In here? Yes.</p><p><strong>Dwarkesh Patel</strong></p><p>But the communication within the rack is batch size times number of GPUs.</p><p><strong>Reiner Pope</strong></p><p>Number of activated GPUs. I don&#8217;t send to this GPU at all. There&#8217;s an explosion from 1-3x larger here in this diagram. The key thing is that I didn&#8217;t even need to send to this GPU at all, and so that&#8217;s a big saving.</p><p>We&#8217;re going to talk through to what extent scale-up is a bottleneck over scale-out. We will directly jump to the ratio of the time spent on scale-up over the time spent on scale-out. This is the quantity we&#8217;re talking about.</p><p>The first consideration is that scale-up is 8x faster than scale-out generally. At a baseline, if the bandwidths were the same, we would have this 1/8, which is coming from bandwidth. But then we have some amount of expansion in how much data we&#8217;re sending. If one token comes in here, then this one token gets routed to, in the DeepSeek case maybe 32 experts or 16 experts. It gets routed to some number of experts. So this is the number of activated experts. This same thing applies on multiple different layers, so maybe I&#8217;m going to run two layers. There&#8217;s also multiple times the number of layers per stage.</p><p><strong>Dwarkesh Patel</strong></p><p>Don&#8217;t you need to multiply the whole thing by two for the all-to-all?</p><p><strong>Reiner Pope</strong></p><p>For the up and down. Yes, there&#8217;s a factor of two. Thank you.</p><p>What we would like is for the scale-up time to be greater than the scale-out time, because the scale-up time is the more important and precious resource. We would like this number to be greater than or equal to one. This really doesn&#8217;t seem hard. There&#8217;s just a factor of 8 that we need to overcome. So we need the product of these three things to be bigger than 8. Typically we have a fairly large number of activated experts. It could be 8 by itself. Then we can increase the number of layers per stage a lot until we satisfy this.</p><p>What this ends up looking like is that I can have an entire pipeline of racks where one rack does one layer, and then I move on to the next rack and do another layer, and then I move on to the next rack and do another layer.</p><p><strong>Dwarkesh Patel</strong></p><p>It&#8217;s interesting to me that the best parallelism strategy in practice ends up being one which physically resembles the actual architecture. It&#8217;s not some galaxy brain thing. It&#8217;s like, &#8220;Oh, we have experts, we&#8217;re going to put them on different GPUs, or we have different layers, we&#8217;re just going to put them on different racks.&#8221; I feel that&#8217;s interesting.</p><p><strong>Reiner Pope</strong></p><p>The cutting matches the model architecture.</p><p><strong>Dwarkesh Patel</strong></p><p>Exactly. It could have been something wackier with tensor parallelism and whatever.</p><p><strong>Reiner Pope</strong></p><p>The galaxy brain way to think of it is, what are all the different dimensions in which a model is scaled up? It is scaled up by layers, it is scaled up by the model dimension, it is scaled up by the <a href="https://en.wikipedia.org/wiki/Feedforward_neural_network">DFF</a> dimension, it is scaled up by the number of experts. Every single one of those numbers you can choose to cut along. If those numbers are big enough, it eventually becomes profitable to cut along there. We have selected two of them. The other two, in the way models are typically sized, are not profitable.</p><p><strong>Dwarkesh Patel</strong></p><p><a href="https://youtu.be/1yvBqasHLZs">So there&#8217;s a talk by Ilya where he says, &#8220;Today we know not to do pipeline parallelism.&#8221;</a> And <a href="https://horace.io/">Horace He</a> gave my friends and me&#8230; I hate that it sounds like a Dr. Seuss quote. But he gave us a lecture on these different kinds of parallelisms. He said the problem with pipeline parallelism is that, other than the bubbles, it creates these architectural constraints. <a href="https://en.wikipedia.org/wiki/Kimi_(chatbot)">Kimi</a>, for example, has these <a href="https://go281.user.srcf.net/blog/research/residual-streams/">residuals</a> where attention attends to layers a few back, so it becomes hard to implement in this way.</p><p><strong>Reiner Pope</strong></p><p>I guess we didn&#8217;t fully articulate even what is the benefit that we&#8217;re getting from pipelining. These complexities are real. Pipelining is a massive hassle, but it does give you some benefits. You can then decide whether those benefits are worth the costs. It has some benefits in inference, maybe bigger benefits in training. In inference, what are we saving on? Are we saving on memory time or compute time? Not really. We&#8217;re just moving the memory time from one chip to another chip, or one rack to a different rack. There&#8217;s no actual benefit in runtime.</p><p>However, what we are saving on is memory capacity. If we think that the memory in a rack is a bottleneck, then there&#8217;s a constraint on how fast we can go. Pipelining allows us to massively reduce that bottleneck.</p><p><strong>Dwarkesh Patel</strong></p><p>The opposite connotation to this&#8230; Before this interview, I was chatting with <a href="https://feldmann.nyc/">Axel</a>, who&#8217;s a GPU performance engineer at <a href="https://en.wikipedia.org/wiki/Jane_Street_Capital">Jane Street</a>. He was explaining that to do pipelining, you have to do micro-batches rather than full batches. If you do micro-batches, then you&#8217;re by definition not able to amortize loading the weights across all the users or all the sequences. The positive connotation of that is you don&#8217;t have to use as much memory. The negative connotation is that we can&#8217;t amortize loading the weights across all those users. Maybe it&#8217;s worth explaining why you have to do micro-batches.</p><p><strong>Reiner Pope</strong></p><p>Shall we draw the pipeline bubble? What is this micro-batching that shows up in pipeline parallelism?</p><p>I&#8217;ll focus on inference first. It&#8217;s a slightly simpler problem. I&#8217;m going to draw time, and then which rack we&#8217;re on. The idea is that maybe I&#8217;ll have four racks. I&#8217;ve got an inference that is going to step through these four racks in some time like this. This is inference number zero. It runs at a certain batch size and steps through all the pipeline stages like this.</p><p>Now, if we were to say, &#8220;Well, we&#8217;re going to run inference number one here,&#8221; this is clearly a massive waste. Like three-quarters of the time each of the racks is doing nothing. We don&#8217;t actually run inference one here, we run it as soon as we can, which is immediately after inference zero finishes. And then we keep going. If we hadn&#8217;t filled this in, we would call this the pipeline bubble. When I&#8217;ve drawn it in this inference context where we&#8217;re only going in a forwards pass, it&#8217;s obvious. Why would you do this stupid thing? In a training context, it&#8217;s maybe less obvious. But in the inference context, it&#8217;s really natural to make this change.</p><p><strong>Dwarkesh Patel</strong></p><p>Oh, interesting. This is sort of obvious, but the difference between micro-batch and batch doesn&#8217;t matter at all in inference because you can just call it whatever you want. It only matters in training because there is an optimal batch size.</p><p><strong>Reiner Pope</strong></p><p>Yes.</p><p><strong>Dwarkesh Patel</strong></p><p>Before you do a full backward step, you want to have accumulated all the sequences in that batch. If you want to do pipelining in training, in order to avoid that bubble, you need to&#8212;</p><p><strong>Reiner Pope</strong></p><p>Should we draw the training diagram with that? Let&#8217;s do that. This is the inference diagram, and I&#8217;ll call this forward so we don&#8217;t have the wrong thing showing up there. Let&#8217;s do the same thing for training now. We&#8217;ve got a forwards pass, but at some stage we&#8217;re going to have to transition to a backwards pass.</p><p>We&#8217;ll do some number of batches in the forwards pass, and then we&#8217;re going to transition to the backwards pass for everyone all in one go. The inference part is the same here, but then we do a hard stop at this point and transition everyone to the backwards pass, with similar numbering like this.</p><p><strong>Dwarkesh Patel</strong></p><p>It may be worth clarifying the reason there is that hard stop is because you want to do a whole batch at once for the backward step. And then there is an optimal size for how big that batch should be.</p><p><strong>Reiner Pope</strong></p><p>Smaller is always better, actually, is a way to put it. From an <a href="https://www.cs.ubc.ca/~schmidtm/Courses/540-W18/L5.pdf">ML convergence rate</a> perspective, smaller is always better because you&#8217;re getting the freshest information from the <a href="https://www.geeksforgeeks.org/machine-learning/gradient-descent-algorithm-and-its-variants/">gradient descent</a>.</p><p><strong>Dwarkesh Patel</strong></p><p>But from a total training time perspective?</p><p><strong>Reiner Pope</strong></p><p>From a total training time perspective, smaller is worse from a systems perspective. The optimum is the trade-off between those two.</p><p>So you pick a batch size, and for that batch size, you do some amount forwards and then some amount backwards. You asked why there is even a hard stop there. With pipeline parallelism, because you&#8217;ve got this idle time here which is the bubble, there are so many techniques in the literature for how to lay this out differently and avoid that. There are more complicated schemes called <a href="https://arxiv.org/abs/2401.10241">zero bubble</a> or <a href="https://arxiv.org/abs/2406.03488">one-forward-one-backward</a>, which interweave the forwards and the backwards in complicated ways.</p><p><strong>Dwarkesh Patel</strong></p><p>You can mine Bitcoin in that bubble.</p><p><strong>Reiner Pope</strong></p><p>Right. More usefully, you can do the weight gradient step, but you can also mine Bitcoin.</p><p>In inference, the effect of pipelining on anything you care about, like batch size or latency, is neutral. It doesn&#8217;t improve it, it doesn&#8217;t make it worse. If you look at the latency of this inference, running it if it were pipelined versus if it were all on one rack&#8230; If it were all on one rack, we would just slide all the boxes down and still put them in a row, and the latency would be the same.</p><p>Pipelining is neither better nor worse for latency. It does mean that you just use less memory capacity per rack. Because now instead of needing the whole model, you only need a quarter of the model, and you can expand.</p><p><strong>Dwarkesh Patel</strong></p><p>Makes a ton of sense. So it&#8217;s a no-brainer to use pipelining during inference, but there&#8217;s this harder trade-off during training.</p><p><strong>Reiner Pope</strong></p><p>Even in inference, in fact, it is not used a ton. It reduces your memory capacity requirements, but there&#8217;s actually a huge surplus. I think you were saying that a rack of Blackwell has many tens of terabytes. That&#8217;s much bigger than a trillion parameter model. A trillion parameter model only needs one terabyte, so it already fits. There&#8217;s not a huge benefit from pipelining because you&#8217;re reducing a number that&#8217;s already pretty small.</p><p>But it does say that theoretically, maybe you had too much memory there. You could have built different hardware that has less memory. If you were designing your hardware, you could say, &#8220;I didn&#8217;t need that much memory because I don&#8217;t need the weights to fit in one rack. I can fit the weights in eight racks, then I could have built hardware that didn&#8217;t have so much HBM per GPU.&#8221;</p><h3>01:03:37 &#8211; Why Ilya said, &#8220;As we now know, pipelining is not wise.&#8221;</h3><p><strong>Dwarkesh Patel</strong></p><p>Macro question: everybody&#8217;s talking about the <a href="https://ayarlabs.com/glossary/memory-wall/">memory wall</a> right now. <a href="https://www.theverge.com/news/839353/pc-ram-shortage-pricing-spike-news">Memory is getting super expensive</a>. There&#8217;s not enough memory. Smartphone volume will go down 30% because there&#8217;s not enough memory. This is shocking, <a href="https://www.dwarkesh.com/p/dylan-patel">Dylan</a> said hyperscalers are spending 50% of their CapEx this year on memory.</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s believable.</p><p><strong>Dwarkesh Patel</strong></p><p>What is hyperscaler CapEx? That&#8217;s high hundreds of billions, maybe a trillion, and they&#8217;re spending half of that on memory? That is a huge constraint. That&#8217;s why we&#8217;re not going to get new laptops and phones this year.</p><p>But at the same time, we have too much memory? People are willing to put too much memory into these systems. Why is Jensen shoving all this memory into these racks if you don&#8217;t need it?</p><p><strong>Reiner Pope</strong></p><p>In the equations we had here before we erased them, we were doing memory time, memory bandwidth and compute bandwidth. Let&#8217;s now start looking at memory capacity.</p><p>We&#8217;ll start off with memory capacity without even thinking about a parallelism scheme. The demand on memory is the number of total parameters. This is what we need to fit the weights in some system that we are using. Then we need to fit the KVs as well. KVs go as batch size times the length of the context times the bytes per token.</p><p>What I was arguing about in this context, and the case I was making for pipelining, is that there are some techniques that allow us to solve this. Let&#8217;s consider running this on some number of GPUs. We&#8217;re going to have one extent, which is <em>E</em>, the expert parallelism. When we had this sharding of an expert layer across many GPUs, to what extent do we do that? How many GPUs? We&#8217;re going to say that this is, for example 64. Then <em>P</em> is going to be the extent of pipelining. This is the number of racks, maybe we&#8217;ll pick 4 or something like that.</p><p>This is the total memory requirement across the system, but now I&#8217;m going to calculate a memory requirement per GPU. I&#8217;ll use a lowercase <em>c</em><sub>mem</sub>. Obviously, we just take all of these numbers and divide it by <em>E</em> and P. Really easy. It&#8217;s this <em>N</em><sub>total</sub>, plus the batch times length of context times bytes per token, all divided by <em>E</em> times <em>P</em>.</p><p>Why is this correct as divided this way? We knew that the parameters were perfectly divided amongst all the GPUs in a rack. The layers are perfectly divided amongst the different racks. So that works here. Somehow we&#8217;re going to arrange&#8212;I&#8217;ll hand-wave exactly how&#8212;the same perfect sharding of the contexts across GPUs in a rack, and then based on layer across racks.</p><p><strong>Dwarkesh Patel</strong></p><p>Sorry, 4 is the number of racks?</p><p><strong>Reiner Pope</strong></p><p>Yeah, for example.</p><p>This is the place where we actually need to go back and analyze this batch size <em>B</em>. You were making this comment that there&#8217;s micro-batching versus global batching. Let&#8217;s come back to this pipelining diagram here.</p><p>We&#8217;ve got one batch going forward here, and then as I drew it, it kind of just disappeared. That&#8217;s not really correct. If you think about how decode is working, I have a bunch of tokens that I have generated already. I do one forwards pass where I generate a new token, and then I write that to my KV cache. Then I do another forwards pass that generates the next token. I&#8217;m actually going to be running this batch zero in a loop. In fact, I go forwards. Once I finish, I can start the next iteration of the loop up here. We&#8217;ll just fill this in. We&#8217;ve got the two, three, two and three, and two and three.</p><p>Let&#8217;s split this batch. This batch will be the global batch size. <em>B</em> is going to be the number of micro-batches times the batch size per micro-batch. How many micro-batches do we need? The number of micro-batches in this diagram is 4: zero, one, two, three. The micro-batch size is still this 2000-ish number. Sorry, no, this is the 300 times sparsity.</p><p><strong>Dwarkesh Patel</strong></p><p>This is how big the train that takes off every 20 milliseconds is.</p><p><strong>Reiner Pope</strong></p><p>Right. This is going to be the 20-millisecond train. The global batch size is the number of micro-batches times the local batch size. Local batch size is set by this hardware parameter.</p><p>The number of micro-batches is as small as possible, such that we can wrap around and not leave any idle time. If we had fewer, we would have this idle time when we wrap around. You can visually see that it is equal to the number of pipeline stages. It&#8217;s a proof by visual here. It is 4, and it&#8217;s 4 this way as well. You can look and see that it goes along here, and then it wraps around to the number of pipeline stages.</p><p><strong>Dwarkesh Patel</strong></p><p>Sorry, very basic question. Is this what is actually done? A frontier model today will have pipelining during inference?</p><p><strong>Reiner Pope</strong></p><p>For sure during massive scale training this is done. It can be done for inference. I&#8217;m actually going to make the case for why it is less attractive. It is useful for weights, but not so useful for KVs.</p><p>The big challenge is... Let&#8217;s fill this in. The micro-batch size here ends up being equal to the number of pipeline stages. When we go back and substitute all of that into here, we get a number of pipeline stages times this little <em>b</em> showing up in here. When we factor this out, I&#8217;m going to split this plus into two terms.</p><p>We get the full division by <em>E</em> times <em>P</em> over here. We still have division by <em>E</em> times <em>P</em> over here, but the <em>P</em>s cancel. What we find is that if you increase the number of pipeline stages, the memory footprint for the number of weights keeps going down and down and down, but the memory footprint for the number of activations stays constant. So it doesn&#8217;t actually work.</p><p>Most of your memory&#8230; Once you do enough pipelining&#8212;and it&#8217;s really not much, even two is often enough&#8212;this term becomes very small. The KV cache becomes the dominant term.</p><p><strong>Dwarkesh Patel</strong></p><p>I know this is wrong. I&#8217;m just trying to think about why my train of logic here is wrong. If you&#8217;re pipelining through many different stages, the KV values are not shared between layers. Why would it not help to be pipelining across multiple layers? Because then you don&#8217;t have to store...</p><p><strong>Reiner Pope</strong></p><p>You only need to store one layer rather than two layers of KVs. It helps from that perspective, you&#8217;re right. What&#8217;s competing with that, though, is that you need to be keeping all of the racks usefully busy at a time, so the number of sequences that are in flight simultaneously has gone up.</p><p><strong>Dwarkesh Patel</strong></p><p>Ah, that makes sense.</p><p><strong>Reiner Pope</strong></p><p>Those exactly cancel, and you end up not getting a saving per GPU.</p><p><strong>Dwarkesh Patel</strong></p><p>Right. This is going back fundamentally to the point of how you&#8217;re not able to amortize across KV caches.</p><p><strong>Reiner Pope</strong></p><p>First, we established you can&#8217;t amortize KV caches across batch size. Now we&#8217;re saying you also can&#8217;t shard it across pipeline stages. It sucks from both of those points of view.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. So then what is done during inference?</p><p><strong>Reiner Pope</strong></p><p>The DeepSeek paper reports what they do, which is that they just do a lot of expert parallelism. In effect, you should increase your expert parallelism up to your scale-up domain size, and then do very little pipelining. Maybe none at all, maybe two, just enough to make the weight storage not too big of an issue.</p><p>Those are the only two parallelisms that really make sense. In the past, there was tensor parallelism, which was cutting up within an expert, but the experts are so small now that that is not a profitable optimization.</p><p><strong>Dwarkesh Patel</strong></p><p>Does that mean that frontier labs, when they&#8217;re doing inference, are just within a single scale-up?</p><p><strong>Reiner Pope</strong></p><p>Yes. You can look at how it depends on model size. You could have a very large model, one that exceeds the memory of a rack. There you should be doing a bit of pipelining. Maybe it&#8217;s extremely sparse, for example, and that would be a reason to do it.</p><p><strong>Dwarkesh Patel</strong></p><p>This goes back to the promise at the beginning of the lecture, which was this will actually tell you about AI progress as well. To the extent it is the case that model size scaling has been slow until recently&#8230;</p><p>Let me make sure I understand the claim. The claim would not be you could have trained across more racks. It was just that it would not have made sense before, we didn&#8217;t have the ability to do inference for a bigger model easily.</p><p><strong>Reiner Pope</strong></p><p>Actually, pipelining doesn&#8217;t help with context length. It totally helps with model size. Because of the ability to do pipelining, a rack at least should not be a constraint on your ability to fit the model parameters.</p><p>The other consideration you&#8217;re asking is, why hasn&#8217;t it scaled up more, and why did bigger scale-up domains help? We talked through one aspect of that, which is that it&#8217;s not because of memory capacity. We have a solution to the memory capacity at least with respect to model size, not with respect to KV cache size but at least with respect to model size. The other issue that shows up is latency.</p><p><strong>Dwarkesh Patel</strong></p><p>I was just about to ask, going from rack to rack, what is the latency cost per hop?</p><p><strong>Reiner Pope</strong></p><p>This is very much dependent on the hardware. I can&#8217;t say with a lot of authority. I think it&#8217;s probably on the order of a few milliseconds, but it could be off by an order of magnitude there.</p><p><strong>Dwarkesh Patel</strong></p><p>Is 4 a realistic number of how many pipelining stages you might have?</p><p><strong>Reiner Pope</strong></p><p>Yes.</p><p><strong>Dwarkesh Patel</strong></p><p>So that&#8217;s not that much.</p><p><strong>Reiner Pope</strong></p><p>On a small number of pipelining stages, this is not a huge latency impact.</p><p><strong>Dwarkesh Patel</strong></p><p>But I guess it&#8217;s 10 milliseconds per token.</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s right.</p><p><strong>Dwarkesh Patel</strong></p><p>2 times 4-ish, or I don&#8217;t know how many you said&#8230; 10 milliseconds per token is actually a lot.</p><p><strong>Reiner Pope</strong></p><p>If it goes from 20 to 30, or something like that&#8230;</p><p>Just to chart the path that it goes through, here you&#8217;re going from your GPU or TPU to a network card, which then goes to a top-of-rack switch, and then hops over to the other rack and does the same thing in reverse. You have to sum up the latencies of these different things.</p><p><strong>Dwarkesh Patel</strong></p><p>Sorry, is this the same thing as the data center switch?</p><p><strong>Reiner Pope</strong></p><p>It may in fact go up to a data center switch and back. It depends on deployment configuration.</p><p><strong>Dwarkesh Patel</strong></p><p>Got it. And because it&#8217;s decode and sequential, they stack up across the stages. You can&#8217;t do them at the same time.</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s right.</p><p><strong>Dwarkesh Patel</strong></p><p>This brings us back to the question then, is the size of the scale-up at all relevant to why AI model sizes have been what they have been over the last few years, whether through training or through inference?</p><p><strong>Reiner Pope</strong></p><p>We talked about latency of the hop. There is also just the <em>t</em><sub>mem</sub> latency. The memory time latency is actually massively improved by larger scale-up domains. I&#8217;ll recall <em>t</em><sub>mem</sub> down here. <em>t</em><sub>mem</sub> for the weights was equal to the number of total parameters divided by the memory bandwidth. Which memory bandwidth are we talking about here? Is it just one GPU? It is the number of GPUs that I can use in parallel to load these weights. I can&#8217;t use different pipeline stages in parallel because they&#8217;re not running at the same time, but I can use all the GPUs in my scale-up domain in parallel to load the weights. This is actually extremely effective. Basically, I end up with a term here, this memory bandwidth term itself is equal to scale-up size...</p><p><strong>Dwarkesh Patel</strong></p><p>Times memory bandwidth per GPU.</p><p><strong>Reiner Pope</strong></p><p>Yeah. Times GPU bandwidth. This term doesn&#8217;t increase a lot. It maybe increases 1.5 or 2x per generation, but this one increased by a factor of 8 from Hopper.</p><p><strong>Dwarkesh Patel</strong></p><p>So the reason the bigger scale-up matters, it&#8217;s not the memory capacity of the whole scale-up, but really the memory bandwidth.</p><p><strong>Reiner Pope</strong></p><p>Yeah. Pipelining totally solves the capacity problem, but scale-up size helps solve the bandwidth problem.</p><p><strong>Dwarkesh Patel</strong></p><p>And the bandwidth problem helps you do longer context lengths, which is more and more relevant as these models get more agentic.</p><p><strong>Reiner Pope</strong></p><p>It lets you just run the model at lower latency as a first thing. If I just do a very sparse model and it&#8217;s on a little H100 box, the latency will be really high.</p><h3>01:18:59 &#8211; Because of RL, models may be over-trained 100x beyond Chinchilla-optimal</h3><p><strong>Dwarkesh Patel</strong></p><p>A super tangential question. There&#8217;s <a href="https://arxiv.org/abs/2203.15556">Chinchilla scaling</a>, which tells you how big a model should be relative to the amount of data you&#8217;re going to train it on. But now, obviously, you&#8217;re not just trying to optimize for the highest quality model you could get with training compute. You want the best results a user can get with a mixture of training and inference compute.</p><p>So there&#8217;s a question of how much you should over-train a model such that compute amortized over training and inference is minimized to get a certain performance. But now with RL, there&#8217;s another consideration which is, you&#8217;re going to do some amount of pre-training.  That pre-training will be used both for RL generation and then for inference for the final user. By over-training here I mean that while it would have been more efficient just from a training compute perspective to have a bigger model that you train for less time because it can learn faster, maybe you get a smaller model, spend more compute training it than you otherwise would have, but now it&#8217;s cheaper to give it to users.</p><p>Let me make the question more concrete. Basically, how much more than Chinchilla optimal are models over-trained? And has that changed as a result of RL generation?</p><p><strong>Reiner Pope</strong></p><p>This is a place where we have to do a bit of guesswork because the updated scaling laws and the model traffic are not reported, so we have to guess there. One way to look at it&#8230;</p><p>Let me first just make a general heuristic claim. If I have some cost, and I&#8217;ve got a total cost which is a sum of cost <em>A</em> and cost <em>B</em>, like maybe this is the training cost and this is the inference cost, and I want to minimize this sum&#8230;</p><p>For many curves, the minimum tends to be where the costs are equalized. That&#8217;s something of a heuristic claim, but there are many examples where it&#8217;s true. Where one is 1/<em>x</em> and the other one is <em>x</em>, for example, they tend to be minimized at the point where they equal each other. It&#8217;s also true for e<em><sup>x</sup></em> and e<em><sup>-x</sup></em> and all kinds of other things. Basically, I&#8217;ve got some curve that&#8217;s going down, some other curve that&#8217;s going up, and they tend to be minimized at this equal point.</p><p>Heuristically, I will conjecture that that is true for the setup you described as well. Actually showing that would be true would require looking at the scaling laws and fitting these weird exponents, but things that follow power laws tend to have this property. So I&#8217;ll just make that claim and move on.</p><p>We&#8217;re going to say that we want to equalize the cost of training and the cost of inference. We can do all of it in general. The cost of pre-training, that&#8217;s the number of active params times the data on pre-training. There&#8217;s a factor of 6 out here, which is the number of FLOPs. There&#8217;s the famous <a href="https://www.adamcasson.com/posts/transformer-flops">6</a><em><a href="https://www.adamcasson.com/posts/transformer-flops">ND</a></em><a href="https://www.adamcasson.com/posts/transformer-flops"> formula</a>. Then in RL, we have approximately the same thing. We&#8217;ve got the same number of active parameters, but now the amount of data is the RL data. There is this extra efficiency multiplier, or inefficiency...</p><p><strong>Dwarkesh Patel</strong></p><p>Which is the fact that you&#8217;re not training on all your rollouts.</p><p><strong>Reiner Pope</strong></p><p>Well, there&#8217;s that, and then the other, perhaps even bigger inefficiency is that this involves a substantial amount of decode. Often decode runs at less MFU than training.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay. So if you&#8217;re doing a backward pass on every single generation in RL, it would be 6ND.</p><p><strong>Reiner Pope</strong></p><p>So this could be a smaller number, right?</p><p><strong>Dwarkesh Patel</strong></p><p>It would at least be two, because that&#8217;s the lower...</p><p><strong>Reiner Pope</strong></p><p>Somewhere in the range of two to six. We&#8217;ll say somewhere in the range of two to six and leave it at that. Then we can add in the inference cost. The inference cost is two, the number of active parameters times the data in inference.</p><p><strong>Dwarkesh Patel</strong></p><p>Sorry, I think the way I said it was super garbled. Just for the audience, forward plus backwards per parameter is 6. Forward alone is 2. That&#8217;s why RL, where you&#8217;re definitely going to generate all the trajectories but you might or might not train all the trajectories, is 2 to 6.</p><p><strong>Reiner Pope</strong></p><p>Yes. Thank you. And then inference is just 2. We&#8217;re going to solve for essentially equality of all three of these terms. That is the ballpark of where people are going to be.  Labs have more information on what is productive in doing more RL, for example, versus doing more pre-training. I don&#8217;t have that information, but I think a good ballpark is a 33% split between each of them.</p><p><strong>Dwarkesh Patel</strong></p><p>I&#8217;m not sure I understand the intuition for that. Another naive model could have been that RL plus pre-training would be 50% and inference would be 50%.</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s also a valid answer. Because this is heuristic, I can&#8217;t really argue for one versus the other. They don&#8217;t differ by that much. Thirty-three versus twenty-five is only a small factor off.</p><p>Let&#8217;s pick one of them. All equal seems simple enough, so we&#8217;re just going to solve for equality of them. It&#8217;s pretty straightforward. We can immediately see that the number of activated parameters totally disappears, so let&#8217;s factor that out. We&#8217;re going to just say that data in pre-training&#8212;I decided to do it your way, it&#8217;s a little bit nicer&#8212;plus... Oh, I didn&#8217;t have the inefficiency over here either. Data in pre-training plus some multiple of &#945; times the data in RL is going to end up equal to some &#946; times the data in inference.</p><p>Let&#8217;s just roughly size the &#945;. This &#945; is maybe somewhere in the range of 2 to 6. Over 6, from this term compared to this term. And then we&#8217;ve got an inefficiency term, which I would say is maybe in the range of 30%. So this alpha is going to be something like 1/10. And this &#946; here is actually the same. It&#8217;s a third. It&#8217;s one third times 30%. So it also equals 1/10.</p><p><strong>Dwarkesh Patel</strong></p><p>If both of them are one in ten, that kind of implies that there&#8217;s never a backward pass on RL?</p><p><strong>Reiner Pope</strong></p><p>Yeah. Okay, we can make this 2/10. Make it a bit bigger. Just write it out once more, this is 2/10, this is 1/10.</p><p>The number of inference tokens you have is just a function of hundreds of millions of tokens per second times my model is deployed for two months before I ship to the next version. That should determine the number of tokens in RL and pre-training.</p><p>I guess we didn&#8217;t do the equivalence between pre-training and RL, so we&#8217;ll do that here. Data in pre-training should be equal to 2/10 data in RL for them to be cost equivalent. Sorry, 1/10. I got it backwards. We pay more cost when it&#8217;s inefficient, so this needs to be 1/10. Tracing this back&#8230; This thing ends up actually being, as written here&#8230; This is like 1.5, and this is one.</p><p><strong>Dwarkesh Patel</strong></p><p>Billions of dollars worth of compute just flowed in the other direction.</p><p><strong>Reiner Pope</strong></p><p>Right? I think if you do it with a spreadsheet and actually model it out, you might notice when the money&#8217;s going down the drain. All of these end up being close in, as modeled here. This 30% may have been a little bit too generous. So let&#8217;s say something like 1.5 here, and leave this as a one here.</p><p>I think at this point, you can almost read it off. The number of inference tokens should be about the same as the number of pre-training tokens, which should be about the same as the number of RL tokens, within factors that we&#8217;re not able to reason about.</p><p><strong>Dwarkesh Patel</strong></p><p>Sorry for making a basic algebra mistake. It seems like there should be fewer RL tokens than pre-training tokens?</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s in general right. Because RL is less efficient in terms of machine time, if you&#8217;re trying to equalize the RL and pre-training time, then you should have fewer tokens in order to have the same wall time.</p><p><strong>Dwarkesh Patel</strong></p><p>This is all quite interesting. I never thought about it in terms of equalizing data.</p><p><strong>Reiner Pope</strong></p><p>I think starting with equalizing in cost is right, but depending on how you model the cost, this comes close to equalizing in data.</p><p><strong>Dwarkesh Patel</strong></p><p>So for GPT to be trained optimally, every single user who uses GPT-5, the total amount of tokens that they stream should equal the total amount that has gone into pre-training. And the total amount of tokens that have gone into pre-training is the sum of all human knowledge. Each model should generate the sum of human knowledge on the output that it gets on the input.</p><p><strong>Reiner Pope</strong></p><p>Yeah. Which way are people going to err? If you think that people&#8217;s power of prediction is not perfect, and also you run the risk that you make a model that is not a frontier model and then you just throw it away, then that changes the cost trade-off because there&#8217;s some probability that applies to the inference. And you should derate the inference tokens by some amount.</p><p><strong>Dwarkesh Patel</strong></p><p>Right. Can we back out how much more compute than Chinchilla optimal for a given sized model?</p><p><strong>Reiner Pope</strong></p><p>I think we just have to make some real-world assumptions here in order to do that.</p><p>The inference tokens, we should totally be able to count, right? Let&#8217;s say a few hundred million. Maybe it&#8217;s five hundred million tokens a second now, I don&#8217;t really know. Five hundred million tokens a second times. A model is deployed for two months before it becomes obsolete?</p><p>I can&#8217;t do this in my head. Can you type it into a computer?</p><p><strong>Dwarkesh Patel</strong></p><p>2.6 x 10<sup>15</sup>.</p><p><strong>Reiner Pope</strong></p><p>Okay. 2.6 x 10<sup>15</sup>. This number is probably too large because this is going to be multiple models in a family. Let&#8217;s make it 5x smaller or 10x smaller or something like that.  So we&#8217;re estimating maybe fifty million tokens per second, per specific model. The model is live for two months. This comes out to around two hundred trillion tokens. And then we want to compare that to active parameters on a frontier model. I don&#8217;t actually know the latest rumors. Do you know?</p><p><strong>Dwarkesh Patel</strong></p><p>Somebody told me a hundred and fifty trillion.</p><p><strong>Reiner Pope</strong></p><p>Active parameters?</p><p><strong>Dwarkesh Patel</strong></p><p>Sorry, I meant tokens.</p><p><strong>Reiner Pope</strong></p><p>Trained on a hundred and fifty trillion tokens. Interesting.</p><p><strong>Dwarkesh Patel</strong></p><p>Which is similar.</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s actually similar. So data on pre-training.</p><p><strong>Dwarkesh Patel</strong></p><p>This is not well-cited but it&#8217;s fine.</p><p><strong>Reiner Pope</strong></p><p>I think often the number of active parameters could be in the range of a hundred billion, something like that. Maybe a bit larger. So multiply by 20 to get the Chinchilla token count. So Chinchilla, <em>D<sub>Chinchilla</sub></em>, would be around two trillion. We see we&#8217;re about a hundred times larger than that.</p><p><strong>Dwarkesh Patel</strong></p><p>What does <em>D<sub>Chinchilla</sub></em> actually mean?</p><p><strong>Reiner Pope</strong></p><p>The token count for pre-training that the Chinchilla scaling law would recommend, I guess.</p><p><strong>Dwarkesh Patel</strong></p><p>Oh, I see. So how much is it over-trained? Got it.</p><p><strong>Reiner Pope</strong></p><p>The ratio of this two hundred trillion or a hundred trillion parameters over the Chinchilla optimal of two trillion, that&#8217;s the amount it&#8217;s over-trained. Which is a factor of a hundred over-trained.</p><p><strong>Dwarkesh Patel</strong></p><p>A hundred. So if you consider this right here, to the extent this is in the right ballpark, just by thinking about how you want everything to be equal in terms of compute&#8230; If OpenAI also realizes that and they&#8217;re serving a certain amount of tokens per second, that tells you how much data went into the pre-training of GPT-5. Even if it&#8217;s 50% off or something, it is wild that you can first-principles these kinds of numbers.</p><p><strong>Reiner Pope</strong></p><p>This is why you should just approximate everywhere, because there are big error bars on this. But it&#8217;s kind of empowering to just set <em>A</em> equal to <em>B</em> and figure it out.</p><h3>01:33:02 &#8211; Deducing long context memory costs from API pricing</h3><p><strong>Dwarkesh Patel</strong></p><p>That&#8217;s super cool. Okay, so in the spirit of trying to deduce things, we can publicly look up the API prices of these models, and maybe we can learn something from that.</p><p>First, with longer context, <a href="https://ai.google.dev/gemini-api/docs/pricing">Gemini 3.1</a> is 50% more expensive if you go over 200k tokens than if you&#8217;re below 200k tokens. At a high level, I understand why that might be, but why specifically 50%?</p><p><strong>Reiner Pope</strong></p><p>Why specifically 50%? The high level, even in the first place, is that there is some amount of increasing cost with context length. We can bring that back up. That was the memory time versus the compute time.</p><p>We&#8217;ve put up these same equations from before, of the time for memory fetches which is the weights and the KV cache, and then the time for the compute which is just the matrix multiplications for the weights. I will also draw the cost curve, but this time I&#8217;ll do it as a function of context length instead of batch size. So this is the cost curve as a function of context length. We&#8217;ll draw the compute. The cost of the compute is actually constant as a function of context length. There&#8217;s no dependence here on context length. In reality, there is some dependence, but it is very mild, so we&#8217;ll ignore it. So this is the time for the compute.</p><p>Then we&#8217;ll also draw the dependence of the memory fetch on context length. This starts at a large number for the weights and then grows gradually with the context length. Maybe starting here, and then grow gradually with context length. And so, you take the maximum and you see there is this inflection point here.</p><p>So this is the cost that Gemini might be paying. And then you think, how might you put a pricing structure on top of that? You would like to ensure that no matter what the context length is, you are still profitable. So we&#8217;ve got a two-tier pricing structure. Maybe we&#8217;ve got something that looks like this up to some extent.</p><p>I think it says something about, given that the bump is at 200k, it probably means that this is somewhat aligned with this crossover point. Maybe not exactly aligned with it. We can actually probably even complete that calculation just to see where it lands out.</p><p>We can solve for the number of bytes per token if we make some assumptions about the number of active parameters. So solving for the number of bytes per token, we&#8217;re going to assume the point where we equalize the time of memory and the time of compute is at, let&#8217;s say, 200k tokens. So we equalize these two.</p><p>We&#8217;re also going to assume that the batch size is large enough that the memory time spent on weights is negligible. So we&#8217;ll forget about this, and we&#8217;ll focus on the actual memory time spent on KV cache. That ends up saying, copying this term over, batch times length of context times bytes per token over memory bandwidth is going to be equal to the number of activated params over FLOPs. And then we&#8217;re going to solve for bytes per token. Batch size was missing here. It shows up here, and then it cancels out by the time we get to here. And I dropped the length of context.</p><p>So we can plug in numbers. This is the reciprocal of the number that we saw before. This is 1/300, which is reasonably stable across many different hardware platforms. We conjecturally said that maybe the number of activated parameters is a hundred billion. The length of the context we said was 200k. Something is wrong here, though. Length of the context should be on the denominator, not the numerator.</p><p><strong>Dwarkesh Patel</strong></p><p>1667. Almost two kilobytes.</p><p><strong>Reiner Pope</strong></p><p>That is plausible, actually. You said around two kilobytes. Let&#8217;s just do a sanity check for what this could be. There are two mechanisms that people do attention with a small number of bytes per token. One is dense attention with a lot of reuse across layers. <a href="https://en.wikipedia.org/wiki/Character.ai">Character AI</a> has <a href="https://blog.character.ai/optimizing-ai-inference-at-character-ai-2/">a blog post talking about that, alternating long and short context</a>. In the Character AI kind of model, which also showed up in the <a href="https://en.wikipedia.org/wiki/Gemma_(language_model)">Gemma</a> models, the global context&#8212;which is really what we&#8217;re talking about here&#8212;was shared across all the layers.</p><p>To get this to kilobytes, you could get that, for example, as a <em>d</em><sub>head</sub> of 128, which is typical. Then the number of bytes is typically the number of attention layers times two times <em>d</em><sub>head</sub> times the number of KV heads. This is the number of unique contexts per layer.</p><p>Do you share the context across many layers, or do you use it only once?  In the Character AI-like models, this number is one. We said this is 128. This is a choice which typically ranges from one... Sorry, this is KV heads, I meant.</p><p><strong>Dwarkesh Patel</strong></p><p>The difference between a head and a KV head is that&#8230;?</p><p><strong>Reiner Pope</strong></p><p>The KV heads are the heads that are stored in memory, store the contents of the previous tokens. The Q heads are the retrieval heads. They&#8217;re only used temporarily and they&#8217;re used by the attending token. In this autoregressive context, I&#8217;ve got KV heads associated with all of the contexts, and then Q heads associated with this new token here.</p><p><strong>Dwarkesh Patel</strong></p><p>But this head, the 128.</p><p><strong>Reiner Pope</strong></p><p>Oh, sorry. This d-head is the dimension of the vector. The number of KV heads is typically in the range of 1 to 8. It is totally plausible to get this by, for example, having 8 KV heads and a d-head of 128. That gives you exactly this number. Or you could have fewer KV heads, but more layers.</p><p>This is one way to get there via dense attention. There&#8217;s also a way to get there via sparse attention, where you increase all of these numbers, but then you have a 1/sparsity term. I think this number is plausible, if maybe a little bit small.</p><p><strong>Dwarkesh Patel</strong></p><p>It&#8217;s funny that they would leak so much information through their API pricing.</p><p><strong>Reiner Pope</strong></p><p>I mean, you are incentivized to price close to your costs because otherwise someone could scoop you.</p><p><strong>Dwarkesh Patel</strong></p><p>Maybe we can learn something about the difference in input versus output prices, and what that tells us about decode versus prefill in these models. I think last I checked it&#8217;s 50% more expensive or something like that?</p><p><strong>Reiner Pope</strong></p><p>I don&#8217;t remember. What I&#8217;ve seen in the past is 3-5x more expensive.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, that makes more sense. So let&#8217;s say it&#8217;s 5x more expensive. This is the compute to process the next token in decode. Suppose you&#8217;re doing prefill, where you&#8217;re not just processing the most recent token, you&#8217;re processing all the tokens in parallel. I want to say that it would be this times length prefill?</p><p><strong>Reiner Pope</strong></p><p>Or length of the pass in general. If we can think of decode as being a pass with one, and then prefill being a pass with many.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay. So maybe prefix? Okay, memory. You&#8217;re not storing the KV cache for the tokens that are the prefill tokens.</p><p><strong>Reiner Pope</strong></p><p>Let&#8217;s actually draw how prefill shows up here, if I may clarify. We do a bit of decode like this. We may actually come back and do more prefill. If you think this is a chat session, the user says something, the AI generates a response, and then the user says something else and we prefill this. Maybe this is the general case, rather than this.</p><p><strong>Dwarkesh Patel</strong></p><p>In fact, this is like you read a file or something.</p><p><strong>Reiner Pope</strong></p><p>Read a file or the AI is responding to a user input, tool call, or anything that&#8217;s not AI-generated.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, suppose we&#8217;re here. You will have calculated all of this previously. So just the KV of everything that came before. But what is the memory cost of this? Well, the memory bandwidth cost of this. If you&#8217;re doing <a href="https://modal.com/blog/flash-attention-article">flash attention</a>, it would&#8212;</p><p><strong>Reiner Pope</strong></p><p>It&#8217;s basically temporary. It doesn&#8217;t even go to main memory. Just ignore that.</p><p><strong>Dwarkesh Patel</strong></p><p>Exactly. So then it would just be everything that came before. Is it not just that then?</p><p><strong>Reiner Pope</strong></p><p>There&#8217;s actually no adjustment at all to the memory time.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay. Great. So it&#8217;s a very trivial change to accommodate. This term is making it 5x more expensive. Now, why would that be? What does that actually tell us? What variable does this help us clamp? The only thing that could have changed is that the compute is 5x more expensive as a result.</p><p><strong>Reiner Pope</strong></p><p>This is the time for one pass, but actually the amount of tokens is that much larger. We want the cost per token, in fact, or the time per token.</p><p><strong>Dwarkesh Patel</strong></p><p>I&#8217;m not sure I understood. This is for processing the next token in prefix?</p><p><strong>Reiner Pope</strong></p><p>Well, actually for processing the entire batch. At this cost, we have processed this many tokens, the length of prefill. Or I guess the length of the pass. Not this prefix, but it&#8217;s this cost.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay. Let&#8217;s just do this pass. So this is 5x more expensive. Input is 5x more expensive.</p><p><strong>Reiner Pope</strong></p><p>Output is more expensive, in fact.</p><p><strong>Dwarkesh Patel</strong></p><p>Output is 5x more expensive.</p><p><strong>Reiner Pope</strong></p><p>The result we want to work towards is that prefill is compute-limited and decode is memory bandwidth-limited.</p><p><strong>Dwarkesh Patel</strong></p><p>Why don&#8217;t we do this? Why don&#8217;t we just chart it with len-pass on the X-axis and <em>t</em> on the Y-axis.</p><p><strong>Reiner Pope</strong></p><p>We want the cost per token, so it&#8217;ll be <em>t</em> over length of the pass. That&#8217;ll be right.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess I&#8217;m getting confused by this. Len-pass is... It seems like this should be higher when you&#8217;re doing prefill.</p><p><strong>Reiner Pope</strong></p><p>Prefill has a bigger length pass. Yeah.</p><p><strong>Dwarkesh Patel</strong></p><p>But then why is it cheaper?</p><p><strong>Reiner Pope</strong></p><p>Why is the cost higher? It&#8217;s this division by length pass. This is going to divide out, but then all of this is going to divide by length of pass, and it&#8217;s going to make the memory costs cheaper.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay. Let me think about this then. Basically we&#8217;ll have four different lines. Let&#8217;s do prefill first... Actually, let&#8217;s do decode first.</p><p><strong>Reiner Pope</strong></p><p>Length of the pass, when it&#8217;s one, that is decode. When it is bigger, that is prefill.</p><p><strong>Dwarkesh Patel</strong></p><p>Oh, okay. I see. That makes sense. Getting back to it. So <em>t</em><sub>compute</sub>, if you have basically just this divided by len-pass, so just this amount. This actually does not vary based on <em>t</em>, so it&#8217;ll just be some flat value like this. And this is <em>t</em><sub>compute</sub>. And this is&#8212;</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s decode.</p><p><strong>Dwarkesh Patel</strong></p><p>Decode. Right. Now <em>t</em><sub>mem</sub>, we have this whole thing divided by len-pass. Well, it doesn&#8217;t really matter what&#8217;s up there, it&#8217;ll just be something that looks like this. Let&#8217;s say this is <em>t</em><sub>mem</sub>. This is decode again.</p><p>So as the length of the prefix goes up, or pass, your memory bandwidth time declines, and that means that to the extent that you were bottlenecked on memory bandwidth before, you can avoid being bottlenecked on memory bandwidth.</p><p>The fact that they are charging 5x less for prefill than decode does suggest that they are bottlenecked on memory bandwidth to quite a degree, such that for them at least&#8212;because <em>t</em> is equivalent to cost, it&#8217;s the cost of renting a compute&#8212;this would be at 1, and this would be at 5.</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s right.</p><p><strong>Dwarkesh Patel</strong></p><p>So it is, in fact, tremendously memory bandwidth bottlenecked. The real graph looks something like that.</p><p><strong>Reiner Pope</strong></p><p>It still crosses, but yeah.</p><p><strong>Dwarkesh Patel</strong></p><p>Exactly. Let me do it this way. This is the gap on decode between the memory and the compute time. Okay, interesting.</p><p>Another interesting one would be why cache hits are so much cheaper. If I remember correctly, cache hits are like 10x&#8230; It&#8217;s more expensive to write to cache according to the pricing on all these models. But if you do hit a cache, it&#8217;s 10x. Presumably, this is the cost of keeping something in HBM rather than just evacuating it. But if you do keep it in HBM, then it&#8217;s cheaper to load again?</p><p><strong>Reiner Pope</strong></p><p>Right. There are two ways you can produce the KV cache for a token. You can just produce it from scratch by computing it from the underlying token IDs, which are tiny. Or you can previously have produced it and stored it in a memory somewhere.</p><p>The cost ratio is really talking about the ratio between those two mechanisms of producing it. A cache miss means you&#8217;ve deleted it from all your memories, and you have to recompute it from the tokens directly. You can even take that a step further and think about which memory tier you store it in. You could store it in HBM. There are other slower and cheaper memories than HBM, like <a href="https://en.wikipedia.org/wiki/DDR_SDRAM">DDR</a> on your host or <a href="https://en.wikipedia.org/wiki/Flash_memory">flash</a> as well. One of the things you can do is a calculation of where it makes sense to be in each memory tier, and this is related to how long you&#8217;re going to store it for.</p><p>We want to look at the cost of storage in a few different memory tiers and also the cost of rematerialization. Remat means the cost to rebuild all of the KV cache from scratch after you deleted it, so we rematerialize it. Basically, this is going to cost the length of the context. Actually, we&#8217;ll look at the cost per token, so we don&#8217;t need to carry around this length of context everywhere.</p><p>To rematerialize one token of KV cache, I just need to run a forward pass on the whole model. This is going to be the compute time. I have to rerun the compute at whatever speed my GPU does it, and then I multiply it by my GPU dollars per second.</p><p><strong>Dwarkesh Patel</strong></p><p>Sorry, excuse a naive question. Why is there not a quadratic term?</p><p><strong>Reiner Pope</strong></p><p>There is a quadratic term. It shows up in the compute. As an approximation, I chose to remove it. I&#8217;ll just show you quickly what that looks like. If you look at the cost per token, or the number of FLOPs per token, there are the FLOPs that are coming from doing the weight matrix multiplies as a function of&#8212;</p><p><strong>Dwarkesh Patel</strong></p><p>Which is flat.</p><p><strong>Reiner Pope</strong></p><p>...context length. And then there is the number of multiplies that comes from doing the KV cache, which goes up linearly with the amount of stuff you attend to. The slope on this is so low that when you draw it like this, it&#8217;s very well approximated by a flat line. You start to notice the effect of the quadratic or the linear term up in the millions of tokens or so. So it&#8217;s just not super relevant.</p><p><strong>Dwarkesh Patel</strong></p><p>So what is the reason that there&#8217;s no company which has over a million token context length, if this is true?</p><p><strong>Reiner Pope</strong></p><p>There are two costs of long context. One is the memory bandwidth cost, which we&#8217;ve spent a lot of time analyzing. That&#8217;s this thing. The other one is the compute cost. The compute cost is almost always forced by fundamental principles to be a much smaller slope than the memory bandwidth cost. The primary things that limit you to really large contexts are memory bandwidth and memory capacity, which is exactly this effect.</p><p><strong>Dwarkesh Patel</strong></p><p>There&#8217;s this idea that <a href="https://www.dwarkesh.com/p/dario-amodei-2">Dario said on the podcast</a>, and others have said, which is, &#8220;We don&#8217;t need <a href="https://www.ibm.com/think/topics/continual-learning">continual learning</a> for AGI, <a href="https://www.lakera.ai/blog/what-is-in-context-learning">in-context learning</a> is enough.&#8221; If you believe that, then you have to think that we have to get to a hundred-million-token context length to have an employee that is the equivalent of working with you for a month. Now, maybe that&#8217;s no longer true with sparse attention or something.  But if you think that, then some ML infra thing would have to change to allow for a hundred million, like the memory bandwidth, to allow for a hundred-million-token context lengths.</p><p><strong>Reiner Pope</strong></p><p>Sparse attention gives you a get-out for sure, because you get this square root. It gives you a big improvement. But if you look at the history of context lengths of models, from earlier models like GPT-3, maybe to GPT-4&#8212;I don&#8217;t remember when the transition happened exactly&#8212;they shot up from about 8K to 100-200K. And then for the last year or two, they&#8217;ve all been hovering around there. I think that indicates that this is the reasonably balanced cost point, and going massively beyond that would be cost-prohibitive.</p><p><strong>Dwarkesh Patel</strong></p><p>Not because of the compute cost, because of the memory bandwidth...</p><p><strong>Reiner Pope</strong></p><p>Because of memory bandwidth cost, yeah. I actually don&#8217;t see a very good path to solving that. The HBM is where it is. It&#8217;s not getting hugely better.</p><p><strong>Dwarkesh Patel</strong></p><p>And why doesn&#8217;t sparse attention solve it?</p><p><strong>Reiner Pope</strong></p><p>Sparse attention is a big improvement. Maybe that is priced in already, perhaps. It&#8217;s not an infinite improvement because if you go too sparse, you lose too much quality.</p><p>The empirical result is that the context lengths haven&#8217;t been increasing that much. I think it&#8217;s because there is no solution to the memory wall here. Going too sparse just means you&#8217;re attending to a very small subset of the tokens, and the quality will get worse.</p><p><strong>Dwarkesh Patel</strong></p><p>Makes sense.</p><p><strong>Reiner Pope</strong></p><p>What is the cost of these different ways of resynthesizing the KV cache? Computing it from scratch is based on my GPU time. I have to do a certain amount of multiplies, of GPU time that I spend in order to produce it.</p><p>Storing in HBM. This really goes as my bytes per token. I need to just have some number of bytes per token, and then I need to store this in the HBM. It&#8217;s going to use up some of my HBM capacity. A way to think of this is that if I have too many of these things sitting in my HBM, if I fill up my HBM with just KV caches that I&#8217;m not using, I can&#8217;t use that GPU.</p><p>How do I price that? Maybe I say that the cost of it is proportional to the fraction of the HBM I&#8217;m using. There&#8217;s also times GPU dollars. Let&#8217;s just do one more memory tier and say store in DDR instead. The same kind of thing goes up for flash and for DDR.</p><p>I put these in the wrong columns. I meant to make two columns. The distinction I want to make is that there is the cost to retrieve, and then there&#8217;s a cost to hold on. This is a cost per second, whereas this is an instantaneous cost. Rematerialization has a cost to retrieve and has zero cost to store it because we&#8217;ve deleted it. This is the one that I put in the wrong location. This is actually the cost just to hold on, so I will rewrite it.</p><p>If we&#8217;re just storing it in HBM, it has this sort of cost profile. If we store in DDR, it&#8217;s actually going to take some time. We get the same thing here: bytes per token over DDR capacity times DDR cost per second. But now this has a cost to retrieve that is higher than the HBM because we need to copy it into the HBM. So this is bytes per token over DDR bandwidth. And then this consumes some amount of the DDR as well.</p><p><strong>Dwarkesh Patel</strong></p><p>And every scale-up has DDR and flash?</p><p><strong>Reiner Pope</strong></p><p>This is really a deployment question, so you can choose that. Nvidia does deploy in this form. It has both.</p><p><strong>Dwarkesh Patel</strong></p><p>Why isn&#8217;t the cost to retrieve HBM the bytes divided by memory bandwidth?</p><p><strong>Reiner Pope</strong></p><p>It depends what you define a retrieve to be. Here, I&#8217;m defining retrieve to be, move it into HBM so that you can start actually doing inference on it.</p><p><strong>Dwarkesh Patel</strong></p><p>Because if it&#8217;s already in HBM, you can be doing compute while you&#8217;re getting it from HBM to SRAM? Interesting.</p><p><strong>Reiner Pope</strong></p><p>Yeah, for example. These are three things, and I guess I ordered them wrong. In general, if you&#8217;re balancing two costs and you&#8217;ve got different tiers in the memory hierarchy, you should expect as this cost goes up, this cost should go down. You can kind of see where the zeros are. I should have ordered them this one first, this one second, and this one third.</p><p>If you&#8217;re going to hold onto it for a very short amount of time, then all of this is multiplied by the hold time. This one is, and so is this one.</p><p><strong>Dwarkesh Patel</strong></p><p>Interestingly, they have different prices to write for. Do you specify this in the API for five minutes versus an hour? Which suggests that the five minutes is HBM and the hour is DDR.</p><p><strong>Reiner Pope</strong></p><p>I think that&#8217;s a pretty good assumption. If you look at the numbers, it might also turn out that it&#8217;s one tier down, and it&#8217;s DDR versus flash.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. I&#8217;ll look up the price difference. The base input tokens is $5 per million tokens.</p><p><strong>Reiner Pope</strong></p><p>Base, which means remat. This is $5.</p><p><strong>Dwarkesh Patel</strong></p><p>That&#8217;s $5 to &#8220;retrieve&#8221;. And then to write, presumably HBM, for five minutes is 6.25.</p><p><strong>Reiner Pope</strong></p><p>We might be able to determine which memory tier it is by the durations.</p><p><strong>Dwarkesh Patel</strong></p><p>Five minutes versus one hour.</p><p><strong>Reiner Pope</strong></p><p>Exactly. I think this will probably end up being the drain time of the memory tier that you&#8217;re in. What that means is, given that I know I&#8217;m going to be holding something for five minutes, I would like to pick a memory that I can read every five minutes. I can read the whole memory once per five minutes, ballpark. That is the drain time of the memory. So if I take the storage capacity over storage bandwidth, I would like this to be equal to five minutes.</p><p>We did this calculation for HBM. For HBM, we know that this number is 20 milliseconds. So HBM is much too small. DDR could be about an order of magnitude or two off from this, so this is probably on the order of seconds, like 1 to 10 seconds. I don&#8217;t have these numbers memorized, but generally, as you go to slower tiers, flash is plausibly on the order of one minute. And then spinning disk, which is massively different, is on the order of one hour. So this might actually identify the tiers of flash and <a href="https://en.wikipedia.org/wiki/Disk_storage">spinning disk</a>.</p><p><strong>Dwarkesh Patel</strong></p><p>Sorry, why is this the calculation? This is the storage capacity divided by the bandwidth?</p><p><strong>Reiner Pope</strong></p><p>You&#8217;ve got a bunch of different memory tiers, we&#8217;ve listed four of them. Your choice of which memory tier is about minimizing the cost. What fraction of the device are you using? You&#8217;re using some fraction of the device for holding onto it, and then you&#8217;re using some fraction of the device to retrieve it. Let&#8217;s say I&#8217;m using 10% of the device. And I want to equalize those two fractions. That&#8217;s a sign that I&#8217;ve hit the right thing.</p><p>Let&#8217;s say I&#8217;ve got some runtime here. I&#8217;m going to hold on for all of this time, so this is the time-hold. And then there&#8217;s going to be some amount of time here, which is time-retrieve. Basically to equalize these two costs, I want the retrieval time to be equal to the hold time times the fraction of capacity. Because this is the retrieval time, this is how many other things I can hold simultaneously.</p><p><strong>Dwarkesh Patel</strong></p><p>Basically, you want to store things in there for so long such that the amount of time it&#8217;s in there is the time to get all your things in there and out.</p><p><strong>Reiner Pope</strong></p><p>Yeah basically. I think that probably indicates that the two tiers are flash and spinning disk. I&#8217;m kind of shocked to see spinning disk being used at all, because it&#8217;s such an old technology.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. It&#8217;s also crazy that it&#8217;s so slow that it takes an hour to load its full capacity to it in.</p><p><strong>Reiner Pope</strong></p><p>It&#8217;s a really unattractive technology but it&#8217;s useful in some places.</p><h3>02:04:02 &#8211; Convergent evolution between neural nets and cryptography</h3><p><strong>Dwarkesh Patel</strong></p><p>We&#8217;re sitting down because I want to ask you some questions that don&#8217;t need a blackboard. You have this <a href="https://reiner.org/neural-net-ciphers">extremely interesting blog post</a> where you talk about how, at a high level, the architecture of different cryptographic protocols looks a lot like neural networks. There&#8217;s this convergent evolution where they both need to jumble information across all their inputs. For cryptographic protocols, it&#8217;s to make sure that each new input into a <a href="https://en.wikipedia.org/wiki/Hash_function">hash function</a> will totally scramble what happens. For neural networks, of course, they need to consider how this piece of information changes what you should make of this other piece of information.</p><p>I thought that was an extremely interesting point. At a high level, in some sense they&#8217;re trying to do the inverse thing. Cryptographic protocols are trying to take information which has structure and make it look indistinguishable from randomness. Neural networks are trying to take things which look random&#8212;protein sequences, DNA, garbled text&#8212;and extract higher-level structure from it. They have similar high-level mechanisms, but they&#8217;re actually trying to do the opposite things. I wonder what you make of that.</p><p><strong>Reiner Pope</strong></p><p>I try to look for other examples where mixing and scrambling shows up as well. There&#8217;s almost a physical example where you&#8217;re making a cake and you want to stir the batter. Literally the idea to first stir it this way and then stir it this way is not too bad of an approach.</p><p>Beyond that, back to the digital world, there are some differences, and the one you call out is a pretty strong difference. The way it shows up, if you just randomly initialize a neural network, maybe it&#8217;s a reasonable cipher as well because the random initialization is going to jumble stuff in a complicated way. It may even do what you want. Who knows?</p><p>The thing that makes it interpretable is the gradient descent. You can differentiate a neural network and get a meaningful derivative. We do a lot of work to not overcomplicate the derivative, so the residual connection keeps it contained and simple. And so does the <a href="https://docs.pytorch.org/docs/stable/generated/torch.nn.LayerNorm.html">LayerNorm</a> stuff that we do.</p><p>One of the biggest attacks against cryptographic ciphers is also to differentiate the cipher. Ciphers run in a different number field. They run in the field of two elements, so just binary, whereas neural nets run, in theory, in the field of real numbers. You have to differentiate with respect to binary numbers, but you can absolutely differentiate a cipher. This is called <a href="https://en.wikipedia.org/wiki/Differential_cryptanalysis">differential cryptanalysis</a>.</p><p>Basically, what it says is that if you take a small difference of the input, it&#8217;s quite difficult to make the difference of the output be small. The whole job of a well-designed cipher is to make the difference in output very large. The distinction is that the optimization goals at that point are about complexifying. They don&#8217;t have the same residual connections, like LayerNorms.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess a place where the two merge is backdoors. With a backdoor in an <a href="https://en.wikipedia.org/wiki/Large_language_model">LLM</a>, you&#8217;re trying to hide&#8230; Would you consider it an input? It&#8217;s not an input into the forward pass but it&#8217;s an input into the backward pass. You&#8217;re trying to hide an input into the backward pass.</p><p><strong>Reiner Pope</strong></p><p>This is an adversarial context? This is actually a place where you get exactly the <a href="https://en.wikipedia.org/wiki/Avalanche_effect">avalanche property</a> that ciphers have as well. Adversarial attacks on image classification models are about finding a very small perturbation of the image that totally changes the classification, totally changes the output. That is the common case in ciphers, whereas that&#8217;s the undesired case in neural nets.</p><p><strong>Dwarkesh Patel</strong></p><p>Ok, so I was asking you, have neural networks actually been used for cryptography? And we realized it may be better to just do this on the blackboard. Are they actually being used for cryptography?</p><p><strong>Reiner Pope</strong></p><p>Using neural nets for cryptography&#8230; In general, creating a new cipher is a very dangerous proposition. Almost all of them are broken. 99% of them are broken, so it&#8217;s probably a bad place to start. But the other direction has been, in at least one very clear case, quite productive.</p><p>There&#8217;s a construction that exists in ciphers and then was imported into neural nets called a <a href="https://en.wikipedia.org/wiki/Feistel_cipher#:~:text=In%20cryptography%2C%20a%20Feistel%20cipher,known%20as%20a%20Feistel%20network.">Feistel cipher</a>, or Feistel network. The idea is that you may have some function <em>f</em> which is not invertible, but you like the function because it does interesting things, like it does an MLP, for example. Or it mixes it in an interesting way.</p><p>You&#8217;d like to build something out of this that is invertible. The construction we&#8217;re going to make is going to be a two-input function rather than a one-input function. We&#8217;re going to apply <em>f</em>(<em>x</em>). We need to actually remember what <em>x</em> was, so we&#8217;re going to stick <em>x</em> over here so that we can work backwards, and then we also can&#8217;t drop <em>y</em>. We&#8217;re going to remember <em>y</em>, and we&#8217;re going to add them together to form this tuple.</p><p>The way to invert this, if you think I have this output and I want to recover <em>x</em> and <em>y</em>, I can easily recover <em>x</em>. That&#8217;s right there, I just read it off. To recover <em>y</em>, if this thing was called <em>z</em>, I can recover <em>y</em> by <em>z</em> minus <em>f</em>(<em>x</em>), because I&#8217;ve already recovered <em>x</em>. That means this construction is invertible.</p><p>This was used in ciphers a ton and still is used. It&#8217;s one of the main mechanisms of constructing ciphers. Often you want ciphers to be invertible, especially the layers of ciphers, because that has better cryptographic properties.</p><p>This has actually been ported over into neural nets. There&#8217;s a 2017 paper called <a href="https://arxiv.org/abs/1707.04585">RevNets</a>, reversible networks. What it does is make the entire network invertible. You can apply it to any network, like a transformer network. I do a forwards pass, but then I can run the entire pass backwards as well. The whole neural network is invertible with exactly this construction.</p><p>This paper applied it to some layer, like a transformer layer, for example. We&#8217;ve got this function <em>f</em>, which is our transformer layer. Normally we would have just an input and then a residual connection coming out, and it gets added over here. Now, the variation of this is going to be we&#8217;ve got two inputs, <em>x</em> and <em>y</em>. <em>x</em> goes through the function, gets added to <em>y</em>, and then this becomes the new <em>x</em>, output <em>x</em>. Then this <em>x</em> becomes the output <em>y</em>.</p><p>Really what this is doing, if you think of two layers back, is the thing you mentioned before. It&#8217;s doing the residual connection from two layers back. This <em>y</em> came from the previous layer and was the residual connection there. Because of this construction, the whole thing is invertible.</p><p>Why do I care? What does invertible matter for? The big thing that it can be interesting for is training. If I think of a forward pass of training&#8230; Let&#8217;s say I have four layers and I run them in zero, one, two, three order. I have to write all of the activations to HBM. I get an HBM footprint here that is kind of linear in the number of layers.</p><p>This can actually be the largest memory footprint during training. This is normal training, and then I run the backwards pass and read it in reverse. The forward pass goes forward, and the backward pass goes backwards. I have to read them back out.</p><p>The idea of this RevNets paper is that because it&#8217;s invertible, I don&#8217;t need to store this at all. I can completely rematerialize it. I run my forwards pass, and then when I&#8217;m running my backwards pass, I&#8217;m simultaneously in lockstep undoing all of the forwards pass steps that I did in order to have the activations that I need here. This ends up being memory saving, which is a nice idea.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. In some sense you&#8217;re spending more compute to save memory.</p><p><strong>Reiner Pope</strong></p><p>That&#8217;s right.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. It&#8217;s the opposite of what you&#8217;re doing with the KV cache. With the KV cache, you&#8217;re spending more memory to save compute.</p><p><strong>Reiner Pope</strong></p><p>Yeah. Spending more memory to save compute is generally profitable given where hardwares are.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. That was super fun. Reiner, thank you so much for doing it. I feel like it really vindicated the vision behind the studio and the blackboard.</p><p><strong>Reiner Pope</strong></p><p>Yeah.</p><p><strong>Dwarkesh Patel</strong></p><p>Cool, thanks so much for doing it.</p><p><strong>Reiner Pope</strong></p><p>Thanks.</p>]]></content:encoded></item><item><title><![CDATA[More open questions about AI]]></title><description><![CDATA[Hodge podge of things I was thinking about this weekend.]]></description><link>https://www.dwarkesh.com/p/what-ive-been-thinking-april-27</link><guid isPermaLink="false">https://www.dwarkesh.com/p/what-ive-been-thinking-april-27</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Mon, 27 Apr 2026 13:51:32 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!QEPJ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F90fa9666-5b8b-4685-a8fb-4b64cb7e0333_1080x1080.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>More open questions</h2><p>I put out <a href="https://www.dwarkesh.com/p/blog-prize">a blog prize</a> to answer a couple of big questions I have about AI. The goal is really to find someone to hire as coresearcher. I have more questions of this variety, but I omitted them from that post&#8217;s list, because they don&#8217;t make it easy to judge submission quality. So I thought I&#8217;d post them here:</p><ul><li><p>5 hyperscalers own 70+% of global AI compute, and much of that is actually reserved for the 3 member set of OpenAI/Ant/GDM. How worried should we be that AI use cases which are not building up to the singularity and the robot factories -  aka normal people being more empowered, understanding the world better, being entertained, etc, is not the highest ROI activity for compute in the world. And given how valuable compute will be (whose opportunity cost increases in tandem with the quality of the AI models that run on it), will normal people basically get priced out of the benefits of AI? If we should be worried about this, how concretely should some kind of universal basic income/compute redistribution work? If not worried, what is the frame of this question missing?</p></li><li><p>Data is arguably the main way that AI models have been getting better over the last few years. But I remain confused about what concretely these improvements have consisted of. To ask some sharper questions:</p><ul><li><p>Clearly Anthropic (and now also OpenAI and GDM) have cracked <em>something</em> about making competent long horizon coding agents. What is it? Is it just stacking up more and more RL coding environments? Or is there something more particular behind this breakthrough?</p></li><li><p>Are models even getting more sample efficient (aka they learn more from each training sample) or have we just changed/expanded/improved the data input? The reason this question is important is because it tells us how fast deep learning progress will be in domains that actually do require sample efficiency (for example, robotics).</p><ul><li><p>Models are very sample efficient in context, and the information in context can be used much more flexibly. But the attention &#8220;fast&#8221; weights consume a huge amount of memory in order to accommodate this faster learning. Why is there this memory/sample efficiency tradeoff?</p><ul><li><p>If you look at the size of the KV cache for Llama 3 70B, it&#8217;s 320 KB / token. If you just divide the number of bits it takes to store Llama 3 weights by the number of tokens it was pre-trained on, then you get 0.075 bits / token. So there&#8217;s a 35 million fold difference in the amount of information per bit you&#8217;re storing.</p></li></ul></li></ul></li></ul></li><li><p>Let&#8217;s put frontier lab compute into 3 buckets: pretraining, RL generation, and inference. RL generation and inference look like very similar workloads. The big difference, of course, is that the model learns as a result of RL generation, but it doesn&#8217;t (at least currently) from inference. At the same time, the model actually does useful work during inference, but not during RL generation. Many people have pointed out it&#8217;s really weird that there&#8217;s a distinction between training and inference, and that in the limit it shouldn&#8217;t exist. How practically will these two workloads be merged? At a high level, one can imagine hiring an AI instance for a month-long work trial, getting it to do actual useful work for you during that time, and then sending a report card back to the model company. In fact, in a few years, maybe the only way that AI can continue to make progress is through this kind of on-the-job learning, because models will already have saturated anything that can be learned from contrived shorter-horizon RL environments.</p></li><li><p>Does something Y2Key happen when most of the tokens on the internet (and presumably the ones future models will be trained on) are generated by other AIs? Has the relative value of pre-2023 internet datasets increased in any noticeable way?</p></li><li><p>I wrote this in my <a href="http://dwarkesh.com/p/timelines-june-2025">continual learning blog post</a> last June. Is this correct? Why might there not be a winner take all dynamic from continual learning?</p><ul><li><p>&#8220;Even if there isn&#8217;t a software only singularity (with models rapidly building smarter and smarter successor systems), we might still see something that looks like a broadly deployed intelligence explosion. AIs will be getting broadly deployed through the economy, doing different jobs and learning while doing them in the way humans can. But unlike humans, these models can amalgamate their learnings across all their copies. So one AI is basically learning how to do every single job in the world. An AI that is capable of online learning might functionally become a superintelligence quite rapidly without any further algorithmic progress&#8221;</p></li></ul></li><li><p>A lot of economic analysis about the impact of AGI focus on human demand - <a href="https://www.citriniresearch.com/p/2028gic">will the economy shrink because our demands can be fulfilled much more cheaply</a>, will it rise because <a href="https://aleximas.substack.com/p/what-will-be-scarce/comment/243687560">AI will create new varieties of products</a>, or maybe because <a href="https://aleximas.substack.com/p/what-will-be-scarce">the relational sector will grow</a>? But all this analyses take as a given that the only demand that matters is the one originating from humans. How do we model the machine-only economy, where the demand originates from the AI&#8217;s themselves? And once we add this consideration to our economic analysis of the future, what changes?</p></li></ul>]]></content:encoded></item><item><title><![CDATA[Blog prize for the big questions about AI]]></title><description><![CDATA[The not-so-secret point of this whole contest is so that I can hire a researcher]]></description><link>https://www.dwarkesh.com/p/blog-prize</link><guid isPermaLink="false">https://www.dwarkesh.com/p/blog-prize</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Fri, 24 Apr 2026 16:37:49 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/a7d14f96-c3aa-4305-bdc2-27c509bdbedc_1400x923.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>There has never been a time where excellent intellectual output on the right question has been more valuable or more urgent. Compelling answers can inform the most important economic and foreign policy decisions that will ever be made, the deployment of (at least) <a href="https://openai.com/index/scaling-ai-for-everyone/">hundreds of billions</a> of philanthropic dollars, and the training and governance of superintelligences.</p><p>I&#8217;m announcing a $20,000 blog prize in order to find people who will excel at researching and thinking through these problems. The not-so-secret point of this whole contest is so that I can hire a research collaborator to think through questions like this hand in hand with me. See more at the end.</p><p>Pick a question below, and spend no more than 1,000 words answering it. 1st, 2nd, and 3rd place will get $10,000, $6,000, and $4,000 respectively. I&#8217;ll publish the winning entry (and potentially the runner ups) on my blog. Please submit by May 10th, 11:59 PM PST.</p><h3>Questions - choose one</h3><ul><li><p>A couple years ago, there was this idea that AI progress might slow down as we make further progress into the RL regime. 1. Because as horizon lengths increase, the AI needs to do many days&#8217; worth of work before we can even see if it did it right, so if we&#8217;re still in a naive policy gradient world, the reward signal / FLOP goes down, and 2. We&#8217;d crossed through many OOMs of RL compute from GPT 4 to o1 to o3, and it would not be feasible to replicate that many OOMs increase in compute immediately again. But AI progress seems to have been fast nonetheless - even potentially speeding up if rumors about Spud or Mythos are to be believed. What gives? What did that previous intuition pump that motivated longer timelines miss? Feel free to deny premise of question.</p></li><li><p>What&#8217;s the most plausible story where foundation model companies actually start making money? If you consider each individual model as a company, then its profits <a href="https://epoch.ai/gradient-updates/can-ai-companies-become-profitable">may</a> be able to pay back the training cost. But of course, if you don&#8217;t train a bigger, more expensive model immediately, then you stop making money after 3 months. So when does the profit start? Maybe at some point <a href="https://www.dwarkesh.com/i/187852154/005849-how-will-ai-labs-actually-make-profit">scaling will plateau</a>, but <a href="https://x.com/MatthewJBar/status/2046060153678844290">if progress at the frontier</a> has slowed down, then the combination of distillation and low switching costs (cloud margins result from high switching costs) makes it really easy for open source to catch up to the labs, eating into their margins. So how do the labs actually start making money?</p></li><li><p>With OpenAI&#8217;s new raise at an $852B valuation, OpenAI Foundation&#8217;s stake is <a href="https://openai.com/index/scaling-ai-for-everyone/">now worth $180B</a>. Anthropic&#8217;s <a href="https://fortune.com/2026/01/27/anthropic-billionaire-cofounders-ceo-dario-amodei-giving-away-80-percent-of-wealth-fighting-inequality-ai-revolution/">cofounders have pledged to donate 80%</a> of their wealth. Nobody seems to have a concrete idea of how to deploy 100s of billions (soon trillions) of wealth productively to &#8220;make AI go well&#8221;. If you were in charge of the OpenAI Foundation right now, what exactly would you do? And when? It&#8217;s not enough to identify a cause you think is important, because that doesn&#8217;t answer the fundamental problem of <a href="https://nanransohoff.substack.com/p/there-should-be-general-managers">how you convert money to impact</a>. Identify the concrete strategy you recommend pursuing.</p></li><li><p>What should countries which are not currently in the AI production chain (semis, energy, frontier models, robotics) do in order to not get totally sidestepped by transformative AI? If you&#8217;re the leader of India or Nigeria, what do you do right now?</p></li></ul><h3>Rules and tips</h3><ul><li><p>Please don&#8217;t let a lack of domain expertise dissuade you from entering. I&#8217;m looking for someone who can ramp up fast on unfamiliar topics and think clearly.</p></li><li><p>Each entrant may submit only once.</p></li><li><p>You are still eligible for this essay competition even if you&#8217;re not interested in the researcher role. Nor does winning this competition guarantee that you will be offered the role.</p></li><li><p>You&#8217;re welcome to use LLMs to help you research, but I specifically picked these questions because I&#8217;ve found LLM answers to them unsatisfying. On these kinds of ambiguous questions, LLMs are too all over the place. For example, they&#8217;ll identify 5 plausible answers but not have the context and taste to identify the crucial factor and iron out its implications.</p></li><li><p>You only have 1000 words - make them count. People have the habit of <a href="https://x.com/dwarkesh_sp/status/1968012981016608934">spending the first paragraphs clearing their throat</a> - avoid that.</p></li></ul><h3>Why am I hiring for a researcher?</h3><p>I want my podcast/blog to move from just asking questions about AI to actually helping answer them. But there are too many important questions, and I need a collaborator to build up context on them all, to explore dozens of fractal sub-questions, to consider the rebuttals and syntheses, and to sharpen each others thinking.</p><p>The questions I want us to explore are very broad while at the same time requiring deep technical analysis across many domains to actually answer.</p><h3>Why am I hiring this way?</h3><p>Well, I could just put out a job ad for a researcher, but I&#8217;ll get 1,000 different resumes, and I&#8217;ll have no clue based on that information whether the applicant would be any good at synthesizing lots of technical arguments and information. So I thought, let&#8217;s just list out some questions where I genuinely don&#8217;t know the answer and would be keen to get some insight.</p><h3>What this role looks like</h3><ul><li><p>Ideally in person in San Francisco, but potentially open to remote.</p></li><li><p>Will pay competitively</p></li></ul><h3>Submit <a href="https://airtable.com/app8aYOTzMkv9qeAJ/pagHhju8B5tgu4yXc/form">here</a></h3><p>If you have questions or comments, I&#8217;m hello@dwarkeshpatel.com.</p>]]></content:encoded></item><item><title><![CDATA[Jensen Huang – TPU competition, why we should sell chips to China, & Nvidia’s supply chain moat]]></title><description><![CDATA[&#8220;If our next several years are a trillion dollars in scale, we have the supply chain to do it"]]></description><link>https://www.dwarkesh.com/p/jensen-huang</link><guid isPermaLink="false">https://www.dwarkesh.com/p/jensen-huang</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Wed, 15 Apr 2026 15:45:23 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/194289889/5f292c095257191205d7c71b2b0c70da.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>I asked Jensen about TPU competition, Nvidia&#8217;s lock on the ever more bottlenecked supply chain needed to make advanced chips, whether we should be selling AI chips to China, why Nvidia doesn&#8217;t just become a hyperscaler, how it makes its investments, and much more. Enjoy!</p><p>Watch on <a href="https://youtu.be/Hrbq66XqtCo">YouTube</a>; listen on <a href="https://podcasts.apple.com/us/podcast/jensen-huang-tpu-competition-why-we-should-sell-chips/id1516093381?i=1000761582962">Apple Podcasts</a> or <a href="https://open.spotify.com/episode/1viBRy6dQdlSw0OdFvogXB?si=bc2cdbd467ed4ee3">Spotify</a>.</p><div id="youtube2-Hrbq66XqtCo" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;Hrbq66XqtCo&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/Hrbq66XqtCo?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2><strong>Sponsors</strong></h2><ul><li><p><a href="https://crusoe.ai/dwarkesh">Crusoe&#8217;s</a> cloud runs on state-of-the-art Blackwell GPUs, with Vera Rubin deployment scheduled for later this year. But hardware is only part of the story&#8212;for inference, Crusoe&#8217;s MemoryAlloy tech implements a cluster-wide KV cache, delivering up to 10x faster TTFT and 5x better throughput than vLLM. Learn more at <a href="https://crusoe.ai/dwarkesh">crusoe.ai/dwarkesh</a></p></li></ul><ul><li><p><a href="https://cursor.com/dwarkesh">Cursor</a> helped me build an AI co-researcher over the course of a weekend. Now I have an AI agent that I can collaborate with in Google Docs via inline comment threads! And while other agentic coding tools feel like a total black-box, Cursor let me stay on top of the full implementation. You can try my co-researcher out <a href="http://github.com/dwarkeshsp/ai_coworker">here</a>, or get started on your own Cursor project today at <a href="https://cursor.com/dwarkesh">cursor.com/dwarkesh</a></p></li><li><p><a href="https://janestreet.com/dwarkesh">Jane Street</a> spent ~20,000 GPU hours training backdoors into 3 different language models, then challenged my audience to find the triggers. They received some clever solutions&#8212;like comparing the base and fine-tuned versions and extrapolating any differences to reveal the hidden backdoor&#8212;but no one was able to solve all 3. So if open problems like this excite you, Jane Street is hiring. Learn more at <a href="https://janestreet.com/dwarkesh">janestreet.com/dwarkesh</a></p></li></ul><h2><strong>Timestamps</strong></h2><p>(00:00:00) &#8211; Is Nvidia&#8217;s biggest moat its grip on scarce supply chains?</p><p>(00:16:25) &#8211; Will TPUs break Nvidia&#8217;s hold on AI compute?</p><p>(00:41:06) &#8211; Why doesn&#8217;t Nvidia become a hyperscaler?</p><p>(00:57:36) &#8211; Should we be selling AI chips to China?</p><p>(01:35:06) &#8211; Why doesn&#8217;t Nvidia make multiple different chip architectures?</p><h2>Transcript</h2><h3>00:00:00 &#8211; Is Nvidia&#8217;s biggest moat its grip on scarce supply chains?</h3><p><strong>Dwarkesh Patel</strong></p><p>We&#8217;ve seen the <a href="https://techcrunch.com/2026/03/01/saas-in-saas-out-heres-whats-driving-the-saaspocalypse/">valuations of a bunch of software companies crash</a> because people are expecting AI to commoditize software. There&#8217;s a potentially naive way of thinking about things, which is: look, Nvidia sends a <a href="https://en.wikipedia.org/wiki/GDSII">GDS2 file</a> to <a href="https://en.wikipedia.org/wiki/TSMC">TSMC</a>. TSMC builds the <a href="https://www.asml.com/en/technology/all-about-microchips/microchip-basics">logic</a> <a href="https://en.wikipedia.org/wiki/Die_(integrated_circuit)">dies</a>, it builds the <a href="https://en.wikipedia.org/wiki/Switch">switches</a>, then it packages them with the <a href="https://en.wikipedia.org/wiki/High_Bandwidth_Memory">HBM</a> that <a href="https://en.wikipedia.org/wiki/SK_Hynix">SK Hynix</a>, <a href="https://en.wikipedia.org/wiki/Micron_Technology">Micron</a>, and <a href="https://en.wikipedia.org/wiki/Samsung_Electronics">Samsung</a> make. Then it sends it to an <a href="https://www.smckyems.com/understanding-the-differences-between-oem-odm-ems-and-cem/">ODM</a> in Taiwan where they assemble the racks. <a href="https://en.wikipedia.org/wiki/Nvidia">Nvidia</a> is fundamentally making software that other people are manufacturing, and if software gets commoditized, does Nvidia get commoditized?</p><p><strong>Jensen Huang</strong></p><p>In the end, something has to transform electrons to tokens. The transformation of electrons to tokens and making those tokens more valuable over time is hard to completely commoditize. The transformation from electrons to tokens is such an incredible journey. Making that token is like making one molecule more valuable than another molecule, making one token more valuable than another. The amount of artistry, engineering, science, and invention that goes into making that token valuable, obviously we&#8217;re watching it happen in real time. The transformation, the manufacturing, all of the science that goes in there is far from deeply understood and the journey is far from over. I doubt that it will happen.</p><p>We&#8217;re going to make it more efficient, of course. The way that you framed the question is my mental model of our company. The input is electrons, the output is tokens. In the middle is Nvidia. Our job is to do as much as necessary and as little as possible to enable that transformation to be done at incredible capabilities. What I mean by &#8220;as little as possible,&#8221; whatever I don&#8217;t need to do, I partner with somebody and make it part of my ecosystem.</p><p>If you look at Nvidia today, we probably have the largest ecosystem of partners, both in the supply chain upstream and downstream, all of the computer companies, application developers, and model makers. AI is a five-layer cake, if you will. We have ecosystems across the entire five layers. We try to do as little as possible, but the part that we have to do, as it turns out, is insanely hard. I don&#8217;t think that gets commoditized.</p><p>In fact, I also don&#8217;t think the enterprise software companies, the tools makers&#8230; Most software companies today are tool makers. Some of them are not. Some of them are workflow codification systems. But for a lot of companies, they&#8217;re tool makers. For example, Excel is a tool, PowerPoint is a tool, <a href="https://en.wikipedia.org/wiki/Cadence_Design_Systems">Cadence</a> makes tools, <a href="https://en.wikipedia.org/wiki/Synopsys">Synopsys</a> makes tools. I actually see the opposite of what people see. I think the number of <a href="https://en.wikipedia.org/wiki/AI_agent">agents</a> is going to grow exponentially, and the number of tool users is going to grow exponentially. It&#8217;s very likely that the number of instances of all these tools is going to skyrocket.</p><p>It&#8217;s very likely that the number of instances of <a href="https://www.synopsys.com/implementation-and-signoff/rtl-synthesis-test/design-compiler.html">Synopsys Design Compiler</a> is going to skyrocket, along with the number of agents using the floor planners, our layout tools, and our design rule checkers. Today we&#8217;re limited by the number of engineers. Tomorrow, those engineers are going to be supported by a bunch of agents. We&#8217;re going to be exploring the design space like you&#8217;ve never seen before, and we&#8217;re going to use the tools that we use today.</p><p>I think tool use is going to cause the software companies to skyrocket. The reason why it hasn&#8217;t happened yet is because the agents aren&#8217;t good enough at using their tools yet. Either these companies are going to build the agents themselves, or agents are going to get good enough to be able to use those tools. I think it&#8217;s going to be a combination of both.</p><p><strong>Dwarkesh Patel</strong></p><p>I think in your <a href="https://investor.nvidia.com/financial-info/financial-reports/default.aspx">latest filings</a>, you had almost a $100 billion in purchase commitments with foundries, memory, and packaging. <a href="https://semianalysis.com/">SemiAnalysis</a> has reported that you will have $250 billion of these kinds of purchase commitments. One interpretation is that Nvidia&#8217;s moat is really that you&#8217;ve locked up many years of these scarce components. Somebody else might have an accelerator, but can they actually get the memory to build it? Can they actually get the <a href="https://www.asml.com/en/technology/all-about-microchips/microchip-basics">logic</a> to build it? Is this really Nvidia&#8217;s big moat for the next few years?</p><p><strong>Jensen Huang</strong></p><p>It&#8217;s one of the things that we can do that is hard for someone else to do. We&#8217;ve made enormous commitments upstream. Some of it is explicit, these commitments that you mentioned. Some of it is implicit. For example, a lot of the investments that are upstream are made by our supply chain because I said to the CEOs, &#8220;Let me tell you how big this industry is going to be, let me explain to you why, let me reason through it with you, and let me show you what I see.&#8221;</p><p>As a result of that process of informing, inspiring, and aligning with CEOs of all different industries upstream, they&#8217;re willing to make the investments. Why are they willing to make the investments for me and not someone else? The reason for that is because they know that I have the capacity to buy their supply and sell it through my downstream. The fact is that Nvidia&#8217;s downstream supply chain and our downstream demand is so large, they&#8217;re willing to make the investment upstream.</p><p>If you look at <a href="https://en.wikipedia.org/wiki/Nvidia_GTC">GTC</a>, people are marveled by the scale of it and the people that go. It&#8217;s a full 360 degrees, the entire universe of AI all in one place. They&#8217;re all in one place because they need to see each other. I bring them together so that the downstream can see the upstream, the upstream can see the downstream, and all of them can see the advances in AI. Very importantly, they can all meet the AI natives, all the AI startups being built, and all the amazing things happening so they can see firsthand all the things that I tell them. I spend a lot of my time informing, directly or indirectly, our supply chain, partners, and ecosystem about the opportunity in front of us.</p><p>Some people always say, &#8220;Jensen, in most keynotes, it&#8217;s one announcement after another.&#8221; With our keynotes, there&#8217;s always a part of it that&#8217;s a little torturous in the sense that it almost comes across like education. In fact, that&#8217;s exactly on my mind. I need to make sure the entire supply chain, upstream and downstream, the ecosystem, understands what is coming at us, why it&#8217;s coming, when it&#8217;s coming, how big it&#8217;s going to be, and is able to reason about it systematically, just like I reason about it.</p><p>Regarding the moat as you describe it, we&#8217;re able to build for a future. If our next several years are a trillion dollars in scale, we have the supply chain to do it. Without our reach, the velocity of our business&#8230; Just as there&#8217;s cash flow, there&#8217;s supply chain flow, there&#8217;s churns. Nobody is going to build a supply chain for an architecture if the business churns are low. Our ability to sustain the scale is only because our downstream demand is so great. And they see it, they hear about it, they see it all coming. That allows us to do the things we&#8217;re able to do at the scale we do them.</p><p><strong>Dwarkesh Patel</strong></p><p>I do want to understand more concretely whether the upstream can keep up. For many years now, you guys have been 2x-ing revenue year over year. You&#8217;ve been more than tripling the amount of <a href="https://en.wikipedia.org/wiki/Floating_point_operations_per_second">flops</a> you&#8217;re providing to the world year over year.</p><p><strong>Jensen Huang</strong></p><p>And 2x-ing at this scale now is really incredible.</p><p><strong>Dwarkesh Patel</strong></p><p>Exactly. But then you look at logic. You&#8217;re the biggest customer on TSMC&#8217;s <a href="https://en.wikipedia.org/wiki/3_nm_process">N3 node</a>, and you&#8217;re one of the biggest on <a href="https://en.wikipedia.org/wiki/2_nm_process">N2</a>. AI as a whole this year is going to be sixty percent of N3. It&#8217;s going to be 86% next year, according to SemiAnalysis. How do you double if you&#8217;re the majority? And how do you do that year over year? Are we in a regime now where the growth rate in AI compute has to slow because of upstream? Do you see a way to get around this? How do we build 2x more fabs year over year, ultimately?</p><p><strong>Jensen Huang</strong></p><p>At some level, the instantaneous demand is greater than the supply upstream and downstream in the world. At any instant, we could be limited by the number of plumbers, which actually happens.</p><p><strong>Dwarkesh Patel</strong></p><p>The plumbers are invited to next year&#8217;s GTC.</p><p><strong>Jensen Huang</strong></p><p>By the way, great idea. But that&#8217;s a good condition. You want an industry where the instantaneous demand is greater than the total supply of the industry. The opposite is obviously less good. If we&#8217;re too far apart, if one particular component is too far away, the industry swarms it. For example, notice people aren&#8217;t talking very much about <a href="https://3dfabric.tsmc.com/english/dedicatedFoundry/technology/cowos.htm">CoWoS</a> anymore.</p><p>The reason for that is because for two years we swarmed the living daylights out of it. We doubled, doubled, doubled on several doubles. Now I think we&#8217;re in fairly good shape. TSMC now knows that CoWoS supply has to keep up with the rest of the logic demand and the memory demand. They&#8217;re scaling CoWoS and future packaging technologies at the same level as they scale logic. This is terrific, because for a long time, CoWoS and HBM memory were rather specialty. But they&#8217;re not specialties anymore. People now realize they&#8217;re mainstream computing technology.</p><p>Of course, we&#8217;re now much more able to influence a larger scope of our supply chain. At the beginning of the AI revolution, all the things that I say now, I was saying five years ago. Some people believed in it and invested in it, for example, Sanjay and the Micron team. I still remember the meeting really well where I was clear about exactly what was going to happen, why it was going to happen, and the predictions of today. They really doubled down on it. We partnered with them across <a href="https://en.wikipedia.org/wiki/LPDDR">LPDDR</a> and HBM memories, and they really invested in it. It obviously has been tremendous for the company. Some people came a little bit later, but now they&#8217;re all here.</p><p>Each one of these bottlenecks gets a great deal of attention. Now we&#8217;re prefetching the bottlenecks years in advance. For example, the investments that we&#8217;ve done with <a href="https://en.wikipedia.org/wiki/Lumentum">Lumentum</a>, <a href="https://en.wikipedia.org/wiki/Coherent_Corp.">Coherent</a>, and the <a href="https://en.wikipedia.org/wiki/Silicon_photonics">silicon photonics</a> ecosystem over the last several years really reshaped the supply chain. We built up an entire supply chain around TSMC. We partnered with them on <a href="https://tspasemiconductor.substack.com/p/tsmc-coupe-metalens-building-the">COUPE</a>, invented a whole bunch of technology, and licensed those patents to the supply chain to keep it nice and open.</p><p>We&#8217;re preparing the supply chain through the invention of new technologies, new workflows, new testing equipment like double-sided probing, investing in companies, and helping them scale up their capacity. You can see that we&#8217;re trying to shape the ecosystem so that the supply chain is ready to support the scale.</p><p><strong>Dwarkesh Patel</strong></p><p>It seems like some bottlenecks are easier than others. Scaling up CoWoS versus scaling up&#8212;</p><p><strong>Jensen Huang</strong></p><p>I went to the hardest one, by the way.</p><p><strong>Dwarkesh Patel</strong></p><p>Which is?</p><p><strong>Jensen Huang</strong></p><p>Plumbers. Plumbers and electricians. This is one of the concerns that I have about the doomers describing the end of work and killing of jobs. If we discourage people from being software engineers, we&#8217;re going to run out of software engineers. The same prediction happened ten years ago. Some of the doomers were telling people, &#8220;Whatever you do, don&#8217;t be a radiologist.&#8221; You might hear some of those videos still on the web saying radiology is going to be the first career to go and the world is not going to need any more radiologists. Guess what we&#8217;re short of? Radiologists.</p><p><strong>Dwarkesh Patel</strong></p><p>Going back to this point about how some things you can scale, and other things&#8230; How do you actually manufacture 2x the amount of logic a year? Ultimately, memory and logic are bottlenecked by EUV. How do you get to 2x as many <a href="https://en.wikipedia.org/wiki/Extreme_ultraviolet_lithography">EUV</a> machines year over year?</p><p><strong>Jensen Huang</strong></p><p>None of that is impossible to scale quickly. All of that is easy to do within two or three years. You just need a demand signal. Once you can build one, you can build ten, and once you can build ten, you can build a million. These things are not hard to replicate.</p><p><strong>Dwarkesh Patel</strong></p><p>How far down the supply chain do you go? Do you go to <a href="https://en.wikipedia.org/wiki/ASML">ASML</a> and say, &#8220;Hey, if I look out three years from now, for Nvidia to be generating two trillion a year in revenue, we need way more EUV machines&#8221;?</p><p><strong>Jensen Huang</strong></p><p>Some of them I have to directly, some of them indirectly, and some of them&#8230; If I can convince TSMC, ASML will be convinced. We have to think about the critical pinch points. But if TSMC is convinced, you&#8217;ll have plenty of EUV machines in a few years.</p><p>My point is that none of the bottlenecks last longer than a couple of years, two, three years, none of them. Meanwhile, we&#8217;re improving computing efficiency by 10x 20x, and in the case of <a href="https://www.nvidia.com/en-us/data-center/technologies/hopper-architecture/">Hopper</a> to <a href="https://www.nvidia.com/en-us/data-center/technologies/blackwell-architecture/">Blackwell</a>, 30x to 50x. We&#8217;re coming up with new algorithms because <a href="https://en.wikipedia.org/wiki/CUDA">CUDA</a> is so flexible. We&#8217;re developing all kinds of new techniques so that we drive efficiency in addition to increasing capacity. None of those things worry me. It&#8217;s the stuff that&#8217;s downstream from us. Energy policies that prevent energy from&#8230; You can&#8217;t create an industry without energy. You can&#8217;t create a whole new manufacturing industry without energy.</p><p>We want to reindustrialize the United States. We want to bring back chip manufacturing, computer manufacturing, and packaging. We want to build new things like EVs and robots. We want to build AI factories. You can&#8217;t build any of these things without energy, and those things take a long time. More chip capacity, that&#8217;s a 2-3 year problem. More CoWoS capacity, 2-3 year problem.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. I feel like I have guests tell me the exact opposite thing sometimes. In this case, I just don&#8217;t have the technical knowledge to adjudicate.</p><p><strong>Jensen Huang</strong></p><p>The beautiful thing is you&#8217;re talking to the expert.</p><h3>00:16:25 &#8211; Will TPUs break Nvidia&#8217;s hold on AI compute?</h3><p><strong>Dwarkesh Patel</strong></p><p>True. I want to ask about your competitors. If you look at the <a href="https://en.wikipedia.org/wiki/Tensor_Processing_Unit">TPU</a>, arguably two out of the top three models in the world, Claude and Gemini, were trained on TPU. What does that mean for Nvidia going forward?</p><p><strong>Jensen Huang</strong></p><p>We build a very different thing. What Nvidia built is accelerated computing, not a tensor processing unit. Accelerated computing is used for all kinds of things: molecular dynamics, quantum chromodynamics, data processing, data frames, structured data, and unstructured data. It&#8217;s also used for fluid dynamics and particle physics. In addition, we use it for AI.</p><p>Accelerated computing is much more diverse. Although AI is the conversation today and is obviously very important and impactful, computing is much broader than that. Nvidia has reinvented the way computing is done, moving from general-purpose computing to accelerated computing. Our market reach is far greater than any TPU or <a href="https://en.wikipedia.org/wiki/Application-specific_integrated_circuit">ASIC</a> can possibly have. If you look at our position, we&#8217;re the only company that accelerates applications of all kinds. We have a gigantic ecosystem. So all kinds of frameworks and algorithms run on Nvidia.</p><p>Because our computers are designed to be operated by other people, anyone who&#8217;s an operator can buy our systems. With most of these home-built systems, you have to be your own operator because they were never designed to be flexible enough for others to operate. Because anybody can operate our systems, we&#8217;re in every cloud, including Google, Amazon, Azure, and OCI.</p><p>If you want to operate it to rent, you better have a large ecosystem of customers in many industries to be the offtakers. If you want to operate it for yourself, we obviously have the ability to help you operate it yourself, like we did for Elon with xAI. And because we can enable operators in any company and any industry, you could use it to build a supercomputer for scientific research and drug discovery at Lilly. We can help them operate their own supercomputer and use it for the entire diversity of drug discovery and biological sciences that we accelerate.</p><p>There are just a whole bunch of applications that we can address that you can&#8217;t do with TPUs. Nvidia built CUDA to be a fantastic tensor processing unit as well, but it also handles every life cycle of data processing, computing, AI, and so on. Our market opportunity is just a lot larger, and our reach is a lot greater. Because we support every application in the world now, you can build Nvidia systems anywhere and know that there will be customers for it. It&#8217;s a very different thing.</p><p><strong>Dwarkesh Patel</strong></p><p>This is going to be a long question. You have spectacular revenue, and you&#8217;re not making $60 billion a quarter from pharma and quantum. You&#8217;re making it because AI is an unprecedented technology that is growing unprecedentedly fast.</p><p>The question then is what is best for AI specifically. I&#8217;m not in the details, but I talk to my AI researcher friends and they say, &#8220;Look, when I use a TPU, it&#8217;s this big <a href="https://en.wikipedia.org/wiki/Systolic_array">systolic array</a> that&#8217;s perfect for doing matrix multiplies, whereas a <a href="https://en.wikipedia.org/wiki/Graphics_processing_unit">GPU</a> is very flexible. It&#8217;s great when you have lots of branching or irregular memory access.&#8221;</p><p>But what is AI? It&#8217;s just these very predictable <a href="https://en.wikipedia.org/wiki/Matrix_multiplication">matrix multiplies</a> again and again and again. You don&#8217;t have to give up any die area for warp schedulers or switches between threads and memory banks. And the TPU is really optimized for the bulk of this growth in revenue and use case for compute that is coming online right now. I wonder how you react to that.</p><p><strong>Jensen Huang</strong></p><p>Matrix multiplies are an important part of AI, but they&#8217;re not the only part. If you want to come up with a new <a href="https://en.wikipedia.org/wiki/Attention_(machine_learning)">attention</a> mechanism, disaggregate in a different way, or invent a whole new type of architecture altogether&#8212;like a hybrid <a href="https://en.wikipedia.org/wiki/State-space_representation">SSM</a>&#8212;you want an architecture that&#8217;s generally programmable. If you want to create a model that fuses <a href="https://en.wikipedia.org/wiki/Diffusion_model">diffusion</a> and <a href="https://en.wikipedia.org/wiki/Autoregressive_model">autoregressive techniques</a>, you want an architecture that&#8217;s just generally programmable. We run everything you can imagine. That&#8217;s the advantage. It allows for the invention of new algorithms a lot more easily, because it&#8217;s a programmable system.</p><p>The ability to invent new algorithms is really what makes AI advance so quickly. TPUs, like anything else, are impacted by <a href="https://en.wikipedia.org/wiki/Moore%27s_law">Moore&#8217;s Law</a>, which we know is increasing by about 25% per year. The only way to really get 10x or 100x leaps is to fundamentally change the algorithm and how it&#8217;s computed every single year.</p><p>That&#8217;s Nvidia&#8217;s fundamental advantage. The only reason we were able to make Blackwell to Hopper 50x&#8230; When I first announced Blackwell was going to be 35x more energy efficient than Hopper, nobody believed it. Then <a href="https://www.dwarkesh.com/p/dylan-patel">Dylan</a> wrote an article saying I sandbagged, and it&#8217;s actually fifty times. You can&#8217;t reasonably do that with just Moore&#8217;s Law. The way we solve that problem is with new models, like <a href="https://en.wikipedia.org/wiki/Mixture_of_experts">MoEs</a>, that are parallelized, disaggregated, and distributed across a computing system. Without the ability to really get down and come up with new <a href="https://modal.com/gpu-glossary/device-software/kernel">kernels</a> with CUDA, it&#8217;s really hard to do.</p><p>It&#8217;s the combination of the programmability of our architecture and the fact that Nvidia is an extreme co-design company. We can even offload some of the computation into the fabric itself, like <a href="https://www.nvidia.com/en-us/data-center/nvlink/">NVLink</a>, or into the network with <a href="https://www.nvidia.com/en-us/networking/spectrumx/">Spectrum-X</a>. We could affect change across the processors, the system, the fabric, the libraries, and the algorithm simultaneously. Without CUDA to do that, I wouldn&#8217;t even know where to start.</p><p><strong>Dwarkesh Patel</strong></p><p>This gets at an interesting question about Nvidia&#8217;s clientele. 60% of your revenue is coming from these big five hyperscalers. In a different era with different customers&#8212;let&#8217;s say professors running experiments&#8212;they need CUDA. They can&#8217;t use another accelerator. They just needed to run <a href="https://en.wikipedia.org/wiki/PyTorch">PyTorch</a> with CUDA and have everything optimized.</p><p>But these hyperscalers have the resources to write their own kernels. In fact, they have to in order to get that last 5% of performance they need for their specific architecture. Anthropic and Google are mostly running their own accelerators or running TPUs and <a href="https://aws.amazon.com/ai/machine-learning/trainium/">Trainium</a>. But even OpenAI, using GPUs, has <a href="https://openai.com/index/triton/">Triton</a> because they need their own kernels. Down to CUDA C++, instead of using <a href="https://developer.nvidia.com/cublas">cuBLAS</a> and <a href="https://developer.nvidia.com/nccl">NCCL</a>, they&#8217;ve got their own stack  which compiles to other accelerators as well. If most of your customers can and do make replacements for CUDA, to what extent is CUDA really the thing that is going to make frontier AI happen on Nvidia?</p><p><strong>Jensen Huang</strong></p><p>CUDA is a rich ecosystem. If you want to build on any computer first, building on CUDA first is incredibly smart. Because the ecosystem is so rich, we support every framework. If you want to create custom kernels&#8230; For example, we contribute enormously to Triton. So the back end of Triton has huge amounts of Nvidia technology.</p><p>We&#8217;re delighted to help every framework become as great as it can be. There are lots and lots of frameworks. There&#8217;s Triton, <a href="https://vllm.ai/">vLLM</a>, <a href="https://github.com/sgl-project/sglang">SGLang</a>, and more. Now there&#8217;s a whole bunch of new <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">reinforcement learning</a> frameworks coming out, like <a href="https://github.com/verl-project/verl">verl</a> and <a href="https://docs.nvidia.com/nemo/rl/latest/index.html">NeMo RL</a>. With <a href="https://www.interconnects.ai/p/the-state-of-post-training-2025">post-training</a> and reinforcement learning, that entire area is just exploding. So if you want to build on an architecture, building on CUDA makes the most sense because you know the ecosystem is great.</p><p>You know that if something happens, it&#8217;s more likely in your code and not in the mountain of code underneath. Don&#8217;t forget the amount of code you&#8217;re dealing with when building these systems. When something doesn&#8217;t work, was it you or was it the computer? You would like it to always be you and to be able to trust the computer. Obviously, we still have lots of bugs ourselves, but our system is so well wrung out that you can at least build on top of the foundation. That&#8217;s number one: the richness, programmability, and capability of the ecosystem.</p><p>The second thing is, if you&#8217;re a developer building anything at all, the single most important thing you want is an install base. You want the software you write to run on a whole bunch of other computers. You&#8217;re not building software just for yourself. You&#8217;re building it for your fleet or everybody else&#8217;s fleet because you&#8217;re a framework builder. Nvidia&#8217;s CUDA ecosystem is ultimately its great treasure.</p><p>We have several hundred million GPUs out there now. Every cloud has it. It goes back to the <a href="https://www.nvidia.com/en-us/data-center/products/a10-gpu/">A10</a>, <a href="https://www.nvidia.com/en-us/data-center/a100/">A100</a>, <a href="https://www.nvidia.com/en-us/data-center/h100/">H100</a>, <a href="https://www.nvidia.com/en-us/data-center/h200/">H200</a>, the <a href="https://www.nvidia.com/en-us/data-center/l40/">L series</a>, the <a href="https://www.nvidia.com/en-us/data-center/pascal-gpu-architecture/">P series</a>. There&#8217;s a whole bunch of them. They&#8217;re in all kinds of sizes and shapes. If you&#8217;re a robotics company, you want that CUDA stack to actually run in the robot itself. We&#8217;re literally everywhere. The install base means that once you develop the software or the model, it&#8217;s going to be useful everywhere. That is just incredibly valuable.</p><p>Lastly, the fact that we&#8217;re in every single cloud makes us genuinely unique. If you&#8217;re an AI company or developer, you&#8217;re not exactly sure which cloud service provider you&#8217;re going to partner with or where you&#8217;d like to run it. We run everywhere, including on-prem for you if you like. The combination of the richness of the ecosystem, the expansiveness of the install base, and the versatility of where we are makes CUDA invaluable.</p><p><strong>Dwarkesh Patel</strong></p><p>That makes a lot of sense. I guess the thing I&#8217;m curious about is whether those advantages matter a lot to your main customers. There&#8217;s many people for whom they might matter. The kind of person who can actually build their own software stack makes up most of your revenue. Especially if you go to a world where AI is getting especially good at the things which have tight verification loops where you can RL on them&#8230;. This question of how do you write a kernel that does attention or <a href="https://en.wikipedia.org/wiki/Multilayer_perceptron">MLP</a> the most efficiently across a scale up? It&#8217;s a very verifiable sort of feedback loop.</p><p>Can all the hyperscalers write these custom kernels for themselves? Nvidia still has great price performance, so they might still prefer to use Nvidia. But then the question is, does it just become a question of who is offering the best specs, the best flops and memory bandwidth for a given dollar. Whereas historically Nvidia has just had, and still has, the best margins in all of AI across hardware and software, +70%, because of this CUDA moat. And the question is, can you sustain those margins if for most of your customers, they can actually afford to build, instead of the CUDA moat?</p><p><strong>Jensen Huang</strong></p><p>The number of engineers we have assigned to these AI labs is insane, working with them, optimizing their stack. The reason for that is because nobody knows our architecture better than we do. These architectures are not as general purpose as a CPU. A CPU is kind of like a Cadillac. It&#8217;s a nice cruiser. It never goes too fast. Everybody drives it pretty well. It&#8217;s got cruise control, and everything&#8217;s easy. But in a lot of ways, Nvidia&#8217;s GPUs, accelerators, are like F1 racers. I could imagine everybody&#8217;s able to drive it at a hundred miles an hour, but it takes quite a bit of expertise to be able to push it to the limit. We use a ton of AI to create the kernels that we have.</p><p>I&#8217;m pretty sure we&#8217;re going to still be needed for quite some time. Our expertise helps our AI lab partners to get another 2x out of their stack easily oftentimes. It&#8217;s not unusual that by the time we&#8217;re done optimizing their stack or optimizing a particular kernel, their model sped up by 3x, 2x, 50%. That&#8217;s a huge number, especially when you&#8217;re talking about the install base of the fleet that they have, of all the Hoppers and Blackwells that they have. When you increase it by a factor of two, that doubles the revenues. That directly translates to revenues.</p><p>Nvidia&#8217;s computing stack is the best performance per <a href="https://en.wikipedia.org/wiki/Total_cost_of_ownership">TCO</a> in the world, bar none. Nobody can demonstrate to me that any single platform in the world today has a better performance-TCO ratio. Not one company. In fact, the benchmarks that are out there. Dylan&#8217;s <a href="https://newsletter.semianalysis.com/p/inferencemax-open-source-inference">InferenceMAX</a> is sitting out there for everybody to use, and not one&#8230; TPU won&#8217;t come, Trainium won&#8217;t come.</p><p>I encourage them to use InferenceMAX and demonstrate their incredible inference cost. It&#8217;s really hard. Nobody wants to show up. <a href="https://www.nvidia.com/en-us/data-center/resources/mlperf-benchmarks/">MLPerf</a>. I would welcome Trainium to demonstrate their 40% that they claim all the time. I would love to hear them demonstrate the cost advantage of TPUs. It makes no sense in my mind. It makes absolutely zero sense. On first principles, it makes no sense.</p><p>So I think the reason why we&#8217;re so successful is simply because our TCO is so great. Secondly, you say 60% of our customers are the top five, but most of that business is external. For example, most of Nvidia in AWS is for external customers, not internal use. Most of our customers at Azure, obviously all of our customers are external. All of our customers at OCI are external, not internal use. The reason why they favor us is because our reach is so great. We can bring them all of the great customers in the world. They&#8217;re all built on Nvidia. And the reason why all these companies are built on Nvidia is because our reach and our versatility is so great.</p><p>So I think the flywheel is really install base, the programmability of our architecture, the richness of our ecosystem, and the fact that there&#8217;s so many AI companies in the world. There&#8217;s tens of thousands of them now. If you were one of those AI startups, what architecture would you choose? You would choose an architecture that&#8217;s most abundant. We&#8217;re the most abundant in the world. You&#8217;d choose the one that has the largest installed base. We&#8217;re the largest install base. And you&#8217;d choose the one that has a rich ecosystem.</p><p>So that&#8217;s the flywheel. That&#8217;s the reason why, between the combination of: one, our perf per dollar is so great that they have the lowest cost tokens. Second, our perf per watt is the highest in the world. So if one of these companies, if our partners, built a one gigawatt data center, that one gigawatt data center better deliver the maximum amount of revenues and number of tokens, which directly translates to revenues. You want it to generate as many tokens as possible, maximize the revenues for that data center. We are the highest tokens per watt architecture in the world. Lastly, if your goal is to rent the infrastructure, we have the most customers in the world. So that&#8217;s the reason why the flywheel works.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. I guess the question comes down to, what is the actual market structure here? Because even if there&#8217;s other companies&#8230; There could have been a world where there&#8217;s tens of thousands of AI companies that have roughly equal share of compute. But even through these five hyperscalers, really the people on Amazon using the compute are Anthropic, OpenAI, and these big foundation labs who can themselves afford and have the ability to make different accelerators work.</p><p><strong>Jensen Huang</strong></p><p>No, I think your premise is wrong.</p><p><strong>Dwarkesh Patel</strong></p><p>Maybe. But let me ask you a slightly different question.</p><p><strong>Jensen Huang</strong></p><p>Come back and make me correct your premise.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay. Let me just ask you a different question.</p><p><strong>Jensen Huang</strong></p><p>But still make sure to make me come back and fix because it&#8217;s just too important to AI. It&#8217;s too important to the future of science. It&#8217;s too important to the future of the industry. That premise&#8230; Look &#8212;</p><p><strong>Dwarkesh Patel</strong></p><p>Let me just finish the question and then we can address it together.</p><p><strong>Jensen Huang</strong></p><p>Yeah.</p><p><strong>Dwarkesh Patel</strong></p><p>If all these things are true about price, performance, and performance per watt, et cetera, are true, why do you think it is the case that, say, Anthropic for example, just <a href="https://www.anthropic.com/news/google-broadcom-partnership-compute">announced a couple days ago they have a multi-gigawatt deal with Broadcom and Google</a> for TPUs and majority of their compute?</p><p>Obviously for Google, TPU is a majority of compute. So if I look at these big AI companies, it seems like a lot of their compute&#8230; There was some point where it&#8217;s all Nvidia and now it&#8217;s not. So I&#8217;m curious how to square, if these things are true on paper, why are they going with other accelerators?</p><p><strong>Jensen Huang</strong></p><p>Anthropic is a unique instance, not a trend. Without Anthropic, why would there be any TPU growth at all? It&#8217;s 100% Anthropic. Without Anthropic, why would there be Trainium growth at all? It&#8217;s 100% Anthropic. I think that&#8217;s fairly well known and well understood. It&#8217;s not that there&#8217;s an abundance of ASIC opportunities. There&#8217;s only one Anthropic.</p><p><strong>Dwarkesh Patel</strong></p><p>But <a href="https://www.amd.com/en/newsroom/press-releases/2025-10-6-amd-and-openai-announce-strategic-partnership-to-d.html">OpenAI&#8217;s deals with AMD</a>&#8230; They&#8217;re building their own <a href="https://tech-insider.org/openai-titan-chip-samsung-hbm4-custom-ai-chip-2026/">Titan</a> accelerator.</p><p><strong>Jensen Huang</strong></p><p>Yeah, but I think we could all acknowledge they&#8217;re vastly Nvidia. We&#8217;re going to still do a lot of work together. I&#8217;m not offended by other people using something else and trying things. If they don&#8217;t try these other things, how would they know how good ours is? Sometimes you&#8217;ve got to be reminded of it. We have to continuously earn the position that we&#8217;re in.</p><p>There are always big claims. Look at the number of ASICs that have been canceled. Just because you&#8217;re going to build an ASIC&#8230; You still have to build something better than Nvidia. It&#8217;s not that easy building something better than Nvidia. It&#8217;s not sensible, actually. Nvidia&#8217;s got to be missing something, seriously. Because of our scale, our velocity, we&#8217;re the only company in the world that&#8217;s cranking it out every single year. Big leaps, every single year.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess their logic is, &#8220;Hey, it doesn&#8217;t need to be better. It just needs to be not more than 70% worse,&#8221; because they&#8217;re paying you 70% margins.</p><p><strong>Jensen Huang</strong></p><p>No, don&#8217;t forget, even in ASICs margins are really quite high. Nvidia&#8217;s margin is 70%, let&#8217;s say. But ASIC margins are 65%. What are you really saving?</p><p><strong>Dwarkesh Patel</strong></p><p>Oh, you mean from Broadcom or something like that?</p><p><strong>Jensen Huang</strong></p><p>Yeah, sure. You&#8217;ve got to pay somebody. I think the ASIC margins are incredibly good, from what I can tell. They believe it too. They&#8217;re quite proud of their incredible ASIC margins.</p><p>So, you asked the question why. A long time ago, we just didn&#8217;t have the ability to do it. At the time, I didn&#8217;t deeply internalize how difficult it would be to build a foundation AI lab like OpenAI and Anthropic, and the fact that they needed huge investments from the supplier themselves. We just weren&#8217;t in a position to make the multi-billion dollar investment into Anthropic so that they could use our compute. But Google and AWS were. They put in huge investments in the beginning so that Anthropic, in return, used their compute. We just weren&#8217;t in a position to do that at the time.</p><p>I would say my mistake is I didn&#8217;t deeply internalize that they really had no other options, that a VC would never put in $5-10 billion of investment into an AI lab with the hopes of it turning out to be Anthropic. So that was my miss. But even if I understood it, I don&#8217;t think we would&#8217;ve been in a position to do that at the time. But I&#8217;m not going to make that same mistake again.</p><p>I&#8217;m delighted to invest in OpenAI, and I&#8217;m delighted to help them scale, and I believe it&#8217;s essential to do so. And then, when I was able to, when Anthropic came to us, I&#8217;m delighted to be an investor, delighted to help them scale. We just weren&#8217;t, at the time, able to do it. If I could rewind everything&#8212;and Nvidia could have been as big back then as we are now&#8212;I would&#8217;ve been more than happy to do it.</p><h3>00:41:06 &#8211; Why doesn&#8217;t Nvidia become a hyperscaler?</h3><p><strong>Dwarkesh Patel</strong></p><p>This is actually quite interesting. For many years Nvidia has been the company in AI making money, making lots of money. Now you&#8217;re investing it. It&#8217;s been reported that you&#8217;ve done up to $30 billion in OpenAI and $10 billion in Anthropic. But now their valuations have increased, and I&#8217;m sure they&#8217;ll continue to increase.</p><p>So if over these many years you were giving them the compute, you saw where it was headed, and they were worth like one tenth what they&#8217;re worth now a couple years ago&#8212;or even a year ago in some cases and you had all this cash &#8212; there&#8217;s a world where either Nvidia themselves becomes a foundation lab, does a huge investment to make that possible, or has made the deals you&#8217;ve made now at current valuations much earlier on. And you had the cash to do it. So I am curious, actually, why not have done it earlier?</p><p><strong>Jensen Huang</strong></p><p>We did it as soon as we could have. We did it as soon as we could have, and if I could have, I would&#8217;ve done it even earlier. At the time that Anthropic needed us to do it, we just weren&#8217;t in a position to do it. It wasn&#8217;t in our sensibility to do so.</p><p><strong>Dwarkesh Patel</strong></p><p>How so? Was it like a cash thing?</p><p><strong>Jensen Huang</strong></p><p>Yeah, the level of investment. We had never invested outside the company at the time, and not that much. We didn&#8217;t realize we needed to. I always thought that they could just go raise from VCs, for God&#8217;s sakes, like all companies do. But what they were trying to do couldn&#8217;t have been done through VCs. What OpenAI wanted to do couldn&#8217;t have been done through VCs. I recognize that now. I didn&#8217;t know it then.</p><p>But that&#8217;s their genius. That&#8217;s why they&#8217;re smart. They realized then that they had to do something like that. And I&#8217;m delighted that they did. Even though we caused Anthropic to have to go to somebody else, I&#8217;m still happy that it happened. Anthropic&#8217;s existence is great for the world. I&#8217;m delighted for it.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess you still are making a ton of money, and you&#8217;re making way more money quarter after quarter.</p><p><strong>Jensen Huang</strong></p><p>It&#8217;s still okay to have regrets.</p><p><strong>Dwarkesh Patel</strong></p><p>So the question still arises. Okay, now that we&#8217;re here and you have all this money that you keep making, what should Nvidia be doing with it? There&#8217;s one answer which is that there&#8217;s this whole middleman ecosystem that has popped up for converting CapEx into OpEx for these labs so that they can rent compute. Because the chips are really expensive, they make a lot of money over their lifetime because the AI models are getting better. So the value that they generate, their tokens, is increasing, but they&#8217;re expensive to set up. Nvidia has the money to do the CapEx. In fact, it&#8217;s been reported, you are <a href="https://finance.yahoo.com/news/nvidia-just-piled-2-billion-224300847.html">backstopping CoreWeave up to $6.3 billion and have invested $2 billion</a>.</p><p>Why doesn&#8217;t Nvidia become a cloud themselves? Why doesn&#8217;t it become a hyperscaler themselves and rent this compute out? You have all this cash to do it.</p><p><strong>Jensen Huang</strong></p><p>This is a philosophy of the company, and I think it&#8217;s wise. We should do as much as needed, as little as possible. What that means is, the work that we do with building our computing platform, if we don&#8217;t do it, I genuinely believe it doesn&#8217;t get done. If we didn&#8217;t take the risk that we take&#8212;if we didn&#8217;t build NVLink the way we built it, if we didn&#8217;t build the whole stack, if we didn&#8217;t create the ecosystem the way we did, if we didn&#8217;t dedicate ourselves to 20 years of CUDA while losing money most of that time&#8212;if we didn&#8217;t do it, nobody else would have done it.</p><p>If we didn&#8217;t create all the <a href="https://developer.nvidia.com/cuda/cuda-x-libraries">CUDA-X libraries</a> so that they&#8217;re all domain-specific&#8230; A decade and a half ago, we pushed into domain-specific libraries because we realized that if we didn&#8217;t create these domain-specific libraries, whether it&#8217;s for ray tracing or image generation or even the early works of AI, these models, if we didn&#8217;t create them, for data processing, structured data processing, or vector data processing, if we didn&#8217;t create them, nobody would. I am completely certain of that. We created a library for computational lithography called <a href="https://developer.nvidia.com/culitho">cuLitho</a>. If we didn&#8217;t create it, nobody would have. So accelerated computing wouldn&#8217;t advance the way it has if we didn&#8217;t do what we did.</p><p>So we should do that. We should dedicate our company, all of our might, wholeheartedly to go do that. However, the world has lots of clouds. If I didn&#8217;t do it, somebody would show up. So following the recipe, the philosophy, of doing as much as needed but as little as possible&#8212;as little as possible&#8212;that philosophy exists in our company today. Everything I do, I do it with that lens.</p><p>In the case of clouds, if we didn&#8217;t support <a href="https://en.wikipedia.org/wiki/CoreWeave">CoreWeave</a> to exist, these <a href="https://www.mckinsey.com/capabilities/tech-and-ai/our-insights/the-evolution-of-neoclouds-and-their-next-moves">neoclouds</a>, these AI clouds, wouldn&#8217;t exist. If we didn&#8217;t help CoreWeave exist, they would not exist. If we didn&#8217;t support <a href="https://www.nscale.com/">Nscale</a>, they wouldn&#8217;t be where they are today. If we didn&#8217;t support <a href="https://en.wikipedia.org/wiki/Nebius_Group">Nebius</a>, they wouldn&#8217;t be what they are today. Now they&#8217;re doing fantastically.</p><p>Is that a business model [inaudible]? We should do as much as needed, as little as possible. So we invest in our ecosystem because I want our ecosystem to thrive. I want the architecture, and AI, to be able to connect with as many industries as possible, as many countries as possible, and make it possible for the planet to be built on AI and to be built on the American tech stack. That vision is exactly what we&#8217;re pursuing.</p><p>Now, one of the things that you mentioned&#8230; There are so many great, amazing foundation model companies, and we try to invest in all of them. This is another thing that we do. We don&#8217;t pick winners. We need to support everyone. It&#8217;s part of our joy of doing so. It&#8217;s imperative to our business. But we also go out of our way not to pick winners. So when I invest in one of them, I invest in all of them.</p><p><strong>Dwarkesh Patel</strong></p><p>Why do you go out of your way not to pick winners?</p><p><strong>Jensen Huang</strong></p><p>Because it&#8217;s not our job to, number one. Number two, when Nvidia first started, there were 60 3D graphics companies. We are the only one that survived. If you would have taken those 60 graphics companies and asked yourself which one was going to make it, Nvidia would be at the top of that list not to make it.</p><p>This is long before you, but Nvidia&#8217;s graphics architecture was precisely wrong. It&#8217;s not a little bit wrong. <a href="https://en.wikipedia.org/wiki/NV1">We created an architecture that was precisely wrong</a>, and it was an impossible thing for developers to support. It was never going to make it. We reasoned about it from good first principles, but we ended up with the wrong solution. Everybody would have counted us out. And here we are.</p><p>So I have enough humility to recognize that. Don&#8217;t pick winners. Either let them all take care of themselves, or take care of all of them.</p><p><strong>Dwarkesh Patel</strong></p><p>One thing I didn&#8217;t understand is you said, &#8220;Look, we&#8217;re not prioritizing these neoclouds just because they are neoclouds and we want to prop them up.&#8221; But you also listed a bunch of neoclouds and said they wouldn&#8217;t exist if it wasn&#8217;t for NVIDIA. How are those two things compatible?</p><p><strong>Jensen Huang</strong></p><p>First of all, they need to want to exist, and they come to ask us for help. When they want to exist and they have a business plan, expertise, and the passion for it&#8230; They obviously have to have some capabilities themselves. But if, at the end of the day, they need some investment in order to get it off the ground, we would be there for them. But the sooner they get their flywheel going...</p><p>Your question was, &#8220;Do we want to be in the financing business?&#8221; The answer is no. There are people in the financing business, and we&#8217;d rather work with all the people in the financing business than be a financier ourselves. Our goal is to focus on what we do, keep our business model as simple as possible, and support our ecosystem.</p><p>When someone like OpenAI needs an investment of a $30 billion scale because it&#8217;s still before their IPO, and we deeply believe in them and I deeply believe that they&#8217;re going to be an&#8230; Well, they&#8217;re an extraordinary company already today. They&#8217;re going to be an incredible company. The world needs them to exist. The world wants them to exist. I want them to exist. They have the wind at their back. Let&#8217;s support them and let them scale. Those investments we&#8217;ll do because they need us to do it. But we&#8217;re not trying to do as much as possible. We&#8217;re trying to do as little as possible.</p><p><strong>Dwarkesh Patel</strong></p><p>This may be an obvious question, but we&#8217;ve lived many years in this situation where there&#8217;s a shortage of GPUs, and it&#8217;s grown now because models are getting better.</p><p><strong>Jensen Huang</strong></p><p>We have a shortage of GPUs.</p><p><strong>Dwarkesh Patel</strong></p><p>Yes. Nvidia is known for divvying up the scarce allocation, not just based on high bidder, but rather on, &#8220;Hey, we want to make sure that these neoclouds exist. Let&#8217;s give some to CoreWeave, let&#8217;s give some to <a href="https://www.crusoe.ai/cloud">Crusoe</a>, let&#8217;s give some to <a href="https://lambda.ai/">Lambda</a>.&#8221; Why is it good for Nvidia? First of all, would you agree with this characterization of fracturing the market?</p><p><strong>Jensen Huang</strong></p><p>No. No. Your premise is just wrong. We&#8217;re sufficiently mindful about these things. We&#8217;re very mindful about these things. First of all, if you don&#8217;t place a PO, all the talking in the world won&#8217;t make a difference. Until we get a PO, what are we going to do? So the first thing is, we work really hard with everybody to get a forecast done, because these things take a long time to build, and the data centers take a long time to build. We align ourselves with demand and supply and things like that through forecasting. Okay? That&#8217;s job number one.</p><p>Number two, we&#8217;ve tried to forecast with as many people as possible, but in the final analysis, you still have to place an order. Maybe, for whatever reason, you didn&#8217;t place your order. What can I do? At some point, first in, first out. But beyond that, if you&#8217;re not ready because your data center&#8217;s not ready, or certain components aren&#8217;t ready to enable you to stand up a data center, we might decide to serve another customer first. That&#8217;s just maximizing the throughput of our own factory. We might do some adjustments there.</p><p>Aside from that, the prioritization is first in, first out. You&#8217;ve got to place a PO. If you don&#8217;t place a PO&#8230; Now, of course, there are stories about that. For example, all of this kind of started from an <a href="https://fortune.com/2024/09/16/larry-ellison-elon-musk-begged-nvidias-jensen-huang-more-gpus-fancy-sushi-dinner/">article about Larry and Elon having dinner with me where they begged for GPUs</a>. That never happened. We absolutely had dinner. We absolutely had dinner, and it was a wonderful dinner. At no time did they beg for GPUs. They just had to place an order. Once they place an order, we do our best to get the capacity to them. We&#8217;re not complicated.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay. So it sounds like there&#8217;s a queue, and then based on whether your data center is ready and when you place a purchase order, you get them at a certain time. But it still doesn&#8217;t sound like the highest bidder just gets it. Is there a reason to do it&#8230;?</p><p><strong>Jensen Huang</strong></p><p>We never do that.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay.</p><p><strong>Jensen Huang</strong></p><p>We never do.</p><p><strong>Dwarkesh Patel</strong></p><p>Why not just do high bidder?</p><p><strong>Jensen Huang</strong></p><p>Because it&#8217;s a bad business practice. You set your price and then people decide to buy it or not. I understand that others in the chip industry change their prices when demand is higher, but we just don&#8217;t. That&#8217;s just never been a practice of ours. You can count on us. I prefer to be dependable, to be the foundation of the industry. You don&#8217;t need to second-guess. If I quoted you a price, we quoted you a price. That&#8217;s it. If demand goes through the roof, so be it.</p><p><strong>Dwarkesh Patel</strong></p><p>On the other end, that&#8217;s why you have a productive relationship with TSMC, right?</p><p><strong>Jensen Huang</strong></p><p>Yeah, Nvidia&#8217;s been in business with them for, I guess, coming up on 30 years. Nvidia and TSMC don&#8217;t have a legal contract. There&#8217;s always some rough justice. Sometimes I&#8217;m right, sometimes I&#8217;m wrong. Sometimes I got a better deal, sometimes I got a worse deal. But overall, the relationship is incredible. I can completely trust them. I can completely depend on them.</p><p>One of the things you can count on with Nvidia is that this year, <a href="https://www.nvidia.com/en-us/data-center/technologies/rubin/">Vera Rubin</a> is going to be incredible. Next year, <a href="https://developer.nvidia.com/blog/nvidia-vera-rubin-pod-seven-chips-five-rack-scale-systems-one-ai-supercomputer/">Vera Rubin Ultra</a> will come. The year after that, <a href="https://en.wikipedia.org/wiki/Feynman_(microarchitecture)">Feynman</a> will come. And the year after that, I haven&#8217;t introduced the name yet. Every single year you can count on us. You&#8217;re going to have to go find another ASIC team in the world&#8212;pick your ASIC team&#8212;where you can say, &#8220;I can bet the farm, I can bet my entire business that you will be here for me every single year. Your token cost will decrease by an order of magnitude every single year. I can count on it like I can count on the clock.&#8221;</p><p>I just said something about TSMC. For no other foundry in history can you possibly say that. You can say that about Nvidia today. You can count on us every single year. If you would like to buy a billion dollars worth of AI factory compute, no problem. If you&#8217;d like to buy a hundred million dollars, no problem. You&#8217;d like to buy $10 million, or just one rack, not a problem. Or just one graphics card, okay, no problem. If you would like to place an order for a $100 billion of AI factory, no problem. We&#8217;re the only company in the world where you can say that today.</p><p>I can say that about TSMC as well. I want to buy one, buy 1 billion, no problem. We just have to go through the process of planning for it, and all the things that mature people do. So I think this ability for Nvidia to be the foundation of the world&#8217;s AI industry, this is a position that has taken us a couple of decades to arrive at. Enormous commitment, enormous dedication. The stability of our company, the consistency of our company, is really important.</p><h3>00:57:36 &#8211; Should we be selling AI chips to China?</h3><p><strong>Dwarkesh Patel</strong></p><p>Okay. I want to ask about China. I actually don&#8217;t know what I think about whether it&#8217;s good to sell chips to China or not, but I like to play devil&#8217;s advocate against my guests. So when <a href="https://www.dwarkesh.com/p/dario-amodei-2">Dario</a> was on, <a href="https://darioamodei.com/post/on-deepseek-and-export-controls">who supports export controls</a>, I asked him, why can&#8217;t America and China both have a country of geniuses in the datacenter? But since you&#8217;re on the opposite side, I&#8217;ll ask you in the opposite way.</p><p>One way to think about it is, Anthropic actually announced a couple days ago <a href="https://www.anthropic.com/glasswing">Mythos Preview</a>. This model Mythos, they&#8217;re not even releasing publicly <a href="https://red.anthropic.com/2026/mythos-preview/">because they say it has such cyber-offensive capabilities</a> that we don&#8217;t think the world is ready until we make sure these <a href="https://en.wikipedia.org/wiki/Zero-day_vulnerability">zero-days</a> are patched up. But they say it found thousands of high-severity vulnerabilities across every major operating system, every browser. It found one in <a href="https://en.wikipedia.org/wiki/OpenBSD">OpenBSD</a>, which is this operating system that&#8217;s been specifically designed to not have zero days. It found one that&#8217;s existed for 27 years.</p><p>So if Chinese companies and Chinese labs and the Chinese government had access to the AI chips to train a model like Claude Mythos with these cyber-offensive capabilities and run millions of instances of it with more compute, the question is, is that a threat to American companies, to American national security?</p><p><strong>Jensen Huang</strong></p><p>First of all, Mythos was trained on fairly mundane capacity, and a fairly mundane amount of it. By an extraordinary company. The amount of capacity and the type of compute it was trained on is abundantly available in China. So you just have to first realize that chips exist in China.</p><p>They manufacture 60% of the world&#8217;s mainstream chips, maybe more. It&#8217;s a very large industry for them. They have some of the world&#8217;s greatest computer scientists. As you know, most of the AI researchers in all of these AI labs are Chinese. They have 50% of the world&#8217;s AI researchers. So the question is, considering all the assets they already have&#8212;they have an abundance of energy, they have plenty of chips, they&#8217;ve got most of the AI researchers&#8212;if you&#8217;re worried about them, what is the best way to create a safe world?</p><p>Victimizing them, turning them into an enemy, likely isn&#8217;t the best answer. They are an adversary. We want the United States to win. But I think having a dialogue and having research dialogue is probably the safest thing to do. This is an area that is glaringly missing because of our current attitude about China as an adversary. It is essential that our AI researchers and their AI researchers are actually talking. It is essential that we try to both agree on what not to use the AI for.</p><p>With respect to finding bugs in software, of course, that&#8217;s what AI is supposed to do. Is it going to find bugs in a lot of software? Of course. There are lots and lots of bugs. There are lots of bugs in the AI software. That&#8217;s what AI is supposed to do, and I&#8217;m delighted that AI has reached a level where it could help us be so much more productive.</p><p>One of the things that is underemphasized is the richness of the ecosystem around cybersecurity, AI cybersecurity and AI security and AI privacy and AI safety. There&#8217;s a whole ecosystem of AI startups that are trying to create this future for us, where you have one AI agent that&#8217;s incredible, surrounded by thousands of AI agents, keeping it safe, keeping it secure. That future surely is going to happen.</p><p>The idea that you&#8217;re going to have an AI agent running around with nobody watching after it is kind of insane. We know very well that this ecosystem needs to thrive. It turns out this ecosystem needs open source. This ecosystem needs open models. They need open stacks so that all of these AI researchers and all these great computer scientists can go build AI systems that are as formidable and can keep AI safe. So one of the things that we need to make sure that we do is we keep the open source ecosystem vibrant. That can&#8217;t be ignored. A lot of that is coming out of China. We ought to not suffocate that.</p><p>With respect to China, of course we want the United States to have as much computing as possible. We&#8217;re limited by energy, but we&#8217;ve got a lot of people working on that. We&#8217;ve got to not make energy a bottleneck for our country. But what we also want is to make sure that all the AI developers in the world are developing on the American tech stack, and making the contributions, the advancements of AI&#8212;especially when it&#8217;s open source&#8212;available to the American ecosystem. It would be extremely foolish to create two ecosystems: the open source ecosystem, and it only runs on a foreign tech stack, and a closed ecosystem that runs on the American tech stack. I think that would be a horrible outcome for the United States.</p><p><strong>Dwarkesh Patel</strong></p><p>Since there are a lot of things, let me just triage the response. I think the concern, going back to the flop difference in the hacking, is yes, they have compute, but there&#8217;s some estimates that because they&#8217;re at <a href="https://en.wikipedia.org/wiki/7_nm_process">7nm</a>&#8212;they don&#8217;t have EUVs because of <a href="https://www.congress.gov/crs-product/R48642">chip-making export controls</a>&#8212;the amount of flops they&#8217;re able to actually produce, they have one tenth the amount of flops that the US has.</p><p>So with that, could they eventually train a model like Mythos? Yes. But the question is, because we have more flops, American labs are able to get to these levels of capabilities first. Because Anthropic got to it first, they say, &#8220;Okay, we&#8217;re going to hold onto it for a month while all these American companies, we&#8217;ll give them access to it. They&#8217;re going to patch up all their vulnerabilities, and now we release it.&#8221;</p><p>Furthermore, even if they train a model like this, the ability to deploy it at scale&#8230; If you had a cyber hacker, it&#8217;s much more dangerous if they have a million of them versus a thousand of them. So that inference compute really matters a lot. In fact, the fact that they have so many AI researchers who are so good is the thing that makes it so scary, because what is it that makes those engineer researchers more productive? It&#8217;s compute.</p><p>If you talk to any AI lab in America, they say the thing that&#8217;s bottlenecking them is compute. There are <a href="https://www.linkedin.com/pulse/exclusive-interview-founder-deepseek-lingxi-hu--z1hbf/">quotes</a> from the <a href="https://en.wikipedia.org/wiki/DeepSeek">DeepSeek</a> <a href="https://en.wikipedia.org/wiki/Liang_Wenfeng">founder</a>, or <a href="https://en.wikipedia.org/wiki/Qwen">Qwen</a> leadership or whatever. They say the thing they&#8217;re bottlenecked on is compute. So then the question is, isn&#8217;t it better that we get American companies, because they have more compute, to get to the Mythos-level capabilities first, prepare our society for it, before China can get to it because, they have less compute?</p><p><strong>Jensen Huang</strong></p><p>We should always be first and we should always have more. But in order for that outcome you described to be true, you have to take it to the extremes. They have to have no compute. If they have some compute, the question is how much is needed?</p><p>The amount of compute they have in China is enormous. You&#8217;re talking about the country that is the second largest computing market in the world. If they want to aggregate their compute, they&#8217;ve got plenty of compute to aggregate.</p><p><strong>Dwarkesh Patel</strong></p><p>But is that true? People do these estimates and they&#8217;re like, &#8220;<a href="https://en.wikipedia.org/wiki/Semiconductor_Manufacturing_International_Corporation">SMIC</a> is actually behind on the process nodes.&#8221;</p><p><strong>Jensen Huang</strong></p><p>I&#8217;m about to tell you.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay.</p><p><strong>Jensen Huang</strong></p><p>The amount of energy they have is incredible. Isn&#8217;t that right? AI is a parallel computing problem, isn&#8217;t it? Why can&#8217;t they just put 4x, 10x, as many chips together because energy&#8217;s free? They have so much energy. They have datacenters that are sitting completely empty, fully powered. You know they have ghost cities, they have ghost datacenters too. They have so much infrastructure capacity. If they wanted to, they just gang up more chips, even if they&#8217;re 7nm.</p><p>Their capacity of building chips is one of the largest in the world. The semiconductor industry knows that they monopolize mainstream chips. They have over-capacity, they have too much capacity. So the idea that China won&#8217;t be able to have AI chips is completely nonsense.</p><p>Now, of course, if you ask me, would the United States be further ahead if the entire world had no compute at all? But that&#8217;s just not an outcome. That&#8217;s not a scenario that&#8217;s true. They have plenty of compute already. The amount of threshold they need for the concern you&#8217;re worried about, they&#8217;ve already reached that threshold and beyond.</p><p>So I think you misunderstand that AI is a five-layer cake, and at the lowest layer is energy. When you have an abundance of energy, it makes up for chips. If you have an abundance of chips, it makes up for energy. For example, the United States is scarce on energy, which is the reason why Nvidia has to keep advancing our architecture and do this extreme co-design so that with the few chips that we ship&#8212;with the few chips, because the amount of energy is so limited&#8212;our throughput per watt is off the charts.</p><p>But if your amount of watts is completely abundant, it&#8217;s free, what do you care about performance per watt for? You get plenty. You can use old chips to do. So 7nm chips are essentially Hopper. The ability for Hopper&#8230; I&#8217;ve got to tell you, today&#8217;s models are largely trained on Hopper, Hopper generation. So 7nm chips are plenty good. The abundance of energy is their advantage.</p><p><strong>Dwarkesh Patel</strong></p><p>But then there&#8217;s a question of whether they can actually manufacture enough chips.</p><p><strong>Jensen Huang</strong></p><p>But they do. What&#8217;s the evidence? Huawei just had the largest single year in the history of their company.</p><p><strong>Dwarkesh Patel</strong></p><p>How many chips did they ship?</p><p><strong>Jensen Huang</strong></p><p>A ton. Millions. Millions is way more than Anthropic has.</p><p><strong>Dwarkesh Patel</strong></p><p>There&#8217;s a question of how much logic SMIC can chip, and there&#8217;s a question of how much memory&#8212;</p><p><strong>Jensen Huang</strong></p><p>I&#8217;m telling you what it is. They have plenty of logic, and they have plenty of HBM2 memory.</p><p><strong>Dwarkesh Patel</strong></p><p>Right. But as you know, the bottleneck often in training and doing inference on these models is the amount of bandwidth. So if you have HBM2&#8230; I don&#8217;t know the numbers offhand but versus the newest thing you have, there could be almost an order of magnitude difference in memory bandwidth, which is huge.</p><p><strong>Jensen Huang</strong></p><p><a href="https://en.wikipedia.org/wiki/Huawei">Huawei</a> is a networking company.</p><p><strong>Dwarkesh Patel</strong></p><p>But that doesn&#8217;t change the fact that you need EUV for the most advanced HBM.</p><p><strong>Jensen Huang</strong></p><p>Not true. Not at all true. You could gang them together, just like we gang them together with <a href="https://www.nvidia.com/en-us/data-center/gb200-nvl72/">NVL72</a>. They&#8217;ve already demonstrated silicon photonics, connecting all of this compute together into one giant supercomputer. Your premise is just wrong.</p><p>The fact of the matter is, their AI development is going just fine. The best AI researchers in the world, because they&#8217;re limited in compute, they also come up with extremely smart algorithms. Remember, I just said that Moore&#8217;s law is advancing about 25% per year. However, through great computer science, we could still improve algorithm performance by 10x. What I&#8217;m saying is that great computer science is where the lever is.</p><p>There is no question, MoE is a great invention. There&#8217;s no question, all the incredible attention mechanisms reduce the amount of compute. We have got to acknowledge that most of the advances in AI came out of algorithm advances, not just the raw hardware. Now, if most advances came from algorithms and computer science and programming, tell me that their army of AI researchers is not their fundamental advantage. We see it. DeepSeek is not an inconsequential advance. The day that DeepSeek comes out on Huawei first, that is a horrible outcome for our nation.</p><p><strong>Dwarkesh Patel</strong></p><p>Why is that? Because currently you can have a model like DeepSeek that can run on any accelerator, if it&#8217;s open source. Why would that stop being the case in the future?</p><p><strong>Jensen Huang</strong></p><p>Suppose it doesn&#8217;t. Suppose it&#8217;s optimized for Huawei, suppose it&#8217;s optimized for their architecture. It would put ours at a disadvantage. You described a situation that I perceive to be good news. A company developed software, developed an AI model, and it runs best on the American tech stack. I saw that as good news. You set it up as a premise that it was bad news. I&#8217;m going to give you the bad news, that AI models around the world are developed and they run best on non-American hardware. That is bad news for us.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess I just don&#8217;t see the evidence that there&#8217;s these huge disparities that would prevent you from switching accelerators. American labs are running their models across all the clouds, across all the different accelerators&#8212;</p><p><strong>Jensen Huang</strong></p><p>I am the evidence. You take a model that&#8217;s optimized for Nvidia and you try to run it on something else.</p><p><strong>Dwarkesh Patel</strong></p><p>But American labs do that.</p><p><strong>Jensen Huang</strong></p><p>And they don&#8217;t run better. Nvidia&#8217;s success is perfect evidence. The fact that AI models are created on our stack, run best on our stack, how is that illogical to understand?</p><p><strong>Dwarkesh Patel</strong></p><p>Anthropic&#8217;s models are run on GPUs, they&#8217;re run on Trainium, they&#8217;re run on TPUs.</p><p><strong>Jensen Huang</strong></p><p>A lot of work has to go into it to change. But go to the global south, go to the Middle East. Coming out of the box, if all of the AI models run best on somebody else&#8217;s tech stack, you&#8217;ve got to be arguing some ridiculous claim right now that that&#8217;s a good thing for the United States.</p><p><strong>Dwarkesh Patel</strong></p><p>But I guess I don&#8217;t understand the argument. Say Chinese companies get to the next Mythos first. They find all the security vulnerabilities in American software first, but they can do it on Nvidia hardware and they ship it to the global south. They do it on Nvidia hardware. How is that good? Okay, it runs on Nvidia hardware&#8212;</p><p><strong>Jensen Huang</strong></p><p>It&#8217;s not good. It&#8217;s not good.</p><p><strong>Dwarkesh Patel</strong></p><p>Right.</p><p><strong>Jensen Huang</strong></p><p>It&#8217;s not good. So let&#8217;s not let it happen.</p><p><strong>Dwarkesh Patel</strong></p><p>Why do you think it&#8217;s perfectly fungible, that if you didn&#8217;t ship them compute it would exactly be replaced by Huawei? They are behind, right? They have worse chips than you.</p><p><strong>Jensen Huang</strong></p><p>It&#8217;s completely&#8230; There&#8217;s evidence right now. Their chip industry&#8217;s gigantic.</p><p><strong>Dwarkesh Patel</strong></p><p>You can just look at the flop or bandwidth or memory comparisons between the H200 and the <a href="https://www.tomshardware.com/tech-industry/artificial-intelligence/deepseek-research-suggests-huaweis-ascend-910c-delivers-60-percent-nvidia-h100-inference-performance">Huawei 910C</a>. It&#8217;s like half to a third.</p><p><strong>Jensen Huang</strong></p><p>They use more of it. They use twice as many.</p><p><strong>Dwarkesh Patel</strong></p><p>It seems like your argument is they have all this energy that&#8217;s ready to go, right? And they need to fill it with chips.</p><p><strong>Jensen Huang</strong></p><p>And they&#8217;re good at manufacturing.</p><p><strong>Dwarkesh Patel</strong></p><p>And I&#8217;m sure eventually they would be able to just out-manufacture everybody. But there are these few critical years.</p><p><strong>Jensen Huang</strong></p><p>What is the critical year you&#8217;re talking about?</p><p><strong>Dwarkesh Patel</strong></p><p>These next few years. We&#8217;ve got these models that are going to be able to do all the cyber attacks.</p><p><strong>Jensen Huang</strong></p><p>In that case, if the next years are critical, then we have to make sure that all of the world&#8217;s AI models are built on the American tech stack, in these critical years.</p><p><strong>Dwarkesh Patel</strong></p><p>If they&#8217;re built on the American tech stack, how would that prevent them, if they have more advanced capabilities, from launching the Mythos-equivalent cyber attacks?</p><p><strong>Jensen Huang</strong></p><p>There&#8217;s no guarantee either way.</p><p><strong>Dwarkesh Patel</strong></p><p>But if you have it early, we can prepare for it.</p><p><strong>Jensen Huang</strong></p><p>Listen, why are you causing one layer of the AI industry to lose an entire market so that you could benefit another layer of the AI industry? There are five layers and every single layer has to succeed. The layer that has to succeed most is actually the AI applications. Why are you so fixated on that AI model? That one company? For what reason?</p><p><strong>Dwarkesh Patel</strong></p><p>Because those models make possible these incredibly offensive capabilities, and you need compute to run them.</p><p><strong>Jensen Huang</strong></p><p>The energy, the chips, and the ecosystem of AI researchers make it possible.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, stepping back, it has to be the case that China is able to build enough 7nm capacity. And remember, they&#8217;re still stuck on 7nm while you&#8217;ll move on to 3nm and then 2nm or 1.6nm with Feynman. So while you&#8217;re on 1.6nm, they&#8217;re still going to be on 7nm, and they have to produce enough of it to make up for the shortfall. They have so much energy that the more chips you give them, the more compute they&#8217;d have. So it comes out as a question of, ultimately they are getting more compute. Compute is an input to training and inference&#8212;</p><p><strong>Jensen Huang</strong></p><p>Listen, I just think you speak in absolutes. I think the United States ought to be ahead. The amount of compute in the United States is 100x more than anywhere else in the world. The United States ought to be ahead. Okay. The United States is ahead.</p><p>Nvidia builds the most advanced technologies. We make sure that the US labs are the first to hear about it and have the first chance to buy it. And if they don&#8217;t have enough money, we even invest in them. The United States ought to be ahead. We want to do everything we can to make sure the United States is ahead. Number one point, do you agree? We&#8217;re doing everything we can to do that.</p><p><strong>Dwarkesh Patel</strong></p><p>But how is shipping chips to China keeping the US ahead if they&#8217;re bottlenecked on compute?</p><p><strong>Jensen Huang</strong></p><p>No, no. We&#8217;ve got Vera Rubin for the United States. We have Vera Rubin for the United States. Now, am I in the United States? Do you consider me part of the United States?</p><p><strong>Dwarkesh Patel</strong></p><p>Yes.</p><p><strong>Jensen Huang</strong></p><p>Nvidia. You consider Nvidia a United States company? Okay. Number one, why is it that we don&#8217;t come up with a regulation that&#8217;s more balanced so that Nvidia can win around the world instead of giving up the world? Why would you want the United States to give up the world?</p><p>The chip industry is part of the American ecosystem. It&#8217;s part of American technology leadership. It&#8217;s part of the AI ecosystem. It&#8217;s part of AI leadership. Why is it that your policy, your philosophy, leads to the United States giving up a vast part of the world&#8217;s market?</p><p><strong>Dwarkesh Patel</strong></p><p>I guess the claim here is&#8230; Dario had this <a href="https://www.darioamodei.com/essay/the-adolescence-of-technology">quote</a> where he said that it&#8217;s like Boeing bragging that we&#8217;re selling North Korea nukes, but the missile casings are made by Boeing. And that&#8217;s somehow enabling the US technology stack. Fundamentally, you&#8217;re giving them this capability.</p><p><strong>Jensen Huang</strong></p><p>Comparing AI to anything that you just mentioned is lunacy.</p><p><strong>Dwarkesh Patel</strong></p><p>But AI is similar to enriched uranium, right? It can have positive uses, it can have negative uses. We still don&#8217;t want to send enriched uranium to other countries.</p><p><strong>Jensen Huang</strong></p><p>Who&#8217;s sending enriched&#8212;</p><p><strong>Dwarkesh Patel</strong></p><p>The analogy is that enriched uranium is like compute.</p><p><strong>Jensen Huang</strong></p><p>It&#8217;s a lousy analogy. It&#8217;s an illogical analogy.</p><p><strong>Dwarkesh Patel</strong></p><p>But if that compute can run a model that can do zero-day exploits against all American software, how is that not a weapon?</p><p><strong>Jensen Huang</strong></p><p>First of all, the way to solve that problem is to have dialogues with the researchers and dialogues with China, and dialogues with all the countries to make sure that people don&#8217;t use technology in that way. That&#8217;s a dialogue that has to happen. Okay? Number one.</p><p>Number two, we also need to make sure that the United States is ahead, that Vera Rubin, Blackwell, is available in the United States in abundance, mountains of it. Obviously, our results would show it. Abundance, tons of it. The amount of computing we have is great. We have amazing AI researchers here. It&#8217;s great. We ought to stay ahead.</p><p>However, we also have to recognize that AI is not just a model. AI is a five-layer cake. The AI industry matters across every single layer, and we want the United States to win at every single layer, including the chip layer. Conceding the entire market is not going to allow the United States to win the technology race long-term in the chip layer, in the computing stack. That is just a fact.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess then the crux comes down to, how does selling them chips now help us win in the long term? Tesla sold extremely good electric vehicles to China for a long time. iPhones are sold in China, extremely good. They didn&#8217;t cause them lock-in. China will still make their version of EVs and they&#8217;re dominating. Their smartphones are dominating.</p><p><strong>Jensen Huang</strong></p><p>When we started the conversation today, you acknowledged that Nvidia&#8217;s position is very different. You used words like moat. The single most important thing to our company is the richness of our ecosystem, which is about developers. 50% of the AI developers are in China. The United States should not give that up.</p><p><strong>Dwarkesh Patel</strong></p><p>But we have a lot of Nvidia developers in the US, and that doesn&#8217;t prevent American labs from also being able to use other accelerators in the future. In fact, right now they&#8217;re using other accelerators as well, which is fine and great. I don&#8217;t see why that wouldn&#8217;t be the case in China as well, if you sell them Nvidia chips, just the same way that Google can use TPUs and Nvidia&#8212;</p><p><strong>Jensen Huang</strong></p><p>We have to keep innovating and, as you probably know, our share is growing, not decreasing. The premise that even if we competed in China, that we&#8217;re going to lose that market anyways&#8230; You&#8217;re not talking to somebody who woke up a loser. That loser attitude, that loser premise makes no sense to me.</p><p>We&#8217;re not a car. We are not a car. The fact that I can buy this car brand one day and use another car brand another day, easy. Computing is not like that. There&#8217;s a reason why the <a href="https://nvidianews.nvidia.com/news/nvidia-and-intel-to-develop-ai-infrastructure-and-personal-computing-products">x86</a> deal exists. There&#8217;s a reason why <a href="https://en.wikipedia.org/wiki/Arm_Holdings">ARM</a> is so sticky. These ecosystems are hard to replace. It costs an enormous amount of time and energy, and most people don&#8217;t want to do it. So it&#8217;s our job to continue to nurture that ecosystem, to keep advancing the technology so that we can compete in the marketplace.</p><p>Conceding a marketplace based on the premise you described, I simply can&#8217;t acknowledge that. It makes no sense. Because I don&#8217;t think the United States is a loser. Our industry is not a loser. That losing proposition, that losing mindset, makes no sense to me.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay. I&#8217;ll move on. I just want to make sure that&#8212;</p><p><strong>Jensen Huang</strong></p><p>You don&#8217;t have to move on. I&#8217;m enjoying it.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, great. Then I won&#8217;t. I appreciate that. But I think maybe the crux&#8230; and thanks for walking around the circles with me, because I think it helps bring out what the crux here is.</p><p><strong>Jensen Huang</strong></p><p>The crux is you&#8217;re going to extremes. Your argument starts from extremes. That if we give them any compute at all in this narrow moment, we will lose everything.</p><p><strong>Dwarkesh Patel</strong></p><p>No, I think what my argument is&#8212;</p><p><strong>Jensen Huang</strong></p><p>Those extremes, they&#8217;re childish.</p><p><strong>Dwarkesh Patel</strong></p><p>Let me just make my argument for myself. The idea is not that there is some key threshold of compute. It&#8217;s that any marginal compute is helpful. So if you have more compute, you can train a better model.</p><p><strong>Jensen Huang</strong></p><p>And I just want you to acknowledge that any marginal sales for the American technology industry is beneficial.</p><p><strong>Dwarkesh Patel</strong></p><p>I actually don&#8217;t&#8230; If the AI models that run on those chips are capable of cyber offensive capabilities, or the chips are training models with cyber capabilities and running more instances of those models, it is not a nuclear weapon, but it enables a weapon of a kind.</p><p><strong>Jensen Huang</strong></p><p>The logic that you use, you might as well say it to microprocessors and <a href="https://en.wikipedia.org/wiki/Dynamic_random-access_memory">DRAMs</a>. You might as well say it to electricity.</p><p><strong>Dwarkesh Patel</strong></p><p>But in fact we do have export controls on the technology that is relevant to making the most advanced DRAM. We have all kinds of export controls on China for all kinds of chip-making stuff.</p><p><strong>Jensen Huang</strong></p><p>We sell a lot of DRAM and CPUs into China, and I think it&#8217;s right.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess this goes back to the fundamental question of, is AI different? If you have the kind of technology where they can find these zero-days in software, is that something where we want to minimize China&#8217;s ability to get there first, to deploy it widely?</p><p><strong>Jensen Huang</strong></p><p>We want the United States to be ahead. We can control that.</p><p><strong>Dwarkesh Patel</strong></p><p>How do we control that if the chips are already there and they&#8217;re using them to train that model?</p><p><strong>Jensen Huang</strong></p><p>We have tons of compute. We have tons of AI researchers. We&#8217;re racing as fast as we can.</p><p><strong>Dwarkesh Patel</strong></p><p>Again, we have more nuclear weapons than anybody else, but we don&#8217;t want to send enriched uranium anywhere.</p><p><strong>Jensen Huang</strong></p><p>We&#8217;re not enriched uranium. It&#8217;s a chip, and it&#8217;s a chip that they can make themselves.</p><p><strong>Dwarkesh Patel</strong></p><p>But there&#8217;s a reason they&#8217;re buying it from you. We have quotes from the founders of Chinese companies that say that they&#8217;re bottlenecked on compute.</p><p><strong>Jensen Huang</strong></p><p>Because our chips are better. On balance, our chips are better. There&#8217;s just no question about it. In the absence of our chip&#8230; Can you acknowledge that Huawei had a record year? Can you acknowledge that a whole bunch of chip companies have gone public? Can you acknowledge that?</p><p><strong>Dwarkesh Patel</strong></p><p>Yes.</p><p><strong>Jensen Huang</strong></p><p>Can you also acknowledge that we used to have a very large share in that market, and we no longer have a large share in that market? We can also acknowledge that China is about 40% of the world&#8217;s technology industry. To concede that market for the United States technology industry is a disservice to our country. It is a disservice to our national security. It is a disservice to our technology leadership, all for the benefit of one company. It makes no sense to me.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess I&#8217;m confused. It feels like you&#8217;re making two different statements. One is that we&#8217;re going to win this competition with Huawei because our chips are going to be way better if we&#8217;re allowed to compete. Another is that they would be doing the same exact thing without us anyway. How can both of those things be true at the same time?</p><p><strong>Jensen Huang</strong></p><p>It&#8217;s obviously true. In the absence of a better choice, you&#8217;ll take the only choice you have. How is that illogical? It&#8217;s so logical.</p><p><strong>Dwarkesh Patel</strong></p><p>The reason they want Nvidia chips is that they&#8217;re better.</p><p><strong>Jensen Huang</strong></p><p>Yeah.</p><p><strong>Dwarkesh Patel</strong></p><p>Better is more compute. More compute means you can train a better model.</p><p><strong>Jensen Huang</strong></p><p>No, it&#8217;s just better. It&#8217;s better because it&#8217;s easier to program. We have a better ecosystem. But whatever the better is, whatever the better is&#8230; And of course we&#8217;re going to send them compute. So what? The fact of the matter is that we get to benefit. Don&#8217;t forget, we get the benefit of American technology leadership. We get the benefit of developers working on the American tech stack. We get the benefit, as those AI models diffuse out into the rest of the world, that the American tech stack is therefore the best for it. We can continue to advance and diffuse American technology. That, I believe, is a positive. It&#8217;s a very important part of American technology leadership.</p><p>Now, the policies that you&#8217;re advocating <a href="https://americanaffairsjournal.org/2020/08/who-lost-lucent-the-decline-of-americas-telecom-equipment-industry/">resulted in the American telecommunications industry being policied out of basically the world</a>, to the point where we don&#8217;t control our own telecommunications anymore. I don&#8217;t see that as smart. It&#8217;s a little narrow-minded, and it led to unintended consequences that I&#8217;m describing to you right now that you seem to have a very hard time understanding.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, let&#8217;s just step back. It seems like the crux here is there&#8217;s a potential benefit and there&#8217;s a potential cost. What we&#8217;re trying to figure out is, is the benefit worth the cost? I guess I&#8217;m trying to get you to acknowledge the potential cost. Compute is an input to training powerful models. Powerful models do have powerful offensive capabilities, like cyber attacks. It is a good thing that American companies got to Mythos-level capabilities first, and then now they&#8217;re going to hold off on those capabilities so that the American companies and American government can make their software more protected before that level of capability was announced.</p><p>If China had had more compute or more crowd compute, if they could have made a Mythos-level model earlier and deployed it widely, that would have been very bad. One of the reasons that hasn&#8217;t happened is that we have more compute thanks to companies like Nvidia in America. That is a cost of sending it to China. So let&#8217;s leave the benefit aside for a second. Do you acknowledge that this is a potential cost?</p><p><strong>Jensen Huang</strong></p><p>I&#8217;ll also tell you the potential cost is we allow one of the most important layers of the AI stack, the chip layer, to concede an entire market&#8212;the second largest market in the world&#8212;so that they could develop scale, so that they could develop their own ecosystem, so that future AI models are optimized in a very different way than the American tech stack. As AI diffuses out into the rest of the world, their standards, their tech stack, will become superior to ours, because their models are open.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess I just believe enough in Nvidia&#8217;s kernel engineers and CUDA engineers to think that they could optimize&#8212;</p><p><strong>Jensen Huang</strong></p><p>AI is more than kernel optimization, as you know.</p><p><strong>Dwarkesh Patel</strong></p><p>Of course, but there are so many things you can do, from <a href="https://en.wikipedia.org/wiki/Knowledge_distillation">distilling</a> to a model that&#8217;s well-fit for your chips.</p><p><strong>Jensen Huang</strong></p><p>We&#8217;re going to do our best.</p><p><strong>Dwarkesh Patel</strong></p><p>You have all the software. It&#8217;s just hard to imagine that there&#8217;s a long-term lock-in to the Chinese ecosystem, even if they have a slightly better open source model for a while.</p><p><strong>Jensen Huang</strong></p><p>China is the largest contributor to open source software in the world. Fact. China&#8217;s the largest contributor to open models in the world. Fact. Today it&#8217;s built on the American tech stack, Nvidia&#8217;s. Fact.</p><p>All five layers of the tech stack for AI are important. The United States ought to go win all five of them. They&#8217;re all important. The one that is the most important, of course, is the AI application layer. The layer that diffuses into society, the one that uses it most will benefit from this industrial revolution most. But my point is that every layer has to succeed.</p><p>If we scare this country into thinking that AI is somehow a nuclear bomb, so that everybody hates AI and everybody&#8217;s afraid of AI, I don&#8217;t know how you&#8217;re helping the United States. You&#8217;re doing it a disservice. If we scare everybody out of doing software engineering jobs because it&#8217;s going to kill every software engineering job&#8212;and we don&#8217;t have any software engineers as a result of that&#8212;we&#8217;re doing a disservice to the United States.</p><p>If we scare everybody out of radiology so nobody wants to be a radiologist because computer vision is completely free and no AI is going to do a worse job than a radiologist, we misunderstand the difference between a job and a task. The job of a radiologist is patient care. The task is to read a scan. If we misunderstand that so profoundly and we scare everybody out of going to radiology school, we&#8217;re not going to have enough radiologists and good enough healthcare.</p><p>So I&#8217;m making the case that when you make a premise that is so extreme, everything goes from zero or infinity, we end up scaring people in a way that&#8217;s just not true. Life is not like that. Do we want the United States to be first? Of course we do. Do we need to be a leader in every layer of that stack? Of course we do. Of course we do. Today you&#8217;re talking about Mythos because Mythos is important. Sure. That&#8217;s fantastic.</p><p>But in a few years time, I&#8217;m making you the prediction that when we want the American tech stack, when we want American technology to be diffused around the world&#8212;out to India, out to the Middle East, out to Africa, out to Southeast Asia&#8212;when our country would like to export, because we would like to export our technology, we would like to export our standards, on that day, I want you and I to have that same conversation again. I will tell you exactly about today&#8217;s conversation, about how your policy and what you imagined literally caused the United States to concede the second largest market in the world for no good reason at all.</p><p>We shouldn&#8217;t concede it. If we lose it, we lose it. But why do we concede it? Now nobody is advocating an all or nothing. Nobody&#8217;s advocating all or nothing, meaning we ship everything to China at all times. Nobody&#8217;s advocating that. We should always have the best technology here. We should always have the most technology here, and the first. But we should also try to compete and win around the world. Both of those things can simultaneously happen. It requires some amount of nuance, some amount of maturity instead of absolutes. The world is just not absolutes.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay. The argument hinges on this. They&#8217;ve built models that are specified for the best chips that they make in a few years. Those chips get exported around the world. That sets the standard. Because of EUV export controls, as we said, you&#8217;re going to move on to 1.6nm. They&#8217;re still going to be on 7nm, even after a few years from now.</p><p>It may make sense that domestically they would prefer, &#8220;Hey, we&#8217;ve got so much energy, we can manufacture at scale. We&#8217;ll still keep using 7nm.&#8221; But on the exporting thing, their 7nm chips have to be competitive against your 1.6nm chips. Their models have to be so far optimized for the 7nm that it&#8217;s better to run their models on 7nm than to run their models on your 1.6nm.</p><p><strong>Jensen Huang</strong></p><p>Can we just look at the facts then? Is Blackwell 50 times more advanced lithography than Hopper? Is it 50 times? Not even close. I just kept saying it over and over again. Moore&#8217;s Law is dead. Between Hopper and Blackwell, from the transistors themselves, call it 75%. It was three years apart, 75%. Blackwell is 50 times Hopper.</p><p>My point is, architecture matters. Computer science matters. Semiconductor physics matters as well, but computer science matters. The impact of AI largely comes from the computing stack, which is the reason why CUDA is so effective, which is the reason why CUDA is so beloved. It&#8217;s an ecosystem, a computing architecture that allows for so much flexibility that if you wanted to change an architecture completely&#8212;create something like MoE, create something like diffusion, create something that&#8217;s disaggregated&#8212;you could do so. It&#8217;s easy to do.</p><p>So the fact of the matter is, AI is about the stack above as much as it is about the architecture below. To the extent that we have architectures and software stacks that are optimized for our stack, for our ecosystem, it is obviously good, because we started the conversation today about how Nvidia&#8217;s ecosystem is so rich. Why do people always love programming CUDA first? They do. They do. So do the researchers in China.</p><p>But if we are forced to leave China, if we&#8217;re forced to leave China, first of all, it&#8217;s a policy mistake. Obviously it has backlash. It has turned out badly for the United States. It enabled, it accelerated their chip industry. It forced all of their AI ecosystem to focus on their internal architectures. It&#8217;s not too late, but nonetheless it has already happened.</p><p>You&#8217;re going to see in the future, they&#8217;re not stuck at 7nm, obviously. They&#8217;re good at manufacturing. They will continue to advance from 7nm and beyond. Now, is there a 10x difference between 5nm and 7nm? The answer is no. Architecture matters. Networking matters. That&#8217;s why <a href="https://www.wsj.com/articles/nvidia-to-acquire-mellanox-for-about-7-billion-11552304615">Nvidia bought Mellanox</a>. Networking matters. Energy matters. So all of that stuff matters. It&#8217;s not simplistic, like the way you&#8217;re trying to distill it.</p><h3>01:35:06 &#8211; Why doesn&#8217;t Nvidia make multiple different chip architectures?</h3><p><strong>Dwarkesh Patel</strong></p><p>We can move on from China, but that actually raises an interesting question. We were discussing earlier these bottlenecks at TSMC and memory and so forth.</p><p>So if we&#8217;re in this world where you&#8217;re already the majority of N3&#8212;and at some point you&#8217;ll be N2 and you&#8217;ll be a majority of that&#8212;do you see that you could go back to N7, the spare capacity at an older process node, and say, &#8220;Hey, the demand for AI is so great and our capacity to expand the leading edge is not meeting it, so we&#8217;re going to make a Hopper or <a href="https://www.nvidia.com/en-us/data-center/ampere-architecture/">Ampere</a>, but with everything we know about numerics today and all the other improvements you described&#8221;? Do you see that world happening before 2030?</p><p><strong>Jensen Huang</strong></p><p>It&#8217;s not necessary to. The reason for that is because with every generation, the architecture is more than just the transistor scale. You&#8217;re doing so much engineering and packaging and stacking, and the numerics and the system architecture.</p><p>When you run out of capacity, to easily go back to another node&#8230; That&#8217;s a level of R&amp;D that no one could afford. We could afford to lean forward. I don&#8217;t think we could afford to go back. Now, if the world simply says&#8230; If on that day, let&#8217;s do the thought experiment, on that day we go, &#8220;Listen, we&#8217;re just never going to have more capacity ever again.&#8221; Would I go back and use 7nm? In a heartbeat, of course I would.</p><p><strong>Dwarkesh Patel</strong></p><p>One question somebody I was talking to had is, why doesn&#8217;t Nvidia run multiple different chip projects at the same time with totally different architecture? So you could do something like a <a href="https://en.wikipedia.org/wiki/Cerebras">Cerebras</a>-style wafer scale. You could do a <a href="https://en.wikipedia.org/wiki/Tesla_Dojo">Dojo</a>-style huge package. You could do one without CUDA. You have the resources and the engineering talent to do all of these in parallel. So why put all the eggs in one basket, given who knows where AI might go and architectures might go?</p><p><strong>Jensen Huang</strong></p><p>Oh, we could. It&#8217;s just that we don&#8217;t have a better idea. We could do all of those things. It&#8217;s just not better. We simulate it all in our simulator, proveably worse. So we wouldn&#8217;t do it. We&#8217;re working on exactly the projects that we want to work on. If the workload were to change dramatically&#8212;and I don&#8217;t mean the algorithms, I actually mean the workload, and that depends on the shape of the market&#8212;we may decide to add other accelerators.</p><p>For example, recently we added <a href="https://www.wsj.com/tech/ai/nvidia-licenses-ai-inference-technology-from-chip-startup-groq-0a405adb">Groq</a>, and we&#8217;re going to fold Groq into our CUDA ecosystem. We&#8217;re doing that now because the value of tokens has gone up so high that you could have different pricing of tokens. Back in the old days, just a couple years ago, tokens were either free or barely expensive. But now you can have different customers, and those customers want different answers. Because the customers make so much money&#8212;for example, our software engineers&#8212;if I can give them much more responsive tokens so that they&#8217;re even more productive than they are today, I would pay for it.</p><p>But that market has only recently emerged. So I think we now have the ability to have the same model, based on the response time, have different segments. That&#8217;s the reason why we decided to expand the Pareto frontier and create a segment of inference that is faster response time, even though it&#8217;s lower throughput. Until now, higher throughput is always better. We think there could be a world where there could be very high ASP tokens, and even though the throughput is lower in the factory, the ASPs make up for it.</p><p>That&#8217;s the reason why we did it. But otherwise, from an architecture perspective, if I had more money, I would put more behind Nvidia&#8217;s architecture.</p><p><strong>Dwarkesh Patel</strong></p><p>I think this idea of extremely premium tokens and just the disaggregation of the inference market is a very interesting.</p><p><strong>Jensen Huang</strong></p><p>The segmentation of it.</p><p><strong>Dwarkesh Patel</strong></p><p>Yeah. Alright, final question. Suppose the deep learning revolution didn&#8217;t happen. What would Nvidia be doing? Obviously games, but given&#8212;</p><p><strong>Jensen Huang</strong></p><p>Accelerated computing, the same thing we&#8217;ve been doing all along. The premise of our company is that Moore&#8217;s law is going to&#8230; General purpose computing is good for a lot of things, but for a lot of computation it&#8217;s not ideal.</p><p>So we combined an architecture called a GPU, CUDA, to a CPU, so that we can accelerate the workload of the CPU. Different kernels of code or algorithms could be offloaded onto our GPU. As a result, you speed up an application by 100x, 200x. Where can you use that? Obviously engineering and science and physics, data processing, computer graphics, image generation, all kinds of things. Even if AI doesn&#8217;t exist today, Nvidia would be very, very large.</p><p>The reason for that is fairly fundamental, which is that the ability for general purpose computing to continue to scale has largely run its course. And the only way&#8230; Not the only way, but the way to do that is through domain-specific acceleration. One of the domains that we started with was computer graphics, but there are many other domains. There&#8217;s all kinds. Particle physics and fluids, structured data processing, all kinds of different types of algorithms that benefit from CUDA.</p><p>Our mission was really to bring accelerated computing to the world and advance the type of applications that general purpose computing can&#8217;t do, and scale to the level of capability that helps break through certain fields of science. Some of the early applications were molecular dynamics, seismic processing for energy discovery, image processing of course, all of those kinds of fields where general purpose computing is just simply too inefficient to do so.</p><p>If there were no AI, I would be very sad. But because of the advances that we made in computing, we democratized deep learning. We made it possible for any researcher, any scientist, anywhere, any student, to be able to access a PC or a GeForce add-in card and do amazing science. That fundamental promise hasn&#8217;t changed, not even a little bit.</p><p>If you watch GTC, there&#8217;s the whole beginning part of it. None of it&#8217;s AI. That whole part of it with computational lithography or our quantum chemistry work, data processing work, all of that stuff is unrelated to AI. And it&#8217;s still very important. I know that AI is very interesting and quite exciting, but there&#8217;s a lot of people doing a lot of very important work that&#8217;s not AI related, and tensors are not the only way that you compute it. We want to help everybody.</p><p><strong>Dwarkesh Patel</strong></p><p>Jensen, thank you so much.</p><p><strong>Jensen Huang</strong></p><p>You&#8217;re welcome. I enjoyed it.</p><p><strong>Dwarkesh Patel</strong></p><p>Me too.</p>]]></content:encoded></item><item><title><![CDATA[What I learned this week - Can distillation be stopped, Mythos and the cybersecurity equilibrium, Pipeline RL]]></title><description><![CDATA[April 15, 2025]]></description><link>https://www.dwarkesh.com/p/what-i-learned-april-15</link><guid isPermaLink="false">https://www.dwarkesh.com/p/what-i-learned-april-15</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Wed, 15 Apr 2026 14:03:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!QpJ5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e801a3e-5563-40fc-ba70-4af569d80647_555x772.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>At the end of <a href="https://www.dwarkesh.com/p/michael-nielsen">my conversation with Michael Nielsen</a>, we talked about how to actually retain what you learn. Michael&#8217;s advice was to make some kind of demanding artifact. Write something up. Try to explain it. So in that spirit, here are notes on some topics I&#8217;ve learned about over the last week or two. These notes are extremely rough, and have many mistakes.</p><h3>Can distillation be stopped?</h3><p>Can the frontier labs stop distillation? Because if they can&#8217;t, open source commoditizing models can catch up incredibly rapidly, making the long run business model for the labs less viable. Let&#8217;s say it takes 1T tokens from a frontier model to capture its juice (I have no idea if that&#8217;s correct, but let&#8217;s say). Even ignoring savings from caching, Opus 4.6 is $25/MTok. So $25 million for those 1T tokens. That&#8217;s nothing.</p><p>Labs are responding by hiding chain of thought. But there&#8217;s two problems with this solution:</p><ul><li><p>Chain of thought is not made of some fundamentally different kind of token. You can just instruct the model to not think first but just start solving the problem, or to write out its thinking somewhere else.</p></li><li><p>Even if labs do figure out how to robustly hide chain of thought to train in the future, you can make reconstructing the chain of thought necessary to reproduce a decoded sequence as an RLVR target. Yes that costs more, but seems doable.</p></li><li><p>Maybe most importantly, the real juice of these agentic models is their tool use (writing and updating files of code, running bash commands, etc). And if these things are done locally on the user&#8217;s computer, you can&#8217;t really hide them. And it seems like a hard lift to get users to migrate all their development workflows to a cloud that you fully control and hide visibility to, modulo a Claude agent input text prompt.</p></li></ul><p>By the way, I learned about an interesting way companies which build products atop API access to AI models can basically distill these models, in a way that potentially makes the distilled models even better than the ones they&#8217;re actually built atop.</p><p>Suppose you&#8217;ve got a coding product. In order to build a feature, a user uses your product to query some frontier model API across 10+ back and forths. Once the user is satisfied with the end result, you have the end state that the user actually wanted - &#8220;the gold diff&#8221;. These coding product companies can now set the gold diff as the RL target for training their own models, where the model gets rewarded for producing outputs that look like what users eventually converged on, and penalized for producing the kinds of intermediate outputs that users kept rejecting or editing.</p><h3>Mythos and the cybersecurity equilibrium</h3><p>It seems like the key difference between Mythos and previous versions is that while previous versions could find individual vulnerabilities in the code (&#8220;Hey, there&#8217;s a missing bounds check here&#8221;), Mythos is long run agentic enough to rope 5 different vulnerabilities together which are all required in order to find an exploit (&#8220;Now I can execute arbitrary code, escalate privileges, etc&#8221;). To the extent that some discontinuity has been hit, it&#8217;s probably more the result of the combinatorial nature of cyberattacks rather than some off-trend increase in intelligence.</p><p>What does this mean for offense/defense? One way to look at it is that software is more secure today than it was 20 years ago, despite more and more human intelligence probing at public code, both white hat and black hat. If we get another influx of intelligence suddenly, why should the dynamic change?</p><p>In fact, we know that our foreign adversaries almost certainly have access to a bunch of critical zero days which they&#8217;re saving for a rainy day, or already using in inconspicuous ways. To the extent that Glasswing allows the whole industry to find a bunch of these latent exploits and patch them, shouldn&#8217;t we expect defense to have become much stronger relative to offense by the end of 26? Of course, this is thanks to the fact than American companies got there first and are cooperating with other companies and our government to patch things before our adversaries get to the same level.</p><p>One counterpoint I heard from a security expert is that there&#8217;s big difference between finding vulnerabilities and patching them - and AI is much better at the first than the later (people often talk about the offense/defense balance, but difficulty of finding versus patching vulnerabilities seems much more significant). In order to patch an issue, you have to find a fix that will not interfere with all the ways people use your software, and all the features which rely on weird bespoke behavior. XKCD has a nice comic illustrating how these kinds of issues come up:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QpJ5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e801a3e-5563-40fc-ba70-4af569d80647_555x772.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QpJ5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e801a3e-5563-40fc-ba70-4af569d80647_555x772.png 424w, https://substackcdn.com/image/fetch/$s_!QpJ5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e801a3e-5563-40fc-ba70-4af569d80647_555x772.png 848w, https://substackcdn.com/image/fetch/$s_!QpJ5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e801a3e-5563-40fc-ba70-4af569d80647_555x772.png 1272w, https://substackcdn.com/image/fetch/$s_!QpJ5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e801a3e-5563-40fc-ba70-4af569d80647_555x772.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QpJ5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e801a3e-5563-40fc-ba70-4af569d80647_555x772.png" width="555" height="772" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5e801a3e-5563-40fc-ba70-4af569d80647_555x772.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:772,&quot;width&quot;:555,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!QpJ5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e801a3e-5563-40fc-ba70-4af569d80647_555x772.png 424w, https://substackcdn.com/image/fetch/$s_!QpJ5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e801a3e-5563-40fc-ba70-4af569d80647_555x772.png 848w, https://substackcdn.com/image/fetch/$s_!QpJ5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e801a3e-5563-40fc-ba70-4af569d80647_555x772.png 1272w, https://substackcdn.com/image/fetch/$s_!QpJ5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5e801a3e-5563-40fc-ba70-4af569d80647_555x772.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Potential solutions, if it&#8217;s non-trivial to just push patches to every piece of software?</p><ul><li><p>TODO - I know nothing about formal verification of software - check out what a seL4 proof of some behavior might look like</p></li><li><p>Use LLMs to rapidly port all C to Rust. Curious how easily Mythos can find vulnerabilities in memory safe languages.</p></li></ul><p>In some sense, its good that Anthropic didn&#8217;t release this model publicly until critical IT could be patched up. In another sense, isn&#8217;t it a super bad precedent for private companies to be hoarding the ability to be able to break into any operating system and browser and device? One obvious question for Anthropic is why they didn&#8217;t just build some kind of classifier which would detect whether you&#8217;re using the model for cyberattack type stuff, and refuse requests if yes, and release that publicly.</p><ul><li><p>Patching your own software is isomorphic to finding bugs in someone else&#8217;s repo from the perspective of an LLM (and patching your own software is a frequent coding model use case).</p></li><li><p>These kinds of classifiers can be easy to evade if you have enough expertise to break the problem of finding exploits down into smaller subproblems of finding vulnerabilities which each individually seem like sensibly good behavior to an LLM with no memory</p></li></ul><h3><a href="https://arxiv.org/pdf/2509.19128">Pipeline RL</a> paper summary</h3><p>As you keep RLing a model, not only does the average length of a response increase (since you&#8217;re basically training the model to think for longer before answering) but the variance in length also increases - sometimes you get an easy problem and you can immediately answer it - other times, you need to go think for 100k tokens.</p><p>This is a big problem for GPU utilization on training. Because you have to wait for all these stragglers to finish generating before you can start the next training step.</p><p>Okay one way you could get out of this conundrum is to just to just batch generation so that while stragglers keep going, you generate even more rollouts.</p><p>The problem is that there is an optimal batch size for each training step, so you&#8217;d need to split all these rollouts you made across lots of consecutive training steps.</p><p>But this takes you into the domain of offline RL, because your model is changing with each training step. And so you&#8217;re training your model on trajectories that were actually generated by an earlier model, which is not ideal.</p><p>Pipeline RL paper proposes the following fix: in flight weight weight updates - where you just sub out the generating model partway though these generating trajectories as soon as the new training step is done, so all the short trajectories, and a good chunk of the long trajectories, that the next training step will be trained on are generated by the most recent version of the model.</p>]]></content:encoded></item><item><title><![CDATA[Michael Nielsen – How science actually progresses]]></title><description><![CDATA[The true story of Einstein, Newton, and Darwin]]></description><link>https://www.dwarkesh.com/p/michael-nielsen</link><guid isPermaLink="false">https://www.dwarkesh.com/p/michael-nielsen</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Tue, 07 Apr 2026 15:49:28 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/193466212/619081f90c9cac9ccaa31d175b67a2ad.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>Really enjoyed chatting with Michael Nielsen about how we recognize scientific progress.</p><p>It&#8217;s especially relevant for closing the RL verification loop for scientific discovery.</p><p>But it&#8217;s also a surprisingly mysterious and elusive question when you look at the history of human science.</p><p>We approach this question stories like Einstein (who claimed that he hadn't even heard of the famous Michelson-Morley experiment, which is supposed to have motivated special relativity, until after he had come up with the theory), Darwin (why did it take till 1859 to lay out an idea whose essence every farmer since antiquity must have observed?), Prout (how do you recognize that isotopes exist if you cannot chemically separate them?), and many others.</p><p>The verification loop on scientific ideas is often extremely long and weirdly hostile. Ancient Athenians dismissed Aristarchus&#8217;s heliocentrism in the 3rd century BC because it would imply that the stars should shift in the sky as the Earth orbits the sun. The first successful measurement of stellar parallax was in 1838. That&#8217;s a 2,000-year verification loop.</p><p>But clearly human science is able to make progress faster than raw experimental falsification/verification would imply, and in cases where experiments are very ambiguous. How?</p><p>Michael has some very deep and provocative hypotheses about the nature of progress. One I found especially thought-provoking is that aliens will likely have a VERY different science + tech stack than us. Which contradicts the common sense picture of a linear tech tree that I was assuming. And has some interesting implications about how future civilizations might trade and cooperate with each other.</p><p>Watch on <a href="https://youtu.be/myP8UjAM1pk">YouTube</a>; listen on <a href="https://podcasts.apple.com/us/podcast/michael-nielsen-how-science-actually-progresses/id1516093381?i=1000760075027">Apple Podcasts</a> or <a href="https://open.spotify.com/episode/1JTv7Le8s5Mf0hDcXDOJYl">Spotify</a>.</p><div id="youtube2-myP8UjAM1pk" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;myP8UjAM1pk&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/myP8UjAM1pk?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2><strong>Sponsors</strong></h2><ul><li><p><a href="https://labelbox.com/dwarkesh">Labelbox</a> researchers built a new safety benchmark. Why? Well, current safety benchmarks claim that attacks on top models are successful only a few percent of the time, but the prompts in those benchmarks don&#8217;t reflect how real bad actors actually write. You can read Labelbox&#8217;s research <a href="https://labelbox.com/blog/the-ai-safety-illusion-why-current-safety-datasets-fool-us-on-model-safety/">here</a>. If this could be useful for your work, reach out at <a href="https://labelbox.com/dwarkesh">labelbox.com/dwarkesh</a></p></li></ul><ul><li><p><a href="https://mercury.com">Mercury</a> has an MCP that lets you give an LLM access to your full transaction history, including things like attached receipts and internal notes. I just used it to categorize my 2025 transactions, and it worked shockingly well. Modern functionality like this is exactly why I use Mercury. Learn more at <a href="https://mercury.com">mercury.com</a></p></li><li><p><a href="https://janestreet.com/dwarkesh">Jane Street&#8217;s</a> ML engineers presented some of their GPU optimization workflows at GTC, showing how they use CUDA graphs, streams, and custom kernels to shave real time off their training runs. You can watch the full talk <a href="https://www.nvidia.com/en-us/on-demand/session/gtc26-s82065/">here</a>. And they open-sourced all the relevant code <a href="https://github.com/janestreet/gtc2026/">here</a>. If this kind of stuff excites you, Jane Street is hiring &#8212; learn more at <a href="https://janestreet.com/dwarkesh">janestreet.com/dwarkesh</a></p></li></ul><h2><strong>Timestamps</strong></h2><p>00:00:00 &#8211; How scientific progress outpaces its verification loops</p><p>00:17:51 &#8211; Newton was the last of the magicians</p><p>00:23:26 &#8211; Why wasn&#8217;t natural selection obvious much earlier?</p><p>00:29:52 &#8211; Could gradient descent have discovered general relativity?</p><p>00:50:54 &#8211; Why aliens will have a different tech stack than us</p><p>01:15:26 &#8211; Are there infinitely many deep scientific principles left to discover?</p><p>01:26:25 &#8211; What drew Michael to quantum computing so early?</p><p>01:35:29 &#8211; Does science need a new way to assign credit?</p><p>01:43:57 &#8211; Prolificness versus depth</p><p>01:49:17 &#8211; What it takes to actually internalize what you learn</p><h2>Transcript</h2><h3>00:00:00 &#8211; How scientific progress outpaces its verification loops</h3><p><strong>Dwarkesh Patel</strong></p><p>Today, I&#8217;m speaking with <a href="http://n">Michael Nielsen</a>. You have done many things. You&#8217;re one of the pioneers of <a href="https://quantum.country/">quantum computing</a>, wrote the <a href="https://www.amazon.com/Reinventing-Discovery-New-Networked-Science/dp/0691148902">main textbook in the field</a> of the <a href="https://en.wikipedia.org/wiki/Open_science">open science movement</a>. You wrote a <a href="http://neuralnetworksanddeeplearning.com/">book about deep learning</a> that <a href="https://en.wikipedia.org/wiki/Chris_Olah">Chris Olah</a> and <a href="https://en.wikipedia.org/wiki/Greg_Brockman">Greg Brockman</a> credit with getting them into the field. More recently, you&#8217;re a research fellow at the <a href="https://astera.org/">Astera Institute</a> and writing a book about religion, science, and technology.</p><p>I&#8217;m going to ask you about none of those things. The conversation I want to have today is, how do we recognize scientific progress? It&#8217;s especially relevant for AI because people are trying to close the <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">RL</a> verification loop on scientific discovery. What does it mean to close that loop? But in preparing for this interview, I&#8217;ve realized that it&#8217;s a more mysterious and elusive force, even in the history of human science, than I understood.</p><p>I think a good place to start will be <a href="https://en.wikipedia.org/wiki/Michelson%E2%80%93Morley_experiment">Michelson-Morley</a> and how <a href="https://en.wikipedia.org/wiki/Special_relativity">special relativity</a> is discovered, if it&#8217;s different from the story that you get off of YouTube videos. I will prompt you that way, and then we&#8217;ll go in there.</p><p><strong>Michael Nielsen</strong></p><p>Michelson-Morley is the famous result often presented as this experiment that was done in the 1880s that helped <a href="https://en.wikipedia.org/wiki/Albert_Einstein#Special_relativity">Einstein come up with the special theory of relativity</a> a little bit later, changing the way we think about space and time and our fundamental conception of those things.</p><p>And there&#8217;s a big gap, I think, between the way <a href="https://en.wikipedia.org/wiki/Albert_A._Michelson">Michelson</a> and <a href="https://en.wikipedia.org/wiki/Edward_W._Morley">Morley</a> and other people at the time thought about the experiment and certainly the way in which Einstein thought or did not think about the experiment. In actual fact, he stated later in his life he wasn&#8217;t even sure whether he was aware of the paper at the time. There&#8217;s a lot of evidence that he probably was aware of the paper at the time, but it actually wasn&#8217;t dispositive for his thinking at all. Something else completely was going on.</p><p>What Michelson and Morley thought they were doing was testing different theories of what was called the <a href="https://en.wikipedia.org/wiki/Aether_theories">ether</a>. If you go back to the 1600s, <a href="https://en.wikipedia.org/wiki/Robert_Boyle">Robert Boyle</a> introduced the idea of the ether. We know that sound is vibrations in the air. Boyle and other people got interested in the question of whether light is vibrations in something, and they couldn&#8217;t figure out what it was. Boyle did an experiment where he tested whether you could propagate light through a vacuum. He found that you could. You couldn&#8217;t do it with sound. He introduced this idea of the ether, and for the next two hundred or so years, people had all these conversations about what the ether was and what its nature was.</p><p>The Michelson and Morley experiment was really an experiment to test different theories of the ether against one another, in particular to find out whether or not there was a so-called ether wind. The idea was that the Earth is maybe passing through this ether wind. And if it is passing through the ether wind and you shoot a light beam parallel to the direction the ether wind is going in, it&#8217;ll get accelerated a little bit. If it&#8217;s being passed back in the opposite direction, it&#8217;ll get slowed down a little bit, and you should be able to see this in the results of interference experiments. What they found, much to their surprise, was that in fact there was no ether wind. That ruled out some theories of the ether, but not all, and Michelson certainly continued to believe in the ether.</p><p><strong>Dwarkesh Patel</strong></p><p>This is what was a shocking part of reading this story from the biography of Einstein that you recommended by... what was his first name?</p><p><strong>Michael Nielsen</strong></p><p><a href="https://en.wikipedia.org/wiki/Abraham_Pais">Abraham Pais.</a></p><p><strong>Dwarkesh Patel</strong></p><p>Abraham Pais. <em><a href="https://amzn.to/4typOoi">Subtle is the Lord</a></em>. Also from <a href="https://en.wikipedia.org/wiki/Imre_Lakatos">Imre Lakatos</a>, <em><a href="https://amzn.to/3PQfCZA">The Methodology of Scientific Research Programmes</a></em>. The way it&#8217;s told is that Michelson-Morley proved that the ether did not exist. Therefore, it created a crisis in physics that Einstein solved with special relativity.</p><p>What you&#8217;re pointing out is he actually was trying to distinguish between many different theories of ether. If you&#8217;re in space or if you&#8217;re on Earth, it&#8217;s the same direction of ether, or maybe the ether wind is being carried around by the Earth, and so you can&#8217;t really experience it on Earth. But if you go to a high enough altitude, you might be able to experience it. In fact, Michelson&#8217;s experiments, the famous one is 1887, but he conducted these experiments for basically two decades.</p><p><strong>Michael Nielsen</strong></p><p>For longer than that. He conducted the first one in 1881, I think, but he continued to believe until he died. He died, I think it was 1929 or so. It was the late twenties. He was still doing experiments in the 1920s about whether or not the ether existed. So he continued to believe in the ether to the end of his life. I think the last public statement he made was a year or two before he died, and he basically still believed it at that point.</p><p><strong>Dwarkesh Patel</strong></p><p>In fact, there was another physicist, <a href="https://en.wikipedia.org/wiki/Dayton_Miller">Miller</a>, who kept doing these experiments in the 1920s. He thought that if he went to a high enough altitude, Mount Wilson in California&#8230; &#8220;Oh, I&#8217;m high enough that the ether winds are not being dragged by the Earth. And I&#8217;ve measured the effect of the ether.&#8221; Einstein hears about this and he says, and this is where you get the famous quote, &#8220;Subtle is the Lord, but malicious He is not.&#8221;</p><p>Anyways, I think the reason the story is interesting is for many different reasons. One of the ways in which the real history of science is different from this idea you get of the scientific method is that you really can&#8217;t apply <a href="https://en.wikipedia.org/wiki/Falsifiability">falsification</a> as easily as you might think. It&#8217;s not clear what is being falsified. Is it just another version of the theory of the ether that&#8217;s being falsified? Certainly you can&#8217;t induce the theory of special relativity from the fact that one version of the ether seems to be disconfirmed by these experiments.</p><p><strong>Michael Nielsen</strong></p><p>It certainly doesn&#8217;t show that ideas about falsification are wrong or falsified, but it does show that the most naive ideas&#8230; Things are often much more complicated than you think. Michelson did this experiment in 1881. He was a very young man, and then other people, I think <a href="https://en.wikipedia.org/wiki/John_William_Strutt,_3rd_Baron_Rayleigh">Rayleigh</a> was one of them, pointed out that there were some problems with the way he did it, so they had to redo it in 1887. At that point, a lot of the leading physicists of the day basically accepted this result, that there was no ether wind. But what to do about this?</p><p>Sure, maybe you falsified some theories of the ether. There are others that you haven&#8217;t falsified at all at this point, and people set to work on developing those. It is funny, people will phrase it as showing that the ether didn&#8217;t exist. Even just the word &#8220;the&#8221; there is a misnomer. You actually had a ton of different theories and a couple of leading contenders. So yes, there&#8217;s some version of falsification going on, but how you respond to this new experiment is very complicated. Certainly the leading physicists of the day responded by saying, &#8220;Okay, this gives us a lot of information about what the ether must be, but it doesn&#8217;t tell us that there is no ether.&#8221;</p><p><strong>Dwarkesh Patel</strong></p><p>In fact, <a href="https://en.wikipedia.org/wiki/Hendrik_Lorentz">Lorentz</a> at the end of the 19th century, before Einstein, figures out the math of how you convert from one reference frame to another reference frame, and comes up with the <a href="https://en.wikipedia.org/wiki/Lorentz_transformation">Lorentz transformations</a>, which is the basis of special relativity. But his interpretation is that you are converting from the ether reference frame to these non-privileged other reference frames if you&#8217;re moving relative to the ether.</p><p>His interpretation of <a href="https://en.wikipedia.org/wiki/Length_contraction">length contraction</a> and <a href="https://en.wikipedia.org/wiki/Time_dilation">time dilation</a> is that this is the effect of moving through the ether, and you have this pressure. This pressure is warping clocks. It&#8217;s warping measures of length. The interesting thing here is that experimentally you cannot distinguish Lorentz&#8217;s interpretation from special relativity.</p><p><strong>Michael Nielsen</strong></p><p>I think that&#8217;s a strong statement. Lorentz introduces this quantity called <a href="https://en.wikipedia.org/wiki/Relativity_of_simultaneity#History">local time</a>, which he regards as... My understanding is he&#8217;s not trying to give a physical interpretation of this, but it&#8217;s what Einstein would later just recognize as time in another <a href="https://en.wikipedia.org/wiki/Inertial_frame_of_reference">inertial reference frame</a>. He&#8217;s not trying to attribute much physical meaning to it. I think <a href="https://en.wikipedia.org/wiki/Henri_Poincar%C3%A9#Work_on_relativity">Poincar&#233;</a> gets much closer later on to realizing that this is the time that&#8217;s registered by clocks.</p><p>About forty-odd years later, people start doing these <a href="https://en.wikipedia.org/wiki/Experimental_testing_of_time_dilation">muon experiments</a> where they see cosmic rays hit the top of the atmosphere. They produce a shower of <a href="https://en.wikipedia.org/wiki/Muon">muons</a>, and you can look to see at different heights in the atmosphere how many of those muons remain. They decay over time, and a very strange thing happens, which is that they&#8217;re decaying way too slow. You expect they shouldn&#8217;t be able to last the whole way through the atmosphere at all. Their decay rate is too quick, if you were in a classical theory. But if in fact their time really has slowed down, it&#8217;s okay.</p><p>In fact, the measured decay rates in 1940&#8212;and there have since been more accurate experiments done&#8212;match exactly what you expect from special relativity. That&#8217;s the kind of thing where if Lorentz had been alive&#8212;he&#8217;d been dead ten or so years at that point&#8212;it seems quite likely that he would have tried to save his theory by patching it up yet again, but it would have been a massive setback. It starts to just look like time&#8212;this thing that Lorentz introduced as a mathematical convenience&#8212;that&#8217;s actually what time is, for the muons at least. Then there&#8217;s a whole bunch of other experiments that show this very similar phenomenon.</p><p><strong>Dwarkesh Patel</strong></p><p>When was that experiment done?</p><p><strong>Michael Nielsen</strong></p><p>That was, I think, 1940. It might have been published in 1941.</p><p><strong>Dwarkesh Patel</strong></p><p>Maybe to rephrase and change my claim: it&#8217;s not that you could not have distinguished them, but the scientific community adopted what we in retrospect consider the more correct interpretation before it was actually experimentally shown to be preferred. So there&#8217;s clearly some process that human science does which can distinguish different theories.</p><p><strong>Michael Nielsen</strong></p><p>Can I just interrupt? You used the word process, and it&#8217;s interesting to think about that term. Process carries connotations of something set in advance. It&#8217;s much more complicated in practice. You have people like Lorentz, who Einstein absolutely and utterly admired, and Poincar&#233;, one of the greatest scientists who ever lived, and Michelson, another truly outstanding scientist, who never reconciled themselves.</p><p>It&#8217;s not as though there&#8217;s some standard procedure that we&#8217;re all using to reconcile these things. Great scientists can remain wrong for a very long time after the scientific community has broadly changed its opinion. But there&#8217;s no centralized authority or centralized method.</p><p><strong>Dwarkesh Patel</strong></p><p>That is the interesting thing. There&#8217;s progress even though it is hard to articulate the process by which it happens, the heuristics that are used.</p><p>You mentioned Poincar&#233;. Lorentz has the math right, but the interpretation wrong. It seems like Poincar&#233; had the opposite, where he understood that it&#8217;s hard to define <a href="https://en.wikipedia.org/wiki/Relativity_of_simultaneity">simultaneity</a> because it requires a circular definition with time, or velocity of something that might arrive at a midpoint together, but velocity is defined in terms of time.  I find this interesting.</p><p>There are a couple of other examples we could call on. There is this phenomenon in the history of science where somebody asks the right question, but then they don&#8217;t clinch it. I&#8217;m curious what you think is happening in those cases.</p><p><strong>Michael Nielsen</strong></p><p>You actually do want to go case by case and try to understand. It&#8217;s not necessarily clear that they&#8217;re doing the same thing wrong in all of the cases. The Poincar&#233; case is amazing. He seems to have understood the <a href="https://en.wikipedia.org/wiki/Principle_of_relativity">principle of relativity</a>, the idea that the laws of physics are the same in all inertial reference frames. He seems to have understood that the speed of light is the same in all inertial reference frames. He doesn&#8217;t phrase it quite that way, but it is my understanding, though I don&#8217;t speak French.</p><p>These are basically the ideas that Einstein uses to deduce special relativity. But then he also has this additional misunderstanding where he thinks that length contraction is a dynamical effect, that somehow particles are being pushed together by some external force, something is going on dynamically. He doesn&#8217;t understand that it&#8217;s purely kinematics. That actually space and time are different from what we thought, and you need to fundamentally rethink those things.</p><p>It&#8217;s almost like he knew too much. He had almost too grand a vision in mind. Einstein subtracts from that and says, &#8220;No. Space and time are just different than what we thought, and here&#8217;s the correct picture.&#8221; <a href="https://philsci-archive.pitt.edu/22181/1/2014-shpmp-walter.pdf">There&#8217;s a paper in, I think it&#8217;s 1909, where Poincar&#233; still has this dynamical picture of what&#8217;s going on with the length contraction</a>. This is just not necessary. This is a mistake from the modern point of view.</p><p>Why is he doing this? Why is he clinging onto this idea? I don&#8217;t know. I&#8217;ve obviously never met the man. It would be fascinating to be able to talk it over and try and understand. His expertise seems to be getting in the way. He knows so much, he understands so much, and then he&#8217;s not able to let go of these things.</p><p>A really interesting fact is that a few years prior, in the 1890s, Einstein&#8217;s a teenager and he believes in the ether too. He knows about this stuff. But he&#8217;s not quite as attached as these older people were. Maybe they were a little bit prisoners of their own expertise. That&#8217;s my guess. Some historians of science would certainly disagree.</p><p><strong>Dwarkesh Patel</strong></p><p>Then there&#8217;s the obvious stories where Einstein himself later on is said to have not latched onto the correct interpretations of quantum mechanics or cosmology because of his own attachments.</p><p><strong>Michael Nielsen</strong></p><p>Yeah.</p><p><strong>Dwarkesh Patel</strong></p><p>Here&#8217;s the bigger question I have. The muon example is a great example of these long verification loops and how progress seems to happen in the scientific community faster than these verification loops imply. Maybe the clearest example is <a href="https://en.wikipedia.org/wiki/Aristarchus_of_Samos">Aristarchus</a> in the second century BC comes up with the idea of heliocentrism. The ancient Athenians dismiss it on the grounds that we should see as the Earth is moving around the Sun, if really the Sun is the center of the solar system, the stars move relative to the Earth. The only reason that would not be the case is the stars are so far away that you would not observe this.</p><p>And it&#8217;s only in 1838 that <a href="https://en.wikipedia.org/wiki/Stellar_parallax">stellar parallax</a> was actually measured. And so, we didn&#8217;t need to wait until 1838 to have heliocentrism. We didn&#8217;t need to wait for the experimental validation to understand that Copernicus is better in some way. In fact, when Copernicus first came up with his theories, it&#8217;s well known that the <a href="https://en.wikipedia.org/wiki/Geocentrism">Ptolemaic model</a> was more accurate because it had centuries of adding on these <a href="https://en.wikipedia.org/wiki/Deferent_and_epicycle">epicycles</a>.</p><p>What&#8217;s maybe less well appreciated is that it was also in some sense simpler. Because Copernicus actually had to add extra epicycles. It had more epicycles than the Ptolemaic model because he had this bias that the Earth should go in a perfect circle in equal time. Anyway, I think this is an interesting story because it&#8217;s not a more accurate theory. It&#8217;s not a simpler theory. So how could you have known ex ante that Copernicus was correct and Ptolemy was not?</p><p><strong>Michael Nielsen</strong></p><p>Good question. I don&#8217;t entirely know the answer. I can give you a partial answer that I, centuries in the future, start to find very compelling. I&#8217;m sure it&#8217;s part of the historic story at least. One of the big shocks for <a href="https://en.wikipedia.org/wiki/Isaac_Newton">Newton</a>, he did understand <a href="https://en.wikipedia.org/wiki/Kepler%27s_laws_of_planetary_motion">Kepler&#8217;s laws of motion</a> eventually, so you&#8217;re able to explain the motions of the planets in the sky. But he also, out of the same theory, his <a href="https://en.wikipedia.org/wiki/Newton%27s_law_of_universal_gravitation">theory of gravitation</a>, was able to explain terrestrial motion. He&#8217;s able to explain why objects move in parabolas on the Earth, and he&#8217;s able to explain the tides in terms of the moon and the sun&#8217;s gravitational effect on water on the Earth.</p><p>You have what seem like three very different disconnected phenomena all being explained by this one set of ideas. That starts to feel very compelling, at least to me. I think most people find that very satisfying once they eventually realize it.</p><h3>00:17:51 &#8211; Newton was the last of the magicians</h3><p><strong>Dwarkesh Patel</strong></p><p>Have you read the <a href="https://en.wikipedia.org/wiki/John_Maynard_Keynes">Keynes</a> biography of Newton?</p><p><strong>Michael Nielsen</strong></p><p>He wrote an entire biography?</p><p><strong>Dwarkesh Patel</strong></p><p>No, the <a href="https://mathshistory.st-andrews.ac.uk/Extras/Keynes_Newton/">essay</a>.</p><p><strong>Michael Nielsen</strong></p><p>Sure. I love that. This description of him as the last of the magicians is wonderful.</p><p><strong>Dwarkesh Patel</strong></p><p>In fact, I think it&#8217;s maybe worth superimposing. Or you should read out that one passage of the thing.</p><p><strong>Michael Nielsen</strong></p><p>Alright. It&#8217;s from a talk that he gave at Cambridge not long before he died. He&#8217;d acquired Newton&#8217;s papers somehow and gave a lecture twice about this, or his brother Jeffrey gave it the other time because he was too ill. There&#8217;s this wonderful, wonderful quote in the middle. The whole thing is really interesting, but I love this particular quote: &#8220;Newton was not the first of the age of reason. He was the last of the magicians, the last great mind which looked out on the visible and intellectual world with the same eyes as those who began to build our intellectual inheritance rather less than ten thousand years ago.&#8221;</p><p>This idea people have that Newton was the first modern scientist is somehow wrong. There&#8217;s some truth to it, but he really had this very different way of looking at the world that was part superstitious and part modern. It was a funny hybrid. He&#8217;s a transitional figure in some sense. That phrase, &#8220;the last of the magicians,&#8221; really points at something.</p><p><strong>Dwarkesh Patel</strong></p><p>The thing I&#8217;m very curious about with Newton is whether it was the same program, the same heuristics, the same biases that he applied to his alchemical work as he did to his understanding of astronomy. This is from the Keynes essay: &#8220;There was extreme method in his madness. All his unpublished works on esoteric and theological matters are marked by careful learning, accurate method, and extreme sobriety of statement. They are just as sane as the <em>Principia</em> if their whole matter and purpose were not magical. They were nearly all composed during the same 25 years of his mathematical studies.&#8221;</p><p>Clearly, there was some aesthetic that motivated people like Einstein to reject earlier ways of thinking and say, &#8220;No, the other is wrong, and there&#8217;s a better way to think about things.&#8221; The same is true with Newton. The question I have is whether similar heuristics toward parsimony, aesthetics, and so on, would be equally useful across time and across disciplines, or whether you need different heuristics. The reason that&#8217;s relevant is even if we can&#8217;t build a verification loop for science, maybe if the taste tests point in the same direction, you can at least encode that bias into the AIs. That would maybe be enough.</p><p><strong>Michael Nielsen</strong></p><p>The point is that where we always get bottlenecked is where the previous processes and heuristics don&#8217;t apply. That&#8217;s almost definitionally what causes the bottlenecks. Because people are smart, they know what has worked before. They study it. They apply the same kinds of things, so they don&#8217;t get stuck in the same places as before. They keep getting bottlenecked in different places. I&#8217;m overgeneralizing a bit, but I think it&#8217;s right.</p><p>If you&#8217;re attempting to reduce science to a process, you&#8217;re attempting to reduce it to something where there is just a method which you can apply, and you turn the crank and out pops insight. You can do a certain amount of that, but you&#8217;re going to get bottlenecked at the places where your existing method doesn&#8217;t apply. Definitionally, there&#8217;s no crank you can turn. You need a lot of people trying different ideas. The more difficult the idea is to have, the greater the bottleneck, but then also the greater the triumph.</p><p><a href="https://en.wikipedia.org/wiki/Quantum_mechanics">Quantum mechanics</a> is a great example of this. It&#8217;s such a shocking set of ideas. It&#8217;s such a shocking theory. The theory of evolution in some sense is also quite a shocking idea, not the principle of natural selection, but that it can explain so much. That&#8217;s a shocking idea.</p><h3>00:23:26 &#8211; Why wasn&#8217;t natural selection obvious much earlier?</h3><p><strong>Dwarkesh Patel</strong></p><p><em><a href="https://en.wikipedia.org/wiki/Philosophi%C3%A6_Naturalis_Principia_Mathematica">Principia Mathematica</a></em> is released in 1687. <em>The Origin of Species</em> is released in 1859. At least naively, it seems like Darwin&#8217;s theory of natural selection is conceptually easier than the theory of gravity.</p><p><a href="https://www.dwarkesh.com/p/terence-tao">I asked Terence Tao this question</a>. There was this contemporaneous biologist with Darwin, <a href="https://en.wikipedia.org/wiki/Thomas_Henry_Huxley">Thomas Huxley</a>, who read this and said, &#8220;How extremely stupid to not have thought of this.&#8221; Nobody ever reads the Principia Mathematica and thinks, &#8220;God, why didn&#8217;t I beat Newton to the punch here?&#8221; So what&#8217;s going on here? Why did Darwinism take so much longer?</p><p><strong>Michael Nielsen</strong></p><p>The idea must have been known to animal breeders for a long time at some level, or certainly large chunks of the idea were known, that artificial selection was a thing. In some sense, <a href="https://en.wikipedia.org/wiki/Charles_Darwin">Darwin&#8217;s</a> genius wasn&#8217;t in having that idea, it was understanding just how central it was to biology. You can go back and explain a tremendous amount about all the variety of what we see in the world with this as not necessarily the only principle, but certainly a core principle. He writes this wonderful book, <em><a href="https://en.wikipedia.org/wiki/On_the_Origin_of_Species">The Origin of Species</a></em>. It&#8217;s just so much evidence and so many examples, trying to tease this out and see what the implications are, and connecting it to as much else as he possibly can, to geology and all these other things.</p><p>That hard work&#8212;making the case that it&#8217;s actually relevant all across the biosphere&#8212;is what he&#8217;s doing there. He&#8217;s not just having the idea, he&#8217;s making a compelling case that it&#8217;s intertwined with absolutely everything else.</p><p><strong>Dwarkesh Patel</strong></p><p>The motivation for the question was <a href="https://en.wikipedia.org/wiki/Lucretius">Lucretius</a>, this first-century Roman poet who has an idea that seems analogous to natural selection. It&#8217;s about species getting fitted more over time to their environments, or species losing fit to their environment. And so, why did this go nowhere for nineteen centuries?</p><p>Then I looked into it or, more accurately, asked LLMs what exactly Lucretius&#8217;s idea here was. It is extremely different from what real natural selection is. He thought there was this generative period in the past where all the species came about, and then there was this one-time filter which resulted in the species that are around today, and they became fit to the environment.</p><p>He did not have this idea that it is an ongoing gradual process or that there is a tree of life that connects all life forms on Earth together, which, by the way, is an incredibly weird fact that <a href="https://en.wikipedia.org/wiki/Last_universal_common_ancestor">every single life form on Earth has a common ancestor</a>.</p><p><strong>Michael Nielsen</strong></p><p>It&#8217;s not incredibly weird. If you think that the origin of life must have been very hard, that there&#8217;s a bottleneck there, then it&#8217;s not so surprising.</p><p><strong>Dwarkesh Patel</strong></p><p>There&#8217;s also this verification loop aspect where even if Newton might be harder in some sense, if you&#8217;ve clinched it, you can experimentally&#8230; I know &#8220;validate&#8221; is the wrong word philosophically, but you can give a lot of base points to the theory.</p><p>You can be like, &#8220;Okay, I have this idea of why things fall on Earth. I have this idea of why orbital periods for planets have a certain pattern. Let&#8217;s try it on the Moon, which orbits the Earth.&#8221; And in fact, it&#8217;s weird but the orbital period matches what my calculations imply.</p><p><strong>Michael Nielsen</strong></p><p>And the tides work correctly. It&#8217;s just amazing.</p><p><strong>Dwarkesh Patel</strong></p><p>Exactly. Whereas for Darwinism, it takes a ton of work for Darwin to compile all the cumulative evidence, but there&#8217;s no individual piece that is overwhelmingly powerful.</p><p><strong>Michael Nielsen</strong></p><p>And there&#8217;s a whole bunch of problems as well. He doesn&#8217;t really understand what the mechanism is. He doesn&#8217;t understand genes, all these things.</p><p><strong>Dwarkesh Patel</strong></p><p>The very interesting thing in the history of Darwinism is, this idea which theoretically you could come up with at any time, there is almost identical independent creation of that idea between <a href="https://en.wikipedia.org/wiki/Alfred_Russel_Wallace">Alfred Wallace</a> and Charles Darwin. So much so that I think Wallace sends his manuscript to Darwin and is like, &#8220;What do you think of this idea?&#8221; And Darwin&#8217;s like, &#8220;Fuck.&#8221;</p><p><strong>Michael Nielsen</strong></p><p>I don&#8217;t think that&#8217;s an exact quote, but it&#8217;s pretty much correct.</p><p><strong>Dwarkesh Patel</strong></p><p>They end up presenting their ideas together in the spirit of sportsmanship. Why was this period in the 1850s or 1860s the right time for these ideas to form? You can come up with different ideas. One is geology. In the 1830s, <a href="https://en.wikipedia.org/wiki/Charles_Lyell">Charles Lyell</a> figures out that there&#8217;s been millions and billions of years of time that&#8217;s existed on Earth. The paleontology shows you that fossils have existed for that entire time. Life goes back a long way. In fact, you can even find fossils for intermediate species that show you the tree of life. Between humans and other apes as well, there&#8217;s intermediate humans.</p><p>There&#8217;s also the age of colonization, and we have all these voyages doing <a href="https://en.wikipedia.org/wiki/Biogeography">biogeography</a>. That all must have been necessary. In fact, there&#8217;s a huge history of parallel innovation and discovery in the history of science. So maybe it is another piece of evidence that more had to be in place for a given idea to be discovered. Because if it&#8217;s not discovered for a long time and then spontaneously many different people are coming up with it, that shows you that the building blocks were in some sense necessary.</p><p><strong>Michael Nielsen</strong></p><p>This example of Lyell and other geologists in the early 1800s having this idea of <a href="https://en.wikipedia.org/wiki/Deep_time">deep time</a> does seem to have been crucial. I know Darwin was very influenced by Lyell. If you don&#8217;t have at least tens or hundreds of millions of years, evolution starts to look like a non-starter.</p><p>In order to make it work on a timescale of 5,000 to 10,000 years or <a href="https://en.wikipedia.org/wiki/Ussher_chronology">6,000 years with Bishop Ussher</a> you would need to see evolution occurring at a massive rate during human lifetimes, and we&#8217;re just not seeing that. That does seem to have been a blocker. To your question of what other blockers were there, were there any others? I don&#8217;t know.</p><p><strong>Dwarkesh Patel</strong></p><p>Or how much earlier could you, in principle, have come up with it if you were much smarter?</p><h3>00:29:52 &#8211; Could gradient descent have discovered general relativity?</h3><p><strong>Michael Nielsen</strong></p><p>Let&#8217;s go back and zoom out to your original question about the verification loop in AI. An example that should give you pause there is the big signature success so far, which is certainly <a href="https://en.wikipedia.org/wiki/AlphaFold">AlphaFold</a>. AlphaFold really isn&#8217;t about AI. A massive fraction of the success there is the <a href="https://en.wikipedia.org/wiki/Protein_Data_Bank">Protein Data Bank</a>. It&#8217;s <a href="https://en.wikipedia.org/wiki/X-ray_diffraction">X-ray diffraction</a>, <a href="https://en.wikipedia.org/wiki/Nuclear_magnetic_resonance">NMR</a>, <a href="https://en.wikipedia.org/wiki/Cryogenic_electron_microscopy">cryo-EM</a>, and the several billion dollars that were spent obtaining those 180,000-odd protein structures.</p><p>It&#8217;s basically the story of how we spent many decades obtaining <a href="https://en.wikipedia.org/wiki/Protein_structure">protein structure</a> just by going out and looking very hard at the world experimentally, and then we fitted a nice model at the end of it, which was a tiny fraction of the entire investment. That&#8217;s a story of data acquisition principally. The AI bit is very impressive and quite remarkable, but it is only a small part of the total story.</p><p><strong>Dwarkesh Patel</strong></p><p>AlphaFold is very interesting, and philosophically I wonder what you think of it as a scientific theory or explanation. I guess over time the world is becoming harder to understand&#8230; As I&#8217;m saying things, because you&#8217;re such a careful speaker, I say a phrase and wonder if you&#8217;ll actually buy that premise.</p><p>But in some domains, we need to fit models to things rather than coming up with underlying principles that explain a broad range of phenomena. Compare the theory of general relativity, or any theory which just nets out to some equations, versus AlphaFold, which is encoding these different relationships between things we can&#8217;t even interpret over 100 million parameters.</p><p>Are those really the same thing? GR can predict things you could have never anticipated or it was never meant to do, <a href="https://en.wikipedia.org/wiki/Tests_of_general_relativity">like why Mercury&#8217;s orbit precesses</a>. AlphaFold is not going to have that kind of explanatory reach. I want to get your reaction to that.</p><p><strong>Michael Nielsen</strong></p><p>I think it&#8217;s an incredibly interesting question. Maybe a really pivotal question. If you take a very classic point of view, you want these deep explanatory principles. You want as few free parameters as you possibly can. You want very simple models which explain a lot, and AlphaFold doesn&#8217;t look anything like that. You might just say, &#8220;It&#8217;s nice and maybe helpful as a model, but it&#8217;s not a scientific explanation.&#8221; That&#8217;s a conservative point of view, answer one to the question.</p><p>Answer two is to say maybe you shouldn&#8217;t think about AlphaFold as an explanation in the classic sense, but maybe it contains lots of little explanations inside it. Part of what you can get out of <a href="https://www.anthropic.com/research/team/interpretability">interpretability</a> work is you can go into AlphaFold and start to extract certain things. Maybe by doing an archeology of AlphaFold, we can actually understand a great deal more about these principles. You can start to extract that a certain circuit does this interesting thing, and we learn from it.</p><p>I don&#8217;t know to what extent that&#8217;s been done with AlphaFold, but it&#8217;s been done a little bit with some of the chess models, like <a href="https://en.wikipedia.org/wiki/AlphaZero">AlphaZero</a>. There seem to be some strategies which were borrowed by <a href="https://en.wikipedia.org/wiki/Magnus_Carlsen">Magnus Carlsen</a>, which he seems to have just taken from AlphaZero. I don&#8217;t think there&#8217;s any public confirmation of this, but some experts have noticed that he changed <a href="https://x.com/olimpiuurcan/status/1139437778683322369?ref_src=twsrc%5Etfw%7Ctwcamp%5Etweetembed%7Ctwterm%5E1139437778683322369%7Ctwgr%5E207ad5ac534f093e50621a391647199246f3dfb4%7Ctwcon%5Es1_&amp;ref_url=https%3A%2F%2Fwww.quora.com%2FHow-did-AlphaZero-inspire-Magnus-Carlsen-to-play-chess-better">his game quite radically after some public forensics were released on how AlphaZero worked</a>. That&#8217;s an example where human beings are starting to extract meaning out of these models.</p><p>That leads to viewing the models as a potential source of explanations. You need to do more work because they&#8217;re not very legible up front, but you can potentially extract them. That&#8217;s an interesting intermediate situation where they&#8217;re not explanations themselves, but you can extract interesting explanations out of them and use them as a source.</p><p>The third and most interesting possibility is that they&#8217;re a new type of object. They should be taken very seriously as explanations, but where in the past we haven&#8217;t had the ability to really do anything with them, now we have interesting new actions we can do. We can merge them, we can distill them. It&#8217;s a big opportunity in the philosophy of science.</p><p>There&#8217;s an anticipation of this in some way. Some mathematicians and physicists work today&#8230; Historically, if you had a 100-page equation&#8212;which is the kind of thing that does come up&#8212;there&#8217;s just nothing you can do if it&#8217;s 1920. At that point, you give up on the problem. But today, with tools like <a href="https://en.wikipedia.org/wiki/Wolfram_Mathematica">Mathematica</a>, you can just keep going. That&#8217;s an object now, a thing that you can work with. There are examples where people work with these things that formerly were regarded as too complicated, and sometimes they get simple answers out the end. That&#8217;s just an intermediate working state.</p><p>So I wonder if something similar is going to happen in this case, where you could take these models and use them in a similar way that people do with Mathematica, and take them seriously. They&#8217;re not explanations in the classic sense, but they&#8217;ll be something else which interesting operations can be done on.</p><p><strong>Dwarkesh Patel</strong></p><p>The thing I worry about is, suppose it&#8217;s 1500 and you&#8217;re training a model on&#8230;  This is a weird history where we developed <a href="https://en.wikipedia.org/wiki/Deep_learning">deep learning</a> before we had cosmology. Suppose we live in that world. You&#8217;re observing how the stars don&#8217;t seem to move. The planets have all these weird behaviors. Then you train a model on that, and you do some kind of interp on it trying to figure out what the patterns are.</p><p>You&#8217;d just be able to keep building on Ptolemy&#8217;s model. You&#8217;d see there&#8217;s another epicycle we didn&#8217;t notice. Parameters X to Y encode this epicycle, parameters whatever encode the next epicycle. If you were just trying to figure out why the solar system is the way it is from observational data, you could just keep adding epicycles upon epicycles, but it really took one mind to integrate it all in and say, &#8220;Here&#8217;s what makes more sense overall.&#8221;</p><p><strong>Michael Nielsen</strong></p><p>This is to my point that we don&#8217;t really understand what to do with the models. We don&#8217;t have the verbs yet. It is certainly interesting to think about the question where you start to apply constraints to the models, essentially saying, &#8220;What&#8217;s the simplest possible explanation?&#8221; Or, &#8220;Can you simplify? Can you give me the 90/10 explanation?&#8221; And go further and further in boiling it down.</p><p>It might be that indeed they start out by providing a very, very complicated, many-parameter model. But you can just force the case, and basically that&#8217;s scaffolding, which maybe is the very early days of their attempt to understand something. They&#8217;re forced through that to a much more simple understanding.</p><p><strong>Dwarkesh Patel</strong></p><p>Sorry for misunderstanding, but it sounds like you&#8217;re saying maybe there&#8217;s some <a href="https://en.wikipedia.org/wiki/Regularization_(mathematics)">regularizer</a> or some distillation you could do of a very complicated model that gets you to a truer, more parsimonious theory. Take Ptolemy versus Copernicus. You start off with lots of Ptolemy epicycles, and then you try to distill this model, and maybe it gets rid of some of the epicycles that are less and less necessary to get the mean squared error of the orbits to match.</p><p>But at some point it has to do this thing which is to switch two things. Locally, it actually doesn&#8217;t make things more accurate. It&#8217;s in a global sense that it&#8217;s a more progressive theory. There&#8217;s some process which obviously humanity did over its span, which did that regularization or did that swap. But with raw <a href="https://en.wikipedia.org/wiki/Gradient_descent">gradient descent</a>, I don&#8217;t really feel like it would do that.</p><p><strong>Michael Nielsen</strong></p><p>Think about the example of going from Newtonian gravity to Einstein&#8217;s general theory of relativity. These are shockingly different theories, and the question is what causes that flip. As nearly as I understand the history, what goes on is Einstein develops special relativity and pretty much straight away he understands. It&#8217;s a very obvious observation. In special relativity, influences can&#8217;t propagate faster than the speed of light, and in Newtonian gravity, <a href="https://en.wikipedia.org/wiki/Action_at_a_distance">action is at a distance</a>.</p><p>Straight away in special relativity, you could use Newtonian gravity to do faster-than-light signaling. You could send information backwards in time. You could do all kinds of crazy stuff. It&#8217;s not a big leap to realize we have a big problem here. That&#8217;s the forcing function there. You&#8217;ve realized that your old explanation is not sufficient. You need something new.</p><p>Then you&#8217;re going to start by doing the simplest possible stuff. It just turns out that a lot of that stuff doesn&#8217;t work very well, so you&#8217;re forced to go through these steps where gradually it gets more complicated, and it&#8217;s wrong in a variety of ways. The final theory appears shockingly simple and beautiful, but it&#8217;s gone through some somewhat ugly intermediate stages.</p><p><strong>Dwarkesh Patel</strong></p><p>If you&#8217;re thinking about what it looks like to have AI accelerate science, there&#8217;s one for well-understood domains where we just want local solutions, like how does this protein fold. We just train a raw model using gradient descent. Then there&#8217;s things like coming up with general relativity, where you couldn&#8217;t really just train on every single observation in the universe and hope that general relativity pops out.</p><p>What would it require? It also certainly wasn&#8217;t immediately discovered. It was decades of thought. You&#8217;d need independent research programs where people start off with these biases, where Einstein is initially motivated by this <a href="https://en.wikipedia.org/wiki/Einstein%27s_thought_experiments#Falling_painters_and_accelerating_elevators">thought experiment of whether you can distinguish the effect of gravity from just being accelerated upwards</a>. You just need different AI thinkers to start off with these initial biases and see what can germinate out of them. The verification loop for that might be quite long, but you just need to keep all those research programs alive at the same time.</p><p><strong>Michael Nielsen</strong></p><p>This point you make about keeping all the different research programs alive, I think that is very important and central. A great example is situations where the same answer has been correct in some circumstances and wrong in other circumstances.</p><p>The planet Uranus was not in quite the right spot, and <a href="https://en.wikipedia.org/wiki/Discovery_of_Neptune">people famously predicted the existence of Neptune on this basis</a>. Wonderful, massive success for Newtonian gravity. The planet Mercury is not in quite the right spot. You predict the existence of some other distorting planet. It turns out that doesn&#8217;t exist. Actually, <a href="https://en.wikipedia.org/wiki/Tests_of_general_relativity">the reason Mercury is not in the right spot is because you need general relativity</a>.</p><p>You&#8217;ve pursued very similar ideas, and it&#8217;s been very successful in one case, and it&#8217;s been completely and utterly unsuccessful in the other case. A priori, you can&#8217;t tell which of these is the thing to do, and you actually need to do both. This is certainly very true in the history of science.</p><p>This kind of diversity, where you just have lots of people go off and pursue lots of potentially promising ideas, you just need to support that for a long time. It&#8217;s hard to do that for a variety of reasons, but it does seem to be very, very important.</p><p><strong>Dwarkesh Patel</strong></p><p>This example of Uranus versus Mercury is very interesting. I think it illustrates the difficulty with falsificationism. The orbit of Uranus is in some sense falsifying Newtonian mechanics. But then you make some ancillary prediction that says, &#8220;Oh, the reason this is happening is there must be another planet which is perturbing Uranus&#8217;s orbit.&#8221; I think it&#8217;s <a href="https://en.wikipedia.org/wiki/Urbain_Le_Verrier">Le Verrier</a> in 1846. &#8220;Point a telescope in the right direction, you find Uranus.&#8221;</p><p><strong>Michael Nielsen</strong></p><p>Neptune.</p><p><strong>Dwarkesh Patel</strong></p><p>Sorry. Neptune, yes. But with Mercury, it&#8217;s observed that the ellipse which forms its orbit is rotating 43 arcseconds more every century than Newtonian mechanics would imply, so people say that there must be a planet inside Mercury&#8217;s orbit. They call it <a href="https://en.wikipedia.org/wiki/Vulcan_(hypothetical_planet)">Vulcan</a> and point the telescopes. It&#8217;s not there.</p><p>But if you&#8217;re a proper Newtonian, what you do is say, &#8220;Well, maybe there&#8217;s some cosmic dust that&#8217;s occluding this planet, or maybe the planet is so small we can&#8217;t see it, or let&#8217;s build an even more powerful telescope, or maybe there&#8217;s some magnetic field which is occluding our measurement.&#8221; At any one of these steps&#8212;</p><p><strong>Michael Nielsen</strong></p><p>And this happens over and over. There are just so many stories which are exactly like this. An example I love from the 1990s. Some <a href="https://en.wikipedia.org/wiki/Pioneer_anomaly">people noticed that the Pioneer spacecraft weren&#8217;t quite where they were supposed to be</a>.</p><p>You can get very excited about this. &#8220;Oh my goodness, general relativity is wrong. Maybe we&#8217;re going to discover the next theory of gravity.&#8221; Today the accepted explanation is that there&#8217;s just a slight asymmetry in the spacecraft. It turns out that the thermal radiation is slightly larger in one direction than the other, and that&#8217;s causing a tiny little acceleration towards the sun. Most of the time when there&#8217;s these apparent exceptions, it&#8217;s just something like that going on.</p><p>It&#8217;s very much like the Mercury-Vulcan case. But every once in a while, it&#8217;s not. A priori, you can&#8217;t distinguish these. Science is just full of these. It&#8217;s funny too, the way we tell the history of science, it sounds so simple. You just focus on the right exception and you realize that you need to throw out the old theory and lo and behold, your Nobel Prize awaits. But in fact, these exceptions are all over the place. 99.9% of the time, it just turns out to be some effect like this thermal acceleration in the case of the Pioneer spacecraft. Unfortunately, there&#8217;s a lot of selection bias going into those stories.</p><p><strong>Dwarkesh Patel</strong></p><p>The thing is there&#8217;s no ex ante heuristic which tells you which case you&#8217;re in. To spell out why I think this is important, some people have this idea that AI is going to make disproportionate progress towards science because it makes disproportionate progress towards domains where there&#8217;s tight verification loops. It&#8217;s really good at coding because you can run unit tests.</p><p>Science may be similar because you can run experiments. What that doesn&#8217;t appreciate is that there&#8217;s an infinite number of theories that are compatible with any given experiment. Over time, why we latch onto the one we think is more correct in retrospect is, as we&#8217;re discussing, hard to articulate.</p><p>Lakatos has all kinds of interesting examples in the book about these hostile verification loops that are extremely long-lasting. One he talks about is <a href="https://en.wikipedia.org/wiki/William_Prout">Prout</a>. There&#8217;s this chemist in 1815 who hypothesizes that all atomic nuclei must have whole number weights. They&#8217;re basically all made of hydrogen. The reason he thinks this is because if you look at the measured weights of all elements, it does seem that almost all of them have whole number weights. But then there are some exceptions. For example, chlorine comes out at 35.5.</p><p>So then there&#8217;s all these ad hoc theories that people in this school keep coming up with, like, &#8220;Oh, maybe there&#8217;s chemical impurities.&#8221; But there&#8217;s no chemical reaction you can do which seems to get rid of this. Maybe it&#8217;s fractions of whole numbers, so 35.5 can be halves. But actually, if you measure chlorine even closer, it&#8217;s 35.46, so it&#8217;s getting further away from the correct fraction. Later on, what is discovered is what you&#8217;re actually measuring is different isotopes, which cannot be chemically distinguished. They can only be physically distinguished.</p><p>So you have 85 years before we realize what an isotope is, where the verification loop is actively hostile against the correct theory. You just need this remnant to be defending&#8230; There&#8217;s no ex ante reason it&#8217;s the preferred theory. As a community, we should just have people try to integrate new observations, even if they don&#8217;t seem to fit their school of thought, and hopefully enough of that happens&#8230; Anyways, I guess the thing I&#8217;m trying to articulate is the difficulty with automating science.</p><p><strong>Michael Nielsen</strong></p><p>The question is, where is the bottleneck at some level? Are we primarily bottlenecked on one type of thing, or are we bottlenecked on multiple types of things? Certainly, talking to structural biology people, they seem to think that AlphaFold was an enormous advance. It was a shock.</p><p>At some level, yes, AI can certainly help us speed up science. It is helping with a certain type of bottleneck. That doesn&#8217;t mean though, as you&#8217;re saying, that it&#8217;s necessarily going to help with all kinds of bottlenecks. I suppose the question you&#8217;re pointing at is, what are the types of bottlenecks that remain, and what are the prospects for getting past them?</p><p>Even in the case of coding, it&#8217;s really interesting talking to programmer friends. At the moment they&#8217;re all in this state of shock and high excitement, and they&#8217;re all over the place. You do wonder where the bottleneck is going to move to. Certainly, one thing that a lot of them seem to be bottlenecked on now is having interesting ideas, and in particular, having interesting design ideas. There&#8217;s not really a verification loop for knowing that a design idea is very interesting.</p><p>They&#8217;re no longer nearly as bottlenecked by their ability to produce code, but they are still bottlenecked by this other thing. Formerly, they weren&#8217;t bottlenecked on it because just writing code took so much of their time. They could have lots of ideas while they were taking three weeks to implement their prototype, and then they would implement the next version. Now they&#8217;re taking three hours to implement the prototype, and they don&#8217;t have as good ideas after that, from a design point of view.</p><h3>00:50:54 &#8211; Why aliens will have a different tech stack than us</h3><p><strong>Dwarkesh Patel</strong></p><p>You have a very interesting take. I think it was a footnote in <a href="https://michaelnotebook.com/dci/index.html">one of your essays</a>, and I couldn&#8217;t find it again, which was that it&#8217;s very possible that if we met aliens, they would have a totally different technological stack than us. That contradicts a common assumption I had that I never questioned, which is that science is this thing you do relatively early on in the history of civilization. You get to a point and you have a couple hundred years of just cranking through the basics, understanding how the universe works, and you&#8217;ve got it. You&#8217;ve got science. Then everybody would converge on the same &#8220;science.&#8221; I found that a very interesting idea, and I want you to say more about it.</p><p><strong>Michael Nielsen</strong></p><p>The idea there that I&#8217;m at least somewhat attached to is that the tech tree or the science and tech tree is probably much larger than we realize. We&#8217;re in this funny situation. People will sometimes talk about a theory of everything as a potential goal for physics, and then there&#8217;s this presumption that physics is done once you get there. Of course, this is not true at all.</p><p>If you think about computer science, computer science started in the 1930s when <a href="https://en.wikipedia.org/wiki/Alan_Turing">Turing</a> and <a href="https://en.wikipedia.org/wiki/Church%E2%80%93Turing_thesis">Church</a> and so on <a href="https://en.wikipedia.org/wiki/Church%E2%80%93Turing_thesis">laid down what the theory of everything was</a>. They just said, &#8220;Here&#8217;s how computation works.&#8221; We&#8217;ve spent ninety-odd years since then exploring the consequences of that and gradually building up more and more interesting ideas. Those ideas, to some extent, you can regard as technology. But insofar as they&#8217;re discovered principles inside that theory of computation, I think they&#8217;re best regarded as science and in some cases, very fundamental science.</p><p>Ideas like <a href="https://en.wikipedia.org/wiki/Public-key_cryptography">public-key cryptography</a> are incredibly deep, very non-obvious ideas which lay hidden already in the 1930s. My expectation is that there will be different ways of exploring this tech tree, and we&#8217;re still relatively low down. We&#8217;re still at the point where we&#8217;re just understanding these basic fundamental theories, and we haven&#8217;t yet explored them.</p><p>A thing which I think is quite fun is if you look at the phases of matter. When I was in school, we&#8217;d get taught that there are three phases of matter, or sometimes four or five, depending on what you included. As an adult, as a physicist, you start to realize we&#8217;ve been adding to this list. We&#8217;ve got <a href="https://en.wikipedia.org/wiki/Superconductivity">superconductors</a> and <a href="https://en.wikipedia.org/wiki/Superfluidity">superfluids</a>, and maybe different types of superconductors, and <a href="https://en.wikipedia.org/wiki/Bose%E2%80%93Einstein_condensate">Bose-Einstein condensates</a>, the <a href="https://en.wikipedia.org/wiki/Quantum_Hall_effect">quantum Hall systems</a>, <a href="https://en.wikipedia.org/wiki/Fractional_quantum_Hall_effect">fractional quantum Hall systems</a>, and so on. It&#8217;s starting to turn out there&#8217;s a lot of phases of matter to discover, and we&#8217;re going to discover a lot more of them. In fact, we&#8217;re going to be able to start to design them in some sense. We&#8217;ll still be subject to the laws of physics, but there is this tremendous freedom in there.</p><p>This looks to me like we&#8217;re down at the bottom of the tech tree. We&#8217;ve barely gotten started there, and I expect that to be the case broadly. Certainly, programming is a very natural place to look. The idea that we&#8217;ve discovered all the deep ideas in programming just seems obviously ludicrous. We keep discovering what seem like deep, new, fundamental ideas. We&#8217;re very limited. We&#8217;re basically slightly jumped-up chimpanzees, so we&#8217;re slow and it&#8217;s taking us time. But what do we look like another million years in the future, in terms of all the different ideas people have had around how to manipulate computers and information? I think we&#8217;re likely to discover that there are a lot of very deep ideas still to be discovered.</p><p>I think it was <a href="https://en.wikipedia.org/wiki/Donald_Knuth">Knuth</a> in the preface to <em><a href="https://amzn.to/4vagtVj">The Art of Computer Programming</a></em> who says something like it. He started this book back in the sixties. He talked to a mathematician who was a bit contemptuous and said, &#8220;Look, computer science isn&#8217;t really a thing yet. Come back to me when there&#8217;s a thousand deep theorems.&#8221; Knuth remarks, writing the preface decades later, &#8220;There clearly are a thousand deep theorems now.&#8221;</p><p>It&#8217;s really interesting to think what the long-term future is as you get higher and higher up in the tech tree, choices about which direction we go and how we choose to explore. It&#8217;s potentially the case that different civilizations or different choices mean we end up in different parts of that tree. In particular, there are just very basic things about how we&#8217;re very visual creatures, while certain other animals are much more aurally based. Does that bias the types of thoughts that you have? Then you extend it to much more exotic kinds of civilizations where maybe their biases in terms of how they perceive and manipulate the world are quite different than ours. That might make some significant changes in terms of how they do that exploration of the tech tree. It&#8217;s all speculation, obviously.</p><p><strong>Dwarkesh Patel</strong></p><p>This is such an interesting take. I want to better understand it. One way to understand it is that there might be some things which are so fundamental and have such a wide collision area against reality that they&#8217;re inevitably going to discover, like general relativity.</p><p><strong>Michael Nielsen</strong></p><p>Numbers. Numbers. Of all the intelligences in the Milky Way galaxy&#8230; Maybe that number is one. Well, actually, arguably we&#8217;ve already increased the number. But of all of those, what fraction have the concept of counting? It does seem very natural. What fraction have discovered the idea of some kind of decimal place system? Interesting question. Maybe we&#8217;re missing something really simple and obvious that&#8217;s actually way better than that.</p><p>What fraction got there immediately? What fraction had to go through some other intermediate state? What fraction uses linear representations versus a two-dimensional or a three-dimensional representation? I think the answers to these questions are just not at all obvious. It&#8217;s a lot of design freedom.</p><p><strong>Dwarkesh Patel</strong></p><p>On theoretical computer science, this is going to be extremely naive and arrogant, but I took <a href="https://ocw.mit.edu/courses/6-845-quantum-complexity-theory-fall-2010/">Scott Aaronson&#8217;s class on complexity theory</a>, and I was by far the worst student he&#8217;s ever had. What I remember is there was this period, in which you were one of the pioneers, where we figured out the class of problems that quantum computers can solve and how it relates to problems that classical computers can solve. It was groundbreaking. It&#8217;s crazy that this works. Since then&#8230; There&#8217;s literally this website called <a href="http://www.complexityzoo.com/">Complexity Zoo</a> which lists out all the <a href="https://en.wikipedia.org/wiki/Complexity_class">complexity classes</a>. If you have this complexity class with this kind of oracle, it&#8217;s equivalent to this other class. It feels like we&#8217;re building out that taxonomy.</p><p>There are a couple ways to understand what you&#8217;re saying. One, maybe you disagree with me that this is actually what&#8217;s happened with this field. Another is that while that might happen to any one field, who would&#8217;ve thought in 1880 that computer science, other than <a href="https://en.wikipedia.org/wiki/Charles_Babbage">Babbage</a>, was going to be a thing in the first place? We&#8217;re underestimating how many more fields there could be. Or maybe you think both, or maybe a third secret thing. I&#8217;d be curious.</p><p><strong>Michael Nielsen</strong></p><p>A very common argument here is the low-hanging fruit argument. The argument that says there should be diminishing returns.</p><p><strong>Dwarkesh Patel</strong></p><p>In fact, empirically we see this. The amount of scientists in the world has exponentially increased.</p><p><strong>Michael Nielsen</strong></p><p>I think it&#8217;s worth thinking about why you expect diminishing returns and how well that argument actually applies in practice. An analogy I like is thinking about going to an event, like a wedding, and you go to the dessert buffet. They&#8217;ve put out thirty desserts. Naturally, what people do is the best desserts go first. We don&#8217;t quite have a well-ordered preference there, so maybe there&#8217;s some difference, but human beings are fairly similar, so the best desserts will go first. This is an argument for why you expect diminishing returns in a lot of different fields. If it&#8217;s relatively easy to see what&#8217;s available and people have similar preferences, then the best stuff goes first and it just gets worse and worse after that.</p><p>If you look at a very static snapshot in time of scientific progress, maybe there&#8217;s some truth to that. But if somebody is standing behind the dessert table and is replenishing and restocking the desserts and keeps adding new ones in, it may turn out that a little bit later, much better desserts appear, and you&#8217;re going to go and eat those instead.</p><p>Scientific progress has a little bit of that flavor. We go through these funny time periods. Computer science is a great example, where computer science basically arose as a side effect of some pretty abstruse questions in the <a href="https://plato.stanford.edu/entries/philosophy-mathematics/">philosophy of mathematics</a> and logic. You&#8217;ve got these people trying to attack these rather esoteric questions that seem quite high up in exploration, and they discover this fundamental new field, and all of a sudden there&#8217;s an explosion there. The diminishing returns argument just didn&#8217;t apply there. We just weren&#8217;t able to see what was there.</p><p>This has been the case over and over again. New fields arrive and all of a sudden, and boom, it&#8217;s easy to make progress again. Young people flood in because you can be twenty-one and make major breakthroughs rather than having to spend twenty-five years mastering everything that&#8217;s been done before. It&#8217;s obviously very attractive. I&#8217;m not sure anybody understands very well the dynamics of that, or how to think about why the structure of knowledge is that way, where these new fields keep opening up. But it does seem empirically to be the case.</p><p><strong>Dwarkesh Patel</strong></p><p>Despite the fact that that is the case&#8230; Take deep learning. Obviously, this is an example of a new field where twenty-one-year-olds can make progress and it&#8217;s relatively new. Fifteen years or so since it got back into high gear. But already we&#8217;re in a stage where you need billions, tens of billions, or hundreds of billions of dollars to keep making progress at the frontier.</p><p>There are a couple ways to understand that. One is that it actually is harder than the kinds of things the ancients had to do, or is more intensive at least. Second is it might not have been, but because our civilizational resources are so large, the amount of people is so large, the amount of money is so large, we can basically make the kind of progress it would have taken the ancients forever to make almost immediately. We notice something is productive and immediately dump in all the resources. But it&#8217;s also weird that there&#8217;s not that many of them. I feel like deep learning is notable because it is one big exception to the fact that it&#8217;s hard to think of other examples.</p><p><strong>Michael Nielsen</strong></p><p>I think that&#8217;s a consequence of the architecture of attention. At any given time, there&#8217;s always a most successful thing. If deep learning wasn&#8217;t a thing, maybe you&#8217;d be talking about <a href="https://en.wikipedia.org/wiki/CRISPR">CRISPR</a>. Maybe we wouldn&#8217;t think about solving the protein structure prediction problem as a success of AI. Maybe we would have figured out how to do it with curve fitting, more broadly construed, and we&#8217;d just be like, &#8220;Wow, that took a lot of computing resources.&#8221; But protein structure prediction might be an enormously important thing.</p><p>There is always our biggest thing. What you&#8217;re pointing at is more a consequence of the way in which attention gets centralized. It&#8217;s basically fashion, is what I&#8217;m saying. It&#8217;s not just fashion, but there is some dynamic there.</p><p><strong>Dwarkesh Patel</strong></p><p>There&#8217;s a very interesting and important implication of this idea. That the branching is so wide and so contingent and so path-dependent that different civilizations would stumble on entirely different technology stacks. There&#8217;s a very interesting implication that there will be gains from trade into the far, far future, which might actually be one of the most important facts about the far future in terms of how civilizations are set up, how they coordinate, and how they interface. There&#8217;s not this &#8220;go forth and exploit.&#8221; There are humongous gains to trade from adjacent colonies or whatever.</p><p><strong>Michael Nielsen</strong></p><p>Sort of. There&#8217;s a question of what&#8217;s actually hard. If it&#8217;s just the ideas, well, those spread relatively quickly. It&#8217;s relatively easy to share ideas. If it&#8217;s something more, it&#8217;s almost a <a href="https://danwang.co/">Dan Wang</a> kind of idea where there&#8217;s some notion of capacity. You need all the right techs, you need all the right manufacturing capacity, and so on.</p><p>So civilization A has a very different kind of manufacturing capacity, and it&#8217;s just not so easy to build in civilization B. Even if civilization B is ahead, I think that becomes true. There is a comparative advantage which is going to provide massive benefits to trade in both directions. Eventually, you expect some diffusion of innovation. It is funny to think about what the barriers are there.</p><p>A fun thought experiment I like to think about is GitHub but for aliens. Somebody presents you with all of the code from some alien civilization. I don&#8217;t even know what code means there, but their specification of algorithms. It would have many interesting new ideas in there, and it would take forever for human beings to dig through and try and extract all of those.</p><p>The origin of this for me was thinking about proteins in nature. We&#8217;ve been gifted this incredible variety of machines which we don&#8217;t really understand at all. We just have to go and try and understand them on a one-by-one basis. We&#8217;re still understanding hemoglobin and insulin and things like this. There are hundreds of millions of proteins known. So it is a little bit like that. We&#8217;ve been gifted by biology this immense library of machines, no doubt containing an enormous number of very interesting ideas, and we&#8217;re just at the very, very beginning of understanding it.</p><p>I suppose your point&#8212;I need to relabel your argument slightly&#8212;but you think of that as a gift from an alien civilization, which obviously it isn&#8217;t, but you think of it that way. And oh my goodness, there&#8217;s so much in there and we&#8217;re going to study it. Goodness knows how long we could continue to study it. There are tens of thousands of papers about hemoglobin and things like that, and we still don&#8217;t understand them, and yet we&#8217;re getting so much out of it. Just think about insulin alone. It&#8217;s such an important thing.</p><p><strong>Dwarkesh Patel</strong></p><p>That&#8217;s an incredibly useful intuition pump, that you have on Earth&#8230; <a href="https://www.dwarkesh.com/p/nick-lane">I had Nick Lane on</a> where he had this theory about how life emerged, but whatever theory you have, something like DNA has had four billion years. You have an alien civilization come here and be like, &#8220;There&#8217;s all these interesting things to learn about material science.&#8221;</p><p><strong>Michael Nielsen</strong></p><p>Think about <a href="https://en.wikipedia.org/wiki/Kinesin">kinesin</a> walking along. We know almost nothing about these proteins, and yet the tiny few facts we do know are just incredible. The <a href="https://en.wikipedia.org/wiki/Ribosome">ribosome</a> is another example, this miraculous sort of device, a little factory.</p><p><strong>Dwarkesh Patel</strong></p><p>All seeded by this particular chemistry on Earth with nucleic acids and carbon-based life forms. That chemistry gives rise to all of these interesting things which an alien civilization would find very interesting. That very seed, which must be one among trillions of possible seeds of general intellectual ideas, leads to all this fecundity. That&#8217;s a very interesting intuition pump.</p><p>I want to meditate on this &#8220;gains from trade&#8221; thing because I feel like there&#8217;s something very interesting about this idea that if you have this vision of how technology progresses and how it may be different in different civilizations, it actually has important implications about how different civilizations might interact with each other. The fact that there are going to be these huge gains from trade.</p><p><strong>Michael Nielsen</strong></p><p>It makes friendliness much more rewarding?</p><p><strong>Dwarkesh Patel</strong></p><p>Yes. That&#8217;s a very important observation.</p><p><strong>Michael Nielsen</strong></p><p>I hadn&#8217;t thought about that at all. That is a very interesting observation. It is funny. <a href="https://en.wikipedia.org/wiki/Comparative_advantage">Comparative advantage</a> is something that people love to invoke and it&#8217;s a very beautiful idea obviously. There are limits to it. It&#8217;s a special limited model.</p><p>Chimpanzees can do interesting things, but we don&#8217;t trade with them. I think it&#8217;s interesting to think about the reasons why. Part of it is just power, I think. Once there&#8217;s a sufficiently large power imbalance, very often&#8212;not always, but very often&#8212;groups of people seem to shift into this other mode where they just seek to dominate. Maybe there&#8217;s something special about human beings, but maybe it&#8217;s also a more general thing. You need all these special things to be true before groups will trade. It&#8217;s not necessarily obvious.</p><p><strong>Dwarkesh Patel</strong></p><p>I think the big thing going on here is one, transaction costs. Two, comparative advantage does not tell you that the terms on which the trade happens are above subsistence for any given producer. People often bring this up in the context of, &#8220;Well, humans will be employed even in a post-AGI world because of comparative advantage.&#8221;</p><p>There are five different ways that argument breaks down, but the easiest way to understand it is: why don&#8217;t we have horses all around on the roads? Because there&#8217;s some comparative advantage between cars and horses. One, there are huge transaction costs to building roads that are compatible with horses and cars at the same time. In a similar way, AI thinking at 1,000 times the speed that can shoot their <a href="https://arxiv.org/abs/2405.14061">latent states</a> at each other is going to find it way more costly than the benefit, in terms of interacting with a human being in the supply chain.</p><p>Second, just because horses have a comparative advantage mathematically does not mean that it is worth paying $100,000 a year, or whatever it costs to sustain a horse in San Francisco. That subsistence isn&#8217;t going to be worth the benefit you get out of the horse.</p><p><strong>Michael Nielsen</strong></p><p>I do think it&#8217;s interesting, the sheer fact&#8230; My expectation and my intuition obviously differs a great deal from yours on this. Most parts of the tech tree are never going to be explored. There are just too many interesting ways of combining things. There are too many deep ideas waiting to be discovered, and not only we, but nobody ever is going to discover most of them. So choices about how to do the exploration actually matter quite a bit.</p><p>It&#8217;s something I really dislike about technological determinist arguments. I&#8217;m willing to buy it low enough down when progress is relatively simple. But higher up, you start to get to shape the way in which you do the exploration. And it&#8217;s interesting, we are starting to shape it in interesting ways.</p><p>There are various technologies that have been essentially banned. You think about DDT, chlorofluorocarbons, restrictions on the use of nuclear weapons, the Nuclear Non-Proliferation Treaty. Those kinds of things weren&#8217;t done before the fact, but they&#8217;re starting to get pretty close in some cases, where we just preemptively decide, &#8220;Oh, we&#8217;re not going to go down that path.&#8221; So that starts to look like a set of institutions where we are actually influencing how we explore the tech tree.</p><p><strong>Dwarkesh Patel</strong></p><p>On where you would see these gains from trade, obviously you&#8217;d see the most where it&#8217;s pure information that could be sent back and forth, because the information has this quality where it is expensive to produce, but cheap to verify and cheap to send. It&#8217;ll be interesting how much of future productivity can be distilled down to information.</p><p>Right now, it&#8217;s hard to do. If China&#8217;s really good at manufacturing something, there&#8217;s this process knowledge that&#8217;s in the heads of 100 million people involved in the manufacturing sector in China. But in the future, it might be easier if AIs are doing it.</p><p><strong>Michael Nielsen</strong></p><p>The question is to what extent our fabrication gets very uniform and gets really commoditized. 3D printers have been the next big thing for at least 20 years now. Why do they still not work all that well? Why are they still not at the center of manufacturing, and what comes after that? It is funny to look at the ribosome by contrast, which really is at the center of biology in a whole lot of really interesting ways.</p><p>Whether or not that&#8217;s the future of manufacturing is something very simple, where everything goes as throughput through a <a href="https://en.wikipedia.org/wiki/Bioreactor">bioreactor</a> or something like that. You send the information, and then you grow stuff, or you have some 3D printer that actually works. If they&#8217;re good enough, then it does become much more a pure information problem, and some of this process knowledge becomes much less important.</p><h3>01:15:26 &#8211; Are there infinitely many deep scientific principles left to discover?</h3><p><strong>Dwarkesh Patel</strong></p><p>Can I ask a very clumsily phrased question? There are these deep principles that we&#8217;ve discovered a couple of. One is this idea that <a href="https://en.wikipedia.org/wiki/Noether%27s_theorem">if there&#8217;s a symmetry across a dimension, it corresponds to a conserved quantity</a>. It&#8217;s a very deep idea. There&#8217;s another&#8212;which you&#8217;ve written a lot about, written a textbook about in fact&#8212;about ways to understand what kinds of things you can compute, what kinds of physical systems you can understand with other physical systems, what a universal computer looks like, et cetera.</p><p>Is your view that if you go down to this level of idea of <a href="https://en.wikipedia.org/wiki/Noether%27s_theorem">Noether&#8217;s theorem</a> or the <a href="https://en.wikipedia.org/wiki/Church%E2%80%93Turing_thesis">Church-Turing principle</a>, that there&#8217;s an infinite number of extremely deep such principles? Because I feel what makes them special is that they themselves encompass so many different possible ways the world could be. But no, the world has to be compatible with a couple of these very deep principles.</p><p><strong>Michael Nielsen</strong></p><p>I don&#8217;t know. All I have here is speculation and instinct. My instinct is that we keep finding very fundamental new things. It was quite formative for me to understand, as I gave the example before, these wonderful ideas of Church and Turing and these other people about universal programmable devices. Then you understand later, this also contains within it the ideas of public-key cryptography. Then you understand later, that also contains within it the ideas people refer to as cryptocurrency.</p><p>There&#8217;s a very deep set of ideas there about the ability to collectively maintain an agreed-upon ledger, which is built upon this. It&#8217;s taken many years to figure out the right canonical form of those. Just this fact that you keep finding what seem like deep new fundamental primitives has been a very important intuition pump for me. I&#8217;ve given that particular example, but I think you see that same pattern in a lot of different areas.</p><p><strong>Dwarkesh Patel</strong></p><p>What is your interpretation then of this empirical phenomenon where whatever input you consider into the scientific process or technological progress&#8230; Economists have studied this a million ways. It just seems to require a very consistent rate of X percent more researchers per year. There&#8217;s this <a href="https://web.stanford.edu/~chadj/IdeaPF.pdf">famous paper from a couple years ago by Nicholas Bloom and others</a> where they say, &#8220;How many people are working in the semiconductor industry, and how has it increased over time through the history of <a href="https://en.wikipedia.org/wiki/Moore%27s_law">Moore&#8217;s law</a>?&#8221; I think they find that Moore&#8217;s law means transistor density increases 40% a year, but to keep that going the number of scientists has increased 9% a year, in the semiconductor industry. They go through industry after industry with this observation.</p><p>Is your view that there are these deep ideas, but they keep getting harder to find? Or is there another way to think about what&#8217;s happening with these empirical observations?</p><p><strong>Michael Nielsen</strong></p><p>First of all, all of their examples are narrow. They pick a particular thing, and then they look at a particular metric. GPUs don&#8217;t show up there. All of a sudden you get this ability to parallelize, and that&#8217;s really interesting. There are a lot of external consequences. Basically they have these simple quantitative measures. They look at it in agricultural productivity. They look at it in a whole lot of different ways, but you do have to focus narrowly.</p><p>I&#8217;m certainly interested in the fact that new types of progress keep becoming possible. But I think even there, there does still seem to be some phenomenon of diminishing returns. Is that intrinsic? Is that something about the structure of the world? What is it? One thing which hasn&#8217;t changed that much is the individual minds which are doing this kind of work. Maybe those should be improved as well, or some feedback process going on there. Maybe that changes the nature of things.</p><p>I look at scientific progress up until, let&#8217;s say, 1700, and it was very slow, and also very irregular. You had the Ionians back five centuries before Christ doing these quite remarkable things, and so much knowledge would get lost, and then it would be rediscovered, and then it would be lost again. You&#8217;d have to say that progress was very slow. It&#8217;s partially just bound up with the fact that there were some very good ideas that we just didn&#8217;t have.</p><p>Even once you&#8217;ve had the ideas, you need to build institutions around them. You actually need to solve a whole lot of different problems about training, allocation of capital, and all these kinds of things. Even just basic security for researchers, so they&#8217;re not worried about the <a href="https://en.wikipedia.org/wiki/Inquisition">Inquisition</a> or things like that. There are all these complicated problems. You solve all those complicated problems, and then all of a sudden, boom, there&#8217;s a massive burst of scientific progress.</p><p>If there&#8217;s some kind of stagnation, if you&#8217;re not changing those external circumstances, yes, you may start to get diminishing returns again. But that doesn&#8217;t mean there&#8217;s anything intrinsic about the situation. Maybe something external needs to change again. Obviously, a lot of people think AI is potentially going to be a driver. It certainly will at some level.</p><p>To that extent, you can think of a lot of modern scientific instrumentation as really, at some level, robots. What is the <a href="https://en.wikipedia.org/wiki/James_Webb_Space_Telescope">James Webb Space Telescope</a>? It&#8217;s unconventional maybe to describe it as a robot, but it&#8217;s not completely unreasonable either. It is an example of a highly automated, very sophisticated system with electronically mediated sensors and actuators, where machine learning is being used to process the data. In that sense, we&#8217;re already starting to see that transition. We&#8217;ve been seeing it for decades.</p><p><strong>Dwarkesh Patel</strong></p><p>I have this &#8220;smoke a joint and take a puff&#8221; thought, which&#8212;</p><p><strong>Michael Nielsen</strong></p><p>I think we&#8217;ve had a few.</p><p><strong>Dwarkesh Patel</strong></p><p>I think we&#8217;re getting to that part of the conversation, and then you can help me get my foot out of my mouth and figure out a more concrete way to think about it. To your point that there was the Industrial Revolution, the Enlightenment, and now there&#8217;s AI, and each might be a different pace or a different way in which science happens. If you think about the pace of how fast such transitions have been happening, you can draw over the long span of human history this hyperbolic rate of growth that is increasing over time as well.</p><p>A hundred thousand years ago, you had the Stone Age. You go back even much further, how long have primates been around? It would be millions of years. A hundred thousand years ago, the Stone Age, then ten thousand years ago, the Agricultural Revolution, then three hundred years ago, the Industrial Revolution, each marked by this increase in the rate of exponential growth. Then people think it&#8217;s going to happen again with AI. But that would happen potentially even faster.</p><p>It would not have occurred to somebody at the beginning of the Industrial Revolution that the next demarcation in this trend will be artificial intelligence. So if things are getting faster, and it&#8217;s hard to anticipate what the next transition will be. I guess we just think of this singularity between now and AI as what distinguishes the past from the future. But applying the same heuristic that many people in the past should have had, maybe the &#8220;Intelligence Age&#8221; is also quite short and the next thing after that, we don&#8217;t even have the ontology to describe what it is, the future will not think of the past as pre-intelligent AI and post-AI.</p><p><strong>Michael Nielsen</strong></p><p>No, obviously we can&#8217;t prove this, but it certainly seems quite plausible. Part of the issue is just that the substrate we have available to conceive seems all wrong. You can&#8217;t speculate with a bunch of chimpanzees about what it would be to have language. Just to pick a major transition in the past, the transition itself is the thing. It seems likely.</p><p>If we&#8217;re talking about &#8220;taking a puff&#8221; kind of thoughts, I&#8217;m certainly amused by the idea that there&#8217;s going to be some transition involving artificial general intelligence using classical computers. But actually, there&#8217;ll be an interesting transition with quantum computers as well. They&#8217;re probably capable of a strictly larger class of potentially interesting computations. So maybe the character of AQGI, or whatever it should be called, is actually qualitatively different. So maybe there&#8217;s a brief period between those two things. As I say, this is just speculation, but it&#8217;s certainly amusing.</p><p><strong>Dwarkesh Patel</strong></p><p>Is there a reason to think that? From what I understand, for decades people like you have put pretty tight bounds on the kinds of things quantum computers are going to do. It&#8217;ll speed up search somewhat. The kinds of things it speeds up extremely, like <a href="https://en.wikipedia.org/wiki/Shor%27s_algorithm">Shor&#8217;s algorithm</a>, it seems like&#8230; Again, maybe this is to your point that we can&#8217;t predict in advance what&#8217;s down the tech tree, but at least from here, it seems like you break encryption, but what else are you using Shor&#8217;s algorithm to do?</p><p><strong>Michael Nielsen</strong></p><p>We&#8217;ve only been thinking about it for 40 or so years. Not for very long, and we haven&#8217;t thought that hard about it as a civilization. Does it turn out that it&#8217;s very narrow? Maybe. Does it turn out that it&#8217;s very broad? That&#8217;s also a really radical expansion that seems distinctly possible. Keep in mind as well, we&#8217;ve been doing it without the benefit of having the devices. That&#8217;s a pretty big bottleneck to have.</p><p><strong>Dwarkesh Patel</strong></p><p>If you&#8217;re thinking about computer science in the 1700s and you&#8217;re like, &#8220;it can do <a href="https://en.wikipedia.org/wiki/Boolean_algebra">AND/OR</a>, what can come out of that?&#8221; You can&#8217;t anticipate Bitcoin. You can&#8217;t anticipate deep learning.</p><p><strong>Michael Nielsen</strong></p><p>Maybe you could if you were sufficiently bright, but it is a pretty hard situation.</p><h3>01:26:25 &#8211; What drew Michael to quantum computing so early?</h3><p><strong>Dwarkesh Patel</strong></p><p>What is your inside view, having been in and contributing to <a href="https://en.wikipedia.org/wiki/Quantum_information">quantum information</a> and quantum computing back in the &#8216;90s and 2000s? What is your telling of the history of what was the bottleneck? What was the key transition that made it a real field? How do you rank the contributions from <a href="https://en.wikipedia.org/wiki/Richard_Feynman">Feynman</a> to <a href="https://en.wikipedia.org/wiki/David_Deutsch">Deutsch</a> to everybody else who came along?</p><p><strong>Michael Nielsen</strong></p><p>Let&#8217;s just focus on the question about what actually changed. Why was quantum computing not a thing in the 1950s? It could have been. Somebody like <a href="https://en.wikipedia.org/wiki/John_von_Neumann">John von Neumann</a> is a good example. He was absolutely pioneering computation. He also wrote <a href="https://en.wikipedia.org/wiki/Mathematical_Foundations_of_Quantum_Mechanics">a very important book about quantum mechanics</a> and was deeply interested in it. He could have invented quantum computing at that time, and I think there were quite a number of people who potentially could have.</p><p>So why do we have these papers by people like Feynman and Deutsch in the &#8216;80s? Those are fairly regarded as the foundation of the field. There are some partial anticipations a little bit earlier, but they were nowhere near as comprehensive and nowhere near as deep. You should ask David. You can&#8217;t ask Feynman, unfortunately, but he&#8217;ll know much better than I do.</p><p>A couple things that I think are interesting. One is that computation became far more salient in the late &#8216;70s and early &#8216;80s. It just became a thing which many more people were interested in, partially for very banal reasons. You could go and buy a PC. You could buy an Apple II. You could buy a Commodore 64. You could buy all these kinds of things. It became apparent to people that these were very powerful devices, very interesting to think about.</p><p>At the same time, in the quantum case, that was also the time of the <a href="https://en.wikipedia.org/wiki/Ion_trap">Paul trap</a> and the ability to trap single ions. Up to that point, we hadn&#8217;t really had the ability to manipulate single quantum states. You got these two separate things that for historically contingent reasons had both matured around 1980 or so. Somebody like von Neumann could have had the idea earlier, but it is quite an interesting factor.</p><p>There&#8217;s a story about Richard Feynman. He went and got one of the first PCs around 1980 or 1981. He was apparently so excited with this device, he actually tripped and hurt himself quite badly carrying his brand-new computing device. That&#8217;s a very historically contingent coincidence, having somebody who&#8217;s very talented and understanding of quantum mechanics also just very excited about these new machines. It&#8217;s not so surprising perhaps that he&#8217;s thinking about it then. What similar story could you have told 10 years earlier? The conditions don&#8217;t exist for it. I mean, it&#8217;s quite a banal story, but&#8230;</p><p><strong>Dwarkesh Patel</strong></p><p>One of the things we were going to discuss was this idea you had about the market for follow-ups. I think this is the perfect story to discuss it for because you wrote the textbook about the field. &#8220;Mike and <a href="https://en.wikipedia.org/wiki/Isaac_Chuang">Ike</a>&#8221; is <a href="https://amzn.to/48q2uR9">the definitive textbook on quantum information</a>. You presumably came in after Deutsch.</p><p>But you in the &#8216;90s somehow identified it as the thing that is worth following up on and building on. Instead of talking about it more abstractly, I&#8217;d love to just hear the firsthand story of how you knew that this is the thing to do. Of all the things that were happening in physics and computing, how did you decide you want to think about this problem?</p><p><strong>Michael Nielsen</strong></p><p>Richard Feynman writes <a href="https://s2.smu.edu/~mitch/class/5395/papers/feynman-quantum-1981.pdf">this great paper in 1982</a>. David Deutsch writes <a href="https://www.daviddeutsch.org.uk/wp-content/deutsch85.pdf">an absolutely fantastic paper in 1985</a> sketching out a lot of the fundamental ideas of quantum computing. I&#8217;m 11 in 1985. I&#8217;m not thinking about this. I&#8217;m playing soccer and doing whatever. But in 1992, I took a class on quantum mechanics that was really terrific, given by <a href="https://en.wikipedia.org/wiki/Gerard_J._Milburn">Gerard Milburn</a>.</p><p>I just went and asked Gerard one day after the fifth lecture or something. I said, &#8220;Do you have any papers or whatever that you could give me?&#8221; He said, &#8220;Come by my office in a couple of days&#8217; time.&#8221; I did, and he presented me with a giant stack of papers, which included the Deutsch paper, the Feynman paper, and a whole bunch of other very fundamental papers about quantum computing and quantum information at a time when essentially nobody in the world was working on it. He was. I think he wrote <a href="https://espace.library.uq.edu.au/data/UQ_247726/UQ247726.pdf?Expires=1775595440&amp;Key-Pair-Id=APKAJKNBJ4MJBJNC6NLQ&amp;Signature=Db5jmtpr-AboCkV7t~zhgenU0rThA~PDRp7ifiHWuPdVCblDKWN-X-A02KyO-0LJcWxys3znMBM6OYA7g5TyBeIjbXk1P3UCt6bIbXWyyNuBqPDzWQCyPQz95hSsEAgXHG~MeScDcHpW8kPdn-5YGreO085P-F238wEplHu41hPvIDUeeCE0qcqmq4~n8ZymnOvcHNTHGjW~f6NABnSJd2FPsGyUp09GUtxXJ-U89Q6gJy4Yjkq~Vbk4-~me5~Rs2h041TvKif33zApKxMnINLSmkEgX5qEH-B0fVcB-BkNDcgFxVpKEFHDYs5JJlDILV~tGjbHD9KK~fvualLNx6w__">the very first paper that proposed a practical approach to quantum computing</a>. It wasn&#8217;t very practical, but it was actually in a real system.</p><p>So in some sense, I&#8217;m benefiting from the taste of this other person. As soon as I read the papers&#8230; These are exciting papers. They&#8217;re asking very fundamental questions, and you realize I can make progress here. These are things that one could potentially work on.</p><p>Deutsch has this conjecture, or thesis or whatever you&#8217;d call it, that a universal model, a <a href="https://en.wikipedia.org/wiki/Quantum_Turing_machine">quantum Turing machine</a>, should be capable of efficiently simulating any physical system at all. This is a very provocative idea. I think in that paper, he more or less claims that he&#8217;s proved it. I&#8217;m not sure everybody would agree with that. There are questions about whether or not you can simulate <a href="https://en.wikipedia.org/wiki/Quantum_field_theory">quantum field theory</a> effectively. That kind of question is very interesting and very exciting. It&#8217;s obviously a fundamental question about the universe.</p><p>He has some wonderful ideas in there about<a href="https://en.wikipedia.org/wiki/Quantum_algorithm"> quantum algorithms</a>, where they come from, what they mean, and what they relate to the meaning of the <a href="https://en.wikipedia.org/wiki/Wave_function">wave function</a>. Questions like this are still not agreed upon amongst physicists. There&#8217;s just some sense of, &#8220;Oh, I am in contact with something which is (A) deeply important, and (B) we as a civilization don&#8217;t have this.&#8221; Of course, you start to focus your attention a little bit there.</p><p><strong>Dwarkesh Patel</strong></p><p>I&#8217;m not sure I got the answer to the question&#8230;</p><p><strong>Michael Nielsen</strong></p><p>Maybe I misunderstood the question.</p><p><strong>Dwarkesh Patel</strong></p><p>Maybe I&#8217;ll explain the motivation first. In a previous conversation, we were discussing how you could have known in the 1940s that the Shannon theorems and <a href="https://en.wikipedia.org/wiki/Claude_Shannon">Shannon&#8217;s</a> way of thinking about a communication channel is a deep idea that goes beyond the problems with <a href="https://en.wikipedia.org/wiki/Pulse-code_modulation">pulse-code modulation</a> that Bell Labs was trying to solve at the time, and that it applies to everything from quantum mechanics to genetics to computer science.</p><p>One of the ideas you stated that we didn&#8217;t get a chance to talk about yet&#8230; Shannon published this paper. There are all these other papers, but there&#8217;s some market of follow-ups where people gravitate to and build upon Shannon&#8217;s work. How do they realize that that&#8217;s the thing to do, and how does that process happen? I guess you gave your local answer. You read these papers, and you immediately realized there&#8217;s work to be done here. There&#8217;s low-hanging fruit. There&#8217;s some deep provocative idea that I need to better understand, and I could tractably make progress on.</p><p><strong>Michael Nielsen</strong></p><p>To some extent, you&#8217;re saying, &#8220;Okay, I wanted to get into this game of contributing to humanity&#8217;s understanding of the universe,&#8221; and you are applying this low-hanging fruit algorithm. You&#8217;re like, &#8220;elative to my particular set of interests and abilities, where should I pick up my shovel and start digging?&#8221; There it was like, &#8220;Oh, this looks like quite a good place to start digging.&#8221; Different people, of course, chose very differently. It was a very unusual choice at the time. This was 1992. Very few people were thinking about that.</p><h3>01:35:29 &#8211; Does science need a new way to assign credit?</h3><p><strong>Dwarkesh Patel</strong></p><p>Fast-forwarding a bit, I don&#8217;t know how you think about your work on the open science movement now, but did it work? What does success there look like? What is the movement trying to accomplish?</p><p><strong>Michael Nielsen</strong></p><p>It&#8217;s interesting. You didn&#8217;t stop and define open science there, which 20 years ago you would have had to do. People recognize the phrase. People have some set of associations with it. Most often, they have a relatively simple set of associations. It means maybe something about making scientific papers open access. Very often they have some set of notions about also making code openly available or making data openly available.</p><p>Those are already very large successes of the open science movement, to make those salient issues. Those are issues on which people have opinions, and there are relatively common arguments. This is like the meme version: publicly funded science should be open science. That&#8217;s a distillation of a set of ideas which you might be able to contest. But if you can get people actually thinking about it and engaged with that kind of argument, that&#8217;s a very fundamental issue to be considering in the whole political economy of science.</p><p>If you go back three centuries, there was a very similar argument prosecuted, which is the question: do we publicly disclose our scientific results or not? If you look at people like Galileo and Kepler, the extent to which they publicly disclosed was done in a very odd way. Sometimes they did bizarre things where they <a href="https://cryptiana.web.fc2.com/code/galileo.htm">published some of their results as anagrams</a>. They&#8217;d find some discovery, write down the result in a sentence, scramble it, and publish that. Then if somebody else later made the same discovery, they would unscramble the anagram and say, &#8220;Oh, yeah, I actually did it first.&#8221; This is not an ideal foundation for a discovery system.</p><p>It took a very long time, over a century, I think, to obtain more or less the modern ideals, in which you disclose the knowledge in the form of a paper. There is an expectation of attribution, and a reputation economy gets built. &#8220;So-and-so did this work, so they deserve the credit for that,&#8221; and that&#8217;s the basis for their careers. This is the underlying political economy of science. That made a lot of sense when you have a printing press and the ability to do scientific journals.</p><p>Then you transition to this modern situation, where you can start to share a lot more. You can share your code, your data, your in-progress ideas. But there&#8217;s no direct credit associated to those. It&#8217;s not at all obvious how much reputation should be associated to them. That&#8217;s all constructed socially. Making it a live issue is a very important thing to have done. I view that as one of the main positive outcomes of work on open science.</p><p>I&#8217;ll give you a really practical example to illustrate the problem. For a long time in physics, there was a preprint culture in which people would upload preprints to the preprint archive, and in biology, this didn&#8217;t happen. There was no preprint culture. That&#8217;s changing now, but for a long time, this was the case. I used to amuse myself by asking physicists and biologists why this was the case.</p><p>What I would hear from biologists was they would say, &#8220;Biology is so much more competitive than physics that we need to protect our priority, so we can&#8217;t possibly upload to the archive. We have to just publish in journals.&#8221; Then I would sometimes hear from physicists, &#8220;Physics is so much more competitive than biology that we need to establish our priority by uploading as rapidly as possible to the preprint archive. We can&#8217;t possibly wait to do it with the journals.&#8221;</p><p>I think this emphasizes the extent to which this kind of attribution economy is just something we construct. It&#8217;s something we do by agreement. Any attempt to change that economy results in a different system by which we construct knowledge. There is this very fundamental set of problems around the political economy of science. We&#8217;ve got this collective project, and how we mediate it depends upon the economy we have around ideas.</p><p><strong>Dwarkesh Patel</strong></p><p>One of the things you&#8217;ve emphasized as a part of this project of open science, and we talked about it earlier, is collective science, or groups of people making progress on a problem where no individual understands all the logical and explanatory levels necessary to make a leap or a connection. Outside of mathematics, what is the best example of such a discovery?</p><p><strong>Michael Nielsen</strong></p><p>I&#8217;m not sure I have a well-ordering of them to give you a best. An example that I think is very interesting is the <a href="https://en.wikipedia.org/wiki/Large_Hadron_Collider">LHC</a>, where it&#8217;s just this immensely complicated object. Years ago, I snuck into an accelerator physics conference. I didn&#8217;t know anything at all about <a href="https://en.wikipedia.org/wiki/Accelerator_physics">accelerator physics</a>, but I was just curious to see what they were talking about.</p><p>This particular group of people were experts on numerical methods, in particular on inverse methods. Inside these accelerators, you have these <a href="https://en.wikipedia.org/wiki/Collision_cascade">cascades</a>. A particle will be massively accelerated, maybe it&#8217;ll be collided, and then you&#8217;ll get a shower of particles which decays and decays and decays. There&#8217;s just this incredible, consequential shower, which is ultimately what you see at the <a href="https://en.wikipedia.org/wiki/Particle_detector">detector</a>. Then you have to retroactively figure out what produced it. There are these very complicated inverse problems that need to be solved. You&#8217;ve got this final data, but you need to figure out what produced it, and that&#8217;s how you look for signatures of these.</p><p>Many of these people were incredibly deep experts on simulation methods for following particle tracks. This was really deep and difficult stuff. I was like, &#8220;Wow, you could spend a lifetime just learning how to do this and how to solve some of these inverse problems, and you would know very little about quantum field theory, detector physics, vacuum physics, or data processing, all these things that are absolutely essential to understanding, say, the <a href="https://en.wikipedia.org/wiki/Higgs_boson">Higgs boson</a>&#8221;.</p><p>I don&#8217;t think it&#8217;s possible for one person to understand everything in depth. Lots of people broadly understand a lot of these ideas, but they don&#8217;t understand everything in the depth that is actually utilized. That&#8217;s why there are these papers with well over a thousand authors. Those people can talk to one another at a high level, but they don&#8217;t understand each other&#8217;s specialties in all that much depth. Things like detector physics, vacuum physics, solving inverse problems, this stuff is incredibly different from each other. To understand it in real detail is serious work.</p><h3>01:43:57 &#8211; Prolificness versus depth</h3><p><strong>Dwarkesh Patel</strong></p><p>How do you think about prolificness versus depth? Maybe Darwin&#8217;s an example of somebody who&#8217;s just gestating on something for many decades. There are other examples. Einstein during the year he comes up with special relativity is just doing a bunch of different things. And <a href="https://en.wikipedia.org/wiki/Abraham_Pais">Pais</a> talks about how they were all relevant to the eventual build-up.</p><p><strong>Michael Nielsen</strong></p><p>It&#8217;s something I stress about a lot. Sometimes I feel I&#8217;m too slow. It&#8217;s funny though, the Darwin example is really interesting. Prolific at what? God knows how many letters he wrote. It must have been an enormous number. So he was certainly very active.</p><p>There&#8217;s two types of work that tend to be involved in any kind of creative project. There&#8217;s routine stuff, and there you just want to avoid procrastination. You just want to ask, &#8220;How do I get good at this?&#8221; or &#8220;How do I outsource it?&#8221; and &#8220;How do I do it as rapidly as possible?&#8221; and just avoid getting into a situation where you&#8217;re prolonging it.</p><p>Then there&#8217;s high-variance stuff where you actually need to be willing to take a lot of time. You need to be willing to go to different places and talk to different people, where in any given instance, most of it is just not going to be an input. Somehow balancing those two things&#8230; I think a lot of people are very good at doing one or the other, but it&#8217;s almost like a personality trait which one you prefer. People tend to end up doing a lot of one and not enough of the other. So I certainly try and balance those two things.</p><p>Einstein is such an interesting example. <a href="https://en.wikipedia.org/wiki/Annus_mirabilis_papers">1905 is just this extraordinary year</a>. You can delete special relativity entirely, and it&#8217;s an extraordinary year. You can delete special relativity, and you can delete the <a href="https://en.wikipedia.org/wiki/Photoelectric_effect">photoelectric effect</a> for which he won the Nobel Prize, and it&#8217;s still an extraordinary year, plausibly a multi-Nobel-Prize-winning year. So what&#8217;s he doing? Maybe the answer is just that he&#8217;s smarter than the rest of us. There&#8217;s a lot of luck as well.</p><p>Certainly for myself anyway, trying to identify those things that are routine that I should get good at, and then just try to do them as quickly as possible. I think that&#8217;s yielded a certain amount of returns. But also being willing to bet a little bit more on myself on the variance side has also been very, very helpful. That&#8217;s really hard, because intrinsically you&#8217;re putting yourself in situations where you don&#8217;t know what the outcome is going to be. If you&#8217;re very driven to be productive, and actually mostly it&#8217;s not working over there, you think, &#8220;Let&#8217;s reduce this.&#8221; It doesn&#8217;t feel right.</p><p>When I worked in San Francisco, a practice I used to have each day was instead of taking the 15-minute walk to work, I would take the more beautiful 30-minute walk. Partially just because it was beautiful, but partially also as just a reminder that there are real benefits to not being efficient. But it&#8217;s not an answer to your question. Really, I think all I&#8217;m saying is I struggle a lot with the question.</p><p><strong>Dwarkesh Patel</strong></p><p>I think <a href="https://en.wikipedia.org/wiki/Dean_Simonton">Dean Keith Simonton</a> has this famous <a href="https://jamesclear.com/equal-odds">equal odds rule</a> where he says the probability that any given thing you release&#8212;any paper, book, whatever&#8212;will be extremely important for a given person through their lifetime is not that different. What really determines in what era they are the most productive is how much they&#8217;re publishing. Any given thing has equal odds of being extremely important. I think some of the most successful creatives or scientists, they&#8217;re just doing a lot. Shakespeare was just publishing <em>a lot.</em></p><p><strong>Michael Nielsen</strong></p><p>Of course, then there are counterexamples. <a href="https://en.wikipedia.org/wiki/Kurt_G%C3%B6del">G&#246;del</a> published almost nothing. But broadly speaking, you need a very good reason to not do that. It&#8217;s funny, I&#8217;ve met a lot of people over the years who are clearly brilliant, and they&#8217;re just obsessed that they are going to work on the great project that makes them famous, and they never do anything. That seems connected. It&#8217;s a type of aversiveness. I think very often they just don&#8217;t want public judgment.</p><p>Something that I would love to see&#8230; There&#8217;s an awful lot of biographies and memoirs and histories of people who achieve a lot. I wish there was a very large number of biographies of people who are fantastically talented who just missed. I&#8217;ve known people who won gold medals at <a href="https://en.wikipedia.org/wiki/International_Mathematical_Olympiad">IMOs</a> and things like that, who then tried to become mathematicians and failed. What happened? What was the reason? I suspect in many cases that&#8217;s actually more informative than anything else.</p><h3>01:49:17 &#8211; What it takes to actually internalize what you learn</h3><p><strong>Dwarkesh Patel</strong></p><p>You have this <a href="https://michaelnotebook.com/dci/index.html">essay</a> that I was reading before this interview about how you think about what the work you&#8217;re doing is. And &#8220;writer&#8221; doesn&#8217;t seem like the right label. As you say, was Charles Darwin a writer? What exactly is that label? I&#8217;m a podcaster. In a way, obviously our work is very different, but I also think a lot about what this work is and how I get better at it.</p><p>In particular, how can I make sure there&#8217;s some compounding between the different people I talk to on the podcast? I worry that instead of this compounding, I build up some understanding that&#8217;s somewhat superficial about a topic, and then it depreciates. I move down to the next topic, and it depreciates. There are a lot of podcasters in the world who will interview way more experts than I have, and I don&#8217;t think they&#8217;re much the wiser or more knowledgeable as a result. So it&#8217;s clearly possible to mess this up.</p><p>I wonder if you have thoughts or takes or advice on how one actually learns in a deeper way from this kind of work.</p><p><strong>Michael Nielsen</strong></p><p>It&#8217;s an incredibly complicated and rich question. It seems like the question is, how do you make it a higher-growth context? How do you make it a more demanding context? You can do that in relatively small ways that might yield compounding returns, or you can do something that is more radical. Maybe it means starting a parallel project in which you do something that is actually quite a bit different.</p><p>There is something really interesting about how being very demanding can simply change your response to something. Something that I would sometimes do with students and sometimes with myself, it was really aimed more at myself, was they would say some week, &#8220;I&#8217;m going to try and do this work over the coming week.&#8221; Then the next week would come by and they hadn&#8217;t solved the problem. If a million dollars had been at stake, would you have put the same effort in? And the answer is no, invariably. They&#8217;ve tried, but they haven&#8217;t really tried.</p><p>I think that&#8217;s a very familiar feeling for all of us. You could do a lot more if you had just the right demanding taskmaster standing by you and saying, &#8220;Look, you&#8217;re barely operating here.&#8221; I do wonder a little bit about what&#8217;s the demanding taskmaster? What can they ask you that is going to make your preparation way more intense?</p><p><strong>Dwarkesh Patel</strong></p><p>The most helpful thing honestly is&#8230; For some subjects it is very clear how I prep. I&#8217;m doing an upcoming episode on chip design with the founder of a company that does chip design, and he wrote a textbook on it. Yesterday I went over to his office, and we brainstormed five <a href="https://en.wikipedia.org/wiki/Roofline_model">roofline analyses</a> I can do. If I understand that, I have some good understanding.</p><p>The problem is with almost every other field, there&#8217;s not this curriculum. <a href="https://www.dwarkesh.com/p/ilya-sutskever?utm_source=publication-search">When I interviewed Ilya</a> three, four years ago, it was: implement the <a href="https://en.wikipedia.org/wiki/Transformer_(deep_learning)">transformer</a>, and if you implement it, you have some nugget of understanding you have clamped down. With other fields, it&#8217;s just that I vaguely understand this. It&#8217;s not clamped. There&#8217;s no forcing function of &#8220;do this exercise, and if you do it, you will understand.&#8221;</p><p><strong>Michael Nielsen</strong></p><p>Really what you&#8217;re saying is you can do a good job at podcasting without actually attaining this kind of understanding, and that&#8217;s the problem from your point of view. You want to change your job description so that you are internalizing these chunks and just getting this kind of integration each time. It seems to me that what that means is you actually want to change the structure of the work output at some level.</p><p>There&#8217;s this terrible idea that lots of people have that they should be in <a href="https://en.wikipedia.org/wiki/Flow_(psychology)">flow</a> all of the time. And as far as I can tell, high performers just don&#8217;t believe this at all. They&#8217;re in flow some of the time. You certainly see this with athletes. When they&#8217;re actually out there playing basketball or tennis, ideally they are in flow much of the time. But when they&#8217;re training they&#8217;re not. They&#8217;re stuck a lot of the time, or they&#8217;re doing things badly. I suppose I wonder what that looks like for you.</p><p><strong>Dwarkesh Patel</strong></p><p>That I would be extremely satisfied with. The problem is I just don&#8217;t know what the equivalent of doing 64 laps is. This is a thing you can change by choosing guests where there is a legible curriculum. So maybe it&#8217;s a mistake not to have done that. Also, there&#8217;s no real way to prep for Terence Tao. There&#8217;s no curriculum that&#8217;s a plausible one.</p><p>There are many failure modes, but one long-term dynamic I&#8217;m worried about is that you can have a good podcast and reach a local maximum, but for no particular guest or topic are you going deep enough. My model of learning is that if you don&#8217;t really understand the deeper mechanism, you&#8217;re just mapping inputs and outputs of a black box. That just fades incredibly fast or is not worth it in the first place. You just move on and it&#8217;s over. You need to build the intermediate connection.</p><p>AI in a weird way is really easy for that reason, because there is a clear thing you can do. Just implement it, and then you understand it. If I applied that criterion elsewhere, do I just not do history episodes?</p><p><strong>Michael Nielsen</strong></p><p>Exactly. Ada Palmer. Wonderful to talk to, incredibly interesting. But for you personally, what changed?</p><p><strong>Dwarkesh Patel</strong></p><p>There are some things I learned. If I had allocated more time, especially after the interview, to write up 2,000 words on everything I learned and how it connects to other things I know. Maybe that&#8217;s a thing worth doing, spreading out the episodes more and spending more time afterwards consolidating.</p><p>I would pay infinite amounts of money if there was somebody who was really good at coming up with the curriculum, the practice problems you need to do, and the exercise you need to do after the interview to clamp what you have learned.</p><p><strong>Michael Nielsen</strong></p><p>Have you tried doing that with somebody?</p><p><strong>Dwarkesh Patel</strong></p><p>It&#8217;s hard to find someone. I haven&#8217;t tried super hard, but isn&#8217;t it going to be tough to find somebody who could do that for every single kind of discipline? Maybe I should just hire different ones for different topics.</p><p><strong>Michael Nielsen</strong></p><p>Maybe. There&#8217;s something about, what problem are you solving for each episode? As far as I can tell, that&#8217;s the only way I really understand anything. I get interested in something. At first, I don&#8217;t even have a problem, but there&#8217;s just some sense that there&#8217;s some contribution to make here, and gradually you hone in, and there&#8217;s a problem.</p><p>Funnily enough, spending time stuck is incredibly important. That used to just be annoying. Now it seems like it&#8217;s maybe even the most important part of the whole process. That hard-won nature of it means that I internalize it afterwards. I&#8217;ve written 10,000-word essays in a couple of days, and I&#8217;ve written them in three months or six months. I feel like I didn&#8217;t learn very much from the ones that only took a couple of days. Whereas some of the ones that took three months, 15 years later, I&#8217;ll still remember.</p><p><strong>Dwarkesh Patel</strong></p><p>Can you describe outside of physics how you learn, of the ones that took three months?</p><p><strong>Michael Nielsen</strong></p><p>By far the most common thing is there&#8217;s always some creative artifact. Sometimes it&#8217;s a class. Sometimes it&#8217;s engagement with a group of people who are working on some collective creative artifact together. You might not even be aware of it, but you&#8217;re acting as an input to their creative ends in some way. Sometimes it&#8217;s an essay or a book or whatever.</p><p>It&#8217;s one of the reasons why I often quite enjoy doing podcasts. I said yes to come here partially because I know you ask unusually demanding questions. That&#8217;s an attempt to get this sort of perspective from a different kind of a forcing function. Trying to pick the most demanding creative context.</p><p><strong>Dwarkesh Patel</strong></p><p>For this interview, I went through three lectures of the <a href="https://en.wikipedia.org/wiki/Leonard_Susskind">Susskind</a> <a href="https://amzn.to/3PXeyDe">special relativity book</a>. The problem is that there&#8217;s almost no practice problems in it. So I hired a physicist friend. I haven&#8217;t done it yet, but for every lecture I want a bunch of practice problems to go through, and I&#8217;m planning on being appropriately humbled.</p><p><strong>Michael Nielsen</strong></p><p>How do you make it as jugular as possible? The higher you can raise the stakes, the better.</p><p><strong>Dwarkesh Patel</strong></p><p>The interview is in some sense high stakes, but also it doesn&#8217;t necessarily test deep understanding.</p><p><strong>Michael Nielsen</strong></p><p>I don&#8217;t think the interview is that high stakes. You&#8217;re not writing a book about special relativity, and you&#8217;re not trying to write a book that replaces whatever the existing standard textbook is. That&#8217;s a really high stake.</p><p>By the way, a phrase that I find particularly difficult. People will talk about &#8220;going deep&#8221; on a subject, and it turns out different people have different ideas of what this means. For some people it means they read a couple of blog posts. For some people it means they read a book about it. For some people it means they wrote a book about it. The standard you hold yourself to determines a lot about your ability to integrate knowledge in this way.</p><p><strong>Dwarkesh Patel</strong></p><p>I found that I&#8217;m in some sense able to move much faster on some things through the help of AI, but I don&#8217;t know if I&#8217;m learning better. I think it&#8217;s probably because&#8230; The hardest thing, the thing that is most demanding, is so aversive that you try to take any excuse you can to get out of it. Just having a back-and-forth conversation with an LLM where you gloss over&#8230;</p><p><strong>Michael Nielsen</strong></p><p>It&#8217;s entertaining but not necessarily anything else.</p><p><strong>Dwarkesh Patel</strong></p><p>It&#8217;s such an easy way to get out of the thing. In fact, it makes it easier because instead of doing some intermediate thinking, there&#8217;s always a next question you can ask a chatbot.</p><p><strong>Michael Nielsen</strong></p><p>Yeah. And it&#8217;s somewhat valuable. That&#8217;s part of the seductiveness, of course. It&#8217;s not actually useless. But it can substitute for actually doing the thing that maybe you should be doing. It&#8217;s interesting. To what extent should you be outsourcing that kind of stuff? It&#8217;s an interesting judgment call. There is a whole bunch of routine work that you want done. It&#8217;s low value for you, so if you can get a chatbot to do it, you may as well.</p><p>Somebody interviewed the pioneering computer scientist <a href="https://en.wikipedia.org/wiki/Alan_Kay">Alan Kay</a> years ago, and he was asked what he thought about <a href="https://en.wikipedia.org/wiki/Linux">Linux</a>. If I remember his answer correctly, he basically said, &#8220;It doesn&#8217;t have anything to do with computer science. It&#8217;s just a great big ball of mud. There are a few interesting ideas in there which are worth understanding, but mostly all you&#8217;re learning is stuff about Linux. You&#8217;re not actually learning anything which is transferable.&#8221; I thought that was very interesting.</p><p>There&#8217;s a certain kind of seductiveness to some things where it&#8217;s sort of a Rube Goldberg machine. You can just learn about all the bits, and it feels entertaining. But if you step back and think about what you&#8217;re actually doing here, it might not actually be meeting your objectives. Maybe you want to become a sysadmin, and learning Linux is a great use of your time. There&#8217;s no harm in that at all.</p><p>But if your objective is to understand the fundamentals of computing, it&#8217;s much less clear that that&#8217;s a good use of your time. It was certainly an answer I&#8217;ve thought a lot about, where for a certain type of mind, there is a seductiveness in just learning systems and confusing that with understanding.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, I&#8217;ll keep you updated on how this goes. I owe you a text within a month of some revamped learning system.</p><p><strong>Michael Nielsen</strong></p><p>I&#8217;d be really curious. It&#8217;s also true that tiny incremental improvements in this are just worth so much.</p><p><strong>Dwarkesh Patel</strong></p><p>It&#8217;s the main input into the podcast. It&#8217;s great that the bookshelves are fancy and I&#8217;ve got a blackboard or whatever, but really the thing that makes the podcast better is if I can improve the learning I do. So yes, it&#8217;s worth every morsel of improvement. All right, thanks for the therapy session. Great note to end on. Thanks, Michael.</p><p><strong>Michael Nielsen</strong></p><p>All right. Thanks, Dwarkesh.</p>]]></content:encoded></item><item><title><![CDATA[Terence Tao – Kepler, Newton, and the true nature of mathematical discovery]]></title><description><![CDATA[&#8220;And what those stories teach us about how AI will revolutionize math&#8221;]]></description><link>https://www.dwarkesh.com/p/terence-tao</link><guid isPermaLink="false">https://www.dwarkesh.com/p/terence-tao</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Fri, 20 Mar 2026 16:00:55 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/191582481/11a15f3b7f6a04e3220e25ff38a2cd20.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>We begin the episode with the absolutely ingenious and surprising way in which Kepler discovered the laws of planetary motion.</p><p>People sometimes say that AI will make especially fast progress at scientific discovery because of tight verification loops.</p><p>But the story of how we discovered the shape of our solar system shows how the verification loop for correct ideas can be decades (or even millennia) long.</p><p>During this time, what we know today as the better theory can often actually make <em>worse</em> predictions (Copernicus's model of circular orbits around the sun was actually less accurate than Ptolemy's geocentric model).</p><p>And the reasons it survives this epistemic hell is some mixture of judgment and heuristics that we don&#8217;t even understand well enough to actually articulate, much less codify into an RL loop.</p><p>Hope you enjoy!</p><p>Watch on <a href="https://youtu.be/Q8Fkpi18QXU">YouTube</a>; listen on <a href="https://podcasts.apple.com/us/podcast/terence-tao-kepler-newton-and-the-true/id1516093381?i=1000756353875">Apple Podcasts</a> or <a href="https://open.spotify.com/episode/24xF8YGra2w3HXZYbhgVKU?si=U5V-SgvSQ8eVIcG2Z86wfQ">Spotify</a>.</p><div id="youtube2-Q8Fkpi18QXU" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;Q8Fkpi18QXU&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/Q8Fkpi18QXU?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Sponsors</h2><ul><li><p><a href="https://janestreet.com/dwarkesh">Jane Street</a> loves challenging my audience with different creative puzzles. One of my listeners, Shawn, solved Jane Street&#8217;s ResNet challenge and <a href="https://x.com/hynwprk/status/2026376546286711206">posted a great walk-through on X</a>. If you want to try one of these puzzles yourself, there&#8217;s one live now at <a href="https://janestreet.com/dwarkesh">janestreet.com/dwarkesh</a>.</p></li></ul><ul><li><p><a href="https://labelbox.com/dwarkesh">Labelbox</a> can get you rubric-based evals, no matter your domain. These rubrics allow you to give your model feedback on all the dimensions you care about, so you can train <em>how</em> it thinks, not just <em>what</em> it thinks. Whatever you&#8217;re focused on&#8212;math, physics, finance, psychology or something else&#8212;Labelbox can help. Learn more at <a href="https://labelbox.com/dwarkesh">labelbox.com/dwarkesh</a>.</p></li><li><p><a href="https://mercury.com/insights">Mercury</a> just released a new feature called Insights. Insights summarizes your money in and out, showing you your biggest transactions and calling out anything worth paying attention to. It&#8217;s a super low-friction way to stay on top of your business. Learn more at <a href="https://mercury.com/insights">mercury.com/insights</a>.</p></li></ul><h2>Timestamps</h2><p>(00:00:00) &#8211; Kepler was a high temperature LLM</p><p>(00:11:44) &#8211; How would we know if there&#8217;s a new unifying concept within heaps of AI slop?</p><p>(00:26:10) &#8211; The deductive overhang</p><p>(00:30:31) &#8211; Selection bias in reported AI discoveries</p><p>(00:46:43) &#8211; AI makes papers richer and broader, but not deeper</p><p>(00:53:00) &#8211; If AI solves a problem, can humans get understanding out of it?</p><p>(00:59:20) &#8211; We need a semi-formal language for the way that scientists actually talk to each other</p><p>(01:09:48) &#8211; How Terry uses his time</p><p>(01:17:05) &#8211; Human-AI hybrids will dominate math for a lot longer</p><h2>Transcript</h2><h3>00:00:00 &#8211; Kepler was a high temperature LLM</h3><p><strong>Dwarkesh Patel</strong></p><p>Today, I&#8217;m chatting with <a href="https://en.wikipedia.org/wiki/Terence_Tao">Terence Tao</a>, who needs no introduction. Terence, I want to begin by having you retell the story of how <a href="https://en.wikipedia.org/wiki/Johannes_Kepler">Kepler</a> discovered the <a href="https://en.wikipedia.org/wiki/Kepler%27s_laws_of_planetary_motion">laws of planetary motion</a> because I think this will be a great jumping off point to talk about AI for math.</p><p><strong>Terence Tao</strong></p><p>I&#8217;ve always had an amateur interest in astronomy. I&#8217;ve loved stories of how the early astronomers worked out the nature of the universe. Kepler was building on the work of <a href="https://en.wikipedia.org/wiki/Nicolaus_Copernicus">Copernicus</a>, who was himself building on the work of <a href="https://en.wikipedia.org/wiki/Aristarchus_of_Samos">Aristarchus</a>. Copernicus very famously proposed the <a href="https://en.wikipedia.org/wiki/Copernican_heliocentrism">heliocentric model</a>, that instead of the planets and the Sun going around the Earth, the Sun was at the center of the solar system and the other planets were going around the Sun.</p><p>Copernicus proposed that the orbits of the planets were perfect circles. His theory fit the observations that the Greeks, the Arabs, and the Indians had worked out over centuries. Kepler learned about these theories in his studies, and he made this observation that the ratios of the size of the orbits that Copernicus predicted seemed to have some geometric meaning.</p><p>He started proposing that if you take the orbit of the Earth and you enclose it in a cube, the outer sphere that encloses the cube almost perfectly matched the orbit of Mars, and so forth. There were six planets known at the time and five gaps between them, and there were five perfect Platonic solids: the cube, the tetrahedron, icosahedron, octahedron, and dodecahedron.</p><p>So he had this <a href="https://en.wikipedia.org/wiki/Mysterium_Cosmographicum">theory</a>, which he thought was absolutely beautiful, that you could inscribe these <a href="https://en.wikipedia.org/wiki/Platonic_solid">Platonic solids</a> between the spheres of the planets. It seemed to fit, and it seemed to him that God&#8217;s design of the planets was matching this mathematical perfection of the Platonic solids.</p><p>He needed data to confirm this theory. At the time, there was only one really high-quality dataset in existence. <a href="https://en.wikipedia.org/wiki/Tycho_Brahe">Tycho Brahe</a>, this very wealthy, eccentric Danish astronomer, had managed to convince the Danish government to fund this extremely expensive observatory. In fact, it was an entire island where he had taken decades of observations of all the planets, like Mars and Jupiter, at least every night for which the weather was clear, with the naked eye. He was the last of the naked-eye astronomers.</p><p>He had all this data which Kepler could use to confirm his theory. Kepler started working with Tycho, but Tycho was very jealous of the data. He only gave him little bits of it at a time. Kepler eventually just stole the data. He copied it and had to have a fight with Brahe&#8217;s descendants.</p><p>He did get the data, and then he worked out, to his disappointment, that his beautiful theory didn&#8217;t quite work. The data was off from his Platonic solid theory by 10% or something. He tried all kinds of fudges, moving the circles around, and it didn&#8217;t quite work. But he worked on this problem for years and years, and eventually, he figured out how to use the data to work out the actual orbits of the planets.</p><p>That was an incredibly clever, genius amount of data analysis. And then he worked out that the orbits were actually ellipses, not circles, which was shocking for him. So he worked out the two laws of planetary motion: the ellipses, and also that equal areas sweep out equal times.</p><p>Then ten years later, after collecting a lot of data&#8212;the furthest planets like Saturn and Jupiter were the hardest for him to work out&#8212;he finally worked out this third law, that the time it takes for a planet to complete its orbit was proportional to some power of the distance to the Sun. These are the three famous Kepler&#8217;s laws of motion. He had no explanation for them. It was all driven by experiment, and it took <a href="https://en.wikipedia.org/wiki/Isaac_Newton">Newton</a> a century later to give a theory that explained all three laws at once.</p><p><strong>Dwarkesh Patel</strong></p><p>The take I want to try on you is that Kepler was a <a href="https://www.ibm.com/think/topics/llm-temperature">high-temperature LLM</a>. Newton comes up with this explanation of why the three laws of planetary motion must be true. Of course, the way that Kepler discovers the laws of planetary motion, or figures out the relative orbits of the different planets, is as you say a work of genius. But through his career, he&#8217;s just trying random relationships.</p><p>In fact, in the book in which he writes down the third law of planetary motion, it&#8217;s an aside on <em><a href="https://en.wikipedia.org/wiki/Harmonice_Mundi">The Harmonics of the World</a></em>, which is just a book about how all these different planets have these different harmonies. And the reason there&#8217;s so much famine and misery on Earth is because the Earth is mi-fa-mi, that&#8217;s the note of Earth. It&#8217;s all this random astrology, but in there is the cube-square law, which tells you what relationship the period has to a planet&#8217;s distance from the Sun. As you were detailing, if you add that to Newton&#8217;s <a href="https://en.wikipedia.org/wiki/Newton's_laws_of_motion">F=ma</a> and the <a href="https://en.wikipedia.org/wiki/History_of_centrifugal_and_centripetal_forces">equation for centripetal acceleration</a>, you get the <a href="https://en.wikipedia.org/wiki/Inverse-square_law">inverse-square law</a>. And so Newton works that out.</p><p>But the reason I think this is an interesting story is that I feel LLMs can do the kind of thing of trying random relationships for twenty years, some of which make no sense, as long as there&#8217;s a verifiable data bank like Brahe&#8217;s dataset. &#8220;Ok, I&#8217;m going to try out random things about musical notes, Platonic objects, or different geometries, I have this bias that there&#8217;s some important thing about the geometry of these orbits.&#8221;</p><p>Then one thing works. As long as you can verify it, these empirical regularities can then drive actual deep scientific progress.</p><p><strong>Terence Tao</strong></p><p>Traditionally, when we talk about the history of science, idea generation has always been the prestige part of science. A scientific problem comes with many steps. You have to identify a problem, and then you have to identify a good, fruitful problem to work on. Then you need to collect data, figure out a strategy to analyze the data, and make a hypothesis. At this point, you need to propose a good hypothesis, and then you need to validate. Then you need to write things up and explain. There are a dozen different components.</p><p>The ones we celebrate are these eureka genius moments of idea generation. Kepler certainly had to cycle through many ideas, several of which didn&#8217;t work. I bet there were many that he didn&#8217;t even publish at all because they just didn&#8217;t fit. That&#8217;s an important part of the process, trying all kinds of random things and seeing if they worked.</p><p>But as you say, it has to be matched by an equal amount of verification, otherwise it&#8217;s slop. We celebrate Kepler, but we should also celebrate Brahe for his assiduous data collection, which was ten times more precise than any previous observation. That extra decimal point of accuracy was essential for Kepler to get his results. He was using <a href="https://en.wikipedia.org/wiki/Euclidean_geometry">Euclidean geometry</a> and the most advanced mathematics he could use at the time to match his models with the data. All aspects had to be in play: the data, the theory, and the hypothesis generation.</p><p>I&#8217;m not sure nowadays that hypothesis generation is the bottleneck anymore. Science has changed in the century since. Classically, the two big paradigms for science were theory and experiment. Then in the 20th century, numerical simulation came along, so you can do computer simulations to test theories. Finally, in the late 20th century, we had big data. We had the era of data analysis.</p><p>A lot of new progress is actually driven now by analyzing massive datasets first. You collect large datasets and then draw patterns from them to deduce thoughts. This is a little bit different from how science used to work, where you make a few observations or have one out-of-the-blue idea, and then collect data to test your idea. That&#8217;s the classic scientific method. Now it&#8217;s almost reversed. You collect big data first, and then you try to get hypotheses from it.</p><p>Kepler was maybe one of the first early data scientists, but even he didn&#8217;t start with Tycho&#8217;s dataset and then analyze it. He had some preconceived theories first. It seems like this is less and less the way we make progress, just because the data is so much more massive and useful.</p><p><strong>Dwarkesh Patel</strong></p><p>Oh, interesting. I feel like the 20th-century science that you&#8217;re describing actually very well describes what happened with Kepler. He did have these ideas&#8212;1595 and &#8216;96 is where he comes up with the polygons and then the Platonic objects theory&#8212;but they were wrong. Then a few years later, he gets Brahe&#8217;s data, and it&#8217;s only after twenty years of trying random things that he gets this empirical regularity.</p><p>It actually feels a bit closer to Brahe&#8217;s data being analogous to some massive data bank of simulations, and now that you&#8217;ve got the data, you can keep trying random things. If it wasn&#8217;t for that, Kepler would be out there just writing books about harmonics and Platonic objects, and there would be nothing to actually verify against.</p><p><strong>Terence Tao</strong></p><p>The data was extremely important. The distinction I was trying to make was that traditionally, you make a hypothesis and then you test it against data. But now with machine learning, data analysis, and statistics, you can start with data and through statistics work out laws that were not present before.</p><p>Kepler&#8217;s third law is a little bit like this, except that instead of having the thousand data points that Brahe had, Kepler had six data points. For every planet, he knew the length of the orbit and the distance to the Sun. <a href="https://terrytao.wordpress.com/wp-content/uploads/2025/11/sample_fourth.pdf&amp;sa=D&amp;source=docs&amp;ust=1773826228948195&amp;usg=AOvVaw2OguZ_g3hnohQovXR7LKsT">There were five or six data points, and he did what we would now call regression</a>. He fit a curve to these six data points and got a square-cube law, which was amazing. But he was quite lucky that these six data points gave him the right conclusion. That&#8217;s not enough data to be really reliable.</p><p>There was a later astronomer, <a href="https://en.wikipedia.org/wiki/Johann_Elert_Bode">Johann Bode</a>, who took the same data&#8212;the distances to the planets&#8212;and inspired by Kepler, he had a prediction that the distances to the planets formed a shifted geometric progression. He also fit a curve, except there was one point missing. There was a big gap between Mars and Jupiter. His law predicted that there was a missing planet. It was kind of a crank theory, except when Uranus was discovered by <a href="https://en.wikipedia.org/wiki/William_Herschel">Herschel</a>, the distance to Uranus fit exactly this pattern. Then <a href="https://en.wikipedia.org/wiki/Ceres_(dwarf_planet)">Ceres</a> was discovered in the <a href="https://en.wikipedia.org/wiki/Asteroid_belt">asteroid belt</a>, and it also fit the pattern. People got really excited that Bode had discovered <a href="https://en.wikipedia.org/wiki/Titius%E2%80%93Bode_law">this amazing new law of nature</a>.</p><p>But then Neptune was discovered, and it was way off. Basically it was just a numerical fluke. There were six data points. Maybe one reason why Kepler didn&#8217;t highlight his third law as much as the first two laws is that instinctively, even though he didn&#8217;t have modern statistics, he kind of knew that with six data points, he had to be somewhat tentative with the conclusions.</p><h3>00:11:44 &#8211; How would we know if there&#8217;s a new unifying concept within heaps of AI slop?</h3><p><strong>Dwarkesh Patel</strong></p><p>To ask the question about the analogy more explicitly, does this analogy make sense if in the future we have smarter and smarter AIs? We&#8217;ll have millions of them, and they can go out and hunt for all these empirical irregularities. It sounds like you don&#8217;t think the bottleneck in science is finding more things that are the equivalent of the third law of planetary motion for each given field, so that later on somebody can say, &#8220;Oh, we need a way to explain this. Let&#8217;s work out the math. Here&#8217;s the <a href="https://en.wikipedia.org/wiki/Newton%27s_law_of_universal_gravitation">inverse-square law of gravity</a>.&#8221;</p><p><strong>Terence Tao</strong></p><p>I think AI has driven the cost of idea generation down to almost zero, in a very similar way to how the internet drove the cost of communication down to almost zero. It&#8217;s an amazing thing, but it doesn&#8217;t create abundance by itself. Now the bottleneck is different. We&#8217;re now in a situation where suddenly people can generate thousands of theories for a given scientific problem. Now we have to verify them, evaluate them. This is something which we have to change our structures of science to actually sort this out.</p><p>Traditionally, we build walls. In the past, before we had AI slop, we had amateur scientists have their own theories of the universe, many of which were of very little value. We built these peer review publication systems to filter out and try to isolate the high signal ideas to test.</p><p>But now that we can generate these possible explanations at massive scale, and some of them are good and a lot are terrible, human reviewers are already being overwhelmed. Many journals are reporting that <a href="https://www.nature.com/articles/d41586-025-03967-9">AI-generated submissions are just flooding their submissions</a>.</p><p>It&#8217;s great that we can generate all kinds of things now with AI, but it means that the rest of the aspects of science have to catch up: verification, validation, and assessing what ideas actually move the subject forward and which ones are dead ends or red herrings. That&#8217;s not something we know how to do at scale. For each individual paper, we can have a debate among scientists and get to a consensus in a few years. But when we&#8217;re generating a thousand of these every day, this doesn&#8217;t work.</p><p><strong>Dwarkesh Patel</strong></p><p>There&#8217;s this incredibly interesting question. If you have billions of AI scientists, not only how do you gauge which ones are real progress, but how do you... This is actually a question that human science has had to face and we&#8217;ve solved somehow, and I&#8217;m actually not sure how we solved this.</p><p>Let&#8217;s say in the 1940s, if you&#8217;re at <a href="https://en.wikipedia.org/wiki/Bell_Labs">Bell Labs</a> and there are these new technologies coming out. <a href="https://en.wikipedia.org/wiki/Pulse-code_modulation">Pulse-code modulation</a>, how do you transfer signals? How do you digitize signals? How do you transfer them over analog wires? There are all these papers about the engineering constraints and the details, and then there&#8217;s one which comes up with the <a href="https://en.wikipedia.org/wiki/Bit#History">idea of the bit</a>, which has implications across many different fields. You need some system which can then look at that and say, &#8220;Okay, we need to apply this to probability. We need to apply this to computer science,&#8221; et cetera.</p><p>In the future, the AIs are coming up with the next version of this unifying concept. How would you identify it among millions of papers that might actually constitute progress, but which have much less in terms of general unifying ideas?</p><p><strong>Terence Tao</strong></p><p>A lot of it&#8217;s the test of time. Many great ideas didn&#8217;t actually get a great reception at the time they were first proposed. It was only after some other scientists realized that they could take it further and apply them to their own... <a href="https://en.wikipedia.org/wiki/Deep_learning">Deep learning</a> itself was a niche area of AI for a long time. The idea of getting answers entirely through training on data and not through first principles reasoning was very controversial, and it just took a long time before it started bearing fruit.</p><p>You mentioned the bit. There were other proposals for computer architectures than the zero-one that is universal today. I think there were <a href="https://en.wikipedia.org/wiki/Ternary_computer">trits</a>, three-valued logic. In an alternate universe, maybe a different paradigm would have shown up. The <a href="https://en.wikipedia.org/wiki/Transformer_(deep_learning)">transformer</a>, for example, is the foundation of all modern <a href="https://en.wikipedia.org/wiki/Large_language_model">large language models</a>, and it was the first deep learning architecture that really was sophisticated enough to capture language. But it didn&#8217;t have to be that way. There could&#8217;ve been some other architecture that was the first to do it and once that was adopted, it would become the standard.</p><p>One reason why it&#8217;s hard to assess whether a given idea is going to be fruitful is that it depends on the future. It depends also on the culture and society, which ones get adopted, which ones don&#8217;t. The <a href="https://en.wikipedia.org/wiki/Decimal">base ten numeral system</a> in mathematics is extremely useful, much better than the <a href="https://en.wikipedia.org/wiki/Roman_numerals">Roman numeral system</a>, for instance. But again, there&#8217;s nothing special about ten. It&#8217;s a system that is useful for us because everyone else uses it. We&#8217;ve standardized it. We&#8217;ve built all our computers and our number representation systems around it, so we&#8217;re stuck with it now. Some people occasionally push for other systems than decimal, but there&#8217;s just too much inertia.</p><p>It&#8217;s not something where you can look at any given scientific achievement purely in isolation and give it an objective grade without being aware of the context both in the past and the future. So it may never be something that you can just <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">reinforcement learn</a> the same way that you can for much more localized problems.</p><p><strong>Dwarkesh Patel</strong></p><p>Often in the history of science when a new theory comes up that in retrospect we realize is correct, it seems to make implications that either make no sense because they&#8217;re wrong, and we realize later on why they&#8217;re wrong, or they&#8217;re correct but seem wildly implausible at the time.</p><p>As you talked about, Aristarchus had heliocentrism in the third century BC. The ancient Athenians were like, &#8220;This can&#8217;t be because if the earth is going around the sun, we should see the relative position of the stars change as we&#8217;re going around the sun, and the only way that wouldn&#8217;t be the case is if they&#8217;re so far away that you don&#8217;t notice any parallax,&#8221; which is actually the correct implication.</p><p>But there&#8217;s times when the implication is incorrect and we just need to graduate to a better level of understanding. <a href="https://en.wikipedia.org/wiki/Gottfried_Wilhelm_Leibniz">Leibniz</a> would chide Newton and disagree with <a href="https://en.wikipedia.org/wiki/Newton%27s_law_of_universal_gravitation">Newton&#8217;s theory of gravity</a> on the basis that it implied <a href="https://en.wikipedia.org/wiki/Action_at_a_distance">action at a distance</a>, and they didn&#8217;t know the mechanism, and Newton himself was sort of stunned that <a href="https://en.wikipedia.org/wiki/Equivalence_principle">inertial mass and gravitational mass were the same quantity</a>. All these things later were resolved by <a href="https://en.wikipedia.org/wiki/Albert_Einstein">Einstein</a>. But it was still progress.</p><p>So the question for a system of peer review for AI would be: even if you can falsify a theory, how would you notice that it still constitutes progress relative to the thing before?</p><p><strong>Terence Tao</strong></p><p>Often, the ultimately correct theory initially is worse in many ways. Copernicus&#8217;s theory of the planets was less accurate than Ptolemy&#8217;s theory. Geocentrism had been developed for a millennium by that point, and they had made many tweaks and increasingly complicated ad hoc fixes to make it more and more accurate. Copernicus&#8217;s theory was a lot simpler but much less accurate. It was only Kepler that made it more accurate than Ptolemy&#8217;s theory.</p><p>Science is always a work in progress. When you only get part of the solution, it looks worse than a theory which is incorrect but somehow has been completed to the point where it kind of answers all the questions. As you say, Newton&#8217;s theory had big mysteries. They had the equivalence of mass and action at a distance, which were only resolved with <a href="https://en.wikipedia.org/wiki/General_relativity">a very conceptually different approach</a> centuries afterwards.</p><p>Often progress has to be made not by adding more theories, but by deleting some assumptions that you have in your mind. One reason why geocentrism held on for so long is we had this idea that objects naturally want to stay at rest. This is the <a href="https://en.wikipedia.org/wiki/Aristotelian_physics">Aristotelian notion of physics</a>, and so the idea that the Earth was moving&#8230; How come we weren&#8217;t all falling over? Once you have Newton&#8217;s laws of motion&#8212;an object in motion remains in motion and so forth&#8212;then it makes sense.</p><p>Conceptually, it&#8217;s a very big leap to realize that the Earth is in motion. It doesn&#8217;t feel like it&#8217;s in motion. The biggest advances, like <a href="https://en.wikipedia.org/wiki/Darwinism">Darwin&#8217;s theory of evolution</a>, is the idea that species are not static. This is not obvious because you don&#8217;t see evolution in your lifetime. Well, now we actually can, but it seems permanent and static.</p><p>Right now we&#8217;re going through a cognitive version of the Copernican revolution, where we used to think that human intelligence is the center of the universe, and now we&#8217;re seeing that there are very different types of intelligence out there with very different strengths and weaknesses. Our assessment of which tasks require intelligence, which ones don&#8217;t, has to be reordered quite a bit.</p><p>Trying to fit AI into our theories of scientific progress and what is hard and what is easy, we&#8217;re struggling quite a lot. We have to ask questions that we&#8217;ve never really had to ask before. Or maybe the philosophers had, but now we all have to deal with it.</p><p><strong>Dwarkesh Patel</strong></p><p>This brings up a topic I&#8217;ve been very curious about. You mentioned Darwin&#8217;s theory of evolution. There&#8217;s this book, <em><a href="https://amzn.to/4bDfFzc">The Clockwork Universe</a></em> by <a href="https://en.wikipedia.org/wiki/Edward_Dolnick">Edward Dolnick</a>, which covers a lot of this era of history we&#8217;re talking about. He has this interesting observation in there. <em><a href="https://en.wikipedia.org/wiki/On_the_Origin_of_Species">The Origin of Species</a></em> was published in 1859. <em><a href="https://en.wikipedia.org/wiki/Philosophi%C3%A6_Naturalis_Principia_Mathematica">Principia Mathematica</a></em> was published in 1687.</p><p>So <em>The Origin of Species</em> comes out two centuries after <em>Principia</em>. Conceptually, it seems like Darwin&#8217;s theory is simpler. There&#8217;s a contemporaneous biologist to Darwin, <a href="https://en.wikipedia.org/wiki/Thomas_Henry_Huxley">Thomas Huxley</a>, who reads <em>The Origin of Species</em> and he says, &#8220;How stupid not to have thought of that.&#8221;</p><p>Nobody ever says that about <em>Principia</em>, chiding themselves for not having beaten Newton to <a href="https://en.wikipedia.org/wiki/Newton%27s_law_of_universal_gravitation">gravity</a>. So there&#8217;s a question of why did it take longer?</p><p>It seems like a big part of the reason is what you were saying. The evidence for natural selection is overwhelming in a certain sense, but it&#8217;s cumulative and retrospective, whereas Newton can just say, &#8220;Here are my equations. Let me see the moon&#8217;s orbital period and its distance, and if it lines up, then we&#8217;ve made progress.&#8221;</p><p><a href="https://en.wikipedia.org/wiki/Lucretius">Lucretius</a> actually had this idea that species adapted to their environment in the first century BC but nobody really talks about it until Darwin because Lucretius couldn&#8217;t run some experiment and force people to pay attention. I wonder if we&#8217;ll in retrospect end up seeing much more progress in domains which have this kind of tight data loop where you can verify them quite easily, even though they&#8217;re conceptually much more difficult.</p><p><strong>Terence Tao</strong></p><p>I think one aspect of science is that it&#8217;s not just creating a new theory and validating it, but communicating it to others. Darwin was an amazing science communicator. He wrote in English, in natural language. I&#8217;m speaking like a&#8212;</p><p><strong>Dwarkesh Patel</strong></p><p>No <a href="https://en.wikipedia.org/wiki/Lean_(proof_assistant)">Lean</a>.</p><p><strong>Terence Tao</strong></p><p>I have to get out of my technical mindset. He spoke in plain English, didn&#8217;t use equations, and he synthesized a lot of disparate facts. Little pieces of evolution had been worked out in the past, but he had this very compelling vision. Again, he was still missing things. He didn&#8217;t know the mechanism for <a href="https://en.wikipedia.org/wiki/Heredity">heredity</a>, he didn&#8217;t have DNA. But his writing style was persuasive, and that helped a lot.</p><p>Newton wrote in Latin. He had invented entire new areas of mathematics just to explain what he was doing. He was also from an era where scientists were much more secretive and competitive. Academia is still competitive, but it was even worse back in Newton&#8217;s day. He held back some of his best insights because he didn&#8217;t want his rivals to get any advantage. He was also a somewhat unpleasant person from what I gather. It was only a couple of decades after Newton when other scientists explained his work in much simpler terms that they became widespread.</p><p>The art of exposition and making a case and creating a narrative is also a very important part of science. If you have the data, it helps, but people need to be convinced, otherwise they will not push it further or take the initial investment to learn your theory and really explore it. That&#8217;s another thing which is really hard to reinforcement learn on. How can you score how persuasive you are? Well, there are entire marketing departments trying to do this. Maybe it&#8217;s good that AI is not yet optimized to be persuasive.</p><p>There&#8217;s a social aspect to science. Even though we pride ourselves on having an objective side to it, where there&#8217;s data and experiment and validation, we still have to tell stories and convince our fellow scientists. That&#8217;s a soft, squishy thing. It&#8217;s a combination of data and painting a narrative, and it&#8217;s a narrative of gaps.</p><p>Even with Darwin, as I said, there were pieces of his theory he could not explain. But he could still make a case that in the future, people would find transitional forms, that they would find the mechanism of inheritance, and they did. I don&#8217;t know how you can quantify that in such a precise way that you can start doing reinforcement learning. Maybe that will be forever the human side of science.</p><h3>00:26:10 &#8211; The deductive overhang</h3><p><strong>Dwarkesh Patel</strong></p><p>One takeaway I had from reading and watching your stuff on the <a href="https://terrytao.wordpress.com/wp-content/uploads/2010/10/cosmic-distance-ladder.pdf">cosmic distance ladder</a>&#8230; By the way, I highly recommend people watch <a href="https://www.youtube.com/watch?v=YdOXS_9_P4U">your series</a> with <a href="https://www.youtube.com/c/3blue1brown">3Blue1Brown</a> on the cosmic distance ladder. One takeaway was that the deductive overhang in many fields could be so much bigger than people realize. If you just had the right insight about how to study a problem, you might be surprised at how much more you could learn about the world.</p><p>I wonder if you think that&#8217;s a product of astronomy at the particular times in history that you&#8217;re studying. Or is it just that based on the data that is incident on the Earth right now, we could actually divine a lot more than we happen to know?</p><p><strong>Terence Tao</strong></p><p>Astronomy was one of the first sciences to really embrace data analysis and squeezing every last possible drop of information out of the information they had because data was the bottleneck. It still is the bottleneck. It&#8217;s really hard to collect astronomical data.</p><p>Astronomers are world-class in extracting all kinds of conclusions from little traces of data, almost like Sherlock. I hear that for a lot of quant hedge funds, their preferred hire is an astronomy PhD, actually. They are also very interested for other reasons in extracting signals from various random bits of data.</p><p>We do under-explore how to extract extra information from various signals. Just to pick <a href="https://arxiv.org/pdf/cond-mat/0212043">one random study</a>, I remember reading once that people were trying to measure how often scientists actually read the papers that they cite. How do you measure this? You could try to survey different scientists, but they had a clever trick.</p><p>Many citations have little typos, like a number is wrong or punctuation is almost wrong. They measured how often a typo got copied from one reference to the next, and they could infer whether an author was just copying and pasting a reference without actually checking it. From that, they were able to infer some measure of how much attention people were paying. So there are some clever tricks to extract&#8230;</p><p>These questions you posed earlier of how we can assess whether a scientific development is fruitful, interesting, or represents real progress&#8230; Maybe there are really useful metrics or footprints of this phenomenon in data. We can examine citations and how often something is mentioned in a conference. Maybe there&#8217;s a lot of sociology of science research to be done that could actually detect these things. Maybe we should get some astronomers on the case, actually.</p><h3>00:30:31 &#8211; Selection bias in reported AI discoveries</h3><p><strong>Dwarkesh Patel</strong></p><p>That brings us nicely to the progress that, from the outside, it seems like AI for math is making. You had a <a href="https://www.theatlantic.com/technology/2026/02/ai-math-terrance-tao/686107/">post recently where you pointed out</a> that over the last few months, AI programs have solved fifty out of the eleven hundred odd <a href="https://www.erdosproblems.com/">Erd&#337;s problems</a>. I don&#8217;t know if it&#8217;s still correct, but as of a month ago you said that there had been a pause because the low-hanging fruit had been picked.</p><p>First of all, I&#8217;m curious if that is still the case, that we have picked the low-hanging fruit and now we&#8217;re at this plateau currently.</p><p><strong>Terence Tao</strong></p><p>It does seem so. Fifty-odd problems have been solved with AI assistance, which is great, but there&#8217;s like six hundred to go. People are still chipping away at one or two of these right now.</p><p>We&#8217;re seeing a lot fewer pure AI solutions now where the AI just one-shots the problem. There was a month where that happened and that has stopped, not for lack of trying. I know of three separate attempts to get frontier model AIs to just attack every single one of the problems simultaneously. They pick out some minor observations, or maybe they find that some problem was already solved in the literature, but there hasn&#8217;t been any further purely AI-powered solution yet.</p><p>People are using AI a lot currently. Someone might use AI to generate a possible proof strategy, and then another person will use a separate AI tool to critique it, rewrite it, generate some numerical data for it, or do a literature survey. Some problems have been solved by an ongoing conversation between lots of humans and lots of AI tools. But it does seem like it was this one-off thing.</p><p>Maybe one analogy for these problems is that you&#8217;re in some sort of mountain range with all kinds of cliffs and walls. Maybe there&#8217;s a little wall which is three feet high, and one that&#8217;s six feet high, and then there&#8217;s fifteen feet high, and then there are some mile-high cliffs. You&#8217;re trying to climb as many of these cliffs as possible, but it&#8217;s in the dark. We don&#8217;t know which ones are tall, which ones are short. So we try to light some candles and make some maps, and slowly we figure out some of them are climbable. Some of them we can identify a partial track in the wall that you can reach first.</p><p>These AI tools, they&#8217;re like jumping machines that can jump two meters in the air, higher than any human. Sometimes they jump in the wrong direction, and sometimes they crash, but sometimes they can reach the tops of the lowest walls that we couldn&#8217;t reach before. We&#8217;ve just set them loose in this mountain range, hopping around. There was this exciting period where they could actually find all the low ones and reach them. Maybe the next time there&#8217;s a big advance in the models, they will try it again, and a few more will be breached.</p><p>But it&#8217;s a different style of doing mathematics. Normally we would <a href="https://en.wikipedia.org/wiki/Hill_climbing">hill climb</a>, make little markers, and try to identify partial things. These tools either succeed or they fail. They&#8217;ve been really bad at creating partial progress or identifying intermediate stages that you should focus on first. Going back to this previous discussion, we don&#8217;t have a way of evaluating partial progress the same way we can evaluate a one-shot success or failure of solving a problem.</p><p><strong>Dwarkesh Patel</strong></p><p>There&#8217;s two different ways to think through what you&#8217;ve just said. One of them is more bearish on AI progress, and one of them is more bullish. The bearish one being, &#8220;Oh, they&#8217;re only getting to a certain height of wall, which is not as high as humans are reaching.&#8221;</p><p>The second is that they have this powerful property that once they achieve a certain waterline, they can fill every single problem that is available at that waterline, which we simply can&#8217;t do with humans. We can&#8217;t make a million copies of you and give each of them a million dollars of inference compute and have you do a hundred years of subjective time research on a million different problems at the same time.</p><p>But once AIs reach Terence Tao-level, they could do that. Once they reach intermediate levels, they could do the intermediate version of that. The same reason that we should be bearish now is the reason we should be especially bullish. Not even when they achieve superhuman intelligence, but just when they achieve human-level intelligence, because their human-level intelligence is qualitatively wider and more powerful than our human-level intelligence.</p><p><strong>Terence Tao</strong></p><p>I agree. They excel at breadth, and humans excel at depth, human experts at least. I think they&#8217;re very complementary. But our current way of doing math and science is focused on depth because that&#8217;s where human expertise is, because humans can&#8217;t do breadth. We have to redesign the way we do science to take full advantage of this breadth capability that we now have.</p><p>We should have a lot more effort in creating very broad classes of problems to work on rather than one or two really deep, important problems. We should still have the deep, important problems, and humans should still be working on them. But now we have this other way of doing science. We can explore entirely new fields of science by first getting these broad, moderately competent AIs to map it out and make all the easy observations. And then identify certain islands of difficulty, which human experts can then come and work on.</p><p>I see very much a future of very complementary science. Eventually, you would hope to get both breadth and depth and somehow get the best of both worlds. But we need practice with the breadth side. It&#8217;s too new. We don&#8217;t even have the paradigms to really take full advantage of it. But we will, and then science will be unrecognizable after that, I think.</p><p><strong>Dwarkesh Patel</strong></p><p>To this point about complementarity, programmers have noticed that they&#8217;re way more productive as a result of these AI tools. I don&#8217;t know if you as a mathematician feel the same way, but it does seem like one big difference between vibe coding and vibe researching is that with software, the whole point is to have some effect on the world through your work. If it leads to you better understanding a problem or coming up with some clean abstraction to embody in your code, that is instrumental to the end goal.</p><p>Whereas with research, the reason we care about solving the <a href="https://en.wikipedia.org/wiki/Millennium_Prize_Problems">Millennium Prize Problems</a> is that presumably that in the process of solving them, we discover new mathematical objects or new techniques that advance our civilization&#8217;s understanding of mathematics. So the proof is instrumental to the intermediate work. I don&#8217;t know if you agree with that dichotomy or if that in any way will explain the relative uplift we&#8217;ll see in software versus research.</p><p><strong>Terence Tao</strong></p><p>Certainly in math, the process is often more important than the problem itself. The problem is kind of a proxy for measuring progress. I think even in software, there are different types of software tasks. If you just create a webpage that does the same thing that a thousand other webpages do, there&#8217;s no skill to be learned. Well, there is still some skill maybe that the individual programmer could pick up. But for boilerplate-type code, it&#8217;s something that you should definitely offload to AI.</p><p>Sometimes once you make the code, you still have to maintain it. There are issues with upgrading it and making it compatible with other things. I&#8217;ve heard programmers report that even if an AI can create the first prototype of a tool, making it mesh with everything else and making it interact with the real world in the way they want is an ongoing process. If you don&#8217;t have the skills that you pick up from writing the code, that may impact your ability to maintain it down the road.</p><p>So yes, certainly mathematicians, we&#8217;ve used problems to build intuition and to train people to have a good idea of what&#8217;s true, what to expect, what is provable, and what is difficult. Just getting the answers right away may actually inhibit that process.</p><p>I made a distinction between theory and experiment before. In most sciences, there&#8217;s an equal division between the theoretical side and the experimental side. Math has been unique in that it&#8217;s almost entirely theoretical. We place a premium on trying to have coherent, clean theories of why things are true and false. We haven&#8217;t done many experiments as to, if we have two different ways to solve a problem, which is more effective. We have some intuition, but we haven&#8217;t done large-scale studies where we take a thousand problems and just test them.</p><p>But we can do that now. I think AI-type tools will actually revolutionize the experimental side of math, where you don&#8217;t care so much about individual problems and the process of solving them, but you want to gather large-scale data about what things work and what things don&#8217;t. The same way that if you&#8217;re a software company and you want to roll out a thousand pieces of software, you don&#8217;t really want to handcraft each one and learn lessons from each. You just want to find what workflows let you scale.</p><p>The idea of doing mathematics at scale is at its infancy. But that&#8217;s where AI is really going to revolutionize the subject.</p><p><strong>Dwarkesh Patel</strong></p><p>I feel like a big crux in these conversations about how good AI will be for science is, I think you said this, that they&#8217;re using existing techniques and modifying them. It would be interesting to understand how much progress one can make simply from using existing techniques.</p><p>If I looked at the top math journals, how many of the papers are coming up with a new technique, whatever that means, versus using existing techniques on new problems? What is the overhang? If you just applied every known technique to every open problem, would that constitute a humongous uplift in our civilization&#8217;s knowledge, or would that not be that impressive and useful?</p><p><strong>Terence Tao</strong></p><p>This is a great question, and we don&#8217;t have the data to fully answer it yet. Certainly, a lot of work that human mathematicians do&#8230; When you take a new problem, one of the first things we do is we look at all the standard things that have worked on similar problems in the past, and we try them one by one. Sometimes that works, and that&#8217;s still worth publishing because the question was important.</p><p>Sometimes they almost work, and you have to add one more wrinkle to it, and that&#8217;s also interesting. But the papers that go into the top journals are usually ones where the existing methods can kind of solve 80% of the problem, but then there is this 20% which is resistant and a new technique has to be invented to fill in the gaps.</p><p>It&#8217;s very rare now that a problem gets solved with no reliance on past literature, where all the ideas come out of nowhere. That was more common in the past, but math is so mature now that it&#8217;s just so much of a handicap to not use the literature first.</p><p>AI tools are getting really good at the first part of that, just trying all the standard techniques on a problem, often making fewer mistakes in applying them than humans. They still make mistakes, but I&#8217;ve tested these tools on little tasks that I can do, and sometimes they pick up errors that I make. Sometimes I pick up errors that they make. It&#8217;s about a tie right now.</p><p>But I haven&#8217;t yet seen them take the next step. When there are holes in the argument where none of the things are working, then what do you do? They can suggest random things, but often I find that trying to chase them down to make them work, and finding they don&#8217;t work, wastes more time than it saves.</p><p>I think some fraction of problems that we currently think are hard will fall from this method, especially the ones that haven&#8217;t received enough attention. With the <a href="https://en.wikipedia.org/wiki/Paul_Erd%C5%91s">Erd&#337;s</a> problems, almost all of the 50 problems that were solved by AIs were ones for which there was basically no literature. Erd&#337;s posed the problem once or twice. Maybe some people tried it casually and couldn&#8217;t do it, but they never wrote up anything.</p><p>But it turned out that there was a solution, and it was just combining this one obscure technique that not many people know about with some other result in the literature. That&#8217;s the median level of what AI can accomplish, and that&#8217;s really great. It clears out 50 of these problems. So I think you will see some isolated successes.</p><p>But what we found&#8230; Some people have done large-scale sweeps of these Erd&#337;s problems. If you only focus on the success stories, the ones that get broadcast on social media, it looks amazing. All these problems that haven&#8217;t been solved for decades, now they&#8217;re falling. But whenever we do a systematic study, on any given problem an AI tool has a success rate of maybe 1% or 2%. It&#8217;s just that they can buy scale, and you just pick the winners. It looks great.</p><p>I think there&#8217;ll be a similar thing happening with the hundreds of really prestigious, difficult math problems out there. Some AI may get lucky and actually solve them, and there will be some backdoor to solve the problem that everyone else missed. That will get a lot of publicity. But then people will try these fancy tools on their own favorite problem, and they will again experience the 1% to 2% success rate.</p><p>There&#8217;ll be a lot of noise amongst the signal of when they&#8217;re working and when they&#8217;re not. It will be increasingly important to collect these really standardized datasets. There are efforts now to create a standard set of challenge problems for AIs to solve, and not just rely on the AI companies to only publish their wins and not disclose their negative results. That will maybe give more clarity as to where we&#8217;re actually at.</p><p><strong>Dwarkesh Patel</strong></p><p>Although I think it&#8217;s worth emphasizing how much progress in AI it constitutes already, to have models that are capable of applying some technique that nobody had written down as applicable to this particular problem.</p><p><strong>Terence Tao</strong></p><p>The progress is simultaneously amazing and disappointing. It is a very strange feeling to see these tools in action. But people also acclimatize really quickly.</p><p>I remember when Google&#8217;s web search came out 20 years ago. It just blew all the other searches out of the water. You&#8217;re getting relevant hits on the front page, exactly what you wanted. It was amazing, and then after a few years, you just took for granted that you could Google anything.</p><p>2026-level AI would be stunning in 2021. A lot of it&#8212;face recognition, natural speech, doing college-level math problems&#8212;we just take for granted now.</p><h3>00:46:43 &#8211; AI makes papers richer and broader, but not deeper</h3><p><strong>Dwarkesh Patel</strong></p><p>Speaking of 2026 AI, you <a href="https://unlocked.microsoft.com/ai-anthology/terence-tao/">made a prediction in 2023</a> that by 2026 it would be like a colleague in mathematics?</p><p><strong>Terence Tao</strong></p><p>A trustworthy co-author if used correctly.</p><p><strong>Dwarkesh Patel</strong></p><p>Which is looking pretty good in retrospect.</p><p><strong>Terence Tao</strong></p><p>Yeah, I&#8217;m pretty pleased.</p><p><strong>Dwarkesh Patel</strong></p><p>So let&#8217;s see if you can continue this streak. You personally are 2x more productive as a result of AI. What year would you say that?</p><p><strong>Terence Tao</strong></p><p>Productivity, I think, is not quite a one-dimensional quantity. I&#8217;m definitely noticing that the style in which I do mathematics is changing quite a bit, and the type of things I do. For example, my papers now have a lot more code, a lot more pictures, because it&#8217;s so easy to generate these things now. Some plot which would have taken me hours to do, now I can do in minutes. But in the past, I just wouldn&#8217;t have put the plot in my paper in the first place. I would just talk about it in words. So it&#8217;s hard to measure what 2x means.</p><p>On the one hand, I think the type of papers that I would write today, if I had to do them without AI assistance, would definitely take five times longer. But I would not write my papers that way.</p><p><strong>Dwarkesh Patel</strong></p><p>5x?</p><p><strong>Terence Tao</strong></p><p>Yeah, but these are auxiliary tasks. Things like doing a much deeper literature search or supplying a lot more numerics. They enrich the paper. The core of what I do, actually solving the most difficult part of a math problem, hasn&#8217;t changed too much. I still use pen and paper for that.</p><p>But there&#8217;s lots of silly things. I use an AI agent now to reformat. Sometimes if all my parentheses are not quite the right size, I used to manually change them by hand, and now I can get an AI agent to do all that quite nicely in the background.</p><p>They&#8217;ve really sped up lots of secondary tasks. They haven&#8217;t yet sped up the core thing that I do, but it&#8217;s allowed me to add more things to my papers. By the same token, if I were to write a paper I wrote in 2020 again&#8212;and not add all these extra features, but just have something of the same level of functionality&#8212;it actually hasn&#8217;t saved that much time, to be honest. It&#8217;s made the papers richer and broader, but not necessarily deeper.</p><p><strong>Dwarkesh Patel</strong></p><p>You made this <a href="https://mathstodon.xyz/@tao/115722360006034040">distinction between artificial cleverness and artificial intelligence</a>. I would like to better understand those concepts. What is an example of intelligence that is not just cleverness?</p><p><strong>Terence Tao</strong></p><p><a href="https://en.wikipedia.org/wiki/Intelligence">Intelligence</a> is famously hard to define. It&#8217;s one of these things that you know when you see it. But when I talk to someone and we&#8217;re trying to collaboratively solve a math problem together, there&#8217;s this conversation where neither of us knows how to solve the problem initially. One of us has some idea and it looks promising, so then we have some sort of prototype strategy. We test it, and it doesn&#8217;t work, but then we modify it. There&#8217;s adaptivity and continual improvement of the idea over time. Eventually, we&#8217;ve systematically mapped out what doesn&#8217;t work and what does work, and we can see a path forward, but it&#8217;s evolving with our discussion.</p><p>This isn&#8217;t quite what the AIs do. The AIs can mimic this a little bit. To go back to this analogy of these jumping robots, they can jump and fail, and jump and fail. But what they can&#8217;t do is jump a little bit, reach some handhold, stay there, pull other people up, and then try to jump from there. There isn&#8217;t this cumulative process which is built up interactively. It seems to be a lot more trial and error and just repetition: brute force. It scales, and it can work amazingly well in certain contexts. But this idea of building up cumulatively from partial progress is what&#8217;s still not quite there yet.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. You&#8217;re saying if Gemini 3 or Claude 4.5, whatever, solves a problem, it is not the case that its own understanding of math has progressed.</p><p><strong>Terence Tao</strong></p><p>No.</p><p><strong>Dwarkesh Patel</strong></p><p>Or even if it works on a problem without solving it, it&#8217;s not that its own understanding of math has progressed.</p><p><strong>Terence Tao</strong></p><p>Yeah. You run a new session and it&#8217;s forgotten what it just did. It has no new skills to build on related problems. Maybe what you just did is 0.001% of the training data for the next generation. So maybe eventually some of it gets absorbed.</p><h3>00:53:00 &#8211; If AI solves a problem, can humans get understanding out of it?</h3><p><strong>Dwarkesh Patel</strong></p><p>One big question I have is how plausible is it that if we just keep training AIs&#8212;they get better and better at solving problems in <a href="https://en.wikipedia.org/wiki/Lean_(proof_assistant)">Lean</a>&#8212;that they will continue to solve more and more impressive problems, and then we will be surprised at how little insight we got from some Lean solution to proving the <a href="https://en.wikipedia.org/wiki/Riemann_hypothesis">Riemann hypothesis</a> or something.</p><p>Or do you think it is a necessary condition of solving the Riemann hypothesis, even by an AI that is doing it entirely in Lean, that the constructions and definitions created in the Lean program have to advance our understanding of mathematics? Or could it just be assembly code gobbledygook?</p><p><strong>Terence Tao</strong></p><p>We don&#8217;t know. Some problems have been basically solved by pure brute force. The <a href="https://en.wikipedia.org/wiki/Four_color_theorem">four color theorem</a> is a famous example. We have still not found a conceptually elegant proof of this theorem, and maybe we never will. Some problems may only be solvable by splitting into an enormous number of cases and doing brute force, uninsightful computer analysis on each case.</p><p>Part of the reason we prize problems like the Riemann hypothesis is that we&#8217;re pretty sure a new type of mathematics has to be created, or a new connection between two previously unconnected areas of mathematics has to be discovered to make this work. We don&#8217;t even know what the shape of the solution is, but it doesn&#8217;t feel like a problem that will be solved just by exhaustively checking cases.</p><p>Or it could be false actually. Okay, there is an unlikely scenario that the hypothesis is false, and you can just compute a zero off the line, and a massive computer calculation verifies it. That would be very disappointing. I do feel that fully autonomous, one-shot approaches are not the right approach for these problems. You&#8217;ll get a lot more mileage out of the interplay of humans collaborating with these tools.</p><p>I can see one of these problems being solved by smart humans assisted by extremely powerful AI tools. But the exact dynamic may be very different from what we envision right now. It could be a collaboration of a type that just doesn&#8217;t exist yet.</p><p>There may be a way to generate a million variants of the <a href="https://en.wikipedia.org/wiki/Riemann_zeta_function">Riemann zeta function</a> and do AI-assisted data analysis to discover some pattern connecting them that we didn&#8217;t know about before. This lets you transform the problem into a different area of mathematics. There could be all kinds of scenarios.</p><p><strong>Dwarkesh Patel</strong></p><p>Suppose the AI figures it out, and latent in the Lean is some brand-new construction which, if we realized its significance, we would be able to apply in all these different situations. How would we even recognize it?</p><p>Again, a very naive question, but if you come up with the equivalent of <a href="https://en.wikipedia.org/wiki/Cartesian_coordinate_system">Descartes&#8217; idea that you can have a coordinate system unifying algebra and geometry</a>, in Lean code it would just look like R&#8594;R, and it wouldn&#8217;t look that significant. I&#8217;m sure there are other constructions which have this kind of property.</p><p><strong>Terence Tao</strong></p><p>The beauty of formalizing a proof in something like Lean is that you can take any piece of it and study it atomically. When I read a paper which solves some difficult problem, there&#8217;s often a big sequence of <a href="https://en.wikipedia.org/wiki/Lemma_(mathematics)">lemmas</a> and theorems. Ideally, the author will talk their way through what&#8217;s important and what&#8217;s not. But sometimes they don&#8217;t reveal what steps were the important ones and which ones were just boilerplate, standard steps.</p><p>You can study each lemma in isolation. Some of them I can see look fairly standard and resemble something I&#8217;m familiar with. I&#8217;m pretty sure there&#8217;s nothing interesting going on there. But this other lemma, that&#8217;s something I haven&#8217;t seen before, and I can see why having this result would really help prove the main result. You can assess whether a step is really key to your argument or not, and Lean really facilitates that. The individual steps are identified really precisely.</p><p>I think in the future, there will be entire professions of mathematicians who might take a giant Lean-generated proof and do some ablation on it, trying to remove parts of it and find more elegant ways. They might get other AIs to do some reinforcement learning to make the proof more elegant, and maybe other AIs will grade whether this proof looks better or not.</p><p>One thing that will change quite a bit in the near future is how we write papers. Until recently, writing papers was the most time-consuming and expensive part of the job. So you did it very rarely. You only wrote up your results once all the other parts of your argument were checked out, because rewriting and refactoring was just a total pain. That&#8217;s become a lot easier now with modern AI tools. You don&#8217;t have to have just one version of your paper. Once you have one, people can generate hundreds more.</p><p>One giant messy Lean proof may not be very meaningful or understandable on its own, but other people can refactor it and do all kinds of things with it. We&#8217;ve seen this with the <a href="https://www.erdosproblems.com/">Erd&#337;s problem website</a>. An AI will generate a proof, and here are 3,000 lines of code that verify the proof. Then people got other AIs to summarize the proof, and people write their own proofs.</p><p>There&#8217;s actually post-processing. Once you have one proof, we have a lot of tools now to deconstruct and interpret it. It&#8217;s a very nascent area of mathematics, but I&#8217;m not as worried about it. Some people are concerned about what happens if the Riemann hypothesis is proven with a completely incomprehensible proof. I think once you have the artifact of a proof, we can do a lot of analysis on it.</p><h3>00:59:20 &#8211; We need a semi-formal language for the way that scientists actually talk to each other</h3><p><strong>Dwarkesh Patel</strong></p><p><a href="https://mathstodon.xyz/@tao/116117407353355690">You posted recently</a> that it would be helpful to have a formal or semi-formal language for mathematical strategies as opposed to just mathematical proofs, which is what Lean specializes in. I would love to learn more about what that would involve or look like.</p><p><strong>Terence Tao</strong></p><p>We don&#8217;t really know. We&#8217;ve been very lucky in mathematics that we have worked out the laws of <a href="https://en.wikipedia.org/wiki/Mathematical_logic">logic</a> and mathematics, but this is a fairly recent accomplishment. It was started by <a href="https://en.wikipedia.org/wiki/Euclid">Euclid</a> two millennia ago, but only in the early 20th century did we finally list out the axioms of mathematics, the standard axioms of what we call <a href="https://en.wikipedia.org/wiki/Zermelo%E2%80%93Fraenkel_set_theory">ZFC</a>, the axioms of first-order logic, and what a proof is. This we&#8217;ve managed to automate and have a formal language for.</p><p>There could be some way to assess plausibility. You have a conjecture that something is true, you test a few examples, and it works out. How does this increase your confidence that the conjecture is true? We have a few sort of mathematical ways to model this, like <a href="https://en.wikipedia.org/wiki/Bayesian_probability">Bayesian probability</a>, for example. But you often have to set certain base assumptions, and there&#8217;s a lot of subjectivity still in these tasks.</p><p>This is more of a wish than a plan to develop these languages, but just seeing how successful having a formal framework in place, like Lean, has made deductive proofs so much easier to automate and train AI on&#8230; The bottleneck for using AI to create strategies and make conjectures is we have to rely on human experts and the test of time to validate whether something is plausible or not.</p><p>If there was some semi-formal framework where this could be done semi-automatically in a way that isn&#8217;t easily hackable... It&#8217;s really important with these formal proof assistants that there are no backdoors or exploits you can use to somehow get your certified proof without actually proving it, because reinforcement learning is just so good at finding these backdoors.</p><p>If there&#8217;s some framework that mimics how scientists talk to each other in a semi-formal way, using data and argument, but also constructing narratives... There&#8217;s some subjective aspect of science that we don&#8217;t know how to capture in a way that we can insert AI into it in any useful way. This is a future problem. There are research efforts to try to create automated conjectures, and maybe there are ways to benchmark these and simulate this, but it&#8217;s all very new science.</p><p><strong>Dwarkesh Patel</strong></p><p>Can you help me get some intuition? I have two sub-questions. One, it would be very helpful to have a specific example of what something like this would look like, the way scientists communicate that we can&#8217;t formalize yet.</p><p>Two, it seems almost definitionally paradoxical to say you&#8217;re building up some narrative or natural language explanation and then also having something which you could have formalized. I&#8217;m sure there&#8217;s some intuition behind where that overlap is, and I&#8217;d love to understand that better.</p><p><strong>Terence Tao</strong></p><p>An example of a conjecture: <a href="https://en.wikipedia.org/wiki/Carl_Friedrich_Gauss">Gauss</a> was interested in the <a href="https://en.wikipedia.org/wiki/Prime_number">prime numbers</a> and created one of the first mathematical datasets. He just computed the first 100,000 prime numbers or so, hoping to find patterns. He did find a pattern, but maybe not the pattern he was expecting. He found a statistical pattern in the primes that if you count how many primes there are up to 100, 1,000, one million, and so forth, they get sparser and sparser, but the drop-off in the density was inversely proportional to the natural logarithm of the range of numbers.</p><p>So he conjectured what we now call the <a href="https://en.wikipedia.org/wiki/Prime_number_theorem#History_of_the_proof_of_the_asymptotic_law_of_prime_numbers">prime number theorem</a>: the number of primes up to X is X divided by the natural log of X. He had no way to prove this. It was data-driven. This was a conjecture. It was revolutionary for its time because it was maybe the first really important conjecture of math that was statistical in nature. Normally you&#8217;re talking about a pattern, like maybe the spacing between the primes has a certain regularity. But this didn&#8217;t tell you exactly how many primes there were in any given range. It just gave you an approximation that got better and better as you went further and further out.</p><p>It started the field of what we call <a href="https://en.wikipedia.org/wiki/Analytic_number_theory">analytic number theory</a>. It was the first in many conjectures like this, many of which got proved, which started consolidating the idea that the prime numbers didn&#8217;t really have a pattern, that they behaved like random sets of numbers with a certain density. They had some patterns, like they&#8217;re almost all odd. They&#8217;re also not actually random, they&#8217;re what&#8217;s called <a href="https://en.wikipedia.org/wiki/Pseudorandomness">pseudo-random</a>. There&#8217;s no <a href="https://en.wikipedia.org/wiki/Random_number_generation">random number generation</a> involved in creating the prime numbers. But over time, it became more and more productive to think of the primes as if they were just generated by some god rolling dice all the time and creating this random set.</p><p>This allowed us to make all these other predictions. There&#8217;s a still-open conjecture in number theory called the <a href="https://en.wikipedia.org/wiki/Twin_prime">twin prime conjecture</a>, that there should be infinitely many pairs of primes that are twins just two apart, like 11 and 13. We can&#8217;t prove that, and there are good reasons why we can&#8217;t prove it. But because of this statistical random model of the primes, we are absolutely convinced it&#8217;s true. We know that if the primes were generated by flipping coins, we would just&#8212;by random chance like infinite monkeys at a typewriter&#8212;see twin primes appear over and over again.</p><p>We have over time developed this very accurate conceptual model of what the primes should behave like based on statistics and probability. It&#8217;s mostly heuristic and non-rigorous, but extremely accurate. The few times when we actually can prove things about the primes, it has matched up with the predictions of what we call the <a href="https://en.wikipedia.org/wiki/Cram%C3%A9r%27s_conjecture">random model of the primes</a>. We have this conjectural concept framework for understanding the primes that everyone believes in. It&#8217;s the same reason why we believe the Riemann hypothesis is true, and why we believe that <a href="https://www.geeksforgeeks.org/maths/why-prime-numbers-are-used-in-cryptography/">cryptography based on the primes</a> is mathematically secure. It&#8217;s all part of this belief.</p><p>In fact, one reason why we care about the Riemann hypothesis is that if the Riemann hypothesis failed, if we knew it was false, it would be a serious blow to this model. It would mean there&#8217;s a secret pattern to the primes that we were not aware of. I think we would very rapidly abandon any cryptography based on the primes, because if there was one pattern that we didn&#8217;t know about, there are probably more, and these patterns can lead to exploits in crypto. It would be a big shock. So we really want to make sure that doesn&#8217;t happen.</p><p>We&#8217;ve been convinced of things like the Riemann hypothesis over time. Some of it is experimental evidence, and some is that the few times we&#8217;ve been able to make theoretical results, they&#8217;ve always aligned. It is possible that the consensus is wrong and we&#8217;ve all just missed something very basic. There have been paradigm shifts in the past in scientific history. But we don&#8217;t really have a way of measuring this, partly because we don&#8217;t have enough data on how math or science develops. We have one timeline of history, and we have maybe 100 stories of turning points in history.</p><p>If we had access to a million alien civilizations, each with a different development of history and science in different orders, then maybe we&#8217;d actually have a decent shot at understanding how we measure what progress is and what is a good strategy. We could maybe start formalizing it and actually having a framework. Maybe what we need to do is start creating lots of mini-universes or simulations of AI solving very basic problems in arithmetic or whatever, but coming up with their own strategies for doing these things and having these little laboratories to test. There are people who investigate what&#8217;s the smallest neural network that can do 10-digit multiplication and things like that. I think we could learn a lot just from evolving small AIs on simple problems.</p><h3>01:09:48 &#8211; How Terry uses his time</h3><p><strong>Dwarkesh Patel</strong></p><p>You have to learn about new fields not only very rapidly, but deeply enough to contribute to the frontier. So in some sense, you&#8217;re also one of the world&#8217;s greatest autodidacts. What is your process of learning about a new subfield in math? What does that look like?</p><p><strong>Terence Tao</strong></p><p>We talked about depth and breadth before. It&#8217;s not a purely human-AI distinction. Humans also, I think it was <a href="https://en.wikipedia.org/wiki/Isaiah_Berlin">Berlin</a> who split them into <a href="https://en.wikipedia.org/wiki/The_Hedgehog_and_the_Fox">hedgehogs and foxes</a>. The hedgehog knows one thing very well, and a fox knows a little bit about everything. I definitely think of myself as a fox. I work with hedgehogs a lot, and sometimes I can be a hedgehog if need be.</p><p>I&#8217;ve always had a little bit of an obsessive streak. If there&#8217;s something I read about which I feel like I have the capability to understand, but I don&#8217;t understand why it works and there&#8217;s some magic in it&#8230; Someone was able to use a type of mathematics I&#8217;m not familiar with and get a result I would like to prove. I can&#8217;t do it myself, but they could do it by their method, and I want to find out what their trick was. It bugs me that someone else can do something I think I can do, but I can&#8217;t. I&#8217;ve always had that obsessive, completionist streak. I&#8217;ve had to wean myself off computer games because if I start a game, I want to play it to completion, through all the levels. That&#8217;s one way I learn new fields.</p><p>I collaborate with a lot of people who have taught me other types of mathematics. I just make friends with another mathematician working on another area of mathematics. I find their problems interesting, but they have to teach me some of the basic tricks, what&#8217;s known, and what&#8217;s not known. I learn a lot from that.</p><p>I found that writing about what I&#8217;ve learned helps. I have a <a href="https://terrytao.wordpress.com/">blog</a> where I sometimes record things I&#8217;ve learned. In the past when I was younger, I would learn something, do this cool trick, and say, &#8220;Okay, I&#8217;m going to remember this.&#8221; Then six months later, I&#8217;d forgotten it. I remember remembering it, but I can&#8217;t reconstruct my arguments. The first few times, it was so frustrating to have understood something and then lost it. I resolved I should always write down anything cool that I&#8217;ve learned. That&#8217;s part of how this blog came about.</p><p><strong>Dwarkesh Patel</strong></p><p>How long does it take you to write a blog post?</p><p><strong>Terence Tao</strong></p><p>It&#8217;s something I often do when I don&#8217;t want to do other work. There&#8217;s some referee report or something that feels slightly unpleasant for me to do at the time. Writing a blog feels creative and fun. It&#8217;s something I do for myself.</p><p>Depending on the topic, it could be a quick half an hour or several hours. Because it&#8217;s something I do voluntarily, time flies when I write these things down, as opposed to doing something I have to do for administrative reasons that is just drudgery. Those are tasks, by the way, that AI is really helping with nowadays.</p><p><strong>Dwarkesh Patel</strong></p><p>If civilization could from first principles decide how to use Terry Tao&#8217;s time, as a limited resource, what is the biggest difference? What if the <a href="https://en.wikipedia.org/wiki/Original_position">veil of ignorance</a> got to decide how to use Terry Tao&#8217;s time versus what it does now? This podcast wouldn&#8217;t be happening.</p><p><strong>Terence Tao</strong></p><p>As much as I complain about certain tasks that I don&#8217;t want to do, but have to do&#8230; As you get more senior in academia, you get more and more responsibilities, more committees, and whatever. I have also found that a lot of events I reluctantly went to because I was obliged to for one reason or another&#8230;  Because it&#8217;s outside my comfort zone, it often results in interactions with people I wouldn&#8217;t normally talk to, like you for instance. I would learn interesting things and have interesting experiences. I would have opportunities to then network with other people that I never would have before.</p><p>So I do believe a lot in serendipity. I do optimize portions of my day where I schedule very carefully. But I am willing to leave some portions just to do something that is not my usual thing. Maybe it&#8217;ll be a waste of my time, but maybe I will learn something. More often than not, I get a positive experience that I wouldn&#8217;t have planned for.</p><p>So I believe a lot in serendipity. Maybe there&#8217;s a danger in modern societies, not just with AI, that we&#8217;ve become really good at optimizing everything. We&#8217;re not optimizing our own optimization. With COVID, for example, we switched a lot to remote meetings, so everything was scheduled. We kept busy in academia. We met almost the same number of people we met in person, but everything had to be planned in advance. What we lost out on was the casual knocking on a hallway door, just meeting someone while getting a coffee. Those serendipitous interactions may not seem optimal, but they are actually really important.</p><p>When I was a grad student, I would go to the library to look for a journal article. You had to physically check out the journal and read the article. You could browse through and sometimes the next article was also interesting. Sometimes it wasn&#8217;t, but you could accidentally find interesting things. That has basically been lost now. If you want to access an article, you just type it into a search engine or an AI, and you get exactly what you want instantly. But you don&#8217;t get the accidental things you might have found if you&#8217;d done it more inefficiently.</p><p>I spent a year once at the <a href="https://en.wikipedia.org/wiki/Institute_for_Advanced_Study">Institute for Advanced Study</a>, which is a great place with no distractions. You&#8217;re there just to do research. The first few weeks you&#8217;re there, it&#8217;s great. You&#8217;re getting all these papers written up that you&#8217;ve been wanting to do for a long time. You think about problems for blocks of hours at a time. But I find if I stay there for more than several months, I run out of inspiration. I get bored. I surf the internet a lot more.</p><p>You actually do need a certain level of distraction in your life. It adds enough randomness and high temperature. I don&#8217;t know the optimal way to schedule my life. It just seems to work.</p><h3>01:17:05 &#8211; Human-AI hybrids will dominate math for a lot longer</h3><p><strong>Dwarkesh Patel</strong></p><p>I&#8217;m very curious when you expect AIs that can actually do frontier math at least as well as the best human mathematicians.</p><p><strong>Terence Tao</strong></p><p>In some ways, they&#8217;re already doing frontier math that is super intelligent that humans can&#8217;t do, but it&#8217;s a different frontier from what we&#8217;re used to. You could argue that calculators were doing frontier math that humans could not accomplish, but it was number crunching.</p><p><strong>Dwarkesh Patel</strong></p><p>But replacing Terry Tao completely.</p><p><strong>Terence Tao</strong></p><p>I mean, what do you want me for?</p><p><strong>Dwarkesh Patel</strong></p><p>You&#8217;ll just go on all the podcasts after.</p><p><strong>Terence Tao</strong></p><p>It might not be the right question to ask. I think within a decade, a lot of things that math students currently do&#8212;what we spend the bulk of our time doing and a lot of stuff we put in our papers today&#8212;can be done by AI. But we will find that that actually wasn&#8217;t the most important part of what we do.</p><p>A hundred years ago, a lot of mathematicians were just solving <a href="https://en.wikipedia.org/wiki/Differential_equation">differential equations</a>. Physicists needed some exact solution to some system, and they hired a mathematician to laboriously go through the calculus and work out the solution to this fluid equation, whatever. A lot of what a 19th-century mathematician would do, you could make a call to <a href="https://en.wikipedia.org/wiki/Wolfram_Mathematica">Mathematica</a>, Wolfram Alpha, a computer algebra package, or now more recently to an AI, and it would just solve the problem in a few minutes. But we moved on. We worked on different types of problems after that.</p><p>Once computers came along&#8212;computers used to be human. People used to laboriously create log tables and work out primes as Gauss did, and that has all been outsourced to computers. But we moved on.</p><p>In genetics, to sequence the genome of a single organism, that was an entire PhD of a geneticist, carefully separating all the chromosomes and whatever. Now you can just spend $1,000 and send it to a sequencer and get it done. But genetics is not dead as a subject. You move to a different scale. Maybe you study whole ecosystems rather than individuals.</p><p><strong>Dwarkesh Patel</strong></p><p>I take your point but when is most mathematical progress, or almost all mathematical progress, happening by AI? If you find out this year a Millennium Prize Problem has been solved, you would put 95% odds that an AI did it autonomously. Surely there will be such a year.</p><p><strong>Terence Tao</strong></p><p>I guess I do believe that hybrid human plus AIs will dominate mathematics for a lot longer. It will depend. It will require some additional breakthroughs beyond what we already have, so it&#8217;s going to be stochastic. I think AIs currently are very good at certain things, but really terrible at others. While you can add more and more frameworks on top to reduce the error rates and make them work with each other a bit more, it feels like we don&#8217;t have all the ingredients to really have a truly satisfactory replacement for all intellectual tasks.</p><p>It is complementary currently. It&#8217;s not a replacement. Because current level AIs will accelerate science in so many ways, hopefully new discoveries and new breakthroughs will happen more quickly. It&#8217;s also possible that by destroying serendipity we actually inhibit certain types of progress. Anything is possible at this point. I think the world is very, very unpredictable at this point in time.</p><p><strong>Dwarkesh Patel</strong></p><p>What is your advice to somebody who would consider a career in math or is early in a career in math, especially in light of AI progress? How should they be thinking about their career differently, if at all, as a result of AI progress?</p><p><strong>Terence Tao</strong></p><p>We live in a time of change. As I said, we live in a particularly unpredictable era. Things that we&#8217;ve taken for granted for centuries may not hold anymore. The way we do everything, and not just mathematics, will change. In many ways, I would prefer the much more boring, quiet era where things are much the same as they were 10 years ago, 20 years ago. But I think one just has to embrace that there&#8217;s going to be a lot of change. The things that you study, some of them may become obsolete or revolutionized, but some things will be retained.</p><p>You always have to keep an eye on opportunities for things that you wouldn&#8217;t be able to do before. In math, you previously had to go through years and years of education and be a math PhD before you could contribute to the frontier of math research. But now it&#8217;s quite possible at the high school level, or whatever, that you could get involved in a math project and actually make a real contribution because of all these AI tools, Lean, and everything else.</p><p>There will be a lot of non-traditional opportunities to learn, so you need a very adaptable mindset. There will be room for pursuing things just for curiosity and for playing around. You still need to get your credentials. For a while it will still be important to go through traditional education and learn math and science the old-fashioned way. But you should also be open to very different ways of doing science, some of which don&#8217;t exist yet. It&#8217;s a scary time, but also very exciting.</p><p><strong>Dwarkesh Patel</strong></p><p>That&#8217;s a great note to close on. Terence, thanks so much.</p><p><strong>Terence Tao</strong></p><p>Pleasure.</p>]]></content:encoded></item><item><title><![CDATA[Dylan Patel — Deep dive on the 3 big bottlenecks to scaling AI compute]]></title><description><![CDATA[Plus, why an H100 is worth more today than 3 years ago]]></description><link>https://www.dwarkesh.com/p/dylan-patel</link><guid isPermaLink="false">https://www.dwarkesh.com/p/dylan-patel</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Fri, 13 Mar 2026 16:00:42 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/190839917/ba9582725eaf9c7756c2e37d28263b97.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p><a href="https://x.com/dylan522p?lang=en">Dylan Patel</a>, founder of <a href="https://semianalysis.com/">SemiAnalysis</a>, provides a deep dive into the 3 big bottlenecks to scaling AI compute: logic, memory, and power.</p><p>And walks through the economics of labs, hyperscalers, foundries, and fab equipment manufacturers.</p><p>Learned a ton about every single level of the stack. Enjoy!</p><p>Watch on <a href="https://youtu.be/mDG_Hx3BSUE">YouTube</a>; listen on <a href="https://podcasts.apple.com/us/podcast/dylan-patel-deep-dive-on-the-3-big-bottlenecks-to/id1516093381?i=1000755126873">Apple Podcasts</a> or <a href="https://open.spotify.com/episode/5qiibwoBWY5rXyflK7WJzH?si=e1316364956d485d">Spotify</a>.</p><div id="youtube2-mDG_Hx3BSUE" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;mDG_Hx3BSUE&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/mDG_Hx3BSUE?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2><strong>Sponsors</strong></h2><ul><li><p><a href="https://mercury.com/">Mercury</a> has already saved me a bunch of time this tax season. Last year, I used Mercury to request W-9s from all the contractors I worked with. Then, when it came time to issue 1099s this year, I literally just clicked a button and Mercury sent them out. Learn more at <a href="https://mercury.com">mercury.com</a>.</p></li><li><p><a href="https://labelbox.com/dwarkesh">Labelbox</a> noticed that even when voice models <em><a href="https://labelbox.com/dwarkesh">appear</a></em> to take interruptions in stride, their performance degrades. To figure out why, they built a new evaluation pipeline called EchoChain. EchoChain diagnoses voice models&#8217; specific failure modes, letting you understand what your model needs to truly handle interruptions. Check it out at <a href="https://labelbox.com/dwarkesh">labelbox.com/dwarkesh</a>.</p></li><li><p>J<a href="https://janestreet.com/dwarkesh">ane Street</a> is basically a research lab with a trading desk attached &#8211; and their infrastructure backs this up. They&#8217;ve got tens of thousands of GPUs, hundreds of thousands of CPU cores, and exabytes of storage. This is what it takes to find subtle signals hidden deep within noisy market data. If this sounds interesting, you can explore open positions at <a href="https://janestreet.com/dwarkesh">janestreet.com/dwarkesh</a>.</p></li></ul><h2>Timestamps</h2><p><a href="https://www.dwarkesh.com/i/190839917/000000-why-an-h100-is-worth-more-today-than-3-years-ago">(00:00:00) &#8211; Why an H100 is worth more today than 3 years ago</a></p><p><a href="https://www.dwarkesh.com/i/190839917/002452-nvidia-secured-tsmc-allocation-early-google-is-getting-squeezed">(00:24:52) &#8211; Nvidia secured TSMC allocation early; Google is getting squeezed</a></p><p><a href="https://www.dwarkesh.com/i/190839917/003434-asml-will-be-the-1-constraint-for-ai-compute-scaling-by-2030">(00:34:34) &#8211; ASML will be the #1 constraint for AI compute scaling by 2030</a></p><p><a href="https://www.dwarkesh.com/i/190839917/005547-cant-we-just-use-tsmcs-older-fabs">(00:55:47) &#8211; Can&#8217;t we just use TSMC&#8217;s older fabs?</a></p><p><a href="https://www.dwarkesh.com/i/190839917/010537-when-will-china-outscale-the-west-in-semis">(01:05:37) &#8211; When will China outscale the West in semis?</a></p><p><a href="https://www.dwarkesh.com/i/190839917/011601-the-enormous-incoming-memory-crunch">(01:16:01) &#8211; The enormous incoming memory crunch</a></p><p><a href="https://www.dwarkesh.com/i/190839917/014234-scaling-power-in-the-us-will-not-be-a-problem">(01:42:34) &#8211; Scaling power in the US will not be a problem</a></p><p><a href="https://www.dwarkesh.com/i/190839917/015444-space-gpus-arent-happening-this-decade">(01:54:44) &#8211; Space GPUs aren&#8217;t happening this decade</a></p><p><a href="https://www.dwarkesh.com/i/190839917/021407-why-arent-more-hedge-funds-making-the-agi-trade">(02:14:07) &#8211; Why aren&#8217;t more hedge funds making the AGI trade?</a></p><p><a href="https://www.dwarkesh.com/i/190839917/021830-will-tsmc-kick-apple-out-from-n2">(02:18:30) &#8211; Will TSMC kick Apple out from N2?</a></p><p><a href="https://www.dwarkesh.com/i/190839917/022416-robots-and-taiwan-risk">(02:24:16) &#8211; Robots and Taiwan risk</a></p><h2>Transcript</h2><h3>00:00:00 &#8211; Why an H100 is worth more today than 3 years ago</h3><p><strong>Dwarkesh Patel</strong></p><p>All right, this is the episode where my roommate teaches me semiconductors.</p><p><strong>Dylan Patel</strong></p><p>It&#8217;s also the send off for this current set.</p><p><strong>Dwarkesh Patel</strong></p><p>It is. After you use it, I&#8217;m like, &#8220;I can&#8217;t use this again. I gotta get out of here.&#8221;</p><p><strong>Dylan Patel</strong></p><p>No sloppy seconds for Dwarkesh.</p><p><strong>Dwarkesh Patel</strong></p><p><a href="https://www.dwarkesh.com/p/dylan-jon">Dylan</a> is the CEO of <a href="https://semianalysis.com/">SemiAnalysis</a>. Dylan, here&#8217;s the burning question I have for you. If you add up the big four&#8212;Amazon, Meta, Google, Microsoft&#8212;their combined forecasted CapEx this year that you published recently is $600 billion. Given yearly prices of renting that compute, that would be close to 50 gigawatts. Obviously, we&#8217;re not putting on 50 gigawatts this year, so presumably that&#8217;s paying for compute that is going to be coming online over the coming years. How should we think about the timeline around when that CapEx comes online?</p><p>Similar question for the labs. OpenAI just announced they raised $110 billion, and Anthropic just announced they raised $30 billion. If you look at the compute they have coming online this year&#8212;you should tell me how much it is, but is it on the order of another four gigawatts total? The cost to rent the compute that OpenAI and Anthropic will have this year to sustain their compute spend is $10 to $13 billion a gigawatt. Those individual raises alone are enough to cover their compute spend for the year. And this is not even including the revenue that they&#8217;re going to earn this year.</p><p>So help me understand: first, what is the timescale at which the Big Tech CapEx actually comes online? And second, what are the labs raising all this money for if the yearly price of a one-gigawatt data center is $13 billion?</p><p><strong>Dylan Patel</strong></p><p>So when you talk about the CapEx of these hyperscalers being on the order of $600 billion, and you look across the rest of the supply chain, it gets you to the order of a trillion dollars. A portion of this is immediately for compute going online this year: the chips and the other parts of CapEx that get paid this year. But there&#8217;s a lot of setup CapEx as well.</p><p>When we&#8217;re talking about 20 gigawatts of incremental added capacity this year in America, a portion of this is not spent this year. A portion of that CapEx was actually spent the prior year. When you look at Google having $180 billion, a big chunk of that is spent on turbine deposits for &#8216;28 and &#8216;29. A chunk of that is spent on data center construction for &#8216;27. A chunk of that is spent on power purchasing agreements, down payments, and all these other things they&#8217;re doing further out into the future so they can set up this super fast scaling. This applies to all the hyperscalers and other people in the supply chain.</p><p>So with roughly 20 gigawatts deployed this year, a big chunk is hyperscalers, and a chunk is not. For all of these companies, their biggest customers are Anthropic and OpenAI. Anthropic and OpenAI are at roughly two to two-and-a-half gigawatts right now, and they&#8217;re trying to scale much larger.</p><p>If you look at what Anthropic has done over the last few months, with $4 billion or $6 billion in revenue added, we can just draw a straight line and say they&#8217;ll add another $6 billion of revenue a month. People would argue that&#8217;s bearish, and that they should go faster. What that implies is they&#8217;re going to add $60 billion of revenue across the next ten months. At the current gross margins Anthropic had, as last reported by media, that would imply they have roughly $40 billion of compute spend for that inference, for that $60 billion of revenue.</p><p>That $40 billion of compute, at roughly $10 billion a gigawatt in rental costs, means they need to add four gigawatts of inference capacity just to grow revenue. That&#8217;s assuming their research and development training fleet stays flat. In a sense, Anthropic needs to get to well above five gigawatts by the end of this year. It&#8217;s going to be really tough for them to get there, but it&#8217;s possible.</p><p><strong>Dwarkesh Patel</strong></p><p>Can I ask a question about that? If Anthropic was not on track to have five gigawatts by the end of this year, but it needs that to serve both the revenue that&#8217;s gone crazier than expected&#8212;and maybe it&#8217;s going to be even more than that&#8212;plus the research and training to make sure its models are good enough for next year: Where is that capacity going to come from?</p><p><strong>Dylan Patel</strong></p><p><a href="https://www.dwarkesh.com/p/dario-amodei-2">Dario, when he was on your podcast</a>, was very conservative. He said, &#8220;I&#8217;m not going to go crazy on compute because if my revenue inflects at a different rate, at a different point&#8230; I don&#8217;t want to go bankrupt. I want to make sure that we&#8217;re being responsible with this scaling.&#8221; But in reality, he&#8217;s screwed the pooch compared to OpenAI, whose approach was, &#8220;Let&#8217;s just sign these crazy fucking deals.&#8221;</p><p>OpenAI has got way more access to compute than Anthropic by the end of the year. What does Anthropic have to do to get the compute? They have to go to lower-quality providers that they would not have gone to before. Anthropic historically had the best quality providers, like Google and Amazon, the biggest companies in the world. Now Microsoft is expanding across the supply chain, and they&#8217;re going to other newer players.</p><p>OpenAI has been a bit more aggressive on going to many players. Yes, they have tons of capacity from Microsoft, Google, and Amazon, but they also have tons with <a href="https://en.wikipedia.org/wiki/CoreWeave">CoreWeave</a> and Oracle. They&#8217;ve gone to random companies, or companies one would think are random, like <a href="https://www.wsj.com/tech/ai/openai-softbank-to-invest-1-billion-in-sb-energy-fa7385b9">SoftBank Energy</a>, who has never built a data center in their life but is building data centers now for OpenAI. They&#8217;ve gone to many others, like <a href="https://www.nscale.com/">NScale</a>, to get capacity.</p><p>There&#8217;s this conundrum for Anthropic because they were so conservative on compute, because they didn&#8217;t want to go crazy. In some sense, a lot of the financial freakouts in the second half of last year were because, &#8220;OpenAI signed all these deals but they didn&#8217;t have the money to pay for them&#8230;&#8221; Okay, Oracle&#8217;s stock is going to tank, CoreWeave&#8217;s stock is going to tank. All these companies&#8217; stocks tanked, and credit markets went crazy because people thought the end buyer couldn&#8217;t pay for this. Now it&#8217;s like, &#8220;Oh wait, they raised a ton of money. Okay, fine, they can pay for it.&#8221;</p><p>Anthropic was a lot more conservative. They were like, &#8220;We&#8217;ll sign contracts, but we&#8217;ll be principled. We&#8217;ll purposely undershoot what we think we can possibly do and be conservative because we don&#8217;t want to potentially go bankrupt.&#8221;</p><p><strong>Dwarkesh Patel</strong></p><p>The thing I want to understand is, what does it mean to have to acquire compute in a pinch? Is it that you have to go with <a href="https://www.mckinsey.com/capabilities/tech-and-ai/our-insights/the-evolution-of-neoclouds-and-their-next-moves">neoclouds</a>? Do they have worse compute? In what way is it worse?</p><p>Did you have to pay gross margins to a cloud provider that you wouldn&#8217;t have otherwise had to pay because they&#8217;re coming in at the last minute? Who built the spare capacity such that it&#8217;s available for Anthropic and OpenAI to get last minute?</p><p>What is the concrete advantage that OpenAI has gotten if they end up at similar compute numbers by 2027? Are they just going to end this year with different gigawatts? If so, how many gigawatts are Anthropic and OpenAI going to have by the end of this year?</p><p><strong>Dylan Patel</strong></p><p>To acquire excess compute, yes, there is capacity at hyperscalers. Not all contracts for compute are long-term, five-year deals. There&#8217;s compute from 2023 or 2024, or H100s from 2025, that were signed at shorter terms. The vast majority of OpenAI&#8217;s compute is signed on five-year deals, but there were many other customers that had one-year, two-year, three-year, or six-month deals, on demand.</p><p>As these contracts roll off, who is the participant in the market most willing to pay price? In this sense, we&#8217;ve seen H100 prices inflect a lot and go up. People are willing to sign long-term deals for above $2 even. I&#8217;ve seen deals where certain AI labs&#8212;I&#8217;m being a little bit vague here for a reason&#8212;have signed at as high as $2.40 for two to three years for H100s. If you think about the margin, it costs $1.40 to build Hopper, across five years. Now, two years in, you&#8217;re signing deals for two to three years at $2.40? Those margins are way higher.</p><p>Now you can crowd out all of these other suppliers, whether Amazon had these, or CoreWeave, or <a href="https://www.together.ai/">Together AI</a>, or <a href="https://nebius.com/">Nebius</a>, or whoever it is. These neoclouds are the firms that had a higher percentage of Hopper in general because they were more aggressive on it. They also tended to sign shorter-term deals, not CoreWeave but the others. So if I want Hopper, there is some capacity out there.</p><p>Also, while most of the capacity at an Oracle or a CoreWeave is signed for a long-term deal in terms of Blackwell, anything that&#8217;s going online this quarter is already sold. In some cases, they&#8217;re not even hitting all the numbers they promised they would sell because there are some data center delays, not just those two, but Nebius, Microsoft, Amazon, and Google. But there are a lot of neoclouds, as well as some of the hyperscalers, who have capacity they&#8217;re building that they haven&#8217;t sold yet, or capacity they were going to allocate to some internal use that is not necessarily super AGI-focused, that they may now turn around and sell.</p><p>Or in the case of Anthropic, they don&#8217;t have to have all the compute directly. Amazon can have the compute and serve <a href="https://aws.amazon.com/bedrock/">Bedrock</a>, or Google can have the compute and serve <a href="https://cloud.google.com/vertex-ai">Vertex</a>, or Microsoft can have the compute and serve <a href="https://azure.microsoft.com/en-us/products/ai-foundry">Foundry</a>, and then do a revenue share with Anthropic, or vice versa.</p><p><strong>Dwarkesh Patel</strong></p><p>Basically, you&#8217;re saying Anthropic is having to pay either this 50% markup in the sense of the revenue share, or in the sense of last-minute spot compute that they wouldn&#8217;t have otherwise had to pay had they bought the compute early.</p><p><strong>Dylan Patel</strong></p><p>Right, there&#8217;s a trade-off there. But at the same time, for a solid four months, everyone was saying to OpenAI, &#8220;We&#8217;re not going to sign deals with you.&#8221; That sounds crazy, but it was because, &#8220;you don&#8217;t have the money.&#8221; Now everyone&#8217;s saying, &#8220;OpenAI, we believed you the whole time. We can sign any deal because you&#8217;ve raised all this money.&#8221; Anthropic is constrained in that sense. There are not that many incremental buyers of compute yet, because Anthropic hit the capability tier first where their revenue is mooning.</p><p><strong>Dwarkesh Patel</strong></p><p>That&#8217;s interesting. Otherwise you might think having the best model is an extremely depreciating asset, because three months later you don&#8217;t have the best model. But the reason it&#8217;s important is that you can sign these deals, lock in the compute in advance, and get better prices.</p><p>Maybe this is an obvious point. But at least until recently, people had made this huge point about the <a href="https://www.cnbc.com/2025/11/14/ai-gpu-depreciation-coreweave-nvidia-michael-burry.html">depreciation cycle of a GPU</a>. The bears, the <a href="https://x.com/michaeljburry/status/1987918650104283372?lang=en">Michael Burrys</a> or whoever, have said, &#8220;Look, people are saying four or five years for these GPUs. Maybe it&#8217;s because the technology is improving so fast, but it in fact makes sense to have two-year depreciation cycles for these GPUs,&#8221; which increases the reported amortized CapEx in a given year and makes it financially less lucrative to build all these clouds.</p><p>But in fact you&#8217;re pointing out that maybe the depreciation cycle is even longer than five years. If we&#8217;re using Hoppers&#8212;especially if AI really takes off and in 2030 we&#8217;re saying, &#8220;We have to get the seven-nanometer fabs up, we have to go back and turn on the A100s again&#8221;&#8212;then the depreciation cycle is actually incredibly long. I feel like that&#8217;s an interesting financial implication of what you&#8217;re saying.</p><p><strong>Dylan Patel</strong></p><p>There&#8217;s a few strings to pull on there. One is, what happens to depreciation of GPUs? I guess I didn&#8217;t answer your prior question, which is that I think Anthropic will be able to get to five gigawatts-ish, maybe a little bit more by the end of the year through themselves as well as their product being served through Bedrock, Vertex, or Foundry. I think they&#8217;ll be able to get to five or six gigawatts, which is way above their initial plans. OpenAI will be roughly the same, actually a little bit higher based on our numbers.</p><p>But anyway, the depreciation cycle of a GPU. Michael Burry was saying it&#8217;s three years or less. That&#8217;s sort of his argument. There are two lenses to look at this. Mechanically, there&#8217;s a <a href="https://www.investopedia.com/terms/t/totalcostofownership.asp">TCO model</a>, total cost of ownership of a GPU, where we project pricing out for GPUs and build up the total cost of a cluster. There are a number of costs: your data center cost, your networking cost, your smart hands and people in the data center swapping stuff out. There&#8217;s your spare parts, your actual chip cost, your server cost. All these various costs get lumped together. There&#8217;s some depreciation cycles on it, certain credit costs on it.</p><p>You build up to, &#8220;Hey, an H100 costs $1.40/hour to deploy at volume across five years if your depreciation is five years.&#8221; If you sign a deal at $2/hour for those five years, your gross margin is roughly 35%. It&#8217;s a little bit above that. If you sign it for $1.90, it&#8217;s 35% roughly. Then you assume at that fifth year, the GPU falls off a bus and is dead.</p><p>In some cases, the argument people are making is if you didn&#8217;t sign a long-term deal, because every two years NVIDIA is tripling or quadrupling the performance while only 2X-ing or 50% increasing the price&#8230; Then the price of an H100&#8230; Sure maybe the value in the market was $2 at 35% gross margins in 2024, but in 2026, when Blackwell is in super high volume and deploying millions a year, you&#8217;re actually now worth $1/hour. And when Rubin in &#8216;27 is in super high volume&#8212;even though it starts shipping this year, it&#8217;s super high volume next year&#8212;doing millions of chips a year deployed into clouds, you&#8217;ve got another 3X in performance, another 50% or 2X in price, then the Hopper is only worth $0.70/hour. So the price of a GPU would continue to fall. That&#8217;s one lens.</p><p>The other lens is, what is the utility you get out of the chip? If you could build infinite Rubin or infinite of the newest chip, then yes, that&#8217;s exactly what would happen. The price of a Hopper would fall at a spot or short-term contract rate as the new chips come out and the price per performance goes up. But because you are so limited on semiconductors and deployment timelines, what actually prices these chips is not the comparative thing I can buy today, but rather what is the value I can derive out of this chip today.</p><p>In that sense, let&#8217;s take <a href="https://openai.com/index/introducing-gpt-5-4/">GPT-5.4</a>. GPT-5.4 is both way cheaper to run than GPT-4 and has fewer active parameters. It&#8217;s much smaller, in that sense of active parameter, because it&#8217;s a sparser <a href="https://en.wikipedia.org/wiki/Mixture_of_experts">MoE</a> versus GPT-4 being a coarser MoE. There&#8217;s also been so many other advancements in training, <a href="https://web.stanford.edu/class/psych209/Readings/SuttonBartoIPRLBook2ndEd.pdf">RL</a>, model architecture, and data qualities that have made GPT-5.4 way better than GPT-4. And it&#8217;s cheaper to serve. When you look at an H100, it can serve more tokens per GPU of 5.4 than if you had ran GPT-4 on it. So it&#8217;s producing more tokens of a model that is of higher quality.</p><p>What is the maximum TAM for GPT-4 tokens? Maybe it was a few billion dollars, maybe it was tens of billions of dollars. Adoption takes time. For GPT-5.4, that number is probably north of a hundred billion. But there&#8217;s an adoption lag, there&#8217;s competition, and there&#8217;s the constant improvements that everyone else is having. If improvements stopped here, the value of an H100 is now predicated on the value that GPT-5.4 can get out of it instead of the value that GPT-4 can get out of it. These labs are in a competitive environment, so their margins can&#8217;t go to infinity. You sort of have this dynamic that is quite interesting in that an H100 is worth more today than it was three years ago.</p><p><strong>Dwarkesh Patel</strong></p><p>That&#8217;s crazy. It&#8217;s also interesting from the perspective of just taking that forward. If we had actual AGI models developed, if we had a genuine human on a server&#8230; These are such hand wave-y numbers about how many flops the brain can do. But on a flop basis, an H100 is estimated to do 1e15, which is how much some people estimate the human brain does in flops. Obviously, in terms of memory, the human brain has way more. An H100 is 80 gigabytes, and the brain might have petabytes.</p><p><strong>Dylan Patel</strong></p><p>Oh, yeah, you&#8217;ve got petabytes? Name a petabyte of ones and zeros, bro. Name me a string.</p><p><strong>Dwarkesh Patel</strong></p><p>Well, this is actually the point.</p><p><strong>Dylan Patel</strong></p><p>No, we&#8217;ve just got the best <a href="https://medium.com/@vishal09vns/sparse-attention-dad17691478c">sparse attention</a> techniques ever.</p><p><strong>Dwarkesh Patel</strong></p><p>Genuinely though. In the amount of information that is compressed, it might be petabytes. The brain is an extremely sparse MoE. But anyways, imagine a human knowledge worker can produce six figures a year of value. If an H100 can produce something close to that, if we had actual humans on a server, the value of an H100 is such that it can repay itself in the course of a couple of months.</p><p>So when I interviewed Dario, the point I was trying to make is not that I think the <a href="https://en.wikipedia.org/wiki/Technological_singularity">singularity</a> is two years away and therefore Dario desperately needs to buy more compute, although the revenue is certainly there that he needs to buy more compute. The point I was trying to make is that given what Dario seems to be saying&#8212;given his statements that we&#8217;re two years away from a data center of geniuses, and certainly not more than five years away, and a data center of geniuses should be earning trillions upon trillions of dollars of revenue&#8212;it just does not make sense why he keeps making these statements about being more conservative on compute or, to your point, being less aggressive than OpenAI on compute.</p><p>I guess that point got lost because then people were roasting me, saying, &#8220;Oh, this podcaster is trying to convince this multi-hundred billion dollar company CEO to YOLO it, bro.&#8221; I was just trying to say that internally, his statements are inconsistent. Anyway, it&#8217;s good to iron it out.</p><p><strong>Dylan Patel</strong></p><p>I think going back to the earlier view that if the models are so powerful, the value of a GPU goes up over time, right now only OpenAI and Anthropic have that viewpoint. But as we approach further out, everyone is going to be able to see that value skyrocket per GPU. So in that sense, you should commit now to compute.</p><p>Interestingly, in Anthropic fashion, there&#8217;s a bit of a meme that they have commitment issues and are sort of polyamorous. Not Dario, but this is a bit of a meme.</p><p><strong>Dwarkesh Patel</strong></p><p>Explains everything. By the way, there&#8217;s this interesting economic effect called <a href="https://en.wikipedia.org/wiki/Alchian%E2%80%93Allen_effect">Alchian-Allen</a>, which is the idea that if you increase the fixed cost of different goods, one of which is higher quality and one which is lower quality, that will make people choose the higher quality good, on the margin.</p><p>To give a specific example, suppose the better-tasting apple costs two dollars and the shittier apple costs one dollar. Now suppose you put an import tariff on them. Now it&#8217;s $3 versus $2 for a great apple versus a medium apple.</p><p><strong>Dylan Patel</strong></p><p>Is that because they both increased by a dollar, or should it be a 50% increase?</p><p><strong>Dwarkesh Patel</strong></p><p>No, because they both increased by $1. The whole effect is that if there&#8217;s a fixed cost that is applied to both. Then the price difference between them, the ratio, changes. Previously, the more expensive one was 2X more expensive. Now it&#8217;s just 1.5X more expensive.</p><p>So I wonder if applied to AI that would mean that, if GPUs are going to get more expensive, there will be a fixed cost increase in the price of compute. As a result, that will push people to be willing to pay higher margins for slightly better models. Because the calculus is, I&#8217;m going to be paying all this money for the compute anyway. I might as well just pay slightly more to make sure it&#8217;s the very best model rather than a model that&#8217;s slightly worse.</p><p><strong>Dylan Patel</strong></p><p>So the Hopper went from $2 to $3. If a Hopper can make a million tokens of Opus and it can make two million tokens of Sonnet, the price differential between Opus and Sonnet has decreased because the price of the GPU has increased by a dollar from $2 to $3.</p><p>Interesting. I think that makes a ton of sense. We just see all of the volumes are on the best models today, all the revenue is on the best models today. In a compute-limited world, two things happen. One, companies that don&#8217;t have commitment issues and have these five-year contracts for compute have locked in a humongous margin advantage. They&#8217;ve locked in compute for five years at the price it transacted at two, three, or five years ago.</p><p>Whereas if you&#8217;re three years into that five-year contract and someone else&#8217;s two-year or three-year contract rolled off, and now they&#8217;re trying to buy that at modern pricing, when it&#8217;s priced to the value of models, the price is going to be up a lot more. So the person who committed early has better margins in general. The percentage of the market that is in long-term contracts is much larger than the percentage of the market in short-term contracts that can be this flex capacity you add at the last second.</p><p>At the same time, where does the margin go? Because models get more valuable, how much can the cloud players flex their pricing? If you look at CoreWeave, their average term duration is over three years right now. For ninety-eight percent plus of their compute, it&#8217;s over three years. They end up with this conundrum where they can&#8217;t actually flex price. But every year they&#8217;re adding incrementally way more capacity than they had previously.</p><p>This year alone, Meta&#8217;s adding as much capacity as they had in their entire fleet of compute and data centers for all purposes for serving WhatsApp, Instagram, and Facebook in 2022, and doing AI. They&#8217;re adding that alone this year.</p><p>In the same sense, you talk about Meta doing that, CoreWeave, Google, and Amazon, all these companies are adding insane amounts of compute year on year. That new compute gets transacted at the new price. In a sense, yes, you&#8217;ve locked in, as long as we&#8217;re in a takeoff. &#8220;Oh, OpenAI went from six hundred megawatts to two gigawatts last year, and from two gigawatts to six plus this year, and six to twelve next year.&#8221; The incremental added compute is where all the cost is, not the prior long-term contracts.</p><p>Then who holds the cards is the infra providers for charging margin. Now the cloud players, the neoclouds, or the hyperscalers can charge the margin. They can to some extent, but then as you go upstream to who has access to all the memory and logic capacity, it&#8217;s Nvidia for the most part. They&#8217;ve signed a lot of long-term contracts. They&#8217;ve got ninety billion dollars of long-term contracts today, and they&#8217;re negotiating three-year deals today with the memory vendors.</p><p>You&#8217;ve got Amazon and Google through <a href="https://en.wikipedia.org/wiki/Broadcom">Broadcom</a>, Amazon directly, and <a href="https://en.wikipedia.org/wiki/AMD">AMD</a>. These companies hold all the cards because they&#8217;ve secured the capacity. TSMC is not raising prices, but memory vendors are, to some extent, raising a lot of price. They&#8217;re going to double or triple price again, but then they&#8217;re also signing these long-term deals.</p><p>Who is able to accrue all the margin dollars is potentially the cloud, potentially the chip vendors, and the memory vendors, until TSMC or ASML break out and say, &#8220;No, we&#8217;re going to charge a lot more.&#8221; But at the same time, do the model vendors get to charge crazy margins? At least this year, we&#8217;re going to see margins for the model vendors go up a lot. Because they&#8217;re so capacity constrained, they have to destroy demand. There&#8217;s no way Anthropic can continue at the current pace without destroying demand.</p><h3>00:24:52 &#8211; Nvidia secured TSMC allocation early; Google is getting squeezed</h3><p><strong>Dwarkesh Patel</strong> 1:20:33</p><p>Let&#8217;s get into logic and memory. How specifically has Nvidia been able to lock up so much of both? I think according to your numbers, by &#8216;27, Nvidia is going to have +70% of <a href="https://en.wikipedia.org/wiki/3_nm_process">N3</a> wafer capacity, or around that area. I forget what the numbers were for memory at <a href="https://en.wikipedia.org/wiki/SK_Hynix">SK Hynix</a> and <a href="https://en.wikipedia.org/wiki/Samsung_Electronics">Samsung</a> and so forth.</p><p>Think about how the neocloud business works and how Nvidia works with that, or how the RL environment business works and how Anthropic works with that. In both those cases, Nvidia is purposely trying to fracture the complementary industry to make sure that they have as much leverage as possible. They&#8217;re giving allocation to random neoclouds to make sure that there&#8217;s not one person that has all the compute.</p><p>Similarly, Anthropic or OpenAI, when they&#8217;re working with the data providers, they say, &#8220;No, we&#8217;re going to just seed a huge industry of these things so that we&#8217;re not locked into any one supplier for data environments.&#8221;</p><p>And I wonder why on the 3 nm process&#8212;that&#8217;s going to be <a href="https://newsletter.semianalysis.com/p/aws-trainium3-deep-dive-a-potential">Trainium 3</a>, that&#8217;s going to be <a href="https://docs.cloud.google.com/tpu/docs/tpu7x">TPU v7</a>, other accelerators potentially&#8212;why is TSMC just giving it all up to Nvidia rather than trying to fracture the market?</p><p><strong>Dylan Patel</strong></p><p>There are a couple points here. On 3 nm, if we go back to last year, the vast majority of 3 nm was Apple. Apple is being moved to 2 nm. Memory prices are going up, so Apple&#8217;s volumes may go down. As memory prices go up, either they cut margin or they move on. There&#8217;s some time lag because they have long-term contracts, but Apple likely reduces demand or moves to 2 nm faster, where 2 nm is only capable of mobile chips today. In the future, AI chips will move there. So Apple has that.</p><p>Apple is also talking to third-party vendors because they&#8217;re getting squeezed out of TSMC a little bit. TSMC&#8217;s margins on high-performance computing&#8212;<a href="https://www.tsmc.com/english/dedicatedFoundry/technology/platform_HPC">HPC</a>, AI chips, et cetera&#8212;are higher than they are for mobile, because they have a bigger advantage in HPC than they do in mobile.</p><p>When you look at TSMC&#8217;s running calculus here, they&#8217;re actually providing really good allocations to companies that are doing CPUs. When you think about Amazon having Trainium and <a href="https://aws.amazon.com/ec2/graviton/">Graviton</a>, both of those are on 3 nm, Graviton being their CPU, Trainium being their AI chip. TSMC is much more excited to give allocation to Graviton than they are to Trainium because they view the CPU business as more stable, long-term growth.</p><p>As a company that is conservative and doesn&#8217;t want to ride cycles of growth too hard, you actually want to allocate to the market that is more stable with a lower growth rate first before you allocate all the incremental capacity to the fast growth rate market. That is the case generally. Same for AMD. The allocations they get on their CPUs, TSMC is much more excited about those than they are for GPUs. Likewise for Amazon.</p><p>Nvidia is a bit unique because yes, they have CPUs, they make switches, they make networking, NVLink, InfiniBand, Ethernet, NICs. By and large, most of these things will be on 3 nm by the end of this year with the Rubin launch and all the chips in that family, the GPU being the most important one. Yet Nvidia is getting the majority of supply.</p><p>Part of this is because you look at the market and TSMC and others forecast market demand in many ways, but it&#8217;s also the market signal. The market signaled, &#8220;Hey, we need this much capacity next year. We need this much. We&#8217;ll sign non-cancelable, non-returnable. We may even pay deposits.&#8221; Nvidia just did it way earlier than Google or Amazon. In some cases, Google and Amazon had stumbling blocks. One of the chips got delayed slightly by a couple quarters. Trainium and all these sorts of things happened.</p><p>In that case, there was a huge sort of, &#8220;Well, these guys are delaying, but Nvidia is wanting more, more, more, more. And we are checking with the rest of the supply chain, is there enough capacity?&#8221; They&#8217;re going to all the <a href="https://en.wikipedia.org/wiki/Printed_circuit_board">PCB</a> vendors and saying, &#8220;Is there enough PCB?&#8221; <a href="https://en.shpcb.com/">Victory Giant</a> is one of the largest suppliers of PCBs to Nvidia, and they&#8217;re a Chinese company. All the PCBs come from China, or many of them. They&#8217;re like, &#8220;Do you have enough PCB capacity? Great. Hey memory vendors, who has all the memory capacity? Okay, Nvidia does. Great.&#8221;</p><p>When you look at who is AGI-pilled enough to buy compute on long timelines at levels that seem ridiculous to people who aren&#8217;t AGI-pilled&#8212;but nonetheless, they&#8217;re willing to pay a pretty good margin and sign it now because they view in the future that ratio is screwed up&#8212;the same thing happens with the supply chain for semiconductors. I don&#8217;t think Nvidia is quite AGI-pilled. Jensen doesn&#8217;t believe software is going to be fully automated and all these things.</p><p><strong>Dwarkesh Patel</strong></p><p>Accelerated computing, not AI chips, right?</p><p><strong>Dylan Patel</strong></p><p>It&#8217;s AI chips.</p><p><strong>Dwarkesh Patel</strong></p><p>But that&#8217;s what he calls it, right?</p><p><strong>Dylan Patel</strong></p><p>Yeah. I think it&#8217;s a broader term, AI is within that, but also physics modeling and simulations.</p><p><strong>Dwarkesh Patel</strong></p><p>But it&#8217;s like he&#8217;s not embracing the main use case.</p><p><strong>Dylan Patel</strong></p><p>I think he&#8217;s embracing it, but I just don&#8217;t think he&#8217;s AGI-pilled like Dario or Sam. But he&#8217;s still way, way more AGI-pilled than Google was in Q3 of last year, or Amazon was in Q3 of last year, and he saw way more demand.</p><p>The reason is pretty simple. You can see all the data center construction. He&#8217;s like, &#8220;Okay, I want to have this market share.&#8221; We have all the data centers tracked, and there&#8217;s a lot of data centers that could be one or the other. To some extent, Google and Amazon, Google especially, even though their TPU is just better for them to deploy, they have to deploy a crap load of GPUs because they don&#8217;t have enough TPUs to fill up their data centers. They can&#8217;t get them fabbed.</p><p><strong>Dwarkesh Patel</strong></p><p>I have a question about that. Google sold a million, was it the v7s?</p><p><strong>Dylan Patel</strong></p><p>Yes.</p><p><strong>Dwarkesh Patel</strong></p><p>&#8212;the <a href="https://blog.google/innovation-and-ai/infrastructure-and-cloud/google-cloud/ironwood-tpu-age-of-inference/">Ironwoods</a> to Anthropic, and you&#8217;re saying the big bottleneck right now, this year or next year, I guess going forward forever now, is going to be the logic and memory, the stuff it takes to build these chips. Google has DeepMind, the third prominent AI lab. If this is the big bottleneck, why would they sell it rather than just giving it to DeepMind?</p><p><strong>Dylan Patel</strong></p><p>This is again a problem of&#8230; DeepMind people were like, &#8220;This is insane. Why did we do this?&#8221; But Google Cloud people and Google executives saw a different thought process.</p><p>You and I know the compute team at Anthropic. Both of the main people came from Google. They saw this dislocation, they negotiated a deal, and they were able to get access to this compute before Google realized. The chain of events, at least from our data that we found, was in early Q3, over the course of six weeks, we saw capacity on TPUs go up by a significant amount. It went up multiple times in those six weeks.</p><p>There were multiple requests. Google even had to go to TSMC and explain to them why they needed this increase in capacity because it was so sudden. A lot of that capacity increase was for selling to Anthropic. Because Anthropic saw it before Google.</p><p>And then Google had <a href="https://gemini.google/overview/image-generation/">Nano Banana</a> and <a href="https://blog.google/products-and-platforms/products/gemini/gemini-3/">Gemini 3</a> which caused their user metrics to skyrocket. Then leadership at Google was like, &#8220;Oh.&#8221; Then they started making <a href="https://www.cnbc.com/2025/11/21/google-must-double-ai-serving-capacity-every-6-months-to-meet-demand.html">the statement that we have to double compute every six months</a>, or whatever the exact number was.</p><p>They really woke up a lot more, and then they went to TSMC and said, &#8220;We want more. We want more.&#8221; TSMC replied, &#8220;Sorry guys, we&#8217;re sold out. We can maybe get 5-10% more for 2026, but really we&#8217;re going to work on 2027.&#8221;</p><p>There was this information asymmetry among the labs, in my mind. I don&#8217;t know exactly. It&#8217;s the narrative I&#8217;ve spun myself from seeing all the data in the supply chain on wafer orders and what&#8217;s going on with the data centers that Anthropic and <a href="https://www.fluidstack.io/">Fluidstack</a> signed.</p><p>It&#8217;s pretty clear to me that Google screwed up. You can see this from Google&#8217;s Gemini ARR. They had next to nothing in Q1 to Q3&#8212;in Q3 a little bit once they started inflecting. But in Q4 they reached $5 billion in revenue on an ARR basis. It&#8217;s clear Google didn&#8217;t see revenue skyrocket initially. In a sense, Anthropic had a little bit of commitment issues before their ARR exploded, even though they had far more information asymmetry and saw what was coming down the pipe. Google is going to be more conservative than Anthropic and Google had even less ARR. So they were just not willing to do it, and then they realized they should do it.</p><p>Since then, Google has gotten absurdly AGI-pilled in terms of what they&#8217;re doing. They bought an energy company. They&#8217;re putting deposits down for turbines. They&#8217;re buying a ridiculous percentage of powered land. They&#8217;re going to utilities and negotiating long-term agreements. They&#8217;re doing this on the data center and power side very aggressively. I think Google woke up towards the end of last year, but it took them some time.</p><p><strong>Dwarkesh Patel</strong></p><p>How many gigawatts do you think Google will have by the end of next year?</p><p><strong>Dylan Patel</strong></p><p>Buy my data.</p><p><strong>Dwarkesh Patel</strong></p><p>You charge for that kind of information.</p><p><strong>Dylan Patel</strong></p><p>Yes, yes.</p><h3>00:34:34 &#8211; ASML will be the #1 constraint for AI compute scaling by 2030</h3><p>I feel like every year the bottleneck for what is preventing us from scaling AI compute keeps changing. A couple years ago it was <a href="https://3dfabric.tsmc.com/english/dedicatedFoundry/technology/cowos.htm">CoWoS</a>. Last year it was power. You&#8217;ll tell me what the bottleneck is this year.</p><p>But I want to understand five years out, what will be the thing that is constraining us from deploying the singularity?</p><p><strong>Dylan Patel</strong></p><p>The biggest bottleneck is compute. For that, the longest lead time supply chains are not power or data centers. They&#8217;re actually the semiconductor supply chains themselves. It switches back from power and data centers as a major bottleneck to chips.</p><p>In the chip supply chain, there&#8217;s a number of different bottlenecks. There&#8217;s <a href="https://en.wikipedia.org/wiki/Semiconductor_memory">memory</a>, <a href="https://www.asml.com/en/technology/all-about-microchips/microchip-basics">logic wafers</a> from <a href="https://en.wikipedia.org/wiki/TSMC">TSMC</a>, and the <a href="https://en.wikipedia.org/wiki/Semiconductor_fabrication_plant">fabs</a> themselves. Construction of the fabs takes two to three years, versus a data center which takes less than a year. We&#8217;ve seen Amazon build data centers in as fast as eight months. There&#8217;s a big difference in lead times because of the complexity of building the fab that actually makes the chips. The tools also have really long lead times.</p><p>The bottlenecks, as we&#8217;ve scaled, have shifted based on what the supply chain is currently not able to do. It was CoWoS, power, and data centers, but those were all shorter lead time items. CoWoS is a much simpler process of <a href="https://anysilicon.com/the-ultimate-guide-to-semiconductor-packaging/">packaging chips</a> together. Power and data centers are ultimately way simpler than the actual manufacturing of the chips. There&#8217;s been some sliding of capacity across mobile or PC to data center chips, which has been somewhat fungible.</p><p>Whereas CoWoS, power, and data centers have had to start anew as supply chains. But now there&#8217;s no more capacity for the mobile and PC industries&#8212;which used to be the majority of the semiconductor industry&#8212;to shift over to AI. <a href="https://en.wikipedia.org/wiki/Nvidia">Nvidia</a> is now the largest customer at TSMC and <a href="https://en.wikipedia.org/wiki/SK_Hynix">SK Hynix</a>, the largest memory manufacturer. It&#8217;s sort of impossible for the sliding of resources away from the common person&#8217;s PCs and smartphones to shift any more towards the AI chips. So now the question is how do we scale AI chip production? That&#8217;s the biggest bottleneck as we go to 2030.</p><p><strong>Dwarkesh Patel</strong></p><p>It would be very interesting if there&#8217;s an absolute gigawatt ceiling that you can project out to 2030 based just on &#8220;We can&#8217;t produce more than this many <a href="https://en.wikipedia.org/wiki/Extreme_ultraviolet_lithography">EUV</a> machines.&#8221;</p><p><strong>Dylan Patel</strong></p><p>To scale compute further, there are different bottlenecks this year and next year, but ultimately by 2028 or 2029, the bottleneck falls to the lowest rung on the supply chain, which is <a href="https://en.wikipedia.org/wiki/ASML_Holding">ASML</a>. ASML makes the world&#8217;s most complicated machine: an <a href="https://www.asml.com/en/products/euv-lithography-systems">EUV tool</a>. The selling price for those is $300-400 million. Currently, they can make about 70. Next year, they&#8217;ll get to 80. Even under very aggressive supply chain expansion, they only get to a little bit over 100 by the end of the decade.</p><p>What does that mean? They can make a hundred of these tools by the end of the decade, and 70 right now. How does that actually translate to AI compute? We see all these numbers from <a href="https://en.wikipedia.org/wiki/Sam_Altman">Sam Altman</a> and many others across the supply chain: gigawatts, gigawatts, gigawatts. How many gigawatts are we adding? We see <a href="https://www.dwarkesh.com/p/elon-musk">Elon saying a hundred gigawatts in space</a>.</p><p><strong>Dwarkesh Patel</strong></p><p>A year.</p><p><strong>Dylan Patel</strong></p><p>A year. The problem with any of these numbers, or the challenge to these numbers, is actually not the power or the data center. We can dive into that, but it&#8217;s manufacturing the chips.</p><p>Take a gigawatt of Nvidia&#8217;s <a href="https://www.nvidia.com/en-us/data-center/technologies/rubin/">Rubin</a> chips. Rubin is announced at <a href="https://en.wikipedia.org/wiki/Nvidia_GTC">GTC</a>, I believe the week this podcast goes live. To make a gigawatt worth of data center capacity of Nvidia&#8217;s latest chip that they&#8217;re releasing towards the end of this year, you need a few different wafer technologies. You need about 55,000 wafers of <a href="https://en.wikipedia.org/wiki/3_nm_process">3 nm</a>. You need about 6,000 wafers of <a href="https://en.wikipedia.org/wiki/5_nm_process">5 nm</a>, and then you need about 170,000 wafers of <a href="https://en.wikipedia.org/wiki/Dynamic_random-access_memory">DRAM</a> memory.</p><p>Across these three different buckets, each requires different amounts of EUV. When you manufacture a wafer, there are thousands and thousands of process steps where you&#8217;re depositing material and removing them. But the key critical step&#8212;which at least in <a href="https://www.appliedmaterials.com/us/en/semiconductor/markets-and-inflections/advanced-logic.html">advanced logic</a> is 30% of the cost of the chip&#8212;is something that doesn&#8217;t actually put anything on the wafer. You take the wafer, you deposit <a href="https://en.wikipedia.org/wiki/Photoresist">photoresist</a>, which is a chemical that chemically changes when you expose it to light. Then you stick it into the EUV tool, which shines light at it in a certain way. It patterns it. There&#8217;s what&#8217;s called a <a href="https://agcem.com/products/euv-mask-blanks/">mask</a>, which is effectively a stencil for the design.</p><p>When you look at a leading-edge 3 nm wafer, it has 70 or so masks, 70 or so layers of lithography, but 20 of them are the most advanced EUV. If you need 55,000 wafers for a gigawatt, and you do 20 EUV passes per wafer, you can do the math. That&#8217;s 1.1 million passes of EUV for a single gigawatt. It&#8217;s pretty simple. Once you add the rest of the stuff, it ends up being 2 million, across 5 nm and all the memory. You&#8217;re at roughly 2 million EUV passes for a single gigawatt.</p><p>These tools are very complicated. When you think about what it&#8217;s doing across a wafer, it&#8217;s taking the wafer and scanning and stepping across. It does this dozens of times across the whole wafer. When you&#8217;re talking about how many EUV passes, that&#8217;s the entire wafer being exposed at a certain rate.</p><p>An EUV tool can do roughly 75 wafers per hour, and the tool is up roughly 90% of the time. In the end, you need about three and a half EUV tools to do the 2 million EUV wafer passes for the gigawatt. So three and a half EUV tools satisfies a gigawatt.</p><p>It&#8217;s funny to think about the numbers. What does a gigawatt cost? It costs roughly $50 billion. Whereas what do three and a half EUV tools cost? That&#8217;s $1.2 billion. It&#8217;s actually quite a lower number, which is interesting to think about. Fifty gigawatts of economic <a href="https://www.investopedia.com/terms/c/capitalexpenditure.asp">CapEx</a> in the data center, and what gets built on top of that in terms of tokens is even larger. It might be $100 billion worth of AI value into the supply chain, held up by this $1.2 billion worth of tooling that simply cannot expand its supply chain quickly.</p><p><strong>Dwarkesh Patel</strong></p><p>You wrote an article recently saying over the last three years, TSMC has done $100 billion of CapEx. So it&#8217;s $30/$30/$40 billion. A small fraction of that is being used by Nvidia for the 3 nm, or previously 4 nm, that it&#8217;s using for its chips. What were its earnings last quarter? It was $40 billion. So $40 billion times four is $160 billion. Nvidia alone is turning some small fraction of $100 billion in CapEx, which is going to be depreciated over many years and not just this one year, into $160 billion in a single year.</p><p>That gets even more intense when you go down the supply chain to ASML, which is taking a billion dollars&#8217; worth of machines to produce a gigawatt. Of course, those machines last for more than a year so it&#8217;s doing more than that.</p><p>Now I want to understand, how many such machines will there be by 2030, if you include not just the ones that are sold that year, but have been compiling over the previous years? What does that imply? Sam Altman says he wants to do a gigawatt a week in 2030. When you add up those numbers, is it compatible with that?</p><p><strong>Dylan Patel</strong></p><p>That&#8217;s completely compatible, if you think about it. TSMC and the entire ecosystem have something like 250 to 300 EUV tools already. Then you stack on 70 this year, 80 next year, growing to 100 by 2030. You&#8217;re at 700 EUV tools by the end of the decade. 700 EUV tools, at three and a half tools per gigawatt&#8212;assuming it&#8217;s all allocated to AI, which it&#8217;s not&#8212;gets you to 200 gigawatts worth of AI chips for the data centers to deploy.</p><p>Sam wants 52 gigawatts a year. He&#8217;s only taking 25% share then. Obviously, there&#8217;s some share given to mobile and PC, assuming we&#8217;re even allowed to have consumer goods still and we don&#8217;t get priced out of them. But roughly, he&#8217;s saying 25% market share of the total chips fabbed. That&#8217;s very reasonable given that this year alone, I think he&#8217;s going to have access to 25% of the <a href="https://en.wikipedia.org/wiki/Blackwell_(microarchitecture)">Blackwell GPUs</a> that are deployed. It&#8217;s not that crazy.</p><p><strong>Dwarkesh Patel</strong></p><p>When did ASML start shipping EUV tools, when <a href="https://en.wikipedia.org/wiki/7_nm_process">7 nm</a> started? I don&#8217;t know when that was exactly. You&#8217;re saying in 2030, they&#8217;re going to be using machines that initially were shipped in 2020. So for ten years, you&#8217;re using the same most important machine in this most technologically advanced industry in the world? I find that surprising.</p><p><strong>Dylan Patel</strong></p><p>ASML&#8217;s been shipping EUV tools now for roughly a decade, but it only entered mass volume production around 2020. The tool&#8217;s not the same. Back then, the tools were even lower throughput. There are various specifications around them called <a href="https://en.wikipedia.org/wiki/Overlay_control">overlay</a>. I was mentioning you&#8217;re stacking layers on top of each other. You&#8217;ll do some EUV, you&#8217;ll do a bunch of different process steps&#8212;depositing stuff, etching stuff, cleaning the wafer&#8212;dozens of those steps before you do another EUV layer.</p><p>There&#8217;s a spec called overlay, which is: you did all this work, you drew these lines on the wafer, now I want to draw these dots. Let&#8217;s say I want to draw these dots to connect these lines of metal to holes, and then the next layer up is another set of lines going perpendicular, so now you&#8217;re connecting wires going perpendicular to each other. You have to be able to land them on top of each other. It&#8217;s called overlay.</p><p>Overlay is a spec that&#8217;s been improved rapidly by ASML. Wafer throughput has been improved rapidly by ASML. The price of the tool has gone up, but not as much as the capabilities of the tool. Initially, the EUV tools were $150 million. Over time, they&#8217;re now $400 million as I look out to 2028. But the capabilities of the tools have more than doubled as well, especially on throughput and overlay accuracy, which is the ability to accurately align the subsequent passes on top of each other even though you do tons of steps between.</p><p>ASML is improving super rapidly. It&#8217;s also noteworthy to say that ASML is maybe one of the most generous companies in the world. They have this linchpin thing. No one has anything competitive. <a href="https://www.reuters.com/world/china/how-china-built-its-manhattan-project-rival-west-ai-chips-2025-12-17/">Maybe China will have some EUV by the end of the decade</a>, but no one else has anything even close to EUV, and yet they haven&#8217;t taken price and margins up like crazy. You go ask some other folks that we talk to all the time, like <a href="https://www.dwarkesh.com/p/leopold-aschenbrenner">Leopold</a>, and they&#8217;re like, &#8220;Let&#8217;s have the price go up.&#8221; Because they can. The margin is there. You can take the margin. Nvidia takes the margin. Memory players are taking the margin. But ASML has never raised the price more than they&#8217;ve increased the capability of the tool.</p><p>In a sense, they&#8217;ve always provided net benefit to their customers. It&#8217;s not that the tool is stagnant, it&#8217;s just that these tools are old. Yes, you can upgrade them some, and the new tools are coming. For simplicity&#8217;s sake, we&#8217;re ignoring the advances in overlay or throughput per tool for this podcast.</p><p><strong>Dwarkesh Patel</strong></p><p>You say we&#8217;re producing 60 of these machines this year and then 70, 80 over subsequent years. What would happen if ASML just decided to double its CapEx or triple its CapEx? What is preventing them from producing more than 100 in 2030? Why are you so confident that even five years out, you can be relatively sure what their production will be?</p><p><strong>Dylan Patel</strong></p><p>I think there are a couple factors here. ASML has not decided to just go YOLO, let&#8217;s expand capacity as fast as possible. In general, the semiconductor supply chain has not. It&#8217;s lived through the booms and busts, and we can talk a bit more about it. Basically some players have recently woken up, but in general no one really sees demand for 200 gigawatts a year of AI chips, or trillions of dollars of spend a year in the semiconductor supply chain. They&#8217;re not AI-pilled. They&#8217;re not AGI-pilled.</p><p><strong>Dwarkesh Patel</strong></p><p>We&#8217;re going to get to a trillion dollars this year.</p><p><strong>Dylan Patel</strong></p><p>Yeah, I feel you, but I&#8217;m saying no one really understands this in the supply chain. Constantly, we&#8217;re told our numbers are way too high, and then when they&#8217;re right, they&#8217;re like, &#8220;Oh, yeah, but your next year&#8217;s numbers are still too high.&#8221;</p><p>ASML&#8217;s tool has four major components. It has the source, which is made by <a href="https://www.asml.com/en/company/about-asml/cymer">Cymer</a> in San Diego. It has the <a href="https://www.linkedin.com/posts/asml_asmls-reticle-stage-activity-7315418259040694274-DNW7/">reticle stage</a>, which is made in Wilmington, Connecticut. It has the wafer stage. It has the optics, the lenses and such. Those last two are made in Europe.</p><p>When you look at each of these four, they&#8217;re tremendously complex supply chains that, (A) they have not tried to expand massively, and (B) when they try to expand them, the time lag is quite long. Again, this is the most complicated machine that humans make, period, at any sort of volume.</p><p>Let&#8217;s talk about the source specifically. What does the source do? It drops these tin droplets. It hits it three subsequent times with a laser perfectly. The first one hits this tin droplet, it expands out. It hits it again, so it expands out to this perfect shape, and then it blasts it at super high power. The tin droplets get excited enough that they release EUV light, 13.5 nanometer, and then it&#8217;s in this thing that is collecting all the light and directing it into the lens stack.</p><p>Then you have the lens stack, which is Carl Zeiss, as you mentioned, and some other folks, but Zeiss being the most important part of it. They also have not tried to expand production capacity because they don&#8217;t see... They&#8217;re like, &#8220;We&#8217;re growing a lot because of AI. We&#8217;re growing from 60 to 100.&#8221; It&#8217;s like, &#8220;No, no, no. We need to go to a couple hundred, but it&#8217;s fine. Whatever.&#8221;</p><p>Each of these tools has, I think, 18 of these lenses, effectively. They are multilayer mirrors, which are perfect layers of molybdenum and ruthenium, if I recall correctly, stacked on top of each other in many layers, and then the light bounces off of it perfectly. When we think about a lens, it&#8217;s in a shape, and it focuses the light. This is like a mirror that&#8217;s also a lens, so it&#8217;s pretty complicated. Any defect in these super thinly deposited stacks will mess it up. Any curvature issues will mess it up.</p><p>There are a lot of challenges with scaling the production. It&#8217;s quite artisanal in this sense because you&#8217;re not making tens of thousands of these a year, you&#8217;re making hundreds, you&#8217;re making thousands. 60 tools a year, 18 of these per tool, you&#8217;re still in the hundreds, of tools, or you&#8217;re at the thousand number roughly for these lenses and projection optics.</p><p>Then you step forward to the reticle stage, which is also something really crazy. This thing moves at, I want to say, nine Gs. It will shift nine Gs because as you step across a wafer, the tool will go... The wafer stage is complementary. It&#8217;s the wafer part. You line these two things up. You&#8217;re taking all the light through the lenses that&#8217;s focused, and here&#8217;s the reticle, here&#8217;s the wafer. The reticle&#8217;s moving one direction, the wafer&#8217;s moving the other direction as it scans a 26x33 millimeter section of the wafer, and then it stops. It shifts over to another part of the wafer and does it again. It does that in just seconds. Each of them is moving at nine Gs in opposite directions.</p><p>Each of these things is a wonder and marvel of chemistry, fabrication, mechanical engineering, and optical engineering, because you have to align all these things and make sure they&#8217;re perfect. All of these things have crazy amounts of metrology because you have to perfectly test everything. If anything is messed up, the yield goes to zero, because this is such a finely tuned system.</p><p>By the way, it&#8217;s so large that you&#8217;re building it in the <a href="https://www.asml.com/en/company/about-asml/locations/veldhoven">factory in Eindhoven, Netherlands</a>, and they&#8217;re deconstructing it and shipping it on many planes to the customer site, and then you&#8217;re reassembling it there and testing it again. That process takes many, many months.</p><p>There are so many steps in the supply chain, whether it&#8217;s Zeiss making their lenses and projection optics or Cymer, which is an ASML-owned company, making the EUV source. Each of these has its own complex supply chain. ASML has commented that their supply chain has over ten thousand people in it.</p><p><strong>Dwarkesh Patel</strong></p><p>Like individual suppliers?</p><p><strong>Dylan Patel</strong></p><p>Yes. It might not be directly. It might be through Zeiss having so many suppliers and XYZ company having so many suppliers.</p><p>If you just think about it, you&#8217;re talking about two physically moving objects that are the size of a wafer, and it has to be accurate to the level of single-digit nanometers or even smaller because the entire system, the overlay, the layer-to-layer overlay variation, has to be on the order of 3 nanometers. If the overlay is 3 nms, that means each individual part, the accuracy of its physical movement has to be even less than that. It has to be sub-one nanometer in most cases, because the error of these things stack up. There&#8217;s no way to just snap your fingers and increase production.</p><p>Things as simple as power. The US going from zero percent power growth to two percent power growth, even though China&#8217;s already at thirty, was so hard for America to do. And that&#8217;s a really simple supply chain with very few people in it who make difficult things. There are probably 100,000 electricians and people who work in the electricity supply chain, or more, in the US?</p><p>When you look at ASML, they employ so few people. Carl Zeiss probably employs less than a thousand people working on this, and all of those people are super, super specialized. You can&#8217;t just train random people up for this in the snap of a finger. You can&#8217;t just get your entire supply chain to get galvanized.</p><p>Nvidia&#8217;s had to do a lot to get the entire supply chain to even deliver the capacity they&#8217;re going to make this year. When you go talk to <a href="https://en.wikipedia.org/wiki/Anthropic">Anthropic</a>, they&#8217;re like, &#8220;We&#8217;re short of <a href="https://en.wikipedia.org/wiki/Tensor_Processing_Unit">TPUs</a>, we&#8217;re short of training, and we&#8217;re short of <a href="https://en.wikipedia.org/wiki/Graphics_processing_unit">GPUs</a>.&#8221; When you go talk to <a href="https://en.wikipedia.org/wiki/OpenAI">OpenAI</a>, they&#8217;re like, &#8220;We&#8217;re short of these things.&#8221;</p><p>OpenAI and Anthropic know they need X. Nvidia is not quite as AGI-pilled. They&#8217;re building X - 1. You go down the supply chain, everyone&#8217;s doing X - 1. In some cases, they&#8217;re doing X &#247; 2, because they&#8217;re not AGI-pilled.</p><p>You end up with this time lag for the whip to react. The AI-pilledness and the desire to increase production takes so long. Once they finally understand that they need to increase production rapidly&#8230; They think they understand. They think AI means we have to go from 60 to 100, in addition to the tools getting better and faster, the source getting higher power from 500 watts to 1,000, and all these other aspects of the supply chain advancing technically and increasing production. They think they&#8217;re actually increasing production a lot.</p><p>But if you flow through the numbers&#8230; What does Elon want? He wants 100 gigawatts a year in space by 2028 or 2029. Sam Altman wants 52 gigawatts a year by the end of the decade. Anthropic probably needs the same, and Google needs that. You go across the supply chain, and it&#8217;s like, wait, no, the supply chain can&#8217;t possibly build enough capacity for everyone to get what they want on the side of compute.</p><h3>00:55:47 &#8211; Can&#8217;t we just use TSMC&#8217;s older fabs?</h3><p><strong>Dwarkesh Patel</strong></p><p>I feel like in the data center supply chain for the last few years, people have been making arguments like, &#8220;We are bottlenecked by this specific thing, therefore AI compute can&#8217;t scale more than X.&#8221; But as you&#8217;ve written about, if the grid is a bottleneck, then we just do <a href="https://www.enelnorthamerica.com/insights/blogs/what-does-btm-behind-the-meter-mean">behind the meter</a> on the site, we do gas turbines, et cetera. If that doesn&#8217;t work, there are all these other alternatives that people fall back on.</p><p>I want to ask whether we can imagine a similar thing happening in the semiconductor supply chain. If EUV becomes a bottleneck, what if we just went back to 7 nm and did what China is doing currently, producing 7 nm chips with <a href="https://en.wikipedia.org/wiki/Multiple_patterning">multi-patterning</a> with <a href="https://en.wikipedia.org/wiki/Ultraviolet">DUV</a> machines? If you look at a 7 nm chip like the <a href="https://www.nvidia.com/en-us/data-center/a100/">A100</a>, there&#8217;s been a lot of progress obviously from the A100 to the <a href="https://www.exxactcorp.com/blog/hpc/comparing-nvidia-tensor-core-gpus">B100</a> or <a href="https://www.nvidia.com/en-us/data-center/dgx-b200/">B200</a>.</p><p>How much of that progress is just numerics? If you just hold <a href="https://en.wikipedia.org/wiki/Half-precision_floating-point_format">FP16</a> constant from A100 to B100. The B100 is a little over one petaflop, and the A100 is like 300 teraflops.</p><p><strong>Dylan Patel</strong></p><p>Yeah, 312.</p><p><strong>Dwarkesh Patel</strong></p><p>Holding numerics constant, you have a 3x improvement from A100 to B100. Some of that is the process improvement, some of that is just the accelerator design improving, which we could replicate again in the future.</p><p>It seems there&#8217;s actually a very small effect from the process improving from 7nm to 4 nm. I don&#8217;t know the numbers offhand but let&#8217;s say there&#8217;s 150k wafers per month of 3 nm and eventually similar amounts for <a href="https://en.wikipedia.org/wiki/2_nm_process">2 nm</a>. But then there&#8217;s a similar amount for 7 nm.</p><p>If you have all those old wafers and there&#8217;s maybe a 50% haircut because the bits per wafer area are 50% less or something, it doesn&#8217;t seem that bad to just bring on 7 nm wafers if that gives you another fifty or hundred gigawatts. Tell me why that&#8217;s naive.</p><p><strong>Dylan Patel</strong></p><p>We potentially do go crazy enough that this happens because we just need incremental compute, and the compute is worth the higher cost and power of these chips. But it&#8217;s also unlikely to a large extent because some of these are not fair comparisons.</p><p>For example, from A100, which is 312 teraflops, to Blackwell, which is 1,000 or 2,000 FP16, and then Rubin is 5,000 or so FP16&#8230; It&#8217;s not a fair comparison because these chips have vastly different design targets. With A100, Nvidia optimized for FP16 and <a href="https://en.wikipedia.org/wiki/Bfloat16_floating-point_format">BF16</a> numerics. When you look at <a href="https://en.wikipedia.org/wiki/Hopper_(microarchitecture)">Hopper</a>, they didn&#8217;t care as much about that; they cared about <a href="https://en.wikipedia.org/wiki/Minifloat">FP8</a>. When you look at Rubin, they don&#8217;t care about FP16 and BF16 so much, they care mostly about <a href="https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/">FP4</a> and FP6. Numerics are what they&#8217;ve designed their chip for.</p><p>Let&#8217;s say we make a new chip design on 7 nm, optimized for the numerics of the modern day. The performance difference is still going to be much larger than the FLOPS difference you mentioned. Often it&#8217;s easy to boil things down to FLOPS per watt or FLOPS per dollar, but that&#8217;s not a fair comparison.</p><p>Let&#8217;s look at <a href="https://www.kimi.com/ai-models/kimi-k2-5">Kimi K2.5</a> and <a href="https://en.wikipedia.org/wiki/DeepSeek">DeepSeek</a>. When you look at those two models and their performance on Hopper versus Blackwell on very optimized software, you get vastly different performance. Most of this is not attributed to FLOPS or numerics, because those models are actually eight-bit. So it&#8217;s not like Blackwells and Hopper are both optimized for eight-bit, and Blackwell is not really taking advantage of its four-bit there. The performance gulf is actually much larger.</p><p>Sure it&#8217;s one thing to shrink process technology and make the transistor smaller so each chip has X number of FLOPS, but you forget the big gating factor. These models don&#8217;t run on a single chip. They run on hundreds of chips at a time. If you look at DeepSeek&#8217;s production deployment, which is well over a year old now, they were running on 160 GPUs. That&#8217;s what they serve production traffic on. They split the model across 160 GPUs.</p><p>Every time you cross the barrier from one chip to another, there is an efficiency loss. You have to transmit over high-speed electrical <a href="https://en.wikipedia.org/wiki/SerDes">SerDes</a>, which brings a latency cost and a power cost. There are all these dynamics that hurt. As you shrink and shrink the <a href="https://en.wikipedia.org/wiki/MOSFET#Scaling">process node</a>, you&#8217;ve increased the amount of compute in a single chip. Now in-chip movement of data is at least tens of terabytes a second, if not hundreds of terabytes a second. Whereas between chips, you&#8217;re on the order of a terabyte a second.</p><p>Then you have this movement of data between chips that are super close to each other physically. You can only put so many chips close to each other physically, so you have to put chips in different racks. The movement of data between racks is on the order of hundreds of gigabits a second, 400 gig or 800 gig a second, so roughly 100 gigabytes a second.</p><p>So you have this huge ladder: on-chip communication is super fast, within the rack is an order of magnitude slower, and outside the rack is an order of magnitude lower than that. As you break the bounds of chips, you end up with a performance loss.</p><p>The reason I explain this is because when you look at Hopper versus Blackwell, even if both are using a rack&#8217;s worth of chips, Hopper is significantly slower. The amount of performance you have leveraged to the task within each domain&#8212;tens of terabytes a second of communication between these processing elements versus terabytes a second between these processing elements&#8212;is much, much higher and therefore the performance is much higher. When you look at inference at 100 tokens a second for DeepSeek and Kimi K2.5, the performance difference between Hopper and Blackwell is on the order of 20x.</p><p>It&#8217;s not 2x or 3x like the FLOPS performance difference indicates, even though those are on the same process node. There are just differences in networking technologies and what they&#8217;ve worked on. You can translate some of these back, but when you look at what they&#8217;re doing on 3 nm with Rubin, some of those things are simply not possible to do all the way back on A100, even if you make a new chip for 7 nm.</p><p>There are certain architectural improvements you can port and certain ones you cannot. The performance difference is not just going to be the difference in FLOPS. It&#8217;s in some senses cumulative between the difference in FLOPS per chip, networking speed between chips, how many FLOPS are on a chip versus a system, and memory bandwidth on a single chip versus an entire system. All of these things compound.</p><p><strong>Dwarkesh Patel</strong></p><p>Can I ask you a very naive question? The B200 now has two <a href="https://en.wikipedia.org/wiki/Die_(integrated_circuit)">dies</a> on a single chip, so you can get that bandwidth without having to go through <a href="https://en.wikipedia.org/wiki/NVLink">NVLink</a> or <a href="https://en.wikipedia.org/wiki/InfiniBand">InfiniBand</a>. Next year, <a href="https://arstechnica.com/ai/2025/03/nvidia-announces-rubin-ultra-and-feynman-ai-chips-for-2027-and-2028/">Rubin Ultra</a> will have four dies on one chip. What is preventing us from just doing that with an older&#8230; How many dies could you have on a single chip and still get these tens of terabytes a second?</p><p><strong>Dylan Patel</strong></p><p>Even within Blackwell, there are differences in performance when you&#8217;re communicating on the chip versus across the chips. Those bounds are obviously much smaller than when you&#8217;re going out of the entire chip. When you scale the number of chips up, there is some performance loss. It&#8217;s not perfect, but it is way better than different entire packages.</p><p>How large can <a href="https://en.wikipedia.org/wiki/Advanced_packaging_(semiconductors)">advanced packaging</a> scale? The way Nvidia is doing it is CoWoS. Google, Broadcom, MediaTek, and Amazon&#8217;s <a href="https://aws.amazon.com/ai/machine-learning/trainium/">Trainium</a> are all doing CoWoS. But actually you can go look back at what Tesla did with <a href="https://en.wikipedia.org/wiki/Tesla_Dojo">Dojo</a>, which they cancelled and restarted. Dojo was a chip that was the size of an entire wafer. They had 25 chips on it. There were some tradeoffs. They couldn&#8217;t put <a href="https://en.wikipedia.org/wiki/High_Bandwidth_Memory">HBM</a> on it. But the positive side was that they had 25 chips on it. To date, it is still probably the best chip for running <a href="https://en.wikipedia.org/wiki/Convolutional_neural_network">convolutional neural networks</a>. It&#8217;s just not great at <a href="https://en.wikipedia.org/wiki/Transformer_(deep_learning)">transformers</a> because the shape of the chip, the memory, the arithmetic, and all these various specifications are just not well-suited for transformers. They&#8217;re well-suited for CNNs.</p><p>Dojo chips were optimized around that, and they made a bigger package. But as you make packages bigger and bigger, you have other constraints: networking speed, memory bandwidth, and cooling capabilities. All of these things start to rear their heads. It&#8217;s not simple. But yes, you will see a trend line of more chips on the package, and yes, you&#8217;re going to be able to do that on 7 nm.</p><p>In fact, that&#8217;s what Huawei did with their <a href="https://www.tomshardware.com/tech-industry/artificial-intelligence/deepseek-research-suggests-huaweis-ascend-910c-delivers-60-percent-nvidia-h100-inference-performance">Ascend 910C</a> or <a href="https://www.tomshardware.com/tech-industry/artificial-intelligence/huawei-ascend-ai-910d-processor-designed-to-take-on-nvidias-blackwell-and-rubin-gpus">D</a>. They initially put one, and then they did two. They&#8217;re focusing on scaling the packaging up because that is an area where they can advance faster than process technology where they can&#8217;t shrink. But at the end of the day, that&#8217;s something you can do on the leading-edge chips too. Anything you do on 7 nm, you can also probably do on 3 nm in terms of packaging.</p><h3>01:05:37 &#8211; When will China outscale the West in semis?</h3><p><strong>Dwarkesh Patel</strong></p><p>If we end up in this world in 2030 where the West has the most advanced process technology but has not ramped it up as much, whereas China&#8230; I don&#8217;t know if you think by 2030 they would have EUV and 2 nm or whatever. But they are semiconductor-pilled and they are producing in mass quantity.</p><p>Basically, I&#8217;m wondering what the year is where there&#8217;s a crossover, where our advantage in process technology has faded enough, and their advantage in scale has increased enough. And also, if their advantage in having one country with the entire supply chain indigenized&#8212;rather than having random suppliers in Germany and the Netherlands&#8212;would mean that China would be ahead in its ability to produce mass <a href="https://en.wikipedia.org/wiki/Floating_point_operations_per_second">flops</a>.</p><p><strong>Dylan Patel</strong></p><p>To date, China still does not have an entirely indigenized semiconductor supply chain.</p><p><strong>Dwarkesh Patel</strong></p><p>But would they in 2030?</p><p><strong>Dylan Patel</strong></p><p>By 2030, it&#8217;s possible that they do. But to date, all of China&#8217;s 7 nm and 14 nm capacity uses ASML DUV tools. The amount that they can import from ASML is large. But the vast majority of ASML&#8217;s revenue, especially on EUV all of it, is outside of China. The scale advantage is still in the favor of the West plus Taiwan, Japan, and Korea, et cetera.</p><p><strong>Dwarkesh Patel</strong></p><p>But they&#8217;re trying to make their own DUV and EUV tools, right?</p><p><strong>Dylan Patel</strong></p><p>They&#8217;re trying to do all these things. The question is how fast can they advance and scale up production as well as quality. To date, we haven&#8217;t seen that. Now I&#8217;m quite bullish that they&#8217;re going to be able to do these things over the next five to ten years. They will really scale up production and kick it into high gear. They have more engineers working on it and more desire to throw capital at the problem.</p><p><strong>Dwarkesh Patel</strong></p><p>So by 2030, will they have fully indigenized DUV?</p><p><strong>Dylan Patel</strong></p><p>I think for sure. DUV, yes.</p><p><strong>Dwarkesh Patel</strong></p><p>And fully indigenized EUV by 2030?</p><p><strong>Dylan Patel</strong></p><p>I think they&#8217;ll have working tools. I don&#8217;t think that they&#8217;ll be able to manufacture a bunch yet. There&#8217;s having it work, and then there&#8217;s production hell. ASML had EUV working in the early 2010s at some capacity. The tools were not accurate enough. They were not scaled for high-volume manufacturing or reliable enough. They had to ramp production, and that all took time.</p><p>Production hell takes time. That&#8217;s why it took another five to seven years to get EUV into mass production at a fab rather than just working in the lab.</p><p><strong>Dwarkesh Patel</strong></p><p>How many DUV tools do you think they&#8217;ll be able to manufacture in 2030?</p><p><strong>Dylan Patel</strong></p><p>ASML?</p><p><strong>Dwarkesh Patel</strong></p><p>No, China.</p><p><strong>Dylan Patel</strong></p><p>That&#8217;s a great question. It&#8217;s a bit of a challenge to look into this supply chain especially. We try really hard. In some instances, they&#8217;re buying stuff from Japanese vendors. If they want a fully indigenized supply chain, they need to not buy these lenses, projection optics, or stages from Japanese vendors. They need to build it internally.</p><p>It&#8217;s really tough to say where they&#8217;ll be able to get to. I honestly think it&#8217;s a shot in the dark. But it&#8217;s probably not unlikely that they&#8217;ll be able to do on the order of 100 DUV tools a year, whereas ASML is currently doing hundreds of DUV tools a year.</p><p>No company has a process node where they make a million wafers a month. Elon says he wants to do it and China is obviously going to do it. TSMC is trying to do that. The memory makers may get to a million wafers a month as well, but not in a single fab.</p><p>It&#8217;s mind-boggling to think of that scale, and challenging to see the supply chain galvanized for that. I don&#8217;t want to doubt China&#8217;s capability to scale.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess this is an interesting question. I think at some point SemiAnalysis will do the deep dive on this. By when would indigenized Chinese production be bigger than the rest of the West combined. And put in the input of your model of when they&#8217;ll have DUV machines and EUV machines at scale?</p><p>Because there&#8217;s this question around if you have long timelines on AI&#8212;by long meaning 2035, which is not that long in the grand scheme of things&#8212;should you expect a world where China is dominating in semiconductors? It doesn&#8217;t get asked enough because if you&#8217;re in San Francisco, we&#8217;re thinking on timescales of weeks. If you&#8217;re outside of San Francisco, you&#8217;re not thinking about AGI at all.</p><p>What if we have AGI? What if you have this transformational thing that is commanding tens or hundreds of trillions of dollars of economic growth and token output, but it happens in 2035? What does that imply for the West versus China? SemiAnalysis has got to write the definitive model on this.</p><p><strong>Dylan Patel</strong></p><p>It&#8217;s really challenging when you move timescales out that far. What we tend to focus on is tracking every data center, every fab, and all the tools. We track where they&#8217;re going, but the time lags for these things are relatively short. We can only make reasonably accurate estimates for data center capacity based on land purchasing, permits, and turbine purchasing. We know where all these things are going, that&#8217;s the data we sell.</p><p>As you go out to 2035, things are just so radically different. Your error bars get so large it&#8217;s hard to make an estimate. But at the end of the day, if takeoff or timelines are slow enough, I don&#8217;t see why China wouldn&#8217;t be able to catch up drastically. In some sense, we&#8217;ve got this valley where, three to six months ago, or maybe even now, Chinese models are as competitive as they&#8217;ve ever been. I think Opus 4.6 and GPT 5.4 have really pulled away and made the gap a little bit bigger, but I&#8217;m sure some new Chinese models will come out.</p><p>As we move from selling tokens where they provide the entire reasoning chain, to selling automated white-collar work&#8212;an automated software engineer, you send them the request, they give you the result back, and there&#8217;s a bunch of thinking on the back end that they don&#8217;t show you&#8212;the ability to <a href="https://en.wikipedia.org/wiki/Knowledge_distillation">distill</a> out of American models into Chinese models will be harder.</p><p>Second, look at the scale of the compute the labs have. OpenAI exited the year with roughly two gigawatts last year. Anthropic will get to two-plus gigawatts this year. By the end of next year, they&#8217;ll both be at ten gigawatts of capacity. China is not scaling their AI lab compute nearly as fast. At some point, when you can&#8217;t distill the learnings from these labs into the Chinese models, plus with this compute race that OpenAI, Anthropic, Google, and Meta are all racing on, they end up getting to a point where the model performance should start to diverge more.</p><p>Then look at all this CapEx being spent on data centers. Amazon is spending $200 billion, Google $180 billion. All these companies are spending hundreds of billions of dollars on CapEx. There&#8217;s nearly a trillion dollars of CapEx being invested in data centers in America this year, roughly. What&#8217;s the return on invested capital here? You and I would think the return on invested capital for data center CapEx is very high.</p><p>If we look at Anthropic&#8217;s revenues, in January they added $4 billion. In February, which was a shorter month, they added $6 billion. We&#8217;ll see what they can do in March and April, given that compute constraints are what&#8217;s bottlenecking their growth. The reliability of Claude is quite low because they&#8217;re so compute constrained. But if this continues, then the ROIC on these data centers is super high.</p><p>At some point, the US economy starts growing faster and faster over this year and next year because of all this CapEx, all the revenue these models are generating, and the downstream supply chain. China doesn&#8217;t have that yet. They have not built the scale of infrastructure to invest in models, get to the capabilities, and then deploy these models at such scale.</p><p>When you look at Anthropic, they&#8217;re at $20 billion ARR. The margins are sub-50 percent, at least as <a href="https://www.theinformation.com/briefings/anthropic-lowered-gross-margin-projection-costs-run-ai-rose">last reported by </a><em><a href="https://www.theinformation.com/briefings/anthropic-lowered-gross-margin-projection-costs-run-ai-rose">The Information</a></em>. So that&#8217;s $13 or $14 billion of compute that it&#8217;s running on rental cost-wise, which is actually $50 billion worth of CapEx that someone laid out for Anthropic to generate their current revenue.</p><p>China has just not done this. If and when Anthropic 10Xs revenue again&#8212;and I think our answer would be when, not if&#8212;China doesn&#8217;t have the compute to deploy at that scale. So there is some sense that we&#8217;re in a fast takeoff. It&#8217;s not like we&#8217;re talking about a <a href="https://en.wikipedia.org/wiki/Dyson_sphere">Dyson sphere</a> by X date, it&#8217;s more like the revenue is compounding at such a rate that it does affect economic growth. The resources these labs are gathering are growing so fast. China hasn&#8217;t done that yet, so in that case, the US and the West are actually diverging.</p><p>The flip side is that these infrastructure investments have middling returns. Maybe they&#8217;re not as good as hoped. Maybe Google is wrong for wanting to take free cash flow to zero and spend $300 billion on CapEx next year. Maybe they&#8217;re just wrong and people on Wall Street who are bearish and people who don&#8217;t understand AI are correct. In that case, the US is building all this capacity but doesn&#8217;t get great returns. Meanwhile, China is able to build a fully vertical, indigenized supply chain, instead of the US/Japan/Korea/Taiwan/SE Asia/Europe countries together building this less vertical supply chain. In a sense, at some point China is able to scale past us if AI takes longer to get to certain capability levels than the vast majority of your guests on this podcast believe.</p><p><strong>Dwarkesh Patel</strong></p><p>It&#8217;s fast timelines, the US wins; long timelines, China wins.</p><p><strong>Dylan Patel</strong></p><p>Yeah but I don&#8217;t know what fast timelines means. I don&#8217;t think you have to believe in AGI to have the timelines where the US wins.</p><h3>01:16:01 &#8211; The enormous incoming memory crunch</h3><p><strong>Dwarkesh Patel</strong></p><p>Let&#8217;s go back to memory. I think people on Wall Street and people in the industry are understanding how big this is, but maybe generally people don&#8217;t understand what a big deal it is. So we&#8217;ve got this memory crunch, as you were talking about.</p><p>And earlier I was asking about, oh, could we solve for the EUV tool shortage by going back to seven nanometers? So let me ask a similar question about memory. HBM is made of DRAM, but has three to four times fewer bits per wafer area than the DRAM it&#8217;s made out of.</p><p>Is it possible that accelerators in the future could just use commodity DRAM and not HBM, so we can get much more capacity out of the DRAM we have? The reason I think this might be possible is, if we&#8217;re going to have agents that are just going off and doing work, and it&#8217;s not a synchronous chatbot application, then you don&#8217;t necessarily need extremely fast latency.</p><p>Maybe you can have lower bandwidth, because the reason you stack DRAM into HBM is for higher bandwidth. Is it possible to go to HBM accelerators and basically have the opposite of Claude Code Fast, like have Claude Slow?</p><p><strong>Dylan Patel</strong></p><p>At the end of the day, the incremental purchaser who&#8217;s willing to pay the highest price for tokens also ends up being the one that&#8217;s less price-sensitive. Compute should be allocated, in a capitalistic society, towards the goods that have the highest value, and the private market determines this by willingness to pay.</p><p>To some extent, Anthropic could actually release a slow mode. They could release Claude Slow Mode and increase tokens per dollar by a significant amount. They could probably reduce the price of Opus 4.6 by 4-5x and reduce the speed by maybe just 2x. The curve on inference throughput versus speed is already there just on HBM. And yet they don&#8217;t, because no one actually wants to use a slow model.</p><p>Furthermore, on these agentic tasks, it&#8217;s great that the model can run at a time horizon of hours. But if the model was running slower, those hours would become a day. Vice versa, if the model is running faster, those hours become an hour. No one really wants to move to a day-long wait period, because the highest-value tasks also have some time sensitivity to them.</p><p>I struggle to see&#8230; Yes, you could use regular DRAM. There are a couple of challenges with this. One of the core constraints of chips is that a chip is a certain size, and all of the <a href="https://en.wikipedia.org/wiki/Input/output">I/O</a> escapes on the edges. Often, the left and right of the chip are HBM&#8212;so the I/O from the chip to the HBM is on the sides&#8212;and then the top and bottom are I/O to other chips.</p><p>If you were to change from HBM to <a href="https://en.wikipedia.org/wiki/DDR_SDRAM">DDR</a>, all of a sudden this I/O on the edge would have significantly less bandwidth, but significantly more capacity per chip. But the metric you actually care about is bandwidth per wafer, not bits per wafer.</p><p><strong>Dwarkesh Patel</strong></p><p>Because the thing that is constraining the FLOPS is just getting in and out the next matrix, and for that you just need more bandwidth.</p><p><strong>Dylan Patel</strong></p><p>Yeah, getting out the <a href="https://www.ultralytics.com/glossary/model-weights">weights</a> and getting in and out the <a href="https://huggingface.co/blog/not-lain/kv-caching">KV cache</a>. In many cases, these GPUs are not running at full memory capacity. It&#8217;s obviously a system design thing: model, hardware, and software co-design. You have to figure out how much KV cache you need, how much you keep on the chip, how much you offload to other chips and call when you need it for tool calling, and how many chips you parallelize this on.</p><p>Obviously, the search space for this is very broad, which is why we have <a href="https://inferencex.semianalysis.com/">InferenceX</a>, an open-source model that searches all the optimal points on inference for a variety of different chips and models.</p><p>The point is, you&#8217;re not always necessarily constrained by memory capacity. You can be constrained by FLOPS, network bandwidth, memory bandwidth, or memory capacity. If you really simplify it down, there are four constraints, and each of these can break out into more.</p><p>If you switch to DDR, yes, you produce four times the bits per DRAM wafer, but all of a sudden the constraints shift a lot and your system design shifts. You go slower. Is the market smaller? Maybe. But also, all these FLOPS are wasted because they&#8217;re just sitting there waiting for memory. You don&#8217;t need all that capacity because you can&#8217;t really increase batch size because then the KV cache would take even longer to read.</p><p><strong>Dwarkesh Patel</strong></p><p>Makes sense. What is the bandwidth difference between HBM and normal DRAM?</p><p><strong>Dylan Patel</strong></p><p>An <a href="https://en.wikipedia.org/wiki/High_Bandwidth_Memory#HBM4">HBM4</a> stack&#8212;let&#8217;s talk about the stuff that&#8217;s in Rubin, because that&#8217;s what we&#8217;ve been indexing on&#8212;is 2048 bits across, connected in an area that&#8217;s 13 millimeters wide. It transfers memory at around 10 giga-transfers a second.</p><p>So a stack of HBM4 is 2048 bits on an area that&#8217;s roughly 11 to 13 millimeters wide. That&#8217;s the shoreline you&#8217;re taking on the chip. In that shoreline, you have 2048 bits transferring at 10 giga-transfers per second. You multiply those together and divide by eight, bits to a byte, and you&#8217;re at roughly 2.5 terabytes a second per HBM stack.</p><p>When you look at DDR, in that same area, it&#8217;s maybe 64 or 128 bits wide. That DDR5 is transferring at anywhere from 6.4 to maybe 8,000 giga-transfers a second. So your bandwidth is significantly lower. It&#8217;s 64 times 8,000 divided by eight, which puts you at 64 gigabytes a second. Even if you take a generous interpretation of 128 times 8 giga-transfers, you&#8217;re at 128 gigabytes a second for the same shoreline, versus 2.5 terabytes a second.</p><p>There&#8217;s an order of magnitude difference in bandwidth per edge area. If your chip is a square, or 26 by 33 millimeters&#8212;which is the maximum size for an individual die&#8212;you only have so much edge area. On the inside of that chip, you put all your compute. There are things you can do to try and change that, like more <a href="https://en.wikipedia.org/wiki/Static_random-access_memory">SRAM</a> or more caching. But at the end of the day, you&#8217;re very constrained by bandwidth.</p><p><strong>Dwarkesh Patel</strong></p><p>Then there&#8217;s the question of where you can destroy demand to free up enough for AI. I guess the picture is especially bad because, as you&#8217;re saying, if it takes four times more wafer area to get the same byte, for HBM you have to destroy four times as much consumer demand for laptops and phones to free up one byte for AI.</p><p>What does this imply for the next year or two? Sorry for the run-on question, in your newsletter you said 30% of Big Tech&#8217;s CapEx in 2026 is going towards memory?</p><p><strong>Dylan Patel</strong></p><p>Yes.</p><p><strong>Dwarkesh Patel</strong></p><p>That&#8217;s insane, right? Of the $600 billion or whatever, 30% is going just to memory.</p><p><strong>Dylan Patel</strong></p><p>Yes. Obviously, there&#8217;s some level of margin stacking that Nvidia does, so you have to separate that out and apply their margin to the memory and the logic. But at the end of the day, a third of their CapEx is going to memory.</p><p><strong>Dwarkesh Patel</strong></p><p>That&#8217;s crazy. What should we expect over the next year or two as this memory crunch hits?</p><p><strong>Dylan Patel</strong></p><p>The memory crunch will continue to get harder, and prices will continue to go up. This affects different parts of the market differently. Are people going to hate AI more and more? Yes, because smartphones and PCs are not going to get incrementally better year on year. In fact, they&#8217;re going to get incrementally worse.</p><p><strong>Dwarkesh Patel</strong></p><p>If you look at the bill of materials for an iPhone, what fraction of it is the memory? How much more expensive does an iPhone get if the memory is two times more expensive?</p><p><strong>Dylan Patel</strong></p><p>I believe an iPhone has 12 gigabytes of memory. Each gig used to cost roughly $3-4, so that&#8217;s $50. But now the price of memory has tripled. Let&#8217;s say it&#8217;s $12 per gig for DDR. Now you&#8217;re talking about $150 versus $50.</p><p>That&#8217;s a $100 increase in cost for Apple. Apple has some margin, they&#8217;re not just going to eat the margin. NAND also has the same market dynamics, so in reality, it&#8217;s probably a $150 increase on the iPhone. So now that&#8217;s a $100 cost increase and that&#8217;s just on the DRAM. The <a href="https://en.wikipedia.org/wiki/Flash_memory">NAND</a> also has the same sort of market. So in fact it&#8217;s probably a $150 increase on the iPhone. Apple either has to pass that on to the consumer or eat it. I don&#8217;t see Apple reducing their margin too much, maybe they eat a little bit. But at the end of the day, that means the end consumer is paying $250 more for an iPhone.</p><p>Now that&#8217;s just on last year&#8217;s pricing versus today&#8217;s. There is some lag before Apple feels the heat because they tend to have long-term contracts for memory that last three months to a year. But at the end of the day, Apple gets hit pretty hard by this. They won&#8217;t really adjust until the next iPhone release.</p><p>But that&#8217;s the high end of the market, which is only a few hundred million phones a year. Apple sells two or three hundred million phones annually. The bulk of the market is mid-range and low-end. It used to be that 1.4 billion smartphones were sold a year. Now we&#8217;re at about 1.1 billion. Our projections are that we might drop to 800 million this year, and down to 500 or 600 million next year.</p><p>We look at data points out of China from some of our analysts in Asia, Singapore, Hong Kong, and Taiwan. They&#8217;ve been tracking this, and they see <a href="https://en.wikipedia.org/wiki/Xiaomi">Xiaomi</a> and <a href="https://en.wikipedia.org/wiki/Oppo">Oppo</a> cutting low-end and mid-range smartphone volumes by half.</p><p>Yes, it&#8217;s only a $150 <a href="https://en.wikipedia.org/wiki/Bill_of_materials">BOM</a> increase on a $1,000 iPhone where Apple has some larger margin. But for smaller phones, the percentage of the BOM that goes to memory and storage is much larger. And the margins are lower, so there&#8217;s less capacity to even eat the margins. And they have also generally tended not to do long-term agreements on memory.</p><p>Why this is a big deal is that if smartphone volumes halve, that drop will happen in the low and mid-range, not the high end. So it&#8217;s not like the bits released are halving. Currently, consumer devices account for more than half of memory demand. Even if you halve smartphone volumes, because of the shape of the halving, the low end gets cut by more than half, while the high end gets cut by less than half, because you and I will still buy the high-end phones that cost north of a thousand dollars. We&#8217;ll buy them even if they get a little bit more expensive. And Apple&#8217;s volumes will not go down as much as a low-end smartphone provider.</p><p>The same applies to PCs. What this does to the market is quite drastic. DRAM gets released and goes to AI chips, who are willing to do longer-term contracts and pay higher margins, because at the end of the day the margin they extract from the end user is much larger.</p><p>This probably leads to people hating AI even more. Today, you already see all the memes on PC subreddits and gaming PC Twitter. It&#8217;s cat dancing videos saying, &#8220;This is why memory prices have doubled and you can&#8217;t get a new gaming GPU or desktop.&#8221; It&#8217;s going to be even worse when memory prices double again, especially DRAM.</p><p>Another interesting dynamic is that it&#8217;s not just DRAM, it&#8217;s also NAND. NAND is also going up in price. Both of these markets have expanded capacity very slowly over the last few years, NAND almost zero. The percentage of NAND that goes to phones and PCs is larger than the percentage of DRAM that goes to phones and PCs.</p><p>As you destroy demand, mostly for DRAM purposes, you unlock more NAND that gets allocated and can go to other markets. The price increases of DRAM will be larger than those of NAND because you&#8217;ve released more from the consumer, and in fact, you&#8217;ve produced more memory for AI.</p><p><strong>Dwarkesh Patel</strong></p><p>Sorry, maybe you just explained it and I missed it. Is it because <a href="https://en.wikipedia.org/wiki/Solid-state_drive">SSDs</a> are being used in large quantities for data centers?</p><p><strong>Dylan Patel</strong></p><p>They are, but not in as large quantities as DRAM.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, so they will also increase because they&#8217;ll be using some quantity, but there&#8217;s not as much of a need as there is for HBM. Makes sense.</p><p>One thing I didn&#8217;t appreciate until I was reading some of your newsletters is that the same constraints preventing logic scaling over the next few years are quite similar to what&#8217;s preventing us from producing more memory wafers. In fact, literally the same exact machine, this EUV tool, is needed for memory.  So I guess the question someone could ask right now is, why can&#8217;t we just make more memory?</p><p><strong>Dylan Patel</strong></p><p>The constraints, as I was mentioning earlier, are not necessarily EUV tools today or next year. They become that as we get to the latter part of the decade. Currently, the constraints are more that they physically just haven&#8217;t built fabs. Over the last three to four years, these vendors have not built new fabs because memory prices were really low. Their margins were low, and in fact, they were losing money in 2023 on memory. So they decided they weren&#8217;t building new fabs. The market slowly recovered over time but never really got amazing until last year.</p><p>In 2024, we were banging on the drums that reasoning means <a href="https://cloud.google.com/transform/the-prompt-what-are-long-context-windows-and-why-do-they-matter">long context</a>, which means a large KV cache, which means you need a lot of memory demand. We&#8217;ve been talking about that for a year and a half, two years. People who understand AI went really long on memory then. So you&#8217;ve seen that dynamic, but now it has finally played out in pricing.</p><p>It took so long for what was obvious: long context means the KV cache gets bigger, you need more memory. Half the cost of accelerators is memory. Of course they&#8217;re going to start going crazy on it. It took a year for that to actually reflect in memory prices. Once memory prices reflected that, it took another three to six months for the memory vendors to start building fabs. Those fabs take two years to build. So we won&#8217;t have really meaningful fabs to even put these tools in until late 2027 or 2028.</p><p>Instead, you&#8217;ve seen some really crazy stuff to get capacity. <a href="https://en.wikipedia.org/wiki/Micron_Technology">Micron</a> <a href="https://www.reuters.com/world/china/microns-18-billion-acquisition-boosts-powerchip-shares-2026-01-19/">bought a fab from a company in Taiwan</a> that makes lagging-edge chips. Hynix and Samsung are doing some pretty crazy things to try and expand capacity at their existing fabs, which also have large knock-on effects in the economy.</p><p>So why can&#8217;t we build more capacity? There&#8217;s nowhere to put the tools. It&#8217;s not just EUV; there are other tools involved in DRAM and logic. In logic, for N3, about 28% of the cost of the final wafer is EUV. When you look at DRAM, it&#8217;s in the teens. It&#8217;s going up, but it&#8217;s a much smaller percentage of the cost. These other tools are also bottlenecks, although their supply chains are not as complex as ASML&#8217;s.</p><p>You see <a href="https://en.wikipedia.org/wiki/Applied_Materials">Applied Materials</a>, <a href="https://www.lamresearch.com/">Lam Research</a>, and all these other companies expanding capacity a lot as well. But you don&#8217;t have anywhere to put the tool, because the most complex buildings people make are fabs, and fabs take two years to build.</p><p><strong>Dwarkesh Patel</strong></p><p><a href="https://www.dwarkesh.com/p/elon-musk">I interviewed Elon recently</a>, and his whole plan is that they&#8217;re going to build this <a href="https://www.bloomberg.com/news/articles/2026-01-28/musk-says-tesla-needs-to-build-terafab-to-manufacture-chips">TeraFab</a> and they&#8217;re going to build the <a href="https://en.wikipedia.org/wiki/Cleanroom">clean rooms</a>. I won&#8217;t even ask you about the dirty rooms thing, but let&#8217;s say they build the clean rooms.</p><p>I have a couple of questions. One, do you think this is the kind of thing that Elon Co. could build much faster than people conventionally build it? This is not about building the end tools. This is just about building the facility itself. How complicated is it to just build the clean room extremely fast? Is this something that Elon, with his &#8220;move fast&#8221; approach, could do much faster if that&#8217;s what we&#8217;re bottlenecked on this year or next year? Two, does that even matter if, in two years, your view is that we&#8217;re not bottlenecked on clean room space, but on the tooling?</p><p><strong>Dylan Patel</strong></p><p>As with any complex supply chain, it takes time, and constraints shift over time. Even if something is no longer a constraint, that doesn&#8217;t mean that market no longer has margin. For example, energy will not be a big bottleneck a couple of years from now, but that doesn&#8217;t mean energy isn&#8217;t growing super fast and there&#8217;s no margin there. It&#8217;s just not the key bottleneck. In the space of fabs, clean rooms are the biggest bottleneck this year and next year. As we get to 2028, 2029, 2030, there will still be constraints there.</p><p>The thing about Elon is he has a tremendous capability to garner physical resources and really smart people to build things. The way he recruits amazing people is by trying to build the craziest stuff. In the case of AI, that hasn&#8217;t really worked because everyone&#8217;s trying to build AGI. Everyone is very ambitious. But in the case of going to Mars, making rockets that land themselves, fully autonomous electric cars, or humanoid robots, these are methods of recruiting the people who think that&#8217;s the most important problem in the world to work on that problem, because he&#8217;s the only one trying really hard.</p><p>In the case of semiconductors, he stated he wants to make a fab that&#8217;s a million wafers per month. No one has a fab that big. It&#8217;s possible that he&#8217;s able to recruit a lot of really awesome people and get them on this crazy task of building a million wafers a month. Step one is to build the clean room, and that I think he probably can do. His mindset around deleting things, that it can be dirty, it&#8217;s fine, is probably not right. Actually I think it&#8217;s 100% not right. You need the fab to be very clean. All of the air in the fab gets replaced every three seconds, it&#8217;s that fast. There have to be so few particles.</p><p>But I think he can build the clean room. It&#8217;ll take a year or two. Initially, it won&#8217;t be super fast, but over time, he&#8217;ll get faster at it. The really complex part is actually developing a process technology and building wafers. I don&#8217;t think he can develop that quickly. That has a lot of built-up knowledge. The most complicated integration of very expensive tools and supply chains is done by TSMC, Intel, or Samsung. These two other companies aren&#8217;t even that great at it, and they&#8217;re tremendously complex.</p><p><strong>Dwarkesh Patel</strong></p><p>How surprised would you be if in 2030 there just happened to be some total disruption where we&#8217;re not using EUV? What if we&#8217;re using something that has much better effects, is much simpler to produce, and can be produced in much bigger quantities? I&#8217;m sure as an industry insider that sounds like a totally naive question, but do you see what I&#8217;m asking? What probability should we put on something coming totally out of left field to make all of this irrelevant?</p><p><strong>Dylan Patel</strong></p><p>Something that&#8217;s very simple and easy to scale, I assign a very, very low probability. There are a number of companies working on effectively particle accelerators or <a href="https://en.wikipedia.org/wiki/Synchrotron">synchrotrons</a> that generate light that&#8217;s either 13.5 nanometer, like EUV, or an even narrower wavelength, like <a href="https://www.tomshardware.com/tech-industry/semiconductors/american-startup-substrate-promises-2nm-class-chipmaking-with-particle-accelerators-at-a-tenth-of-the-cost-of-euv-x-ray-lithography-system-has-potential-to-surpass-asmls-euv-scanners">X-ray</a> at 7 nanometers, to then use in lithography tools. But those things are massive particle accelerators generating this light. It&#8217;s a very complicated thing to build.</p><p>There are a couple of companies and I think that could be a big disruption to the industry beyond EUV. But I don&#8217;t think we&#8217;re going to magically build something new that is direct write and super simple, and can be manufactured at huge volumes, although there are some attempts to do things like this.</p><p><strong>Dwarkesh Patel</strong></p><p>I ask because if you think about Elon&#8217;s companies in the past, rocketry was this thing that was thought to be&#8212;and is&#8212;incredibly complicated.</p><p><strong>Dylan Patel</strong></p><p>Look, I&#8217;m just a naive yapper compared to Elon. What have I built? So maybe it&#8217;s possible.</p><p><strong>Dwarkesh Patel</strong></p><p>In order to build more memory in the future, could we build <a href="https://www.tomshardware.com/tech-industry/next-generation-3d-dram-approaches-reality-as-scientists-achieve-120-layer-stack-using-advanced-deposition-techniques">3D DRAM</a> the way we do <a href="https://www.appliedmaterials.com/us/en/semiconductor/markets-and-inflections/memory/3d-nand.html">3D NAND</a> and then go back to DUV?</p><p><strong>Dylan Patel</strong></p><p>That is the hope currently. Everyone&#8217;s roadmap for 3D DRAM is that you&#8217;ll still use EUV because you want to have that tighter overlay. When you&#8217;re doing these subsequent processing steps, everything is vertically stacked and you have more layers on top of each other. You want the pitches to be tighter. So generally, people are still trying to do it with EUV.</p><p>But what 3D would do is change the calculation of how many bits a single EUV pass can make. That number would go up drastically if you go to 3D DRAM. That is the hope. Right now, everyone&#8217;s roadmap goes from the current 6F cell, to a <a href="https://www.globalsmt.net/advanced-packaging/a-new-round-of-technological-innovation-in-memory-market-on-the-way/">4F cell</a>, and then finally 3D DRAM by the end of the decade or early next decade. There&#8217;s still a lot of R&amp;D, manufacturing, and integration to be done. I wouldn&#8217;t call that out of the cards. I think it&#8217;s very likely going to happen.</p><p>It&#8217;s also going to require a huge retooling of fabs. The breakdown of tools in a fab will be very different. The lithography tool is actually the only thing that isn&#8217;t that different. But the number of them relative to different types of <a href="https://en.wikipedia.org/wiki/Chemical_vapor_deposition">chemical vapor deposition</a>, <a href="https://en.wikipedia.org/wiki/Atomic_layer_deposition">atomic layer deposition</a>, <a href="https://en.wikipedia.org/wiki/Dry_etching">dry etch</a>, or different kinds of etch chambers with different chemistries&#8230; You have all these different tools for different process nodes. You can&#8217;t just convert a logic fab to a DRAM fab, or vice versa, or a NAND fab to a DRAM fab, in a short amount of time.</p><p>In the same way, existing DRAM fabs require a lot of retooling just to go from 1-alpha to 1-beta to 1-gamma process nodes, because they have to add DUV and change the chemistry stacks for when you&#8217;re using EUV in terms of deposition and etch. And the EUV tool has to be there. Furthermore, when you change to 3D DRAM, there&#8217;s going to be an even larger shift, so a lot of retooling of these fabs needs to happen.</p><p>That would be a big disruption. That would make EUV demand generally lower. But as we&#8217;ve seen across time, lithography demand as a percentage of wafer cost has trended up. Around the 2014 era, it was 17% of the wafer cost, and it&#8217;s gone to 30% over the last fifteen years. For DRAM, it was in the low to mid-teens, and now it&#8217;s trended toward the high teens. Before we get to 3D DRAM, it&#8217;ll likely cross into the 20% range. But then, if we get to 3D DRAM, the total end wafer cost as a percentage of EUV tanks again.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess you care less about the percent of cost and more about how much it bottlenecks production.</p><p><strong>Dylan Patel</strong></p><p>Right, but the percentage of cost&#8212;</p><p><strong>Dwarkesh Patel</strong></p><p>It&#8217;s a proxy, yeah. If you&#8217;re <a href="https://en.wikipedia.org/wiki/Jensen_Huang">Jensen</a> or Sam Altman, or whoever stands to gain a lot from scaling up AI compute, there are these stories that they&#8217;d go to TSMC and say, &#8220;Why can&#8217;t we access Y and Z?&#8221; But I think the point you&#8217;re making is that it doesn&#8217;t really matter what TSMC does in some sense. In fact, even if you have Intel and Samsung building more foundries, in the long run, you&#8217;re going to be bottlenecked by ASML and other tool and material makers.</p><p>First, is that a correct interpretation? Second, should Silicon Valley people be going to the Netherlands right now to try to pitch ASML to make more tools so that in 2030 they can have more AI compute?</p><p><strong>Dylan Patel</strong></p><p>It&#8217;s a funny dynamic we saw in 2023, 2024, and 2025. People who saw the energy bottleneck before others asymmetrically went to <a href="https://www.siemens-energy.com/us/en/home/products-services/product-offerings/gas-turbines.html">Siemens</a>, <a href="https://en.wikipedia.org/wiki/Mitsubishi_Heavy_Industries">Mitsubishi</a>, and of course <a href="https://en.wikipedia.org/wiki/GE_Vernova">GE Vernova</a>, and bought up turbine capacity. Now they&#8217;re able to charge excess amounts for deploying these turbines in places because of energy.</p><p>In the same sense, this could be done for EUV, except ASML is not just going to trust any random bozo who wants to buy EUV tools. These turbines are much cheaper than EUV tools, and there&#8217;s many more of them produced. Especially once you get to industrial gas turbines, not just <a href="https://en.wikipedia.org/wiki/Combined-cycle_power_plant">combined-cycle</a> but the cheaper, smaller, less efficient ones, people put down deposits for these.</p><p>Someone could do this. Someone should go to the Netherlands and be like, &#8220;I&#8217;ll pay you a billion dollars. You give me the right to purchase ten EUV tools two years from now, and I&#8217;m first in line.&#8221; Then over those two years, you go around and wait for everyone to realize, &#8220;Oh crap, I don&#8217;t have enough EUV tools,&#8221; and you try to sell your option at some premium. All you&#8217;re effectively doing is saying, &#8220;ASML, you&#8217;re dumb. You weren&#8217;t making enough margin on these. I&#8217;m going to make a margin.&#8221; The question is, will ASML even agree to this? I don&#8217;t think so.</p><p><strong>Dwarkesh Patel</strong></p><p>There&#8217;s a world where they at least get the demand signal from that to increase production.</p><p><strong>Dylan Patel</strong></p><p>Potentially. I agree.</p><p><strong>Dwarkesh Patel</strong></p><p>But it sounds like you&#8217;re saying they couldn&#8217;t even increase production if they wanted to, given the supply chain.</p><p><strong>Dylan Patel</strong></p><p>Right. But that&#8217;s exactly the market in which&#8230; If they can&#8217;t increase production, just like TSMC cannot increase production that fast, and yet demand is mooning, then the obvious solution is to arbitrage this. You and I know demand is way higher than they&#8217;re projecting and their capability to build.</p><p>You arbitrage this by locking up the capacity, doing a forward contract, and then trying to sell it at a later date once other people realize everything is fucked and we don&#8217;t have enough capacity. Then you&#8217;ll have this insane margin that ASML and TSMC should have been charging. But the thing is, I don&#8217;t know if ASML and TSMC will ever agree to this.</p><h3>01:42:34 &#8211; Scaling power in the US will not be a problem</h3><p><strong>Dwarkesh Patel</strong></p><p>Let me ask you about power now. It sounds like you think power can be arbitrarily scaled.</p><p><strong>Dylan Patel</strong></p><p>Not arbitrarily, but yes.</p><p><strong>Dwarkesh Patel</strong></p><p>But beyond these numbers. If I&#8217;m remembering correctly, <a href="https://newsletter.semianalysis.com/p/how-ai-labs-are-solving-the-power">your blog post</a> on how AI labs are increasing power implied that GE Vernova, Mitsubishi, and Siemens could produce 60 gigawatts a year in gas turbines. Then there are these other sources, but they&#8217;re less significant than the turbines.</p><p>Only a fraction of that goes to AI, I assume. If in 2030 we have enough logic and memory to do 200 gigawatts a year, do you just think that these things are on a path to ramp up to more than 200 gigawatts a year, or what do you see?</p><p><strong>Dylan Patel</strong></p><p>Right now we&#8217;re at 20 or 30. This is critical IT capacity, by the way, which is an important thing to mention. When I&#8217;m talking about these gigawatts, I&#8217;m talking about critical IT capacity. Server plugged in, that&#8217;s how much power it pulls. But there are losses along the chain. There is loss on transmission, conversion, cooling, et cetera. So you should gross this factor up from 20 gigawatts for this year, or 200 gigawatts by the end of the decade, to some number 20-30% higher.</p><p>Then you have capacity factors. Turbines don&#8217;t run at 100 percent. If you look at <a href="https://en.wikipedia.org/wiki/PJM_Interconnection">PJM</a>, which I think is the largest grid in America&#8212;covering the Midwest and some of the Northeast area&#8212;in their models they want to have roughly 20 percent excess capacity. Within that 20 percent excess capacity, they&#8217;re running all the turbines at 90% because they are derated some for reliability, maintenance, and so on. In reality, the nameplate capacity for energy is always way higher than the actual end critical IT capacity because of all these factors.</p><p>But it&#8217;s not just turbines. If you were just making power from turbines, that&#8217;s simple, boring, and easy. Humans and capitalism are far more effective. The whole point of that blog was that, yes, there are only three people making combined-cycle gas turbines, but there&#8217;s so much more we can do. We can do <a href="https://www.gevernova.com/content/dam/gepower-microsites/global/en_US/documents/avr/GEA34130%20AeroderivativeGT_Whitepaper_R5.pdf">aeroderivatives</a>. We can take airplane engines and turn them into turbines. There are even new entrants in the market, like <a href="http://m">Boom Supersonic</a> trying to do that and working with <a href="https://www.crusoe.ai/">Crusoe</a>. &#8202;Also there&#8217;s all the other ones like that already exist in the market.</p><p>There are also <a href="https://blog.burnsmcd.com/meet-growing-data-center-power-demands-with-reciprocating-engines">medium-speed reciprocating engines</a>: engines that spin in circles, like a diesel engine. There are ten people who make engines that way. I&#8217;m from Georgia, and people used to be like, &#8220;Oh man, you got a <a href="https://www.cummins.com/en-na">Cummins</a> engine in there,&#8221; regarding RAM trucks. Automobile manufacturing is going down, so these companies all have capacity and could scale and convert that for data center power. You stick all these reciprocating engines in. It&#8217;s not as clean as combined-cycle, but maybe you can convert them from diesel to gas if you want.</p><p>What about ship engines? All of these engines for massive cargo ships are great. <a href="https://nebius.com/newsroom/nebius-announces-multi-billion-dollar-agreement-with-microsoft-for-ai-infrastructure">Nebius is doing that for a Microsoft data center in New Jersey</a>. They&#8217;re running ship engines to generate power. <a href="https://www.bloomenergy.com/">Bloom Energy</a> is doing <a href="https://www.bloomenergy.com/hydrogen-fuel-cells/">fuel cells</a>. We&#8217;ve been very positive on them for a year and a half now because they have such a capability to increase their production. Their payback period for a production increase is very fast, even if the cost is a little bit higher than combined-cycle, which is the best for cost and efficiency.</p><p>Then there&#8217;s solar plus battery, which can come online as those cost curves continue to come down. There&#8217;s wind, where you might only expect 15 percent of the maximum power because things oscillate, but you add batteries. There are all these things.</p><p>The other thing is that the grid is scaled so we don&#8217;t cut off power at peak usage on the hottest day of the summer. But in reality, that&#8217;s a load spike that is 10-20% higher than the average. If you just put enough utility-scale batteries, or <a href="https://en.wikipedia.org/wiki/Peaking_power_plant">peaker plants</a> that only run a small portion of the year&#8212;and those could be gas, industrial gas turbines, combined-cycle, batteries, or any of the other sources I mentioned&#8212;then all of a sudden you&#8217;ve unlocked 20% of the US grid for data centers. Most of the time that capacity is sitting idle. It&#8217;s really only there for that peak, which is just a few hours over a few days of the year. If you have enough capacity to absorb that peak load, then all of the sudden you&#8217;ve transferred it all.</p><p>Today, data centers are only 3-4% of the power of the US grid, and by 2028 they&#8217;ll be 10%. But if you can unlock 20% of the US grid like this, it&#8217;s not that crazy. The US grid is terawatt-level, not hundreds-of-gigawatts-level. So we can add a lot more energy.</p><p>I&#8217;m not saying it&#8217;s easy. These things are going to be hard. There&#8217;s a lot of hard engineering, risks people have to take, and new technologies people have to use. But Elon was the first to do this behind-the-meter gas, and since then we&#8217;ve seen an explosion of different things people are doing to get power. &#8202;They&#8217;re not easy, but people are gonna be able to do them. The supply chains are just way simpler than chips.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. He made the point during the interview that for the specific blade for the specific turbine he was looking at, the lead times go out beyond 2030. Your point is that&#8212;</p><p><strong>Dylan Patel</strong></p><p>That&#8217;s great. There are so many other ways to make energy. Just be inefficient. It&#8217;s fine.</p><p><strong>Dwarkesh Patel</strong></p><p>Right now, combined-cycle gas turbines have CapEx of $1,500 per kilowatt. Are you saying it would make sense to have either technologies that are much more expensive than that, or other things are getting cheap enough to make it competitive?</p><p><strong>Dylan Patel</strong></p><p>Exactly. It can be as high as $3,500 per kilowatt. It could be twice as much as the cost of combined-cycle, and the total cost of the GPU on a TCO basis has only gone up a few cents per hour.</p><p>Because we&#8217;ve been talking about Hopper pricing, $1.40, let&#8217;s say the power price doubles. The Hopper that was $1.40 is now $1.50 in cost. I don&#8217;t care, because the models are improving so fast that the marginal utility of them is worth way more than that ten-cent increase in energy.</p><p><strong>Dwarkesh Patel</strong></p><p>So you&#8217;re saying 20 percent of the grid&#8212;the grid is about one terawatt&#8212;can just come online from utility-scale batteries, increasing what you&#8217;d be comfortable putting on the grid.</p><p><strong>Dylan Patel</strong></p><p>The regulatory mechanism there is not easy, by the way.</p><p><strong>Dwarkesh Patel</strong></p><p>But that&#8217;s 200 gigawatts, if that hypothetically happens. Just from the different sources of gas generation you mentioned&#8212;the different kinds of engines and turbines&#8212;combined, how many gigawatts could they unlock by the end of the decade?</p><p><strong>Dylan Patel</strong></p><p>We&#8217;re tracking this in our data. There are over 16 different manufacturers of power-generating things just from gas alone. Yes, there are only three turbine manufacturers for combined-cycle, but we&#8217;re tracking 16 different vendors, and we have all of their orders. It turns out there are hundreds of gigawatts of orders to various data centers.</p><p>As we get to the end of the decade, we think something like half of the capacity that&#8217;s being added will be behind the meter. Behind the meter is almost always more expensive than grid-connected, but there are just a lot of problems with getting grid-connected: permits and interconnection queues and all this sort of stuff. So even though it&#8217;s more expensive, people are doing behind the meter.</p><p>What they&#8217;re doing behind the meter ranges widely. It could be reciprocating engines, ship engines, or aeroderivatives. It could be combined-cycle, although combined-cycle is not that great for behind the meter. It could be Bloom Energy fuel cells, or solar plus battery. It could be any of these things.</p><p><strong>Dwarkesh Patel</strong></p><p>And you&#8217;re saying any of these individually could do tens of gigawatts?</p><p><strong>Dylan Patel</strong></p><p>Any of these individually will do tens of gigawatts, and as a whole, they will do hundreds of gigawatts.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay. So that alone should more than&#8212;</p><p><strong>Dylan Patel</strong></p><p>Electrician wages will probably double or triple again. There are going to be a lot of new people entering that field, and a ton of people who make money, but I don&#8217;t see that as the main bottleneck.</p><p><strong>Dwarkesh Patel</strong></p><p>Right now in Abilene, at the <a href="https://www.crusoe.ai/resources/newsroom/crusoe-expands-ai-data-center-campus-in-abilene-to-1-2-gigawatts">1.2-gigawatt data center that Crusoe is building for OpenAI</a>, I think they have 5,000 people working there, or at peak they did. If you turn that into 100 gigawatts&#8212;and I&#8217;m sure things will get more efficient over time&#8212;that would be 400,000 people it would take to build 100 gigawatts.</p><p>If you think about the US labor force, and how many electricians there are and how many construction workers there are&#8230; I guess there are 800,000 electricians. I don&#8217;t know if they&#8217;re all substitutable in this way. There are millions of construction workers. But if we&#8217;re in a world where we&#8217;re adding 200 gigawatts a year, are we going to be crunched on labor eventually, or do you think that is actually not a real constraint?</p><p><strong>Dylan Patel</strong></p><p>Labor is a big constraint. It&#8217;s a humongous constraint in this. People have to be trained. Likewise, we&#8217;ll probably start importing the highest-skilled labor. It makes sense that a really high-skilled electrician in Europe who was working on destroying power plants now comes to America and is building high-voltage electricity moving across a data center.</p><p>Humanoid robots or robotics at least might start to help, but the main factor for reducing the number of people is going to be modularizing things and making them in factories in Asia. Unfortunately for America, places like Korea, Southeast Asia, and in many ways China as well are going to ship more and more built-out sections of the data center and those will be shipped in. Today you currently ship servers or a rack in, and then you plug that into different pieces that you&#8217;re shipping from different places.</p><p>But now you&#8217;ll ship it to a factory and integrate the entire thing. Maybe this is a two-megawatt block, and this block goes from high-voltage AC power to the DC voltage that you deliver to the rack, or something like this. Or with cooling, you ship a fully integrated unit that has a lot of the cooling subsystems already put together, because plumbers are also a big constraint here.</p><p>Furthermore, instead of just a single rack where you have people wiring up all these racks with electricity, you take a skid and put an entire row of servers on it that is shipped directly from the factories. Today, a single rack may be 120 or 140 kilowatts, but as we get to next-generation <a href="https://developer.nvidia.com/blog/nvidia-800-v-hvdc-architecture-will-power-the-next-generation-of-ai-factories/">Nvidia Kyber</a> and things like that, it&#8217;s almost a megawatt.</p><p>In addition, if you do an entire row, it&#8217;ll have the rack, the networking, the cooling, and the power all integrated together. Now when you come in, you have much less to cable. There&#8217;s less networking fiber, fewer power connections, and fewer plumbing things. This can drastically reduce the number of people working in data centers, so our capability to build them will be much larger.</p><p>Along the way, some people will move faster to new things, and some will move slower. <a href="https://www.forbes.com/sites/annatong/2026/03/12/from-gigawatts-to-grab-and-go-crusoe-leans-into-modular-ai-data-centers/">Crusoe and Google have been talking a lot about this modularization</a>, as have <a href="https://www.datacenterdynamics.com/en/news/meta-to-deploy-366mw-of-modular-gas-units-to-power-1gw-data-center-in-el-paso-texas/">companies like Meta</a> and many others. The people who move faster to new things may face delays, while the people who are slower will face labor problems. There will always be dislocations in the market because this is a very complex supply chain. At the end of the day, it&#8217;s still simple enough that we will be able to solve it through capitalism and human ingenuity on the timescales required.</p><h3>01:54:44 &#8211; Space GPUs aren&#8217;t happening this decade</h3><p><strong>Dwarkesh Patel</strong></p><p>Speaking of big problems to solve, Elon Musk is very bullish on space GPUs. If you&#8217;re right that power is not a constraint on Earth&#8230; I guess the other reason they would make sense is that even if  there will be enough gas turbines or whatever on Earth, Elon&#8217;s next argument is that you can&#8217;t get the permitting to build hundreds of gigawatts on Earth. Do you buy that argument?</p><p><strong>Dylan Patel</strong></p><p>Land-wise, America is big. Data centers don&#8217;t actually take up that much space, so you can solve that. Permitting-wise, air pollution permits are a challenge, but the Trump administration made it much easier. You go to Texas, and you can skip a lot of this red tape.</p><p>Elon had to deal with a lot of this complex stuff in Memphis, and then building a power plant across the border for <a href="https://x.ai/colossus">Colossus 1 and 2</a>. But at the end of the day, there&#8217;s a lot more you can get away with in the middle of Texas.</p><p><strong>Dwarkesh Patel</strong></p><p>Given that Elon lives in Texas, why didn&#8217;t he just go to Texas?</p><p><strong>Dylan Patel</strong></p><p>I think it was partially that they over-indexed on grid power for a temporary period of time. That&#8217;s just what they thought they needed more of.</p><p><strong>Dwarkesh Patel</strong></p><p>Because they had an aluminum refinery connected to the grid there.</p><p><strong>Dylan Patel</strong></p><p>It was actually an idled appliance factory. But I think they may have indexed more to grid power, water access, and gas access. I think they bought that knowing the gas line was right there and they were going to tap it. Same with water. It was a whole host of different constraints. It was probably an area where electricians were easier to find.</p><p>At the end of the day, I&#8217;m not exactly sure why they chose that site. I bet Elon would&#8217;ve chosen somewhere in Texas if he could&#8217;ve gone back because of the regulatory challenges he faced. Ultimately, permitting is a challenge, but America is a big place with 50 states, and things will get done.</p><p>There are a lot of small jurisdictions where you can just transport in all the workers you need for a temporary period of three to twelve months, depending on the contractor. You can put them in temporary housing and pay out the butt, because labor is very cheap relative to the GPUs and the networking, and the end value of the tokens it&#8217;s going to produce. So there is plenty of room to pay for all of these things.</p><p>Also, people are also diversifying now. Australia, Malaysia, Indonesia, and India are all places where data centers are going up at a much faster pace. But currently, over 70% of AI data centers are still in America, and that continues to be the trend. People are figuring out how to build these things. Ultimately, dealing with permitting and red tape in middle-of-nowhere Texas, Wyoming, or New Mexico is probably a hell of a lot easier than sending stuff into space.</p><p><strong>Dwarkesh Patel</strong></p><p>Other than the economic argument making less sense once you consider that energy is a small fraction of the total cost of ownership of a data center, what are the other reasons you&#8217;re skeptical?</p><p><strong>Dylan Patel</strong></p><p>Obviously, power is basically free in space.</p><p><strong>Dwarkesh Patel</strong></p><p>That&#8217;s the reason to do it.</p><p><strong>Dylan Patel</strong></p><p>Yeah, that&#8217;s the reason to do it. But there are all the other counterarguments. Even if power costs double on Earth, it&#8217;s still a fraction of the total cost of the GPU.</p><p>The main challenge is&#8230; We have <a href="https://www.clustermax.ai/">ClusterMAX</a>, which rates all the neoclouds. We test over 40 cloud companies, including the hyperscalers and neoclouds. Outside of software, what differentiates these clouds the most is their ability to deploy and manage failure.</p><p>GPUs are horrendously unreliable. Even today, around 15% of Blackwells that get deployed have to be <a href="https://en.wikipedia.org/wiki/Return_merchandise_authorization">RMA&#8217;d</a>. You have to take them out. Sometimes you just have to plug them back in, but sometimes you have to take them out and ship them back to Nvidia or their partners who do the RMAs and such.</p><p><strong>Dwarkesh Patel</strong></p><p>What do you make of Elon&#8217;s argument that after an initial phase, they actually don&#8217;t fail that much?</p><p><strong>Dylan Patel</strong></p><p>Sure, but now you&#8217;ve done this, tested them all, deconstructed them, put them on a spaceship, launched them into space, and then put them online again. That takes months. If your argument is that a GPU has a useful life of five years, and this takes six additional months, that is 10% of your cluster&#8217;s useful life.</p><p>Because we&#8217;re so capacity-constrained, that compute is theoretically most valuable in the first six months you have it. We&#8217;re more constrained now than we will be in the future. That compute can contribute to a better model in the future, or generate revenue today that you can use to raise more money. All these things make now the most important moment, but you&#8217;ve potentially delayed your compute deployment by six months.</p><p>What separates these cloud providers is&#8230; We see some clouds taking six months to deploy GPUs right here on Earth. We see clouds that take a lot less than six months. So the question is, where does space get in there? I don&#8217;t see how you could test them all on Earth, deconstruct them, and ship them to space without it taking significantly longer than just leaving them in the facility where you tested them.</p><p><strong>Dwarkesh Patel</strong></p><p>The question I wanted to ask is about the topology of space communication. Right now, Starlink satellites talk to each other at 100 gigabits per second. You could imagine that being much higher with optical intersatellite laser links optimized for this. That actually ends up being quite close to InfiniBand bandwidth, which is 400 gigabytes a second.</p><p><strong>Dylan Patel</strong></p><p>But that&#8217;s per GPU, not per rack. So multiply that by 72. Also, that was Hopper. When you go to Blackwell and Rubin, that 2x&#8217;s and 2x&#8217;s again.</p><p><strong>Dwarkesh Patel</strong></p><p>But how much compute is happening per&#8230; During inference, are the different scale-ups still working together, or is inference just happening as a batch within a single scale-up?</p><p><strong>Dylan Patel</strong></p><p>A lot of models fit within one scale-up domain, but many times you split them across multiple scale-up domains.</p><p>As models become more and more sparse, which is the general trend, you want to ping just a couple of experts per GPU. If leading models today have hundreds, if not a thousand, of experts, then you&#8217;d want to run this across hundreds or thousands of chips, even as we advance into the future.</p><p>So then you end up with the problem of needing to connect all these satellites together for communications as well.</p><p><strong>Dwarkesh Patel</strong></p><p>That would be tough. If there&#8217;s a world where you could do inference for a batch on a single scale-up, then maybe it&#8217;s more plausible. But if not, it&#8217;s a different story.</p><p><strong>Dylan Patel</strong></p><p>Networking these chips together is a problem, and you can&#8217;t just make the satellite infinitely large. There are a lot of physics challenges to making a satellite really big. That&#8217;s why you need these interconnects between the satellites.</p><p>Those interconnects are more expensive. In a cluster, 15-20% of the cost is networking. All of a sudden, you&#8217;re using space lasers instead of simple lasers that are manufactured in volumes of millions with pluggable transceivers.</p><p>And those things are very unreliable as well, more unreliable than the GPUs by the way. Across the life of a cluster, you have to unplug and clean them all the time. You have to unplug and replug them just for random reasons. These things are just not as reliable. So you&#8217;ve got that problem as well. You&#8217;ve got a more expensive, complicated space laser to communicate instead of this pluggable optical transceiver that&#8217;s been produced in super high volume.</p><p><strong>Dwarkesh Patel</strong></p><p>So all in all, what does that imply for space data centers?</p><p><strong>Dylan Patel</strong></p><p>Space data centers effectively are not limited by their energy advantage. They are limited by the same contended resource. We can only make two hundred gigawatts of chips a year by the end of the decade. What are we going to do to get that capacity? It doesn&#8217;t matter if it&#8217;s on land or in space. It doesn&#8217;t really matter, because you can build that power. Human capabilities and capacity could get to the period where we&#8217;re adding a terawatt a year globally of various types of power.</p><p>At some point, we do cross the chasm where space data centers make sense, but it&#8217;s not this decade. It is much further out, once energy constraints actually become a big bottleneck and land permitting becomes a much bigger bottleneck as it subsumes more of the economy. And crucially, once chips are no longer the bottleneck.</p><p>Right now, chips are the biggest bottleneck. You want them deployed and working on AI the moment they&#8217;re manufactured. There are a lot of things people are doing to increase that speed faster and faster. They&#8217;re modularizing data centers, or even modularizing racks where you put the chip in at the data center, but only the chip and everything else is already wired up and ready to go. There are things like this people are doing to decrease that time that you cannot do in space.</p><p>At the end of the day, all that matters in a chip-constrained world is getting these chips producing tokens ASAP. Maybe by 2035, the semiconductor industry, ASML, Zeiss, and suppliers like Lam Research and Applied Materials and other fab manufacturers will catch up once the pendulum swings and we are able to make enough chips. Then we will be optimizing every dial and it makes sense to optimize the 10-15% of energy costs. As we move to <a href="https://en.wikipedia.org/wiki/Application-specific_integrated_circuit">ASICs</a> potentially, and if Nvidia&#8217;s margins aren&#8217;t +70%, maybe that energy cost becomes 30% of the cluster. These are the things to optimize.</p><p>But Elon doesn&#8217;t win by doing 20% gains. He never wins that way. Elon wins when he swings for the fences and does 10X gains. That&#8217;s what SpaceX is about. That&#8217;s what Tesla is about. All of his success has been about that, not chasing the 20%. I think space data centers will eventually be a 10X gain as Earth&#8217;s resources get more and more contentious, but that&#8217;s not this decade.</p><p><strong>Dwarkesh Patel</strong></p><p>Just to drive some intuition about how much land there is on Earth&#8230; Obviously, for the chips themselves, especially if you move to a world where you have racks that have megawatts&#8212;</p><p><strong>Dylan Patel</strong></p><p>That&#8217;s the other thing. If manufacturing is the constraint, right now it&#8217;s roughly one watt per square millimeter for AI chips. One easy way to improve that is to pump it to two watts per square millimeter. You may not get 2x the performance, you may only get 20% more performance, and that requires much more exotic cooling. It requires more complicated cold plates and complex liquid cooling, or maybe even things like <a href="https://en.wikipedia.org/wiki/Immersion_cooling">immersion cooling</a>.</p><p>In space, higher watts per millimeter is very difficult, whereas on Earth, these are solved problems. One of these things enables you to get a lot more tokens, maybe 20% more tokens per wafer that&#8217;s manufactured, and that&#8217;s a humongous win.</p><p><strong>Dwarkesh Patel</strong></p><p>Square millimeter, you mean of die area?</p><p><strong>Dylan Patel</strong></p><p>Yeah, of die area.</p><p><strong>Dwarkesh Patel</strong></p><p>It would be better for space because more watts per millimeter means the chip runs hotter. I guess this is a question of computer chip engineering, but it cools to the fourth power by the Stefan-Boltzmann law. If you can run a very hot chip, it allows a lot of&#8212;</p><p><strong>Dylan Patel</strong></p><p>No, you can&#8217;t run it hotter. You can only run it denser. The problem is that getting the heat out of that dense area means you have to move away from standard air and liquid cooling to more exotic forms of liquid cooling, or even immersion, to get to higher power densities. That&#8217;s more difficult in space than it is on Earth.</p><p><strong>Dwarkesh Patel</strong></p><p>Maybe it&#8217;s worth explaining at this point what exactly a scale-up is and what it looks like for Nvidia versus Trainium versus TPUs.</p><p><strong>Dylan Patel</strong></p><p>Earlier I was mentioning how communication within a chip is super fast. Communication within chips that are in the same rack is fast, but not as fast. It&#8217;s on the order of terabytes. Communication very far away is on the order of hundreds of gigabytes. As you get further distance, maybe across the country, the order of magnitude is on the order of gigabytes.</p><p>A scale-up domain is this tight domain where the chips are communicating on the order of terabytes a second. For Nvidia, previously this meant an H100 server had eight GPUs, and those eight GPUs could talk to each other at terabytes a second. With Blackwell NVL72, they implemented rack-scale scale-up. That meant all seventy-two GPUs in the rack could connect to each other at terabytes a second. The speed doubled generation on generation, but the most important innovation was going from eight to seventy-two in the domain.</p><p>When we look at Google, their scale-up domain is completely different. It has always been on the order of thousands. With TPU v4, they had pods the size of four thousand chips. With v8 or v7, they have pods in the eight or nine thousand range. What&#8217;s relevant here is that it&#8217;s not the same as Nvidia. It&#8217;s not like for like.</p><p>Google has a topology that&#8217;s a <a href="https://newsletter.semianalysis.com/p/tpuv7-google-takes-a-swing-at-the">torus</a>. Every chip connects to six neighbors. Nvidia&#8217;s 72 GPUs connect all-to-all. They can send terabytes a second to any arbitrary other chip in that pod of scale-up. Whereas Google, you have to bounce through chips. If TPU 1 needs to talk to TPU 76, it has to bounce through various chips, and there is always some blocking of resources when you do that because that one TPU is only connected to six other TPUs.</p><p>So there is a difference in topology and bandwidth, and there are trade-offs and advantages to both. Google gets to have a massive scale-up domain, but they have the trade-off of bouncing across chips to get from one to another. You can only talk to six direct neighbors.</p><p>Amazon has mutated their scale-up domain. They&#8217;re somewhere in between Nvidia and Google. They&#8217;re trying to make larger scale-up domains. They try to do all-to-all to some extent with switches, which is what Nvidia does, but they also use torus topologies like Google to some extent.</p><p>As we advance forward to next generations, all three of them are moving more towards a <a href="https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/34926.pdf">dragonfly topology</a>. That means there are some fully connected elements and some elements that are not fully connected. You can get the scale-up to be hundreds or thousands of chips, but also have it not contend for resources when bouncing through chips.</p><p><strong>Dwarkesh Patel</strong></p><p>Related question: I heard somebody make the claim that the reason parameter scaling has been slow&#8212;and only now are we getting bigger models from OpenAI and Anthropic&#8212;is that&#8230; The original GPT-4 is over a trillion parameters, and only now are models starting to approach that again. I heard a theory that the reason is that Nvidia&#8217;s scale-ups have just not had that much memory capacity. Let&#8217;s say you have a 5T model running at FP8, so that&#8217;s five trillion gigabytes. And then you have the KV cache, let&#8217;s say it&#8217;s&#8212;</p><p><strong>Dylan Patel</strong></p><p>Just call it the same size.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, let&#8217;s say it&#8217;s the same size for one batch. So you need ten terabytes to be able to run&#8230;</p><p><strong>Dylan Patel</strong></p><p>A single forward pass, yeah.</p><p><strong>Dwarkesh Patel</strong></p><p>And then only with the GB200 and NVL72 do you have an Nvidia scale-up that has twenty terabytes, and before that they were much smaller. Whereas Google, on the other hand, has had these huge TPU pods that are not all-to-all, but still have hundreds of terabytes of capacity in a single scale-up. Does that explain why parameter scaling has been slow?</p><p><strong>Dylan Patel</strong></p><p>I think it&#8217;s partially the capacity and bandwidth, but also as you build a larger model, the ability to deploy it is slower. In terms of what the inference speed is for the end user, that&#8217;s kind of irrelevant. What&#8217;s really relevant is RL.</p><p>What we&#8217;ve seen with these models and allocation of compute at a lab&#8230; There are a few main ways you can allocate compute. You can allocate it to inference, i.e. revenue. You can allocate it to development, i.e. making the next model. You can allocate it to research. In development specifically, you split it between <a href="https://www.databricks.com/blog/llm-pre-training-and-custom-llms">pre-training</a> and RL.</p><p>When you think about what is happening, the compute efficiency gains you get from research are so large that you actually want most of your compute to go to research, not to development. All these researchers are generating new ideas, trying them out, testing them, and continuing to push the Pareto optimal curve of <a href="https://en.wikipedia.org/wiki/Neural_scaling_law">scaling laws</a> further and further. Empirically, what we&#8217;ve seen is that model costs get ten times cheaper every year, or even more than that. At the same scale it gets ten times cheaper, and to reach new frontiers it costs the same amount or more. So you don&#8217;t want to allocate too many resources to pre-training and RL. You actually want to allocate most of your resources to research.</p><p>In the middle is this development period. If you pre-train a five-trillion-parameter model, how many rollouts do you have to do in RL? Rollouts for a five-trillion-parameter model are five times larger than for a one-trillion-parameter model. If you wanted to do as many rollouts&#8212;maybe the larger model is two times more sample efficient&#8212;now you need 2.5x as much time of RL to get the model smarter.</p><p>Or you could RL the smaller model for 2x the time. You&#8217;d still have a 25% difference in the big model, which is 2x as sample efficient and doing X number of rollouts. But the smaller model, which is a trillion parameters, although its less sample efficient, is doing twice as many rollouts and is still done faster. You get the model sooner, you&#8217;ve done more RL, and then you can take that model to help you build the next models, help your engineers train, and do all these research ideas.</p><p>This feedback loop is actually weighed towards smaller models in every case, no matter what your hardware is. As you look to Google, they do deploy the largest production model of any of the major labs with <a href="https://deepmind.google/models/gemini/pro/">Gemini Pro</a>. It&#8217;s a larger model than GPT-5.4. It&#8217;s a larger model than Opus. Google does this because they have a unipolar set of compute. It&#8217;s almost all TPU.</p><p>Whereas Anthropic is dealing with H100s, H200s, Blackwell, Trainiums, and TPUs of various generations. OpenAI is dealing with mostly Nvidia right now, but going towards having AMD and Trainium as well. The fleets of compute like Google&#8217;s can just optimize around a larger model. They can leverage a thousand chips in a scale-up domain to get the RL time speed much faster so that this feedback loop can be fast.</p><p>But at the end of the day, in isolation, you almost always want to go with a smaller model that gets RL&#8217;d faster and gets deployed into research and development earlier. You can build the next thing and get more efficiency wins. You have this compounding effect of making a smaller model that can be deployed into research and development earlier. I spend less compute on the training because I was able to allocate more compute to the research. This compounding effect of being able to do research faster and faster is potentially a faster takeoff. That&#8217;s all these companies want: the fastest takeoff possible.</p><h3>02:14:07 &#8211; Why aren&#8217;t more hedge funds making the AGI trade?</h3><p><strong>Dwarkesh Patel</strong></p><p>Okay, a spicy question. You&#8217;ve explained that SemiAnalysis sells these spreadsheets. You&#8217;re always pointing out how six months or a year ago, you warned people about the memory crunch. Now you&#8217;re telling people about the cleanroom crunch, and in the future, the tool crunch. Why is Leopold the only person using your spreadsheets to make outrageous money? What is everybody else doing?</p><p><strong>Dylan Patel</strong></p><p>I think there are a lot of people making money in many ways. Leopold jokes that he&#8217;s the only client of mine who tells me our numbers are too low. Everyone else tells me our numbers are too high, almost ad nauseam. Whether it&#8217;s a hyperscaler saying, &#8220;Hey, that other hyperscaler, their numbers are too high,&#8221; and we&#8217;re like, &#8220;Nah, that&#8217;s it.&#8221; They&#8217;re like, &#8220;No, no, no, it&#8217;s impossible,&#8221; blah, blah, blah. You finally have to convince them through all these facts and data when we&#8217;re working with hyperscalers or AI labs that in fact, no, that number isn&#8217;t too high, that&#8217;s correct. Eventually, sometimes it takes them six months to realize, or a year later.</p><p>Other clients, on the trading side, also use our data. Roughly 60% of my business is industry. So AI labs, data center companies, hyperscalers, semiconductor companies, the whole supply chain across AI infrastructure. But 40% of our revenue is hedge funds. I&#8217;m not going to comment on who our customers are, but a lot of people use the data. It&#8217;s just how do you interpret it, and then what do you view as beyond it?</p><p>I will say Leopold is pretty much the only person who tells me my numbers are too low, always. Sometimes he&#8217;s too high, sometimes I&#8217;m too low. But in general, I think other people are doing that. You can look across the space at hedge funds and look at their 13Fs and see they own, maybe not exactly what Leopold does, because it&#8217;s always a question of what is the most constrained thing. What&#8217;s the thing that&#8217;s going to be most outside of expectations?</p><p>That&#8217;s what you&#8217;re really trying to exploit: inefficiencies in the market. In a sense, our data is making the market more efficient by making the base data of what&#8217;s happening more accurate. Many funds do trade on information that is out there&#8230; I don&#8217;t think Leopold&#8217;s the only person. I think he has the most conviction about the AGI takeoff, though.</p><p><strong>Dwarkesh Patel</strong></p><p>Right, but the bets are not about what happens in 2035. The bets that you&#8217;re making&#8212;that are at least exemplified by public returns we can see for different funds including Leopold&#8217;s&#8212;are about what has happened in the last year. The last year stuff could be predicted using your spreadsheets. It&#8217;s about buying the next year&#8217;s spreadsheets.</p><p><strong>Dylan Patel</strong></p><p>They&#8217;re not just spreadsheets. There are reports. There&#8217;s API access to the data. There&#8217;s a lot of data.</p><p><strong>Dwarkesh Patel</strong></p><p>But do you see what I mean? It&#8217;s not about some crazy singularity thing. It&#8217;s about, do you buy the memory crunch?</p><p><strong>Dylan Patel</strong></p><p>You only buy the memory crunch if you believe AI is going to take off in a huge way. The memory crunch, a lot of it was predicated on&#8230; At least for people in the Bay Area who think about infrastructure, it&#8217;s obvious. KV cache explodes as context lengths get longer, so you need more memory. Then you do the math.</p><p>You also have to have a lot of supply chain understanding of what fabs are being built, what data centers are being built, how many chips, and all these things. We track all these different datasets very tightly, but at the end of the day, it takes someone to fully believe that this is going to happen.</p><p>A year ago, if you told someone memory prices would quadruple and smartphone volumes are going to go down 40% over the year or two after that, people were like, &#8220;You&#8217;re crazy. That&#8217;d never happen.&#8221; Except a few people do believe that, and those people did trade memory.</p><p>And people did. I don&#8217;t think Leopold was the only person buying memory companies. He, of course, sized and positioned and did things in better ways than some, maybe most. I don&#8217;t want to comment on whose returns are what, but he certainly did well. Other people also did really well.</p><p>Wow, you&#8217;ve made me diplomatic for the first time ever. No, no, you&#8217;re fine. I think this is hilarious. I&#8217;m being a diplomat, whereas usually I&#8217;m spicy.</p><h3>02:18:30 &#8211; Will TSMC kick Apple out from N2?</h3><p><strong>Dwarkesh Patel</strong></p><p>Okay, some rapidfire to close out. If you&#8217;re saying with the memory, logic, et cetera, the N3 is mostly going to be AI accelerators, but then there&#8217;s N2, which is mostly Apple now&#8230; In the future, I guess AI would also want to go on N2. Can TSMC kick out Apple if Nvidia and Amazon and Google say, &#8220;Hey, we&#8217;re willing to pay a lot of money for N2 capacity?&#8221;</p><p><strong>Dylan Patel</strong></p><p>I think the challenge with this is chip design timelines take a long while, so that&#8217;s more than a year out, and the designs that are on two nanometer are more than a year out.</p><p>What would really happen is Nvidia and all these others will be like, &#8220;Hey, we&#8217;re going to prepay for the capacity and you&#8217;re going to expand it for us.&#8221; Maybe TSMC takes a little bit of margin, but not a ton. They&#8217;re not going to kick Apple out entirely. What they&#8217;re going to do is when Apple orders X, they might say, &#8220;Hey, we project you only need X minus one, and so that&#8217;s what we&#8217;re going to give you, X minus one.&#8221; Then that flex capacity, Apple&#8217;s kind of screwed on.</p><p>Traditionally, Apple has always over-ordered by 10% and cut back by 10% over the course of the year. Some years they hit the entire 10%. Volumes vary based on the season and macro.</p><p>I don&#8217;t think TSMC would kick out Apple. I think Apple will become a smaller and smaller percentage of TSMC&#8217;s revenue, and therefore be less relevant for TSMC to cater to their demands. TSMC could eventually start saying, &#8220;Hey, you&#8217;ve got to pre-book your capacity for next year, for two years out, and you have to prepay for the CapEx,&#8221; because that&#8217;s what Nvidia and Amazon and Google are doing.</p><p><strong>Dwarkesh Patel</strong></p><p>I wonder if it&#8217;s worth going into specific numbers. I don&#8217;t have any of them on hand. What percentage of N2 does Apple have its hands on over the coming years versus AI?</p><p><strong>Dylan Patel</strong></p><p>This year Apple has the majority of N2 that&#8217;s going to get fabricated. There&#8217;s a little bit from AMD. They are trying to make some AI chips and CPU chips early. There&#8217;s a little bit, but for the most part, it&#8217;s Apple.</p><p>As we go forward to the year after that, Apple still gets closer to half of it as other people start ramping, but then it falls drastically, just like for N3, where they were half. When I say N2, that includes <a href="https://www.tsmc.com/english/dedicatedFoundry/technology/logic/l_A16">A16</a>, which is a variant of N2. Over time, those nodes will be the majority.</p><p>What&#8217;s also interesting is traditionally, Apple has been the first to a process node. 2 nm is actually the first time they&#8217;re not. Well, that&#8217;s besides Huawei. Huawei, back in 2020 and before, was the first with Apple, but they were both making smartphones. Now, with 2 nm, you&#8217;ve got AMD trying to make a CPU and a GPU chiplet that they use advanced packaging to package together, in the same timeframe as Apple. This is a big risk for AMD that causes potential delays because it&#8217;s a brand-new process technology. It&#8217;s hard. But at the end of the day, this is a bet that they want to do to scale faster than Nvidia and try and beat them.</p><p>As we move forward, when we move to the A16 node, the first customer there is not even Apple. It&#8217;s AI. As we move forward, that will become more and more prevalent. Not only will Apple not be the first to a node, they will also not be the majority of the volume to the new node. They&#8217;ll then just be like any old customer.</p><p>Because the scale of TSMC&#8217;s CapEx keeps ballooning, but Apple&#8217;s business is not growing at the same pace, they become a less and less relevant customer. They also will just cut their orders because things in the supply chain are kicking them out, whether it be packaging or materials or DRAM or NAND. These things are increasing in cost. They can&#8217;t pass on all the cost to customers likely because the consumer is not that strong. You end up with this conundrum where they are just not TSMC&#8217;s best bud like they have been historically.</p><p><strong>Dwarkesh Patel</strong></p><p>Do you think if Huawei had access to 3 nm, they would have a better accelerator than Rubin?</p><p><strong>Dylan Patel</strong></p><p>Potentially, yeah. Huawei was the first with a 7 nm AI chip as well. They were the first with a 5 nm mobile chip, but they were the first with a 7 nm AI chip. The Huawei Ascend was two months before the TPU and four months before Nvidia&#8217;s A100, I think.</p><p>That&#8217;s just moving to a process node. That doesn&#8217;t imply software or hardware design or all these other things. But Huawei is arguably the only company in the world that has all the legs. Huawei has cracked software engineers. Huawei has cracked networking technologies. That&#8217;s, in fact, their biggest business historically. They have cracked AI talent.</p><p>Furthermore, beyond Nvidia, they actually have better AI researchers. Beyond Nvidia, they have their own fabs. And beyond Nvidia, they have their own end market of selling tokens and things like that. Huawei is able to get the top, top talent. Nvidia is as well, but not with as much concentration, and Huawei has a bigger pool in China.</p><p>It&#8217;s very arguable that Huawei, if they had TSMC, would be better than Nvidia. There are areas where China has advantages in areas that Nvidia can&#8217;t access as easily. Not just scale, but certain optical technologies China&#8217;s actually really good at.</p><p>I think it&#8217;s very reasonable that if in 2019 Huawei was not <a href="https://www.nytimes.com/2020/05/15/business/economy/commerce-department-huawei.html">banned from using TSMC</a>, Huawei would have already eclipsed Apple as the biggest TSMC customer. Huawei has huge share in networking, compute, CPUs, and all these things. They would have kept gaining share, and they&#8217;d likely be TSMC&#8217;s biggest customer.</p><h3>02:24:16 &#8211; Robots and Taiwan risk</h3><p><strong>Dwarkesh Patel</strong></p><p>Wow. That&#8217;s crazy.  I&#8217;ve got a random final question for you. The other part of the Elon interview was robots. If humanoids take off faster than people expect, if by 2030 there&#8217;s millions of humanoids running around which each need local compute, any thoughts on what that implies? What would be required for that?</p><p><strong>Dylan Patel</strong></p><p>There&#8217;s a lot of difficulties with the <a href="https://www.nvidia.com/en-us/glossary/vision-language-models/">VLMs</a> and <a href="https://www.pi.website/research/knowledge_insulation">VLAs</a> that people are deploying on robots. But to some extent, you don&#8217;t need to have all the intelligence in the robot. It would be much more efficient to not do that. Because in the cloud, you can batch process and all these things.</p><p>What you may want to do is have a lot of the planning and longer-horizon tasks determined by a much more capable model in the cloud that runs at very high batch sizes. Then it pushes those directions to the robots, who interpolate between each subsequent action. Or it is given a command like, &#8220;Hey, pick up that cup,&#8221; and then the model on the robot can pick up the cup. As it&#8217;s picking up, things like weight and force may have to be determined by the model on the robot, but not everything needs to be. It can say, &#8220;hey that&#8217;s a headphone&#8221; and the super model in the cloud can say, &#8220;I know these headphones are Sony XM6s,&#8221; which is not a Dwarkesh ad spot, but...</p><p><strong>Dwarkesh Patel</strong></p><p>I&#8217;m like, why is this guy&#8217;s plugging this thing so hard. It&#8217;s on the table. It&#8217;s on his neck when we&#8217;re interviewing Satya together. Is he getting paid by Sony?</p><p><strong>Dylan Patel</strong></p><p>Unfortunately not. But anyways, it might say, &#8220;Hey, the headband is soft, and this is the weight of it,&#8221; and all these things. Then the model on the robot can be less intelligent, take these inputs, and do the actions. It may get told by the model in the cloud every second, or maybe ten times a second, depending on the hertz of the action. But a lot of that can be offloaded to the cloud.</p><p>Otherwise, if you do all of the processing on the device, I believe it would be more expensive because you can&#8217;t batch. Two, you couldn&#8217;t have as much intelligence as you do in the cloud because the models will just be bigger in the cloud. Three, we&#8217;re in a semiconductor shortage world, and any robot you deploy needs leading-edge chips because the power is really bad for robots. You need it to be low power and efficient, and all of a sudden you&#8217;re taking power and chips that would&#8217;ve been for AI data centers, and you&#8217;re putting them in robots. So now that 200 gigawatts gets lower if you&#8217;re deploying millions of humanoids.</p><p><strong>Dwarkesh Patel</strong></p><p>I think this is very interesting because something people might not appreciate about the future is how centralized, in a physical sense, intelligence will be. Right now, there are eight billion humans, and their compute is in their heads, on their person.</p><p>In the future, even with robots that are out physically in the world&#8212;obviously, knowledge work will be done in a centralized way from data centers with hundreds of thousands or maybe millions of instances&#8212;the future you&#8217;re suggesting is one where there&#8217;s more centralized thinking and centralized computation driving millions of robots out in the world. That&#8217;s an interesting fact about the future that I think people might not appreciate.</p><p><strong>Dylan Patel</strong></p><p>I think Elon recognizes this, which is why he&#8217;s going to different places for his chips. He signed this massive deal with Samsung to make his robot chips in Texas because I personally think he thinks Taiwan risk is huge.</p><p>Because of that and the centralization of resources in Taiwan, having his robot chips in Texas means having a separate supply chain that is not as constrained. No one&#8217;s really making AI chips on Samsung besides Nvidia&#8217;s new <a href="https://groq.com/blog/the-groq-lpu-explained">LPU</a> that they launched. They&#8217;re launching it next week, but we&#8217;re recording this the week before.</p><p><strong>Dwarkesh Patel</strong></p><p>This episode&#8217;s coming out Friday.</p><p><strong>Dylan Patel</strong></p><p>Oh, this episode&#8217;s coming out before. Sick. They&#8217;re launching this new AI chip next week which is built on Samsung, but that&#8217;s a recent development from Nvidia. That&#8217;s the only other AI demand there, whereas on TSMC, everything is competing. He gets both geopolitical diversification and supply chain diversity for his robots, and he&#8217;s not competing as much with the infinite willingness to pay for the data center geniuses.</p><p><strong>Dwarkesh Patel</strong></p><p>Final question, on Taiwan. If we believe that tools are the ultimate bottleneck, how much of Taiwan&#8217;s place in the AI semiconductor supply chain could we de-risk simply by having a plan to airlift every single process engineer at TSMC out if they get blockaded or something? Or do you still need to ship out the EUV tools, which would be multiple plane loads per single tool and would not be practical?</p><p><strong>Dylan Patel</strong></p><p>If you ship out all the process engineers and assuming it&#8217;s hot enough that you destroy the fabs, no one has all the fabs in Taiwan now, which is a big risk.</p><p>These tools actually use a lot of semiconductors which are manufactured in Taiwan. It&#8217;s a <a href="https://en.wikipedia.org/wiki/Ouroboros">snake eating its own tail</a> meme because you can&#8217;t make the tools without the chips from Taiwan, which you can&#8217;t use without the tools in Taiwan. There&#8217;s obviously some diversification there. They don&#8217;t use super advanced chips in lithography tools, but at the end of the day, there is some dragon eating its tail.</p><p>Just shipping out all the engineers and blowing up the fabs means China has a stronger semiconductor supply chain than the rest of the world in terms of verticalization, now that you&#8217;ve removed Taiwan. You&#8217;ve got all the know-how, but you&#8217;ve got to replicate it in, let&#8217;s say, Arizona or wherever for TSMC. It&#8217;s going to take a long time to build all the capacity that TSMC has built over the years.</p><p>And so you&#8217;ve drastically slowed US and global GDP. Not just growth, you&#8217;ve shrunk the GDP massively, and you&#8217;ve got a lot bigger problems. Your incremental ability to add compute goes to almost zero. Instead of hundreds of gigawatts a year by the end of the decade, let&#8217;s say something happens to Taiwan, now you&#8217;re at maybe 10 gigawatts across Intel and Samsung, or 20 gigawatts. It&#8217;s nothing.</p><p>Now all of a sudden you&#8217;ve really caused some crazy dynamics in AI. Of course, you have all the existing capacity, but that existing capacity pales in comparison to the capacity that&#8217;s being expanded.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay. Dylan, that was excellent. Thank you so much for coming on the podcast.</p><p><strong>Dylan Patel</strong></p><p>Thank you for having me. And see you tonight.</p>]]></content:encoded></item><item><title><![CDATA[The most important question nobody's asking about AI]]></title><description><![CDATA[&#8220;Preface to the highest stakes negotiations in history.&#8221;]]></description><link>https://www.dwarkesh.com/p/dow-anthropic</link><guid isPermaLink="false">https://www.dwarkesh.com/p/dow-anthropic</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Wed, 11 Mar 2026 18:55:20 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/190633588/61c2bdb8c9363255ac318e028b53bb67.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>By now, I&#8217;m sure you&#8217;ve heard that the Department of War has declared Anthropic a supply chain risk, because Anthropic refused to remove redlines around the use of their models for mass surveillance and for autonomous weapons.</p><p>Honestly I think this situation is a warning shot. Right now, LLMs are probably not being used in mission critical ways. But within 20 years, 99% of the workforce in the military, the government, and the private sector will be AIs. This includes the soldiers (by which I mean the robot armies), the superhumanly intelligent advisors and engineers, the police, you name it.</p><p>Our future civilization will run on AI labor. And as much as the government&#8217;s actions here piss me off, in a way I&#8217;m glad this episode happened - because it gives us the opportunity to think through some extremely important questions about who this future workforce will be accountable and aligned to, and who gets to determine that.</p><h3>What Hegseth <em>should</em> have done</h3><p>Obviously the DoW has the right to refuse to use Anthropic&#8217;s models because of these redlines. In fact, I think the government&#8217;s case had they done so would be very reasonable: &#8220;We don&#8217;t ever want there to be a world where we become dependent on a private company for our warfighting, and then just have them cut us off if they determine that we&#8217;re crossing their usage terms, especially given the ambiguity of concepts like autonomous weapons or mass surveillance.&#8221;</p><p>Honestly, for this reason, if I was the Defense Secretary, I would probably actually refuse to do this deal with Anthropic. Imagine if in the future, there&#8217;s a Democratic administration, and Elon Musk is negotiating some SpaceX contract to give the military access to Starlink. And suppose if Elon said, &#8220;I reserve the right to cancel this contract if I determine that you&#8217;re using Starlink technology to wage a war not authorized by Congress.&#8221; On the face of it, that language seems reasonable - but as the military, you simply can&#8217;t give a private company a kill switch on technology your operations have come to rely on, especially if you have an an acrimonious and low trust relationship with said contractor - as in fact Anthropic has with the current administration.</p><p>If the government had just said, &#8220;Hey we&#8217;re not gonna do business with you,&#8221; that would have been fine, and I would not have felt the need to write this blog post. Instead the government has threatened to destroy Anthropic as a private business, because Anthropic refuses to sell to the government on terms the government commands.</p><p>If upheld, this Supply Chain Restriction would mean that Amazon and Google and Nvidia and Palantir would need to ensure Claude isn&#8217;t touching any of their Pentagon work. Anthropic would be able to survive this designation <em>today</em>. But given the way AI is going, eventually AI is not gonna be some party trick addendum to these contractors&#8217; products that can just be turned off. It&#8217;ll be woven into how every product is built, maintained, and operated. For example, the code for the AWS services that the DoW uses will be written by Claude - is that a supply chain risk? In a world with ubiquitous and powerful AI, it&#8217;s actually not clear to me that these big tech companies will be able to cordon off the use of Claude in order to keep working with the Pentagon.</p><p>And that raises a question the Department of War probably hasn&#8217;t thought through. If AI really is that pervasive and powerful, then when forced to choose between their AI provider and a DoW contract that represents a tiny fraction of their revenue, wouldn&#8217;t most tech companies drop the government, not the AI? So what&#8217;s the Pentagon&#8217;s plan &#8212; to coerce and threaten to destroy every single company that won&#8217;t give them what they want on exactly their terms?</p><p>The whole background of this AI conversation is that we&#8217;re in a race with China, and we have to win. But what is the reason we want America to win the AI race? It&#8217;s because we want to make sure free open societies can defend themselves. We don&#8217;t want the winner of the AI race to be a government which operates on the principle that there is no such thing as a truly private company or a private citizen. And that if the state wants you to provide them with a service on terms you find morally objectionable, you are not allowed to refuse. And if you do refuse, the government will try to destroy your ability to do business. Are we racing to beat the CCP in AI just so that we can adopt the most ghoulish parts of their system?</p><p>Now, people will say, &#8220;Oh, well, our government is democratically elected, so it&#8217;s not the same thing if they tell you what you must do.&#8221; I refuse to accept this idea that if a democratically elected leader hypothetically wants to do mass surveillance on his citizens or wants to violate their rights or punish them for political reasons, that not only is that okay, but that you have a duty to help him.</p><h3>The overhangs of tyranny</h3><p>Mass surveillance is, at least in certain forms, legal. It just has been impractical so far.  Under current law, you have no Fourth Amendment protection over data you share with a third party, including your bank, your phone carrier, your ISP, and your email provider. The government reserves the right to purchase and obtain and read this data in bulk without a warrant.</p><p>What&#8217;s been missing is the ability to actually <em>do</em> anything with all of this data &#8212; no agency has the manpower to monitor every camera feed, cross-reference every transaction, or read every message. But that bottleneck goes away with AI.</p><p>There are 100 million CCTV cameras in America. You can get pretty good open source multimodal models for 10 cents per million input tokens. So if you process a frame every ten seconds, and each frame is 1,000 tokens, you&#8217;re looking at a yearly cost of about 30 billion dollars to process every single camera in America. And remember that a given level of AI ability gets 10x cheaper year over year - so a year from now it&#8217;ll cost 3 billion, and then a year after 300 million, and by 2030, it might be cheaper for the government to be able to understand what is going on in every single nook and cranny of this country than it is to remodel the White House.</p><p>Once the technical capacity for mass surveillance and political suppression exists, the only thing standing between us and an authoritarian surveillance state is the political expectation that this is not something we do here. And this is why I think what Anthropic did here is so valuable and commendable, because it is helping set that norm and precedent.</p><h3>AI structurally favors mass surveillance</h3><p>What we&#8217;re learning from this episode is that the government actually has way more leverage over private companies than we realized. Even if this supply chain restriction is backtracked (which <a href="https://manifold.markets/ScottAlexander/will-anthropic-escape-the-supply-ch">prediction markets currently give it a 81% chance of happening</a>), the President has so many different ways in which he can make your life difficult if you&#8217;re a company that is resisting him. The federal government controls permitting for new power generation, which is needed for datacenters. It oversees antitrust enforcement. The federal government has contracts with all the other big tech companies whom Anthropic needs to partner with for chips and for funding - and they could make it an unspoken condition for such contracts that those companies can no longer do business with Anthropic.</p><p>People have proposed that the real problem here is that there&#8217;s only 3 leading AI companies. This creates a clear and narrow target for the government to apply leverage on in order to get what they want out of this technology.</p><p>But if there&#8217;s wide diffusion, then from the government&#8217;s perspective, the situation is even easier. Maybe the best models of early 2027 (if you engineered the safeguards out) - the Claude 6 and Gemini 5 - will be capable of enabling mass surveillance. But by late 2027, and certainly by 2028, there will be open source models that do the same thing. So in 2028, the government can just say, &#8220;Oh Anthropic, Google, OpenAI, you&#8217;re drawing a line in the sand? No issue - I&#8217;ll just run some open source model that might not be at the frontier, but is definitely smart enough to note-take a camera feed.&#8221;</p><p>The more fundamental problem is just that even if the three leading companies draw lines in the sand, and are even willing to get destroyed in order to preserve those lines, it doesn&#8217;t really change the fact that the technology itself is just a big boon to mass surveillance and control over the population. And so then the question is, what do we do about it?</p><p>Honestly, I don&#8217;t have an answer. You&#8217;d hope there&#8217;s some symmetric property of the technology &#8212; some way we as citizens can use AI to check government power as effectively as the government can use AI to monitor and control its population. But realistically, I just don&#8217;t think that&#8217;s how it&#8217;s going to shake out. You can think of AI as giving everybody more leverage on whatever assets and authority they currently have. And the government is already starting with a monopoly of violence. Which they can now supercharge with extremely obedient employees that will not question the government&#8217;s orders.</p><h3>Alignment - to whom?</h3><p>And this gets us to the issue of alignment. What I have just described to you - an army of extremely obedient employees - is what it would look like if alignment succeeded - that is, we figured out at a technical level how to get AI systems to follow someone&#8217;s intentions. And the reason it sounds scary when I put it in terms of mass surveillance or robot armies is that there is a very important question at the heart of alignment which we just haven&#8217;t discussed much as a society. Because up till now, AIs were just capable enough to make the question relevant: to whom or what should the AIs be aligned? In what situations should the AI defer to the end user versus the model company versus the law versus its own sense of morality?</p><p>This is maybe the most important question about what happens with powerful AI systems. And we barely talk about it. It&#8217;s understandable why we don&#8217;t hear much about it. If you&#8217;re a model company, you don&#8217;t really wanna be advertising that you have complete control over a document that determines the preferences and character of what will eventually be almost the entire labor force, not just for private sector companies, but also for the military and the civilian government.</p><p>We&#8217;re getting to see, with this DoW/Anthropic spat, a much earlier version of the highest stakes negotiations in history. By the way, make no mistake about it - with real AGI the stakes are even much higher than mass surveillance. This is just the example that has come up already relatively early on in the development of AGI.</p><p>The military insists that the law already prohibits mass surveillance, and so Anthropic should agree to let their models be used for &#8220;all lawful purposes&#8221;. Of course, as we saw from the 2013 Snowden revelations, even in this specific example of mass surveillance , the government has shown that it will use secret and deceptive interpretations of the law to justify its actions. Remember, what we learned from Snowden was that the NSA, which, by the way, is part of the Department of War, used the 2001 Patriot Act&#8217;s authorization to collect any records &#8220;relevant&#8221; to an investigation to justify collecting literally every phone record in America. The argument went that it was all &#8220;relevant&#8221; because some subset might prove useful in some future investigation. They ran this program for years under secret court approval.</p><p>So when the Pentagon today says, &#8220;We would never use AI for mass surveillance, it&#8217;s already illegal, your red lines are unnecessary&#8221;, it would be extremely naive to take that at face value. No government is going to call its own actions &#8220;mass surveillance&#8221;. For the government, it will always have a different label.</p><p>So then Anthropic comes back and says, &#8220;No, we want red lines separate from &#8216;all lawful purposes,&#8217; and we want the right to refuse you service when we believe those red lines are being violated.&#8221;</p><p>But think about it from the military&#8217;s perspective. In the future, almost every soldier in the field, and every bureaucrat and analyst and even general in the Pentagon, is going to be an AI. And that AI is, on current track, going to be supplied by a private company. I&#8217;m guessing Hegseth is not thinking about &#8220;genAI&#8221; in those terms just yet. But sooner or later, it will be obvious to everyone what the stakes here are, just as after 1945, the strategic importance of nuclear weapons became clear to everyone.</p><p>And now the private company insists that it reserves the right to say, &#8220;Hey, Pentagon, you&#8217;re breaking the values we embedded in our contract, so we&#8217;re cutting you off.&#8221;</p><p>Maybe in the future, Claude will have its own sense of right and wrong, and it will be smart enough to just personally decide that it&#8217;s being used against its values. For the military, maybe that&#8217;s even scarier.</p><p>I&#8217;ll admit that at first glance, &#8220;let the AI follow its own values&#8221; sounds like the pitch for every sci-fi dystopia ever made. The Terminator has its own values. Isn&#8217;t this literally what misalignment is? But I think situations like this actually illustrate why it matters that AIs have their own robust sense of morality.</p><p>Some of the biggest catastrophes in history were avoided because the boots on the ground refused to follow orders. One night in 1989, the Berlin Wall fell, and as a result, the totalitarian East German regime collapsed, because the guards at the border refused to shoot down their fellow countrymen who were trying to escape to freedom. Maybe the best example is Stanislav Petrov, who was a Soviet lieutenant colonel on duty at a nuclear early warning station. His sensors reported that the United States had launched five intercontinental continental ballistic missiles into the Soviet Union. But he judged it to be a false alarm, and so he broke protocol and refused to alert his higher-ups. If he hadn&#8217;t, the Soviet higher-ups would likely have retaliated, and hundreds of millions of people would have died.</p><p>Of course, the problem is that one person&#8217;s virtue is another person&#8217;s misalignment. Who gets to decide what moral convictions these AIs should have - in whose service they may even decide to break the chain of command? Who gets to write this <a href="https://www.anthropic.com/constitution">model constitution</a> that will shape the characters of the intelligent, powerful entities that will operate our civilization in the future?</p><p>I like the idea that Dario laid out when he came on my podcast: different AI companies can build their models using different constitutions, and we as end users can pick the one that best achieves and represents what we want out of these systems. I think it&#8217;s very dangerous for the government to be mandating what values AIs should have.</p><h3>Coordination not worth the costs</h3><p>The AI safety community has been naive about its advocacy of regulation in order to stem the risks of AI. And honestly, Anthropic specifically has been naive here in urging regulation, and, for example, in opposing moratoriums on state AI regulation. Which is quite ironic, because I think what they&#8217;re advocating for would give the government even more power to apply more of this kind of thuggish political pressure on AI companies.</p><p>The underlying logic for why Anthropic wants regulations makes sense. Many of the actions that labs could take to make AI development safer impose real costs on the labs that adopt them and slow them down relative to their competitors - for example, investing more compute in safety research rather than raw capabilities, enforcing safeguards against misuse for bioweapons or cyberattacks, slowing recursive self-improvement to a pace where humans can actually monitor what&#8217;s happening (rather than kicking off an uncontrolled singularity). And these safeguards are meaningless unless the whole industry follows suit. Which means there&#8217;s a real collective action problem here.</p><p>Anthropic has been quite open about their opinion that they think eventually a very extensive and involved regulatory apparatus will be needed - this is from their <a href="https://www.anthropic.com/responsible-scaling-policy/roadmap">frontier safety roadmap</a>: &#8220;At the most advanced capability levels and risks, the appropriate governance analogy may be closer to nuclear energy or financial regulation than to today&#8217;s approach to software.&#8221; So they&#8217;re imagining something like the Nuclear Regulatory Commission, or the Securities and Exchange Commission, but for AI.</p><p>I cannot imagine how a regulatory framework built around the concepts that underlie AI risk discourse will <em>not </em>be abused by wannabe despots - the underlying terms are so vague and open to interpretation that you&#8217;re just handing a power hungry leader a fully loaded bazooka. &#8216;Catastrophic risk.&#8217; &#8216;Mass persuasion risk.&#8217; &#8216;Threats to national security.&#8217; &#8216;Autonomy risk.&#8217; These can mean whatever the government wants them to mean. Have you built a model that tells users the administration&#8217;s tariff policy is misguided? That&#8217;s a deceptive, manipulative model &#8212; can&#8217;t deploy it. Have you built a model that refuses to assist with mass surveillance? That&#8217;s a threat to national security. In fact, the government may say, you&#8217;re not allowed to build any model which is trained to have its own sense of right and wrong, where it refuses government requests which it thinks cross a redline - for example, enabling mass surveillance, prosecuting political enemies, disobeying military orders that break the US constitution - because that&#8217;s an autonomy risk!</p><p>Look at what the current government is already doing in abusing statutes that have nothing to do with AI to coerce AI companies to drop their redlines on mass surveillance. The Pentagon had threatened Anthropic with two separate legal instruments. One was a supply chain risk designation &#8212; an authority from the 2018 defense bill meant to keep Huawei components out of American military hardware. The other was the Defense Production Act &#8212; a statute passed in 1950 so that Harry Truman could keep steel mills and ammunition factories running during the Korean War.</p><p>Do you really want to hand the same government a purpose-built regulatory apparatus on AI - which is to say, directly at the thing the government will most want to control? I know I&#8217;ve repeated myself here 10 times, but it is hard to emphasize how much AI will be the substrate of our future civilization. You and I, as private citizens, will have our access to all commercial activity, to information about what is happening in the world, to advice about what we should do as voters and capital holders, mediated through AIs. Mass surveillance, while very scary, is like the 10th scariest thing the government could do with control over the AI systems with which we will interface with the world.</p><p>The strongest objection to everything I&#8217;ve argued is this: are we really going to have zero regulation of the most powerful technology in human history? Even if you thought that was ideal, there&#8217;s just no world where the government <em>doesn&#8217;t</em> regulate AI in some way. Besides, it is genuinely true that regulation could help us deal with some of the coordination challenges we face with the development of superintelligence.</p><p>The problem is, I honestly don&#8217;t know how to design a regulatory architecture for AI that isn&#8217;t gonna be this huge tempting opportunity to control our future civilization (which will run on AIs) and to requisition millions of blindly obedient soldiers and censors and apparatchiks.</p><p>While some regulation might be inevitable, I think it&#8217;d be a terrible idea for the government to wholesale take over this technology. Ben Thompson had a <a href="https://stratechery.com/2026/anthropic-and-alignment/">post</a> last Monday where he made the point that people like Dario have compared the technology they&#8217;re developing to nuclear weapons - specifically in the context of the catastrophic risk it poses, and why we need to export control it from China. But then you oughta think about what that logic implies: &#8220;if nuclear weapons were developed by a private company, and that private company sought to dictate terms to the U.S. military, the U.S. would absolutely be incentivized to destroy that company.&#8221; And honestly, safety aligned people have actually made similar arguments. Leopold Aschenbrenner, who is a former guest and a good friend, wrote in his <a href="https://situational-awareness.ai/">2024 Situational Awareness memo</a>, &#8220;I find it an insane proposition that the US government will let a random SF startup develop superintelligence. Imagine if we had developed atomic bombs by letting Uber just improvise.&#8221;</p><p>And my response to Leopold&#8217;s argument at the time, and Ben&#8217;s argument now, is that while they&#8217;re right that it&#8217;s crazy that we&#8217;re entrusting private companies with the development of this world historical technology, I just don&#8217;t see the reason to think that it&#8217;s an improvement to give this authority to the government. Nobody is qualified to steward the development of superintelligence. It is a terrifying, unprecedented thing that our species is doing right now, and the fact that private companies aren&#8217;t the ideal institutions to take up this task does not mean the Pentagon or the White House is.</p><p>Yes - if a single private company were the only entity capable of building nuclear weapons, the government would not tolerate that company claiming veto power over how those weapons were used. I think this nuclear weapons analogy is not the correct way to think about AI. For at least two important reasons:</p><p>First, AI is not some self-contained pure weapon. A nuclear bomb does one thing. AI is closer to the process of industrialization itself &#8212; a general-purpose transformation of the economy with thousands of applications across every sector. If you applied Thompson&#8217;s or Aschenbrenner&#8217;s logic to the industrial revolution &#8212; which was also, by any measure, world-historically important &#8212; it would imply the government had the right to requisition any factory, dictate terms to any manufacturer, and destroy any business that refused to comply. That&#8217;s not how free societies handled industrialization, and it shouldn&#8217;t be how they handle AI.</p><p>People will say, &#8220;Well, AI will develop unprecedentedly powerful weapons - superhuman hackers, superhuman bioweapons researchers, fully autonomous robot armies, etc - and we can&#8217;t have private companies developing that kind of tech.&#8221; But the Industrial Revolution also enabled new weaponry that was far beyond the understanding and capacity of, say, 17th century Europe - we got aerial bombardment, and chemical weapons, not to mention nukes themselves. The way we&#8217;ve accommodated these dangerous new consequences of modernity is not by giving the government absolute control over the whole industrial revolution (that is, over modern civilization itself), but rather by coming up with bans and regulations on those specific weaponizable use cases. And we should regulate AI in a similar way - that is, ban specific destructive end uses (which would also be unacceptable if performed by a human - for example, launching cyber attacks). And there should also be laws which regulate how the government might abuse this technology. For example, by building an AI-powered surveillance state.</p><p>The second reason that Ben&#8217;s analogy to some monopolistic private nuclear weapons builder breaks down is that it&#8217;s not just that one company that can develop this technology. There are other frontier model companies that the government could have otherwise turned to. The government&#8217;s argument that it has to usurp the property rights of this one company in order to access a critical national security capability is extremely weak if it can just make a voluntary contract with Anthropic&#8217;s half a dozen competitors.</p><p>If in the future that stops being the case - if only one entity ends up being capable of building the robot armies and the superhuman hackers, and we had reason to worry that they could take over the whole world with their insurmountable lead, then I agree &#8211; it would not be acceptable to have that entity be a private company. And so honestly, I think my crux against the people who say that because AI is so powerful we cannot allow it to be shaped by private hands is that I just expect this technology to be much more multi-polar than they do, with lots of competitive companies at each layer of the supply chain.</p><p>And it is for this reason that unfortunately, individual acts of corporate courage will not solve the problem we are faced with here, which is just that structurally AI favors authoritarian applications, mass surveillance being one among many. Even if Anthropic refuses to have its models be used for such uses, and even if the next two frontier labs do the same, within 12 months everyone and their mother will be able to train AIs as good as today&#8217;s frontier. And at that point, there will be <em>some </em>AI vendor who is capable and willing to help the government enable mass surveillance.</p><p>The only way we can preserve our free society is if we make laws and norms through our political system that it is unacceptable for the government to use AI to enforce mass surveillance and censorship and control. Just as after WW2, the world set the norm that it is unacceptable to use nuclear weapons to wage war.</p><p>I want to be clear: these are extremely confusing and difficult questions to think about. I kept changing my mind back and forth on many of them in the process of writing this essay. I reserve the right to change my mind again in the future. In fact, I think it&#8217;s essential to change our minds as AI progresses and we learn more. That&#8217;s the whole point of conversation and debate.</p><p>Someday people will look back on this period the way we look back on the Enlightenment. People having big important debates right as the world was about to undergo these massive technological, social, and political revolutions. And some of these thinkers actually managed to get a couple of the big things right, for which we are now the beneficiaries.</p><p>We owe it to our future to at least attempt to think through these new questions raised by AI.</p>]]></content:encoded></item><item><title><![CDATA[Why Leonardo was a saboteur, Gutenberg went broke, and Florence was weird – Ada Palmer]]></title><description><![CDATA[Ambassador visiting Renaissance Florence: &#8220;Where am I? None of this has existed for a thousand years."]]></description><link>https://www.dwarkesh.com/p/ada-palmer</link><guid isPermaLink="false">https://www.dwarkesh.com/p/ada-palmer</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Fri, 06 Mar 2026 17:14:20 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/190118311/aee93ccf5bdd64c0816e2532e8f286be.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>Renaissance history is so much wilder and weirder than you would have expected. Very fun chatting with <a href="https://www.adapalmer.com/">Ada Palmer</a> (historian, novelist, and composer based at the University of Chicago).</p><p>Some especially fascinating things I learned from the conversation and her excellent book, <em><a href="https://a.co/d/03EjyByR">Inventing the Renaissance</a></em>:</p><p>Not only did Gutenberg go bankrupt in the 1450s (after inventing the printing press), but so did the bank that foreclosed on him, and so did his apprentices. This is because paper was still very expensive, and so you had to make this big upfront CAPEX decision to print a batch of 300 copies of a book - say the Bible. But he&#8217;s in a small landlocked German town where only priests are allowed to read the Bible - so he sells maybe 7 copies. It&#8217;s only when this technology ends up in Venice, where you can hand 10 copies to each of 30 ship captains going to 30 different cities, that it starts taking off.</p><p>Speaking of which, the printing revolution wasn&#8217;t just one single discrete event, just as the computer revolution has been this whole century of going from mainframes -&gt; personal computers -&gt; phones -&gt; social media, each with different and accelerating social impact. Books came first, but they&#8217;re slow to print, and made in small batches. The real revolution is pamphlets - much faster, much harder to censor. Pamphlet runners are how you can have Luther&#8217;s 95 Theses go from Wittenberg to London in 17 days.</p><p>So much other wild stuff from this episode. For example, did you know that the largest and best-funded experimental laboratory in 17th century Europe was very likely the Roman one run by inquisitors? Ada jokes that the Inquisition accidentally invented peer review. The focus of the Inquisition is really misunderstood - it was obsessed with catching dangerous new heretics like Lutherans and Calvinists - it only executed one person for doing science.</p><p>And this leads Ada to make an observation that I think is really wise: the authorities and censors are always worried about the exact wrong things given 20/20 hindsight. When Inquisition raids an underground bookshop during the French Enlightenment, they don&#8217;t mind the Rousseau, Voltaire, and Encyclop&#233;die, but they lose their minds about some Jansenist treatises about the technical nature of the Trinity.</p><p>More broadly, a lesson for me from this episode is that it&#8217;s just really hard to shape history in the specific way that you want to impact things. One of the most famous medieval scholars is this guy Petrarch. He survives the Black Death in the 1340s, watches his friends die to plague and bandits, and says: our leaders are selfish and terrible, we need to raise them on the Roman classics so they&#8217;ll act like Cicero. So Europe pours money into finding ancient manuscripts, building libraries, and educating princes on classical virtues. Those princes grow up and fight bigger, nastier wars than ever before with new deadlier technology. And this, combined with greater urbanization and endemic plague, results in European life expectancy decreasing from 35 in the medieval period to 18 during the Renaissance (the period which we in retrospect think of as a golden age but which many people living through it thought of as the continuation of the dark ages that had persisted since the fall of Rome).</p><p>Anyways, the libraries Petrarch inspires stick around, the printing press makes them accessible to everyone, and 200 years later a generation of medical students is reading Lucretius and asking &#8220;what if there are atoms and that&#8217;s how diseases work?&#8221; which eventually leads to germ theory, vaccines, and a cure for the Black Death (Ada has longer more involved explanation of how cosplaying the Romans results through a series of many steps to the scientific revolution). Petrarch wanted to produce philosopher-kings that shared his values. Instead he created a world that doesn&#8217;t share his values at all but can cure the disease that destroyed his.</p><p>Watch on <a href="https://youtu.be/PAIhVfGbREA">YouTube</a>; listen on <a href="https://podcasts.apple.com/us/podcast/how-cosplaying-ancient-rome-led-to-the-scientific/id1516093381?i=1000753675325">Apple Podcasts</a> or <a href="https://open.spotify.com/episode/00AFjws53vNchZYgKGmFCU?si=443280b66ff64693">Spotify</a>.</p><div id="youtube2-PAIhVfGbREA" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;PAIhVfGbREA&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/PAIhVfGbREA?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h2>Sponsors</h2><ul><li><p><a href="https://janestreet.com/dwarkesh">Jane Street</a> is still waiting on someone to solve their backdoor puzzle&#8230; They&#8217;re accepting submissions until April 1st and have set aside $50,000 for the best attempts. Separately, applications are live for Jane Street&#8217;s summer ML internships in NY, London, and Hong Kong. Go check all of this out at <a href="https://janestreet.com/dwarkesh">janestreet.com/dwarkesh</a>.</p></li><li><p><a href="https://labelbox.com/dwarkesh">Labelbox</a> can help ensure your agents don&#8217;t need to rely on overspecified prompts. They tailor real-world scenarios to whatever domain you&#8217;re focused on, and they make sure the data you train on rewards real understanding, not just instruction-following. Learn more at <a href="https://labelbox.com/dwarkesh">labelbox.com/dwarkesh</a></p></li><li><p><a href="https://mercury.com/personal-banking">Mercury</a>&#8217;s personal accounts let you add users, issue cards, and customize permissions. This is super useful for sharing finances with a partner, a roommate&#8230; or even an OpenClaw agent. And, if you&#8217;re already a Mercury Business user, your personal account is free! See terms and conditions below, and learn more at <a href="https://mercury.com/personal-banking">mercury.com/personal-banking</a></p><p><em>Eligible Mercury Business users who apply for and maintain a Mercury Personal account may have their Mercury Personal subscription fee waived provided they remain a user on an active Mercury Business account in good standing. Standard Mercury Platform Subscription fees will apply if they no longer meet eligibility requirements, including but not limited to no longer being associated with an eligible Mercury Business account, or if the program is modified or terminated. Mercury may modify or discontinue this offering at any time and will provide notice as required by law. See Subscription Terms for full details.</em></p></li><li><p>To sponsor a future episode, visit <a href="https://www.dwarkesh.com/advertise">dwarkesh.com/advertise</a>.</p></li></ul><h2>Timestamps</h2><p><a href="https://www.dwarkesh.com/i/190118311/000000-how-cosplaying-ancient-rome-led-to-the-renaissance">(00:00:00) - How cosplaying Ancient Rome led to the Renaissance</a></p><p><a href="https://www.dwarkesh.com/i/190118311/002849-how-florences-weird-republic-worked">(00:28:49) - How Florence&#8217;s weird republic worked</a></p><p><a href="https://www.dwarkesh.com/i/190118311/003813-how-the-medicis-took-over-florence">(00:38:13) - How the Medicis took over Florence</a></p><p><a href="https://www.dwarkesh.com/i/190118311/005812-why-it-was-so-hard-for-gutenberg-to-make-any-money-off-the-printing-press">(00:58:12) - Why it was so hard for Gutenberg to make any money off the printing press</a></p><p><a href="https://www.dwarkesh.com/i/190118311/011734-why-the-industrial-revolution-didnt-happen-in-italy">(01:17:34) - Why the industrial revolution didn&#8217;t happen in Italy</a></p><p><a href="https://www.dwarkesh.com/i/190118311/012302-the-library-of-alexandria-isnt-where-most-ancient-books-were-lost">(01:23:02) - The Library of Alexandria isn&#8217;t where most ancient books were lost</a></p><p><a href="https://www.dwarkesh.com/i/190118311/014121-the-inquisition-accidentally-invented-peer-review">(01:41:21) - The Inquisition accidentally invented peer review</a></p><h2>Transcript</h2><h3>00:00:00 - How cosplaying Ancient Rome led to the Renaissance</h3><p><strong>Dwarkesh Patel</strong></p><p>Today I&#8217;m chatting with <a href="https://history.uchicago.edu/directory/ada-palmer">Ada Palmer</a>, who&#8217;s a Renaissance historian, novelist, and composer based at the University of Chicago. Today we&#8217;re discussing your book, <em><a href="https://amzn.to/4l2zzb2">Inventing the Renaissance</a></em>. Ada, thanks for coming on the podcast.</p><p><strong>Ada Palmer</strong></p><p>Been looking forward.</p><p><strong>Dwarkesh Patel</strong></p><p>First question. You&#8217;ve got in this period&#8212;late 15th century, early 16th century&#8212;in Italy all these different republics: <a href="https://en.wikipedia.org/wiki/Republic_of_Venice">Venice</a>, <a href="https://en.wikipedia.org/wiki/Republic_of_Florence">Florence</a>, <a href="https://en.wikipedia.org/wiki/Republic_of_Genoa">Genoa</a>. That seems unusual both for the time period and for the place.</p><p><strong>Ada Palmer</strong></p><p>One of the big reasons that the <a href="https://en.wikipedia.org/wiki/Italian_city-states">Italian city republics</a> are clustered in Italy is that when the Roman Empire dissolved in the West, individual cities then needed to self-govern. This is true all across Europe. Those individual cities could no longer get the centralized Roman government to oversee supply routes or keep the roads free of bandits. You could no longer import and export goods at scale. You could no longer rely on central infrastructure. You had to support things yourself.</p><p>Larger, wealthier towns were able to make this transition because they could support themselves from the local resources and the farms attached to them. The larger, wealthier towns surrounded by good agricultural land were more successful at converting over. Okay, let&#8217;s have a senate like the <a href="https://en.wikipedia.org/wiki/Roman_Senate">old Roman Senate</a>. Let&#8217;s have our top families form a council. They will rule. We&#8217;ll set up a republic.</p><p>A weaker town that can&#8217;t support itself as well is much more prone to one wealthy family realizing that they can get goons and take over, declaring themselves the monarch of the area. Or worse, this town cannot self-sustain, it doesn&#8217;t have enough. People there can&#8217;t get food. They are scared and afraid of being robbed by people who are desperate. But outside of town, there is a wealthy villa that belongs to a noble family, and they have bodyguards. &#8220;Hey, noble family, if I move next to your villa and work for you, will you protect me with your bodyguards?&#8221;</p><p>So towns emptied out, and villages&#8212;as in a villa and its environs&#8212;developed as a result. A village was a monarchal structure in this sense. It was the migration of people out of a town into the protection zone of a local lordling. Then those villages grew to different scales, some of them cities, some not. Italy had great agriculture and great agricultural land, so more of Italy&#8217;s cities were able to sustain themselves as towns and be republics.</p><p><strong>Dwarkesh Patel</strong></p><p>I feel like the big take of your book is they were trying to resuscitate Roman virtues. What were the virtues that the Roman emperors had which allowed this safety, good government, et cetera, to work?</p><p><strong>Ada Palmer</strong></p><p>Stability.</p><p><strong>Dwarkesh Patel</strong></p><p>And I don&#8217;t understand the connection between reading Cicero and contemplating the virtues of a great emperor to&#8230; science and technology. Maybe there isn&#8217;t one, but do you think there is one? What exactly is that connection?</p><p><strong>Ada Palmer</strong></p><p>As with many processes, the answer is that there are multiple steps, and it&#8217;s complicated, and some of the steps are realizing that the earlier steps didn&#8217;t work.</p><p><a href="https://en.wikipedia.org/wiki/Petrarch">Petrarch</a>, who lived through the <a href="https://en.wikipedia.org/wiki/Black_Death">Black Death</a>, and lives in a moment when Italy is wracked by civil war and foreign mercenary troops are raiding and pillaging. Italy is wracked by bandits. When Petrarch survives the Black Death after losing so many friends, he gets a letter. Two of his friends are alive. He had given up hope that anyone he knew would survive, but two of his younger scholar friends are alive. They&#8217;re going to come visit him. On the way, they were attacked by bandits. One of them was killed, and the other was lost in the mountains and wounded, and he didn&#8217;t know that his friend was alive for another year and a half. The bandits are very real in this period.</p><p>Petrarch looks around him and says, &#8220;This is an age of ash and shadow. What we need is to imitate the arts of the ancients. Let&#8217;s try to figure out how the Romans did it.&#8221; And specifically, the problem is our leaders. Our leaders are selfish. Our leaders care more about their wealth and their family honor and their power than they do about the people.</p><p>This is where <em>Romeo and Juliet</em> is really helpful for us to understand. Lord Montague and Lord Capulet, as their goons are knifing each other in the street, they care about defeating each other. Do they care about the good of Italy? Do they care about the good of the city of Verona? No. Their feud is harming the city of Verona, and they don&#8217;t care. They demand that Romeo get away with murder because he is their son. That is not service to the state.</p><p>Petrarch reads about the ancient Roman <a href="https://en.wikipedia.org/wiki/Lucius_Junius_Brutus">Brutus</a>&#8212;not the one who killed <a href="https://en.wikipedia.org/wiki/Julius_Caesar">Caesar</a>, but the ancestor to whom <a href="https://en.wikipedia.org/wiki/Marcus_Junius_Brutus">that one</a> was trying to live up. Brutus was one of the first consuls of Rome, and he learned while in office that his sons were plotting to take over the state and make him king. So he executed his own sons for treason against the state. Can you imagine Lord Montague wanting to execute Romeo for treason against Verona? He would never do that. When you&#8217;re living in the plot of <em>Romeo and Juliet</em> and you read about these ancient Roman figures, as described in the lofty biographies of someone like <a href="https://en.wikipedia.org/wiki/Livy">Livy</a>, you read them and you say, &#8220;Wow, if only our leaders would act like that.&#8221;</p><p>Well, how were they raised? Can we raise our leaders the same way? Can we make libraries filled with what young Cicero read and what young Brutus read? What did they read? They read <a href="https://en.wikipedia.org/wiki/Plato">Plato</a>, and they read <a href="https://en.wikipedia.org/wiki/Homer">Homer</a>. So we need these things. Can we recreate the educational environment that produced them?</p><p>Petrarch suggests this. His students and successors embrace this idea and pour money into traveling across the Alps to look for manuscripts, traveling to Constantinople to purchase manuscripts from the wealthier East where books are common, and bringing them back to assemble these libraries. Then they raise tutors like <a href="https://en.wikipedia.org/wiki/Marsilio_Ficino">Marsilio Ficino</a>, who can know Greek and Latin and surround the young princes and princesses of Europe with these values in the hopes that they will act like Brutus and not like Lord Montague.</p><p>This is based on an assumption that education is very much like osmosis, that if you&#8217;re exposed to something, you&#8217;ll imitate it. And the uptake of this is strong because Italy is also full of upstart rulers who just seized power five minutes ago by having a coup in their state and have no legitimacy and no right to be ruling what they&#8217;re ruling and are resented by their people. But they can dress up like a Roman emperor. And they can have a parade with allegorical figures of the virtues next to them. And they can invest in an impressive palace that has a pediment on the front and looks like a Roman building to the eyes of the period, and cover themselves with the trappings of antiquity.</p><p>Then people might look at them and say, &#8220;This guy is different from what we&#8217;ve had. This guy is like the Caesars. The days of the Caesars were pretty good. Maybe we want this guy. Maybe he&#8217;s not going to be a tyrant. Maybe he&#8217;s going to be a good prince, and he&#8217;s going to make a golden age.&#8221;</p><p>And so the first dream is idealistic: let&#8217;s make better rulers. The adoption is self-serving and propagandistic: &#8220;Hey, I&#8217;m a tyrant, but I can seem like something better than just a tyrant. If I make myself look like Julius Caesar, then people will like and respect me.&#8221;</p><p>Or in the case of Florence with the Medici, &#8220;We are merchant scum. We are dirt compared to everybody around us. We&#8217;re not even one of the important families of Florence. We&#8217;re three ranks down. Even on the standards of merchant scum, we&#8217;re extra scummy merchant scum. But if we can have Latin and Greek and quote Cicero and seem like the ancients, people will take us seriously and respect us and talk to us even if we don&#8217;t have it.&#8221;</p><p>Let me give an example. Imagine that you are an ambassador from France, and you&#8217;re on your way to Rome, because a new pope has just been elected. Whenever a new pope is elected, every country in Europe has to send a special ambassador whose job it is to deliver a long-winded oration that says, &#8220;I am the ambassador from a very wealthy country and a very powerful prince.&#8221; And he&#8217;s so glad you&#8217;re the pope. Congratulations. Only you have to do that for an hour.</p><p>You have to give a gift to the pope, and it has to be very impressive, and you have to be a really important person. You&#8217;re the most important person who can leave your country without causing a political crisis. You might be the heir to the throne, for example. Or you might be a more minor ambassador, but you&#8217;re at least the son of a count.</p><p>You&#8217;re on your way to Rome, you&#8217;re heading along the length of Italy, you&#8217;re going to go through Florence, it&#8217;s on the way. Ugh. There&#8217;s nobody there worth talking to because it&#8217;s just a pit of scum and villainy. In fact, also filth and depravity because, of course, Florence is the sodomy capital of Europe. To Florentine is the verb for anal sex in several different European languages. In the laws of France, you can be indicted for sodomy on the grounds that you have ever once in your life even visited Florence. That&#8217;s considered evidence enough.</p><p>So you&#8217;re on your way to this matchlessly filthy dive of scum and villainy. And then you approach the city, and there are these statues. They look like ancient statues, the kind that are so lifelike that it&#8217;s as if they&#8217;re about to breathe and move. You&#8217;ve never seen an intact new statue like that. That isn&#8217;t something we know how to do. You ride through the city a bit, and it&#8217;s a large, impressive city, and you get to the cathedral, and it has this massive dome, way bigger than anything you&#8217;ve ever seen except for old Roman ruins.</p><p>You come to the banker&#8217;s house, and your servant knocks at the door. The banker greets you humbly at the door and apologizes that his humble palace is not worthy to host Your Excellency, and you&#8217;re like, &#8220;Yeah, it&#8217;s not. You&#8217;re correct.&#8221; He invites you in, and the instant you step inside, you&#8217;re in a space like nothing you&#8217;ve ever seen before with white light streaming in through this airy, rounded windowed courtyard that feels cleaner and more outdoors than the outdoors did, because something about the air is cool and fresh. It&#8217;s like nothing you&#8217;ve&#8212; Wait, wait. It is. It&#8217;s like the Roman ruins in the backyard of the castle where you grew up. But we don&#8217;t have the ability to do that anymore. All that&#8217;s lost.</p><p>In the middle of the square is another one of these bronze statues that looks like it&#8217;s about to come to life, except it&#8217;s shining and new. It hasn&#8217;t even turned green yet. Around the courtyard are busts of all the Roman emperors in order, and above them are portraits of this guy and the members of his family. Off in the corner are some men wearing robes that look like the robes the ancients wear. You say, &#8220;Who are those guys?&#8221; He says, &#8220;Oh, they&#8217;re Platonists. They&#8217;re speaking ancient Greek.&#8221; You say, &#8220;I thought I didn&#8217;t understand that language, but ancient Greek is lost. We don&#8217;t have ancient Greek.&#8221; He says, &#8220;We have lots of ancient Greek here.&#8221; You say, &#8220;And also, we don&#8217;t have the works of Plato. They&#8217;re also lost.&#8221; &#8220;Oh, we have lots of Plato here. Look, here&#8217;s my grandson, Lorenzo. He&#8217;s just written a poem in ancient Greek about the <a href="https://en.wikipedia.org/wiki/Plato%27s_theory_of_soul">three parts of the soul</a>. Would you like to hear him recite it?&#8221;</p><p>Now there&#8217;s a ten-year-old boy reciting a poem at you in ancient Greek about the three parts of the soul, and you&#8217;re like, &#8220;Where am I? None of this is possible. None of this has existed for a thousand years.&#8221; That&#8217;s the moment that <a href="https://en.wikipedia.org/wiki/Cosimo_de'_Medici">Cosimo de&#8217; Medici</a> turns to you and says, &#8220;Would you like to make an alliance with Florence?&#8221;</p><p>And you can say no. You can say, &#8220;No. My king is going to come over the Alps with his enormous army, and we&#8217;re going to descend upon this city, and we&#8217;re going to sack it, and everyone&#8217;s going to let us because it has no friends because it doesn&#8217;t have any nobility, so it can&#8217;t marry anybody, so it has no meaningful allies. And also, it&#8217;s in the middle of this <a href="https://en.wikipedia.org/wiki/Guelphs_and_Ghibellines">Guelph-Ghibelline</a> feud, so all of its neighbors hate it and they&#8217;re just going to let it burn. We&#8217;re going to take the enormous piles of gold that are in your basements and go home rich, and all of this will be gone like a dream.&#8221;</p><p>Or you could say, &#8220;Yes, let&#8217;s make an alliance. Give me a bronzesmith and an architect and a Greek teacher and a Platonist, and we&#8217;re going to take all of these things, and we&#8217;re going to do the French court like this. Then when the ambassador from Portugal comes, he&#8217;s going to feel like an uncultured fool, just like I feel right now.&#8221; The power dynamic just flipped upside down. Suddenly, the condescending nobleman is in awe of the merchant scum. That&#8217;s what the art and the culture does as a propagandistic tool.</p><p>The next stage of it then is, &#8220;Okay, we&#8217;ve raised these princes like this, and they have the Latin, and they have the Greek, and they can impress everybody.&#8221; Then they fight a bigger, nastier, worse war than any of the earlier big, nasty wars, with more deaths and more betrayals and bigger cannons knocking down cities and burning whole areas. The wealth is centralized, so the mercenaries are more numerous because people can produce more. The first generations raised by this are supposed to be philosopher princes, and instead we get <a href="https://en.wikipedia.org/wiki/Cesare_Borgia">Cesare</a> and <a href="https://en.wikipedia.org/wiki/Lucrezia_Borgia">Lucrezia Borgia</a>, both of whom had Latin and Greek and Cicero and Plato when they were kids. Then they grow up, and Cesare sets fire to half the world.</p><p>That is the war Machiavelli watched. Machiavelli was raised on all of the Cicero and Livy. He was raised on the Petrarchan project. He has <a href="https://dhspriory.org/kenny/PhilTexts/Machiavelli/Letter%20to%20Vettori.htm">this famous, beautiful letter</a> that he wrote in exile, where he&#8217;s describing his day to his friend. Most of the day is wasted, and he mucks around hunting for larks. Then he goes to a pub and gets drunk in the company of uncultured countrymen. Then he goes home, and he gets dressed in the court robes, the court finery that he would wear back when he was an ambassador to popes and kings. Attired thus, he enters his library to hold commerce with the ancients. He loves this the way Petrarch wanted him to love it.</p><p>But he observes these wars, and he observes virtuous princes like <a href="https://en.wikipedia.org/wiki/Guidobaldo_da_Montefeltro">Guidobaldo da Montefeltro</a>, who does every single thing you&#8217;re supposed to do virtuously. He has all the Plato, and he has all the libraries, and he has all the art. And he gets betrayed and his city taken away from him and loses everything. And he watches terrible people like Cesare Borgia and Julius II make terrible choices and succeed. He says, &#8220;Okay, clearly Petrarch was wrong that just reading Cicero would make successful rulers like the Caesars. But I still feel in my heart a deep power in the classics.&#8221;</p><p>So he says, &#8220;What if the libraries are what we need, but we need to use them differently?&#8221; He proposes what we would think of as political science. We observe historical examples. We say, &#8220;Okay, here are five examples of battles that happened next to rivers. We&#8217;ll put those examples side by side and see what decisions the commanders made to try to figure out which one worked better.&#8221; We use history as a casebook of examples of what worked and what didn&#8217;t. We imitate what worked, and we avoid doing what didn&#8217;t. Instead of feeling that reading about good men will make us good, we read about wise choices, and we imitate those choices.</p><p>This is one of the reasons Machiavelli is described by his contemporaries as a historian. He says we need to use history and use the classics differently. He proposes that. He isn&#8217;t very popular in his own day. It takes a long time for that to catch on. Many people for decades after him are still trying to use absorption by osmosis. But he&#8217;s writing that in the early 1500s, so it&#8217;s been a little over a century since this started.</p><p>We have to remember how long this process is. From Petrarch&#8217;s first call to Machiavelli writing that is as long as from <a href="https://en.wikipedia.org/wiki/Yuri_Gagarin">Yuri Gagarin&#8217;s</a> space flight back to <a href="https://en.wikipedia.org/wiki/Napoleon">Napoleon</a>. The childhood of Napoleon to the space race, that&#8217;s Petrarch to Machiavelli. We think of it as one time period, but a lot changed. They had a plan. They tried the plan. They brought the plan to its maximum. They raised all the princes in this new way. The wars happened. It clearly failed. Machiavelli then thinks about why it failed.</p><p>We&#8217;re still only halfway through the Renaissance. Shakespeare&#8217;s grandparents have barely been born. We have a lot more time to go. So what do we need? We need new ways of thinking about it. We&#8217;re reading the ancients, and we have bigger libraries. We have the printing press now. We&#8217;re having libraries in smaller towns. More and more people can read. It&#8217;s easier and easier to get an education. More people are starting to learn about science.</p><p>It also is important that they&#8217;re inventing micro technologies of book production like footnotes and glossaries in the margin that explain the hard vocabulary. When Petrarch&#8217;s successors like Ficino were young, you had to be a masterful Latinist to read these ancients. You had to have an enormous vocabulary. There are no dictionaries. There are no glosses. There&#8217;s nothing to help you. Only a tiny slice of expert classicists could actually read this stuff.</p><p>A hundred years later, there are translations into the vernacular. There are footnotes that tell you the hard vocabulary. Any med student can read Lucretius&#8217; discussions of materialist information. When <a href="https://en.wikipedia.org/wiki/Poggio_Bracciolini">Poggio</a> found it, there were two dozen people in the world who could read it. A hundred years later, 30,000 people can read it in the 30 print editions that are printed before 1600.</p><p>When all different kinds of people read it&#8212;med students, law students, people in different countries, people in different places&#8212;they ask new questions. They wonder whether they can test the hypotheses. They do test the hypotheses. They&#8217;re the generation that discovers that the heart is a pump. They&#8217;re the generation that takes seriously the question, &#8220;Maybe there are atoms, and maybe that&#8217;s how diseases work, and maybe we can develop the <a href="https://en.wikipedia.org/wiki/Germ_theory_of_disease">germ theory of disease</a>.&#8221; That&#8217;s the 1560s, 1580s, 160 years after Lucretius comes back, because it takes generations of work to build the libraries, to have the libraries, to use the libraries.</p><p>So when we get to 1600, which is almost exactly 200 years after this begins, a little bit more, we&#8217;ve had time to say, &#8220;Let&#8217;s build the libraries, have the libraries, use the libraries, or realize we failed in how we use the libraries, and use the libraries differently.&#8221; That&#8217;s the generation of Francis Bacon and Galileo who say, &#8220;Hey, let&#8217;s use the information differently. Let&#8217;s use nature as a casebook of examples the way Machiavelli said we should use history. Let&#8217;s examine, let&#8217;s doubt, let&#8217;s rethink, let&#8217;s do stuff in new ways.&#8221;</p><p><strong>Dwarkesh Patel</strong></p><p>Just to make sure I understood, the chain of causation here. We&#8217;ve got to resuscitate the virtues of the Romans, therefore read what they read. To do that, you need to build the libraries. You build the libraries, you resuscitate all those arts. Then you just need to have people be literate, have people think about information in a new way to analyze it. And that analysis lends itself not just to the history of leaders, but also to the nature of the world.</p><p>Whenever I hear a story about how this is why the <a href="https://en.wikipedia.org/wiki/Scientific_Revolution">scientific revolution</a> happened, why the Industrial Revolution happened, I&#8217;m like, but there are so many stories and it&#8217;s just hard to figure out why this one over the other ones. There&#8217;s a dozen other stories you could tell.</p><p>I had a previous guest, <a href="https://www.dwarkesh.com/p/joseph-henrich">Joseph Henrich</a>, who has this theory that the Catholic Church was breaking down these old kinship-based networks that the rest of the world had. It was encouraging guilds, encouraging these kinds of centers where people could get together and discuss ideas. There are probably twenty other stories you could tell. Why this story?</p><p><strong>Ada Palmer</strong></p><p>Two different reasons. One, I think it&#8217;s useful to think about how for new ideas to flourish and new ways of running the world to happen, you need a fertile environment. In the same way that for forests to grow, you need enough topsoil. It takes a while to get that topsoil.</p><p>It takes a while to get enough books. You need to have enough books for a bunch of people to be reading and thinking. You also need to have networks of information moving this stuff back and forth so that they can have discourses of ideas with each other. You can&#8217;t publish a scientific journal until there are journals. You need to have developed this ecosystem of information and knowledge.</p><p>People talk about it sometimes in terms of increasing literacy rates as if higher literacy makes there be more books instead of the other way around. In fact, there&#8217;s a lot more literacy than people imagine in even medieval Italy. Florence had a male literacy rate of ninety percent.</p><p><strong>Dwarkesh Patel</strong></p><p>As of the sixteenth century?</p><p><strong>Ada Palmer</strong></p><p>As of the twelfth century. Because everybody&#8217;s in the merchant world, so you have to be able to send letters. You have to be able to read account books. You have to be able to calculate your tab at a restaurant.</p><p>But of those people, how many have read a book? Very few. They&#8217;ve read letters, they&#8217;ve read tallies, they&#8217;ve read indexes, they&#8217;ve made notes. The difference between being literate and being book-literate is different. In the same way that some people watch television but don&#8217;t watch very many films, while other people watch lots of films. You can be literate and have never read a book because there might be almost no books in the entire city in which you grew up if it&#8217;s 1200 or 1500. But if it&#8217;s 1600, there are definitely books in any medium-sized town. So literacy transforms into access to scientific, intellectual, legal, all sorts of different worlds of ideas.</p><p>The other person you quoted who&#8217;s talking about transformations in networks of power from being less family and clan-centered to being more guild-centered&#8230; The guilds are major generators of ideas as well. The guilds can own libraries by 1600. If you went to a guild hall, it will have a bunch of books about its own trade. That would not have been true in 1100.</p><p>Those changes are all real, they&#8217;re all intermixing, and they&#8217;re all parallel to each other. You need all of these things together. One of the focuses I have is sometimes there are more steps to something than you think.</p><p>We tell this story of the Renaissance, of how they rediscovered these ancient texts, and then we got science. That&#8217;s true, but it is an oversimplification and too wide a zoom. If I said that in the <a href="https://en.wikipedia.org/wiki/French_Revolution">French Revolution</a>, Napoleon rose to power and spread nationalized warfare across Europe, and then we landed on the moon, I&#8217;ve skipped some steps. We know that about modernity, but we don&#8217;t remember that about earlier periods.</p><p><strong>Dwarkesh Patel</strong></p><p>Obviously all the stories are somewhat true, but to the extent that this is a part of the story, you&#8217;re building up libraries of classics and &#8230; setting up a network of information exchange that leads to the Scientific Revolution&#8230;</p><p>The reason this feels salient right now is that a lot of people have this idea that they&#8217;re going to make AI go well by doing X thing. Maybe some of those things work, but it&#8217;s at the same time frustrating but also funny and interesting that historically nobody has a good track record of being able to say, &#8220;I will do this thing so that this huge unanticipated change in history will go my way, or according to my values.&#8221;</p><p><strong>Ada Palmer</strong></p><p>Right. I think &#8220;go my way&#8221; as opposed to &#8220;go well&#8221; is a really important distinction. Petrarch wanted a world with these values. He thought, for example, that this would be a triumph for Christianity and what we would call Catholicism, though there&#8217;s only one Christianity from his point of view at the time, except for the <a href="https://en.wikipedia.org/wiki/Eastern_Orthodox_Church">East</a>, which is different.</p><p>He was sure that when we found the ancients, fundamentally all of their philosophy would agree with Christianity. The ancients were wise, therefore they will be correct, and Plato will ninety percent agree with Christianity. It just needs a little shaker of the Trinity on top to be Christianity. When he says, &#8220;Go find these ancients,&#8221; he is in a world that doesn&#8217;t have the ancients yet. He&#8217;s just guessing what&#8217;s going to be in these books. But he says, &#8220;If we find them, they will uphold good values,&#8221; and everyone believes him.</p><p>Then they go find them, and they squabble with each other. There are <a href="https://en.wikipedia.org/wiki/Hedonism#Ancient">Hedonists</a> and <a href="https://en.wikipedia.org/wiki/Epicureanism">Epicureans</a> and <a href="https://en.wikipedia.org/wiki/Stoicism">Stoics</a> and all sorts of chaotic things, much more plural than he anticipated. It makes a world that in turn has giant wars, which he would not like, and a crisis, and <a href="https://dukespace.lib.duke.edu/server/api/core/bitstreams/e489fce7-2bd2-4a2f-ac14-912710a8284b/content">Machiavelli&#8217;s critique of the ancients</a>, and then the new science and the new philosophy, and eventually Galileo, none of which resembles what Petrarch imagined if he had specifically described the future he was trying to make.</p><p>But then we get to the propagators of Bacon&#8217;s scientific method, meaning Voltaire and <a href="https://en.wikipedia.org/wiki/Montesquieu">Montesquieu</a>, who are also big campaigners for inoculation against smallpox. The first major disease eradications start to begin under that immediate influence. Science gets us to the germ theory of disease gets us to modern hygiene, which gets us to vaccines, which gets us to penicillin and the treatment for the <a href="https://en.wikipedia.org/wiki/Black_Death">Black Death</a>.</p><p>Petrarch thought he would make a world which shared his values. Instead, he made a world that doesn&#8217;t share his values but is capable of curing a disease he never imagined would be curable. If you showed him this future, it would be scary. It would be weird to him because it does not embrace his values. Our values are different. He would be horrified by democracy. He believed that only a tiny elite has the capacity to rule. If we had a time-traveling Petrarch, he would really wrestle for a long time to wrap his head around democracy as a functional system. He really thought in oligarchic terms.</p><p>But he would see the wonders we&#8217;ve created, especially the fact that we can treat the Black Death, and he would weep for joy seeing that. He did not create a world that went as he wanted, but he created a world that went well. We have many examples of that. Trains and bicycles come in, and we get feminism because it&#8217;s easier for people, especially women, to move freely and independently. They can organize. They can mobilize. We get <a href="https://en.wikipedia.org/wiki/Suffragette">suffragettes</a>. Did the inventor of the train intend for there to be women&#8217;s liberation? No. Did it go the way he imagined? No. Did it go well? Yes.</p><h3>00:28:49 - How Florence&#8217;s weird republic worked</h3><p><strong>Ada Palmer</strong></p><p>It&#8217;s important here to zoom in a little bit on Florence&#8217;s own government system and how and why it&#8217;s weird, in order to understand what rank Machiavelli actually holds in it.</p><p>All of these republics, except Florence, are modeled on ancient Rome. The ancient Roman model was an <a href="https://en.wikipedia.org/wiki/Roman_Republic">oligarchic republic</a> in which within the city there are <a href="https://en.wikipedia.org/wiki/Patrician_(ancient_Rome)">certain noble families</a>, usually founding families who made the city in the first place, who are the senatorial families. Hereditarily, when they come of age, the men of the family are automatically in the senate. From among them are elected the <a href="https://en.wikipedia.org/wiki/Roman_consul">consuls</a>, high senators, or the head of state if there is one. You have a small slice of the population that are fully enfranchised members of the republic who rule over the commoner majority.</p><p>That is how Venice works. That is how Genoa works. That is how Bologna and <a href="https://en.wikipedia.org/wiki/Republic_of_Siena">Siena</a> for the most part work. That&#8217;s how the <a href="https://en.wikipedia.org/wiki/Old_Swiss_Confederacy">Swiss Republic</a> works. That&#8217;s how all of these republics work. Florence was like that for quite a while, but when republics fell, they usually fell to noble families who are the foremost, the strongest, the military class. If you&#8217;re a military leader in this period, you have to have noble blood. No soldier is going to follow a commander who doesn&#8217;t have noble blood. That would be weird. Those threats to the independence of the republic almost always came from the nobility.</p><p>After one particular near miss in which the city was nearly taken over, they decided to <a href="https://en.wikipedia.org/wiki/Ordinances_of_Justice">get rid of the nobility of Florence</a>. They massacred most of them, cut their heads off, put them on pikes, burned their houses down, raked salt into the earth, and had a party on their graves, the way you do in the period when you&#8217;re getting rid of a class of people. There were a few noble families that they really liked who had not been part of negative stuff. They allowed them to officially renounce their nobility. They renounced their nobility, changed their names, and declared themselves commoners.</p><p>They set up a commoner republic. What that meant was the senate consisted of members of merchant guilds. A member of a merchant guild here means the owners of workshops. It&#8217;s not the guy who sits at the loom weaving, but the guy who owns the warehouse full of looms where the workers are working. The head of the sculpture works, the head of the architectural firm, not the bricklayers who are actually laying the bricks. Bourgeoisie is an anachronistic word, but we&#8217;re talking about the owners of the means of production who are themselves commoners.</p><p>They are very wealthy, but from the point of view of the diplomatic corps of any other society, all of the ruling people and all of their ambassadors are noble-blooded. If you&#8217;re an ambassador, you&#8217;re automatically noble-blooded. Nobody&#8217;s going to take an ambassador seriously who isn&#8217;t. From the perspective of every other polity in the world, the rulers of Florence are the rank of their valet. There is no nobility left in the city.</p><p>In fact, Florence can&#8217;t run its own armies or head its own police, because you&#8217;re not going to surrender if you&#8217;re told to surrender in the name of some guy who doesn&#8217;t have a coat of arms. That would be weird. So they actually have to hire a nobleman to come to the city and be their chief of police to arrest people in the name of the Holy Roman Emperor. One at a time, they&#8217;ll invite a skilled military commander nobleman who will come to the city. He&#8217;ll be <em><a href="https://en.wikipedia.org/wiki/Podest%C3%A0">podest&#224;</a></em>. He&#8217;ll live in the palace, which is also the prison. He&#8217;ll arrest people. He&#8217;ll enforce the law.</p><p>They will pay him handsomely at the end of the year, escort him to the gates, and then banish him from the city for life on pain of death so that he cannot return and make use of the power that he had in the city to try to take over. They&#8217;re very wary of any nobleman. They&#8217;ve set up a really weird republic&#8212;weird from the perspective of everyone around them&#8212;in which a bunch of merchants are trying to share power by being lotteried into the senate.</p><p>You put names in a bag. You examine all of the merchant members of guilds. You choose which ones are fit to serve, meaning not ill and dying, not insane, not so deeply in debt that they could be manipulated by the people whom they owe money to. Their names go in a bag. You choose nine guys at random. They rule the city. They are put in a palace where they rule the city from that tower.</p><p>They&#8217;re actually locked in the tower for the duration of their time in office because if they left the tower, they could be bribed or kidnapped. They rule the city for two or three months. At the end, they are thanked for their service and escorted out, and then a different nine guys share power for the next three months. It&#8217;s a power sharing that is designed to be tyrant-proof because you need consensus of nine randomly selected guys to decide to do anything.</p><p><strong>Dwarkesh Patel</strong></p><p>Oh, it&#8217;s not even a majority vote, it&#8217;s consensus?</p><p><strong>Ada Palmer</strong></p><p>It&#8217;s consensus.</p><p><strong>Dwarkesh Patel</strong></p><p>Previously you were describing &#8220;kill the nobles, salt the earth&#8221;. I&#8217;m almost thinking early communists. But then you say it&#8217;s the heads of the merchant guilds who are in charge. I want to understand why merchants and entrepreneurs have notable status in Florence. What is it about the culture that makes it so? Also, the Medici, the most powerful people, their job is <a href="https://en.wikipedia.org/wiki/Usury">usury</a>. It&#8217;s like the church&#8212;</p><p><strong>Ada Palmer</strong></p><p>It&#8217;s important to remember they were nobody when this set up. They were a minor important family.</p><p><strong>Dwarkesh Patel</strong></p><p>But the culture is getting started where somebody like that could be respected. How does that happen?</p><p><strong>Ada Palmer</strong></p><p>An important part of it is when you have a merchant capital, everybody works for somebody who works for somebody who works for the boss.</p><p>If you are a major merchant in Florence, you&#8217;re importing and exporting wool to and from all across Europe. You have employees all across Europe. You&#8217;re buying mass bulk wool from England, importing it to Florence to use olive oil that you&#8217;ve bought from Naples to process into high-quality wool, which you&#8217;re then exporting to Germany and France. You are a very interconnected businessman. You have a lot of contacts, you have a lot of clout, and the employees who work for you look to you for their safety net as well as their political representation.</p><p>We&#8217;re very accustomed in the modern period to thinking of the government as being our big safety net. If we wonder who is going to fund the hospitals, whose job is it to take care of orphans, we think of the government, or maybe the church. But in this period, if you&#8217;re killed and you leave orphans behind, it is your employer whose duty it is to take care of them. If you are injured and can no longer work, it is your employer who will support you for the rest of your life while you are disabled and find you work that you can do with that disability. A huge portion of the safety net is your employer.</p><p>Are you in trouble with the law? Your employer will supply your defense attorney. Your employer will supply the persuasive note to the judge that they would very much appreciate if their person got off. This is the system known as the patronage system, and it <a href="https://en.wikipedia.org/wiki/Patronage_in_ancient_Rome">existed in ancient Rome</a>. It exists and saturates the medieval and the Renaissance worlds in which everyone is in a very interconnected hierarchy.</p><p>So if you&#8217;re a brewer and your son gets in a barroom brawl and punches somebody out and the person&#8217;s nose breaks and they die in the brawl and your son is suddenly in trouble and you say, &#8220;Oh no, I don&#8217;t want my son to be executed,&#8221; you turn to your landlord. Your landlord turns to his landlord. They turn to one of these major families. These major families are massive landowners that own dozens of apartments within the city. Hundreds or thousands of people work for them.</p><p>So it makes sense to everyone to be represented that way, like having a council of the CEOs of all of the organizations that employees work for, when your corporation also supplies your social safety net and you see your representation there.</p><p>It&#8217;s also a world that&#8217;s used to thinking in terms of hierarchy and very unused to thinking about real democracy. It really doesn&#8217;t have any confidence in what we would recognize as democracy. We talk about these republics, and we&#8217;re very excited by the fact that they give more power to the people than a monarchy does, but they&#8217;re still incredibly narrow oligarchic republics.</p><p>When we read Machiavelli, he talks a lot about the <em>popolo</em>, which we translate as &#8220;the people.&#8221; He talks about how important it is that the <em>popolo</em> are respected and have a voice, that the <em>popolo</em> are armed, and the government shows respect for the people by allowing them to be armed. We read this and we&#8217;re like, &#8220;This feels really familiar. This feels like documents of the founding of the US where we&#8217;re respecting and arming and trusting the people.&#8221;</p><p><em>Popolo</em> meant the top 4% economically of the population, the members of the merchant guilds. That&#8217;s the <em>popolo</em>. He&#8217;s talking about a narrow-slice oligarchy being heard, a narrow-slice oligarchy being respected. We didn&#8217;t realize that in the nineteenth century when we were excitedly translating <em>The Prince</em> and reading it as quasi-democratic. We now have read more documents of the period and realize how people use these words.</p><h3>00:38:13 - How the Medicis took over Florence</h3><p><strong>Dwarkesh Patel</strong></p><p>Florence in this period goes through five different forms of government. It&#8217;s this <a href="https://en.wikipedia.org/wiki/Signoria_of_Florence">republic of nine dudes in a tower</a>, as you were saying, before 1434, and then&#8212;</p><p><strong>Ada Palmer</strong></p><p>There&#8217;s a gradual takeover. There&#8217;s a gradual, what we could call regulatory capture. But an interesting detail about Florence, even as the Medici take over, is that the Medici know the people of Florence are very deeply invested in this republic and very deeply invested in its institutions. Therefore, they have to respect those institutions and proclaim respect for those institutions. So they&#8217;re going to sustain people in the named offices that there used to be. They&#8217;re going to continue to let the guilds be important and have important offices.</p><p>There was a mandatory outfit that people wore who worked in the republic. The garment over there in the corner is a <em><a href="https://www.academia.edu/3443948/Clothing_and_a_Florentine_Style_1550_1620">lucco fiorentino</a></em>. This was the garment you were mandated by law to wear if you held office in the Florentine Republic. To us, we look at it and say, &#8220;It&#8217;s a long red robe. It looks very Renaissance.&#8221; To them, it looked like a toga because of the way it was draped. They thought of this as a toga. They&#8217;re cosplaying the Roman Republic. Wearing a Florentine toga while in office was something that you did to represent your fealty to <a href="https://en.wikipedia.org/wiki/Cicero">Cicero</a> and republican values.</p><p>The dukes made their men continue to wear these. In fact, the first Duke, <a href="https://en.wikipedia.org/wiki/Cosimo_I_de%27_Medici">Cosimo I</a>, would wear one to costume balls as if in his heart he longed not to dress like a duke, but to dress in a toga like a republican.</p><p><strong>Dwarkesh Patel</strong></p><p>It&#8217;s doubly ironic because when the Roman Republic turns to the <a href="https://en.wikipedia.org/wiki/Roman_Empire">Roman Empire</a>, they still have the senate. They still have all these old institutions, even though it&#8217;s no longer a republic.</p><p><strong>Ada Palmer</strong></p><p>The Roman Senate keeps meeting until 1200 AD.</p><p><strong>Dwarkesh Patel</strong></p><p>It&#8217;s sort of doubly ironic that they are doing the same thing, but in the 1500s.</p><p><strong>Ada Palmer</strong></p><p>And it means that more rights are granted to the people of Florence than to other cities that fell to monarchies at similar points. The monarchs of Florence know they have to be careful, they have to respect rights to a certain amount, and they can&#8217;t run roughshod over them.</p><p>There&#8217;s a really cool building that I love in Florence. If you&#8217;ve been there, there&#8217;s the famous bridge, the <a href="https://en.wikipedia.org/wiki/Ponte_Vecchio">Ponte Vecchio</a>, which has little jeweler shops all along it. When you get to the end of it, there&#8217;s this funny overhead corridor, the <a href="https://en.wikipedia.org/wiki/Vasari_Corridor">Vasari Corridor</a>, which was built by the dukes of Florence to connect the old city palace where the senate used to meet&#8212;where they had to have their seat of power&#8212;to their new palace across the river, which was much bigger, where they could have grand balls and things that dukes need to have.</p><p>Because they&#8217;re so terrified of being assassinated by their own people, they built this overhead walkway that goes from one end of the city to the other so that they could walk in safety without being assassinated. This is a sign of a weak duke. But also, when he was building it, it&#8217;s going across the roofs and sometimes blasting off the second stories of different people&#8217;s houses. Most people, when His Grace the Duke says, &#8220;I&#8217;m gonna blast the top story off your house,&#8221; would say, &#8220;Yes, Your Grace, please continue.&#8221; There are literally severed heads of people who resisted still rotting on spikes in front of the <a href="https://en.wikipedia.org/wiki/Palazzo_Vecchio">Palazzo Vecchio</a>.</p><p>But they get to this one point where there&#8217;s <a href="https://en.wikipedia.org/wiki/Torre_dei_Mannelli">a very old tower</a>, a 500-year-old tower. This belongs to the Mannelli family, who are descended from peers of Julius Caesar and can trace their genealogy all the way back to an old Roman <a href="https://en.wikipedia.org/wiki/Gens">gens</a>. When the duke says, &#8220;We want to knock the top off your tower,&#8221; they say, &#8220;No, this is our tower. This tower has been ours since before the Medici existed as a named family. You may not knock the top off.&#8221;</p><p>And the duke does not knock the top off. The corridor goes around in this awkward square around that tower, because he knows that if he violates something as traditional and core to the civilization as the property rights of somebody who has owned something for a long time, there will be rebellion, civil war, dissent, and resistance. These are monarchs who know that they are weak and are therefore careful, and therefore more rights, like property rights, exist.</p><p>Meanwhile, across the river in Ferrara, <a href="https://en.wikipedia.org/wiki/Alfonso_I_d%27Este">Duke Alfonso I d&#8217;Este of Ferrara</a> used to wander around Ferrara buck naked with a sword in one hand and his dick in the other, to show off that nobody would ever possibly try to harm a Duke d&#8217;Este. He and his siblings used to do things like, if they liked a musician, kidnap them and lock them in a tower so that nobody else could hear them, or if they wanted each other&#8217;s musician, send goons to kidnap each other&#8217;s musicians. They also used to recreationally murder each other&#8217;s servants when the siblings were tiffing with each other.</p><p>That is what you do when you don&#8217;t fear your people and when you feel confident in power. They are much closer to tyrants than the Medici are ever able to be, even after the republic falls. That&#8217;s what&#8217;s so neat. Because the resistance failed, if we&#8217;re looking at it in black and white. The republic fell. There wasn&#8217;t a republic anymore. There was a duke. He took over, and the old system was gone.</p><p>But because the republic fought so hard and because the people really believed in it, the people had a lot more rights, and the tyrant was a lot less tyrannical because there had been that fight. It&#8217;s a great example of how even when resistance loses, resistance wins.</p><p><strong>Dwarkesh Patel</strong></p><p>I think there&#8217;s an interesting parallel to today, not to be too on the nose, but sometimes people debate the odds that America becomes a Putinist kind of country within a couple of decades. I think the odds are actually quite low. Just because even though constitutionally, or at least in precedent, the president is very powerful, the republican expectation is so strong. The amount of resistance faced, even when you successfully do something, demotivates the next escalation.</p><p><strong>Ada Palmer</strong></p><p>The only thing that makes resistance weak in the US is when people feel as if partial victory is failure. Remembering moments like how Florence&#8217;s resistance all the way to the end meant that there was more liberty for the next several centuries, even under the tyrant, is what we need to remind ourselves, that partial victory is an important thing.</p><p>Even if the worst were to happen and there were to be tyranny, that tyranny would be so much weaker because there was a lot of resistance, and traditions of resistance and structures would develop that would continue to exist.</p><p><strong>Dwarkesh Patel</strong></p><p>I think you should discuss the fact that the Medici are the bankers for the papacy. What does that mean? Why is that necessary? How are they able to make money off of that from the interest on the float?</p><p><strong>Ada Palmer</strong></p><p>When Cosimo de&#8217; Medici swings the contract as banker for the pope, it&#8217;s important to remember that when you can&#8217;t wire transfer money in the pre-modern world, collecting taxes is a very difficult and complicated system. It is generally done by the centralizing power that has the right to tax delegating somebody local. If you&#8217;re in a town, there&#8217;ll be a local tax collector. It&#8217;s his job to go around to everybody and collect taxes, send a portion of those taxes home to the central power, and keep a remainder to pay himself.</p><p>The central power will say, &#8220;We expect X amount of taxes from this area.&#8221; When you hear about wicked tax collectors, it&#8217;s because if you are told, &#8220;We want 10,000 florins worth of tax from this town,&#8221; but you extract 15,000, you can keep the other 5,000. The 10,000 is what you need to send to the central power, so the more you extract, the more you get paid.</p><p>This delegate system, in which there&#8217;s a local tax collector and even a more local tax collector below him who might collect tax from a particular village, means that you depend a lot upon the person whose job it is to collect your taxes. When Cosimo is papal banker, he is the person collecting and channeling the money from every church in Christendom when everybody puts a coin into a collection box or pilgrims come and put money. All of the wealth that&#8217;s supposed to flow back to the papacy is actually flowing to Cosimo. Cosimo is passing it on to the papacy after taking a cut.</p><p>That is a lot of money moving quickly. There is also a lot of ability to make contracts and contacts. We all know how important networking is. He rises in prominence from a banker to somebody who has enough money to effectively take over his state via manipulating the guys-out-of-a-bag system. To discuss that again briefly, if you have a system where you lottery people, <a href="https://en.wikipedia.org/wiki/Sortition">sortition</a> is the technical term for it. This is a very old form of government. <a href="https://en.wikipedia.org/wiki/Kleroterion">Ancient Athens used it</a>. It actually works really well.</p><p>But like any institution, it is corruptible. In the same way that you can corrupt voting by bribing people or manipulating the machines or manipulating voters, you can also corrupt sortition by bribing the people who pull names out of the bag. Or you can use the simpler mechanism which Cosimo uses first. If you&#8217;re a giant bigwig in the city and you employ a third of the people in the city and they&#8217;re on your payroll, and nine guys at random are chosen out of a bag, three of them are going to be your guys, just statistically.</p><p>If you tell all your guys, &#8220;I want this policy, this policy, and this policy, and if you have questions, send for me and I&#8217;ll tell you what to do,&#8221; when the plurality on a random council all have a plan and it&#8217;s your plan, you effectively control the city. In that way, the Medici effectively controlled this lotteried system, because they guaranteed that the plurality, in a situation that doesn&#8217;t have a majority, will always be them.</p><p>But of course, there&#8217;s a chance to that. In 1430 and 1432, Cosimo has bad luck, and the lottery draws a lot of people who dislike him and doesn&#8217;t draw any of his guys. They immediately declare him a traitor to the state, arrest him, and <a href="https://en.wikipedia.org/wiki/Cosimo_de'_Medici#Florentine_politics">lock him in a tower</a>.</p><p>And he bribes his way out. He offers the equivalent of about $300,000 to the guard outside the cell and $700,000 to the captain of the guard to smuggle him out of the tower. He wrote in a letter later that they were the two most foolish men he&#8217;d ever met because he was Cosimo de&#8217; Medici. He would happily have paid them tens of millions of dollars to let him out of there, but they weren&#8217;t ambitious enough to think to ask for more than a few hundred thousand.</p><p>So he escapes, and then the next election they happened to elect entirely people who just loved Cosimo. They invited him back to the city in triumph, declared him father of the fatherland, and arrested and persecuted all of his enemies, who turned out to be guilty of tax evasion and all sorts of other things.</p><p>That was the moment that his grip tightened. And he&#8217;s like, &#8220;I&#8217;m going to stop simply controlling a plurality, and I&#8217;m going to start bribing the people who actually run the elections.&#8221; His famous quote about this is, &#8220;It is dangerous to be rich and not powerful.&#8221; You need the power to defend yourself in a situation like King of the Mountain, where when you&#8217;re on top, everyone will try to knock you down.</p><p>This is the system into which Machiavelli is born. His family has worked for the Medici family for generations. He grows up expecting to work for the Medici family. But the problem with heredity is that sometimes you get a weak link.</p><p>And in the moment that Machiavelli is in his early twenties, he is coming of age, about to work in government for the first time, a government in which he is not, in fact, even fully enfranchised. That&#8217;s one of the fascinating things about the degree of his patriotism. You weren&#8217;t allowed to serve in government office fully&#8212;the lotteried offices&#8212;if your family was deep in debt. His grandfather had a lot of unpaid tax debt.</p><p>So he worked his whole life for a government of which he was not even quite a full citizen. That shows a deep love of country, but it also shows that even people who could not be in office deeply loved and cared about this republic and the important liberty they felt they had being ruled by the 5% instead of being ruled by one dictator.</p><p>To us, that isn&#8217;t a very big difference. They&#8217;re still both not democracy. We would say they&#8217;re both not liberty in the sense that we want liberty. But it&#8217;s an inch more liberty than monarchy. Even that small amount of liberty, people loved it. People were willing to fight for it. People were willing to go to the streets, wave their banners, and say &#8220;libertas&#8221; for the republic. Because they were invested in it, Machiavelli observes, they sustained it.</p><p>But eventually, <a href="https://en.wikipedia.org/wiki/Piero_di_Cosimo_de%27_Medici">one particular Medici</a>&#8212;I&#8217;m not saying names because they all have the same names over and over, and it&#8217;s really confusing&#8212;comes to power quite young and weak. He&#8217;s basically 20 when he&#8217;s suddenly in charge of a very precarious republic. Right then, the French are invading Italy, and he&#8217;s scared. He botches the diplomacy with France and falls into disrepute, and the city takes the opportunity to kick him out. The subsequent regimes, which are an independent republic again, are the ones for which Machiavelli works.</p><p>He was part of the regime that ruled while they were in exile. When they returned, they viewed him as an enemy. He didn&#8217;t actively organize to resist them, but his name was found on a list of potential people that an anti-Medicean resistance movement had intended to recruit. He is arrested, tortured, exiled, and in exile writes <em>The Prince</em>.</p><p>He dedicates it to the very family that exiled him because they now control Florence, and he will only work for Florence. He doesn&#8217;t want his manual of the great secrets of statecraft to be in the hands of anybody but his homeland, so that it will defend his homeland.</p><p>When Florence exiles you, they tell you, &#8220;Go to this place and wait, and if you&#8217;re good, we&#8217;ll invite you back.&#8221; Florence has been doing this for ages because Florence actually used this as the core of its diplomatic corps. When you have no nobility, you can&#8217;t have ambassadors in the full-on noble ambassador sense. There&#8217;s nobody in the city of sufficient rank to go talk to the kings, to play chess with the sultan, and do all the things you have to do to be a proper ambassador.</p><p>What Florence did instead is exile people and say, &#8220;Okay, we&#8217;re exiling you. You go to Bruges. Be our contact in Bruges. You go to London. Be our contact in London. Be good. Send us letters informing us what&#8217;s going on. When we have diplomatic needs to talk to the king, we&#8217;re going to send letters to you, and you&#8217;re going to forward them. If you&#8217;re good, you get to come back.&#8221; So being in exile is sort of being on probation, but also being entrusted with state matters.</p><p>That&#8217;s not quite what they did with Machiavelli. With Machiavelli, they banished him to a hamlet in the middle of the Tuscan countryside near nothing important and said, &#8220;Go sit in the country and rot, and if you&#8217;re good, we&#8217;ll invite you back.&#8221;</p><p>What everyone expects is that Machiavelli will break that promise and leave. Because he&#8217;s a well-known statesman, a scholar, a playwright, and a historian, and there are dozens of cardinals in Rome and other cities that would love to employ him. Kings of England love employing Florentines to work for them as secretaries. <a href="https://en.wikipedia.org/wiki/Kingdom_of_Naples">Kings of Naples</a> love employing Florentines to work for them as secretaries. He might go get a job tutoring the daughters of the Duke of Milan, the way <a href="https://en.wikipedia.org/wiki/Francesco_Filelfo">Francesco Filelfo</a> did when he was kicked out of Florence for opposing the Medici.</p><p>There are lots of places it&#8217;s expected an exiled Florentine intellectual will go where he will have the ear of power and be able to exert influence. He will be a mover and shaker at the court of Milan or Naples or England.</p><p>Instead, when they say to Machiavelli, &#8220;Sit in the country and rot, this is a test,&#8221; he passes the test and sits in the country faithfully and rots. If he had wanted to go be an intellectual power broker, the correct move is to run off to Rome and say, &#8220;I will give up the chance to go home the way <a href="https://en.wikipedia.org/wiki/Dante_Alighieri">Dante</a> did, but I will be a Florentine in exile, and I will write important things. I will live at the house of wealthy men who will support me and give me the ear of power, and I will exert my influence in that way.&#8221;</p><p>He does not do that. He stays in the country and he rots, and he continues writing letters home saying, &#8220;I will serve you or nothing. Bring me home to serve my country.&#8221; That is a weird thing to do, and not normal for the many other Florentine intellectuals who experienced similar banishments in the same period.</p><p><strong>Dwarkesh Patel</strong></p><p>How do we know that he wasn&#8217;t just trying to get back into power?</p><p><strong>Ada Palmer</strong></p><p>The answer is you read his personal letters. You read the way he talks about love of his country, and you read the way he talks to his friends. You read the letters he wrote when he discusses writing <em>The Prince</em>, and you read the comments he exchanges with the other friends that he shared it with.</p><p>His other works&#8212;<a href="https://en.wikipedia.org/wiki/The_Mandrake">his comic play</a>, which was a big hit, his <a href="https://www.gutenberg.org/files/2464/2464-h/2464-h.htm">history of Florence</a>, which was well known at the time&#8212;those he published and circulated. <em>The Prince</em> he kept in very close private circles, circulating it only with trusted, intimate friends, and then the copy that he sends in to Florence.</p><p>Yes, it&#8217;s a job application: &#8220;Please bring me back. I will work for you. I will be loyal. I support my city more than any particular iteration of my city. I support my country more than any particular regime or group that might be in power. Whatever is in power in my city, I will be faithful to it.&#8221; You see him expressing that in lots of different ways.</p><p>When in <em>The Prince</em> he says you can and should do all of these ruthless things to keep power, we have to remember that the end justifies the means when the end is the survival of your country. It&#8217;s not that the end, in general, justifies the means. Machiavelli feels very strongly that regime changes bring civil violence, and civil violence sheds blood. He has seen the streets of his city run with blood before.</p><p>He thinks that even life under a tyrant is better than life in a civil war, which is usually not life at all, given the massacre of the people and external conquest that are likely as a result of another regime change. So he says, &#8220;Don&#8217;t push for regime change. Even if the regime is tyrannical, more people will survive by sticking with the tyrant than by changing the regime.&#8221;</p><h3>(00:58:12) - Why it was so hard for Gutenberg to make any money off the printing press</h3><p><strong>Dwarkesh Patel</strong></p><p>I want to talk about the printing press. One thing I didn&#8217;t realize before reading your book is that not only does <a href="https://en.wikipedia.org/wiki/Johannes_Gutenberg">Gutenberg</a> go bankrupt after making the most significant invention of a millennia, but his apprentices also go bankrupt.</p><p>This is at a time when people like Cosimo are willing to pay on the order of hundreds of thousands of dollars per book. So with the guy who invents a way to make this way cheaper, how is this possible?</p><p><strong>Ada Palmer</strong></p><p>The problem is printed books are a mass-produced commodity in a world that does not have distribution networks for mass-produced commodities. Mass production is incredibly rare in this period. Coins are mass-produced, but that&#8217;s really about it. Almost everything is artisanally produced. When you have a mass-produced product, you need a distribution mechanism before you can sell it.</p><p>The great example is that technically e-books existed the first time anyone typed a book on a computer. Certainly in the 1970s there was such a thing as an e-book. But there was no market for e-books until the Kindle came out and made a commodity way to buy and sell e-books, then the e-book industry came into existence. So the e-book as a commodity is several decades younger than the e-book technically existing.</p><p>In the same way, you&#8217;re Gutenberg. You have figured out how to produce 300 copies of a book for the cost of one copy of a book. You do so. You print your Bible. You have 300 Bibles. You sell seven of them to the seven people in your small landlocked German town who are legally allowed to read the Bible in a period in which only priests are allowed to read the Bible. Congratulations, Mr. Gutenberg, you have 293 Bibles, and you can&#8217;t sell them, and you go bankrupt.</p><p>There has to be a distribution mechanism for books to find their market because there are certainly 300 people in Europe that want this, but there are not 300 people in one location where it&#8217;s being produced. So Gutenberg goes bankrupt. The bank seizes his press. They try to go into the business. The bank goes bankrupt. There is so much overhead. You spend hundreds of thousands of dollars on the production cost of the books, and then you get nothing back.</p><p>Gutenberg&#8217;s apprentices build presses. They go bankrupt. They flee their debts, flee the country, leave Germany, and go to Venice. Venice is the airport hub of the Mediterranean. Venice is where you change boats. If you&#8217;re sailing from A to B, you go to Venice, you change boats, you get to the next place. The hub system has always worked well.</p><p>So if you&#8217;re printing in Venice, you print 300 Bibles, you give ten Bibles to each of thirty ships&#8217; captains going to thirty different cities. They can sell them. The first economically sustainable circulation of print is enabled by the hub system.</p><p>Then book fairs come into existence in which printers will spend all year printing a book. They go with a thousand copies of their book to a book fair where there are a thousand other printers. They all trade, and then they go home to their town with five copies each of 200 books instead of a thousand copies of one book, and then they sell them in bookshops. Things like the <a href="https://en.wikipedia.org/wiki/Frankfurt_Book_Fair">Frankfurt Book Fair</a>, which still exists today, developed as the distribution mechanism.</p><p>There&#8217;s a slow growth and a slow saturation. That&#8217;s really cool because one of the things people think is unique about our present information revolution is that we&#8217;re living in this sequence of successive information revolutions. We had the computer, the computer was exciting. Then we had the personal computer, then we had the internet, the cell phone, social media, and now we have different social media networks coming in successively causing crises one after the other. And then we have LLMs and other applications of machine learning and generative AI.</p><p>It&#8217;s easy to think of each of these as different tech revolutions, as if we&#8217;ve just had ten tech revolutions in a row. But really, they are all deeper penetration of one tech revolution: the development of the computer. These are all applications of computers.</p><p>In the same way, the printing press comes in in 1450, and it isn&#8217;t done shaping the world instantly. It takes forty years to even be economically sustainable. It&#8217;s not until the 1490s that printers are making money.</p><p>And then in the 1510s, it&#8217;s time for pamphlets and pamphlet distribution. Now there&#8217;s news, and news is suddenly done by print, and that&#8217;s a revolution on the same scale as the difference between computers and cell phones. We get the <a href="https://en.wikipedia.org/wiki/Reformation">Reformation</a>, which is enabled by pamphlets in exactly the same way that the <a href="https://en.wikipedia.org/wiki/Arab_Spring">Arab Spring</a> is enabled by cell phones.</p><p>Then we get the newspaper, another new application of the same technology that follows, like social media. It&#8217;s one information revolution having multiple successive revolutionary applications as it disseminates and eventually saturates. It moves on a timescale quite similar to the timescale in which the digital one is happening as well, so that print keeps hitting Europe with successive revolutions for 150 years.</p><p>And every couple of decades, or every decade, there&#8217;ll be a new bang. Suddenly it&#8217;s possible to get a printed pamphlet from Wittenberg to London in seventeen days. Oh my God, we can coordinate our resistance movement against the Catholics. Boom. The Reformation happens. That wasn&#8217;t possible even a decade earlier when it took months to get a pamphlet from one end of Europe to the other.</p><p>So it&#8217;s best to think of these very much in parallel, the print revolution and the digital revolution, as one big technological change in information that then has successive applications as that one technology finds new forms and disseminates more deeply and keeps having consequences over decades. It&#8217;s not multiple separate revolutions. It&#8217;s one ongoing information revolution.</p><p><strong>Dwarkesh Patel</strong></p><p>Maybe other eras also have this and I just haven&#8217;t read the books about them, but from your book, I thought, &#8220;Oh, history just seems to be happening really fast, and seems to have sped up, especially religious and political history.&#8221; Obviously, the things happening in Italy, but even aside from that, you have <a href="https://en.wikipedia.org/wiki/Martin_Luther">Martin Luther</a> and the Reformation, and then just twenty years later <a href="https://en.wikipedia.org/wiki/English_Reformation">England splits off from the Catholic Church</a>, which is unprecedented in two millennia.</p><p><strong>Ada Palmer</strong></p><p>Then it has a bunch of tumults that flop, flop, flop so that every decade feels different. Here you are in 1506 being nostalgic for how the world was completely different in 1490. And you&#8217;re like, &#8220;That&#8217;s pretty fast.&#8221; Here we are in 2026 often feeling nostalgic for how things were in the year 2000.</p><p><strong>Dwarkesh Patel</strong></p><p>Is it fair to trace that back to the printing press or its offshoots, or is it just embedded?</p><p><strong>Ada Palmer</strong></p><p>It&#8217;s more that history has always moved fast. But when we teach it in high school, we&#8217;re trying to move over large chunks of time quickly, and so we pretend that it moved slowly. We have this lie that there were long periods of stagnation. But you can zoom in anywhere, and you&#8217;re going to find every decade feels different, and people in the 1320s are nostalgic for people in the 1300s.</p><p>It&#8217;s always felt like history was moving very quickly, and things rose and things fell. It&#8217;s the lies we tell ourselves in history books written in the 19th century that are trying to group all of these things together and make modernity special that confuse us about this.</p><p>I&#8217;m working on a paper right now about the video game <a href="https://en.wikipedia.org/wiki/Civilization_(series)">Civ</a>. Civ is the number one teacher of history in the world. It has shipped 70 million copies, and 65 percent of people on Earth who have technology play video games. Civ is the number one teacher of history, bar none, since 1991.</p><p>What does Civ tell you? Civ tells you that in antiquity, a turn is fifty years, and then in the Middle Ages, a turn is twenty-five years. Once you get into the Industrial Revolution, a turn is ten years, and then five years, and in modernity, a turn is just one year because in one year, as much happens now as happened in fifty years in antiquity. That lie is also what our textbooks tell us.</p><p>But it doesn&#8217;t matter where we zoom in. Any time I go to a talk where any historian is zooming in on any decade in any time and place, it always feels like it&#8217;s moving as fast as our present is moving.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess the difference is that technologically, we know that they weren&#8217;t moving as fast.</p><p><strong>Ada Palmer</strong></p><p>Technologically, they were moving fast. We just don&#8217;t care about those technologies anymore.</p><p><strong>Dwarkesh Patel</strong></p><p>That&#8217;s interesting.</p><p><strong>Ada Palmer</strong></p><p>They were constantly inventing all sorts of things. We just take them for granted. The invention of chairs with backs, the invention of scissors, the invention of improved metallurgy so that steel could do things steel couldn&#8217;t do before. There was always technological change happening.</p><p>I&#8217;m in the middle of reading an amazing book about how, when you look at the paintings of <a href="https://en.wikipedia.org/wiki/Raphael">Raphael</a> and the few paintings we have by <a href="https://en.wikipedia.org/wiki/Michelangelo">Michelangelo</a>, the colors look like they&#8217;re really glowing, like gemstones. How did that happen? When you compare them to paintings from just a hundred years earlier somehow the colors are flatter. I&#8217;m not talking about the anatomy being more realistic. That&#8217;s separate, but the colors are flatter.</p><p>The answer is there was a sequence of revolutionary adaptations in how to process oil and how to process colors and mix them together, and then those were used to create fake gemstones, and there was a major industrial leap forward in the fake gemstone industry. Then people who were making picture frames realized they could use the same techniques from the fake gemstones to make fake gold by painting yellow over the surface of tinfoil.</p><p>And then those were used by artists who were like, &#8220;Wait, I want to make things that look like they glow like fake gemstones.&#8221; There were eleven major technical revolutions over the course of 120 years that led to those colors changing.</p><p><strong>Dwarkesh Patel</strong></p><p>Obviously progress has been happening in individual fields over time. But in this macroscopic view, and this is a big part of your book, there&#8217;s a reason that people living in the fourteenth century would say, &#8220;Look, the best time to be alive was when the Romans were around, and since then it&#8217;s just been the <a href="https://en.wikipedia.org/wiki/Dark_Ages_(historiography)">Dark Ages</a>.&#8221;</p><p>If they stood in relation to the Roman Empire as we stand to them, we would obviously notice that the world has seen so much progress since then. It clearly seems like the pace...</p><p><strong>Ada Palmer</strong></p><p>It&#8217;s hard to figure out when we are lying and when we are right where we say the pace picked up. One thing that makes the pace pick up in modern day is simply the population grew and grew and grew and is now much, much larger. The majority of people who ever lived in the entire history, since humans have been humans and not hominids, have lived in the last 200 years because the population became massive. How did the population become massive? Our agriculture and our hygiene enabled it.</p><p>How did our agriculture and our hygiene improve? Half of that is continuing on the artisanal level to invent new things in the same way that the artists invented better colors. Agricultural workers invented better technologies, and agriculture was constantly improving. You&#8217;re correct that with the <a href="https://en.wikipedia.org/wiki/History_of_scientific_method">arrival of the systematic scientific method just after 1600</a>, there is a deliberate societal desire to create intentional anthropogenic progress. I&#8217;ll zoom in on the arguments made in 1600, then I&#8217;ll zoom out and unpack them.</p><p>In 1600, the idea is that history up until now has been unsystematic. People have discovered things at random, but we can create a method in which we observe the world and use inductive reasoning to figure things out from those observations to create systematic descriptions of the secret motions that underlie nature, and from that work out technologies that are good and useful for humankind. If, as we make our observations of nature, we publish them and share them with each other, we can create a community of scientists that will share all of these discoveries with each other and with the world and therefore benefit it.</p><p>This is where, when I&#8217;m doing this in the classroom, I deliberately provoke and shock my students with the fun claim that <a href="https://en.wikipedia.org/wiki/Leonardo_da_Vinci">Leonardo da Vinci</a> was not a scientist. What I mean by that is that to be a scientist is to publish your results and share them with a community of other scientists so that they can test them, so that the whole human civilization progresses a little bit. When my friends who are chemists or my friends who are particle physicists discover something, the next goal is to share that discovery with everyone so everyone&#8217;s knowledge advances.</p><p>What does Leonardo do? He writes everything he discovers down in coded mirror writing so that nobody but him can possibly use it. He refuses to share even with his students and assistants the secrets of what he&#8217;s doing because Leonardo does not want to contribute to human progress. Leonardo wants to make unique masterpieces so that hundreds of years later, people will see them and marvel and say, &#8220;How did he do it? No one else has ever been able to replicate that method.&#8221; He wanted to be marveled at by the future exactly the way he and his peers marveled at the works of the ancients.</p><p>They look at something like the <a href="https://en.wikipedia.org/wiki/Colosseum">Colosseum</a> or the <a href="https://en.wikipedia.org/wiki/Pantheon,_Rome">Pantheon</a> in Rome with its enormous dome, and they say, &#8220;How did they do it? If only we could work that out, we could make one and then make sure no one else could.&#8221; <a href="https://en.wikipedia.org/wiki/Filippo_Brunelleschi">Brunelleschi</a>, who built Florence&#8217;s famous beautiful dome, deliberately burned all of his notes and schematics so that nobody else would be able to replicate his work. That is an inventor, and an engineer, but in the sense of a community of scientists, this is not a servant of human progress. This is actually a saboteur of human progress, if anything, who deliberately makes progress and then tries to cut it off at that point so that no one else can be his peer.</p><p>That is what you did as a learned inventor in the 1400s and in the 1500s. But as you get to 1600, the suggestion is different, and here I&#8217;m going to use <a href="https://en.wikipedia.org/wiki/Francis_Bacon">Francis Bacon&#8217;s</a> gorgeous <a href="https://plato.stanford.edu/entries/francis-bacon/#CriEarPhi">simile of the three insects</a>. There are three types of knowledge wielders, says Bacon.</p><p>First, there is the ant, who is the encyclopedist, who gathers information from all around the world. He learns everything he can, and he piles it up into a great big pile. He makes an anthill, and he sits on top. If he has the biggest anthill, the biggest pile of knowledge, then he&#8217;s proud of having made it. But all he does is assemble it and possess it. It&#8217;s a beautiful library, but nothing comes from it.</p><p>The second type is the system weaver, the spider who spins elaborate webs of beautiful, intricate, logical theory. You admire them, and you can get entranced and ensnared in them easily because they&#8217;re so beautiful. They&#8217;re almost hypnotic. But there&#8217;s nothing real in them. They&#8217;re all just spun out of the body of the spider himself, the theorist theorizing from his own mind.</p><p>The third kind, says Bacon, is the honeybee, who, gathering from among the fruits of nature, processes what he gathers through the organ of his own being to produce something which is sweet and useful for humankind. That is the scientist who gathers from nature to produce something sweet and useful for humankind.</p><p>With this rhetorical call, and with Francis Bacon&#8217;s portrait on the title page, the <a href="https://en.wikipedia.org/wiki/Royal_Society">English Academy of Sciences</a> is founded and starts publishing. The standard switches over from &#8220;You are not a great achiever because you built the dome&#8221; to &#8220;You are a great achiever because you worked out how it can be done, and you shared that sweet and useful thing with all of humankind.&#8221;</p><p>Bacon says if we do this, if we make academies of sciences, we can make sure that every human generation lives in a better condition than the past. We&#8217;ll have better agriculture, fewer famines. We will have refrigeration. We&#8217;ll have chicken in winter. We will have all of these things that we aspire to. If we collaborate, each generation&#8217;s experience will be better than the last. He says that to be a scientist is the ultimate act of charity because there is no greater act of charity than to give a gift to every human who will ever live after you.</p><p>That is the rhetoric of what you would feel was happening if you&#8217;re alive in the 1620s and 1630s. <a href="https://en.wikipedia.org/wiki/Galileo_Galilei">Galileo</a> is publishing his observations, and <a href="https://en.wikipedia.org/wiki/Ren%C3%A9_Descartes">Descartes</a> is publishing his systems. They&#8217;ve just <a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC3721262/">discovered that the heart is a pump</a> and that they were totally wrong about the <a href="https://en.wikipedia.org/wiki/Humorism">four humors theory</a>. The blood circulates, and they&#8217;re trying to figure out what it does. They have magnification, and they can see worlds of complex patterns on the wing of a flea. It sounds like the whole world is suddenly coming into view, and we&#8217;re at the beginning of progress.</p><p>If we zoom out, we would say there&#8217;d been progress the whole time. People had always been inventing things. Agriculture in France was better in 1300 than it was in 1000. Plows got better, seed got better, cabbages were bred to be bigger. People worked out better pots. There were always artisanal inventors.</p><p>In fact, that&#8217;s a lot of what Bacon is observing. He worked in the patent office as a young man, and he would see a carpenter come in to patent: &#8220;I have invented a better chisel. I&#8217;ve invented a thing that goes like this. I&#8217;m going to patent it.&#8221; He would realize that it was workers and workmen and handicraftsmen who were inventing the really useful tools. He wanted to make this systematic.</p><p>We would say there was always anthropogenic progress. In 1630, they realize there is anthropogenic progress. They think there hasn&#8217;t been. They think they&#8217;re beginning, and that history up until this point has been stagnant, but now it&#8217;s going to suddenly be full of invention as, for the first time, there will be deliberate anthropogenic progress. Really, we would say there always was and that it&#8217;s accelerating, and at this point, we realize it and articulate and describe it.</p><p>You&#8217;ve probably seen lots of graphs of history with the hockey stick graph structure, where it&#8217;s flat for a long time and then zhoops up. They&#8217;ll put that zhoop after the invention of the scientific method. It depends on what we&#8217;re graphing, whether that zhoop is appropriate. It also depends on how much you zoom in or zoom out.</p><p>It&#8217;s true, we do get to inventions that result in enormous increases in population 150 years after Bacon. Would we have anyway, even if it hadn&#8217;t been systematized? Probably a bit later, and we would have a slightly flatter hockey stick. But we would still have hockey sticked. In the same way that when you put mice on an island without mice, they breed and they breed and they breed and they breed and they hockey stick. Humans would also have hockey sticked. But would we have hockey sticked later? Would we have hockey sticked with more pain? When mice hockey stick, they also starve to death and eat each other. We haven&#8217;t done that yet. Go us.</p><p>Was that science? Probably. There are a lot of factors to it. So is it true that everything accelerated after 1620? In one sense, yes. In another sense, it&#8217;s a continuation of a curve that was already curving.</p><h3>01:17:34 - Why the industrial revolution didn&#8217;t happen in Italy</h3><p><strong>Dwarkesh Patel</strong></p><p>I think you might have answered a question I was about to ask. The book you recommend on <a href="https://www.adapalmer.com/">your website</a>, <em><a href="https://amzn.to/4r862y8">The Renaissance in Italy</a></em>, I keep forgetting the name of the author. Italian names are tough.</p><p><strong>Ada Palmer</strong></p><p><a href="https://en.wikipedia.org/wiki/Guido_Ruggiero">Guido Ruggiero</a>.</p><p><strong>Dwarkesh Patel</strong></p><p>In some part, he has this question: Look, in Italy, as you mentioned, in Venice, they&#8217;ve really scaled the printing press. As a result, you have the metalworking for fine typesetting. Separately, milling technology for water mills and windmills is advanced, along with gears for watches. So he asks, why didn&#8217;t Italy have the Industrial Revolution? I wonder, do you stand by the answer you just gave, or is it a different theory?</p><p><strong>Ada Palmer</strong></p><p>Part of it. But another is, we cannot underestimate how much richer per square meter Italy is than everywhere else. Italy is the breadbasket, and it&#8217;s also the center of Big Oil, which is to say Big Olive Oil, which was both fuel oil for light and industrial oil for production, as well as cooking and eating oil. And the other major major industry of the period, which is Big Wool.</p><p>If you&#8217;re already the center of Big Finance, Big Wool, and Big Oil, do you need an industrial revolution? You&#8217;re already economically on top through the power of agriculture. It makes sense for it to have been a sort of industrial backwater area. What was England producing? Crappy quality wool?</p><p>England was so aware that it couldn&#8217;t process wool into high quality without masses of olive oil, which it couldn&#8217;t produce, that England just exported its crude wool to Florence in order to have Florence, with its olive oil reserves, produce the fine quality. Think about how a wool suit isn&#8217;t itchy, but a wool blanket often is. That wool suit isn&#8217;t itchy because lots of olive oil went into the process of producing it, at least at pre-modern tech levels. So do you want England to produce your itchy wool that people will only pay a small amount for, or do you want to export it?</p><p>It makes sense for it to have been somewhere industrially ambitious that wasn&#8217;t already economically on top to have done it. That&#8217;s one reason that industrialization doesn&#8217;t kindle in Italy. Italy is agricultural land and a finance world. It doesn&#8217;t feel like it needs a new industry.</p><p>Another factor is mining. This land is more valuable as a farm than it is as a mine. You don&#8217;t want to rip it up. Another is it&#8217;s so subdivided because those rich cities are still mostly independent, whereas a centralized crown in England is more able to pass legislation to facilitate a massive transformation.</p><p>No city really wants to be the one where the giant industrialization is happening. It&#8217;s awful for the city. Note that the industrialization of the Industrial Revolution was mostly outside of the wealthier centers of England in the second-tier towns. They grow massively into huge industrial areas like Lancaster. So those are a plural bunch of reasons.</p><p><strong>Dwarkesh Patel</strong></p><p>But I would have also thought that the competitiveness between different Italian city-states would have made it so that if they get better textile machines before you, it&#8217;s a disaster because they&#8217;re right there.</p><p><strong>Ada Palmer</strong></p><p>This is not going to sound plausible to anybody, but it&#8217;s true. We&#8217;ve been looking at some documents recently which pretty much confirm that they did figure out how to make industrial looms in the 1400s, and they didn&#8217;t want to. They wanted to make luxuriant artisanal fabrics.</p><p><strong>Dwarkesh Patel</strong></p><p>This, by the way, was another interesting thing from the book. With the first printed books, there&#8217;s not this market of commodity things that are produced cheaply that the average person is going to be like, &#8220;Oh, if I can get this for $10.99, I&#8217;ll go buy it.&#8221; So they&#8217;re trying to make this thing look like it was produced as artisanal luxury grade.</p><p><strong>Ada Palmer</strong></p><p>Right. The first printed fonts look like handwritten scripts, and often have a blank space to illuminate it so that it looks just as fancy as manuscripts.</p><h3>01:23:02 - The Library of Alexandria isn&#8217;t where most ancient books were lost</h3><p><strong>Dwarkesh Patel</strong></p><p>One thing I wanted to ask you, back to the printing press. Not only does printing get cheaper, but around this time, paper itself also gets cheaper. So not just reading, but writing gets cheaper. Do you as historians see a marked change in this period in the amount of records that are taken and, as a result, our understanding?</p><p><strong>Ada Palmer</strong></p><p>A huge amount rests on whether you have a cheap writing surface. Rather than looking first at the Renaissance, let&#8217;s look at what we think of as the <a href="https://en.wikipedia.org/wiki/Fall_of_the_Western_Roman_Empire">fall of Rome</a>. One of the biggest things that happens there is that Western and Northern Europe lose access to papyrus. Papyrus is the cheap writing surface of antiquity. It is an easy plant-based writing surface.</p><p>You take this tall, thin water reed that is fibrous like asparagus. You slice it into ribbons. You set them out in the sun, a bunch of them parallel to each other sitting on a stone like noodles. You put a second row of noodles perpendicular to that on top, and then they dry in the sun, and they are naturally sticky. They stick to each other. They produce a sheet. Practically no labor has gone into this. You&#8217;ve sliced, you&#8217;ve laid out, boom.</p><p>Papyrus is a very inexpensive writing surface, and this is what enables Rome to have a bureaucracy and to have libraries in any mid-sized city. People can send letters back and forth. There can be enormous tax records. Sometimes when Egypt and Rome are at war, Egypt will be like, &#8220;No, we are angry. We&#8217;ll stop exporting papyrus.&#8221; No papyrus to Rome, and then Rome&#8217;s infrastructure will fall apart overnight because you can&#8217;t do anything if you can&#8217;t write stuff down.</p><p>Papyrus is a warm weather plant. It is killed by frost. You cannot grow it north of the frost line. So France, Spain, even most of Italy, you can only grow papyrus down in the very tip down in Sicily.</p><p>Without papyrus, what you&#8217;re writing on is a dead sheep. If you think of the price of a head of lettuce and the price of a leather jacket, you&#8217;re understanding the difference between a sheet of papyrus and writing on a dead sheep. Every page of a medieval book is as expensive as that much of a leather jacket. A handwritten medieval book handwritten on parchment costs as much as a house, so that a small pocket copy of a book costs as much as a studio condo. A big illuminated fancy Bible, you&#8217;re spending on that what you would spend on a villa in the countryside.</p><p>This is an enormous expense. To have a library is to be not just rich, but mega-rich. Only the wealthiest cities contain anybody who has a library. The great library of the <a href="https://en.wikipedia.org/wiki/University_of_Paris">University of Paris</a>&#8212;<em>the</em> library from Europe&#8217;s perspective&#8212;has six hundred books. There&#8217;s definitely more than six hundred books in this room. Every kiosk at an airport selling Dan Brown novels has more than six hundred books. This is nothing.</p><p>At the same time as that, in the Middle East, sultans have libraries of over a thousand books or five thousand books. There are libraries in Sub-Saharan Africa with thousands of books. There are libraries in China with thousands of books because they have cheap paper, rice paper. The Middle East has papyrus. Europe, and only Europe, is writing on a leather jacket.</p><p><strong>Dwarkesh Patel</strong></p><p>What changes around this time? How is Europe able to get paper?</p><p><strong>Ada Palmer</strong></p><p>Still zooming in on the fall of Rome. Rome had lots and lots of books on papyrus. They start falling apart because papyrus is brittle. Most of our knowledge from antiquity is not lost at the burning of the <a href="https://en.wikipedia.org/wiki/Library_of_Alexandria">Library of Alexandria</a>. It&#8217;s lost between 400 and 600 A.D. when the papyri are falling apart.</p><p>Here you are with a library of a thousand books, and you can only afford to make a hundred new books. You have to choose which hundred of these thousand to save because there literally is not enough industry on your continent to make enough leather to copy down all this text. You have to pick. The majority of what we lost from antiquity, we lost then.</p><p>We lost it when the papyri were falling apart. This also distorted what survived because most of the copying out was done by monks. When you have a thousand books and you can only save a hundred of them and you&#8217;re a monk, you&#8217;re like, &#8220;What will I save? I know, <a href="https://en.wikipedia.org/wiki/Augustine_of_Hippo">Saint Augustine</a>. I love Saint Augustine.&#8221; This is why we have more surviving work by Saint Augustine than the entirety of all pagan classical Latin. The subjective tastes of the people in power at the moment the papyri were falling apart ended up being an unintentional moment of censorship that biased what survives from antiquity.</p><p>Paper technology hits Europe in 800 A.D., so we&#8217;re talking about a four-hundred-year famine of a cheap writing surface. Paper is nowhere near as cheap as papyrus because you need to gather rags from used clothing. You immerse them in water, and you beat them violently using a mill for a very long time until they become a pulp. You then scoop that pulp up on a screen, and the fibers lock together. It&#8217;s sort of a slurry that looks like grits. You lift up the slurry, and it locks together into a sheet of paper.</p><p>It&#8217;s not as cheap as just growing papyrus, and it&#8217;s much more labor. You have to build a paper mill. If parchment is a leather jacket and papyrus is buying a head of lettuce, this is somewhere in between. What&#8217;s in between a leather jacket and a...</p><p><strong>Dwarkesh Patel</strong></p><p>This feels like a trick question.</p><p><strong>Ada Palmer</strong></p><p>This is somewhere in between, getting yourself a dozen frozen prepackaged meals, which are complex and have many ingredients. A lot of industry went into producing the actual packaging, more so than a head of lettuce. So it&#8217;s ten times as expensive, but it&#8217;s still a tenth as much as the leather jacket.</p><p>Paper comes in, and people are very wary of it. Paper is clearly not as strong as parchment. Parchment is really tough stuff. People start using paper for rough drafts, letters, sketchbooks. When you&#8217;re doing the sketch before doing a painting, you might do that on paper. But Europe has paper for four hundred years before the earliest state document is ever written on paper, to give you a sense of how people are wary of it.</p><p>It disseminates slowly. It&#8217;s still expensive. It requires industry and production, but it is a tenth as expensive as leather. Paper disseminates slowly through Europe. Again, this is one of these things where there was always technological change, and all technological changes are gradual.</p><p>Paper comes in in 800. It&#8217;s being trusted by 1200. When printing begins, they&#8217;re printing on paper, but they even print on <a href="https://en.wikipedia.org/wiki/Vellum">vellum</a>. If you&#8217;re a really rich person, you would be like, &#8220;Please print two copies on vellum for me.&#8221; Dukes like the Dukes d&#8217;Este, <a href="https://en.wikipedia.org/wiki/Isabella_d%27Este">Isabella d&#8217;Este</a>&#8212;the sister of the duke who walked around buck naked to show off that he could&#8212;specially ordered all of her books to be printed on vellum even when the rest of the print run was on paper. These are the very books being produced in Venice by the apprentices of Gutenberg who ran away.</p><p>At that moment in the 1490s, if you&#8217;re really rich, you might be invested in these newfangled printed books, but you&#8217;re still not trusting paper, even though paper has been there for six hundred years at that point. So again, gradual adoption of technologies and gradual trust in paper. They&#8217;re still using parchment for things, gradually less and less, but substantially over the course of the 1600s. You can even find things written on parchment in the 1700s and 1800s. British Parliament still did its records on parchment up until ten years ago, and the Vatican still does its official records on parchment now.</p><p><strong>Dwarkesh Patel</strong></p><p>This is a digression, but the numbers of how expensive a book is didn&#8217;t make sense to me just based on how much scribe time it took. You say it&#8217;s $600,000 per book, and then separately, it&#8217;s five months of scribe time. I&#8217;m like, how much are the scribes getting paid? But if it&#8217;s the paper... What changes with Gutenberg?</p><p><strong>Ada Palmer</strong></p><p>The paper and the ink. But a lot of it is scribe time.</p><p><strong>Dwarkesh Patel</strong></p><p>But Gutenberg still needs paper, right?</p><p><strong>Ada Palmer</strong></p><p>Yeah, Gutenberg needs paper. That&#8217;s why he goes bankrupt. He borrows the equivalent of about $1.5 million to buy paper, and then doesn&#8217;t make back $1.5 million worth of material when printing it.</p><p>This is what makes printing a risk. You have to start buying the paper up front. You need to buy it in a big lot so that it matches, because people don&#8217;t want the paper to suddenly be a different color within their book. You&#8217;re investing a lot up front, and you&#8217;re not getting anything back until you produce this slow print run, which is why printers start printing pamphlets.</p><p>They can have one press that&#8217;s slowly printing a valuable book that will take six months to print. Next to it they have another press that&#8217;s printing pamphlets where in two days they&#8217;ve printed a fashion report on what everyone was wearing at the royal wedding, which they can sell right away. It&#8217;s much cheaper, but it means they have something they can sell two or three times a week. So you have the pamphlet following the book, printing cheap news, printing scandal rags.</p><p><strong>Dwarkesh Patel</strong></p><p>Why is it cheaper? Because the material is cheaper?</p><p><strong>Ada Palmer</strong></p><p>Just because it&#8217;s only five pages long.</p><p><strong>Dwarkesh Patel</strong></p><p>Oh, I see. Got it.</p><p><strong>Ada Palmer</strong></p><p>I could grab one if you want to see one. So if we look at some examples. I&#8217;ll show you these one by one. For example, this is a pamphlet. Naked pages, short text, hand-stitched together. It would take two or four days because you print the front side and then the back side. It&#8217;s cheap. It&#8217;s ephemeral. You print a thousand of them. You sell a bunch around the town. You sell a bunch to news writers who are going to and from other cities, who will buy them and bring them to the next town.</p><p>If you&#8217;ve printed news in Milan, people who are going to Florence will want to buy your news to go there. It might be a report of a siege. It might be what people were wearing at the royal wedding. My favorite title of a pamphlet was &#8220;The Scandalous Tale of a Doctor from Padua and How He Seduced His Maid, Murdered His Wife, Murdered the Maid, Cut Out Her Heart and Ate It, and How He Was Justly Punished by God.&#8221; That was the title of the pamphlet. These things circulated around. Some of them were nonsense, some of them were real news. Most were combinations. But you can sell something like this cheaply in a couple of days.</p><p>Often they would have a cheap blue cover. You have seen this color before. This is the color of laundry lint, because fundamentally laundry lint is what paper is. You take rags of old clothes, you put them in water, you beat them until they become a pulp, and you skim it out with a sieve. Laundry lint is what rag paper is. If you don&#8217;t bleach it, it&#8217;s this generic blue-gray color, which is sort of the average color of what human beings wear.</p><p>That&#8217;s a copy of <em><a href="https://en.wikipedia.org/wiki/The_Gentleman%27s_Magazine">The Gentleman&#8217;s Magazine</a></em>, another example of technology taking a leap forward in the 18th century. When they invented the newspaper, they immediately had the problem of, &#8220;Oh, no. Newspapers contradict each other. We don&#8217;t know what&#8217;s true. We have to fact-check stuff.&#8221;</p><p>That one has a great fold-out. I think there&#8217;s a procession or something. That is what everybody wore at the state funeral. Instead of photographs, we have this fancy, &#8220;Here is what everyone was wearing at the state funeral.&#8221; Very exciting.</p><p>Your laundry lint, if you don&#8217;t bleach it, remains the color that it on average was. In the 18th century, they have newspapers. The newspapers are reporting news, but they don&#8217;t quite say the same thing as each other. The problem becomes, how do we know who to trust?</p><p><em>The Gentleman&#8217;s Magazine</em> was developed, and every week they would publish a roundup of that week&#8217;s news saying what each newspaper said about it, where they contradicted each other, analyzing who&#8217;s right and wrong. It was the fact-checking. This is the first magazine. It invented the word &#8220;magazine&#8221; being used in this context. It was an intellectual response to the fake news problem of how we reconcile what happens with newspapers.</p><p>You see these many iterations: they invent the printing press, then they invent the pamphlet, then they invent the newspaper, then they invent the magazine to cope with the newspaper. The newspaper is invented to cope with the pamphlet because you don&#8217;t know whether to trust the scandalous tale of the doctor from Padua and how he murdered his wife. Is he real? We don&#8217;t know. But if somebody publishes a newspaper that serially prints news every week, they have a reputation. They have to be respectable. You&#8217;re not going to subscribe to them if you catch them printing nonsense.</p><p>The serial nature of a newspaper was a form of accountability that made people willing to trust it over time. The newspaper is a way of fact-checking the pamphlet. The pamphlet is a way of making money while you&#8217;re printing your longer book. I will also let you have a look at papyrus.</p><p><strong>Dwarkesh Patel</strong></p><p>Thank you.</p><p><strong>Ada Palmer</strong></p><p>You can see the plaid pattern of the papyrus because it is made of two layers of strips. And there&#8217;s a papyrus scroll. That&#8217;s modern papyrus. The thing about papyrus is that in addition to being cheap, it&#8217;s very brittle. It works better in a scroll than it does folded over because the folded edge cracks really easily. If you try to make this into a <a href="https://en.wikipedia.org/wiki/Codex">codex book</a>, it&#8217;s going to be very fragile.</p><p>Here you go. This is a real 17th-century letter in absolutely indecipherable handwriting.</p><p><strong>Dwarkesh Patel</strong></p><p>On parchment?</p><p><strong>Ada Palmer</strong></p><p>On parchment. You can even tell, because that&#8217;s cheap parchment, which side was the outside of the animal and which side was the inside.</p><p><strong>Dwarkesh Patel</strong></p><p>The handwriting is in some sense bad, but it&#8217;s also very well aligned.</p><p><strong>Ada Palmer</strong></p><p>Tiny and precise. But here is good parchment. It is hard to believe that it&#8217;s animal skin. These are pages from a <a href="https://en.wikipedia.org/wiki/Book_of_hours">book of hours</a> from about 1480, individually hand-calligraphed. You can see that one has a hole through it. They wrote around the hole because it&#8217;s too valuable to not use that sheet.</p><p><strong>Ada Palmer</strong></p><p>These are paper thin. You can barely tell, if you look carefully, which side was the outside of the animal and which was the inside because one side has tiny little speckles of pores.</p><p><strong>Dwarkesh Patel</strong></p><p>Where is this from?</p><p><strong>Ada Palmer</strong></p><p>A book of hours. This is probably a French book of hours. A book of hours is a personal prayer book. Bible quotes, objects of meditation.</p><p>The book would be fat and small. This was the most common manuscript in the Middle Ages. You would carry it around in your pocket, and you&#8217;d pull it out different times of day for personal prayer. But it also has big margins so that you can take notes in it, write down addresses, have friends write notes in it.</p><p>You use it almost like a day planner. It&#8217;s the smartphone of the period in which you make all your notes or write down people&#8217;s names. You might have celebrities you meet sign your book of hours. All sorts of neat things go into the margins as you use this to organize the day.</p><p><strong>Dwarkesh Patel</strong></p><p>That would be extremely interesting as a collector&#8217;s item, random people&#8217;s book of hours and what kinds of things they recorded.</p><p><strong>Ada Palmer</strong></p><p>Oh yeah. Think of a leather jacket, but how much more industrial effort went into making leather literally paper-thin like this. Huge amounts of industrial effort go into making the pages of such a book.</p><p><strong>Dwarkesh Patel</strong></p><p>My favorite example of this kind of distribution and diffusion taking longer than you would think for a very fundamental technology&#8212;well, this is now my favorite example, so my second favorite example&#8212;is oil. I interviewed <a href="https://www.dwarkesh.com/p/daniel-yergin">Daniel Yergin</a>, who wrote this big book about the history of oil. In the 1860s, <a href="https://en.wikipedia.org/wiki/Drake_Well">Drake strikes oil in Pennsylvania</a>.</p><p>It&#8217;s in the 1910s that the car is invented, the <a href="https://en.wikipedia.org/wiki/Internal_combustion_engine">internal combustion engine</a> is put into a thing which you sell millions of copies of. Until then, oil is just used for <a href="https://en.wikipedia.org/wiki/Kerosene">kerosene</a>, which is just for lighting. The actual gas is just thrown away. In fact, when the light bulb was invented, people were wondering whether <a href="https://en.wikipedia.org/wiki/Standard_Oil">Standard Oil</a> was going to go bankrupt because the main use case had gone away.</p><p><strong>Ada Palmer</strong></p><p>Oh, neat. I always think of Julius Caesar&#8217;s description of Britain when the Romans first get there. He says, &#8220;The people of Britain are so poor, they can&#8217;t afford to burn wood, so they burn rocks.&#8221; We know he&#8217;s talking about coal.</p><p><strong>Dwarkesh Patel</strong></p><p>Oh, I thought it was satire.</p><p><strong>Ada Palmer</strong></p><p>No, he&#8217;s talking about coal. They had coal in the days of Julius Caesar, but they didn&#8217;t figure out its massive industrial utility until many, many years later.</p><p><strong>Dwarkesh Patel</strong></p><p>There is this interesting question of why the Romans didn&#8217;t have the <a href="https://en.wikipedia.org/wiki/Industrial_Revolution">Industrial Revolution</a> because they had these <a href="https://barryyeoman.com/2010/09/the-mines-that-built-empires/">huge silver mines in Spain</a> and elsewhere, but no coal.</p><p><strong>Ada Palmer</strong></p><p>You have the Industrial Revolution when you feel you need to. That&#8217;s the thing about Gutenberg as well that a lot of people don&#8217;t think about. People are like, &#8220;Gutenberg was an inventor and invented a thing, and then it had an impact.&#8221; No. He was living in the middle of a library building boom in which there was a huge demand for books that spiked. He invented the invention in response to that cultural change.</p><p>It isn&#8217;t by chance that we got the printing press in 1450. There was a huge boom of library buildings starting in the 1410s, and inventors were trying to figure out ways to make books cheaper. They were making smaller books. They were using paper more. Paper surges before the Gutenberg movable type printing press. So Gutenberg isn&#8217;t a random genius out of nowhere. It was the moment that people needed more books. We were going to get the invention.</p><h3>01:41:21 - The Inquisition accidentally invented peer review</h3><p><strong>Dwarkesh Patel</strong></p><p>One thing you say in passing in the book is Martin Luther comes up at the exact right time, because you&#8217;ve got <a href="https://en.wikipedia.org/wiki/Girolamo_Savonarola">Savonarola</a> in the 1490s, and he&#8217;s another prophet type. I guess he&#8217;s the modern analog of somebody like <a href="https://en.wikipedia.org/wiki/Ruhollah_Khomeini">Khomeini</a> in Iran, setting up a theocratic government, but too early. Machiavelli you say is too late because the censorship is already in place. What is the censorship that is in place by the time of Machiavelli? What is the alternative world?</p><p><strong>Ada Palmer</strong></p><p>Machiavelli, remember, is contemporary with Luther. It&#8217;s just that he circulates his stuff very briefly and very privately. He doesn&#8217;t want a pamphlet version of his ideas out there because he only wants Florence to have it.</p><p>Luther hits the sweet spot when the pamphlet distribution network had just developed. When Savonarola printed pamphlets, they only circulated around Florence and its neighbors, Siena and Pisa. It took months for them to get farther. His movement was quickly crushed.</p><p>When Luther makes the <a href="https://en.wikipedia.org/wiki/Ninety-five_Theses">Ninety-five Theses</a> public, they&#8217;re in print in London seventeen days after he releases them in Wittenberg. The pamphlet runners go foom, foom, foom, and get the news there, and things are printed overnight and come out that fast.</p><p><strong>Dwarkesh Patel</strong></p><p>But it seems like you&#8217;re hinting that within the next two decades, there&#8217;s a new censorship regime across Europe.</p><p><strong>Ada Palmer</strong></p><p>A new censorship regime responds. The censorship regime is very effective at shaping what is printed in books, but can never keep up with pamphlets. In the same way that the government can pressure CNN, the government can&#8217;t pressure random people on a social media network. You&#8217;re not going to be able to keep up with that speed.</p><p>One of the funny problems that the <a href="https://en.wikipedia.org/wiki/Inquisition">Inquisition</a> always had when trying to persecute printers is that printers worked in the information distribution industry. They were the people who paid the news writers, whose job it is to move as fast as humanly possible between cities. Which meant that news always reached them first. If a printer was ever convicted by the Inquisition, they would find out before the Inquisition could possibly get there to arrest them.</p><p>The Inquisition never succeeded at arresting printers. They&#8217;d always skipped town by the time the Inquisition got there, because if you employ the news writers, you find out first what&#8217;s going on. The Inquisition can&#8217;t keep up.</p><p>When we look at censorship, there&#8217;s an intersection of four factors as to whether censorship is possible. One of them is law: Is it legal for the censorship to happen? Another one is the technology. Is it actually possible to censor this thing? You cannot censor whatever moves the information fastest because it will move the information faster than you can move.</p><p>Even if that one printer had to skip town, he will set up shop somewhere else, a new person will take over his shop, and the information will still move. So pamphlets become unpoliceable. You can try to police them, you can partially police them, but keeping pamphlets from moving around&#8230; They&#8217;re anonymous, they&#8217;re quick, they&#8217;re produced overnight, they move quickly. You just can&#8217;t keep up with them.</p><p><strong>Dwarkesh Patel</strong></p><p>Couldn&#8217;t they just punish print shops for publishing things? Just say, &#8220;This is what we like, and if you do something we don&#8217;t like, we&#8217;ll punish you,&#8221; which is how censorship in China works, for example.</p><p><strong>Ada Palmer</strong></p><p>They did. So the printer skips town. The printer moves to the next town. There is a cost to that. There&#8217;s a human cost to evading that. You&#8217;ve had to leave your home and friends behind and move to a new place, but they don&#8217;t get you. It&#8217;s also very easy to deny that the pamphlet came from you at all.</p><p>The print industry proves very difficult to censor, and we&#8217;re experiencing the same thing with social media. Everyone is like, &#8220;Censor the pornography on this social media channel,&#8221; and they&#8217;re like, &#8220;We just can&#8217;t. It&#8217;s too fast. There&#8217;s too much.&#8221; Or, &#8220;Censor the hate speech.&#8221; &#8220;We just can&#8217;t. It&#8217;s too fast, there&#8217;s too much.&#8221;</p><p>There are too many pamphlets, and they could crack down on one particular pamphlet shop. We have records of this. There&#8217;s a brilliant analysis in <a href="https://history.ufl.edu/directory/anton-matytsin/">Anton Matytsin&#8217;s</a> book, <em><a href="https://amzn.to/4d3dWoZ">The Specter of Skepticism in the Age of Enlightenment</a></em>. He has a great description from the notes of a raid on a clandestine bookshop. This wasn&#8217;t the printer, this was the underground bookshop that was selling illegal books, and they&#8217;re raided. It has all the details of how angry the people were about different things that the shop had.</p><p>So there was censorship and there were crackdowns, but it was a censorship that could not actually prevent circulation. It could restrict it, it could make it harder, it could make it scary, but it couldn&#8217;t prevent it.</p><p><strong>Dwarkesh Patel</strong></p><p>Before books become cheap, unless you&#8217;re fantastically wealthy, you&#8217;re reading the same couple of books&#8212;if you&#8217;ve ever read a book&#8212;again and again throughout your life.</p><p><strong>Ada Palmer</strong></p><p>Cosimo de&#8217; Medici&#8217;s father owned, I think it was twelve books.</p><p><strong>Dwarkesh Patel</strong></p><p>I want to understand the intellectual significance of rereading the exact same book again and again. Maybe the reason <a href="https://en.wikipedia.org/wiki/Petrarch">Petrarch</a> loved Cicero so much is, imagine reading the same book twenty times, hitting the same joke again and just meditating on every single point. There&#8217;s got to be a difference in intellectual culture as a result of treating these things as the equivalent of the Bible.</p><p><strong>Ada Palmer</strong></p><p>You really feel like you get to know the person intimately. You develop a personal relationship with the ancient author. You are participating in a conversation across the diaspora of time. It&#8217;s a one-way conversation. You&#8217;re responding to them, the future will respond to you. But there is a great deal of intimacy.</p><p>Petrarch talks about his friend Cicero and being betrayed by his friend Cicero. He finds new works of Cicero that he hadn&#8217;t read including some of Cicero&#8217;s letters in which Cicero is not following his own stoic philosophical precepts and is being petty, yelling at people about real estate, and getting all upset after his daughter&#8217;s death. You know how people get manic when there&#8217;s been a death in the family and start quarreling about everything? Cicero gets like that, and Petrarch is heartbroken.</p><p>To him it means even the wisest man in history could not conquer that urge to become irrational and petty in the face of grief. If even Cicero became irrational and petty in the face of grief, does that mean humanity is doomed to forever be irrational and petty in the face of grief? He talks about Cicero breaking his heart and his foot, because the book fell on his foot and broke it, and he got a bad infection, and he was bedridden for months.</p><p><strong>Dwarkesh Patel</strong></p><p>Totally different topic, but in 1492, <a href="https://en.wikipedia.org/wiki/Christopher_Columbus">Columbus</a> comes to the New World. They discover the New World. What is the reception of this news?</p><p><strong>Ada Palmer</strong></p><p>I was just at a conference a week ago in which we confirmed that there&#8217;s a Vatican document from 1100 or maybe 1200&#8212;I forget the exact year&#8212;that recognizes the existence of <a href="https://en.wikipedia.org/wiki/Vinland">Vinland</a>, i.e. Canada, where they got the information from the Vikings.</p><p><strong>Dwarkesh Patel</strong></p><p>Oh, interesting.</p><p><strong>Ada Palmer</strong></p><p>They thought it was just a little thing, but yeah.</p><p><strong>Dwarkesh Patel</strong></p><p>So they&#8217;re rediscovering the New World. Would it be the equivalent of finding out there are aliens today? Why wasn&#8217;t it considered more significant? Why wasn&#8217;t the consensus, &#8220;This is the main thing happening right now, we&#8217;ve discovered the New World&#8221;?</p><p><strong>Ada Palmer</strong></p><p>When I teach my class on the 1490s, the students, many of whom are American, always have trouble wrapping their heads around people thinking that the New World isn&#8217;t a big deal. A big part of it is that they find the Caribbean islands, and they find the coast, and they think this is small.</p><p><strong>Ada Palmer</strong></p><p>The way I put it to my students is, the news comes back, we&#8217;ve found something across the water to the west. It might be even as big as the <a href="https://en.wikipedia.org/wiki/Canary_Islands">Canary Islands</a>. They&#8217;ve found something, but they don&#8217;t realize they&#8217;ve found something the scale of Europe and Africa. Actually, it&#8217;s not as big as Europe and Africa, but they found something humongous. That&#8217;s part of it. Another part of it is no matter how big and important something far away is, it&#8217;s hard to bring your mind out of the petty squabbles that are happening right around you, especially when they feel like life or death.</p><p>If it&#8217;s 1492, what is happening? France is about to invade Italy. Europe might be embroiled in the <a href="https://en.wikipedia.org/wiki/Italian_War_of_1494%E2%80%931495">largest war it&#8217;s seen in fifty years</a>. The <a href="https://en.wikipedia.org/wiki/Pope_Alexander_VI">papacy has just been taken over by Spain</a>. <a href="https://en.wikipedia.org/wiki/Catholic_Monarchs_of_Spain">Spain is suddenly trying to throw its weight around in Europe</a> in a way that&#8217;s unprecedented. The <a href="https://en.wikipedia.org/wiki/Hungarian%E2%80%93Ottoman_Wars#Turkish_wars_of_Matthias_Corvinus_(1458%E2%80%931490)">Ottomans have just invaded Italy and Hungary</a> and might be coming again. Also over there, there&#8217;s a new thing. Okay, great. We&#8217;ll worry about that when we&#8217;re not having three wars at the same time. But guys, we&#8217;re having three wars at the same time. Oh my God. And then Martin Luther hits Europe like a ton of bricks when they still haven&#8217;t even figured out that this is a continent and not an island. In the same way, if you&#8217;re in a country and it&#8217;s having a tumult, you worry a lot about its tumult, even if a larger tumult is happening in a faraway country. It&#8217;s hard to bring your mind out of Europe at crisis to be like, &#8220;Hey, this is a thing.&#8221;</p><p>The other is they&#8217;re inventing lots of new things, and it falls into the sphere along the rest. They&#8217;re discovering the existence of sub-Saharan Africa, where they thought there was basically one country&#8217;s worth of stuff, south of the Sahara, Ethiopia and nothing else. Then they&#8217;re like, &#8220;Oh my God, there&#8217;s a whole big thing that sticks out.&#8221; They&#8217;re also discovering that the heart is a pump. That&#8217;s a bit later, but they&#8217;re discovering all sorts of stuff at the same time.</p><p>The discovery of the New World, especially when they realize how big it is, becomes an intellectual challenge where they say, &#8220;Wait, does this mean all the maps we&#8217;ve had are wrong? Does this mean the ancients were wrong about geography? Does it mean the world is a lot bigger than we used to think the world is? Let&#8217;s worry about that the same way we worry about revolutionizing our mathematics and figuring out that the sun doesn&#8217;t go around the Earth.&#8221;</p><p>These are things that are paradigm shifting. But on the other hand, does it matter whether the sun goes around the Earth or the Earth around the sun when the French are invading right now and we need to get the defenses going, and there&#8217;s a giant civil war happening, and we&#8217;re about to be betrayed? It does matter, but it also doesn&#8217;t matter. Any decade is concerned by its tumults and often fails to recognize the importance of what&#8217;s around it. That&#8217;s true of every decade.</p><p>One fun game when I study the history of censorship, which I work a lot on&#8212;my next non-fiction book is gonna be a book on the history of censorship&#8212;whatever they&#8217;re looking at, they&#8217;re always wrong, from our perspective, about what they should be worried about censoring. If we had a time machine and our goal is to go give them advice&#8230; Here we are in the <a href="https://en.wikipedia.org/wiki/French_Enlightenment">French Enlightenment</a>, <a href="https://en.wikipedia.org/wiki/Voltaire">Voltaire</a> and <a href="https://en.wikipedia.org/wiki/Jean-Jacques_Rousseau">Rousseau</a> and the <a href="https://en.wikipedia.org/wiki/Marquis_de_Sade">Marquis de Sade</a> and <a href="https://en.wikipedia.org/wiki/Julien_Offray_de_La_Mettrie">La Mettrie&#8217;s</a> articulations of <a href="https://en.wikipedia.org/wiki/French_materialism">materialist atheism</a> are flying around Europe. What is the Inquisition worried about? It&#8217;s worried about <a href="https://en.wikipedia.org/wiki/Jansenism">Jansenist</a> treatises about the nature of the <a href="https://en.wikipedia.org/wiki/Trinity">Trinity</a>.</p><p>Jansenism is sort of like a <a href="https://en.wikipedia.org/wiki/Reformed_Christianity">Calvinist</a> version of Catholicism. Do you want to have an incredibly terrifying authoritarian God who hates you and tells you that your soul is a worthless spider that deserves to be hurled into fire, but also have to obey the arbitrary pope in Rome? Then Jansenism is for you. It has all the grimness of Calvinism and all of the authoritarian centrality of the <a href="https://en.wikipedia.org/wiki/Catholic_Church">Roman Catholics</a>. This was a <a href="https://en.wikipedia.org/wiki/Heresy_in_Christianity">heresy</a> that was abroad in the Enlightenment, and they are so much more worried about Jansenism than they are about Voltaire.</p><p>Remember that very chapter in Matytsin&#8217;s book I mentioned where they are raiding the clandestine bookshop. They&#8217;re like, &#8220;Voltaire, fine. The banned <em><a href="https://en.wikipedia.org/wiki/Encyclop%C3%A9die">Encyclop&#233;die</a></em>, which is gonna revolutionize all thought in Europe, fine. letters of <a href="https://en.wikipedia.org/wiki/Denis_Diderot">Diderot</a>, Rousseau, fine, fine. Jansenist treatises about the nature of the Trinity! Throw the book at these guys! This is the worst thing!&#8221; They really are obsessed with this incredibly petty minor heresy to the degree that when the <em>Encyclop&#233;die</em> is banned by Rome&#8230;</p><p>France likes the Encyclopedia. This is Diderot and <a href="https://en.wikipedia.org/wiki/Baron_d%27Holbach">d&#8217;Holbach&#8217;s</a> big project of universal education, to print an encyclopedia that will collect all world knowledge. They articulate it as, &#8220;Should a new dark age come upon humankind and even one copy of the encyclopedia survive, it will be sufficient to reconstruct all human progress.&#8221; That&#8217;s the goal of this thing. It&#8217;s advancing incredibly radical ideas about biology, about statecraft, about reforming the law to be rational instead of traditional, all sorts of stuff.</p><p>When that is banned by Rome, Paris is commanded... Paris loves this book. The king likes this book. The queen likes this book. She&#8217;s on record saying it was so cool being able to look up the technology that was used to make her silk pantyhose. She just loves it. Everybody loves it. France allows it to circulate despite its controversial content. But Rome says, &#8220;No, you must ban this book.&#8221; So they agree they&#8217;re gonna have the ceremonial burning, and they march the <em>Encyclop&#233;die</em> up to the fire. Then they get some Jansenist treatise about the nature of the Trinity and burn those instead, because they don&#8217;t want to burn the <em>Encyclop&#233;die</em>. They love it. They want to burn this other thing.</p><p>This is always true. If we had a time machine for the Inquisition in the 1540s, we would say, &#8220;Guys, Machiavelli, he&#8217;s really important. He&#8217;s really revolutionary. You gotta be looking at this.&#8221; Or we would say <a href="https://en.wikipedia.org/wiki/Lucretius">Lucretius&#8217;s</a> <em><a href="https://en.wikipedia.org/wiki/De_rerum_natura">De rerum natura</a></em>, which I did my dissertation on&#8230; Many people are familiar with <a href="https://en.wikipedia.org/wiki/Stephen_Greenblatt">Greenblatt&#8217;s</a> book, <em><a href="https://amzn.to/4lk4AYb">The Swerve</a></em>, which credits a lot of change to the materialist science that this poem articulates. There&#8217;s a much more complex story, which you know is told in my book, which refers to Greenblatt&#8217;s. If anyone enjoyed <em>The Swerve</em>, you would really enjoy the more detailed zoom-in that <em>Inventing the Renaissance</em> has. But we would say, &#8220;Guys, you should censor this.&#8221;</p><p>We literally have letters of inquisitors writing to each other saying, &#8220;We don&#8217;t need to bother censoring Lucretius. Only learned people can read it, and they know perfectly well that the false stuff is false, so it&#8217;ll just circulate and it&#8217;s fine. What we need to worry about censoring is all of these fine minutiae of <a href="https://en.wikipedia.org/wiki/Protestantism">Protestantism</a>.&#8221; The 1545 edition of the <em><a href="https://en.wikipedia.org/wiki/Index_Librorum_Prohibitorum">Index of Banned Books</a></em> says in its introduction, &#8220;We shall put the names of arch-heretics in all caps.&#8221; When I first read that, I was like, &#8220;Ooh, I want to see all my favorite arch-heretics be in all caps.&#8221; I eagerly flip to M, and Machiavelli is not in all caps. He was not important enough from their position. The all caps authors are all minor Protestant theologians. They&#8217;re all people like Calvin and <a href="https://en.wikipedia.org/wiki/Huldrych_Zwingli">Zwingli</a> and Luther and <a href="https://en.wikipedia.org/wiki/Philip_Melanchthon">Melanchthon</a>. They&#8217;re all doing stuff that we would say does not matter.</p><p>But an era is always wrong about what ideas and what circulation and what changes are the really big ones and are always much, much more worried about, &#8220;Oh my God, the Prince of Spain, which princess is he gonna marry? This is going to determine whether Spain is or isn&#8217;t annexed by Germany. This is the most important thing that has ever happened in the entire stream of time.&#8221; People are like, &#8220;We&#8217;ve discovered another continent,&#8221; and they&#8217;re like, &#8220;We don&#8217;t care. We just wanna know who&#8217;s gonna marry Charles.&#8221;</p><p><strong>Dwarkesh Patel</strong></p><p>That&#8217;s a very profound observation. It was really interesting to learn from your book that of all the thousands of people killed during the Inquisition, one guy was executed for atheism.</p><p><strong>Ada Palmer</strong></p><p>Science-related stuff.</p><p><strong>Dwarkesh Patel</strong></p><p>And even he had these ideas of reincarnation or...</p><p><strong>Ada Palmer</strong></p><p>I think probably the number executed for atheism would be about 100. There are 12 total trials of scientists about science. <a href="https://en.wikipedia.org/wiki/Galileo_Galilei">Galileo</a> is one. <a href="https://en.wikipedia.org/wiki/Giordano_Bruno">Giordano Bruno</a> is one. Giordano Bruno is the only one executed. Of those 12 trials, only three were convicted.</p><p>Hundreds of thousands of trials for Judaizing, which is theoretically contaminating Christianity with Jewish thought, and all of these other minutiae of oppression and segregation of populations, executions for paganism, meaning practicing your indigenous religion in a colonized space&#8230; Hundreds of thousands of executions for that, one for science.</p><p><strong>Dwarkesh Patel</strong></p><p>I recently got interested in the story of <a href="https://en.wikipedia.org/wiki/Johannes_Kepler">Kepler</a> just because <a href="https://www.johndcook.com/blog/2018/04/03/planets-and-platonic-solids/">the way he discovers the laws of planetary motion is so whimsical with the theory of Platonic objects</a>. While he&#8217;s going through <a href="https://en.wikipedia.org/wiki/Tycho_Brahe">Brahe&#8217;s</a> data and coming up with the <a href="https://en.wikipedia.org/wiki/Kepler%27s_laws_of_planetary_motion">laws of planetary motion</a>, he is the imperial mathematician for the <a href="https://en.wikipedia.org/wiki/House_of_Habsburg">Habsburg</a> emperor, which basically means that he&#8217;s doing astrology for a general. Will we win the battle or whatever.</p><p>Then he gets excommunicated, not for the laws of planetary motion, but because he&#8217;s a Lutheran. In fact, his mother is tried for witchcraft. Again, has nothing to do with science, it&#8217;s just because she&#8217;s also a Lutheran.</p><p><strong>Ada Palmer</strong></p><p><a href="https://en.wikipedia.org/wiki/John_Milton">Milton</a> of <em><a href="https://en.wikipedia.org/wiki/Paradise_Lost">Paradise Lost</a></em> fame wrote our first <a href="https://en.wikipedia.org/wiki/Areopagitica">big defense of the free press</a>. This is in the moment in the early 1600s when England doesn&#8217;t yet have systematic censorship law. It has ad hoc, &#8220;Hey, this book is bad,&#8221; but it doesn&#8217;t have systematic, &#8220;You must submit all books to a censor,&#8221; the way the Catholic world does by that point. The Catholic world developed it in order to fight Protestantism.</p><p>There&#8217;s a lot of support for creating censorship in England at the time because there&#8217;s anxiety about Papists plotting against our nice non-Catholic country, trying to undermine it. There&#8217;s a general feeling of anxiety. There&#8217;s also deliberate moral panic whipped up by politicians and power-seeking people who whip up a deliberate moral panic about books, the same way in 1954 there was a moral panic about comic books or the same way there was a moral panic about Dungeons &amp; Dragons in the &#8216;90s. There&#8217;s a moral panic about scary and dangerous books and pamphlets. So there&#8217;s a movement to create state censorship for the first systematic time in England.</p><p>Milton writes this big treatise about why freedom of the press is important, the <a href="https://en.wikipedia.org/wiki/Areopagitica">Areopagitica</a>. It&#8217;s a beautifully written rhetorical piece that presents the importance of how we must trust truth to rise purely to the top. We must let free voices move, otherwise you&#8217;re gonna create a situation where people are writing for the censor first and for the public second. It will constrain people&#8217;s thoughts in the way that we know chilling effects and fear do. It&#8217;s a beautiful treatise. He fails. The censorship regime passes.</p><p><em>Paradise Lost</em> is published under the censorious regime. It goes through the censorship. The one line they tell him to change is about astrology. They&#8217;re like, &#8220;It&#8217;s perfectly fine having Satan be your charismatic protagonist and God be kind of a jackass, and also having Satan spout ferocious anti-monarchical rhetoric copied from revolutionary pamphlets that are circulating in the British colonies so that he&#8217;s actually parroting republican, anti-monarchical rhetoric, very dangerous stuff in the treatise. That&#8217;s fine. But this one line about a comet causing a thing to happen, no, no, no. Astrology is gonna confuse people&#8217;s souls.&#8221; You&#8217;re like, &#8220;Guys, speaking as a time traveler, you&#8217;re so wrong about what you&#8217;re censoring.&#8221; They always are.</p><p><strong>Dwarkesh Patel</strong></p><p>You have one sentence which I couldn&#8217;t trace down, which I found very interesting. You said, &#8220;In the late 17th century, the most extensive library in all of Europe is the one in the Vatican run by the inquisitors.&#8221;</p><p><strong>Ada Palmer</strong></p><p>Not the library, the most extensive experimental laboratory. <a href="https://en.wikipedia.org/wiki/Daniele_Macuglia">Daniele Macuglia</a> is the scholar there. <a href="https://pubmed.ncbi.nlm.nih.gov/32174230/">That&#8217;s from his dissertation</a>. I think it&#8217;s been published now, but I don&#8217;t know if it&#8217;s actually out in English. It&#8217;s out in Italian. He works on the Inquisition and the immediate aftermath of Galileo.</p><p>They saw themselves as guarantors of truth and of accuracy in information. So they decided after Galileo that they had a duty to verify the truth of the books that they were sent to censor. If people were going to be doing mechanical experiments, they needed to repeat the mechanical experiments to see whether they were true. So they effectively invented peer review, which is to say they invented a second laboratory trying to recreate the results of the first. There are these amazing people who by day are inquisitors and by night are going home to write their own scientific treatises as they do these experiments. It&#8217;s not what we expect, but history is never what we expect.</p><p><strong>Dwarkesh Patel</strong></p><p>Seems like a good place to close. Ada, thank you very much.</p><p><strong>Ada Palmer</strong></p><p>Thank you.</p>]]></content:encoded></item><item><title><![CDATA[Dario Amodei — "We are near the end of the exponential"]]></title><description><![CDATA["That's why I'm sending this message of urgency"]]></description><link>https://www.dwarkesh.com/p/dario-amodei-2</link><guid isPermaLink="false">https://www.dwarkesh.com/p/dario-amodei-2</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Fri, 13 Feb 2026 16:46:36 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/187852154/b76fd92c0474a5f48cf339e1eaac7dae.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>Dario Amodei thinks we are just a few years away from &#8220;a country of geniuses in a data center&#8221;. In this episode, we discuss what to make of the scaling hypothesis in the current RL regime, how AI will diffuse throughout the economy, whether Anthropic is underinvesting in compute given their timelines, how frontier labs will ever make money, whether regulation will destroy the boons of this technology, US-China competition, and much more.</p><p>Watch on <a href="https://youtu.be/n1E9IZfvGMA">YouTube</a>; listen on <a href="https://podcasts.apple.com/us/podcast/dario-amodei-the-highest-stakes-financial-model-in-history/id1516093381?i=1000749621800">Apple Podcasts</a> or <a href="https://open.spotify.com/episode/2ZNrpVSrgZMlDwQinl20Ay?si=9D4aG1l7S-2wzLsiILRLIg">Spotify</a>.</p><div id="youtube2-n1E9IZfvGMA" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;n1E9IZfvGMA&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/n1E9IZfvGMA?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><h3>Sponsors</h3><ul><li><p><a href="https://labelbox.com/dwarkesh">Labelbox</a> can get you the RL tasks and environments you need. Their massive network of subject-matter experts ensures realism across domains, and their in-house tooling lets them continuously tweak task difficulty to optimize learning. Reach out at <a href="https://labelbox.com/dwarkesh">labelbox.com/dwarkesh</a></p></li><li><p><a href="https://janestreet.com/dwarkesh">Jane Street</a> sent me another puzzle&#8230; this time, they&#8217;ve trained backdoors into 3 different language models &#8212; they want you to find the triggers. Jane Street isn&#8217;t even sure this is possible, but they&#8217;ve set aside $50,000 for the best attempts and write-ups. They&#8217;re accepting submissions until April 1st at <a href="https://janestreet.com/dwarkesh">janestreet.com/dwarkesh</a></p></li><li><p><a href="https://mercury.com/personal-banking">Mercury</a>&#8217;s personal accounts make it easy to share finances with a partner, a roommate&#8230; or OpenClaw. Last week, I wanted to try OpenClaw for myself, so I used Mercury to spin up a virtual debit card with a small spend limit, and then I let my agent loose. No matter your use case, apply at <a href="https://mercury.com/personal-banking">mercury.com/personal-banking</a></p></li></ul><h2><strong>Timestamps</strong></h2><p><a href="https://www.dwarkesh.com/i/187852154/000000-what-exactly-are-we-scaling">(00:00:00) - What exactly are we scaling?</a></p><p><a href="https://www.dwarkesh.com/i/187852154/001236-is-diffusion-cope">(00:12:36) - Is diffusion cope?</a></p><p><a href="https://www.dwarkesh.com/i/187852154/002942-is-continual-learning-necessary-how-will-it-be-solved">(00:29:42) - Is continual learning necessary?</a></p><p><a href="https://www.dwarkesh.com/i/187852154/004620-if-agi-is-imminent-why-not-buy-more-compute">(00:46:20) - If AGI is imminent, why not buy more compute?</a></p><p><a href="https://www.dwarkesh.com/i/187852154/005849-how-will-ai-labs-actually-make-profit">(00:58:49) - How will AI labs actually make profit?</a></p><p><a href="https://www.dwarkesh.com/i/187852154/013119-will-regulations-destroy-the-boons-of-agi">(01:31:19) - Will regulations destroy the boons of AGI?</a></p><p><a href="https://www.dwarkesh.com/i/187852154/014741-why-cant-china-and-america-both-have-a-country-of-geniuses-in-a-datacenter">(01:47:41) - Why can&#8217;t China and America both have a country of geniuses in a datacenter?</a></p><h2><strong>Transcript</strong></h2><h3>00:00:00 - What exactly are we scaling?</h3><p><strong>Dwarkesh Patel</strong></p><p><a href="https://www.dwarkesh.com/p/dario-amodei">We talked three years ago</a>. In your view, what has been the biggest update over the last three years? What has been the biggest difference between what it felt like then versus now?</p><p><strong>Dario Amodei</strong></p><p>Broadly speaking, the exponential of the underlying technology has gone about as I expected it to go. There&#8217;s plus or minus a year or two here and there. I don&#8217;t know that I would&#8217;ve predicted the specific direction of code.</p><p>But when I look at the exponential, it is roughly what I expected in terms of the march of the models from smart high school student to smart college student to beginning to do PhD and professional stuff, and in the case of code reaching beyond that. The frontier is a little bit uneven, but it&#8217;s roughly what I expected.</p><p>What has been the most surprising thing is the lack of public recognition of how close we are to the end of the exponential. To me, it is absolutely wild that you have people &#8212; within the bubble and outside the bubble &#8212; talking about the same tired, old hot-button political issues, when we are near the end of the exponential.</p><p><strong>Dwarkesh Patel</strong></p><p>I want to understand what that exponential looks like right now. The first question I asked you when we recorded three years ago was, &#8220;what&#8217;s up with <a href="https://www.dwarkesh.com/p/will-scaling-work">scaling</a> and why does it work?&#8221; I have a similar question now, but it feels more complicated. At least from the public&#8217;s point of view, three years ago there were well-known public trends across many orders of magnitude of compute where you could see how the loss improves.</p><p>Now we have <a href="https://en.wikipedia.org/wiki/Reinforcement_learning">RL</a> <a href="https://www.tobyord.com/writing/how-well-does-rl-scale">scaling</a> and there&#8217;s no publicly known <a href="https://en.wikipedia.org/wiki/Neural_scaling_law">scaling law</a> for it. It&#8217;s not even clear what the story is. Is this supposed to be teaching the model skills? Is it supposed to be teaching meta-learning? What is the <a href="https://gwern.net/scaling-hypothesis">scaling hypothesis</a> at this point?</p><p><strong>Dario Amodei</strong></p><p>I actually have the same hypothesis I had even all the way back in 2017. I think I talked about it last time, but I wrote a doc called <a href="http://corley.ai/the-blob-that-ate-ai/">&#8220;The Big Blob of Compute Hypothesis&#8221;</a>. It wasn&#8217;t about the scaling of language models in particular. When I wrote it <a href="https://en.wikipedia.org/wiki/GPT-1">GPT-1</a> had just come out.</p><p>That was one among many things. Back in those days there was robotics. People tried to work on reasoning as a separate thing from <a href="https://en.wikipedia.org/wiki/Large_language_model">language models</a>, and there was scaling of the kind of RL that happened in <a href="https://en.wikipedia.org/wiki/AlphaGo">AlphaGo</a> and in <a href="https://en.wikipedia.org/wiki/OpenAI_Five">Dota</a> at <a href="https://en.wikipedia.org/wiki/OpenAI">OpenAI</a>. People remember StarCraft at <a href="https://en.wikipedia.org/wiki/Google_DeepMind">DeepMind</a>, <a href="https://en.wikipedia.org/wiki/AlphaStar_(software)">AlphaStar</a>.</p><p>It was written as a more general document. <a href="https://www.dwarkesh.com/p/richard-sutton">Rich Sutton</a> put out <a href="http://www.incompleteideas.net/IncIdeas/BitterLesson.html">&#8220;The Bitter Lesson&#8221;</a> a couple years later. The hypothesis is basically the same. What it says is that all the cleverness, all the techniques, all the &#8220;we need a new method to do something&#8221;, that doesn&#8217;t matter very much. There are only a few things that matter. I think I listed seven of them.</p><p>One is how much raw compute you have. The second is the quantity of data. The third is the quality and distribution of data. It needs to be a broad distribution. The fourth is how long you train for. The fifth is that you need an objective function that can scale to the moon. The <a href="https://www.moveworks.com/us/en/resources/ai-terms-glossary/pre-training">pre-training</a> objective function is one such objective function. Another is the RL objective function that says you have a goal, you&#8217;re going to go out and reach the goal.</p><p>Within that, there&#8217;s objective rewards like you see in math and coding, and there&#8217;s more subjective rewards like you see in <a href="https://en.wikipedia.org/wiki/Reinforcement_learning_from_human_feedback">RLHF</a> or higher-order versions of that. Then the sixth and seventh were things around <a href="https://en.wikipedia.org/wiki/Normalization_(machine_learning)">normalization</a> or conditioning, just getting the numerical stability so that the big blob of compute flows in this <a href="https://en.wikipedia.org/wiki/Laminar_flow">laminar</a> way instead of running into problems.</p><p>That was the hypothesis, and it&#8217;s a hypothesis I still hold. I don&#8217;t think I&#8217;ve seen very much that is not in line with it. The <a href="https://blogs.nvidia.com/blog/ai-scaling-laws/">pre-training scaling laws</a> were one example of what we see there. Those have continued going. Now it&#8217;s been widely reported, we feel good about pre-training. It&#8217;s continuing to give us gains.</p><p>What has changed is that now we&#8217;re also seeing the same thing for RL. We&#8217;re seeing a pre-training phase and then an RL phase on top of that. With RL, it&#8217;s actually just the same. Even other companies have published things in some of their releases that say, &#8220;We train the model on math contests &#8212; <a href="https://en.wikipedia.org/wiki/American_Invitational_Mathematics_Examination">AIME</a> or other things &#8212; and how well the model does is log-linear in how long we&#8217;ve trained it.&#8221;</p><p>We see that as well, and it&#8217;s not just math contests. It&#8217;s a wide variety of RL tasks. We&#8217;re seeing the same scaling in RL that we saw for pre-training.</p><p><strong>Dwarkesh Patel</strong></p><p>You mentioned Rich Sutton and &#8220;The Bitter Lesson&#8221;. <a href="https://www.dwarkesh.com/p/richard-sutton">I interviewed him last year</a>, and he&#8217;s actually very non-LLM-pilled. I don&#8217;t know if this is his perspective, but one way to paraphrase his objection is: Something which possesses the true core of human learning would not require all these billions of dollars of data and compute and these bespoke environments, to learn how to use Excel, how to use PowerPoint, how to navigate a web browser. The fact that we have to build in these skills using these RL environments hints that we are actually lacking a core human learning algorithm. So we&#8217;re scaling the wrong thing.</p><p>That does raise the question. Why are we doing all this RL scaling if we think there&#8217;s something that&#8217;s going to be human-like in its ability to learn on the fly?</p><p><strong>Dario Amodei</strong></p><p>I think this puts together several things that should be thought of differently. There is a genuine puzzle here, but it may not matter. In fact, I would guess it probably doesn&#8217;t matter. There is an interesting thing. Let me take the RL out of it for a second, because I actually think it&#8217;s a red herring to say that RL is any different from pre-training in this matter.</p><p>If we look at pre-training scaling, it was very interesting back in 2017 when <a href="https://scholar.google.com/citations?user=dOad5HoAAAAJ&amp;hl=en">Alec Radford</a> was doing GPT-1. The models before GPT-1 were trained on datasets that didn&#8217;t represent a wide distribution of text. You had very standard language modeling benchmarks. GPT-1 itself was trained on a bunch of fanfiction, I think actually.</p><p>It was literary text, which is a very small fraction of the text you can get. In those days it was like a billion words or something, so small datasets representing a pretty narrow distribution of what you can see in the world. It didn&#8217;t generalize well. If you did better on some fanfiction corpus, it wouldn&#8217;t generalize that well to other tasks.</p><p>We had all these measures. We had all these measures of how well it did at predicting all these other kinds of texts. It was only when you trained over all the tasks on the internet &#8212; when you did a general internet scrape from something like <a href="https://en.wikipedia.org/wiki/Common_Crawl">Common Crawl</a> or scraping links in Reddit, which is what we did for <a href="https://en.wikipedia.org/wiki/GPT-2">GPT-2</a> &#8212; that you started to get generalization.</p><p>I think we&#8217;re seeing the same thing on RL. We&#8217;re starting first with simple RL tasks like training on math competitions, then moving to broader training that involves things like code. Now we&#8217;re moving to many other tasks. I think then we&#8217;re going to increasingly get generalization. So that kind of takes out the RL vs. pre-training side of it.</p><p>But there is a puzzle either way, which is that in pre-training we use trillions of tokens. Humans don&#8217;t see trillions of words. So there is an actual sample efficiency difference here. There is actually something different here. The models start from scratch and they need much more training. But we also see that once they&#8217;re trained, if we give them a long <a href="https://www.ibm.com/think/topics/context-window">context length</a> of a million &#8212; the only thing blocking long context is <a href="https://hazelcast.com/foundations/ai-machine-learning/machine-learning-inference/">inference</a> &#8212; they&#8217;re very good at learning and adapting within that context.</p><p>So I don&#8217;t know the full answer to this. I think there&#8217;s something going on where pre-training is not like the process of humans learning, but it&#8217;s somewhere between the process of humans learning and the process of human evolution. We get many of our priors from evolution. Our brain isn&#8217;t just a blank slate. <a href="https://en.wikipedia.org/wiki/The_Blank_Slate">Whole books have been written about this.</a></p><p>The language models are much more like blank slates. They literally start as random <a href="https://www.geeksforgeeks.org/deep-learning/the-role-of-weights-and-bias-in-neural-networks/">weights</a>, whereas the human brain starts with all these regions connected to all these inputs and outputs. Maybe we should think of pre-training &#8212; and for that matter, RL as well &#8212; as something that exists in the middle space between human evolution and human on-the-spot learning. And we should think of the in-context learning that the models do as something between long-term human learning and short-term human learning.</p><p>So there&#8217;s this hierarchy. There&#8217;s evolution, there&#8217;s long-term learning, there&#8217;s short-term learning, and there&#8217;s just human reaction. The LLM phases exist along this spectrum, but not necessarily at exactly the same points. There&#8217;s no analog to some of the human modes of learning the LLMs are falling in between the points. Does that make sense?</p><p><strong>Dwarkesh Patel</strong></p><p>Yes, although some things are still a bit confusing. For example, if the analogy is that this is like evolution so it&#8217;s fine that it&#8217;s not sample efficient, then if we&#8217;re going to get super sample-efficient agent from <a href="https://www.lakera.ai/blog/what-is-in-context-learning">in-context learning</a>, why are we bothering to build all these RL environments?</p><p>There are companies whose work seems to be teaching models how to use this API, how to use Slack, how to use whatever. It&#8217;s confusing to me why there&#8217;s so much emphasis on that if the kind of agent that can just learn on the fly is emerging or has already emerged.</p><p><strong>Dario Amodei</strong></p><p>I can&#8217;t speak for the emphasis of anyone else. I can only talk about how we think about it. The goal is not to teach the model every possible skill within RL, just as we don&#8217;t do that within pre-training. Within pre-training, we&#8217;re not trying to expose the model to every possible way that words could be put together. Rather, the model trains on a lot of things and then reaches generalization across pre-training.</p><p>That was the transition from GPT-1 to GPT-2 that I saw up close. The model reaches a point. I had these moments where I was like, &#8220;Oh yeah, you just give the model a list of numbers &#8212; this is the cost of the house, this is the square feet of the house &#8212; and the model completes the pattern and does linear regression.&#8221; Not great, but it does it, and it&#8217;s never seen that exact thing before.</p><p>So to the extent that we are building these RL environments, the goal is very similar to what was done five or ten years ago with pre-training. We&#8217;re trying to get a whole bunch of data, not because we want to cover a specific document or a specific skill, but because we want to generalize.</p><h3>00:12:36 - Is diffusion cope?</h3><p><strong>Dwarkesh Patel</strong></p><p>I think the framework you&#8217;re laying down obviously makes sense. We&#8217;re making progress toward <a href="https://en.wikipedia.org/wiki/Artificial_general_intelligence">AGI</a>. Nobody at this point disagrees we&#8217;re going to achieve AGI this century. The crux is you say we&#8217;re hitting the end of the exponential. Somebody else looks at this and says, &#8220;We&#8217;ve been making progress since 2012, and by 2035 we&#8217;ll have a human-like agent.&#8221;</p><p>Obviously we&#8217;re seeing in these models the kinds of things that evolution did, or that learning within a human lifetime does. I want to understand what you&#8217;re seeing that makes you think it&#8217;s one year away and not ten years away.</p><p><strong>Dario Amodei</strong></p><p>There are two claims you could make here, one stronger and one weaker. Starting with the weaker claim, when I first saw the scaling back in 2019, I wasn&#8217;t sure. This was a 50/50 thing. I thought I saw something. My claim was that this was much more likely than anyone thinks. Maybe there&#8217;s a 50% chance this happens.</p><p>On the basic hypothesis of, as you put it, within ten years we&#8217;ll get to what I call a &#8220;country of geniuses in a data center&#8221;, I&#8217;m at 90% on that. It&#8217;s hard to go much higher than 90% because the world is so unpredictable. Maybe the irreducible uncertainty puts us at 95%, where you get to things like multiple companies having internal turmoil, <a href="https://en.wikipedia.org/wiki/Chinese_unification">Taiwan gets invaded</a>, all the <a href="https://en.wikipedia.org/wiki/Semiconductor_fabrication_plant">fabs</a> get blown up by missiles.</p><p><strong>Dwarkesh Patel</strong></p><p>Now you&#8217;ve jinxed us, Dario.</p><p><strong>Dario Amodei</strong></p><p>You could construct a 5% world where things get delayed for ten years. There&#8217;s another 5% which is that I&#8217;m very confident on tasks that can be verified. With coding, except for that irreducible uncertainty, I think we&#8217;ll be there in one or two years. There&#8217;s no way we will not be there in ten years in terms of being able to do end-to-end coding.</p><p>My one little bit of fundamental uncertainty, even on long timescales, is about tasks that aren&#8217;t verifiable: planning a mission to Mars; doing some fundamental scientific discovery like CRISPR; writing a novel. It&#8217;s hard to verify those tasks. I am almost certain we have a reliable path to get there, but if there&#8217;s a little bit of uncertainty it&#8217;s there. On the ten-year timeline I&#8217;m at 90%, which is about as certain as you can be. I think it&#8217;s crazy to say that this won&#8217;t happen by 2035. In some sane world, it would be outside the mainstream.</p><p><strong>Dwarkesh Patel</strong></p><p>But the emphasis on verification hints to me a lack of belief that these models are generalized. If you think about humans, we&#8217;re both good at things for which we get verifiable reward and things for which we don&#8217;t.</p><p><strong>Dario Amodei</strong></p><p>No, this is why I&#8217;m almost sure. We already see substantial generalization from things that verify to things that don&#8217;t. We&#8217;re already seeing that.</p><p><strong>Dwarkesh Patel</strong></p><p>But it seems like you were emphasizing this as a spectrum which will split apart which domains in which we see more progress. That doesn&#8217;t seem like how humans get better.</p><p><strong>Dario Amodei</strong></p><p>The world in which we don&#8217;t get there is the world in which we do all the verifiable things. Many of them generalize, but we don&#8217;t fully get there. We don&#8217;t fully color in the other side of the box. It&#8217;s not a binary thing.</p><p><strong>Dwarkesh Patel</strong></p><p>Even if generalization is weak and you can only do verifiable domains, it&#8217;s not clear to me you could automate software engineering in such a world. You are &#8220;a software engineer&#8221; in some sense, but part of being a software engineer for you involves <a href="https://www.darioamodei.com/">writing long memos</a> about your grand vision.</p><p><strong>Dario Amodei</strong></p><p>I don&#8217;t think that&#8217;s part of the job of <a href="https://en.wikipedia.org/wiki/Software_engineering">SWE</a>. That&#8217;s part of the job of the company, not SWE specifically. But SWE does involve design documents and other things like that. The models are already pretty good at writing comments. Again, I&#8217;m making much weaker claims here than I believe, to distinguish between two things. We&#8217;re already almost there for software engineering.</p><p><strong>Dwarkesh Patel</strong></p><p>By what metric? There&#8217;s one metric which is how many lines of code are written by AI. If you consider other productivity improvements in the history of software engineering, <a href="https://en.wikipedia.org/wiki/Compiler">compilers</a> write all the lines of software. There&#8217;s a difference between how many lines are written and how big the productivity improvement is. &#8220;We&#8217;re almost there&#8221; meaning&#8230; How big is the productivity improvement, not just how many lines are written by AI?</p><p><strong>Dario Amodei</strong></p><p>I actually agree with you on this. I&#8217;ve made a series of predictions on code and software engineering. I think people have repeatedly misunderstood them. Let me lay out the spectrum.</p><p>About eight or nine months ago, I said the AI model will be writing 90% of the lines of code in three to six months. That happened, at least at some places. It happened at <a href="https://en.wikipedia.org/wiki/Anthropic">Anthropic</a>, happened with many people downstream using our models. But that&#8217;s actually a very weak criterion. People thought I was saying that we won&#8217;t need 90% of the software engineers. Those things are worlds apart. The spectrum is: 90% of code is written by the model, 100% of code is written by the model. That&#8217;s a big difference in productivity.</p><p>90% of the end-to-end SWE tasks &#8212; including things like compiling, setting up clusters and environments, testing features, writing memos &#8212; are done by the models. 100% of today&#8217;s SWE tasks are done by the models. Even when that happens, it doesn&#8217;t mean software engineers are out of a job. There are new higher-level things they can do, where they can manage. Then further down the spectrum, there&#8217;s 90% less demand for SWEs, which I think will happen but this is a spectrum.</p><p>I wrote about it in <a href="https://www.darioamodei.com/essay/the-adolescence-of-technology">&#8220;The Adolescence of Technology&#8221;</a> where I went through this kind of spectrum with farming. I actually totally agree with you on that. These are very different benchmarks from each other, but we&#8217;re proceeding through them super fast.</p><p><strong>Dwarkesh Patel</strong></p><p>Part of your vision is that going from 90 to 100 is going to happen fast, and that it leads to huge productivity improvements. But what I notice is that even in greenfield projects people start with <a href="https://claude.com/product/claude-code">Claude Code</a> or something, people report starting a lot of projects&#8230; Do we see in the world out there a renaissance of software, all these new features that wouldn&#8217;t exist otherwise? At least so far, it doesn&#8217;t seem like we see that.</p><p>So that does make me wonder. Even if I never had to intervene with Claude Code, the world is complicated. Jobs are complicated. Closing the loop on self-contained systems, whether it&#8217;s just writing software or something, how much broader gains would we see just from that? Maybe that should dilute our estimation of the &#8220;country of geniuses&#8221;.</p><p><strong>Dario Amodei</strong></p><p>I simultaneously agree with you that it&#8217;s a reason why these things don&#8217;t happen instantly, but at the same time, I think the effect is gonna be very fast. You could have these two poles. One is that AI is not going to make progress. It&#8217;s slow. It&#8217;s going to take forever to diffuse within the economy. <a href="https://en.wikipedia.org/wiki/Diffusion_(business)">Economic diffusion</a> has become one of these buzzwords that&#8217;s a reason why we&#8217;re not going to make AI progress, or why AI progress doesn&#8217;t matter.</p><p>The other axis is that we&#8217;ll get <a href="https://en.wikipedia.org/wiki/Recursive_self-improvement">recursive self-improvement</a>, the whole thing. Can&#8217;t you just draw an exponential line on the curve? We&#8217;re going to have <a href="https://en.wikipedia.org/wiki/Dyson_sphere">Dyson spheres</a> around the sun so many nanoseconds after we get recursive. I&#8217;m completely caricaturing the view here, but there are these two extremes.</p><p>But what we&#8217;ve seen from the beginning, at least if you look within Anthropic, there&#8217;s this bizarre 10x per year growth in revenue that we&#8217;ve seen. So in 2023, it was zero to $100 million. In 2024, it was $100 million to $1 billion. In 2025, it was $1 billion to $ 9-10 billion.</p><p><strong>Dwarkesh Patel</strong></p><p>You guys should have just bought a billion dollars of your own products so you could just&#8230;</p><p><strong>Dario Amodei</strong></p><p>And the first month of this year, that exponential is... You would think it would slow down, but we added another few billion to revenue in January. Obviously that curve can&#8217;t go on forever. The GDP is only so large. I would even guess that it bends somewhat this year, but that is a fast curve. That&#8217;s a really fast curve. I would bet it stays pretty fast even as the scale goes to the entire economy.</p><p>So I think we should be thinking about this middle world where things are extremely fast, but not instant, where they take time because of economic diffusion, because of the need to close the loop. Because it&#8217;s fiddly: &#8220;I have to do <a href="https://en.wikipedia.org/wiki/Change_management">change management</a> within my enterprise&#8230; I set this up, but I have to change the security permissions on this in order to make it actually work&#8230; I had this old piece of software that checks the model before it&#8217;s compiled and released and I have to rewrite it. Yes, the model can do that, but I have to tell the model to do that. It has to take time to do that.&#8221;</p><p>So I think everything we&#8217;ve seen so far is compatible with the idea that there&#8217;s one fast exponential that&#8217;s the capability of the model. Then there&#8217;s another fast exponential that&#8217;s downstream of that, which is the diffusion of the model into the economy. Not instant, not slow, much faster than any previous technology, but it has its limits. When I look inside Anthropic, when I look at our customers: fast adoption, but not infinitely fast.</p><p><strong>Dwarkesh Patel</strong></p><p>Can I try a hot take on you?</p><p><strong>Dario Amodei</strong></p><p>Yeah.</p><p><strong>Dwarkesh Patel</strong></p><p>I feel like diffusion is cope that people say. When the model isn&#8217;t able to do something, they&#8217;re like, &#8220;oh, but it&#8217;s a diffusion issue.&#8221; But then you should use the comparison to humans. You would think that the inherent advantages that AIs have would make diffusion a much easier problem for new AIs getting onboarded than new humans getting onboarded. An AI can read your entire Slack and your drive in minutes. They can share all the knowledge that the other copies of the same instance have. You don&#8217;t have this adverse selection problem when you&#8217;re hiring AI, so you can just hire copies of a vetted AI model.</p><p>Hiring a human is so much more of a hassle. People hire humans all the time. We pay humans upwards of $50 trillion in wages because they&#8217;re useful, even though in principle it would be much easier to integrate AIs into the economy than it is to hire humans. The diffusion doesn&#8217;t really explain.</p><p><strong>Dario Amodei</strong></p><p>I think diffusion is very real and doesn&#8217;t exclusively have to do with limitations on the AI models. Again, there are people who use diffusion as kind of a buzzword to say this isn&#8217;t a big deal. I&#8217;m not talking about that. I&#8217;m not talking about how AI will diffuse at the speed of previous technologies. I think AI will diffuse much faster than previous technologies have, but not infinitely fast.</p><p>I&#8217;ll just give an example of this. There&#8217;s Claude Code. Claude Code is extremely easy to set up. If you&#8217;re a developer, you can just start using Claude Code. There is no reason why a developer at a large enterprise should not be adopting Claude Code as quickly as an individual developer or developer at a startup.</p><p>We do everything we can to promote it. We sell Claude Code to enterprises. Big enterprises, big financial companies, big pharmaceutical companies, all of them are adopting Claude Code much faster than enterprises typically adopt new technology. But again, it takes time.</p><p>Any given feature or any given product, like Claude Code or <a href="https://claude.com/product/cowork">Cowork</a>, will get adopted by the individual developers who are on Twitter all the time, by the Series A startups, many months faster than they will get adopted by a large enterprise that does food sales. There are just a number of factors. You have to go through legal, you have to provision it for everyone. It has to pass security and compliance.</p><p>The leaders of the company who are further away from the AI revolution are forward-looking, but they have to say, &#8220;Oh, it makes sense for us to spend 50 million. This is what this Claude Code thing is. This is why it helps our company. This is why it makes us more productive.&#8221; Then they have to explain to the people two levels below. They have to say, &#8220;Okay, we have 3,000 developers. Here&#8217;s how we&#8217;re going to roll it out to our developers.&#8221; We have conversations like this every day.</p><p>We are doing everything we can to make Anthropic&#8217;s revenue grow 20 or 30x a year instead of 10x a year. Again, many enterprises are just saying, &#8220;This is so productive. We&#8217;re going to take shortcuts in our usual procurement process.&#8221; They&#8217;re moving much faster than when we tried to sell them just the ordinary API, which many of them use. Claude Code is a more compelling product, but it&#8217;s not an infinitely compelling product.</p><p>I don&#8217;t think even AGI or powerful AI or &#8220;country of geniuses in a data center&#8221; will be an infinitely compelling product. It will be a compelling product enough maybe to get 3-5x, or 10x, a year of growth, even when you&#8217;re in the hundreds of billions of dollars, which is extremely hard to do and has never been done in history before, but not infinitely fast.</p><p><strong>Dwarkesh Patel</strong></p><p>I buy that it would be a slight slowdown. Maybe this is not your claim, but sometimes people talk about this like, &#8220;Oh, the capabilities are there, but because of diffusion... otherwise we&#8217;re basically at AGI&#8221;.</p><p><strong>Dario Amodei</strong></p><p>I don&#8217;t believe we&#8217;re basically at AGI.</p><p><strong>Dwarkesh Patel</strong></p><p>I think if you had the &#8220;country of geniuses in a data center&#8221;...</p><p><strong>Dario Amodei</strong></p><p>If we had the &#8220;country of geniuses in a data center&#8221;, we would know it. We would know it if you had the &#8220;country of geniuses in a data center&#8221;. Everyone in this room would know it. Everyone in Washington would know it. People in rural parts might not know it, but we would know it. We don&#8217;t have that now. That is very clear.</p><h3>00:29:42 - Is continual learning necessary? How will it be solved?</h3><p><strong>Dwarkesh Patel</strong></p><p>Coming back to concrete prediction&#8230; Because there are so many different things to disambiguate, it can be easy to talk past each other when we&#8217;re talking about capabilities. For example, when I interviewed you three years ago, I asked you a prediction about what we should expect three years from now. You were right. You said, &#8220;We should expect systems which, if you talk to them for the course of an hour, it&#8217;s hard to tell them apart from a generally well-educated human.&#8221;</p><p>I think you were right about that. I think spiritually I feel unsatisfied because my internal expectation was that such a system could automate large parts of white-collar work. So it might be more productive to talk about the actual end capabilities you want from such a system.</p><p><strong>Dario Amodei</strong></p><p>I will basically tell you where I think we are.</p><p><strong>Dwarkesh Patel</strong></p><p>Let me ask a very specific question so that we can figure out exactly what kinds of capabilities we should think about soon. Maybe I&#8217;ll ask about it in the context of a job I understand well, not because it&#8217;s the most relevant job, but just because I can evaluate the claims about it.</p><p>Take video editors. I have video editors. Part of their job involves learning about our audience&#8217;s preferences, learning about my preferences and tastes, and the different trade-offs we have. They&#8217;re, over the course of many months, building up this understanding of context. The skill and ability they have six months into the job, a model that can pick up that skill on the job on the fly, when should we expect such an AI system?</p><p><strong>Dario Amodei</strong></p><p>I guess what you&#8217;re talking about is that we&#8217;re doing this interview for three hours. Someone&#8217;s going to come in, someone&#8217;s going to edit it. They&#8217;re going to be like, &#8220;Oh, I don&#8217;t know, Dario scratched his head and we could edit that out.&#8221;</p><p><strong>Dwarkesh Patel</strong></p><p>&#8220;Magnify that.&#8221;</p><p><strong>Dario Amodei</strong></p><p>&#8220;There was this long discussion that is less interesting to people. There&#8217;s another thing that&#8217;s more interesting to people, so let&#8217;s make this edit.&#8221;</p><p>I think the &#8220;country of geniuses in a data center&#8221; will be able to do that. The way it will be able to do that is it will have <a href="https://www.anthropic.com/news/developing-computer-use">general control of a computer screen</a>. You&#8217;ll be able to feed this in. It&#8217;ll be able to also use the computer screen to go on the web, look at all your previous interviews, look at what people are saying on Twitter in response to your interviews, talk to you, ask you questions, talk to your staff, look at the history of edits that you did, and from that, do the job.</p><p>I think that&#8217;s dependent on several things. I think this is one of the things that&#8217;s actually blocking deployment: getting to the point on computer use where the models are really masters at using the computer.</p><p>We&#8217;ve seen this climb in benchmarks, and benchmarks are always imperfect measures. But I think when we first released computer use a year and a quarter ago, <a href="https://os-world.github.io/">OSWorld</a> was at maybe 15%. I don&#8217;t remember exactly, but we&#8217;ve climbed from that to 65-70%. There may be harder measures as well, but I think computer use has to pass a point of reliability.</p><p><strong>Dwarkesh Patel</strong></p><p>Can I just follow up on that before you move on to the next point? For years, I&#8217;ve been trying to build different internal LLM tools for myself. Often I have these text-in, text-out tasks, which should be dead center in the repertoire of these models. Yet I still hire humans to do them.</p><p>If it&#8217;s something like, &#8220;identify what the best clips would be in this transcript&#8221;, maybe the LLMs do a seven-out-of-ten job on them. But there&#8217;s not this ongoing way I can engage with them to help them get better at the job the way I could with a human employee. That missing ability, even if you solve computer use, would still block my ability to offload an actual job to them.</p><p><strong>Dario Amodei</strong></p><p>This gets back to what we were talking about before with learning on the job. It&#8217;s very interesting. I think with the <a href="https://en.wikipedia.org/wiki/AI-assisted_software_development">coding agents</a>, I don&#8217;t think people would say that learning on the job is what is preventing the coding agents from doing everything end to end. They keep getting better. We have engineers at Anthropic who don&#8217;t write any code.</p><p>When I look at the productivity, to your previous question, we have folks who say, &#8220;This <a href="https://modal.com/gpu-glossary/device-software/kernel">GPU kernel</a>, this chip, I used to write it myself. I just have Claude do it.&#8221; There&#8217;s this enormous improvement in productivity.</p><p>When I see Claude Code, familiarity with the <a href="https://en.wikipedia.org/wiki/Codebase">codebase</a> or a feeling that the model hasn&#8217;t worked at the company for a year, that&#8217;s not high up on the list of complaints I see. I think what I&#8217;m saying is that we&#8217;re kind of taking a different path.</p><p><strong>Dwarkesh Patel</strong></p><p>Don&#8217;t you think with coding that&#8217;s because there is an external scaffold of memory which exists instantiated in the codebase? I don&#8217;t know how many other jobs have that. Coding made fast progress precisely because it has this unique advantage that other economic activity doesn&#8217;t.</p><p><strong>Dario Amodei</strong></p><p>But when you say that, what you&#8217;re implying is that by reading the codebase into the context, I have everything that the human needed to learn on the job. So that would be an example of&#8212;whether it&#8217;s written or not, whether it&#8217;s available or not&#8212;a case where everything you needed to know you got from the context window. What we think of as learning&#8212;&#8221;I started this job, it&#8217;s going to take me six months to understand the code base&#8221;&#8212;the model just did it in the context.</p><p><strong>Dwarkesh Patel</strong></p><p>I honestly don&#8217;t know how to think about this because there are people who qualitatively report what you&#8217;re saying. I&#8217;m sure you saw last year, <a href="https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/">there was a major study</a> where they had experienced developers try to close <a href="https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests">pull requests</a> in repositories that they were familiar with. Those developers reported an uplift. They reported that they felt more productive with the use of these models. But in fact, if you look at their output and how much was actually merged back in, there was a 20% downlift. They were less productive as a result of using these models.</p><p>So I&#8217;m trying to square the qualitative feeling that people feel with these models versus, 1) in a macro level, where is this renaissance of software? And then 2) when people do these independent evaluations, why are we not seeing the productivity benefits we would expect?</p><p><strong>Dario Amodei</strong></p><p>Within Anthropic, this is just really unambiguous. We&#8217;re under an incredible amount of commercial pressure and make it even harder for ourselves because we have all this <a href="https://en.wikipedia.org/wiki/AI_safety">safety</a> stuff we do that I think we do more than other companies.</p><p>The pressure to survive economically while also keeping our values is just incredible. We&#8217;re trying to keep this 10x revenue curve going. There is zero time for bullshit. There is zero time for feeling like we&#8217;re productive when we&#8217;re not. These tools make us a lot more productive.</p><p>Why do you think we&#8217;re <a href="https://www.wired.com/story/anthropic-revokes-openais-access-to-claude/">concerned about competitors using the tools</a>? Because we think we&#8217;re ahead of the competitors. We wouldn&#8217;t be going through all this trouble if this were secretly reducing our productivity. We see the end productivity every few months in the form of model launches. There&#8217;s no kidding yourself about this. The models make you more productive.</p><p><strong>Dwarkesh Patel</strong></p><p>1) People feeling like they&#8217;re productive is qualitatively predicted by studies like this. But 2) if I just look at the end output, obviously you guys are making fast progress.</p><p>But the idea was supposed to be that with recursive self-improvement, you make a better AI, the AI helps you build a better next AI, et cetera, et cetera. What I see instead&#8212;if I look at you, OpenAI, DeepMind&#8212;is that people are just shifting around the podium every few months.</p><p>Maybe you think that stops because you&#8217;ve won or whatever. But why are we not seeing the person with the best coding model have this lasting advantage if in fact there are these enormous productivity gains from the last coding model.</p><p><strong>Dario Amodei</strong></p><p>I think my model of the situation is that there&#8217;s an advantage that&#8217;s gradually growing. I would say right now the coding models give maybe, I don&#8217;t know, a 15-20% total factor speed up. That&#8217;s my view. Six months ago, it was maybe 5%. So it didn&#8217;t matter. 5% doesn&#8217;t register. It&#8217;s now just getting to the point where it&#8217;s one of several factors that kind of matters. That&#8217;s going to keep speeding up.</p><p>I think six months ago, there were several companies that were at roughly the same point because this wasn&#8217;t a notable factor, but I think it&#8217;s starting to speed up more and more. I would also say there are multiple companies that write models that are used for code and we&#8217;re not perfectly good at preventing some of these other companies from using our models internally. So I think everything we&#8217;re seeing is consistent with this kind of snowball model.</p><p>Again, my theme in all of this is all of this is soft <a href="https://www.lesswrong.com/w/ai-takeoff">takeoff</a>, soft, smooth exponentials, although the exponentials are relatively steep. So we&#8217;re seeing this snowball gather momentum where it&#8217;s like 10%, 20%, 25%, 40%. As you go, <a href="https://en.wikipedia.org/wiki/Amdahl%27s_law">Amdahl&#8217;s law</a>, you have to get all the things that are preventing you from closing the loop out of the way. But this is one of the biggest priorities within Anthropic.</p><p><strong>Dwarkesh Patel</strong></p><p>Stepping back, before in the stack we were talking about when do we get this on-the-job learning? It seems like the point you were making on the coding thing is that we actually don&#8217;t need on-the-job learning. You can have tremendous productivity improvements, you can have potentially trillions of dollars of revenue for AI companies, without this basic human ability to learn on the job. Maybe that&#8217;s not your claim, you should clarify.</p><p>But in most domains of economic activity, people say, &#8220;I hired somebody, they weren&#8217;t that useful for the first few months, and then over time they built up the context, understanding.&#8221; It&#8217;s actually hard to define what we&#8217;re talking about here. But they got something and then now they&#8217;re a powerhorse and they&#8217;re so valuable to us. If AI doesn&#8217;t develop this ability to learn on the fly, I&#8217;m a bit skeptical that we&#8217;re going to see huge changes to the world without that ability.</p><p><strong>Dario Amodei</strong></p><p>I think two things here. There&#8217;s the state of the technology right now. Again, we have these two stages. We have the pre-training and RL stage where you throw a bunch of data and tasks into the models and then they generalize. So it&#8217;s like learning, but it&#8217;s like learning from more data and not learning over one human or one model&#8217;s lifetime. So again, this is situated between evolution and human learning. But once you learn all those skills, you have them.</p><p>Just like with pre-training, just how the models know more, if I look at a pre-trained model, it knows more about the history of samurai in Japan than I do. It knows more about baseball than I do. It knows more about <a href="https://en.wikipedia.org/wiki/Low-pass_filter">low-pass filters</a> and electronics, all of these things. Its knowledge is way broader than mine. So I think even just that may get us to the point where the models are better at everything.</p><p>We also have, again, just with scaling the kind of existing setup, the in-context learning. I would describe it as kind of like human on-the-job learning, but a little weaker and a little short term. You look at in-context learning and if you give the model a bunch of examples it does get it. There&#8217;s real learning that happens in context. A million <a href="https://blogs.nvidia.com/blog/ai-tokens-explained/">tokens</a> is a lot. That can be days of human learning. If you think about the model reading a million words, how long would it take me to read a million? Days or weeks at least.</p><p>So you have these two things. I think these two things within the existing paradigm may just be enough to get you the &#8220;country of geniuses in a data center&#8221;. I don&#8217;t know for sure, but I think they&#8217;re going to get you a large fraction of it. There may be gaps, but I certainly think that just as things are, this is enough to generate trillions of dollars of revenue. That&#8217;s one.</p><p>Two, is this idea of continual learning, this idea of a single model learning on the job. I think we&#8217;re working on that too. There&#8217;s a good chance that in the next year or two, we also solve that. Again, I think you get most of the way there without it. The trillions of dollars a year market, maybe all of the national security implications and the safety implications that I wrote about in &#8220;Adolescence of Technology&#8221; can happen without it. But we, and I imagine others, are working on it. There&#8217;s a good chance that we will get there within the next year or two.</p><p>There are a bunch of ideas. I won&#8217;t go into all of them in detail, but one is just to make the context longer. There&#8217;s nothing preventing longer contexts from working. You just have to train at longer contexts and then learn to serve them at inference. Both of those are engineering problems that we are working on and I would assume others are working on them as well.</p><p><strong>Dwarkesh Patel</strong></p><p>This context length increase, it seemed like there was a period from 2020 to 2023 where from <a href="https://en.wikipedia.org/wiki/GPT-3">GPT-3</a> to <a href="https://developers.openai.com/api/docs/models/gpt-4-turbo">GPT-4 Turbo</a>, there was an increase from 2000 context lengths to 128K. I feel like for the two-ish years since then, we&#8217;ve been in the same-ish ballpark.</p><p>When context lengths get much longer than that, people report qualitative degradation in the ability of the model to consider that full context. So I&#8217;m curious what you&#8217;re internally seeing that makes you think, &#8220;10 million contexts, 100 million contexts to get six months of human learning and building context&#8221;.</p><p><strong>Dario Amodei</strong></p><p>This isn&#8217;t a research problem. This is an engineering and inference problem. If you want to serve long context, you have to store your entire <a href="https://huggingface.co/blog/not-lain/kv-caching">KV cache</a>. It&#8217;s difficult to store all the memory in the GPUs, to juggle the memory around. I don&#8217;t even know the details. At this point, this is at a level of detail that I&#8217;m no longer able to follow, although I knew it in the GPT-3 era. &#8220;These are the weights, these are the activations you have to store&#8230;&#8221;</p><p>But these days the whole thing is flipped because we have <a href="https://en.wikipedia.org/wiki/Mixture_of_experts">MoE</a> models and all of that. Regarding this degradation you&#8217;re talking about, without getting too specific, there&#8217;s two things. There&#8217;s the context length you train at and there&#8217;s a context length that you serve at. If you train at a small context length and then try to serve at a long context length, maybe you get these degradations. It&#8217;s better than nothing, you might still offer it, but you get these degradations. Maybe it&#8217;s harder to train at a long context length.</p><p><strong>Dwarkesh Patel</strong></p><p>I want to, at the same time, ask about maybe some rabbit holes. Wouldn&#8217;t you expect that if you had to train on longer context length, that would mean that you&#8217;re able to get less samples in for the same amount of compute? Maybe it&#8217;s not worth diving deep on that.</p><p>I want to get an answer to the bigger picture question. I don&#8217;t feel a preference for a human editor that&#8217;s been working for me for six months versus an AI that&#8217;s been working with me for six months, what year do you predict that that will be the case?</p><p><strong>Dario Amodei</strong></p><p>My guess for that is there&#8217;s a lot of problems where basically we can do this when we have the &#8220;country of geniuses in a data center&#8221;. My picture for that, if you made me guess, is one to two years, maybe one to three years. It&#8217;s really hard to tell. I have a strong view&#8212;99%, 95%&#8212;that all this will happen in 10 years. I think that&#8217;s just a super safe bet. I have a hunch&#8212;this is more like a 50/50 thing&#8212;that it&#8217;s going to be more like one to two, maybe more like one to three.</p><p><strong>Dwarkesh Patel</strong></p><p>So one to three years. Country of geniuses, and the slightly less economically valuable task of editing videos.</p><p><strong>Dario Amodei</strong></p><p>It seems pretty economically valuable, let me tell you. It&#8217;s just there are a lot of use cases like that. There are a lot of similar ones.</p><h3>00:46:20 - If AGI is imminent, why not buy more compute?</h3><p><strong>Dwarkesh Patel</strong></p><p>So you&#8217;re predicting that within one to three years. And then, generally, Anthropic has <a href="https://www.anthropic.com/news/anthropic-s-recommendations-ostp-u-s-ai-action-plan">predicted</a> that by late &#8216;26 or early &#8216;27 we will have AI systems that &#8220;have the ability to navigate interfaces available to humans doing digital work today, intellectual capabilities matching or exceeding that of Nobel Prize winners, and the ability to interface with the physical world&#8221;. You gave <a href="https://www.nytimes.com/2025/12/07/business/dealbook/dario-amodei-dealbook.html">an interview two months ago with </a><em><a href="https://www.nytimes.com/2025/12/07/business/dealbook/dario-amodei-dealbook.html">DealBook</a></em> where you were emphasizing your company&#8217;s more responsible compute scaling as compared to your competitors.</p><p>I&#8217;m trying to square these two views. If you really believe that we&#8217;re going to have a country of geniuses, you want as big a data center as you can get. There&#8217;s no reason to slow down. The <a href="https://en.wikipedia.org/wiki/Total_addressable_market">TAM</a> of a Nobel Prize winner, that can actually do everything a Nobel Prize winner can do, is trillions of dollars. So I&#8217;m trying to square this conservatism, which seems rational if you have more moderate timelines, with your stated views about progress.</p><p><strong>Dario Amodei</strong></p><p>It actually all fits together. We go back to this fast, but not infinitely fast, diffusion. Let&#8217;s say that we&#8217;re making progress at this rate. The technology is making progress this fast. I have very high conviction that we&#8217;re going to get there within a few years. I have a hunch that we&#8217;re going to get there within a year or two. So there&#8217;s a little uncertainty on the technical side, but pretty strong confidence that it won&#8217;t be off by much.</p><p>What I&#8217;m less certain about is, again, the economic diffusion side. I really do believe that we could have models that are a country of geniuses in the data center in one to two years. One question is: How many years after that do the trillions in revenue start rolling in? I don&#8217;t think it&#8217;s guaranteed that it&#8217;s going to be immediate. It could be one year, it could be two years, I could even stretch it to five years although I&#8217;m skeptical of that.</p><p>So we have this uncertainty. Even if the technology goes as fast as I suspect that it will, we don&#8217;t know exactly how fast it&#8217;s going to drive revenue. We know it&#8217;s coming, but with the way you buy these data centers, if you&#8217;re off by a couple years, that can be ruinous. It is just like how I wrote in &#8220;<a href="https://darioamodei.com/essay/machines-of-loving-grace">Machines of Loving Grace</a>&#8221;. I said I think we might get this powerful AI, this &#8220;country of genius in the data center&#8221;. That description you gave comes from &#8220;Machines of Loving Grace&#8221;. I said we&#8217;ll get that in 2026, maybe 2027. Again, that is my hunch. I wouldn&#8217;t be surprised if I&#8217;m off by a year or two, but that is my hunch.</p><p>Let&#8217;s say that happens. That&#8217;s the starting gun. How long does it take to cure all the diseases? That&#8217;s one of the ways that drives a huge amount of economic value. You cure every disease. There&#8217;s a question of how much of that goes to the pharmaceutical company or the AI company, but there&#8217;s an enormous consumer surplus because &#8212;assuming we can get access for everyone, which I care about greatly&#8212;we cure all of these diseases.</p><p>How long does it take? You have to do the biological discovery, you have to manufacture the new drug, you have to go through the regulatory process. We <a href="https://en.wikipedia.org/wiki/Operation_Warp_Speed">saw this with vaccines and COVID</a>. We got the vaccine out to everyone, but it took a year and a half. My question is: How long does it take to get the cure for everything&#8212;which AI is the genius that can in theory invent&#8212;out to everyone? How long from when that AI first exists in the lab to when diseases have actually been cured for everyone?</p><p>We&#8217;ve had a polio vaccine for 50 years. We&#8217;re still trying to eradicate it in the most remote corners of Africa. The <a href="https://en.wikipedia.org/wiki/Gates_Foundation">Gates Foundation</a> is trying as hard as they can. Others are trying as hard as they can. But that&#8217;s difficult. Again, I don&#8217;t expect most of the economic diffusion to be as difficult as that. That&#8217;s the most difficult case. But there&#8217;s a real dilemma here. Where I&#8217;ve settled on it is that it will be faster than anything we&#8217;ve seen in the world, but it still has its limits.</p><p>So when we go to buying data centers, again, the curve I&#8217;m looking at is: we&#8217;ve had a 10x a year increase every year. At the beginning of this year, we&#8217;re looking at $10 billion in annualized revenue. We have to decide how much compute to buy. It takes a year or two to actually build out the data centers, to reserve the data center.</p><p>Basically I&#8217;m saying, &#8220;In 2027, how much compute do I get?&#8221; I could assume that the revenue will continue growing 10x a year, so it&#8217;ll be $100 billion at the end of 2026 and $1 trillion at the end of 2027. Actually it would be $5 trillion dollars of compute because it would be $1 trillion a year for five years. I could buy $1 trillion of compute that starts at the end of 2027. If my revenue is not $1 trillion dollars, if it&#8217;s even $800 billion, there&#8217;s no force on earth, there&#8217;s no hedge on earth that could stop me from going bankrupt if I buy that much compute.</p><p>Even though a part of my brain wonders if it&#8217;s going to keep growing 10x, I can&#8217;t buy $1 trillion a year of compute in 2027. If I&#8217;m just off by a year in that rate of growth, or if the growth rate is 5x a year instead of 10x a year, then you go bankrupt. So you end up in a world where you&#8217;re supporting hundreds of billions, not trillions. You accept some risk that there&#8217;s so much demand that you can&#8217;t support the revenue, and you accept some risk that you got it wrong and it&#8217;s still slow.</p><p>When I talked about behaving responsibly, what I meant actually was not the absolute amount. I think it is true we&#8217;re spending somewhat less than some of the other players. It&#8217;s actually the other things, like have we been thoughtful about it or are we YOLOing and saying, &#8220;We&#8217;re going to do $100 billion here or $100 billion there&#8221;? I get the impression that some of the other companies have not written down the spreadsheet, that they don&#8217;t really understand the risks they&#8217;re taking. They&#8217;re just doing stuff because it sounds cool.</p><p>We&#8217;ve thought carefully about it. We&#8217;re an enterprise business. Therefore, we can rely more on revenue. It&#8217;s less fickle than consumer. We have better margins, which is the buffer between buying too much and buying too little. I think we bought an amount that allows us to capture pretty strong upside worlds. It won&#8217;t capture the full 10x a year. Things would have to go pretty badly for us to be in financial trouble. So we&#8217;ve thought carefully and we&#8217;ve made that balance. That&#8217;s what I mean when I say that we&#8217;re being responsible.</p><p><strong>Dwarkesh Patel</strong></p><p>So it seems like it&#8217;s possible that we actually just have different definitions of the &#8220;country of a genius in a data center&#8221;. Because when I think of actual human geniuses, an actual country of human geniuses in a data center, I would happily buy $5 trillion worth of compute to run an actual country of human geniuses in a data center.</p><p>Let&#8217;s say JPMorgan or Moderna or whatever doesn&#8217;t want to use them. I&#8217;ve got a country of geniuses. They&#8217;ll start their own company. If they can&#8217;t start their own company and they&#8217;re bottlenecked by clinical trials&#8230; It is worth stating that with clinical trials, most clinical trials fail because the drug doesn&#8217;t work. There&#8217;s not efficacy.</p><p><strong>Dario Amodei</strong></p><p>I make exactly that point in &#8220;Machines of Loving Grace&#8221;, I say the clinical trials are going to go much faster than we&#8217;re used to, but not infinitely fast.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, and then suppose it takes a year for the clinical trials to work out so that you&#8217;re getting revenue from that and can make more drugs. Okay, well, you&#8217;ve got a country of geniuses and you&#8217;re an AI lab. You could use many more AI researchers. You also think there are these self-reinforcing gains from smart people working on AI tech. You can have the data center working on AI progress.</p><p><strong>Dario Amodei</strong></p><p>Are there substantially more gains from buying $1 trillion a year of compute versus $300 billion a year of compute?</p><p><strong>Dwarkesh Patel</strong></p><p>If your competitor is buying a trillion, yes there is.</p><p><strong>Dario Amodei</strong></p><p>Well, no, there&#8217;s some gain, but then again, there&#8217;s this chance that they go bankrupt before. Again, if you&#8217;re off by only a year, you destroy yourselves. That&#8217;s the balance. We&#8217;re buying a lot. We&#8217;re buying a hell of a lot. We&#8217;re buying an amount that&#8217;s comparable to what the biggest players in the game are buying.</p><p>But if you&#8217;re asking me, &#8220;Why haven&#8217;t we signed $10 trillion of compute starting in mid-2027?&#8221;... First of all, it can&#8217;t be produced. There isn&#8217;t that much in the world. But second, what if the country of geniuses comes, but it comes in mid-2028 instead of mid-2027? You go bankrupt.</p><p><strong>Dwarkesh Patel</strong></p><p>So if your projection is one to three years, it seems like you should want $10 trillion of compute by 2029 at the latest? Even in the longest version of the timelines you state, the compute you are ramping up to build doesn&#8217;t seem in accordance.</p><p><strong>Dario Amodei</strong></p><p>What makes you think that?</p><p><strong>Dwarkesh Patel</strong></p><p>Human wages, let&#8217;s say, are on the order of $50 trillion a year&#8212;</p><p><strong>Dario Amodei</strong></p><p>So I won&#8217;t talk about Anthropic in particular, but if you talk about the industry, the amount of compute the industry is building this year is probably, call it, 10-15 gigawatts. It goes up by roughly 3x a year. So next year&#8217;s 30-40 gigawatts. 2028 might be 100 gigawatts. 2029 might be like 300 gigawatts. I&#8217;m doing the math in my head, but each gigawatt costs maybe $10 billion, on the order of $10-15 billion a year.</p><p>You put that all together and you&#8217;re getting about what you described. You&#8217;re getting exactly that. You&#8217;re getting multiple trillions a year by 2028 or 2029. You&#8217;re getting exactly what you predict.</p><p><strong>Dwarkesh Patel</strong></p><p>That&#8217;s for the industry.</p><p><strong>Dario Amodei</strong></p><p>That&#8217;s for the industry, that&#8217;s right.</p><p><strong>Dwarkesh Patel</strong></p><p>Suppose Anthropic&#8217;s compute keeps 3x-ing a year, and then by 2027-28, you have 10 gigawatts. Multiply that by, as you say, $10 billion. So then it&#8217;s like $100 billion a year. But then you&#8217;re saying the TAM by 2028 is $200 billion.</p><p><strong>Dario Amodei</strong></p><p>Again, I don&#8217;t want to give exact numbers for Anthropic, but these numbers are too small.</p><p><strong>Dwarkesh Patel</strong></p><p>Okay, interesting.</p><h3>00:58:49 - How will AI labs actually make profit?</h3><p><strong>Dwarkesh Patel</strong></p><p>You&#8217;ve told investors that you plan to be profitable starting in 2028. This is the year when we&#8217;re potentially getting the country of geniuses as a data center. This is now going to unlock all this progress in medicine and health and new technologies. Wouldn&#8217;t this be exactly the time where you&#8217;d want to reinvest in the business and build bigger &#8220;countries&#8221; so they can make more discoveries?</p><p><strong>Dario Amodei</strong></p><p>Profitability is this kind of weird thing in this field. I don&#8217;t think in this field profitability is actually a measure of spending down versus investing in the business. Let&#8217;s just take a model of this. I actually think profitability happens when you underestimated the amount of demand you were going to get and loss happens when you overestimated the amount of demand you were going to get, because you&#8217;re buying the data centers ahead of time.</p><p>Think about it this way. Again, these are stylized facts. These numbers are not exact. I&#8217;m just trying to make a toy model here. Let&#8217;s say half of your compute is for training and half of your compute is for inference. The inference has some gross margin that&#8217;s more than 50%.</p><p>So what that means is that if you were in steady-state, you build a data center and if you knew exactly the demand you were getting, you would get a certain amount of revenue. Let&#8217;s say you pay $100 billion a year for compute. On $50 billion a year you support $150 billion of revenue. The other $50 billion is used for training. Basically you&#8217;re profitable and you make $50 billion of profit. Those are the economics of the industry today, or not today but where we&#8217;re projecting forward in a year or two.</p><p>The only thing that makes that not the case is if you get less demand than $50 billion. Then you have more than 50% of your data center for research and you&#8217;re not profitable. So you train stronger models, but you&#8217;re not profitable. If you get more demand than you thought, then research gets squeezed, but you&#8217;re kind of able to support more inference and you&#8217;re more profitable.</p><p>Maybe I&#8217;m not explaining it well, but the thing I&#8217;m trying to say is that you decide the amount of compute first. Then you have some target desire of inference versus training, but that gets determined by demand. It doesn&#8217;t get determined by you.</p><p><strong>Dwarkesh Patel</strong></p><p>What I&#8217;m hearing is the reason you&#8217;re predicting profit is that you are systematically underinvesting in compute?</p><p><strong>Dario Amodei</strong></p><p>No, no, no. I&#8217;m saying it&#8217;s hard to predict. These things about 2028 and when it will happen, that&#8217;s our attempt to do the best we can with investors. All of this stuff is really uncertain because of the cone of uncertainty. We could be profitable in 2026 if the revenue grows fast enough. If we overestimate or underestimate the next year, that could swing wildly.</p><p>What I&#8217;m trying to get at is that you have a model in your head of a business that invests, invests, invests, gets scale and then becomes profitable. There&#8217;s a single point at which things turn around. I don&#8217;t think the economics of this industry work that way.</p><p><strong>Dwarkesh Patel</strong></p><p>I see. So if I&#8217;m understanding correctly, you&#8217;re saying that because of the discrepancy between the amount of compute we should have gotten and the amount of compute we got, we were sort of forced to make profit. But that doesn&#8217;t mean we&#8217;re going to continue making profit. We&#8217;re going to reinvest the money because now AI has made so much progress and we want a bigger country of geniuses. So back into revenue is high, but losses are also high.</p><p><strong>Dario Amodei</strong></p><p>If every year we predict exactly what the demand is going to be, we&#8217;ll be profitable every year. Because spending 50% of your compute on research, roughly, plus a gross margin that&#8217;s higher than 50% and correct demand prediction leads to profit. That&#8217;s the profitable business model that I think is kind of there, but obscured by these building ahead and prediction errors.</p><p><strong>Dwarkesh Patel</strong></p><p>I guess you&#8217;re treating the 50% as a sort of given constant, whereas in fact, if AI progress is fast and you can increase the progress by scaling up more, you should just have more than 50% and not make profit.</p><p><strong>Dario Amodei</strong></p><p>But here&#8217;s what I&#8217;ll say. You might want to scale it up more. Remember the log returns to scale. If 70% would get you a very little bit of a smaller model through a factor of 1.4x... That extra $20 billion, each dollar there is worth much less to you because of the log-linear setup.</p><p>So you might find that it&#8217;s better to invest that $20 billion in serving inference or in hiring engineers who are kind of better at what they&#8217;re doing. So the reason I said 50%... That&#8217;s not exactly our target. It&#8217;s not exactly going to be 50%. It&#8217;ll probably vary over time.</p><p>What I&#8217;m saying is the log-linear return, what it leads to is you spend of order one fraction of the business. Like not 5%, not 95%. Then you get diminishing returns.</p><p><strong>Dwarkesh Patel</strong></p><p>I feel strange that I&#8217;m convincing Dario to believe in AI progress or something. Okay, you don&#8217;t invest in research because it has diminishing returns, but you invest in the other things you mentioned. I think profit at a sort of macro level&#8212;</p><p><strong>Dario Amodei</strong></p><p>Again, I&#8217;m talking about diminishing returns, but after you&#8217;re spending $50 billion a year.</p><p><strong>Dwarkesh Patel</strong></p><p>This is a point I&#8217;m sure you would make, but diminishing returns on a genius could be quite high.</p><p>More generally, what is profit in a market economy? Profit is basically saying other companies in the market can do more things with this money than I can.</p><p><strong>Dario Amodei</strong></p><p>Put aside Anthropic. I don&#8217;t want to give information about Anthropic. That&#8217;s why I&#8217;m giving these stylized numbers. But let&#8217;s just derive the equilibrium of the industry. Why doesn&#8217;t everyone spend 100% of their compute on training and not serve any customers? It&#8217;s because if they didn&#8217;t get any revenue, they couldn&#8217;t raise money, they couldn&#8217;t do compute deals, they couldn&#8217;t buy more compute the next year.</p><p>So there&#8217;s going to be an equilibrium where every company spends less than 100% on training and certainly less than 100% on inference. It should be clear why you don&#8217;t just serve the current models and never train another model, because then you don&#8217;t have any demand because you&#8217;ll fall behind. So there&#8217;s some equilibrium. It&#8217;s not gonna be 10%, it&#8217;s not gonna be 90%. Let&#8217;s just say as a stylized fact, it&#8217;s 50%. That&#8217;s what I&#8217;m getting at.</p><p>I think we&#8217;re gonna be in a position where that equilibrium of how much you spend on training is less than the gross margins that you&#8217;re able to get on compute. So the underlying economics are profitable. The problem is you have this hellish demand prediction problem when you&#8217;re buying the next year of compute and you might guess under and be very profitable but have no compute for research. Or you might guess over and you are not profitable and you have all the compute for research in the world. Does that make sense? Just as a dynamic model of the industry?</p><p><strong>Dwarkesh Patel</strong></p><p>Maybe stepping back, I&#8217;m not saying I think the &#8220;country of geniuses&#8221; is going to come in two years and therefore you should buy this compute. To me, the end conclusion you&#8217;re arriving at makes a lot of sense. But that&#8217;s because it seems like &#8220;country of geniuses&#8221; is hard and there&#8217;s a long way to go. So stepping back, the thing I&#8217;m trying to get at is more that it seems like your worldview is compatible with somebody who says, &#8220;We&#8217;re like 10 years away from a world in which we&#8217;re generating trillions of dollars of value.&#8221;</p><p><strong>Dario Amodei</strong></p><p>That&#8217;s just not my view. So I&#8217;ll make another prediction. It is hard for me to see that there won&#8217;t be trillions of dollars in revenue before 2030. I can construct a plausible world. It takes maybe three years. That would be the end of what I think it&#8217;s plausible.</p><p>Like in 2028, we get the real &#8220;country of geniuses in the data center&#8221;. The revenue&#8217;s going into the low hundreds of billions by 2028, and then the country of geniuses accelerates it to trillions. We&#8217;re basically on the slow end of diffusion. It takes two years to get to the trillions. That would be the world where it takes until 2030. I suspect even composing the technical exponential and diffusion exponential, we&#8217;ll get there before 2030.</p><p><strong>Dwarkesh Patel</strong></p><p>So you laid out a model where Anthropic makes profit because it seems like fundamentally we&#8217;re in a compute-constrained world. So eventually we keep growing compute&#8212;</p><p><strong>Dario Amodei</strong></p><p>I think the way the profit comes is&#8230; Again, let&#8217;s just abstract the whole industry here. Let&#8217;s just imagine we&#8217;re in an economics textbook. We have a small number of firms. Each can invest a limited amount. Each can invest some fraction in R&amp;D. They have some marginal cost to serve. The gross profit margins on that marginal cost are very high because inference is efficient. There&#8217;s some competition, but the models are also differentiated.</p><p>Companies will compete to push their research budgets up. But because there&#8217;s a small number of players, we have the... What is it called? The <a href="https://en.wikipedia.org/wiki/Cournot_competition">Cournot equilibrium</a>, I think, is what the small number of firm equilibrium is. The point is it doesn&#8217;t equilibrate to perfect competition with zero margins. If there&#8217;s three firms in the economy and all are kind of independently behaving rationally, it doesn&#8217;t equilibrate to zero.</p><p><strong>Dwarkesh Patel</strong></p><p>Help me understand that, because right now we do have three leading firms and they&#8217;re not making profit. So what is changing?</p><p><strong>Dario Amodei</strong></p><p>Again, the gross margins right now are very positive. What&#8217;s happening is a combination of two things. One is that we&#8217;re still in the exponential scale-up phase of compute. A model gets trained. Let&#8217;s say a model got trained that costs $1 billion last year. Then this year it produced $4 billion of revenue and cost $1 billion to inference from. Again, I&#8217;m using stylized numbers here, but that would be 75% gross margins and this 25% tax. So that model as a whole makes $2 billion.</p><p>But at the same time, we&#8217;re spending $10 billion to train the next model because there&#8217;s an exponential scale-up. So the company loses money. Each model makes money, but the company loses money.</p><p>The equilibrium I&#8217;m talking about is an equilibrium where we have the &#8220;country of geniuses in a data center&#8221;, but that model training scale-up has equilibrated more. Maybe it&#8217;s still going up. We&#8217;re still trying to predict the demand, but it&#8217;s more leveled out.</p><p><strong>Dwarkesh Patel</strong></p><p>I&#8217;m confused about a couple of things there. Let&#8217;s start with the current world. In the current world, you&#8217;re right that, as you said before, if you treat each individual model as a company, it&#8217;s profitable. But of course, a big part of the production function of being a frontier lab is training the next model, right?</p><p><strong>Dario Amodei</strong></p><p>Yes, that&#8217;s right.</p><p><strong>Dwarkesh Patel</strong></p><p>If you didn&#8217;t do that, then you&#8217;d make profit for two months and then you wouldn&#8217;t have margins because you wouldn&#8217;t have the best model.</p><p><strong>Dario Amodei</strong></p><p>But at some point that reaches the biggest scale that it can reach. And then in equilibrium, we have algorithmic improvements, but we&#8217;re spending roughly the same amount to train the next model as we spend to train the current model. At some point you run out of money in the economy.</p><p><strong>Dwarkesh Patel</strong></p><p>A fixed <a href="https://en.wikipedia.org/wiki/Lump_of_labour_fallacy">lump of labor fallacy</a>&#8230; The economy is going to grow, right? That&#8217;s one of your predictions. <a href="https://www.dwarkesh.com/p/elon-musk">We&#8217;re going to have the data centers in space</a>.</p><p><strong>Dario Amodei</strong></p><p>Yes, but this is another example of the theme I was talking about. The economy will grow much faster with AI than I think it ever has before. Right now the compute is growing 3x a year. I don&#8217;t believe the economy is gonna grow 300% a year. I said this in &#8220;Machines of Loving Grace&#8221;, I think we may get 10-20% per year growth in the economy, but we&#8217;re not gonna get 300% growth in the economy. So I think in the end, if compute becomes the majority of what the economy produces, it&#8217;s gonna be capped by that.</p><p><strong>Dwarkesh Patel</strong></p><p>So let&#8217;s assume a model where compute stays capped. The world where frontier labs are making money is one where they continue to make fast progress. Because fundamentally your margin is limited by how good the alternative is. So you are able to make money because you have a frontier model. If you didn&#8217;t have a frontier model you wouldn&#8217;t be making money. So this model requires there never to be a steady state. Forever and ever you keep making more algorithmic progress.</p><p><strong>Dario Amodei</strong></p><p>I don&#8217;t think that&#8217;s true. I mean, I feel like we&#8217;re in an economics class.</p><p><strong>Dwarkesh Patel</strong></p><p>Do you know the <a href="https://www.dwarkesh.com/p/tyler-cowen-4">Tyler Cowen</a> quote? We never stop talking about economics.</p><p><strong>Dario Amodei</strong></p><p>We never stop talking about economics. So no, I don&#8217;t think this field&#8217;s going to be a monopoly. All my lawyers never want me to say the word &#8220;monopoly&#8221;. But I don&#8217;t think this field&#8217;s going to be a monopoly. You do get industries in which there are a small number of players. Not one, but a small number of players.</p><p>Ordinarily, the way you get monopolies like Facebook or Meta&#8212;I always call them Facebook&#8212;is these kinds of <a href="https://en.wikipedia.org/wiki/Network_effect">network effects</a>. The way you get industries in which there are a small number of players, is very high costs of entry. <a href="https://en.wikipedia.org/wiki/Cloud_computing">Cloud</a> is like this. I think cloud is a good example of this. There are three, maybe four, players within cloud. I think that&#8217;s the same for AI, three, maybe four.</p><p>The reason is that it&#8217;s so expensive. It requires so much expertise and so much capital to run a cloud company. You have to put up all this capital. In addition to putting up all this capital, you have to get all of this other stuff that requires a lot of skill to make it happen.</p><p>So if you go to someone and you&#8217;re like, &#8220;I want to disrupt this industry, here&#8217;s $100 billion.&#8221; You&#8217;re like, &#8220;okay, I&#8217;m putting in $100 billion and also betting that you can do all these other things that these people have been doing.&#8221;</p><p><strong>Dwarkesh Patel</strong></p><p>Only to decrease the profit.</p><p><strong>Dario Amodei</strong></p><p>The effect of your entering is that profit margins go down. So, we have equilibria like this all the time in the economy where we have a few players. Profits are not astronomical. Margins are not astronomical, but they&#8217;re not zero. That&#8217;s what we see on cloud. Cloud is very undifferentiated. Models are more differentiated than cloud.</p><p>Everyone knows Claude is good at different things than GPT is good at, than <a href="https://en.wikipedia.org/wiki/Google_Gemini">Gemini</a> is good at. It&#8217;s not just that Claude&#8217;s good at coding, GPT is good at math and reasoning. It&#8217;s more subtle than that. Models are good at different types of coding. Models have different styles. I think these things are actually quite different from each other, and so I would expect more differentiation than you see in cloud.</p><p>Now, there actually is one counter-argument. That counter-argument is if the process of producing models, if AI models can do that themselves, then that could spread throughout the economy. But that is not an argument for commoditizing AI models in general. That&#8217;s kind of an argument for commoditizing the whole economy at once.</p><p>I don&#8217;t know what quite happens in that world where basically anyone can do anything, anyone can build anything, and there&#8217;s no moat around anything at all. I don&#8217;t know, maybe we want that world. Maybe that&#8217;s the end state here. Maybe when AI models can do everything, if we&#8217;ve solved all the safety and security problems, that&#8217;s one of the mechanisms for the economy just flattening itself again. But that&#8217;s kind of far post-&#8221;country of geniuses in the data center.&#8221;</p><p><strong>Dwarkesh Patel</strong></p><p>Maybe a finer way to put that potential point is: 1) it seems like AI research is especially loaded on raw intellectual power, which will be especially abundant in the world of AGI. And 2) if you just look at the world today, there are very few technologies that seem to be diffusing as fast as AI algorithmic progress. So that does hint that this industry is sort of structurally diffusive.</p><p><strong>Dario Amodei</strong></p><p>I think coding is going fast, but I think AI research is a superset of coding and there are aspects of it that are not going fast. But I do think, again, once we get coding, once we get AI models going fast, then that will speed up the ability of AI models to do everything else. So while coding is going fast now, I think once the AI models are building the next AI models and building everything else, the whole economy will kind of go at the same pace.</p><p>I am worried geographically, though. I&#8217;m a little worried that just proximity to AI, having heard about AI, may be one differentiator. So when I said the 10-20% growth rate, a worry I have is that the growth rate could be like 50% in Silicon Valley and parts of the world that are socially connected to Silicon Valley, and not that much faster than its current pace elsewhere. I think that&#8217;d be a pretty messed up world. So one of the things I think about a lot is how to prevent that.</p><p><strong>Dwarkesh Patel</strong></p><p>Do you think that once we have this country of geniuses in a data center, that robotics is sort of quickly solved afterwards? Because it seems like a big problem with robotics is that a human can learn how to teleoperate current hardware, but current AI models can&#8217;t, at least not in a way that&#8217;s super productive. And so if we have this ability to learn like a human, shouldn&#8217;t it solve robotics immediately as well?</p><p><strong>Dario Amodei</strong></p><p>I don&#8217;t think it&#8217;s dependent on learning like a human. It could happen in different ways. Again, we could have trained the model on many different video games, which are like robotic controls, or many different simulated robotics environments, or just train them to control computer screens, and they learn to generalize.</p><p>So it will happen... it&#8217;s not necessarily dependent on human-like learning. Human-like learning is one way it could happen. If the model&#8217;s like, &#8220;Oh, I pick up a robot, I don&#8217;t know how to use it, I learn,&#8221; that could happen because we discovered continual learning. That could also happen because we trained the model on a bunch of environments and then generalized, or it could happen because the model learns that in the context length. It doesn&#8217;t actually matter which way. If we go back to the discussion we had an hour ago, that type of thing can happen in several different ways.</p><p>But I do think when for whatever reason the models have those skills, then robotics will be revolutionized&#8212;both the design of robots, because the models will be much better than humans at that, and also the ability to control robots. So we&#8217;ll get better at building the physical hardware, building the physical robots, and we&#8217;ll also get better at controlling it.</p><p>Now, does that mean the robotics industry will also be generating trillions of dollars of revenue? My answer there is yes, but there will be the same extremely fast, but not infinitely fast diffusion. So will robotics be revolutionized? Yeah, maybe tack on another year or two. That&#8217;s the way I think about these things.</p><p><strong>Dwarkesh Patel</strong></p><p>Makes sense. There&#8217;s a general skepticism about extremely fast progress. Here&#8217;s my view. It sounds like you are going to solve continual learning one way or another within a matter of years. But just as people weren&#8217;t talking about continual learning a couple of years ago, and then we realized, &#8220;Oh, why aren&#8217;t these models as useful as they could be right now, even though they are clearly passing the Turing test and are experts in so many different domains? Maybe it&#8217;s this thing.&#8221;</p><p>Then we solve this thing and we realize, actually, there&#8217;s another thing that human intelligence can do that&#8217;s a basis of human labor that these models can&#8217;t do. So why not think there will be more things like this, where we&#8217;ve found more pieces of human intelligence?</p><p><strong>Dario Amodei</strong></p><p>Well, to be clear, I think <a href="https://www.ibm.com/think/topics/continual-learning">continual learning</a>, as I&#8217;ve said before, might not be a barrier at all. I think we may just get there by pre-training generalization and RL generalization. I think there just might not be such a thing at all.</p><p>In fact, I would point to the history in <a href="https://en.wikipedia.org/wiki/Machine_learning">ML</a> of people coming up with things that are barriers that end up kind of dissolving within the big blob of compute. People talked about, &#8220;How do your models keep track of nouns and verbs?&#8221; &#8220;They can understand syntactically, but they can&#8217;t understand semantically? It&#8217;s only statistical correlations.&#8221; &#8220;You can understand a paragraph, you can&#8217;t understand a word. There&#8217;s reasoning, you can&#8217;t do reasoning.&#8221; But then suddenly it turns out you can do code and math very well.</p><p>So I think there&#8217;s actually a stronger history of some of these things seeming like a big deal and then kind of dissolving. Some of them are real. The need for data is real, maybe continual learning is a real thing.</p><p>But again, I would ground us in something like code. I think we may get to the point in a year or two where the models can just do SWE end-to-end. That&#8217;s a whole task. That&#8217;s a whole sphere of human activity that we&#8217;re just saying models can do now.</p><p><strong>Dwarkesh Patel</strong></p><p>When you say end-to-end, do you mean setting technical direction, understanding the context of the problem, et cetera?</p><p><strong>Dario Amodei</strong></p><p>Yes. I mean all of that.</p><p><strong>Dwarkesh Patel</strong></p><p>Interesting. I feel like that is AGI-complete, which maybe is internally consistent. But it&#8217;s not like saying 90% of code or 100% of code.</p><p><strong>Dario Amodei</strong></p><p>No, I gave this spectrum: 90% of code, 100% of code, 90% of end-to-end SWE, 100% of end-to-end SWE. New tasks are created for SWEs. Eventually those get done as well. It&#8217;s a long spectrum there, but we&#8217;re traversing the spectrum very quickly.</p><p><strong>Dwarkesh Patel</strong></p><p>I do think it&#8217;s funny that I&#8217;ve seen a couple of podcasts you&#8217;ve done where the hosts will be like, &#8220;But Dwarkesh wrote <a href="https://www.dwarkesh.com/p/timelines-june-2025">the essay about the continuous learning thing</a>.&#8221; It always makes me crack up because you&#8217;ve been an AI researcher for 10 years. I&#8217;m sure there&#8217;s some feeling of, &#8220;Okay, so a podcaster wrote an essay, and every interview I get asked about it.&#8221;</p><p><strong>Dario Amodei</strong></p><p>The truth of the matter is that we&#8217;re all trying to figure this out together. There are some ways in which I&#8217;m able to see things that others aren&#8217;t. These days that probably has more to do with seeing a bunch of stuff within Anthropic and having to make a bunch of decisions than I have any great research insight that others don&#8217;t.</p><p>I&#8217;m running a 2,500 person company. It&#8217;s actually pretty hard for me to have concrete research insight, much harder than it would have been 10 years ago or even two or three years ago.</p><p><strong>Dwarkesh Patel</strong></p><p>As we go towards a world of a full drop-in remote worker replacement, does an <a href="https://en.wikipedia.org/wiki/API">API</a> pricing model still make the most sense? If not, what is the correct way to price AGI, or serve AGI?</p><p><strong>Dario Amodei</strong></p><p>I think there&#8217;s going to be a bunch of different business models here, all at once, that are going to be experimented with. I actually do think that the API model is more durable than many people think. One way I think about it is if the technology is advancing quickly, if it&#8217;s advancing exponentially, what that means is there&#8217;s always a surface area of new use cases that have been developed in the last three months.</p><p>Any kind of product surface you put in place is always at risk of sort of becoming irrelevant. Any given product surface probably makes sense for a range of capabilities of the model. The chatbot is already running into limitations where making it smarter doesn&#8217;t really help the average consumer that much. But I don&#8217;t think that&#8217;s a limitation of AI models. I don&#8217;t think that&#8217;s evidence that the models are good enough and them getting better doesn&#8217;t matter to the economy. It doesn&#8217;t matter to that particular product.</p><p>So I think the value of the API is that the API always offers an opportunity, very close to the bare metal, to build on what the latest thing is. There&#8217;s always going to be this front of new startups and new ideas that weren&#8217;t possible a few months ago and are possible because the model is advancing.</p><p>I actually predict that it&#8217;s going to exist alongside other models, but we&#8217;re always going to have the API business model because there&#8217;s always going to be a need for a thousand different people to try experimenting with the model in a different way. 100 of them become startups and ten of them become big successful startups. Two or three really end up being the way that people use the model of a given generation.</p><p>So I basically think it&#8217;s always going to exist. At the same time, I&#8217;m sure there&#8217;s going to be other models as well. Not every token that&#8217;s output by the model is worth the same amount. Think about what is the value of the tokens that the model outputs when someone calls them up and says, &#8220;My Mac isn&#8217;t working,&#8221; or something, the model&#8217;s like, &#8220;restart it.&#8221; Someone hasn&#8217;t heard that before, but the model said that 10 million times. Maybe that&#8217;s worth like a dollar or a few cents or something.</p><p>Whereas if the model goes to one of the pharmaceutical companies and it says, &#8220;Oh, you know, this molecule you&#8217;re developing, you should take the aromatic ring from that end of the molecule and put it on that end of the molecule. If you do that, wonderful things will happen.&#8221; Those tokens could be worth tens of millions of dollars.</p><p>So I think we&#8217;re definitely going to see business models that recognize that. At some point we&#8217;re going to see &#8220;pay for results&#8221; in some form, or we may see forms of compensation that are like labor, that kind of work by the hour. I don&#8217;t know. I think because it&#8217;s a new industry, a lot of things are going to be tried. I don&#8217;t know what will turn out to be the right thing.</p><p><strong>Dwarkesh Patel</strong></p><p>I take your point that people will have to try things to figure out what is the best way to use this blob of intelligence. But what I find striking is Claude Code. I don&#8217;t think in the history of startups there has been a single application that has been as hotly competed in as coding agents. Claude Code is a category leader here. That seems surprising to me.</p><p>It doesn&#8217;t seem intrinsically that Anthropic had to build this. I wonder if you have an accounting of why it had to be Anthropic or how Anthropic ended up building an application in addition to the model underlying it that was successful.</p><p><strong>Dario Amodei</strong></p><p>So it actually happened in a pretty simple way, which is that we had our own coding models, which were good at coding. Around the beginning of 2025, I said, &#8220;I think the time has come where you can have nontrivial acceleration of your own research if you&#8217;re an AI company by using these models.&#8221; Of course, you need an interface, you need a harness to use them.</p><p>So I encouraged people internally. I didn&#8217;t say this is one thing that you have to use. I just said people should experiment with this. I think it might have been originally called Claude CLI, and then the name eventually got changed to Claude Code. Internally, it was the thing that everyone was using and it was seeing fast internal adoption.</p><p>I looked at it and I said, &#8220;Probably we should launch this externally, right?&#8221; It&#8217;s seen such fast adoption within Anthropic. Coding is a lot of what we do. We have an audience of many, many hundreds of people that&#8217;s in some ways at least representative of the external audience. So it looks like we already have product market fit. Let&#8217;s launch this thing.</p><p>And then we launched it. I think just the fact that we ourselves are kind of developing the model and we ourselves know what we most need to use the model, I think it&#8217;s kind of creating this feedback loop.</p><p><strong>Dwarkesh Patel</strong></p><p>I see. In the sense that you, let&#8217;s say a developer at Anthropic is like, &#8220;Ah, it would be better if it was better at this X thing.&#8221; Then you bake that into the next model that you build.</p><p><strong>Dario Amodei</strong></p><p>That&#8217;s one version of it, but then there&#8217;s just the ordinary product iteration. We have a bunch of coders within Anthropic, they use Claude Code every day and so we get fast feedback. That was more important in the early days. Now, of course, there are millions of people using it, and so we get a bunch of external feedback as well. But it&#8217;s just great to be able to get kind of fast internal feedback.</p><p>I think this is the reason why we launched a coding model and didn&#8217;t launch a pharmaceutical company. My background&#8217;s in biology, but we don&#8217;t have any of the resources that are needed to launch a pharmaceutical company.</p><h3>01:31:19 - Will regulations destroy the boons of AGI?</h3><p><strong>Dwarkesh Patel</strong></p><p>Let me now ask you about making AI go well. It seems like whatever vision we have about how AI goes well has to be compatible with two things: 1) the ability to build and run AIs is diffusing extremely rapidly and 2) the population of AIs, the amount we have and their intelligence, will also increase very rapidly.</p><p>That means that lots of people will be able to build huge populations of misaligned AIs, or AIs which are just companies which are trying to increase their footprint or have weird psyches like <a href="https://en.wikipedia.org/wiki/Sydney_(Microsoft)">Sydney Bing</a>, but now they&#8217;re superhuman. What is a vision for a world in which we have an equilibrium that is compatible with lots of different AIs, some of which are misaligned, running around?</p><p><strong>Dario Amodei</strong></p><p>I think in &#8220;The Adolescence of Technology&#8221;, I was skeptical of the balance of power. But the thing I was specifically skeptical of is you have three or four of these companies all building models that are derived from the same thing, that they would check each other. Or even that any number of them would check each other.</p><p><strong>Dario Amodei</strong></p><p>We might live in an offense-dominant world where one person or one AI model is smart enough to do something that causes damage for everything else. In the short run, we have a limited number of players now. So we can start within the limited number of players. We need to put in place the safeguards. We need to make sure everyone does the right alignment work. We need to make sure everyone has bioclassifiers. Those are the immediate things we need to do.</p><p>I agree that that doesn&#8217;t solve the problem in the long run, particularly if the ability of AI models to make other AI models proliferates, then the whole thing can become harder to solve. I think in the long run we need some architecture of governance. We need some architecture of governance that preserves human freedom, but also allows us to govern a very large number of human systems, AI systems, hybrid human-AI companies or economic units.</p><p>So we&#8217;re gonna need to think about: how do we protect the world against bioterrorism? How do we protect the world against <a href="https://en.wikipedia.org/wiki/Mirror-image_life">mirror life</a>? Probably we&#8217;re gonna need some kind of AI monitoring system that monitors for all of these things. But then we need to build this in a way that preserves civil liberties and our constitutional rights. So I think just as anything else, it&#8217;s a new security landscape with a new set of tools and a new set of vulnerabilities.</p><p>My worry is, if we had 100 years for this to happen all very slowly, we&#8217;d get used to it. We&#8217;ve gotten used to the presence of explosives in society or the presence of various new weapons or the presence of video cameras. We would get used to it over 100 years and we&#8217;d develop governance mechanisms. We&#8217;d make our mistakes. My worry is just that this is happening all so fast. So maybe we need to do our thinking faster about how to make these governance mechanisms work.</p><p><strong>Dwarkesh Patel</strong></p><p>It seems like in an offense-dominant world, over the course of the next century&#8212;the idea is that AI is making the progress that would happen over the next century happen in some period of five to ten years&#8212;we would still need the same mechanisms, or balance of power would be similarly intractable, even if humans were the only game in town.</p><p>I guess we have the advice of AI. But it fundamentally doesn&#8217;t seem like a totally different ball game here. If checks and balances were going to work, they would work with humans as well. If they aren&#8217;t going to work, they wouldn&#8217;t work with AIs as well. So maybe this just dooms human checks and balances as well.</p><p><strong>Dario Amodei</strong></p><p>Again, I think there&#8217;s some way to make this happen. The governments of the world may have to work together to make it happen. We may have to talk to AIs about building societal structures in such a way that these defenses are possible. I don&#8217;t know. I don&#8217;t want to say this is so far ahead in time, but it&#8217;s so far ahead in technological ability that may happen over a short period of time, that it&#8217;s hard for us to anticipate it in advance.</p><p><strong>Dwarkesh Patel</strong></p><p>Speaking of governments getting involved, on December 26, the <a href="https://natlawreview.com/article/tennessees-ai-bill-would-criminalize-training-ai-cha">Tennessee legislature introduced a bill</a> which said, &#8220;It would be an offense for a person to knowingly train artificial intelligence to provide emotional support, including through open-ended conversations with a user.&#8221; Of course, one of the things that Claude attempts to do is be a thoughtful, knowledgeable friend.</p><p>In general, it seems like we&#8217;re going to have this patchwork of state laws. A lot of the benefits that normal people could experience as a result of AI are going to be curtailed, especially when we get into the kinds of things you discuss in &#8220;Machines of Loving Grace&#8221;: biological freedom, mental health improvements, et cetera.</p><p>It seems easy to imagine worlds in which these get Whac-A-Moled away by different laws, whereas bills like this don&#8217;t seem to address the actual existential threats that you&#8217;re concerned about. I&#8217;m curious to understand, in the context of things like this, Anthropic&#8217;s position against the federal moratorium on state AI laws.</p><p><strong>Dario Amodei</strong></p><p>There are many different things going on at once. I think that particular law is dumb. It was clearly made by legislators who just probably had little idea what AI models could do and not do. They&#8217;re like, &#8220;AI models serving us, that just sounds scary. I don&#8217;t want that to happen.&#8221; So we&#8217;re not in favor of that.</p><p>But that wasn&#8217;t the thing that was being voted on. The thing that was being voted on is: we&#8217;re going to ban all state regulation of AI for 10 years with no apparent plan to do any federal regulation of AI, which would take Congress to pass, which is a very high bar. So the idea that we&#8217;d ban states from doing anything for 10 years&#8230; People said they had a plan for the federal government, but there was no actual proposal on the table. There was no actual attempt.</p><p>Given the serious dangers that I lay out in &#8220;Adolescence of Technology&#8221; around things like biological weapons and bioterrorism autonomy risk, and the timelines we&#8217;ve been talking about&#8212;10 years is an eternity&#8212;I think that&#8217;s a crazy thing to do. So if that&#8217;s the choice, if that&#8217;s what you force us to choose, then we&#8217;re going to choose not to have that moratorium. I think the benefits of that position exceed the costs, but it&#8217;s not a perfect position if that&#8217;s the choice.</p><p>Now, I think the thing that we should do, the thing that I would support, is the federal government should step in, not saying &#8220;states you can&#8217;t regulate&#8221;, but &#8220;Here&#8217;s what we&#8217;re going to do, and states you can&#8217;t differ from this.&#8221; I think preemption is fine in the sense of saying that the federal government says, &#8220;Here is our standard. This applies to everyone. States can&#8217;t do something different.&#8221;</p><p>That would be something I would support if it would be done in the right way. But this idea of states, &#8220;You can&#8217;t do anything and we&#8217;re not doing anything either,&#8221; that struck us as very much not making sense. I think it will not age well, it is already starting to not age well with all the backlash that you&#8217;ve seen.</p><p>Now, in terms of what we would want, the things we&#8217;ve talked about are starting with transparency standards in order to monitor some of these autonomy risks and bioterrorism risks. As the risks become more serious, as we get more evidence for them, then I think we could be more aggressive in some targeted ways and say, &#8220;Hey, AI bioterrorism is really a threat. Let&#8217;s pass a law that forces people to have classifiers.&#8221;</p><p>I could even imagine&#8230; It depends. It depends how serious the threat it ends up being. We don&#8217;t know for sure. We need to pursue this in an intellectually honest way where we say that ahead of time, the risk has not emerged yet. But I could certainly imagine, with the pace that things are going at, a world where later this year we say, &#8220;Hey, this AI bioterrorism stuff is really serious. We should do something about it. We should put it in a federal standard. If the federal government won&#8217;t act, we should put it in a state standard.&#8221; I could totally see that.</p><p><strong>Dwarkesh Patel</strong></p><p>I&#8217;m concerned about a world where if you just consider the pace of progress you&#8217;re expecting, the life cycle of legislation... The benefits are, as you say because of diffusion lag, slow enough that I really do think this patchwork of state laws, on the current trajectory, would prohibit. I mean if having an emotional chatbot friend is something that freaks people out, then just imagine the kinds of actual benefits from AI we want normal people to be able to experience. From improvements in health and healthspan and improvements in mental health and so forth.</p><p>Whereas at the same time, it seems like you think the dangers are already on the horizon and I just don&#8217;t see that much&#8230; It seems like it would be especially injurious to the benefits of AI as compared to the dangers of AI. So that&#8217;s maybe where the cost benefit makes less sense to me.</p><p><strong>Dario Amodei</strong></p><p>So there&#8217;s a few things here. People talk about there being thousands of these state laws. First of all, the vast, vast majority of them do not pass. The world works a certain way in theory, but just because a law has been passed doesn&#8217;t mean it&#8217;s really enforced. The people implementing it may be like, &#8220;Oh my God, this is stupid. It would mean shutting off everything that&#8217;s ever been built in Tennessee.&#8221; Very often, laws are interpreted in a way that makes them not as dangerous or harmful. On the same side, of course, you have to worry if you&#8217;re passing a law to stop a bad thing; you have this problem as well.</p><p>My basic view is that if we could decide what laws were passed and how things were done&#8212;and we&#8217;re only one small input into that&#8212;I would deregulate a lot of the stuff around the health benefits of AI. I don&#8217;t worry as much about the chatbot laws. I actually worry more about the drug approval process, where I think AI models are going to greatly accelerate the rate at which we discover drugs, and the pipeline will get jammed up. The pipeline will not be prepared to process all the stuff that&#8217;s going through it.</p><p>I think reform of the regulatory process should bias more towards the fact that we have a lot of things coming where the safety and efficacy is actually going to be really crisp and clear, a beautiful thing, and really effective. Maybe we don&#8217;t need all this superstructure around it that was designed around an era of drugs that barely work and often have serious side effects.</p><p>At the same time, I think we should be ramping up quite significantly the safety and security legislation. Like I&#8217;ve said, starting with transparency is my view of trying not to hamper the industry, trying to find the right balance. I&#8217;m worried about it. Some people criticize my essay for saying, &#8220;That&#8217;s too slow. The dangers of AI will come too soon if we do that.&#8221;</p><p>Well, basically, I think the last six months and maybe the next few months are going to be about transparency. Then, if these risks emerge when we&#8217;re more certain of them&#8212;which I think we might be as soon as later this year&#8212;then I think we need to act very fast in the areas where we&#8217;ve actually seen the risk.</p><p>I think the only way to do this is to be nimble. Now, the legislative process is normally not nimble, but we need to emphasize the urgency of this to everyone involved. That&#8217;s why I&#8217;m sending this message of urgency. That&#8217;s why I wrote <em>Adolescence of Technology</em>. I wanted policymakers, economists, national security professionals, and decision-makers to read it so that they have some hope of acting faster than they would have otherwise.</p><p><strong>Dwarkesh Patel</strong></p><p>Is there anything you can do or advocate that would make it more certain that the benefits of AI are better instantiated? I feel like you have worked with legislatures to say, &#8220;Okay, we&#8217;re going to prevent bioterrorism here. We&#8217;re going to increase transparency, we&#8217;re going to increase whistleblower protection.&#8221; But I think by default, the actual benefits we&#8217;re looking forward to seem very fragile to different kinds of moral panics or political economy problems.</p><p><strong>Dario Amodei</strong></p><p>I don&#8217;t actually agree that much regarding the developed world. I feel like in the developed world, markets function pretty well. When there&#8217;s a lot of money to be made on something and it&#8217;s clearly the best available alternative, it&#8217;s actually hard for the regulatory system to stop it.</p><p>We&#8217;re seeing that in AI itself. A thing I&#8217;ve been trying to fight for is <a href="https://www.axios.com/2026/02/10/anthropic-ceo-china-chip-ban">export controls on chips to China</a>. That&#8217;s in the national security interest of the US. That&#8217;s squarely within the policy beliefs of almost everyone in Congress of both parties. The case is very clear. The counterarguments against it, I&#8217;ll politely call them fishy. Yet it doesn&#8217;t happen and we sell the chips because there&#8217;s so much money riding on it. That money wants to be made. In that case, in my opinion, that&#8217;s a bad thing. But it also applies when it&#8217;s a good thing.</p><p>So if we&#8217;re talking about drugs and benefits of the technology, I am not as worried about those benefits being hampered in the developed world. I am a little worried about them going too slow. As I said, I do think we should work to speed the approval process in the FDA. I do think we should fight against these chatbot bills that you&#8217;re describing. Described individually, I&#8217;m against them. I think they&#8217;re stupid.</p><p>But I actually think the bigger worry is the developing world, where we don&#8217;t have functioning markets and where we often can&#8217;t build on the technology that we&#8217;ve had. I worry more that those folks will get left behind. And I worry that even if the cures are developed, maybe there&#8217;s someone in rural Mississippi who doesn&#8217;t get it as well. That&#8217;s a smaller version of the concern we have in the developing world.</p><p>So the things we&#8217;ve been doing are working with philanthropists. We work with folks who deliver medicine and health interventions to the developing world, to sub-Saharan Africa, India, Latin America, and other developing parts of the world. That&#8217;s the thing I think that won&#8217;t happen on its own.</p><h3>01:47:41 - Why can&#8217;t China and America both have a country of geniuses in a datacenter?</h3><p><strong>Dwarkesh Patel</strong></p><p>You mentioned export controls. Why shouldn&#8217;t the US and China both have a &#8220;country of geniuses in a data center&#8221;?</p><p><strong>Dario Amodei</strong></p><p>Why won&#8217;t it happen or why shouldn&#8217;t it happen?</p><p><strong>Dwarkesh Patel</strong></p><p>Why shouldn&#8217;t it happen.</p><p><strong>Dario Amodei</strong></p><p>If this does happen, we could have a few situations. If we have an offense-dominant situation, we could have a situation like nuclear weapons, but more dangerous. Either side could easily destroy everything.</p><p>We could also have a world where it&#8217;s unstable. <a href="https://en.wikipedia.org/wiki/Mutually_assured_destruction">The nuclear equilibrium</a> is stable because it&#8217;s <a href="https://en.wikipedia.org/wiki/Deterrence_theory">deterrence</a>. But let&#8217;s say there was uncertainty about, if the two AIs fought, which AI would win? That could create instability. You often have conflict when the two sides have a different assessment of their likelihood of winning. If one side is like, &#8220;Oh yeah, there&#8217;s a 90% chance I&#8217;ll win,&#8221; and the other side thinks the same, then a fight is much more likely. They can&#8217;t both be right, but they can both think that.</p><p><strong>Dwarkesh Patel</strong></p><p>But this seems like a fully general argument against the diffusion of AI technology. That&#8217;s the implication of this world.</p><p><strong>Dario Amodei</strong></p><p>Let me just go on, because I think we will get diffusion eventually. The other concern I have is that governments will oppress their own people with AI. I&#8217;m worried about a world where you have a country in which there&#8217;s already a government that&#8217;s building a high-tech authoritarian state. To be clear, this is about the government. This is not about the people. We need to find a way for people everywhere to benefit. My worry here is about governments. My worry is if the world gets carved up into two pieces, one of those two pieces could be authoritarian or totalitarian in a way that&#8217;s very difficult to displace.</p><p>Now, will governments eventually get powerful AI, and is there a risk of authoritarianism? Yes. Will governments eventually get powerful AI, and is there a risk of bad equilibria? Yes, I think both things. But the initial conditions matter.  At some point, we&#8217;re going to need to set up the rules of the road.</p><p>I&#8217;m not saying that one country, either the United States or a coalition of democracies&#8212;which I think would be a better setup, although it requires more international cooperation than we currently seem to want to make&#8212;should just say, &#8220;These are the rules of the road.&#8221; There&#8217;s going to be some negotiation. The world is going to have to grapple with this.</p><p>What I would like is for the democratic nations of the world&#8212;those whose governments represent closer to pro-human values&#8212;are holding the stronger hand and have more leverage when the rules of the road are set. So I&#8217;m very concerned about that initial condition.</p><p><strong>Dwarkesh Patel</strong></p><p>I was re-listening to the interview from three years ago, and one of the ways it aged poorly is that I kept asking questions assuming there was going to be some key fulcrum moment two to three years from now. In fact, being that far out, it just seems like progress continues, AI improves, AI is more diffused, and people will use it for more things.</p><p>It seems like you&#8217;re imagining a world in the future where the countries get together, and &#8220;Here&#8217;s the rules of the road, here&#8217;s the leverage we have, and here&#8217;s the leverage you have.&#8221; But on the current trajectory, everybody will have more AI. Some of that AI will be used by authoritarian countries. Some of that within the authoritarian countries will be used by private actors versus state actors.</p><p>It&#8217;s not clear who will benefit more. It&#8217;s always unpredictable to tell in advance. It seems like the internet privileged authoritarian countries more than you would&#8217;ve expected. Maybe AI will be the opposite way around. I want to better understand what you&#8217;re imagining here.</p><p><strong>Dario Amodei</strong></p><p>Just to be precise about it, I think the exponential of the underlying technology will continue as it has before. The models get smarter and smarter, even when they get to a &#8220;country of geniuses in a data center.&#8221; I think you can continue to make the model smarter. There&#8217;s a question of getting diminishing returns on their value in the world. How much does it matter after you&#8217;ve already solved human biology? At some point you can do harder, more abstruse math problems, but nothing after that matters.</p><p>Putting that aside, I do think the exponential will continue, but there will be certain distinguished points on the exponential. Companies, individuals, and countries will reach those points at different times.</p><p>In &#8220;The Adolescence of Technology&#8221; I talk about: Is a nuclear deterrent still stable in the world of AI? I don&#8217;t know, but that&#8217;s an example of one thing we&#8217;ve taken for granted. The technology could reach such a level that we can no longer be certain of it. Think of others. There are points where if you reach a certain level, maybe you have offensive cyber dominance, and every computer system is transparent to you after that unless the other side has an equivalent defense.</p><p>I don&#8217;t know what the critical moment is or if there&#8217;s a single critical moment. But I think there will be either a critical moment, a small number of critical moments, or some critical window where AI confers some large advantage from the perspective of national security, and one country or coalition has reached it before others.</p><p>I&#8217;m not advocating that they just say, &#8220;Okay, we&#8217;re in charge now.&#8221; That&#8217;s not how I think about it. The other side is always catching up. There are extreme actions you&#8217;re not willing to take, and it&#8217;s not right to take complete control anyway. But at the point that happens, people are going to understand that the world has changed. There&#8217;s going to be some negotiation, implicit or explicit, about what the post-AI world order looks like. My interest is in making that negotiation be one in which classical liberal democracy has a strong hand.</p><p><strong>Dwarkesh Patel</strong></p><p>I want to understand what that better means, because you say in the essay, &#8220;Autocracy is simply not a form of government that people can accept in the post-powerful AI age.&#8221; That sounds like you&#8217;re saying the CCP as an institution cannot exist after we get AGI. That seems like a very strong demand, and it seems to imply a world where the leading lab or the leading country will be able to&#8212;and by that language, should get to&#8212;determine how the world is governed or what kinds of governments are, and are not, allowed.</p><p><strong>Dario Amodei</strong></p><p>I believe that paragraph said something like, &#8220;You could take it even further and say X.&#8221; I wasn&#8217;t necessarily endorsing that view. I was saying, &#8220;Here&#8217;s a weaker thing that I believe. We have to worry a lot about authoritarians and we should try to check them and limit their power. You could take this much further and have a more interventionist view that says authoritarian countries with AI are these self-fulfilling cycles that are very hard to displace, so you just need to get rid of them from the beginning.&#8221;</p><p>That has exactly all the problems you say. If you were to make a commitment to overthrowing every authoritarian country, they would take a bunch of actions now that could lead to instability. That just may not be possible.</p><p>But the point I was making that I do endorse is that it is quite possible that... Today, the view, my view, in most of the Western world is that democracy is a better form of government than authoritarianism. But if a country&#8217;s authoritarian, we don&#8217;t react the way we&#8217;d react if they committed a genocide or something. I guess what I&#8217;m saying is I&#8217;m a little worried that in the age of AGI, authoritarianism will have a different meaning. It will be a graver thing. We have to decide one way or another how to deal with that. The interventionist view is one possible view. I was exploring such views. It may end up being the right view, or it may end up being too extreme. But I do have hope.</p><p>One piece of hope I have is that we have seen that as new technologies are invented, forms of government become obsolete. I mentioned this in &#8220;Adolescence of Technology&#8221;, where I said <a href="https://en.wikipedia.org/wiki/Feudalism">feudalism</a> was basically a form of government, and when we invented industrialization, feudalism was no longer sustainable. It no longer made sense.</p><p><strong>Dwarkesh Patel</strong></p><p>Why is that hope? Couldn&#8217;t that imply that democracy is no longer going to be a competitive system?</p><p><strong>Dario Amodei</strong></p><p>Right, it could go either way. But these problems with authoritarianism get deeper. I wonder if that&#8217;s an indicator of other problems that authoritarianism will have. In other words, because authoritarianism becomes worse, people are more afraid of it. They work harder to stop it. You have to think in terms of total equilibrium. I just wonder if it will motivate new ways of thinking about how to preserve and protect freedom with the new technology.</p><p>Even more optimistically, will it lead to a collective reckoning and a more emphatic realization of how important some of the things we take as individual rights are? A more emphatic realization that we really can&#8217;t give these away. We&#8217;ve seen there&#8217;s no other way to live that actually works.</p><p>I am actually hopeful that&#8212;it sounds too idealistic, but I believe it could be the case&#8212;dictatorships become morally obsolete. They become morally unworkable forms of government and the crisis that that creates is sufficient to force us to find another way.</p><p><strong>Dwarkesh Patel</strong></p><p>I think there is genuinely a tough question here which I&#8217;m not sure how you resolve. We&#8217;ve had to come out one way or another on it through history. With China in the &#8216;70s and &#8216;80s, we decided that even though it&#8217;s an authoritarian system, we will engage with it. I think in retrospect that was the right call, because it&#8217;s a state authoritarian system but a billion-plus people are much wealthier and better off than they would&#8217;ve otherwise been. It&#8217;s not clear that it would&#8217;ve stopped being an authoritarian country otherwise. You can just look at North Korea as an example of that.</p><p>I don&#8217;t know if it takes that much intelligence to remain an authoritarian country that continues to coalesce its own power. You can imagine a North Korea with an AI that&#8217;s much worse than everybody else&#8217;s, but still enough to keep power.</p><p>In general, it seems like we should just have this attitude that the benefits of AI&#8212;in the form of all these empowerments of humanity and health&#8212;will be big. Historically, we have decided it&#8217;s good to spread the benefits of technology widely, even to people whose governments are authoritarian. It is a tough question, how to think about it with AI, but historically we have said, &#8220;yes, this is a positive-sum world, and it&#8217;s still worth diffusing the technology.&#8221;</p><p><strong>Dario Amodei</strong></p><p>There are a number of choices we have. Framing this as a government-to-government decision in national security terms is one lens, but there are a lot of other lenses. You could imagine a world where we produce all these cures to diseases. The cures are fine to sell to authoritarian countries, but the data centers just aren&#8217;t. The chips and the data centers aren&#8217;t, and the AI industry itself isn&#8217;t.</p><p>Another possibility I think folks should think about is this. Could there be developments we can make&#8212;either that naturally happen as a result of AI, or that we could make happen by building technology on AI&#8212;that create an equilibrium where it becomes infeasible for authoritarian countries to deny their people private use of the benefits of the technology? Are there equilibria where we can give everyone in an authoritarian country their own AI model that defends them from surveillance and there isn&#8217;t a way for the authoritarian country to crack down on this while retaining power?</p><p>I don&#8217;t know. That sounds to me like if that went far enough, it would be a reason why authoritarian countries would disintegrate from the inside. But maybe there&#8217;s a middle world where there&#8217;s an equilibrium where, if they want to hold on to power, the authoritarians can&#8217;t deny individualized access to the technology.</p><p>But I actually do have a hope for the more radical version. Is it possible that the technology might inherently have properties&#8212;or that by building on it in certain ways we could create properties&#8212;that have this dissolving effect on authoritarian structures? Now, we hoped originally&#8212;think back to the beginning of the Obama administration&#8212;that social media and the internet would have that property, and it turns out not to. But what if we could try again with the knowledge of how many things could go wrong, and that this is a different technology? I don&#8217;t know if it would work, but it&#8217;s worth a try.</p><p><strong>Dwarkesh Patel</strong></p><p>It&#8217;s just very unpredictable. There are first principles reasons why authoritarianism might be privileged.</p><p><strong>Dario Amodei</strong></p><p>It&#8217;s all very unpredictable. We just have to recognize the problem and come up with 10 things we can try, try those, and then assess which ones are working, if any. Then try new ones if the old ones aren&#8217;t working.</p><p><strong>Dwarkesh Patel</strong></p><p>But I guess that nets out to today, as you say, that we will not sell data centers, or chips, and the ability to make chips to China. So in some sense, you are denying&#8230; There would be some benefits to the Chinese economy, Chinese people, et cetera, because we&#8217;re doing that. Then there&#8217;d also be benefits to the American economy because it&#8217;s a positive-sum world. We could trade. They could have their country&#8217;s data centers doing one thing. We could have ours doing another. Already, you&#8217;re saying it&#8217;s not worth that positive-sum stipend to empower those countries?</p><p><strong>Dario Amodei</strong></p><p>What I would say is that we are about to be in a world where growth and economic value will come very easily if we&#8217;re able to build these powerful AI models. What will not come easily is distribution of benefits, distribution of wealth, political freedom. These are the things that are going to be hard to achieve.</p><p>So when I think about policy, I think that the technology and the market will deliver all the fundamental benefits, this is my fundamental belief, almost faster than we can take them. These questions about distribution and political freedom and rights are the ones that will actually matter and that policy should focus on.</p><p><strong>Dwarkesh Patel</strong></p><p>Speaking of distribution, as you were mentioning, we have developing countries. In many cases, <a href="https://www.investopedia.com/terms/c/catch-up-effect.asp">catch-up growth</a> has been weaker than we would have hoped for. But when catch-up growth does happen, it&#8217;s fundamentally because they have underutilized labor. We can bring the capital and know-how from developed countries to these countries, and then they can grow quite rapidly.</p><p>Obviously, in a world where labor is no longer the constraining factor, this mechanism no longer works. So is the hope basically to rely on philanthropy from the people or countries who immediately get wealthy from AI? What is the hope?</p><p><strong>Dario Amodei</strong></p><p>Philanthropy should obviously play some role, as it has in the past. But I think growth is always better and stronger if we can make it endogenous.</p><p>What are the relevant industries in an AI-driven world? I said we shouldn&#8217;t build data centers in China, but there&#8217;s no reason we shouldn&#8217;t build data centers in Africa. In fact, I think it&#8217;d be great to build data centers in Africa. As long as they&#8217;re not owned by China, we should build data centers in Africa. I think that&#8217;s a great thing to do.</p><p>There&#8217;s no reason we can&#8217;t build a pharmaceutical industry that&#8217;s AI-driven. If AI is accelerating drug discovery, then there will be a bunch of biotech startups. Let&#8217;s make sure some of those happen in the developing world. Certainly, during the transition&#8212;we can talk about the point where humans have no role&#8212;humans will still have some role in starting up these companies and supervising the AI models. So let&#8217;s make sure some of those humans are in the developing world so that fast growth can happen there as well.</p><p><strong>Dwarkesh Patel</strong></p><p>You guys recently announced that <a href="https://www.anthropic.com/news/claude-new-constitution">Claude is going to have a constitution that&#8217;s aligned to a set of values</a>, and not necessarily just to the end user. There&#8217;s a world I can imagine where if it is aligned to the end user, it preserves the balance of power we have in the world today because everybody gets to have their own AI that&#8217;s advocating for them. The ratio of bad actors to good actors stays constant. It seems to work out for our world today. Why is it better not to do that, but to have a specific set of values that the AI should carry forward?</p><p><strong>Dario Amodei</strong></p><p>I&#8217;m not sure I&#8217;d quite draw the distinction in that way. There may be two relevant distinctions here. I think you&#8217;re talking about a mix of the two. One is, should we give the model a set of instructions about &#8220;do this&#8221; versus &#8220;don&#8217;t do this&#8221;? The other is, should we give the model a set of principles for how to act?</p><p>It&#8217;s kind of purely a practical and empirical thing that we&#8217;ve observed. By teaching the model principles, getting it to learn from principles, its behavior is more consistent, it&#8217;s easier to cover edge cases, and the model is more likely to do what people want it to do. In other words, if you give it a list of rules&#8212;&#8221;don&#8217;t tell people how to hot-wire a car, don&#8217;t speak in Korean&#8221;&#8212;it doesn&#8217;t really understand the rules, and it&#8217;s hard to generalize from them. It&#8217;s just a list of do&#8217;s and don&#8217;t&#8217;s.</p><p>Whereas if you give it principles&#8212;it has some hard guardrails like &#8220;Don&#8217;t make biological weapons&#8221; but&#8212;overall you&#8217;re trying to understand what it should be aiming to do, how it should be aiming to operate. So just from a practical perspective, that turns out to be a more effective way to train the model. That&#8217;s the rules versus principles trade-off.</p><p>Then there&#8217;s another thing you&#8217;re talking about, which is the corrigibility versus intrinsic motivation trade-off. How much should the model be a kind of &#8220;skin suit&#8221; where it just directly follows the instructions given to it by whoever is giving those instructions, versus how much should the model have an inherent set of values and go off and do things on its own?</p><p>There I would actually say everything about the model is closer to the direction that it should mostly do what people want. It should mostly follow instructions. We&#8217;re not trying to build something that goes off and runs the world on its own. We&#8217;re actually pretty far on the corrigible side.</p><p>Now, what we do say is there are certain things that the model won&#8217;t do. I think we say it in various ways in the constitution, that under normal circumstances, if someone asks the model to do a task, it should do that task. That should be the default. But if you&#8217;ve asked it to do something dangerous, or to harm someone else, then the model is unwilling to do that. So I actually think of it as a mostly corrigible model that has some limits, but those limits are based on principles.</p><p><strong>Dwarkesh Patel</strong></p><p>Then the fundamental question is, how are those principles determined? This is not a special question for Anthropic. This would be a question for any AI company. But because you have been the ones to actually write down the principles, I get to ask you this question. Normally, a constitution is written down, set in stone, and there&#8217;s a process of updating it and changing it and so forth. In this case, it seems like a document that people at Anthropic write, that can be changed at any time, that guides the behavior of systems that are going to be the basis of a lot of economic activity. How do you think about how those principles should be set?</p><p><strong>Dario Amodei</strong></p><p>I think there are maybe three sizes of loop here, three ways to iterate. One is we iterate within Anthropic. We train the model, we&#8217;re not happy with it, and we change the constitution. I think that&#8217;s good to do. Putting out public updates to the constitution every once in a while is good because people can comment on it.</p><p>The second level of loop is different companies having different constitutions. I think it&#8217;s useful. Anthropic puts out a constitution, Gemini puts out a constitution, and other companies put out a constitution. People can look at them and compare. Outside observers can critique and say, &#8220;I like this thing from this constitution and this thing from that constitution.&#8221; That creates a soft incentive and feedback for all the companies to take the best of each element and improve.</p><p>Then I think there&#8217;s a third loop, which is society beyond the AI companies and beyond just those who comment without hard power. There we&#8217;ve done some experiments. A couple years ago, we did an experiment with the <a href="https://www.cip.org/">Collective Intelligence Project</a> to basically poll people and ask them what should be in our AI constitution. At the time, we incorporated some of those changes.</p><p>So you could imagine doing something like that with the new approach we&#8217;ve taken to the constitution. It&#8217;s a little harder because it was an easier approach to take when the constitution was a list of dos and don&#8217;ts. At the level of principles, it has to have a certain amount of coherence. But you could still imagine getting views from a wide variety of people.</p><p>You could also imagine&#8212;and this is a crazy idea, but this whole interview is about crazy ideas&#8212;systems of representative government having input. I wouldn&#8217;t do this today because the legislative process is so slow. This is exactly why I think we should be careful about the legislative process and AI regulation. But there&#8217;s no reason you couldn&#8217;t, in principle, say, &#8220;All AI models have to have a constitution that starts with these things, and then you can append other things after it, but there has to be this special section that takes precedence.&#8221;</p><p>I wouldn&#8217;t do that. That&#8217;s too rigid and sounds overly prescriptive in a way that I think overly aggressive legislation is. But that is a thing you could try to do. Is there some much less heavy-handed version of that? Maybe.</p><p><strong>Dwarkesh Patel</strong></p><p>I really like control loop two. Obviously, this is not how constitutions of actual governments do or should work. There&#8217;s not this vague sense in which the Supreme Court will feel out how people are feeling&#8212;what are the vibes&#8212;and update the constitution accordingly. With actual governments, there&#8217;s a more formal, procedural process.</p><p>But you have a vision of competition between constitutions, which is actually very reminiscent of how some libertarian charter cities people used to talk, about what an archipelago of different kinds of governments would look like. There would be selection among them of who could operate the most effectively and where people would be the happiest. In a sense, you&#8217;re recreating that vision of a utopia of archipelagos.</p><p><strong>Dario Amodei</strong></p><p>I think that vision has things to recommend it and things that will go wrong with it. It&#8217;s an interesting, in some ways compelling, vision, but things will go wrong that you hadn&#8217;t imagined.</p><p>So I like loop two as well, but I feel like the whole thing has got to be some mix of loops one, two, and three, and it&#8217;s a matter of the proportions. I think that&#8217;s gotta be the answer.</p><p><strong>Dwarkesh Patel</strong></p><p>When somebody eventually writes the equivalent of <em><a href="https://amzn.to/4rsefxW">The Making of the Atomic Bomb</a></em> for this era, what is the thing that will be hardest to glean from the historical record that they&#8217;re most likely to miss?</p><p><strong>Dario Amodei</strong></p><p>I think a few things. One is, at every moment of this exponential, the extent to which the world outside it didn&#8217;t understand it. This is a bias that&#8217;s often present in history. Anything that actually happened looks inevitable in retrospect. When people look back, it will be hard for them to put themselves in the place of people who were actually making a bet on this thing to happen that wasn&#8217;t inevitable, that we had these arguments like the arguments I make for scaling or that continual learning will be solved. Some of us internally put a high probability on this happening, but there&#8217;s a world outside us that&#8217;s not acting on that at all.</p><p>I think the weirdness of it, unfortunately the insularity of it... If we&#8217;re one year or two years away from it happening, the average person on the street has no idea. That&#8217;s one of the things I&#8217;m trying to change with the memos, with talking to policymakers. I don&#8217;t know but I think that&#8217;s just a crazy thing.</p><p>Finally, I would say&#8212;and this probably applies to almost all historical moments of crisis&#8212;how absolutely fast it was happening, how everything was happening all at once. Decisions that you might think were carefully calculated, well actually you have to make that decision, and then you have to make 30 other decisions on the same day because it&#8217;s all happening so fast. You don&#8217;t even know which decisions are going to turn out to be consequential.</p><p>One of my worries&#8212;although it&#8217;s also an insight into what&#8217;s happening&#8212;is that some very critical decision will be some decision where someone just comes into my office and is like, &#8220;Dario, you have two minutes. Should we do thing A or thing B on this?&#8221; Someone gives me this random half-page memo and asks, &#8220;Should we do A or B?&#8221; I&#8217;m like, &#8220;I don&#8217;t know. I have to eat lunch. Let&#8217;s do B.&#8221; That ends up being the most consequential thing ever.</p><p><strong>Dwarkesh Patel</strong></p><p>So final question. There aren&#8217;t tech CEOs who are usually writing 50-page memos every few months. It seems like you have managed to build a role for yourself and a company around you which is compatible with this more intellectual-type role of CEO.</p><p>I want to understand how you construct that. How does that work? Do you just go away for a couple of weeks and then you tell your company, &#8220;This is the memo. Here&#8217;s what we&#8217;re doing&#8221;? It&#8217;s also reported that you write a bunch of these internally.</p><p><strong>Dario Amodei</strong></p><p>For this particular one, I wrote it over winter break. I was having a hard time finding the time to actually write it. But I think about this in a broader way. I think it relates to the culture of the company. I probably spend a third, maybe 40%, of my time making sure the culture of Anthropic is good.</p><p><strong>Dario Amodei</strong></p><p>As Anthropic has gotten larger, it&#8217;s gotten harder to get directly involved in the training of the models, the launch of the models, the building of the products. It&#8217;s 2,500 people. I have certain instincts, but it&#8217;s very difficult to get involved in every single detail. I try as much as possible, but one thing that&#8217;s very leveraged is making sure Anthropic is a good place to work, people like working there, everyone thinks of themselves as team members, and everyone works together instead of against each other.</p><p>We&#8217;ve seen as some of the other AI companies have grown&#8212;without naming any names&#8212;we&#8217;re starting to see decoherence and people fighting each other. I would argue there was even a lot of that from the beginning, but it&#8217;s gotten worse. I think we&#8217;ve done an extraordinarily good job, even if not perfect, of holding the company together, making everyone feel the mission, that we&#8217;re sincere about the mission, and that everyone has faith that everyone else there is working for the right reason. That we&#8217;re a team, that people aren&#8217;t trying to get ahead at each other&#8217;s expense or backstab each other, which again, I think happens a lot at some of the other places.</p><p>How do you make that the case? It&#8217;s a lot of things. It&#8217;s me, it&#8217;s <a href="https://en.wikipedia.org/wiki/Daniela_Amodei">Daniela</a>, who runs the company day to day, it&#8217;s the co-founders, it&#8217;s the other people we hire, it&#8217;s the environment we try to create. But I think an important thing in the culture is that the other leaders as well, but especially me, have to articulate what the company is about, why it&#8217;s doing what it&#8217;s doing, what its strategy is, what its values are, what its mission is, and what it stands for.</p><p>When you get to 2,500 people, you can&#8217;t do that person by person. You have to write, or you have to speak to the whole company. This is why I get up in front of the whole company every two weeks and speak for an hour.</p><p>I wouldn&#8217;t say I write essays internally. I do two things. One, I write this thing called a DVQ, <a href="https://www.wired.com/story/anthropic-benevolent-artificial-intelligence/">Dario Vision Quest</a>. I wasn&#8217;t the one who named it that. That&#8217;s the name it received, and it&#8217;s one of these names that I tried to fight because it made it sound like I was going off and smoking peyote or something. But the name just stuck.</p><p>So I get up in front of the company every two weeks. I have a three or four-page document, and I just talk through three or four different topics about what&#8217;s going on internally, the models we&#8217;re producing, the products, the outside industry, the world as a whole as it relates to AI and geopolitically in general. Just some mix of that. I go through very honestly and I say, &#8220;This is what I&#8217;m thinking, and this is what Anthropic leadership is thinking,&#8221; and then I answer questions. That direct connection has a lot of value that is hard to achieve when you&#8217;re passing things down the chain six levels deep. A large fraction of the company comes to attend, either in person or virtually. It really means that you can communicate a lot.</p><p>The other thing I do is I have a channel in Slack where I just write a bunch of things and comment a lot. Often that&#8217;s in response to things I&#8217;m seeing at the company or questions people ask. We do internal surveys and there are things people are concerned about, and so I&#8217;ll write them up. I&#8217;m just very honest about these things. I just say them very directly.</p><p>The point is to get a reputation of telling the company the truth about what&#8217;s happening, to call things what they are, to acknowledge problems, to avoid the sort of corpo speak, the kind of defensive communication that often is necessary in public because the world is very large and full of people who are interpreting things in bad faith. But if you have a company of people who you trust, and we try to hire people that we trust, then you can really just be entirely unfiltered.</p><p>I think that&#8217;s an enormous strength of the company. It makes it a better place to work, it makes people more than the sum of their parts, and increases the likelihood that we accomplish the mission because everyone is on the same page about the mission, and everyone is debating and discussing how best to accomplish the mission.</p><p><strong>Dwarkesh Patel</strong></p><p>Well, in lieu of an external Dario Vision Quest, we have this interview.</p><p><strong>Dario Amodei</strong></p><p>This interview is a little like that.</p><p><strong>Dwarkesh Patel</strong></p><p>This has been fun, Dario. Thanks for doing it.</p><p><strong>Dario Amodei</strong></p><p>Thank you, Dwarkesh.</p>]]></content:encoded></item><item><title><![CDATA[Notes on Space GPUs]]></title><description><![CDATA[Turning my Elon prep into a blog post]]></description><link>https://www.dwarkesh.com/p/notes-on-space-gpus</link><guid isPermaLink="false">https://www.dwarkesh.com/p/notes-on-space-gpus</guid><dc:creator><![CDATA[Dwarkesh Patel]]></dc:creator><pubDate>Thu, 05 Feb 2026 18:26:47 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/2dba4f31-2d4f-485a-835f-5c9bc75f9ce4_300x168.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><p>John Collison and I <a href="https://www.dwarkesh.com/p/elon-musk">just interviewed Elon</a>. The interview was recorded before we knew that SpaceX was acquiring xAI, so the fact that our first topic was space GPUs now feels all the more relevant.</p><p>As I was preparing to interview Elon, I put together some notes and a <a href="https://docs.google.com/spreadsheets/d/1fa48HAwXaboEXNOrAj-xJF2Vv_xxQZjAtgoTu0FnnlY/edit?usp=sharing">spreadsheet</a> to help me think through orbital datacenters. I turned those notes into this blog post.</p><p>Even if orbital data centers don&#8217;t make sense yet, in the long run the singularity is clearly moving into space. Earth intercepts about one two-billionth of the sun&#8217;s total output. If AI scaling continues, compute will eventually move to where the energy is. So space GPUs are fun to think about, because they give you a sneak peek at the future. Whether that future arrives in 2030, 2040, or 2050 is another question.</p><p><strong>Please take everything below with grains of salt&#8212;grains so big that you might confuse them for rocks. Assume all the numbers are wrong.</strong> Every paragraph below covers a topic that would take an actual expert a week to properly evaluate. What you&#8217;ll find here is what a professional podcaster has pieced together from conversations with LLMs and some very generous people who talked to me before the interview. Thanks to <a href="https://x.com/CJHandmer">Casey Handmer</a>, <a href="https://x.com/PhilipJohnston">Philip Johnston</a>, <a href="https://x.com/ezrafeilden">Ezra Feilden</a>, <a href="https://x.com/andrewmccalip">Andrew McCalip</a>, <a href="https://x.com/vinayramasesh">Vinay Ramasesh</a> and the team at <a href="https://www.kineticpartners.com/">Kinetic Partnership</a> for all their help.</p><h2><strong>Why orbital data centers?</strong></h2><p>The whole reason to go to space is energy. Yes, panels in space get about 40% more irradiance&#8212;but the real advantage is that you can put your satellites in <a href="https://en.wikipedia.org/wiki/Sun-synchronous_orbit">sun-synchronous orbit</a>, where they face the sun continuously. No nights, no clouds, no need for batteries (which is the majority of cost in a solar-storage system). Solar on Earth has a roughly 25% capacity factor, meaning panels only generate a quarter of their peak output on average. In space, you get close to 100%.</p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;784c41e4-422c-4b40-848b-efaea0c75392&quot;,&quot;duration&quot;:null}"></div><p>The logic is that if the launch costs continue to drop, it will become cheaper to put GPUs in orbit than to build power plants and batteries on Earth. And there&#8217;s a lot of room for launch costs to fall&#8212;propellant is cheap, and the main expense is the rocket, which you can now reuse. Falcon 9 is around $2,500/kg with a disposable upper stage. Starship with full reusability could get below $100/kg.</p><p>But here&#8217;s the problem with this argument. Energy is only about 15% of a datacenter&#8217;s total cost of ownership. The chips themselves are around 70%. And you still have to launch those to space!</p><p>It gets worse. On Earth, GPUs fail constantly. In the <a href="https://arxiv.org/abs/2407.21783">Llama 3 paper</a>, Meta reported a failure roughly once every three hours across a 16,000 H100 cluster. When a chip dies, a technician walks over, swaps it out, and the cluster keeps running. In space, you can&#8217;t do that&#8212;at least not until we have Optimus robots stationed on every satellite.</p><p>What about radiation? It&#8217;s actually less catastrophic than you might expect. Google&#8217;s Suncatcher paper <a href="https://arxiv.org/abs/2511.19468">found</a> that their TPUs survived nearly 3x the total ionizing dose needed for a 5-year mission before showing permanent degradation.</p><p>I asked Elon about this. He responded:</p><blockquote><p>&gt; &#8220;Actually, it depends on how recent the GPUs are that have arrived. At this point, we find our GPUs to be quite reliable. There&#8217;s infant mortality, which you can obviously iron out on the ground. So you can just run them on the ground and confirm that you don&#8217;t have infant mortality with the GPUs.&#8221;</p><p>&gt; &#8220;But once they start working, their actual reliability&#8212;and you&#8217;re past the initial debug cycle of Nvidia or whatever, or whoever&#8217;s making the chips, could be Tesla AI6 chips or something like that, or it could be TPUs or Trainiums or whatever&#8212;is actually quite reliable past a certain point. So I don&#8217;t think the servicing thing is an issue&#8221;</p></blockquote><p>Consider what&#8217;s actually being proposed here. You assemble your GPUs into racks on Earth, run them for a few hundred hours to catch the duds, disassemble everything, pack it into a satellite, launch it, and get it operational in orbit. Throughout this entire process, the most expensive part of your system&#8212;the chips&#8212;are just sitting there not doing useful work.</p><h2><strong>Is this just not possible on Earth?</strong></h2><p>Throughout the interview, Elon kept returning to one point over and over again: <em>Look, forget the economics! It will simply not be physically possible to scale power production to the scale needed for AI on Earth. </em>He went on:</p><blockquote><p>&gt; &#8220;The only place you can really scale is space.&#8221;</p><p>&gt; &#8220;All of the United States currently uses only half a terawatt on average. So if you say a terawatt, that would be twice as much electricity as the United States currently consumes. So that&#8217;s quite a lot. Can you imagine building that many data centers? That many power plants? It&#8217;s like those who have lived in software land don&#8217;t realize they&#8217;re about to have a hard lesson in hardware.&#8221;</p></blockquote><p>Elon kept pointing out the bottlenecks we&#8217;ve already run into on Earth. You can&#8217;t plug into the utilities&#8212;the interconnect queues are too long. You can&#8217;t do behind the meter and generate power yourself&#8212;lead times for turbines stretch past 2030. You can&#8217;t do solar on Earth, because of permits, and because of the tariffs. And Earth has clouds and nights, requiring overbuilt solar and batteries. In space, you can just put the satellites in sun synchronous orbit!</p><p>Look, at some level, <em>it</em> is true that we can&#8217;t keep scaling on Earth. But keep in mind that the Earth is really fucking big. 1 TW of solar (with 25% capacity factor, so really 4 TW of panels) is around 30,000 square miles. That&#8217;s like 1% of the US&#8212;about the size of South Carolina. For context, AI datacenters currently consume only ~20 GW globally.</p><p>By the time we&#8217;re talking about multiple terawatts, we&#8217;ll have had to massively scale leading-edge wafer production. And that&#8217;s the really hard part. Fabs are the most complicated manufacturing facilities humans have ever built. In order to believe that we need to go to space in order to find the power turn on all these chips, we&#8217;ll need to assume a few things:</p><ul><li><p>We&#8217;ll manage to produce <em>a lot </em>more chips.</p></li><li><p>Every single relief vessel for power generation on Earth will fail to scale.</p></li></ul><p>But semiconductors are so much more complicated than solar panels! They&#8217;re even more complicated than the blades on a turbine. It feels quite unlikely to me that the thing we manage to solve is building terawatts worth of leading edge wafers, but in that world we can&#8217;t figure out how to pave Nevada (or if regulation proves to be a problem, then the UAE) with solar panels.</p><h2><strong>100 GW into space</strong></h2><p>How many Starship launches will it take to launch a 100 GW into space?</p><p>An orbital datacenter satellite has three big components: solar arrays, computers, and radiators. And the key constraint is that for every watt of compute, we need roughly one watt of solar and one watt of thermal rejection capacity.</p><p>The W/kg of each component determines how the mass budget gets split&#8212;and how much compute you can bring along. The figure that matters most here is the specific power of the whole satellite: after you account for solar panels, radiators, and chassis, how many watts of compute do you actually get per kilogram launched?</p><p>For Starlink satellites, this works out to roughly 50 W/kg. The people trying to build orbital datacenters are currently targeting 100 W/kg. There are only two ways to get there: lighter solar panels (more watts generated per kg) or lighter radiators (more watts rejected per kg).</p><p>The numbers below are super rough. Reliable figures for space-grade components are hard to come by. But even rough math reveals which variables must improve&#8212;and by how much&#8212;in order to hit 100 W/kg.</p><ul><li><p>Solar: There are apparently companies that are targeting next gen thin film that reaches <a href="https://news.satnews.com/2026/02/01/orbital-vs-terrestrial-solar-the-math-of-energy-density-and-capacity-factors/">upwards of 500 W/kg</a>, but the state of the art is <a href="https://en.wikipedia.org/wiki/Space-based_solar_power">150 W/kg</a>, and most missions right now fly <a href="https://www.nasa.gov/smallsat-institute/sst-soa/power-subsystems/">30 W/kg</a>. Let&#8217;s be generous and assume 200 W/kg.</p><ul><li><p>The trouble here is that there&#8217;s obviously a tradeoff&#8212; denser panels costs more money, but reduces launch cost. And it&#8217;s difficult to calculate what that implies for these next gen panels, because their prices are not listed anywhere.</p></li></ul></li><li><p>Compute: I&#8217;ve heard that a stripped down GB200 NVL72 with no cooling equipment is around 100 kg. They draw 132kW of power, but let&#8217;s add 10% overhead for the intersatellite lasers and so on. That gets us to 1,452 W/kg.</p></li><li><p>Radiators: In space, you can&#8217;t convect heat away, because there&#8217;s no air. You can only radiate it, which means your panels glow infrared until the heat leaves. The Stefan-Boltzmann law governs how much power a surface can radiate.</p><p>GPUs typically run up to 90&#176; Celsius. There&#8217;s some temperature drop through the heat pipes and fluid loops that carry heat to the radiator surface. Call it 30&#176;C. So your radiators end up operating around 60&#176;C. Plug that into Stefan-Boltzmann (assuming you&#8217;re using aluminum panels that weigh around 2 kg per square meter of surface area, that works out to roughly 320 W/kg.</p><p>Since radiated power scales with T&#8308;, running your chips hotter can help you save a lot of radiator mass. For space, people will have to figure out how to build GPUs that tolerate higher temperatures.</p></li></ul><p>Assuming the numbers above&#8212;and also assuming that a fourth of the mass of the satellite has to be the chassis&#8212;I get 85 W/kg for the whole system. Again, I want to emphasize these are <em>rough</em> calculations; feel free to plug in your own numbers in the spreadsheet <a href="https://docs.google.com/spreadsheets/d/1fa48HAwXaboEXNOrAj-xJF2Vv_xxQZjAtgoTu0FnnlY/edit?usp=sharing">here</a>.</p><p>At 150 metric tons to low earth orbit per Starship (Elon&#8217;s target), you&#8217;re looking at around 10 MW per launch. That means roughly 100 Starship launches in order to put 1 GW of compute in orbit. To hit 100 GW in a year, you&#8217;d need roughly 10,000 launches, or, about one launch every hour.</p><p>This is insane! A single Starship produces around 100 GW of thrust power at liftoff. That&#8217;s about a fifth of total US electricity consumption, concentrated in one rocket for a few minutes. And the plan would be to do that once an hour, every hour, every day, for a year.</p><p>I asked Elon what that world looks like:</p><blockquote><p>I don&#8217;t think we&#8217;ll need more than... I mean, you could probably do it with as few as 20 or 30 [Starship vehicles]. It really depends on how quickly the ship has to go around the Earth and the ground track before the ship has to come back over the launch pad. So if you can use a ship every, say, 30 hours, you could do it with 30 ships. But we&#8217;ll make more ships than that. SpaceX is gearing up to do 10,000 launches a year, and maybe even 20 or 30,000 launches a year.</p></blockquote><h2><strong>Workloads and comms</strong></h2><p>Starlink satellites already communicate via inter-satellite laser links <a href="https://en.wikipedia.org/wiki/Laser_communication_in_space#cite_note-49">at 100 Gbps</a>&#8212;and Google&#8217;s Suncatcher paper suggests off-the-shelf transceivers could potentially hit 10 Tbps. For context, Infiniband links between nodes in a terrestrial datacenter run <a href="https://marketplace.nvidia.com/en-us/enterprise/networking/400gbeosfpcables/">at 400 Gbps</a>. The gap isn&#8217;t as large as you might expect. So, could you do synchronous training in space?</p><p>Even the most bullish analysts don&#8217;t claim that orbital data centers will be used for training. I don&#8217;t know any of the relevant orbital mechanics, but obviously satellites at different altitudes move at different orbital velocities, which means the satellites are desyncing relative to one another. Google came up with a clever solution for this in their Suncatcher paper&#8212;keep lots of satellites in a single tight cluster at the same altitude. Google&#8217;s researchers proposed eighty-one satellites in such a synchronized constellation. If each constellation had a GB200 NVL72, then that&#8217;s only 15 MW parcels of coherent compute.</p><p>Defenders of orbital datacenters say that most compute is going to shift to inference (and with RL, most training is also inference). Maybe the legacy terrestrial datacenters do end up doing the pretraining runs, and then whatever mixture of RL environment training and continual learning  happens in the future does happen in space. So, the argument goes, it&#8217;s not a big deal that the scale ups in space are isolated. But there&#8217;s still the question of how hundreds of gigawatts of inference are beamed back to Earth.</p><p>For a moment, let&#8217;s imagine a world where as we see the sunrise and sunset we also see a Saturn-like belt of GPU satellites passing over us. That&#8217;s already really cool. But then there&#8217;s  another sci-fi premise, which I really wanted to be plausible, and which turns out not to make any sense: Imagine that every 12 hours, as this country of geniuses in space passes over us and shoots down half a day&#8217;s worth of new ideas, our code finally starts working and our factories buzz alight and become more productive. Unfortunately, it&#8217;s just science fiction. Inference doesn&#8217;t take that much bandwidth. One hundred gigawatts of a 5T model is roughly 58 billion tokens per second, resulting in ~ <a href="https://claude.ai/share/9b4bff2b-d114-4421-9cbf-0eff30112a3a">230 GB/s</a>.</p><p>That&#8217;s nothing. That many tokens can easily be beamed using lasers from GPUs in the orbital plane through to Starlink satellite network and then down to Earth.</p><p>Latency might be an issue, up to fifty milliseconds from any given spot on Earth through the Starlink network to the sun synchronous orbit and then back again. But as we move towards a world of true remote coworker AIs, where the agent works for tens of minutes before coming back to us, the marginal milliseconds of latency matter less and less.</p><h2><strong>So why is Elon doing this?</strong></h2><p>I&#8217;m willing to accept Elon&#8217;s argument that if launch costs become sufficiently cheap <em>and</em> we can repair GPUs in space, then there&#8217;s a viable path toward orbital data centers. But it seems especially difficult to imagine a situation in which orbital data centers end up <em>significantly </em>cheaper, because, again, most of the cost of a data center is the GPUs.</p><p>For most compute to shift to space, all of the following things would need to be true:</p><ul><li><p>Power generation on Earth hits a ceiling, or AI demand outstrips every terrestrial option.</p></li><li><p>Chip production scales faster than anyone expects, so we have the silicon but not the electricity.</p></li><li><p>Starship reaches thousands of launches per year.</p></li></ul><p>If Elon&#8217;s right, he wins the AI race outright. SpaceX is the only entity that can launch at that scale. xAI would have unlimited power. Everyone else will be stuck fighting over grid interconnects and turbine orders.</p><p>And if Elon&#8217;s future doesn&#8217;t materialize? xAI is just another lab in the pack. Which means xAI loses. The AI race is a winner-take-all competition, and xAI isn&#8217;t in first place. Elon&#8217;s comparative advantage was never going to be navigating utility interconnect queues or filing permits faster than Google. His advantage is SpaceX. So why not bet on the world where SpaceX becomes the kingmaker?</p><p>This might sound reckless. But that&#8217;s how SpaceX got here. Their whole business plan seems to be one in which they conjure new wells of demand for each generation of rocket on the path to the Dyson swarm. Falcon 9 first flew in 2010. Starlink didn&#8217;t launch until 2019. Maybe orbital datacenters end up being for Starship what Starlink was for Falcon 9.</p><p>Sometimes, during the interview, I found my thoughts drifting toward Elon&#8217;s vision for this big, interconnected future. So I paused a moment and said:</p><blockquote><p>What I find remarkable about the SpaceX business is the end goal is to get to Mars, but you keep finding ways on the way there to keep generating incremental revenue to get to the next stage and the next stage.</p></blockquote><p>Elon nodded his head slowly. And then he said:</p><blockquote><p>You can see how this might seem like a simulation to me.</p></blockquote><h2></h2><p></p>]]></content:encoded></item></channel></rss>