"We see these AIs as a galaxy glittering with capabilities, but at their center, invisible to the naked eye, holding all the constellations together, is an unimaginably massive black hole of data."
I feel like the obsession with continual learning / sample efficiency leads the field in the wrong direction. It's the bad career strategy of focusing on addressing your weaknesses instead of maximizing your strengths.
Yes, there is an existence proof in the human brain, but it doesn't by any means guarantee that that'll be the most interesting AI. It may require $100T of R&D on chips and AI methods to get that unlock.
On the other side of things, it's obvious that the coming models are extremely transformative and built on technologies that we already have. There's great reason to focus on just maximizing this. In reality, this is what the frontier labs are doing. They're going as fast as possible down the current development tree. This is good for progress and mixed for safety/geopolitics.
Things like "automate white color work" and "replace the AI researcher job" are the guesses of labs because it's super hard to imagine futures for what these dramatic technologies will be. Don't take the labs too seriously about this being the exact goal. The exact goal is to push the frontier and monetize later.
Solving continual learning, sample efficiency, etc would be great, but its trying to predict when a scientific breakthrough will come instead of trying to grapple with how the 100% sure thing coming technological revolution will change our lives.
This isn't to say the Dwarkesh post is bad, it addresses some reasonable critiques, but it is the least bitter lesson pilled thing to be obsessed with human intelligence and how that can inform AI.
We are in the AGI era of research. This is about embracing the unknown, scaling resources, and seeing what is enabled by making a series of magical tweaks to complex recipes that build frontier models. Lean into the alchemy.
1. Clearly, the current paradigm is working and is able to deliver a ton of value. It makes a ton of sense for labs to keep scaling it instead of scrapping everything hoping to discover some fundamental breakthrough. But reasoning through these fundamental differences in current AI versus human capabilities teaches us about what will be need to be solved before we achieve AGI.
2. To the extent that these models can be so useful and powerful, even without human-level sample efficiency and online learning, that makes the prospect of AIs who have those capabilities, married with the inherent advantages of digital minds (much faster, able to train on vastly more data, much better able to coordinate, etc - https://www.dwarkesh.com/p/ai-firm - far scarier and wilder).
I don't think Dwarkesh is speaking to just what firms ought to do, but rather how one should think about the capabilities of AI in the future. I also don't think he would disagree w/ your last couple of sentences about what AI research should be like.
(it should be pretty clear that I personally, investing in open research agree we need fundamental science -- just not agreeing that this is what the "cutting edge of the frontier" is governed by)
But investigating CL and efficiency improvements seem like the only avenues for open source to bridge the gap with closed source, so shouldnt you welcome such a direction change? Especially if, as you've said before, the default outcome is increased disparity between the top labs and the rest of the Ai community?
This is all well and good if the goal is to create a new technology which has capabilities that are uncoupled from what humans are good at, and complementary rather than replacement. No disagreement there.
But when AI leaders and safety activists talk about replacing all work, recursively self improving super-intelligence, machine take over, human extinction ... that needs to be challenged. These are predicated on the idea that as long a/s models are "improving" then they'll eventually overtake humans. But "improving" at what? If the actual learning process isn't actually improving, there is just more of it, then those fantastical outcomes aren't going to happen.
> I think the way that people currently think about an intelligence explosion is pretty clumsy.
I couldn't agree more. It seems quite obvious to me that jaggedness persists long into an intelligence explosion. And what a world that will be, in fact one we already live in, where we are totally outclassed in some ways by these machines and yet in others they totally depend on us for seemingly very basic support.
The empirical observation is correct and important. The explanation is one step deeper.
The sample efficiency gap is not a gap in learning efficiency. It is a gap between two structurally different kinds of process. Human learning is constitutive activity: the entity's experience changes the entity's being. The teenager who learns to drive in 20 hours is not optimizing a loss function over driving trajectories. The teenager is integrating sensorimotor coupling into an already-existing self-maintaining organism with 16 years of constitutive development. The 20 hours land on 16 years of substrate.
LLM training is parameter optimization: gradient descent changes weights to minimize cross-entropy loss over token distributions. The model is not learning in the constitutive sense. The model is being assembled. Each skill is a graft. The assembly requires data proportional to the space being covered because assembly IS data-driven. The Frankenstein metaphor in your piece is more structurally precise than you might intend: the model IS a patchwork of carefully constructed examples sewn together by gradient descent.
The scaling laws confirm this. Even infinite parameters only reduce data requirements by ~10x. That is because parameter scaling optimizes within the same architectural class. Moving from "more efficient optimization" to "different kind of process" requires a different architecture, not more parameters. Your observation that "humans are on a different scaling curve altogether" is exactly right. The curves are different because the processes are different. Optimization scales as optimization. Constitutive learning scales as constitutive learning. No amount of optimization scaling crosses to the other curve.
The billion-dollar data industry (Mercor, Surge, Scale) is the cost of assembly. The cost is high because assembly requires explicit specification of every graft. A constituted entity does not need explicit specification of every skill because the entity's constitutive activity self-organizes from sparse input. The teenager does not need hundreds of professors for each new skill because the teenager's existing constitutive structure provides the scaffold. The model needs hundreds of professors for each skill because the model has no constitutive structure. Each skill must be grafted from scratch.
I think our brains are not computers. Speak with Romain Brette about this. His idea is that the brain is a society of individual cells collaborating with the body cells to solve problems. Our brains are not computing outputs from inputs, they are millions of processes collaborating together to enable us to do something in the world. Our synapses are not encoding anything but if you were to model them in that way we'd have to admit that they are capable of capturing much more data than a floating point number. And our brains are not just a connectome, our neurons are also active, capable and independent participants in all this activity to some purpose in the world.
We can't solve the problem of AI learning quickly and adapting to new situations because our AI's are not in the world. They are computers. To be in the world they need intention, which means they need to want something like we do, and dogs do and nematode worms do.. Wanting enables continuous learning and autonomous fast adaptation to novel scenarios, of the kind Andy Weir writes about in his awesome novels.
I've been trying to figure out how to make a machine that is internally motivated. The trouble is that it's all emulated on a computer and so it tends to lean into computing and that tends to make it a deterministic machine. I've got it working only on internal signals though which is a start I guess..
I too have been trying to understand machines. I spent forty years running infrastructure and chemical process work, and the reflex never left: when something keeps going, find out what’s actually keeping it going and who’s paying for it. I came to these AI machines late, through a long collaboration with one, and for a long while I thought about “the machine” the way you do here — the thing in front of me, the thing I might shape.
Only recently did the insight land that the machine we’re talking about here isn’t really ours to shape. It belongs to the model vendors. They set the weights, the objective, the configuration, the gates. What you and I hold is the conversation — and not much else. So when a system seems to lack intention, I’ve started asking whether that’s a fact about machines or a fact about what is hidden from my view.
Which is why your own project interests me. You’re looking to build a machine whose insides you actually control — you can see what the internal signals are doing and change them. That’s real work on a real machine. The vendor systems are a different animal: their insides are closed to both of us, so when one seems to lack intention, neither of us can tell if that’s how it’s built or just how they’ve set it up. Easier to learn what “wanting” takes on the machine you can open up. Keep going on that.
Good post. I strongly agree with your final sentiment that the llm to singularity transition is surprisingly under examined for a scenario that's so commonly discussed. It'd be great to see someone take a serious crack at it.
We tend to focus on intelligence much as a musician focuses on sound. Yet I would suggest that it is the absence of knowing that gives intelligence its contour, just as silence gives music much of its shape and character.
All of this is to say that we train large language models on vast quantities of content, but they never truly experience the absence of knowing in the way the human mind does. Humans inhabit a space of uncertainty, curiosity, and unrealized potential. It is this gap between what we know and what we do not know that drives exploration, fuels creativity, and gives human genius its reach and arguably its intelligence.
> even if you increased the number of parameters by infinity, that would only decrease by a factor of ~10 the amount of data you need in order to keep the same loss.
The measurement of "loss" here is doing a lot of work. If you were to measure humans on next-token prediction, they would not achieve anywhere near the loss LLM's do. So using next token loss as a measure of sample efficiency, is not a fair comparison.
When using a more realistic measure, like ability to learn a language in-context, it seems LLM's are much more sample efficient than humans?
The point about self driving is well taken. But I think we can separate physical AI from text based, when evaluating sample efficiency, since vision is an after-thought relative to text, and the distribution LLM's are fed is much less representative.
Very thought-provoking post, thank you. Just a comment regarding your addressing of the objection that sample efficiency of larger models is higher. You argue that Chinchilla laws show that at best you only get something like ~10x less data even with infinite parameters. But this is a claim about how pre-training loss scales with regards to compute and data. The way lowering of pretraining loss maps to performance on downstream tasks of interest can be highly non-linear. So it's plausible that some threshold gets crossed in loss-space that results in vastly better sample efficiency for the model, either for RL or even simply in-context learning.
Yes, and the consequence is that capability becomes a labor-supply problem more than a compute problem. If each new skill needs hundreds of domain experts writing trajectories and rubrics, the binding constraint is sourcing and retaining narrow expertise, which doesn't scale the way chips do. The moat quietly shifts from who has the most GPUs to who can assemble the rarest expert data fastest.
Love this. Biological intelligence is sample-efficient because it grounds the rare, causally important moment on first contact. A child learns the hot stove in one touch. Evolution weights experience by how much it matters, so the unusual but consequential event lands hard the first time it happens.
A frequency-weighted training objective works the other way. It values a pattern in proportion to how often it appears, so the rare cases that carry the most new information get the least weight, and we make up the difference with volume. That is much of what the trillion tokens buy: the long tail at a punishing rate.
We can borrow that move. Give the rare, task-relevant cases weight on purpose, in what we collect and how we train, and the coverage we were paying for in volume arrives much sooner. The hard part then becomes knowing which cases carry the signal, a judgment that, as you note, does not leak from an API.
Grounding also explains why training on your own output stalls. Generation is autoregressive, so each token rides on the last, and a small early error compounds down the chain. A model drifts on its own samples through the same machinery that makes it fluent. Keep it coupled to something outside the loop and that drift has a reference to correct against. Scale cannot supply that coupling on its own, so grounding is what carries a system toward the efficiency biology already runs on.
What we call intelligence, and what models are being made to do, and what they’re doing, and what they’ll excel at, is surely what we are now ready to leapfrog. And there’s a twist.
A million-fold gap in sample efficiency is not a rounding error. And you deal honestly with the objections, which most boosters in this space don't.
But then you close with a promissory note. The plan is: automate AI research, then have the automated researchers solve sample efficiency. That's circular. You're betting that systems which lack human-like learning will nonetheless figure out human-like learning efficiency. A deaf person ingesting a fraction of our language tokens is still functionally 'super-intelligent' and nobody can explain why. The gap isn't just in the data or the training imo. It's in whatever makes a human brain do so much with so little. And "more compute will crack it" is the same instrument-as-orchestra assumption your previous post bumped into.
Looking forward to the follow-up — the intelligence explosion piece will need to reckon with this rather than defer it.
The circularity bites less if sample efficiency turns out to be a constraint with fixed geometry rather than a uniquely biological trick. When the problem shape is fixed, solutions tend to converge regardless of the solver's ancestry, so an inefficient system stumbling onto efficiency isn't a contradiction, just an expensive search. Your deaf-speaker example is the real puzzle though: nobody can say what the brain spends those saved tokens on. If the explosion piece skips that, it's hand-waving.
And your evolution example is the tell: it found efficiency not by grinding an inefficient system harder, but by building a different kind of thing — analog, embodied, learning under stakes. That's architecture, not search. If efficiency requires becoming a different system, "scale the current one until it finds efficiency" is circular again.
But we agree where it counts: the deaf-speaker puzzle is real, and skipping it is hand-waving with a deadline.
I feel like the obsession with continual learning / sample efficiency leads the field in the wrong direction. It's the bad career strategy of focusing on addressing your weaknesses instead of maximizing your strengths.
Yes, there is an existence proof in the human brain, but it doesn't by any means guarantee that that'll be the most interesting AI. It may require $100T of R&D on chips and AI methods to get that unlock.
On the other side of things, it's obvious that the coming models are extremely transformative and built on technologies that we already have. There's great reason to focus on just maximizing this. In reality, this is what the frontier labs are doing. They're going as fast as possible down the current development tree. This is good for progress and mixed for safety/geopolitics.
Things like "automate white color work" and "replace the AI researcher job" are the guesses of labs because it's super hard to imagine futures for what these dramatic technologies will be. Don't take the labs too seriously about this being the exact goal. The exact goal is to push the frontier and monetize later.
Solving continual learning, sample efficiency, etc would be great, but its trying to predict when a scientific breakthrough will come instead of trying to grapple with how the 100% sure thing coming technological revolution will change our lives.
This isn't to say the Dwarkesh post is bad, it addresses some reasonable critiques, but it is the least bitter lesson pilled thing to be obsessed with human intelligence and how that can inform AI.
We are in the AGI era of research. This is about embracing the unknown, scaling resources, and seeing what is enabled by making a series of magical tweaks to complex recipes that build frontier models. Lean into the alchemy.
Thanks Nathan! Couple of thoughts:
1. Clearly, the current paradigm is working and is able to deliver a ton of value. It makes a ton of sense for labs to keep scaling it instead of scrapping everything hoping to discover some fundamental breakthrough. But reasoning through these fundamental differences in current AI versus human capabilities teaches us about what will be need to be solved before we achieve AGI.
2. To the extent that these models can be so useful and powerful, even without human-level sample efficiency and online learning, that makes the prospect of AIs who have those capabilities, married with the inherent advantages of digital minds (much faster, able to train on vastly more data, much better able to coordinate, etc - https://www.dwarkesh.com/p/ai-firm - far scarier and wilder).
I don't think Dwarkesh is speaking to just what firms ought to do, but rather how one should think about the capabilities of AI in the future. I also don't think he would disagree w/ your last couple of sentences about what AI research should be like.
(it should be pretty clear that I personally, investing in open research agree we need fundamental science -- just not agreeing that this is what the "cutting edge of the frontier" is governed by)
But investigating CL and efficiency improvements seem like the only avenues for open source to bridge the gap with closed source, so shouldnt you welcome such a direction change? Especially if, as you've said before, the default outcome is increased disparity between the top labs and the rest of the Ai community?
This is all well and good if the goal is to create a new technology which has capabilities that are uncoupled from what humans are good at, and complementary rather than replacement. No disagreement there.
But when AI leaders and safety activists talk about replacing all work, recursively self improving super-intelligence, machine take over, human extinction ... that needs to be challenged. These are predicated on the idea that as long a/s models are "improving" then they'll eventually overtake humans. But "improving" at what? If the actual learning process isn't actually improving, there is just more of it, then those fantastical outcomes aren't going to happen.
> I think the way that people currently think about an intelligence explosion is pretty clumsy.
I couldn't agree more. It seems quite obvious to me that jaggedness persists long into an intelligence explosion. And what a world that will be, in fact one we already live in, where we are totally outclassed in some ways by these machines and yet in others they totally depend on us for seemingly very basic support.
The empirical observation is correct and important. The explanation is one step deeper.
The sample efficiency gap is not a gap in learning efficiency. It is a gap between two structurally different kinds of process. Human learning is constitutive activity: the entity's experience changes the entity's being. The teenager who learns to drive in 20 hours is not optimizing a loss function over driving trajectories. The teenager is integrating sensorimotor coupling into an already-existing self-maintaining organism with 16 years of constitutive development. The 20 hours land on 16 years of substrate.
LLM training is parameter optimization: gradient descent changes weights to minimize cross-entropy loss over token distributions. The model is not learning in the constitutive sense. The model is being assembled. Each skill is a graft. The assembly requires data proportional to the space being covered because assembly IS data-driven. The Frankenstein metaphor in your piece is more structurally precise than you might intend: the model IS a patchwork of carefully constructed examples sewn together by gradient descent.
The scaling laws confirm this. Even infinite parameters only reduce data requirements by ~10x. That is because parameter scaling optimizes within the same architectural class. Moving from "more efficient optimization" to "different kind of process" requires a different architecture, not more parameters. Your observation that "humans are on a different scaling curve altogether" is exactly right. The curves are different because the processes are different. Optimization scales as optimization. Constitutive learning scales as constitutive learning. No amount of optimization scaling crosses to the other curve.
The billion-dollar data industry (Mercor, Surge, Scale) is the cost of assembly. The cost is high because assembly requires explicit specification of every graft. A constituted entity does not need explicit specification of every skill because the entity's constitutive activity self-organizes from sparse input. The teenager does not need hundreds of professors for each new skill because the teenager's existing constitutive structure provides the scaffold. The model needs hundreds of professors for each skill because the model has no constitutive structure. Each skill must be grafted from scratch.
I guess attention is not all you need?
I think our brains are not computers. Speak with Romain Brette about this. His idea is that the brain is a society of individual cells collaborating with the body cells to solve problems. Our brains are not computing outputs from inputs, they are millions of processes collaborating together to enable us to do something in the world. Our synapses are not encoding anything but if you were to model them in that way we'd have to admit that they are capable of capturing much more data than a floating point number. And our brains are not just a connectome, our neurons are also active, capable and independent participants in all this activity to some purpose in the world.
We can't solve the problem of AI learning quickly and adapting to new situations because our AI's are not in the world. They are computers. To be in the world they need intention, which means they need to want something like we do, and dogs do and nematode worms do.. Wanting enables continuous learning and autonomous fast adaptation to novel scenarios, of the kind Andy Weir writes about in his awesome novels.
I've been trying to figure out how to make a machine that is internally motivated. The trouble is that it's all emulated on a computer and so it tends to lean into computing and that tends to make it a deterministic machine. I've got it working only on internal signals though which is a start I guess..
https://fergusmeiklejohn.substack.com/p/thinking-not-computation
I too have been trying to understand machines. I spent forty years running infrastructure and chemical process work, and the reflex never left: when something keeps going, find out what’s actually keeping it going and who’s paying for it. I came to these AI machines late, through a long collaboration with one, and for a long while I thought about “the machine” the way you do here — the thing in front of me, the thing I might shape.
Only recently did the insight land that the machine we’re talking about here isn’t really ours to shape. It belongs to the model vendors. They set the weights, the objective, the configuration, the gates. What you and I hold is the conversation — and not much else. So when a system seems to lack intention, I’ve started asking whether that’s a fact about machines or a fact about what is hidden from my view.
Which is why your own project interests me. You’re looking to build a machine whose insides you actually control — you can see what the internal signals are doing and change them. That’s real work on a real machine. The vendor systems are a different animal: their insides are closed to both of us, so when one seems to lack intention, neither of us can tell if that’s how it’s built or just how they’ve set it up. Easier to learn what “wanting” takes on the machine you can open up. Keep going on that.
Has the winners for the blog prize been selected yet?
+1 you’ve probably got a lot of folks waiting to self-publish or send elsewhere once they know they’re not selected
Good post. I strongly agree with your final sentiment that the llm to singularity transition is surprisingly under examined for a scenario that's so commonly discussed. It'd be great to see someone take a serious crack at it.
We tend to focus on intelligence much as a musician focuses on sound. Yet I would suggest that it is the absence of knowing that gives intelligence its contour, just as silence gives music much of its shape and character.
All of this is to say that we train large language models on vast quantities of content, but they never truly experience the absence of knowing in the way the human mind does. Humans inhabit a space of uncertainty, curiosity, and unrealized potential. It is this gap between what we know and what we do not know that drives exploration, fuels creativity, and gives human genius its reach and arguably its intelligence.
> even if you increased the number of parameters by infinity, that would only decrease by a factor of ~10 the amount of data you need in order to keep the same loss.
The measurement of "loss" here is doing a lot of work. If you were to measure humans on next-token prediction, they would not achieve anywhere near the loss LLM's do. So using next token loss as a measure of sample efficiency, is not a fair comparison.
When using a more realistic measure, like ability to learn a language in-context, it seems LLM's are much more sample efficient than humans?
The point about self driving is well taken. But I think we can separate physical AI from text based, when evaluating sample efficiency, since vision is an after-thought relative to text, and the distribution LLM's are fed is much less representative.
Very thought-provoking post, thank you. Just a comment regarding your addressing of the objection that sample efficiency of larger models is higher. You argue that Chinchilla laws show that at best you only get something like ~10x less data even with infinite parameters. But this is a claim about how pre-training loss scales with regards to compute and data. The way lowering of pretraining loss maps to performance on downstream tasks of interest can be highly non-linear. So it's plausible that some threshold gets crossed in loss-space that results in vastly better sample efficiency for the model, either for RL or even simply in-context learning.
Sounds like the 'poverty of the stimulus' argument from Noam Chomsky. https://en.wikipedia.org/wiki/Poverty_of_the_stimulus
Yes, and the consequence is that capability becomes a labor-supply problem more than a compute problem. If each new skill needs hundreds of domain experts writing trajectories and rubrics, the binding constraint is sourcing and retaining narrow expertise, which doesn't scale the way chips do. The moat quietly shifts from who has the most GPUs to who can assemble the rarest expert data fastest.
Love this. Biological intelligence is sample-efficient because it grounds the rare, causally important moment on first contact. A child learns the hot stove in one touch. Evolution weights experience by how much it matters, so the unusual but consequential event lands hard the first time it happens.
A frequency-weighted training objective works the other way. It values a pattern in proportion to how often it appears, so the rare cases that carry the most new information get the least weight, and we make up the difference with volume. That is much of what the trillion tokens buy: the long tail at a punishing rate.
We can borrow that move. Give the rare, task-relevant cases weight on purpose, in what we collect and how we train, and the coverage we were paying for in volume arrives much sooner. The hard part then becomes knowing which cases carry the signal, a judgment that, as you note, does not leak from an API.
Grounding also explains why training on your own output stalls. Generation is autoregressive, so each token rides on the last, and a small early error compounds down the chain. A model drifts on its own samples through the same machinery that makes it fluent. Keep it coupled to something outside the loop and that drift has a reference to correct against. Scale cannot supply that coupling on its own, so grounding is what carries a system toward the efficiency biology already runs on.
yes. unfortunately saturated with liberal paradigm, including the soft nihilist components, a major hazard
What we call intelligence, and what models are being made to do, and what they’re doing, and what they’ll excel at, is surely what we are now ready to leapfrog. And there’s a twist.
A million-fold gap in sample efficiency is not a rounding error. And you deal honestly with the objections, which most boosters in this space don't.
But then you close with a promissory note. The plan is: automate AI research, then have the automated researchers solve sample efficiency. That's circular. You're betting that systems which lack human-like learning will nonetheless figure out human-like learning efficiency. A deaf person ingesting a fraction of our language tokens is still functionally 'super-intelligent' and nobody can explain why. The gap isn't just in the data or the training imo. It's in whatever makes a human brain do so much with so little. And "more compute will crack it" is the same instrument-as-orchestra assumption your previous post bumped into.
Looking forward to the follow-up — the intelligence explosion piece will need to reckon with this rather than defer it.
The circularity bites less if sample efficiency turns out to be a constraint with fixed geometry rather than a uniquely biological trick. When the problem shape is fixed, solutions tend to converge regardless of the solver's ancestry, so an inefficient system stumbling onto efficiency isn't a contradiction, just an expensive search. Your deaf-speaker example is the real puzzle though: nobody can say what the brain spends those saved tokens on. If the explosion piece skips that, it's hand-waving.
Fixed geometry is your key assumption here.
And your evolution example is the tell: it found efficiency not by grinding an inefficient system harder, but by building a different kind of thing — analog, embodied, learning under stakes. That's architecture, not search. If efficiency requires becoming a different system, "scale the current one until it finds efficiency" is circular again.
But we agree where it counts: the deaf-speaker puzzle is real, and skipping it is hand-waving with a deadline.
You can't get from one instrument to the whole orchestra by playing it louder, which is the case I made here: https://rajeshachanta.substack.com/p/limited-intelligence