Thoughts on AI progress (Dec 2025)
Why I'm moderately bearish in the short term, and explosively bullish in the long term
What are we scaling?
I’m confused why some people have short timelines and at the same time are bullish on RLVR. If we’re actually close to a human-like learner, this whole approach is doomed.
Currently the labs are trying to bake in a bunch of skills into these models through “mid-training” - there’s an entire supply chain of companies building RL environments which teach the model how to use Excel to write financial models or navigate a web browser.
Either these models will soon learn on the job in a self directed way - making all this pre-baking pointless - or they won’t - which means AGI is not imminent. Humans don’t have to go through a special training phase where they need to rehearse every single piece of software they might ever use.
Beren made interesting points about this in a recent blog post:
When we see frontier models improving at various benchmarks we should think not just of increased scale and clever ML research ideas but billions of dollars spent paying PhDs, MDs, and other experts to write questions and provide example answers and reasoning targeting these precise capabilities ... In a way, this is like a large-scale reprise of the expert systems era, where instead of paying experts to directly program their thinking as code, they provide numerous examples of their reasoning and process formalized and tracked, and then we distill this into models through behavioural cloning. This has updated me slightly towards longer AI timelines since given we need such effort to design extremely high quality human trajectories and environments for frontier systems implies that they still lack the critical core of learning that an actual AGI must possess.
This tension seems especially vivid in robotics. In some fundamental sense, robotics is an algorithms problem, not a hardware or data problem — with very little training, a human can learn how to teleoperate current hardware to do useful work. So if we had a human like learner, robotics would be (in large part) be solved. But the fact that we don’t have such a learner makes it necessary to go out into a thousand different homes and factories to learn how to pick up dishes or fold laundry.
One counterargument I’ve heard from the takeoff-within-5-years crew is that we have to do this cludgy RL in service of building a superhuman AI researcher, and then the million copies of automated Ilya can go figure out how to solve robust and efficient learning from experience.
This gives the vibes of, “We’re losing money on every sale, but we’ll make it up in volume.” This automated researcher is somehow going to figure out the algorithm for AGI - something humans have been banging their head against for the better part of a century - while not having the basic learning capabilities that children have? That seems super implausible to me.
Besides, even if that’s what you believe, it clearly doesn’t describe how the labs are approaching RLVR. You don’t need to pre-bake the consultant’s skills at crafting Powerpoint slides in order to automate Ilya. So clearly the labs’ actions hint at a world view where these models will continue to fare poorly at generalizing and on-the-job learning, thus making it necessary to build in the skills that they hope will be economically valuable beforehand.
Another counterargument you could make is that even if the model could learn these skills on the job, it is just so much more efficient to learn them once during training rather that again and again for each user or company. And look, it makes a lot of sense to just bake in fluency with common tools like browsers and terminals. Indeed one of the key advantages that AGIs will have is this greater capacity to share knowledge across copies. But people are underrating how much company and context specific skills are required to do most jobs. And there just isn’t currently a robust efficient way for AIs to pick up those skills.
Human labor is valuable precisely because it’s not shleppy to train
I was at a dinner with an AI researcher and a biologist. The biologist said she had long timelines. We asked what she thought AI would struggle with. She said has recently involved looking at slides and decide if a dot is actually a macrophage or just looks like one. The AI researcher says, “Image classification is a textbook deep learning problem—we could easily train for that.”
I thought this was a very interesting exchange, because it revealed a key crux between me and the people who expect transformative economic impacts in the next few years. Human workers are valuable precisely because we don’t need to build schleppy training loops for every small part of their job. It’s not net-productive to build a custom training pipeline to identify what macrophages look like given the way this particular lab prepares slides, then another for the next lab-specific micro-task, and so on. What you actually need is an AI that can learn from semantic feedback on the job and immediately generalize, the way a human does.
Every day, you have to do a hundred things that require judgment, situational awareness, and skills & context learned on the job. These tasks differ not just across different people, but from one day to the next even for the same person. It is not possible to automate even a single job by just baking in some predefined set of skills, let alone all the jobs.
In fact, I think people are really underestimating how big a deal actual AGI will be because they’re just imagining more of this current regime. They’re not thinking about billions of human-like intelligences on a server which can copy and merge all their learnings. And to be clear, I expect this (aka actual AGI) in the next decade or two. That’s fucking crazy!
RL scaling is laundering the prestige of pretraining scaling
With pretraining, we had this extremely clean and general trend in improvement in loss across multiple orders of magnitude of compute (albeit on a power law, which is as weak as exponential growth is strong). People are trying to launder the presitge of pretraining scaling, which was almost as predictable as a physical law of the universe, to justify bullish projections about RLVR, for which we have no well fit publicly known trend. When intrepid researchers do try to piece together the implications from scarce public datapoints, they get quite bearish results. For example, Toby Ord has a great post where he cleverly connects the dots between different o-series benchmark charts, which suggested “we need something like a 1,000,000x scale-up of total RL compute to give a boost similar to a GPT level”.
Economic diffusion lag is cope for missing capabilities
Sometimes people will say that the reason that AIs aren’t more widely deployed across firms and already providing lots of value (outside of coding) is that technology takes a long time to diffuse. I think this is cope. People are using this cope to gloss over the fact that these models just lack the capabilities necessary for broad economic value.
Steven Byrnes has an excellent post on this and many other points:
New technologies take a long time to integrate into the economy? Well ask yourself: how do highly-skilled, experienced, and entrepreneurial immigrant humans manage to integrate into the economy immediately? Once you’ve answered that question, note that AGI will be able to do those things too.
If these models were actually like humans on a server, they’d diffuse incredibly quickly. In fact, they’d be so much easier to integrate and onboard than a normal human employee (they could read your entire Slack and Drive in minutes and immediately distill all the skills your other AI employees have). Plus, the hiring market is very much like a lemons market, where it’s hard to tell who the good people are beforehand, and hiring someone bad is quite costly. This is a dynamic you wouldn’t have to worry about when you just wanna spin up another instance of a vetted AGI model.
For these reasons, I expect it’s going to be much much easier to diffuse AI labor into firms than it is to hire a person. And companies hire lots of people all the time. If the capabilities were actually at AGI level, people would be willing to spend trillions of dollars a year buying tokens (knowledge workers cumulatively earn 10s or trillions of dollars of wages a year)1. The reason that lab revenue are 4 orders of magnitude off right now is that the models are nowhere near as capable as human knowledge workers.
Goal post shifting is justified
AI bulls will often criticize AI bears for repeatedly moving the goal posts. This is often fair. AI has made a ton of progress in the last decade, and it’s easy to forget that.
But some amount of goal post shifting is justified. If you showed me Gemini 3 in 2020, I would have been certain that it could automate half of knowledge work. We keep solving what we thought were the sufficient bottlenecks to AGI (general understanding, few shot learning, reasoning), and yet we still don’t have AGI (defined as, say, being able to completely automate 95% of knowledge work jobs). What is the rational response?
It’s totally reasonable to look at this and say, “Oh actually there’s more to intelligence and labor than I previously realized. And while we’re really close to (and in many ways have surpassed) what I would have defined as AGI in the past, the fact that model companies are not making trillions is revenue clearly reveals that my previous definition of AGI was too narrow.”
I expect this to keep happening into the future. I expect that by 2030 that the labs will have made significant progress on my hobby horse of continual learning, and the models will start earning 100s of billions in revenue, but they won’t have automated all knowledge work, and I’ll be like, “We’ve made a lot of progress, but we’re not at AGI yet. We also need X, Y, and Z thing to get to trillions in revenue.”
Models keep getting more impressive at the rate the short timelines people predict, but more useful at the rate the long timelines people predict.
Comparison to human distribution will make us at first overestimate (and then underestimate) AI
There is huge variance in the amount of value that different humans can add, especially in white collar with its O-ring dynamics. The village idiot adds ~0 value to knowledge work, while top AI researchers are worth billions of dollars to Mark Zuckerberg.
AI models at any given snapshot of time, however, are roughly equally capable. Humans have all this variance, whereas AI models don’t. Because a disproportionate share of value-add in knowledge work comes from the top percentile humans, if we try to compare the intelligence of these AI models to the median human, then we will systematically overestimate the value they can generate. But by the same token, when models finally do match top human performance, their impact might be quite explosive.
Broadly deployed intelligence explosion
People have spent a lot of time talking about a software only singularity (where AI models write the code for a smarter successor system), a software + hardware singularity (where AIs also improve their successor’s computing hardware), or variations therein.
All these scenarios neglect what I think will be the main driver of further improvements atop AGI: continual learning. Again, think about how humans become more capable at anything. It’s mostly from experience in the relevant domain.
Over conversation, Beren Millidge made the interesting suggestion that the future might look continual learning agents going out, doing jobs and generating value, and then bringing all their learnings back to the hive mind model, which does some kind of batch distillations on all these agents. The agents themselves could be quite specialized - containing what Karpathy called “the cognitive core” plus knowledge and skills relevant to the job they’re being deployed to do.
“Solving” continual learning won’t be a singular one-and-done achievement. Instead, it will feel like solving in context learning. GPT-3 demonstrated that in context learning could be very powerful (its ICL capabilities were so remarkable that the title of the GPT-3 paper is ‘Language Models are Few-Shot Learners’). But of course, we didn’t “solve” in-context learning when GPT-3 came out - and indeed there’s plenty of progress still to be made, from comprehension to context length. I expect a similar progression with continual learning. Labs will probably release something next year which they call continual learning, and which will in fact count as progress towards continual learning. But human level continual learning may take another 5 to 10 years of further progress.
This is why I don’t expect some kind of runaway gains to the first model that cracks continual learning, thus getting more and more widely deployed and capable. If you had fully solved continual learning drop out of nowhere, then sure, it’s “game set match”, as Satya put it. But that’s not what’s going to happen. Instead, some lab is going to figure out how to get some initial traction on the problem. Playing around with this feature will make it clear how it was implemented, and the other labs will soon replicate this breakthrough and improve it slightly.
There’ll also probably be diminishing returns from learning-from-deployment. Each of the first 1000 consultant agents are each learning a ton from deployment. Less so the next 1000. And is there such a long tail to consultant work that the millionth deployed instance is likely to see something super important the other 999,999 instances missed? In fact, I wouldn’t be surprised if continual learning also ends up leading to a power law, but with respect to the number of instances deployed.
Besides, I just have some prior that competition will stay fierce, informed by the observation that all these previous supposed flywheels (user engagement on chat, synthetic data, etc) have done very little to diminish the greater and greater competition between model companies. Every month (or less), the big three will rotate around the podium, with other competitors not that far behind. There is some force (potentially talent poaching, rumor mills, or reverse engineering) which has so far neutralized any runaway advantages a single lab might have had.
Obviously competition might drive the cost of AI tokens to below the current cost of human knowledge worker tokens. But at the same time, there’s not a fixed lump of labor, so if AGI tokens actually were really cheap, our demand for AI labor would dramatically expand.


Related to a point made in your podcast with Ilya: it seems like one of the things that allows humans to learn quickly is that the space of misunderstandings humans have is heavily constrained and largely predictable. For example, when learning calculus most pitfalls/confusions are very common and can thus be called out when teaching someone. The mistakes that AIs make are unpredictable (same AI makes different mistakes at different points) but also unintuitive (we don't have a good model for when an AI will be reliable and when it won't). This makes creating a learning environment where all the possible mistakes are not only identified but also penalized correctly incredibly difficult.
This of course relates to your broader point about continual learning. If we could create a model architecture that constrains the AI to fail in predictable ways this seems like it would be a large step towards continual learning.
Good post, but I think you could be overconfident. My sense is that the reports you cite only weakly support the strong claims made, and could also be interpreted in other ways.
> OpenAI is using tons of highly specialized skills in their RL-training pipeline, showing that RL training does not really generalize.
The article cited really only says that OAI hired some Wall Street folks to generate data. I think it is more likely that OAI wants to use this to offer specialized models to high paying customers in the very short term rather than their general approach to reaching AGI. Counterevidence would be OAI acquiring such data from a much more diverse section of the economy.
> AI is not diffused yet, showing we are not at AGI.
True, but the more reasonable folks with short timelines are not saying that we are at AGI already. Slow diffusion is a valid argument if you have agents that are good but not reliable enough to match human performance. Claude Code is by many accounts really useful, but would be useless as an autonomous employee.
Now observe that CC unlocks models' value: using Claude's chat interface for coding would substantially reduce the value add, and that it took serious engineering effort to make CC as good as it is now. If CC and other coding agents did not exist, you would be making the mistaken argument that frontier models are useless for coding. It is quite conceivable that model's value add for many other economically valuable tasks are at the moment bottlenecked on somebody investing serious resources into making such scaffolding,