loved this. i would love to read more and explore along with you. hope you keep writing them up without the pressure to perform. protect your joy of learning my man.
Fun way to learn; adding to the point on the key difference with Mythos:
"Rather, my suspicion is that these capabilities are driven by Claude Mythos likely being the first model to be explicitly and seriously RL-trained to invent and defend against cyberattacks."
'In some sense, its good that Anthropic didn’t release this model publicly until critical IT could be patched up. In another sense, isn’t it a super bad precedent for private companies to be hoarding the ability to be able to break into any operating system and browser and device?'
1. Anthropic passed the democracy test. They acted to ensure that their model was used to allow critical IT to be patched up before releasing it generally.
2. Apart from DEMOCRATIC private companies like Anthropic 'hoarding the ability to break into any operating system and browser and device', who else would you have 'hoarding' such an ability, Dwarkesh? Trump's government? How DEMOCRATIC is Trump's - or any other President's - government likely to be if THEY are 'hoarding' such an ability? I'd go with DEMOCRATIC Anthropic every time.
3. 'One obvious question for Anthropic is why they didn’t just build some kind of classifier which would detect whether you’re using the model for cyberattack type stuff ... These kinds of classifiers can be easy to evade if you have enough expertise'.
Maybe that's why Anthropic didn't bother releasing a classifier.
Our current method is to pretrain on language modeling since needing to predict future tokens turns out to lead to models internalizing information very effectively. When causality is broken, a sufficiently strong model can just peek at the future instead of learning what we want it to.
I think the concerns of this post - distillation, bugs, cybersecurity, information asymmetry - are really more organizational alignment issues than model issues. Not completely, but substantially.
All of these concerns are real. But the best mitigation strategy is likely not solely further advances in model alignment. It is organizational alignment. Shouldn’t alignment be primarily institutional, through precertification, model insurance, and liability?
Exxon and Goldman Sachs are both very smart, very powerful, non-human entities, yet they act mostly in alignment with societal goals and within our social contract. We didn’t achieve that by modifying their internal cognition. We did it through structure: boards of directors, insurance, liability, and basic regulation. Maybe those same forms should be ported explicitly to the organizations releasing models.
What if the real problem is less how to align the model itself, and more how to align the organization releasing it?
What seems more sensible: trying to contain models internally, or shaping the incentives of those who deploy them?
Consider:
Precertification: like aviation or stock issuance guidelines. Forces ex ante risk evaluation.
Liability: converts externalities into balance sheet items.
Insurance: prices risk. In a Hayekian sense, prices convey dispersed and otherwise inaccessible information.
These mechanisms work by aligning the model-releasing organization to do the hard work up front. In some sense, this is already happening. Delaying release of high-risk capabilities is one example (Anthropic/Mythos). But it is happening in an ad hoc way. The question is whether it should be made systematic and legible. I'm wondering if what Anthropic is doing ad hoc should be a standard.
Yes, this model of alignment accepts some residual tail risk. But a more decentralized, Hayekian approach, one that leverages price signals, distributed knowledge, and institutional constraints, has meaningful advantages over centralized, model-centric approaches. It is more adaptive, more information-rich, and more consistent with how we already manage other powerful systems.
A useful distinction:
Model alignment reduces capability risk.
Institutional alignment reduces deployment risk.
And deployment risk is where most real-world harm occurs.
If intelligence is going to diffuse, and it likely will, then alignment may be less about containing models and more about structuring the incentives of those who build and deploy them.
Fineweb is 18.5T tokens and Openwebtext is 9B tokens so 1T tokens for distillation sounds about right to me. I didn't quite follow the pretraining parallelism discussion, I'll need to return to that later.
> One obvious question for Anthropic is why they didn’t just build some kind of classifier which would detect whether you’re using the model for cyberattack type stuff, and refuse requests if yes, and release that publicly.
They are for 4.7, which is still weaker than Mythos.
The argument would be, "false negatives on Opus-level models aren't too dangerous, and we can accept some misuse cases from it. False negatives from Mythos would be catastrophic and not worth the risk of even one case of."
Though I don't think that argument is the most driving factor in reality, I think it is more likely there is just a talent shortage on mundane safety / financial motivation gap against it.
My take re distillation is that a potential way companies could stop distillation is the following 2 things:
1. Compute/memory scarcity meaning most of the actually useful models becoming naturally hoarded by frontier companies. You've talked before on Dwarkesh how the incoming memory crunch is going to cause consumer phones to worsen year on year and get more expensive, and while I do expect attempts to improve the situation by adding more supply, I also expect demand to shoot up in my median world/majority of worlds from my perspective, where the promises of the big tech companies of developing AIs that can replace everyone at their jobs is coming at the latest within the next 30 years, but the software intelligence explosion idea either plateaus near immediately or a small version of a software intelligence explosion happens, but isn't enough to handle the physical stuff as well as people hoped.
More generally, one underrated trend of compute/memory being perpetually scarce relative to demand (because it's locked up in AI contracts, and AI is the best use of GPUs/data centers/TPUs at the moment) is that it centralizes compute, and makes it more palatable/beneficial to switch to closed source models over distilling them.
2. Companies just not releasing their best models, ala Anthropic not releasing Mythos, and instead already pre-distilling models that while not the frontier, are useful enough to dissuade people from attempting to secretly exfiltrate their best models.
loved this. i would love to read more and explore along with you. hope you keep writing them up without the pressure to perform. protect your joy of learning my man.
Love it! 🤓 Raw and real. Pure joy to read your notes. Thanks for being you! 🤍
Fun way to learn; adding to the point on the key difference with Mythos:
"Rather, my suspicion is that these capabilities are driven by Claude Mythos likely being the first model to be explicitly and seriously RL-trained to invent and defend against cyberattacks."
Seems plausible to me
https://www.beren.io/2026-04-11-Thoughts-On-Claude-Mythos/
Re formal verification of software - this piece is a good primer
https://astrangeattractor.substack.com/p/how-could-formal-verification-help
https://kaipability.substack.com/p/industrialists-build-temples-scientists Talks about you and your podcast this week.
A great way also is to reflect back to the source. Notes are not rough as thoughts were bouncing of an LLM. Enjoy
'In some sense, its good that Anthropic didn’t release this model publicly until critical IT could be patched up. In another sense, isn’t it a super bad precedent for private companies to be hoarding the ability to be able to break into any operating system and browser and device?'
1. Anthropic passed the democracy test. They acted to ensure that their model was used to allow critical IT to be patched up before releasing it generally.
2. Apart from DEMOCRATIC private companies like Anthropic 'hoarding the ability to break into any operating system and browser and device', who else would you have 'hoarding' such an ability, Dwarkesh? Trump's government? How DEMOCRATIC is Trump's - or any other President's - government likely to be if THEY are 'hoarding' such an ability? I'd go with DEMOCRATIC Anthropic every time.
3. 'One obvious question for Anthropic is why they didn’t just build some kind of classifier which would detect whether you’re using the model for cyberattack type stuff ... These kinds of classifiers can be easy to evade if you have enough expertise'.
Maybe that's why Anthropic didn't bother releasing a classifier.
date is wrong! @dwarkesh
Re: Why breaking causality is bad
Our current method is to pretrain on language modeling since needing to predict future tokens turns out to lead to models internalizing information very effectively. When causality is broken, a sufficiently strong model can just peek at the future instead of learning what we want it to.
I think the concerns of this post - distillation, bugs, cybersecurity, information asymmetry - are really more organizational alignment issues than model issues. Not completely, but substantially.
All of these concerns are real. But the best mitigation strategy is likely not solely further advances in model alignment. It is organizational alignment. Shouldn’t alignment be primarily institutional, through precertification, model insurance, and liability?
Exxon and Goldman Sachs are both very smart, very powerful, non-human entities, yet they act mostly in alignment with societal goals and within our social contract. We didn’t achieve that by modifying their internal cognition. We did it through structure: boards of directors, insurance, liability, and basic regulation. Maybe those same forms should be ported explicitly to the organizations releasing models.
What if the real problem is less how to align the model itself, and more how to align the organization releasing it?
What seems more sensible: trying to contain models internally, or shaping the incentives of those who deploy them?
Consider:
Precertification: like aviation or stock issuance guidelines. Forces ex ante risk evaluation.
Liability: converts externalities into balance sheet items.
Insurance: prices risk. In a Hayekian sense, prices convey dispersed and otherwise inaccessible information.
These mechanisms work by aligning the model-releasing organization to do the hard work up front. In some sense, this is already happening. Delaying release of high-risk capabilities is one example (Anthropic/Mythos). But it is happening in an ad hoc way. The question is whether it should be made systematic and legible. I'm wondering if what Anthropic is doing ad hoc should be a standard.
Yes, this model of alignment accepts some residual tail risk. But a more decentralized, Hayekian approach, one that leverages price signals, distributed knowledge, and institutional constraints, has meaningful advantages over centralized, model-centric approaches. It is more adaptive, more information-rich, and more consistent with how we already manage other powerful systems.
A useful distinction:
Model alignment reduces capability risk.
Institutional alignment reduces deployment risk.
And deployment risk is where most real-world harm occurs.
If intelligence is going to diffuse, and it likely will, then alignment may be less about containing models and more about structuring the incentives of those who build and deploy them.
I think that the way Anthropic may end up releasing Mythos to the public is as a model that Opus passes certain work onto--with a safety net between.
Just because you can distill doesn’t mean you can match frontier labs on cost per token.
As tokens get more valuable cost per token advantages matter more.
Hassabis recently said he expects the open source frontier gap to increase not decrease.
Which interview of Hassabis do you refer to? Just curious :)
Fineweb is 18.5T tokens and Openwebtext is 9B tokens so 1T tokens for distillation sounds about right to me. I didn't quite follow the pretraining parallelism discussion, I'll need to return to that later.
> One obvious question for Anthropic is why they didn’t just build some kind of classifier which would detect whether you’re using the model for cyberattack type stuff, and refuse requests if yes, and release that publicly.
They are for 4.7, which is still weaker than Mythos.
The argument would be, "false negatives on Opus-level models aren't too dangerous, and we can accept some misuse cases from it. False negatives from Mythos would be catastrophic and not worth the risk of even one case of."
Though I don't think that argument is the most driving factor in reality, I think it is more likely there is just a talent shortage on mundane safety / financial motivation gap against it.
My take re distillation is that a potential way companies could stop distillation is the following 2 things:
1. Compute/memory scarcity meaning most of the actually useful models becoming naturally hoarded by frontier companies. You've talked before on Dwarkesh how the incoming memory crunch is going to cause consumer phones to worsen year on year and get more expensive, and while I do expect attempts to improve the situation by adding more supply, I also expect demand to shoot up in my median world/majority of worlds from my perspective, where the promises of the big tech companies of developing AIs that can replace everyone at their jobs is coming at the latest within the next 30 years, but the software intelligence explosion idea either plateaus near immediately or a small version of a software intelligence explosion happens, but isn't enough to handle the physical stuff as well as people hoped.
More generally, one underrated trend of compute/memory being perpetually scarce relative to demand (because it's locked up in AI contracts, and AI is the best use of GPUs/data centers/TPUs at the moment) is that it centralizes compute, and makes it more palatable/beneficial to switch to closed source models over distilling them.
2. Companies just not releasing their best models, ala Anthropic not releasing Mythos, and instead already pre-distilling models that while not the frontier, are useful enough to dissuade people from attempting to secretly exfiltrate their best models.
this is good