What I learned this week - Can distillation be stopped, Mythos and the cybersecurity equilibrium, Pipeline RL
April 15, 2025
At the end of my conversation with Michael Nielsen, we talked about how to actually retain what you learn. Michael’s advice was to make some kind of demanding artifact. Write something up. Try to explain it. So in that spirit, here are notes on some topics I’ve learned about over the last week or two. These notes are extremely rough, and have many mistakes.
Can distillation be stopped?
Can the frontier labs stop distillation? Because if they can’t, open source commoditizing models can catch up incredibly rapidly, making the long run business model for the labs less viable. Let’s say it takes 1T tokens from a frontier model to capture its juice (I have no idea if that’s correct, but let’s say). Even ignoring savings from caching, Opus 4.6 is $25/MTok. So $25 million for those 1T tokens. That’s nothing.
Labs are responding by hiding chain of thought. But there’s two problems with this solution:
Chain of thought is not made of some fundamentally different kind of token. You can just instruct the model to not think first but just start solving the problem, or to write out its thinking somewhere else.
Even if labs do figure out how to robustly hide chain of thought to train in the future, you can make reconstructing the chain of thought necessary to reproduce a decoded sequence as an RLVR target. Yes that costs more, but seems doable.
Maybe most importantly, the real juice of these agentic models is their tool use (writing and updating files of code, running bash commands, etc). And if these things are done locally on the user’s computer, you can’t really hide them. And it seems like a hard lift to get users to migrate all their development workflows to a cloud that you fully control and hide visibility to, modulo a Claude agent input text prompt.
By the way, I learned about an interesting way companies which build products atop API access to AI models can basically distill these models, in a way that potentially makes the distilled models even better than the ones they’re actually built atop.
Suppose you’ve got a coding product. In order to build a feature, a user uses your product to query some frontier model API across 10+ back and forths. Once the user is satisfied with the end result, you have the end state that the user actually wanted - “the gold diff”. These coding product companies can now set the gold diff as the RL target for training their own models, where the model gets rewarded for producing outputs that look like what users eventually converged on, and penalized for producing the kinds of intermediate outputs that users kept rejecting or editing.
Mythos and the cybersecurity equilibrium
It seems like the key difference between Mythos and previous versions is that while previous versions could find individual vulnerabilities in the code (“Hey, there’s a missing bounds check here”), Mythos is long run agentic enough to rope 5 different vulnerabilities together which are all required in order to find an exploit (“Now I can execute arbitrary code, escalate privileges, etc”). To the extent that some discontinuity has been hit, it’s probably more the result of the combinatorial nature of cyberattacks rather than some off-trend increase in intelligence.
What does this mean for offense/defense? One way to look at it is that software is more secure today than it was 20 years ago, despite more and more human intelligence probing at public code, both white hat and black hat. If we get another influx of intelligence suddenly, why should the dynamic change?
In fact, we know that our foreign adversaries almost certainly have access to a bunch of critical zero days which they’re saving for a rainy day, or already using in inconspicuous ways. To the extent that Glasswing allows the whole industry to find a bunch of these latent exploits and patch them, shouldn’t we expect defense to have become much stronger relative to offense by the end of 26? Of course, this is thanks to the fact than American companies got there first and are cooperating with other companies and our government to patch things before our adversaries get to the same level.
One counterpoint I heard from a security expert is that there’s big difference between finding vulnerabilities and patching them - and AI is much better at the first than the later (people often talk about the offense/defense balance, but difficulty of finding versus patching vulnerabilities seems much more significant). In order to patch an issue, you have to find a fix that will not interfere with all the ways people use your software, and all the features which rely on weird bespoke behavior. XKCD has a nice comic illustrating how these kinds of issues come up:
Potential solutions, if it’s non-trivial to just push patches to every piece of software?
TODO - I know nothing about formal verification of software - check out what a seL4 proof of some behavior might look like
Use LLMs to rapidly port all C to Rust. Curious how easily Mythos can find vulnerabilities in memory safe languages.
In some sense, its good that Anthropic didn’t release this model publicly until critical IT could be patched up. In another sense, isn’t it a super bad precedent for private companies to be hoarding the ability to be able to break into any operating system and browser and device? One obvious question for Anthropic is why they didn’t just build some kind of classifier which would detect whether you’re using the model for cyberattack type stuff, and refuse requests if yes, and release that publicly.
Patching your own software is isomorphic to finding bugs in someone else’s repo from the perspective of an LLM (and patching your own software is a frequent coding model use case).
These kinds of classifiers can be easy to evade if you have enough expertise to break the problem of finding exploits down into smaller subproblems of finding vulnerabilities which each individually seem like sensibly good behavior to an LLM with no memory
Pipeline RL paper summary
As you keep RLing a model, not only does the average length of a response increase (since you’re basically training the model to think for longer before answering) but the variance in length also increases - sometimes you get an easy problem and you can immediately answer it - other times, you need to go think for 100k tokens.
This is a big problem for GPU utilization on training. Because you have to wait for all these stragglers to finish generating before you can start the next training step.
Okay one way you could get out of this conundrum is to just to just batch generation so that while stragglers keep going, you generate even more rollouts.
The problem is that there is an optimal batch size for each training step, so you’d need to split all these rollouts you made across lots of consecutive training steps.
But this takes you into the domain of offline RL, because your model is changing with each training step. And so you’re training your model on trajectories that were actually generated by an earlier model, which is not ideal.
Pipeline RL paper proposes the following fix: in flight weight weight updates - where you just sub out the generating model partway though these generating trajectories as soon as the new training step is done, so all the short trajectories, and a good chunk of the long trajectories, that the next training step will be trained on are generated by the most recent version of the model.



loved this. i would love to read more and explore along with you. hope you keep writing them up without the pressure to perform. protect your joy of learning my man.
Love it! 🤓 Raw and real. Pure joy to read your notes. Thanks for being you! 🤍