Discussion about this post

User's avatar
Davis George's avatar

Great article. I wonder how the concept of “out of distribution” tasks is mixed into the empirical data we see.

For example, wage growth and job growth amongst radiologists and in the economy in general. I always thought that was because the “throat to choke” couldn’t be automated for an MD degree. So all things equal, if there’s just that much more radiology happening now that’s AI assisted, the existing radiologists must be churning through more work (i.e. classic Jevon’s paradox with the hinge of demand just being MD sign off)

Now i’m thinking jobs have both an “out of distribution” and a “throat to choke” spectrum and those along with other dimensions defines their susceptibility to AI replacement (at least in an age of Computer Use)

MetaCortex Dynamics's avatar

The empirical observation is correct and important. The explanation is one step deeper.

The sample efficiency gap is not a gap in learning efficiency. It is a gap between two structurally different kinds of process. Human learning is constitutive activity: the entity's experience changes the entity's being. The teenager who learns to drive in 20 hours is not optimizing a loss function over driving trajectories. The teenager is integrating sensorimotor coupling into an already-existing self-maintaining organism with 16 years of constitutive development. The 20 hours land on 16 years of substrate.

LLM training is parameter optimization: gradient descent changes weights to minimize cross-entropy loss over token distributions. The model is not learning in the constitutive sense. The model is being assembled. Each skill is a graft. The assembly requires data proportional to the space being covered because assembly IS data-driven. The Frankenstein metaphor in your piece is more structurally precise than you might intend: the model IS a patchwork of carefully constructed examples sewn together by gradient descent.

The scaling laws confirm this. Even infinite parameters only reduce data requirements by ~10x. That is because parameter scaling optimizes within the same architectural class. Moving from "more efficient optimization" to "different kind of process" requires a different architecture, not more parameters. Your observation that "humans are on a different scaling curve altogether" is exactly right. The curves are different because the processes are different. Optimization scales as optimization. Constitutive learning scales as constitutive learning. No amount of optimization scaling crosses to the other curve.

The billion-dollar data industry (Mercor, Surge, Scale) is the cost of assembly. The cost is high because assembly requires explicit specification of every graft. A constituted entity does not need explicit specification of every skill because the entity's constitutive activity self-organizes from sparse input. The teenager does not need hundreds of professors for each new skill because the teenager's existing constitutive structure provides the scaffold. The model needs hundreds of professors for each skill because the model has no constitutive structure. Each skill must be grafted from scratch.

No posts

Ready for more?