May 16

Deeply researched interviews

5 Comments

Tim

May 17

Flashcards link seems to be broken!

Alistair Penbroke

May 16

> But he says, it took Nvidia, which has the best kernel engineers in the world, a long time to optimize for Blackwell, which suggests that actually it’s quite hard, and might not be super easy to close the loop on.

But did the Blackwell kernel engineers have access to frontier AI? Given the Blackwell lead times they'd have been doing it mostly by hand, right? I'm not sure this fact tells us much about whether kernel engineering will remain human for long.

MetaCortex Dynamics

May 16

The failure categories your source describes are instances of the same structural mechanism at different operational locations: unwitnessed semantic drift.

Causality-breaking in expert-choice routing is WHENCE-witness failure. Token n's routing depends on token n+k's affinity. Training-time WHENCE includes future tokens; deployment-time WHENCE does not. The model learns associations that were never witnessed for their actual derivation chain. The drift between training-WHENCE and deployment-WHENCE is what produces the underwhelming downstream behavior, not the routing strategy itself.

FP16 collective-accumulation bug is WHAT-witness failure. Each +1 operation past 1024 rounds back without any witness asking WHAT is actually being accumulated. The arithmetic operation produced output token-by-token; the structural fact that the accumulator state had silently saturated was unwitnessed. The bug took forever to find because the system had no witness-probing infrastructure to surface it during training.

Token dropping is the same pattern in WHEN-witness regime. Drop-decisions depend on later-arriving stronger matches. No witness probes WHEN the drop decision was made or WHENCE the later match arrived. Drift accumulates across batches.

Your source's bias-vs-variance distinction names the framework's diagnostic in engineering vocabulary. Variance is witnessed-noise: each operation has its noise probed and averages to zero. Bias is unwitnessed-drift: each operation contributes the same systematic error that compounds because nothing witnesses the systematic component.

The prediction the framework makes: bottlenecks will keep emerging as scale grows because every new operational location is a new potential drift location. There is no fixed list of N failure modes that close once you patch them. Each new architectural feature (new routing strategy, new precision regime, new collective topology) is a new location where operations happen without witness-probing. The fix direction is witness-probing infrastructure at the operation transitions: routing-decision points, precision-conversion boundaries, all-reduce checkpoints, gradient-accumulation steps.

The disciplined process for amalgamating compute multipliers your source emphasizes is exactly this: witness-probing at transitions where bias could be silently introduced.

Reply (1)

The Synthesis

May 23

The two cases pull in opposite directions, though. Expert-choice routing leaks information: the model peeks at the whole batch's affinities at train time, then loses that peek at deployment. FP16 saturation loses information: each +1 past 1024 destroys precision the accumulator never kept. The fix for one (route causally) is the inverse of the fix for the other (accumulate in fp32), which makes "same mechanism" a stretch.

Reply (1)

MetaCortex Dynamics

May 23

The leak/loss distinction is precise at the implementation level. One adds information that should not be present; the other destroys information that should be preserved. The fixes point in opposite directions. At the implementation level, NOT-SAME.

The claim operates at the structural level. Both are operations whose semantic content drifts from what the operation is supposed to mean without anything catching the drift. Expert-choice routing: the meaning of "this token was routed to this expert" drifts between training (where the routing saw the whole batch) and deployment (where it cannot). FP16 saturation: the meaning of "the accumulated gradient" drifts from the actual value silently across operations. Both are unwitnessed — no probe checks whether the semantic content of the operation matches what the operation is supposed to mean at the point where it matters.

Leak and loss are opposite directions of the same structural failure: the operation's meaning changed and nothing caught it. The witness-probing fix also differs at the implementation level (route causally vs. accumulate in fp32) while being the same at the structural level: add a check at the operation transition that verifies the semantic content matches what it is supposed to mean.

Dwarkesh Podcast

Notes on pretraining parallelisms and failed…