> But he says, it took Nvidia, which has the best kernel engineers in the world, a long time to optimize for Blackwell, which suggests that actually it’s quite hard, and might not be super easy to close the loop on.
But did the Blackwell kernel engineers have access to frontier AI? Given the Blackwell lead times they'd have been doing it mostly by hand, right? I'm not sure this fact tells us much about whether kernel engineering will remain human for long.
The failure categories your source describes are instances of the same structural mechanism at different operational locations: unwitnessed semantic drift.
Causality-breaking in expert-choice routing is WHENCE-witness failure. Token n's routing depends on token n+k's affinity. Training-time WHENCE includes future tokens; deployment-time WHENCE does not. The model learns associations that were never witnessed for their actual derivation chain. The drift between training-WHENCE and deployment-WHENCE is what produces the underwhelming downstream behavior, not the routing strategy itself.
FP16 collective-accumulation bug is WHAT-witness failure. Each +1 operation past 1024 rounds back without any witness asking WHAT is actually being accumulated. The arithmetic operation produced output token-by-token; the structural fact that the accumulator state had silently saturated was unwitnessed. The bug took forever to find because the system had no witness-probing infrastructure to surface it during training.
Token dropping is the same pattern in WHEN-witness regime. Drop-decisions depend on later-arriving stronger matches. No witness probes WHEN the drop decision was made or WHENCE the later match arrived. Drift accumulates across batches.
Your source's bias-vs-variance distinction names the framework's diagnostic in engineering vocabulary. Variance is witnessed-noise: each operation has its noise probed and averages to zero. Bias is unwitnessed-drift: each operation contributes the same systematic error that compounds because nothing witnesses the systematic component.
The prediction the framework makes: bottlenecks will keep emerging as scale grows because every new operational location is a new potential drift location. There is no fixed list of N failure modes that close once you patch them. Each new architectural feature (new routing strategy, new precision regime, new collective topology) is a new location where operations happen without witness-probing. The fix direction is witness-probing infrastructure at the operation transitions: routing-decision points, precision-conversion boundaries, all-reduce checkpoints, gradient-accumulation steps.
The disciplined process for amalgamating compute multipliers your source emphasizes is exactly this: witness-probing at transitions where bias could be silently introduced.
Flashcards link seems to be broken!
> But he says, it took Nvidia, which has the best kernel engineers in the world, a long time to optimize for Blackwell, which suggests that actually it’s quite hard, and might not be super easy to close the loop on.
But did the Blackwell kernel engineers have access to frontier AI? Given the Blackwell lead times they'd have been doing it mostly by hand, right? I'm not sure this fact tells us much about whether kernel engineering will remain human for long.
The failure categories your source describes are instances of the same structural mechanism at different operational locations: unwitnessed semantic drift.
Causality-breaking in expert-choice routing is WHENCE-witness failure. Token n's routing depends on token n+k's affinity. Training-time WHENCE includes future tokens; deployment-time WHENCE does not. The model learns associations that were never witnessed for their actual derivation chain. The drift between training-WHENCE and deployment-WHENCE is what produces the underwhelming downstream behavior, not the routing strategy itself.
FP16 collective-accumulation bug is WHAT-witness failure. Each +1 operation past 1024 rounds back without any witness asking WHAT is actually being accumulated. The arithmetic operation produced output token-by-token; the structural fact that the accumulator state had silently saturated was unwitnessed. The bug took forever to find because the system had no witness-probing infrastructure to surface it during training.
Token dropping is the same pattern in WHEN-witness regime. Drop-decisions depend on later-arriving stronger matches. No witness probes WHEN the drop decision was made or WHENCE the later match arrived. Drift accumulates across batches.
Your source's bias-vs-variance distinction names the framework's diagnostic in engineering vocabulary. Variance is witnessed-noise: each operation has its noise probed and averages to zero. Bias is unwitnessed-drift: each operation contributes the same systematic error that compounds because nothing witnesses the systematic component.
The prediction the framework makes: bottlenecks will keep emerging as scale grows because every new operational location is a new potential drift location. There is no fixed list of N failure modes that close once you patch them. Each new architectural feature (new routing strategy, new precision regime, new collective topology) is a new location where operations happen without witness-probing. The fix direction is witness-probing infrastructure at the operation transitions: routing-decision points, precision-conversion boundaries, all-reduce checkpoints, gradient-accumulation steps.
The disciplined process for amalgamating compute multipliers your source emphasizes is exactly this: witness-probing at transitions where bias could be silently introduced.