More open questions
I put out a blog prize to answer a couple of big questions I have about AI. The goal is really to find someone to hire as coresearcher. I have more questions of this variety, but I omitted them from that post’s list, because they don’t make it easy to judge submission quality. So I thought I’d post them here:
5 hyperscalers own 70+% of global AI compute, and much of that is actually reserved for the 3 member set of OpenAI/Ant/GDM. How worried should we be that AI use cases which are not building up to the singularity and the robot factories - aka normal people being more empowered, understanding the world better, being entertained, etc, is not the highest ROI activity for compute in the world. And given how valuable compute will be (whose opportunity cost increases in tandem with the quality of the AI models that run on it), will normal people basically get priced out of the benefits of AI? If we should be worried about this, how concretely should some kind of universal basic income/compute redistribution work? If not worried, what is the frame of this question missing?
Data is arguably the main way that AI models have been getting better over the last few years. But I remain confused about what concretely these improvements have consisted of. To ask some sharper questions:
Clearly Anthropic (and now also OpenAI and GDM) have cracked something about making competent long horizon coding agents. What is it? Is it just stacking up more and more RL coding environments? Or is there something more particular behind this breakthrough?
Are models even getting more sample efficient (aka they learn more from each training sample) or have we just changed/expanded/improved the data input? The reason this question is important is because it tells us how fast deep learning progress will be in domains that actually do require sample efficiency (for example, robotics).
Models are very sample efficient in context, and the information in context can be used much more flexibly. But the attention “fast” weights consume a huge amount of memory in order to accommodate this faster learning. Why is there this memory/sample efficiency tradeoff?
If you look at the size of the KV cache for Llama 3 70B, it’s 320 KB / token. If you just divide the number of bits it takes to store Llama 3 weights by the number of tokens it was pre-trained on, then you get 0.075 bits / token. So there’s a 35 million fold difference in the amount of information per bit you’re storing.
Let’s put frontier lab compute into 3 buckets: pretraining, RL generation, and inference. RL generation and inference look like very similar workloads. The big difference, of course, is that the model learns as a result of RL generation, but it doesn’t (at least currently) from inference. At the same time, the model actually does useful work during inference, but not during RL generation. Many people have pointed out it’s really weird that there’s a distinction between training and inference, and that in the limit it shouldn’t exist. How practically will these two workloads be merged? At a high level, one can imagine hiring an AI instance for a month-long work trial, getting it to do actual useful work for you during that time, and then sending a report card back to the model company. In fact, in a few years, maybe the only way that AI can continue to make progress is through this kind of on-the-job learning, because models will already have saturated anything that can be learned from contrived shorter-horizon RL environments.
Does something Y2Key happen when most of the tokens on the internet (and presumably the ones future models will be trained on) are generated by other AIs? Has the relative value of pre-2023 internet datasets increased in any noticeable way?
I wrote this in my continual learning blog post last June. Is this correct? Why might there not be a winner take all dynamic from continual learning?
“Even if there isn’t a software only singularity (with models rapidly building smarter and smarter successor systems), we might still see something that looks like a broadly deployed intelligence explosion. AIs will be getting broadly deployed through the economy, doing different jobs and learning while doing them in the way humans can. But unlike humans, these models can amalgamate their learnings across all their copies. So one AI is basically learning how to do every single job in the world. An AI that is capable of online learning might functionally become a superintelligence quite rapidly without any further algorithmic progress”
A lot of economic analysis about the impact of AGI focus on human demand - will the economy shrink because our demands can be fulfilled much more cheaply, will it rise because AI will create new varieties of products, or maybe because the relational sector will grow? But all this analyses take as a given that the only demand that matters is the one originating from humans. How do we model the machine-only economy, where the demand originates from the AI’s themselves? And once we add this consideration to our economic analysis of the future, what changes?


I think that the answer for the first question is that we will move from using dollars to using compute. We are already seeing this within companies like nvidia where they pay their employees with both money and tokens
On the need to have idiosyncratic agendas pursued over years as an essential motor of discovery: this might be a genuinely AI advantage here. Human discoveries trawl the possibility space in a very sparse manner, relying especially on social pressure to trim the outlandish theories. this is important because human is very g-limited, in term of individuals needed to go deep in each rabbit hole. the time spent developping new measurments is the limiting factor in your stories.
But AI could afford to be - not exhaustive - but much more bayesian in the handling of theories over long scales: spend a few hours of datacenters to push even the long tail of theories to the measurment bottleneck, get a queue of resources for verification (which itself would not have the social pressure via grantmaking etc to deal with), and progress iteratively - the parrallax experiment would have waited for some years, but importantly still adding some probability mass to the correct model! and scientific accelleration would be real.