I think I just watched two smart people talk past each other for an hour. :) My intuitive sense is that Dr. Sutton’s point about how kids learn is that they learn by iteratively predicting, acting and observing, being little scientists 24-7, several times a second, physically, interacting with the real world, and do so for years before the symbolism flying about in their brains is attached to a language in their heads. They do learn by imitation of others, but it is a tiny percentage of the learning by living that they’re doing. This self-guided style of learning maximally would leverage compute which, if I read him correctly, is Sutton’s point: LLMs can’t work this way by their very nature.
Despite their amazing and unreasonably successful powers, LLMs are by comparison, dead inside: they are a curve fitting estimator in a latent space that result in intelligence-like powers, but hallucinations haven’t yet been defeated, and they betray the LLMs utter lack of a world model. It does an amazing job of imitating humans, but I suspect this is an echo from the humanity recorded in our input data and modeled in hypercubes by gigawatts of power and petabytes of data. They are amazing compressions of queryable human data, but LLMs are missing something vital and may be a dead end on the road to AGI. I don’t think we can add onto LLMs to get there. I do think LLMs can be a data input to an entirely more active, iterative, continual learning architecture.
Content aside — it’s really great, if uncomfortable, content — this guy is too intensely a Smart Guy From Edmonton. No idea why that’s such a distinctive type, but I see it in medicine and tech regularly.
I am broadly in agreement to the part about continual training leading to the most probably path to AGI. I think we should see movement on this as soon as next year. The company I work for - NonBioS - is on track to start testing long context - continual trained systems as soon as year end. My estimate is that this might result in AGI (or a clear roadmap to it) as soon as 2027.
Interesting. Nevertheless, I am afraid that mistaking continual learning with long context leads to a dead end. The only way to implement continual learning is long term memory. Memory requires integrating the information in the neural network, not overloading the short term memory (the context). Humans have an actually infinite context thanks to long term memory.
Sorry, maybe you misunderstood, or maybe I wasnt clear enough. I am not mistaking continual learning with long context. I was referring to long context about tasks arising as a result of long term memory created through continual learning.
Thanks for the clarification. You did not mention memory at all, so it was hard to imagine that you were referring to long term memory, and not just an extremely long context. Continual learning can also be implemented with such contexts, although it is the wrong path.
Great episode. Fun, engaging and having Sutton disagree with you was very entertaining! I tried to find the numbers on energy costs of training vs inference but I can’t find80 % figure anywhere. Is this just a loose argument from Sutton’s side? That billions of queries will overshadow the training costs due to the test-time compute regime we are in? I think the energy costs will matter a lot for being able to predict how the scaling laws for this paradigm will hold and how intelligent these systems will get before new training runs become prohibitively expensive. Here is a perplexity query exploring this a bit further. https://www.perplexity.ai/search/whats-the-energy-consumption-o-aFf0GjufRle8OVbuV1dZhQ#0
Great episode. Fun, engaging and having Sutton disagree with you was very entertaining! I tried to find the numbers on energy costs of training vs inference but I can’t find80 % figure anywhere. Is this just a loose argument from Sutton’s side? That billions of queries will overshadow the training costs due to the test-time compute regime we are in? I think the energy costs will matter a lot for being able to predict how the scaling laws for this paradigm will hold and how intelligent these systems will get before new training runs become prohibitively expensive. Here is a perplexity query exploring this a bit further. https://www.perplexity.ai/search/whats-the-energy-consumption-o-aFf0GjufRle8OVbuV1dZhQ#0
I think I just watched two smart people talk past each other for an hour. :) My intuitive sense is that Dr. Sutton’s point about how kids learn is that they learn by iteratively predicting, acting and observing, being little scientists 24-7, several times a second, physically, interacting with the real world, and do so for years before the symbolism flying about in their brains is attached to a language in their heads. They do learn by imitation of others, but it is a tiny percentage of the learning by living that they’re doing. This self-guided style of learning maximally would leverage compute which, if I read him correctly, is Sutton’s point: LLMs can’t work this way by their very nature.
Despite their amazing and unreasonably successful powers, LLMs are by comparison, dead inside: they are a curve fitting estimator in a latent space that result in intelligence-like powers, but hallucinations haven’t yet been defeated, and they betray the LLMs utter lack of a world model. It does an amazing job of imitating humans, but I suspect this is an echo from the humanity recorded in our input data and modeled in hypercubes by gigawatts of power and petabytes of data. They are amazing compressions of queryable human data, but LLMs are missing something vital and may be a dead end on the road to AGI. I don’t think we can add onto LLMs to get there. I do think LLMs can be a data input to an entirely more active, iterative, continual learning architecture.
Content aside — it’s really great, if uncomfortable, content — this guy is too intensely a Smart Guy From Edmonton. No idea why that’s such a distinctive type, but I see it in medicine and tech regularly.
I am broadly in agreement to the part about continual training leading to the most probably path to AGI. I think we should see movement on this as soon as next year. The company I work for - NonBioS - is on track to start testing long context - continual trained systems as soon as year end. My estimate is that this might result in AGI (or a clear roadmap to it) as soon as 2027.
Interesting. Nevertheless, I am afraid that mistaking continual learning with long context leads to a dead end. The only way to implement continual learning is long term memory. Memory requires integrating the information in the neural network, not overloading the short term memory (the context). Humans have an actually infinite context thanks to long term memory.
Sorry, maybe you misunderstood, or maybe I wasnt clear enough. I am not mistaking continual learning with long context. I was referring to long context about tasks arising as a result of long term memory created through continual learning.
Thanks for the clarification. You did not mention memory at all, so it was hard to imagine that you were referring to long term memory, and not just an extremely long context. Continual learning can also be implemented with such contexts, although it is the wrong path.
Great episode. Fun, engaging and having Sutton disagree with you was very entertaining! I tried to find the numbers on energy costs of training vs inference but I can’t find80 % figure anywhere. Is this just a loose argument from Sutton’s side? That billions of queries will overshadow the training costs due to the test-time compute regime we are in? I think the energy costs will matter a lot for being able to predict how the scaling laws for this paradigm will hold and how intelligent these systems will get before new training runs become prohibitively expensive. Here is a perplexity query exploring this a bit further. https://www.perplexity.ai/search/whats-the-energy-consumption-o-aFf0GjufRle8OVbuV1dZhQ#0
Great episode. Fun, engaging and having Sutton disagree with you was very entertaining! I tried to find the numbers on energy costs of training vs inference but I can’t find80 % figure anywhere. Is this just a loose argument from Sutton’s side? That billions of queries will overshadow the training costs due to the test-time compute regime we are in? I think the energy costs will matter a lot for being able to predict how the scaling laws for this paradigm will hold and how intelligent these systems will get before new training runs become prohibitively expensive. Here is a perplexity query exploring this a bit further. https://www.perplexity.ai/search/whats-the-energy-consumption-o-aFf0GjufRle8OVbuV1dZhQ#0
Hello Alexander - Your URL does not resolve. Can you repost?
I greatly admire the clarity of your exposition. "Your writing", I'm trying to say. See, I can see it but just can't do it!