Scaling Laws, Carefully

(lilianweng.github.io)

12 points by tehnub 20 hours ago|6 comments

•

aspenmartin 15 hours ago

I really wish more people skeptical of AI capabilities would read about scaling laws -- Lilian is always so marvelous at giving a deep overview of the technical side but the whole point of this is: there are scaling laws, and they hold and continue to hold. This is such a huge basis for the predictions about AI capabilities for the past like 5 years.

•

FromTheFirstIn 14 hours ago

And sitting right next to the data and compute factors in every cross entropy loss equation is the entropy of the language, which is just a fixed constant. There’s such a hard cap on cross entropy loss training and I never hear it come up!

•

aspenmartin 13 hours ago

Right but that is context dependent; it drops with context length, depends on tokenizer, etc. It doesn't end up being super relevant, despite the fact that if you look at the loss for real models it's relatively large in absolute terms. But that doesn't really matter -- all of the interesting stuff happens once you start getting closer and closer to it. You've gotten past all of the easy tokens that dominate the entropy and now you get to the really challenging ones that we care about (like e.g. very difficult reasoning about a next step).

•

FromTheFirstIn 12 hours ago

My understanding is that the true entropy floor of a language is intractable- regardless of context length there will be “unpredictable” tokens where cross entropy loss is bound to happen. Even with infinite parameters and data you’ll still have a chance at failing to predict the next token correctly a decent chunk of the time.

Also, linear gains in context length scale quadratically with compute because of attention, so depending on context growth means taking a bath on GPUs for as long as you can, right?

•

graboy 7 hours ago

Yeah I mean, if you and I were to play the word-guessing game where you needed to guess what next word I'm thinking of, there's always uncertainty in your guess because it's a game of partial information - you can't fully observe my inner state. But that doesn't mean you couldn't evolve a strategy that spends a really long time thinking and analyzing to get asymptotically close to the best guess. There's no limit on that intelligence.

•

FromTheFirstIn 2 hours ago

Isn’t the limit exactly what you’re describing? There’s always uncertainty, and your asymptote can approach its limit but it does have a limit. That’s the limit to the intelligence. And this is just for cross entropy loss- even if you could get loss to 0, I’m still not convinced at all that an enormous semantic map and its convoluted geometries amounts to intelligence.