AI scaling

This note last modified May 21, 2026

AI Intelligence is a bit weird. When the field of AI was starting, there was a huge debate about heavily curated and specialized systems versus… systems with just a stupid amount of data and computation power thrown at it. The latter systems, these hyperscaling systems had emergent properties that no one expected. It’s worth remembering that LLMs are just meant to create language. The fact that they can mimic Intelligence (or are arguably, actually intelligent) is insane.

Welch Labs has some good videos on the topic, but this effect is seen time and time again:

When you train an AI, if you train it too much, it’ll overfit. Until you just keep training and training, and then its accuracy jumps back up again.
Gradient descent just magically finds “wormholes” as a natural extension of high dimensional geometry. In practice, it means that in high dimensions, GD can just magically find solutions without worrying too much about local valleys.
The transformer architecture allowed for massive scaling due to its efficiency.