tags:
  - generalization
  - deep_learning
  - paper

Grokking : learning beyond generalization

GROKKING: GENERALIZATION BEYOND OVERFITTING ON SMALL ALGORITHMIC DATASETS
Towards Understanding Grokking: An Effective Theory of Representation Learning

"Grokking" is a term coined by computer scientist Peter Norvig, and it refers to the process of a machine learning model (such as a neural network) suddenly and unexpectedly understanding a complex pattern or concept in the data, often in a way that is surprising and difficult to explain.

The term "grokking" is derived from a science fiction novel called "Stranger in a Strange Land" by Robert A. Heinlein, in which the word "grok" means to understand something deeply and intuitively, without needing to analyze or rationalize it.

In the context of machine learning, grokking refers to the phenomenon where a model learns to recognize a pattern or relationship in the data that it was not explicitly trained on, and uses that understanding to generalize to new, unseen data. This can happen when the model has been trained on a large and diverse dataset, or when it has been given a large number of iterations or computational resources to learn.

Grokking is often associated with the concept of "understanding" or "insight" in machine learning, and is seen as a desirable outcome because it allows the model to make predictions or decisions that are more accurate and robust than would be possible through simple memorization or pattern recognition. However, grokking is still a poorly understood phenomenon, and researchers are still working to understand the mechanisms that underlie it.

Double descent

Double descent.png

The "double descent" phenomenon in deep learning refers to the observation that the generalization performance of a neural network can exhibit a non-monotonic behavior as a function of the model's capacity, measured by the number of parameters or the depth of the network.

In general, increasing the capacity of a neural network can lead to better generalization performance, as it allows the model to learn more complex patterns in the data. However, beyond a certain point, further increasing the capacity can actually lead to a decrease in generalization performance, as the model becomes overfitting and starts to memorize the training data rather than learning generalizable patterns.

The double descent phenomenon is a more nuanced phenomenon, where the generalization performance of the model exhibits two distinct phases:

First descent: As the capacity of the model increases, the generalization performance improves, following the typical pattern of underfitting to overfitting. This is the expected behavior, where the model learns to fit the training data better as it becomes more complex.
Second ascent: However, beyond a certain point, the generalization performance of the model increases again, despite the increased capacity. This is the "double descent" phenomenon, where the model learns to generalize better to new data even though it has more parameters.

How to Grok a deep learning model

Weight decay + SGD => faster grokking