Understanding Deep Networks Always Grok and Here is Why

The paper "Deep Networks Always Grok and Here is Why" by Ahmed Imtiaz Humayun, Randall Balestrieri, and Richard Baraniuk explores the phenomenon of "grokking" in deep neural networks (DNNs). Grokking, or delayed generalization, refers to the long-term improvement in generalization and robustness after achieving near-zero training error. The authors demonstrate that grokking is more widespread than previously thought, occurring in various practical settings such as training CNNs on CIFAR10 or ResNets on Imagenette. The paper introduces the concept of "delayed robustness," where DNNs become robust to adversarial examples long after they achieve generalization. This phenomenon is explained through the lens of "local complexity," a measure that quantifies the density of "linear regions" in the input-output mapping of a DNN. These linear regions are regions where the DNN's function is approximately linear, and their density changes during training, leading to both delayed generalization and delayed robustness. Key findings include: 1. **Delayed Robustness**: DNNs exhibit delayed robustness to adversarial examples, occurring after generalization. 2. **Local Complexity**: A novel progress measure based on the local complexity of the DNN's input space partition, which is a proxy for the network's expressivity. 3. **Training Dynamics**: DNNs undergo a phase change in local complexity, with a second descent phase where linear regions migrate away from training data points and towards the decision boundary. 4. **Region Migration**: The phenomenon where linear regions concentrate around the decision boundary, leading to a robust partition of the input space. The authors provide empirical evidence for these findings across various datasets and architectures, including CNNs, ResNets, and GPT-based models. They also discuss the impact of different training parameters, such as batch normalization and weight decay, on the emergence of grokking and delayed robustness. The paper concludes with a discussion on the implications of these findings for the field of machine learning and the potential societal consequences of longer training times.The paper "Deep Networks Always Grok and Here is Why" by Ahmed Imtiaz Humayun, Randall Balestrieri, and Richard Baraniuk explores the phenomenon of "grokking" in deep neural networks (DNNs). Grokking, or delayed generalization, refers to the long-term improvement in generalization and robustness after achieving near-zero training error. The authors demonstrate that grokking is more widespread than previously thought, occurring in various practical settings such as training CNNs on CIFAR10 or ResNets on Imagenette. The paper introduces the concept of "delayed robustness," where DNNs become robust to adversarial examples long after they achieve generalization. This phenomenon is explained through the lens of "local complexity," a measure that quantifies the density of "linear regions" in the input-output mapping of a DNN. These linear regions are regions where the DNN's function is approximately linear, and their density changes during training, leading to both delayed generalization and delayed robustness. Key findings include: 1. **Delayed Robustness**: DNNs exhibit delayed robustness to adversarial examples, occurring after generalization. 2. **Local Complexity**: A novel progress measure based on the local complexity of the DNN's input space partition, which is a proxy for the network's expressivity. 3. **Training Dynamics**: DNNs undergo a phase change in local complexity, with a second descent phase where linear regions migrate away from training data points and towards the decision boundary. 4. **Region Migration**: The phenomenon where linear regions concentrate around the decision boundary, leading to a robust partition of the input space. The authors provide empirical evidence for these findings across various datasets and architectures, including CNNs, ResNets, and GPT-based models. They also discuss the impact of different training parameters, such as batch normalization and weight decay, on the emergence of grokking and delayed robustness. The paper concludes with a discussion on the implications of these findings for the field of machine learning and the potential societal consequences of longer training times.

Deep Networks Always Grok and Here is Why

2024 | Ahmed Imtiaz Humayun, Randall Balestriero, Richard Baraniuk