[slides and audio] Grokfast%3A Accelerated Grokking by Amplifying Slow Gradients

**Grokfast: Accelerated Grokking by Amplifying Slow Gradients** **Authors:** Jaerin Lee, Bong Gyun Kang, Kihoon Kim, Kyoung Mu Lee **Affiliation:** ASRI, Department of ECE, Interdisciplinary Program in Artificial Intelligence, Seoul National University, Korea **Abstract:** The phenomenon of "grokking" in machine learning, where models achieve generalization long after overfitting to training data, is a mysterious and under-studied aspect. This paper aims to accelerate this generalization process by focusing on the slow-varying components of gradients. By decomposing the parameter trajectories under gradient descent into fast-varying (overfitting) and slow-varying (generalization-inducing) components, the authors propose GROKFAST, an algorithm that amplifies the slow-varying components to speed up grokking. Experiments show that GROKFAST can accelerate grokking by more than 50 times across various tasks, including images, languages, and graphs. The method is simple to implement and applicable to most machine learning frameworks, demonstrating its practical utility. **Key Contributions:** 1. **Grokking Phenomenon:** The paper introduces the grokking phenomenon, where models achieve generalization long after overfitting. 2. **GROKFAST Algorithm:** A simple algorithmic modification to existing optimizers that amplifies the slow-varying components of gradients, accelerating grokking. 3. **Empirical Validation:** Extensive experiments across various tasks (algorithmic data, MNIST, QM9, IMDb) show significant acceleration in grokking, with up to 50 times improvement. 4. **Practical Implementation:** The code for GROKFAST is available, making it accessible for practical use. **Methodology:** - **Signal Decomposition:** Parameters are treated as random signals over time, and the spectral decomposition reveals the slow-varying and fast-varying components. - **Low-Pass Filtering:** The slow-varying components are amplified using low-pass filters, which are applied to gradients before they are used in the optimizer. - **Experiments:** Various tasks are used to validate the effectiveness of GROKFAST, showing that it reduces the time to achieve generalization by a significant margin. **Conclusion:** GROKFAST provides a practical tool for accelerating the grokking phenomenon, making it more accessible for researchers and practitioners. The method's simplicity and effectiveness in diverse tasks highlight its potential for broader applications in machine learning.**Grokfast: Accelerated Grokking by Amplifying Slow Gradients** **Authors:** Jaerin Lee, Bong Gyun Kang, Kihoon Kim, Kyoung Mu Lee **Affiliation:** ASRI, Department of ECE, Interdisciplinary Program in Artificial Intelligence, Seoul National University, Korea **Abstract:** The phenomenon of "grokking" in machine learning, where models achieve generalization long after overfitting to training data, is a mysterious and under-studied aspect. This paper aims to accelerate this generalization process by focusing on the slow-varying components of gradients. By decomposing the parameter trajectories under gradient descent into fast-varying (overfitting) and slow-varying (generalization-inducing) components, the authors propose GROKFAST, an algorithm that amplifies the slow-varying components to speed up grokking. Experiments show that GROKFAST can accelerate grokking by more than 50 times across various tasks, including images, languages, and graphs. The method is simple to implement and applicable to most machine learning frameworks, demonstrating its practical utility. **Key Contributions:** 1. **Grokking Phenomenon:** The paper introduces the grokking phenomenon, where models achieve generalization long after overfitting. 2. **GROKFAST Algorithm:** A simple algorithmic modification to existing optimizers that amplifies the slow-varying components of gradients, accelerating grokking. 3. **Empirical Validation:** Extensive experiments across various tasks (algorithmic data, MNIST, QM9, IMDb) show significant acceleration in grokking, with up to 50 times improvement. 4. **Practical Implementation:** The code for GROKFAST is available, making it accessible for practical use. **Methodology:** - **Signal Decomposition:** Parameters are treated as random signals over time, and the spectral decomposition reveals the slow-varying and fast-varying components. - **Low-Pass Filtering:** The slow-varying components are amplified using low-pass filters, which are applied to gradients before they are used in the optimizer. - **Experiments:** Various tasks are used to validate the effectiveness of GROKFAST, showing that it reduces the time to achieve generalization by a significant margin. **Conclusion:** GROKFAST provides a practical tool for accelerating the grokking phenomenon, making it more accessible for researchers and practitioners. The method's simplicity and effectiveness in diverse tasks highlight its potential for broader applications in machine learning.

Grokfast: Accelerated Grokking by Amplifying Slow Gradients

5 Jun 2024 | Jaerin Lee, Bong Gyun Kang, Kihoon Kim, Kyoung Mu Lee