DEEP DOUBLE DESCENT: WHERE BIGGER MODELS AND MORE DATA HURT

DEEP DOUBLE DESCENT: WHERE BIGGER MODELS AND MORE DATA HURT

4 Dec 2019 | Preetum Nakkiran*, Gal Kaplun†, Yamini Bansal†, Tristan Yang, Boaz Barak, Ilya Sutskever
This paper presents empirical evidence of the "double descent" phenomenon in deep learning, where increasing model size or training data can initially hurt performance but eventually improve it. The authors define a new complexity measure, effective model complexity (EMC), which captures the maximum number of samples a training procedure can fit with near-zero training error. They show that double descent occurs both as a function of model size and training time, and that increasing the number of training samples can sometimes hurt test performance, contradicting the conventional wisdom that more data is always better. The paper demonstrates that double descent occurs in various settings, including different architectures (CNNs, ResNets, Transformers), optimizers (Adam, SGD), and data distributions. It shows that test error first decreases, then increases near the interpolation threshold, and then decreases again as training continues. This phenomenon is referred to as "model-wise double descent" and "epoch-wise double descent." The authors also show that increasing the number of training samples can sometimes lead to worse test performance, a phenomenon called "sample-wise non-monotonicity." The paper challenges the conventional wisdom that larger models and more data are always better, showing that in some cases, increasing model size or data can hurt performance. The authors propose a generalized double descent hypothesis, which suggests that test error follows a U-shaped curve as model complexity increases, but after a certain point, increasing complexity can improve performance. They also show that the double descent phenomenon is not unique to deep neural networks, as it is observed in other settings such as random Fourier features. The paper provides extensive experimental results across various settings, including different datasets, architectures, and training procedures. It shows that double descent is robust to changes in model size, training time, label noise, and data augmentation. The authors conclude that the double descent phenomenon provides new insights into the behavior of deep learning models and highlights the importance of considering the relationship between model complexity, training time, and test performance.This paper presents empirical evidence of the "double descent" phenomenon in deep learning, where increasing model size or training data can initially hurt performance but eventually improve it. The authors define a new complexity measure, effective model complexity (EMC), which captures the maximum number of samples a training procedure can fit with near-zero training error. They show that double descent occurs both as a function of model size and training time, and that increasing the number of training samples can sometimes hurt test performance, contradicting the conventional wisdom that more data is always better. The paper demonstrates that double descent occurs in various settings, including different architectures (CNNs, ResNets, Transformers), optimizers (Adam, SGD), and data distributions. It shows that test error first decreases, then increases near the interpolation threshold, and then decreases again as training continues. This phenomenon is referred to as "model-wise double descent" and "epoch-wise double descent." The authors also show that increasing the number of training samples can sometimes lead to worse test performance, a phenomenon called "sample-wise non-monotonicity." The paper challenges the conventional wisdom that larger models and more data are always better, showing that in some cases, increasing model size or data can hurt performance. The authors propose a generalized double descent hypothesis, which suggests that test error follows a U-shaped curve as model complexity increases, but after a certain point, increasing complexity can improve performance. They also show that the double descent phenomenon is not unique to deep neural networks, as it is observed in other settings such as random Fourier features. The paper provides extensive experimental results across various settings, including different datasets, architectures, and training procedures. It shows that double descent is robust to changes in model size, training time, label noise, and data augmentation. The authors conclude that the double descent phenomenon provides new insights into the behavior of deep learning models and highlights the importance of considering the relationship between model complexity, training time, and test performance.
Reach us at info@study.space
[slides] Deep double descent%3A where bigger models and more data hurt | StudySpace