Neural Networks Learn Statistics of Increasing Complexity

Neural Networks Learn Statistics of Increasing Complexity

13 Feb 2024 | Nora Belrose, Quintin Pope, Lucia Quirke, Alex Mallen, Xiaoli Fern
Neural networks exhibit a bias toward learning low-order statistics of data distributions before higher-order ones, a phenomenon known as the distributional simplicity bias (DSB). This study provides new evidence for DSB by showing that networks initially perform well on maximum-entropy distributions matching training data's low-order statistics, then lose this ability later. The DSB is extended to discrete domains by proving an equivalence between token n-gram frequencies and embedding vector moments. Optimal transport methods are used to edit low-order statistics of images, showing early-trained networks treat edited images as if they belong to the target class. The study evaluates image classification models on synthetic data generated to test their sensitivity to low-order statistics. Early in training, models classify images based on means and covariances, but as training progresses, they become sensitive to higher-order statistics, leading to a U-shaped loss curve. The Pythia language models also show a "double descent" phenomenon, where loss initially decreases, then increases, then decreases again as models learn to match data generating processes in-context. Theoretical analysis shows that the expected loss of a neural network can be expressed as a sum of central moments, suggesting a connection between data distribution moments and model loss. The study proposes criteria for evaluating whether models use low-order statistics, including grafting statistics from one class to another and deleting higher-order statistics. These criteria are tested using synthetic data generated via optimal transport and maximum entropy sampling. The results show that models initially rely on low-order statistics, but as training progresses, they become more sensitive to higher-order statistics. This behavior is observed across various image and language models, with language models showing a double descent in loss when matching low-order data statistics. The study contributes to understanding the DSB, refining its impact on early learning dynamics, and providing a foundation for further research into DSB.Neural networks exhibit a bias toward learning low-order statistics of data distributions before higher-order ones, a phenomenon known as the distributional simplicity bias (DSB). This study provides new evidence for DSB by showing that networks initially perform well on maximum-entropy distributions matching training data's low-order statistics, then lose this ability later. The DSB is extended to discrete domains by proving an equivalence between token n-gram frequencies and embedding vector moments. Optimal transport methods are used to edit low-order statistics of images, showing early-trained networks treat edited images as if they belong to the target class. The study evaluates image classification models on synthetic data generated to test their sensitivity to low-order statistics. Early in training, models classify images based on means and covariances, but as training progresses, they become sensitive to higher-order statistics, leading to a U-shaped loss curve. The Pythia language models also show a "double descent" phenomenon, where loss initially decreases, then increases, then decreases again as models learn to match data generating processes in-context. Theoretical analysis shows that the expected loss of a neural network can be expressed as a sum of central moments, suggesting a connection between data distribution moments and model loss. The study proposes criteria for evaluating whether models use low-order statistics, including grafting statistics from one class to another and deleting higher-order statistics. These criteria are tested using synthetic data generated via optimal transport and maximum entropy sampling. The results show that models initially rely on low-order statistics, but as training progresses, they become more sensitive to higher-order statistics. This behavior is observed across various image and language models, with language models showing a double descent in loss when matching low-order data statistics. The study contributes to understanding the DSB, refining its impact on early learning dynamics, and providing a foundation for further research into DSB.
Reach us at info@study.space
[slides and audio] Neural Networks Learn Statistics of Increasing Complexity