13 Feb 2024 | Nora Belrose, Quintin Pope, Lucia Quirke, Alex Mallen, Xiaoli Fern
The paper explores the *distributional simplicity bias* (DSB), which posits that neural networks initially learn low-order moments of the data distribution before moving on to higher-order correlations. The authors provide new evidence for the DSB by showing that networks perform well on maximum-entropy distributions with low-order statistics matching the training set early in training but lose this ability later. They extend the DSB to discrete domains by proving an equivalence between token $n$-gram frequencies and the moments of embedding vectors. Additionally, they use optimal transport methods to edit the low-order statistics of images and show that early-training networks treat the edited images as if they were from the target class. The paper also evaluates Pythia autoregressive language models on synthetic data sampled from unigram and bigram models, observing a "double descent" phenomenon where models initially mirror the U-shaped scaling observed in image classifiers but later achieve even lower loss through in-context learning. The findings strengthen the case for the DSB and provide insights into how it influences early learning dynamics.The paper explores the *distributional simplicity bias* (DSB), which posits that neural networks initially learn low-order moments of the data distribution before moving on to higher-order correlations. The authors provide new evidence for the DSB by showing that networks perform well on maximum-entropy distributions with low-order statistics matching the training set early in training but lose this ability later. They extend the DSB to discrete domains by proving an equivalence between token $n$-gram frequencies and the moments of embedding vectors. Additionally, they use optimal transport methods to edit the low-order statistics of images and show that early-training networks treat the edited images as if they were from the target class. The paper also evaluates Pythia autoregressive language models on synthetic data sampled from unigram and bigram models, observing a "double descent" phenomenon where models initially mirror the U-shaped scaling observed in image classifiers but later achieve even lower loss through in-context learning. The findings strengthen the case for the DSB and provide insights into how it influences early learning dynamics.