The paper "Neural Redshift: Random Networks are not Random Functions" explores the inductive biases of neural networks (NNs) that contribute to their generalization capabilities. The authors examine untrained, random-weight networks to understand the architectural biases that influence function representation, independent of gradient descent optimization.
Key findings include:
1. **Inductive Biases in Random Networks**: Even simple MLPs show strong inductive biases, with uniform sampling in weight space yielding a biased distribution of functions in terms of complexity. However, NNs do not inherently have a "simplicity bias," which depends on components like ReLUs, residual connections, and layer normalizations.
2. **Complexity Measures**: The paper uses three measures—Fourier frequency, polynomial order, and compressibility—to characterize the complexity of functions implemented by various architectures. These measures show that popular architectures are biased towards low-frequency, low-order, and compressible functions.
3. **ReLU Activations**: ReLU activations play a critical role in maintaining the simplicity bias, unaffected by weight magnitude or increased depth. Other activations, such as GELU, TanH, and Gaussian, exhibit different behavior, with increased complexity depending on depth and weight magnitude.
4. **Layer Normalization and Residual Connections**: Layer normalization and residual connections significantly reduce the sensitivity of complexity to weight magnitude, aligning with the simplicity bias.
5. **Transformers**: The bias observed in MLPs is also present in transformers, suggesting that the same mechanisms that cause the simplicity bias in MLPs are at play in sequence models.
6. **Implications**: The paper provides a fresh explanation for the success of deep learning, independent of gradient-based training, and highlights the importance of controlling the inductive biases implemented by trained models.
The authors conclude that the effectiveness of NNs is not an intrinsic property but results from the adequacy between key architectural choices and the properties of real-world data. They also discuss limitations and open questions, emphasizing the need for further research to fully understand the inductive biases in NNs.The paper "Neural Redshift: Random Networks are not Random Functions" explores the inductive biases of neural networks (NNs) that contribute to their generalization capabilities. The authors examine untrained, random-weight networks to understand the architectural biases that influence function representation, independent of gradient descent optimization.
Key findings include:
1. **Inductive Biases in Random Networks**: Even simple MLPs show strong inductive biases, with uniform sampling in weight space yielding a biased distribution of functions in terms of complexity. However, NNs do not inherently have a "simplicity bias," which depends on components like ReLUs, residual connections, and layer normalizations.
2. **Complexity Measures**: The paper uses three measures—Fourier frequency, polynomial order, and compressibility—to characterize the complexity of functions implemented by various architectures. These measures show that popular architectures are biased towards low-frequency, low-order, and compressible functions.
3. **ReLU Activations**: ReLU activations play a critical role in maintaining the simplicity bias, unaffected by weight magnitude or increased depth. Other activations, such as GELU, TanH, and Gaussian, exhibit different behavior, with increased complexity depending on depth and weight magnitude.
4. **Layer Normalization and Residual Connections**: Layer normalization and residual connections significantly reduce the sensitivity of complexity to weight magnitude, aligning with the simplicity bias.
5. **Transformers**: The bias observed in MLPs is also present in transformers, suggesting that the same mechanisms that cause the simplicity bias in MLPs are at play in sequence models.
6. **Implications**: The paper provides a fresh explanation for the success of deep learning, independent of gradient-based training, and highlights the importance of controlling the inductive biases implemented by trained models.
The authors conclude that the effectiveness of NNs is not an intrinsic property but results from the adequacy between key architectural choices and the properties of real-world data. They also discuss limitations and open questions, emphasizing the need for further research to fully understand the inductive biases in NNs.