Understanding Early alignment in two-layer networks training is a two-edged sword

The paper explores the early alignment phase in the training of two-layer neural networks, particularly focusing on the impact of small initializations and ReLU activation functions. The authors provide a quantitative description of this phase, which is crucial for understanding the implicit bias of gradient descent. They show that during the early alignment phase, neurons align towards key directions, leading to a sparse representation of the network. This sparsity, while beneficial for feature learning, can also make it difficult to minimize the training loss, especially in overparameterized models. The paper contributes by: 1. Characterizing the early alignment phenomenon for small initializations and one hidden ReLU layer networks, providing a rigorous analysis of the alignment process. 2. Analyzing the complete dynamics of training on a specific data example, where gradient flow converges to a spurious stationary point, highlighting the importance of weights' omnidirectionality for reaching global minima. Key findings include: - The early alignment phase is essential for the implicit bias of gradient descent, leading to a sparse representation of the network. - Small initializations can lead to overparameterized models failing to converge to global minima, even with infinite training time and neurons. - The loss of omnidirectionality of weights during the early alignment phase can cause convergence to spurious stationary points. The paper discusses the implications of these findings for understanding the implicit bias of gradient descent and the trade-offs between different initialization scales.The paper explores the early alignment phase in the training of two-layer neural networks, particularly focusing on the impact of small initializations and ReLU activation functions. The authors provide a quantitative description of this phase, which is crucial for understanding the implicit bias of gradient descent. They show that during the early alignment phase, neurons align towards key directions, leading to a sparse representation of the network. This sparsity, while beneficial for feature learning, can also make it difficult to minimize the training loss, especially in overparameterized models. The paper contributes by: 1. Characterizing the early alignment phenomenon for small initializations and one hidden ReLU layer networks, providing a rigorous analysis of the alignment process. 2. Analyzing the complete dynamics of training on a specific data example, where gradient flow converges to a spurious stationary point, highlighting the importance of weights' omnidirectionality for reaching global minima. Key findings include: - The early alignment phase is essential for the implicit bias of gradient descent, leading to a sparse representation of the network. - Small initializations can lead to overparameterized models failing to converge to global minima, even with infinite training time and neurons. - The loss of omnidirectionality of weights during the early alignment phase can cause convergence to spurious stationary points. The paper discusses the implications of these findings for understanding the implicit bias of gradient descent and the trade-offs between different initialization scales.

Early alignment in two-layer networks training is a two-edged sword

2024 | Etienne Boursier, Nicolas Flammarion