Early alignment in two-layer networks training is a two-edged sword

Early alignment in two-layer networks training is a two-edged sword

2024 | Etienne Boursier, Nicolas Flammarion
Early alignment in two-layer neural networks training is a two-edged sword. This paper explores the phenomenon of early alignment in two-layer networks, where neurons align towards key directions during training. This alignment induces a sparse representation of the network, which is related to the implicit bias of gradient flow at convergence. However, this sparsity comes at the expense of difficulties in minimizing the training objective. The paper provides a general and quantitative description of the early alignment phase, originally introduced by Maennel et al. (2018). For small initializations and one hidden ReLU layer networks, the early stage of training dynamics leads to alignment of neurons towards key directions. This alignment induces a sparse representation of the network, which is directly related to the implicit bias of gradient flow at convergence. However, this sparsity inducing alignment comes at the expense of difficulties in minimizing the training objective. The paper also provides a simple data example where overparameterized networks fail to converge towards global minima and only converge to a spurious stationary point instead. The paper discusses the implications of early alignment in two-layer networks, showing that it can lead to spurious convergence to local minima. The paper also discusses the importance of omnidirectionality of weights in achieving convergence to global minima. The paper concludes that choosing a small initialisation scale can lead to a better implicit bias towards low rank hidden weights matrix, but at the risk of converging towards spurious stationary points. On the other hand, choosing a large initialisation scale can affect generalisation on new, unseen data. The paper suggests that an intermediate scale might be the best trade-off to yield both convergence to global minima and small generalisation errors.Early alignment in two-layer neural networks training is a two-edged sword. This paper explores the phenomenon of early alignment in two-layer networks, where neurons align towards key directions during training. This alignment induces a sparse representation of the network, which is related to the implicit bias of gradient flow at convergence. However, this sparsity comes at the expense of difficulties in minimizing the training objective. The paper provides a general and quantitative description of the early alignment phase, originally introduced by Maennel et al. (2018). For small initializations and one hidden ReLU layer networks, the early stage of training dynamics leads to alignment of neurons towards key directions. This alignment induces a sparse representation of the network, which is directly related to the implicit bias of gradient flow at convergence. However, this sparsity inducing alignment comes at the expense of difficulties in minimizing the training objective. The paper also provides a simple data example where overparameterized networks fail to converge towards global minima and only converge to a spurious stationary point instead. The paper discusses the implications of early alignment in two-layer networks, showing that it can lead to spurious convergence to local minima. The paper also discusses the importance of omnidirectionality of weights in achieving convergence to global minima. The paper concludes that choosing a small initialisation scale can lead to a better implicit bias towards low rank hidden weights matrix, but at the risk of converging towards spurious stationary points. On the other hand, choosing a large initialisation scale can affect generalisation on new, unseen data. The paper suggests that an intermediate scale might be the best trade-off to yield both convergence to global minima and small generalisation errors.
Reach us at info@study.space
[slides] Early alignment in two-layer networks training is a two-edged sword | StudySpace