21 Jan 2015 | Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun
The paper explores the loss surfaces of multilayer neural networks, connecting them to the Hamiltonian of the spherical spin-glass model. It shows that for large, fully decoupled networks, the loss function's critical points form a layered structure, with local minima concentrated in a band below the global minimum. The number of local minima outside this band decreases exponentially with network size. Empirical results confirm that both simulated annealing and stochastic gradient descent (SGD) converge to this band, where local minima are of high quality. The paper also demonstrates that recovering the global minimum becomes harder as network size increases, often leading to overfitting. Theoretical results from random matrix theory support these findings, showing that the loss landscape of large networks has a layered structure with low-index critical points dominating. Experiments on spin-glass models and neural networks confirm the similarity in their loss landscapes, highlighting the importance of avoiding saddle points for effective optimization. The study emphasizes that for large networks, poor-quality local minima are less likely to be found, and the optimization process is more efficient in finding high-quality solutions.The paper explores the loss surfaces of multilayer neural networks, connecting them to the Hamiltonian of the spherical spin-glass model. It shows that for large, fully decoupled networks, the loss function's critical points form a layered structure, with local minima concentrated in a band below the global minimum. The number of local minima outside this band decreases exponentially with network size. Empirical results confirm that both simulated annealing and stochastic gradient descent (SGD) converge to this band, where local minima are of high quality. The paper also demonstrates that recovering the global minimum becomes harder as network size increases, often leading to overfitting. Theoretical results from random matrix theory support these findings, showing that the loss landscape of large networks has a layered structure with low-index critical points dominating. Experiments on spin-glass models and neural networks confirm the similarity in their loss landscapes, highlighting the importance of avoiding saddle points for effective optimization. The study emphasizes that for large networks, poor-quality local minima are less likely to be found, and the optimization process is more efficient in finding high-quality solutions.