The Loss Surfaces of Multilayer Networks

The Loss Surfaces of Multilayer Networks

21 Jan 2015 | Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun
This paper explores the connection between the loss function of a fully-connected feed-forward neural network and the Hamiltonian of a spherical spin-glass model under specific assumptions: variable independence, redundancy in network parametrization, and uniformity. The authors show that for large-size decoupled networks, the critical values of the random loss function form a layered structure, with the global minimum being the lowest. The number of local minima outside this band decreases exponentially with the network size. Empirical results verify that the mathematical model exhibits similar behavior to computer simulations, despite the high dependencies in real networks. The paper conjectures that simulated annealing and stochastic gradient descent (SGD) converge to the band of low critical points, and all critical points in this band are high-quality local minima. It also highlights that finding the global minimum becomes harder as the network size increases and is often irrelevant due to overfitting. The paper provides theoretical insights into the optimization paradigm of large neural networks and discusses future research directions.This paper explores the connection between the loss function of a fully-connected feed-forward neural network and the Hamiltonian of a spherical spin-glass model under specific assumptions: variable independence, redundancy in network parametrization, and uniformity. The authors show that for large-size decoupled networks, the critical values of the random loss function form a layered structure, with the global minimum being the lowest. The number of local minima outside this band decreases exponentially with the network size. Empirical results verify that the mathematical model exhibits similar behavior to computer simulations, despite the high dependencies in real networks. The paper conjectures that simulated annealing and stochastic gradient descent (SGD) converge to the band of low critical points, and all critical points in this band are high-quality local minima. It also highlights that finding the global minimum becomes harder as the network size increases and is often irrelevant due to overfitting. The paper provides theoretical insights into the optimization paradigm of large neural networks and discusses future research directions.
Reach us at info@study.space
[slides and audio] The Loss Surfaces of Multilayer Networks