[slides] The Implicit Bias of Gradient Descent on Separable Data

This paper examines the implicit bias of gradient descent on unregularized logistic regression problems with homogeneous linear predictors on linearly separable datasets. The authors show that the predictor converges to the direction of the max-margin (hard margin SVM) solution, even without explicit regularization. This behavior generalizes to other monotone decreasing loss functions with an infimum at infinity, multi-class problems, and training a weight layer in a deep network under certain conditions. The convergence rate is slow, logarithmic in the convergence of the loss itself, which explains why optimizing the logistic or cross-entropy loss continues to improve even after the training error is zero and the training loss is extremely small. The methodology also aids in understanding implicit regularization in more complex models and with other optimization methods. The paper provides rigorous proofs and characterizes the convergence rates, highlighting the sharp contrast between the slow convergence of the normalized weight vector to the max-margin solution and the rapid decrease of the training loss.This paper examines the implicit bias of gradient descent on unregularized logistic regression problems with homogeneous linear predictors on linearly separable datasets. The authors show that the predictor converges to the direction of the max-margin (hard margin SVM) solution, even without explicit regularization. This behavior generalizes to other monotone decreasing loss functions with an infimum at infinity, multi-class problems, and training a weight layer in a deep network under certain conditions. The convergence rate is slow, logarithmic in the convergence of the loss itself, which explains why optimizing the logistic or cross-entropy loss continues to improve even after the training error is zero and the training loss is extremely small. The methodology also aids in understanding implicit regularization in more complex models and with other optimization methods. The paper provides rigorous proofs and characterizes the convergence rates, highlighting the sharp contrast between the slow convergence of the normalized weight vector to the max-margin solution and the rapid decrease of the training loss.

The Implicit Bias of Gradient Descent on Separable Data

Submitted 4/18; Published 11/18 | Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, Nathan Srebro