The Implicit Bias of Gradient Descent on Separable Data

The Implicit Bias of Gradient Descent on Separable Data

Submitted 4/18; Published 11/18 | Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, Nathan Srebro
This paper investigates the implicit bias of gradient descent (GD) on logistic regression problems with linearly separable data. It shows that GD converges to the direction of the maximum margin (hard margin SVM) solution, even without explicit regularization. This result generalizes to other monotone decreasing loss functions with an exponential tail, multi-class problems, and certain deep networks. The convergence is very slow, with the distance to the max-margin predictor decreasing as $ O(1/\log t) $, and in some cases as $ O(\log \log t / \log t) $. This explains why GD continues to improve the logistic or cross-entropy loss even after training error is zero and training loss is extremely small, even if validation loss increases. The implicit bias is specific to GD and changes with different optimization algorithms, such as ADAM. The paper analyzes the behavior of GD on linearly separable data with a loss function that is smooth, strictly decreasing, and has an exponential tail. It shows that GD converges to the global minimum (zero loss) even without convexity, and that the direction of the weight vector converges to the maximum margin solution. The convergence rate is characterized, and it is shown that the residual error decreases slowly. The paper also provides a proof sketch for the main result, showing that the exponential tail of the loss function leads to asymptotic convergence to the maximum margin vector.This paper investigates the implicit bias of gradient descent (GD) on logistic regression problems with linearly separable data. It shows that GD converges to the direction of the maximum margin (hard margin SVM) solution, even without explicit regularization. This result generalizes to other monotone decreasing loss functions with an exponential tail, multi-class problems, and certain deep networks. The convergence is very slow, with the distance to the max-margin predictor decreasing as $ O(1/\log t) $, and in some cases as $ O(\log \log t / \log t) $. This explains why GD continues to improve the logistic or cross-entropy loss even after training error is zero and training loss is extremely small, even if validation loss increases. The implicit bias is specific to GD and changes with different optimization algorithms, such as ADAM. The paper analyzes the behavior of GD on linearly separable data with a loss function that is smooth, strictly decreasing, and has an exponential tail. It shows that GD converges to the global minimum (zero loss) even without convexity, and that the direction of the weight vector converges to the maximum margin solution. The convergence rate is characterized, and it is shown that the residual error decreases slowly. The paper also provides a proof sketch for the main result, showing that the exponential tail of the loss function leads to asymptotic convergence to the maximum margin vector.
Reach us at info@study.space