Friendly Sharpness-Aware Minimization

Friendly Sharpness-Aware Minimization

19 Mar 2024 | Tao Li¹, Pan Zhou², Zhengbao He¹, Xinwen Cheng¹, Xiaolin Huang¹
This paper introduces Friendly Sharpness-Aware Minimization (F-SAM), an improved variant of Sharpness-Aware Minimization (SAM), which enhances generalization performance in deep learning. SAM aims to minimize both training loss and loss sharpness to find flat minima that generalize better. However, the mechanisms behind SAM's generalization improvements remain unclear. The authors investigate the core components of SAM and find that the batch-specific stochastic gradient noise in the minibatch gradient plays a crucial role in its generalization performance. The full gradient component, on the other hand, can degrade generalization. F-SAM addresses this by removing the full gradient component and leveraging the stochastic gradient noise for improved generalization. It approximates the full gradient using an exponentially moving average (EMA) of historical stochastic gradients, reducing computational overhead. Theoretical analysis validates the EMA approximation and proves the convergence of F-SAM on non-convex problems. Extensive experiments show that F-SAM outperforms vanilla SAM in terms of generalization, robustness, and training efficiency. F-SAM is less sensitive to the perturbation radius and performs well across different batch sizes and datasets. The results demonstrate that F-SAM significantly improves the generalization performance of SAM while maintaining training efficiency.This paper introduces Friendly Sharpness-Aware Minimization (F-SAM), an improved variant of Sharpness-Aware Minimization (SAM), which enhances generalization performance in deep learning. SAM aims to minimize both training loss and loss sharpness to find flat minima that generalize better. However, the mechanisms behind SAM's generalization improvements remain unclear. The authors investigate the core components of SAM and find that the batch-specific stochastic gradient noise in the minibatch gradient plays a crucial role in its generalization performance. The full gradient component, on the other hand, can degrade generalization. F-SAM addresses this by removing the full gradient component and leveraging the stochastic gradient noise for improved generalization. It approximates the full gradient using an exponentially moving average (EMA) of historical stochastic gradients, reducing computational overhead. Theoretical analysis validates the EMA approximation and proves the convergence of F-SAM on non-convex problems. Extensive experiments show that F-SAM outperforms vanilla SAM in terms of generalization, robustness, and training efficiency. F-SAM is less sensitive to the perturbation radius and performs well across different batch sizes and datasets. The results demonstrate that F-SAM significantly improves the generalization performance of SAM while maintaining training efficiency.
Reach us at info@study.space
Understanding Friendly Sharpness-Aware Minimization