Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers

Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers

7 Feb 2024 | Abhimanyu Rajeshkumar Bambhaniya * 1 Amir Yazdanbakhsh * 2 Suvinay Subramanian 3 Sheng-Chun Kao 3 Shivani Agrawal 3 Utku Evci 2 Tushar Krishna 1
This paper addresses the challenge of training large transformers with high sparsity ratios, particularly in the N:M structured sparsity regime. Existing methods, such as SR-STE, struggle to maintain model quality at high sparsity levels (>80%). The authors identify that this issue is primarily due to elevated noise in gradient magnitudes, which degrades model performance. To mitigate this, they propose a class of decaying-based sparse training recipes that progressively restrict the flow of gradients towards pruned elements. These methods improve model quality by up to 2% and 5% in vision and language models, respectively, at high sparsity regimes. The proposed methods also require fewer training FLOPs, showing a 30% reduction compared to SR-STE at iso-training FLOPs. The effectiveness of the proposed methods is demonstrated through experiments on various transformer-based models, including ViT, SwinV2, and T5X, across different tasks such as image classification, language understanding, and translation. The source code for the experiments is available on GitHub.This paper addresses the challenge of training large transformers with high sparsity ratios, particularly in the N:M structured sparsity regime. Existing methods, such as SR-STE, struggle to maintain model quality at high sparsity levels (>80%). The authors identify that this issue is primarily due to elevated noise in gradient magnitudes, which degrades model performance. To mitigate this, they propose a class of decaying-based sparse training recipes that progressively restrict the flow of gradients towards pruned elements. These methods improve model quality by up to 2% and 5% in vision and language models, respectively, at high sparsity regimes. The proposed methods also require fewer training FLOPs, showing a 30% reduction compared to SR-STE at iso-training FLOPs. The effectiveness of the proposed methods is demonstrated through experiments on various transformer-based models, including ViT, SwinV2, and T5X, across different tasks such as image classification, language understanding, and translation. The source code for the experiments is available on GitHub.
Reach us at info@study.space