This paper introduces a novel approach for training transformer models with N:M structured sparsity, addressing the limitations of existing sparse training methods in high-sparsity regimes. The authors argue that traditional methods fail to maintain model quality at high sparsity levels due to increased gradient noise. To mitigate this, they propose decaying-based training recipes that progressively restrict gradient flow to pruned elements, improving model performance by up to 2% in vision models and 5% in language models. These methods also reduce training FLOPs by over 30% compared to state-of-the-art sparse training recipes. The proposed decaying-based approaches, including Mask Decay Gradient Flow (MDGF) and Structure Decay Gradient Flow (SDGF), are evaluated on various transformer-based models, including ViT-Base, SwinV2-Base, and T5X-Base, demonstrating superior performance in both image classification and language understanding tasks. The results show that the proposed methods achieve higher accuracy at high sparsity levels while maintaining efficiency. The study also highlights the importance of gradient flow in sparse training and provides insights into the trade-offs between model accuracy and computational cost. The source code is available on GitHub.This paper introduces a novel approach for training transformer models with N:M structured sparsity, addressing the limitations of existing sparse training methods in high-sparsity regimes. The authors argue that traditional methods fail to maintain model quality at high sparsity levels due to increased gradient noise. To mitigate this, they propose decaying-based training recipes that progressively restrict gradient flow to pruned elements, improving model performance by up to 2% in vision models and 5% in language models. These methods also reduce training FLOPs by over 30% compared to state-of-the-art sparse training recipes. The proposed decaying-based approaches, including Mask Decay Gradient Flow (MDGF) and Structure Decay Gradient Flow (SDGF), are evaluated on various transformer-based models, including ViT-Base, SwinV2-Base, and T5X-Base, demonstrating superior performance in both image classification and language understanding tasks. The results show that the proposed methods achieve higher accuracy at high sparsity levels while maintaining efficiency. The study also highlights the importance of gradient flow in sparse training and provides insights into the trade-offs between model accuracy and computational cost. The source code is available on GitHub.