This paper proposes a progressive semantic-guided vision transformer (ZSLViT) for zero-shot learning (ZSL), which aims to recognize unseen classes by leveraging semantic knowledge from seen classes. Existing ZSL methods often fail to learn effective visual-semantic correspondences due to a lack of semantic guidance, leading to suboptimal visual-semantic interactions. To address this, ZSLViT introduces semantic-embedded token learning (SET) and visual enhancement (ViE) to progressively learn semantic-related visual representations. SET improves visual-semantic correspondences through semantic enhancement and semantic-guided token attention, while ViE fuses low semantic-visual correspondence visual tokens to discard semantic-unrelated information. These operations are integrated into various encoders to enable accurate visual-semantic interactions for ZSL. The proposed method achieves significant performance improvements on three benchmark datasets: CUB, SUN, and AWA2. Extensive experiments demonstrate that ZSLViT outperforms existing methods in both conventional and generalized ZSL settings. The method effectively learns semantic-related visual features, leading to better zero-shot classification performance. The results show that ZSLViT significantly improves the accuracy of unseen classes and achieves state-of-the-art results on these benchmarks.This paper proposes a progressive semantic-guided vision transformer (ZSLViT) for zero-shot learning (ZSL), which aims to recognize unseen classes by leveraging semantic knowledge from seen classes. Existing ZSL methods often fail to learn effective visual-semantic correspondences due to a lack of semantic guidance, leading to suboptimal visual-semantic interactions. To address this, ZSLViT introduces semantic-embedded token learning (SET) and visual enhancement (ViE) to progressively learn semantic-related visual representations. SET improves visual-semantic correspondences through semantic enhancement and semantic-guided token attention, while ViE fuses low semantic-visual correspondence visual tokens to discard semantic-unrelated information. These operations are integrated into various encoders to enable accurate visual-semantic interactions for ZSL. The proposed method achieves significant performance improvements on three benchmark datasets: CUB, SUN, and AWA2. Extensive experiments demonstrate that ZSLViT outperforms existing methods in both conventional and generalized ZSL settings. The method effectively learns semantic-related visual features, leading to better zero-shot classification performance. The results show that ZSLViT significantly improves the accuracy of unseen classes and achieves state-of-the-art results on these benchmarks.