The paper introduces a novel approach called Progressive Semantic-Guided Vision Transformer (ZSLViT) for zero-shot learning (ZSL). ZSLViT aims to improve the performance of ZSL by explicitly learning semantic-related visual representations and discarding semantic-unrelated visual information. The key contributions of ZSLViT include:
1. **Semantic-Embedded Token Learning (SET)**: This mechanism enhances visual-semantic correspondences through semantic enhancement and semantic-guided token attention, explicitly discovering and preserving semantic-related visual tokens.
2. **Visual Enhancement (ViE)**: This operation fuses low visual-semantic correspondence visual tokens into a new token, purifying the visual features by discarding semantic-unrelated information.
ZSLViT is integrated into various encoders to progressively learn semantic-related visual representations, enabling effective visual-semantic interactions. Extensive experiments on three benchmark datasets (CUB, SUN, and AWA2) show that ZSLViT achieves significant performance improvements over existing methods, demonstrating its effectiveness in ZSL tasks. The paper also includes ablation studies and hyper-parameter analysis to validate the effectiveness of each component of ZSLViT.The paper introduces a novel approach called Progressive Semantic-Guided Vision Transformer (ZSLViT) for zero-shot learning (ZSL). ZSLViT aims to improve the performance of ZSL by explicitly learning semantic-related visual representations and discarding semantic-unrelated visual information. The key contributions of ZSLViT include:
1. **Semantic-Embedded Token Learning (SET)**: This mechanism enhances visual-semantic correspondences through semantic enhancement and semantic-guided token attention, explicitly discovering and preserving semantic-related visual tokens.
2. **Visual Enhancement (ViE)**: This operation fuses low visual-semantic correspondence visual tokens into a new token, purifying the visual features by discarding semantic-unrelated information.
ZSLViT is integrated into various encoders to progressively learn semantic-related visual representations, enabling effective visual-semantic interactions. Extensive experiments on three benchmark datasets (CUB, SUN, and AWA2) show that ZSLViT achieves significant performance improvements over existing methods, demonstrating its effectiveness in ZSL tasks. The paper also includes ablation studies and hyper-parameter analysis to validate the effectiveness of each component of ZSLViT.