FROSTER: FROzen CLIP is A STRONG TEACHER FOR OPEN-VOCABULARY ACTION RECOGNITION

FROSTER: FROzen CLIP is A STRONG TEACHER FOR OPEN-VOCABULARY ACTION RECOGNITION

2024 | Xiaohu Huang, Hao Zhou, Kun Yao, Kai Han
FROSTER is an effective framework for open-vocabulary action recognition, leveraging the CLIP model's strong generalization capability. The CLIP model, pretrained on massive image-text pairs, has shown success in image-based tasks but lacks temporal information, making it challenging for action recognition. To address this, FROSTER employs a residual feature distillation approach, using a frozen CLIP as a teacher to maintain generalization while adapting to video tasks. This method bridges the gap between images and videos by supervising feature learning for video-specific features. A residual sub-network balances the objectives of learning generalizable and video-specific features. FROSTER is evaluated on open-vocabulary benchmarks under base-to-novel and cross-dataset settings, consistently achieving state-of-the-art performance. The framework is compatible with various network architectures and demonstrates strong generalization. The key contributions include introducing FROSTER, a residual feature distillation approach to balance feature learning, and extensive evaluations showing its effectiveness. The method improves video-specific knowledge and generalizability, enhancing overall performance. FROSTER outperforms existing methods in both base-to-novel and cross-dataset settings, demonstrating its potential for open-world applications. The framework is implemented with a residual feature distillation method that allows flexible adaptation from images to videos without sacrificing the generalization capability of the pretrained CLIP. The method is validated through experiments showing its effectiveness in improving action recognition performance.FROSTER is an effective framework for open-vocabulary action recognition, leveraging the CLIP model's strong generalization capability. The CLIP model, pretrained on massive image-text pairs, has shown success in image-based tasks but lacks temporal information, making it challenging for action recognition. To address this, FROSTER employs a residual feature distillation approach, using a frozen CLIP as a teacher to maintain generalization while adapting to video tasks. This method bridges the gap between images and videos by supervising feature learning for video-specific features. A residual sub-network balances the objectives of learning generalizable and video-specific features. FROSTER is evaluated on open-vocabulary benchmarks under base-to-novel and cross-dataset settings, consistently achieving state-of-the-art performance. The framework is compatible with various network architectures and demonstrates strong generalization. The key contributions include introducing FROSTER, a residual feature distillation approach to balance feature learning, and extensive evaluations showing its effectiveness. The method improves video-specific knowledge and generalizability, enhancing overall performance. FROSTER outperforms existing methods in both base-to-novel and cross-dataset settings, demonstrating its potential for open-world applications. The framework is implemented with a residual feature distillation method that allows flexible adaptation from images to videos without sacrificing the generalization capability of the pretrained CLIP. The method is validated through experiments showing its effectiveness in improving action recognition performance.
Reach us at info@study.space