5 Feb 2024 | Xiaohu Huang1 Hao Zhou2 Kun Yao2 Kai Han1†
FROSTER is an effective framework for open-vocabulary action recognition, addressing the challenges posed by the lack of temporal information in CLIP's pretraining and the overfitting issues when fine-tuning CLIP on action recognition datasets. The framework employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while adapting to the action recognition task. Specifically, the frozen CLIP model acts as a teacher to maintain the generalizability of the original CLIP, while a residual sub-network supervises the feature learning for extracting video-specific features. This approach balances the two objectives of learning generalizable and video-specific features. Extensive evaluations on open-vocabulary action recognition benchmarks, including cross-dataset and base-to-novel settings, demonstrate that FROSTER consistently achieves state-of-the-art performance across various datasets. The main contributions of the paper include the introduction of FROSTER, a residual feature distillation method, and the superior performance of FROSTER in open-vocabulary action recognition tasks.FROSTER is an effective framework for open-vocabulary action recognition, addressing the challenges posed by the lack of temporal information in CLIP's pretraining and the overfitting issues when fine-tuning CLIP on action recognition datasets. The framework employs a residual feature distillation approach to ensure that CLIP retains its generalization capability while adapting to the action recognition task. Specifically, the frozen CLIP model acts as a teacher to maintain the generalizability of the original CLIP, while a residual sub-network supervises the feature learning for extracting video-specific features. This approach balances the two objectives of learning generalizable and video-specific features. Extensive evaluations on open-vocabulary action recognition benchmarks, including cross-dataset and base-to-novel settings, demonstrate that FROSTER consistently achieves state-of-the-art performance across various datasets. The main contributions of the paper include the introduction of FROSTER, a residual feature distillation method, and the superior performance of FROSTER in open-vocabulary action recognition tasks.