Understanding Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

This paper addresses the challenge of cross-domain open-vocabulary action recognition, a task that aims to develop models capable of recognizing actions in unseen video domains. The authors establish the XOV-Action benchmark, which includes four test datasets with varying levels of domain gaps compared to the training datasets. They evaluate five state-of-the-art CLIP-based video learners on this benchmark and find that these models exhibit limited performance in recognizing actions in unseen domains, particularly in domains with large domain gaps. To address this issue, the authors propose a novel Scene-Aware Video-Text Alignment (SATA) method. SATA aims to learn scene-agnostic video representations by distinguishing video representations from scene-encoded text representations, thereby mitigating scene bias. Extensive experiments demonstrate that SATA effectively improves the closed-set action recognition performance across domains while maintaining open-set performance. The paper concludes by highlighting the challenges and future directions in cross-domain open-vocabulary action recognition.This paper addresses the challenge of cross-domain open-vocabulary action recognition, a task that aims to develop models capable of recognizing actions in unseen video domains. The authors establish the XOV-Action benchmark, which includes four test datasets with varying levels of domain gaps compared to the training datasets. They evaluate five state-of-the-art CLIP-based video learners on this benchmark and find that these models exhibit limited performance in recognizing actions in unseen domains, particularly in domains with large domain gaps. To address this issue, the authors propose a novel Scene-Aware Video-Text Alignment (SATA) method. SATA aims to learn scene-agnostic video representations by distinguishing video representations from scene-encoded text representations, thereby mitigating scene bias. Extensive experiments demonstrate that SATA effectively improves the closed-set action recognition performance across domains while maintaining open-set performance. The paper concludes by highlighting the challenges and future directions in cross-domain open-vocabulary action recognition.

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

24 May 2024 | Kun-Yu Lin, Henghui Ding, Jiaming Zhou, Yu-Ming Tang, Yi-Xing Peng, Zhilin Zhao, Chen Change Loy, Wei-Shi Zheng