Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition

24 May 2024 | Kun-Yu Lin, Henghui Ding, Jiaming Zhou, Yu-Ming Tang, Yi-Xing Peng, Zhilin Zhao, Chen Change Loy, Wei-Shi Zheng
This paper introduces a new benchmark, XOV-Action, for cross-domain open-vocabulary action recognition, which evaluates five state-of-the-art CLIP-based video learners across four test domains with varying domain gaps. The benchmark reveals that existing methods struggle with generalization to unseen domains, particularly for closed-set and open-set action categories. The paper identifies scene bias as a critical challenge, where models rely on scene-specific information rather than action-related features. To address this, the authors propose a novel Scene-Aware video-Text Alignment (SATA) method that distinguishes video representations from scene-encoded text representations, encouraging the video encoder to focus on action information. Experiments on XOV-Action show that SATA significantly improves closed-set action recognition performance while maintaining open-set performance. The method is evaluated on four test domains: UCF, HMDB, ARID, and NEC-Dr, with results demonstrating the effectiveness of SATA in mitigating scene bias and enhancing cross-domain generalization. The paper also highlights the challenges of cross-domain open-vocabulary action recognition, including improving open-set recognition and developing a unified model for all categories. The proposed benchmark and method provide a comprehensive way to evaluate and analyze models for cross-domain open-vocabulary action recognition.This paper introduces a new benchmark, XOV-Action, for cross-domain open-vocabulary action recognition, which evaluates five state-of-the-art CLIP-based video learners across four test domains with varying domain gaps. The benchmark reveals that existing methods struggle with generalization to unseen domains, particularly for closed-set and open-set action categories. The paper identifies scene bias as a critical challenge, where models rely on scene-specific information rather than action-related features. To address this, the authors propose a novel Scene-Aware video-Text Alignment (SATA) method that distinguishes video representations from scene-encoded text representations, encouraging the video encoder to focus on action information. Experiments on XOV-Action show that SATA significantly improves closed-set action recognition performance while maintaining open-set performance. The method is evaluated on four test domains: UCF, HMDB, ARID, and NEC-Dr, with results demonstrating the effectiveness of SATA in mitigating scene bias and enhancing cross-domain generalization. The paper also highlights the challenges of cross-domain open-vocabulary action recognition, including improving open-set recognition and developing a unified model for all categories. The proposed benchmark and method provide a comprehensive way to evaluate and analyze models for cross-domain open-vocabulary action recognition.
Reach us at info@study.space