[slides] R2-Tuning%3A Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

This paper introduces R²-Tuning, a parameter- and memory-efficient transfer learning framework for video temporal grounding (VTG). VTG aims to ground relevant clips in untrimmed videos given natural language queries. Most existing VTG models rely on frame-wise final-layer CLIP features, often augmented with additional temporal backbones and sophisticated temporal reasoning mechanisms. R²-Tuning leverages the multi-granularity spatial-temporal modeling capabilities of CLIP, learning a lightweight R² Block that recurrently aggregates spatial features from earlier layers and refines temporal correlations based on the query. This approach achieves state-of-the-art performance across three VTG tasks (moment retrieval, highlight detection, and video summarization) on six public benchmarks without the need for additional backbones. The method is efficient, memory-effective, and flexible in handling different granularities of queries. Extensive experiments demonstrate the effectiveness and significance of R²-Tuning.This paper introduces R²-Tuning, a parameter- and memory-efficient transfer learning framework for video temporal grounding (VTG). VTG aims to ground relevant clips in untrimmed videos given natural language queries. Most existing VTG models rely on frame-wise final-layer CLIP features, often augmented with additional temporal backbones and sophisticated temporal reasoning mechanisms. R²-Tuning leverages the multi-granularity spatial-temporal modeling capabilities of CLIP, learning a lightweight R² Block that recurrently aggregates spatial features from earlier layers and refines temporal correlations based on the query. This approach achieves state-of-the-art performance across three VTG tasks (moment retrieval, highlight detection, and video summarization) on six public benchmarks without the need for additional backbones. The method is efficient, memory-effective, and flexible in handling different granularities of queries. Extensive experiments demonstrate the effectiveness and significance of R²-Tuning.

R²-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding

21 Jul 2024 | Ye Liu, Jixuan He, Wanhua Li, Junsik Kim, Donglai Wei, Hanspeter Pfister, Chang Wen Chen