21 Jul 2024 | Ye Liu, Jixuan He, Wanhua Li, Junsik Kim, Donglai Wei, Hanspeter Pfister, and Chang Wen Chen
R²-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
This paper introduces R²-Tuning, a parameter- and memory-efficient transfer learning framework for video temporal grounding (VTG). The proposed method learns a lightweight R² Block containing only 1.5% of the total parameters to perform progressive spatial-temporal modeling. Starting from the last layer of CLIP, the R² Block recurrently aggregates spatial features from earlier layers, then refines temporal correlation conditioning on the given query, resulting in a coarse-to-fine scheme. R²-Tuning achieves state-of-the-art performance across three VTG tasks (moment retrieval, highlight detection, and video summarization) on six public benchmarks even without the additional backbone, demonstrating the significance and effectiveness of the proposed scheme. The method is designed to exploit the potential of spatial-temporal modeling based on multi-layer CLIP features. It introduces two effective strategies: query-modulated spatial pooling and recurrent temporal refinement, to model spatial-temporal information from coarse to fine. Additionally, video-level and layer-wise constraints are introduced to calibrate the granularities of CLIP visual and text encoders. The framework is both parameter- and memory-efficient, and is granularity-flexible as the R² Block can adaptively control the spatial pooling strategy conditioning on queries. The method is evaluated on three VTG tasks across six public benchmarks, achieving significant improvements in performance. The results show that R²-Tuning outperforms existing methods in terms of parameter efficiency, memory usage, and granularity flexibility. The framework is expected to spark further research on efficient image-to-video transfer learning for untrimmed videos.R²-Tuning: Efficient Image-to-Video Transfer Learning for Video Temporal Grounding
This paper introduces R²-Tuning, a parameter- and memory-efficient transfer learning framework for video temporal grounding (VTG). The proposed method learns a lightweight R² Block containing only 1.5% of the total parameters to perform progressive spatial-temporal modeling. Starting from the last layer of CLIP, the R² Block recurrently aggregates spatial features from earlier layers, then refines temporal correlation conditioning on the given query, resulting in a coarse-to-fine scheme. R²-Tuning achieves state-of-the-art performance across three VTG tasks (moment retrieval, highlight detection, and video summarization) on six public benchmarks even without the additional backbone, demonstrating the significance and effectiveness of the proposed scheme. The method is designed to exploit the potential of spatial-temporal modeling based on multi-layer CLIP features. It introduces two effective strategies: query-modulated spatial pooling and recurrent temporal refinement, to model spatial-temporal information from coarse to fine. Additionally, video-level and layer-wise constraints are introduced to calibrate the granularities of CLIP visual and text encoders. The framework is both parameter- and memory-efficient, and is granularity-flexible as the R² Block can adaptively control the spatial pooling strategy conditioning on queries. The method is evaluated on three VTG tasks across six public benchmarks, achieving significant improvements in performance. The results show that R²-Tuning outperforms existing methods in terms of parameter efficiency, memory usage, and granularity flexibility. The framework is expected to spark further research on efficient image-to-video transfer learning for untrimmed videos.