DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval

19 Jan 2024 | Xiangpeng Yang, Linchao Zhu, Xiaohan Wang, Yi Yang
DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval This paper proposes DGL, a dynamic global-local prompt tuning method for text-video retrieval. The method generates dynamic local-level prompts from a shared latent space to ensure cross-modal interaction. It also proposes a global-local video attention mechanism to model videos from both the global and local levels, capturing inter-frame temporal information with the global prompt and focusing on each frame's information with the local frame prompts. The method outperforms or is comparable to fully finetuning methods on four datasets: MSR-VTT, VATEX, LSMDC, and ActivityNet, with only 0.67% parameters tuned. The method is parameter-efficient, reducing the total parameter cost by 99.3% compared to fully finetuning. The method is effective in capturing temporal dynamics and global video information, and it is efficient in terms of storage and computation. The method is evaluated on four benchmarks and shows superior performance compared to other parameter-efficient methods. The method is implemented with a shared latent space for prompt generation and a global-local attention mechanism for video modeling. The method is effective in capturing both local and global video information, and it is efficient in terms of parameter usage and computation. The method is validated through extensive experiments and ablation studies, showing its effectiveness in text-video retrieval. The method is compared to other parameter-efficient methods and shows superior performance in terms of retrieval accuracy and parameter efficiency. The method is implemented with a shared latent space for prompt generation and a global-local attention mechanism for video modeling. The method is effective in capturing both local and global video information, and it is efficient in terms of parameter usage and computation. The method is validated through extensive experiments and ablation studies, showing its effectiveness in text-video retrieval.DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval This paper proposes DGL, a dynamic global-local prompt tuning method for text-video retrieval. The method generates dynamic local-level prompts from a shared latent space to ensure cross-modal interaction. It also proposes a global-local video attention mechanism to model videos from both the global and local levels, capturing inter-frame temporal information with the global prompt and focusing on each frame's information with the local frame prompts. The method outperforms or is comparable to fully finetuning methods on four datasets: MSR-VTT, VATEX, LSMDC, and ActivityNet, with only 0.67% parameters tuned. The method is parameter-efficient, reducing the total parameter cost by 99.3% compared to fully finetuning. The method is effective in capturing temporal dynamics and global video information, and it is efficient in terms of storage and computation. The method is evaluated on four benchmarks and shows superior performance compared to other parameter-efficient methods. The method is implemented with a shared latent space for prompt generation and a global-local attention mechanism for video modeling. The method is effective in capturing both local and global video information, and it is efficient in terms of parameter usage and computation. The method is validated through extensive experiments and ablation studies, showing its effectiveness in text-video retrieval. The method is compared to other parameter-efficient methods and shows superior performance in terms of retrieval accuracy and parameter efficiency. The method is implemented with a shared latent space for prompt generation and a global-local attention mechanism for video modeling. The method is effective in capturing both local and global video information, and it is efficient in terms of parameter usage and computation. The method is validated through extensive experiments and ablation studies, showing its effectiveness in text-video retrieval.
Reach us at info@futurestudyspace.com