MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

MiniGPT4-Video: Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens

4 Apr 2024 | Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Deyao Zhu, Jian Ding, Mohamed Elhoseiny
This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed for video understanding. Building on the success of MiniGPT-v2, which excelled in translating visual features into the LLM space for single images, MiniGPT4-Video extends this capability to process sequences of frames, enabling it to comprehend videos effectively. The model incorporates both visual and textual data, allowing it to answer queries involving both components. By concatenating every four adjacent visual tokens and incorporating subtitles, MiniGPT4-Video reduces token count while mitigating information loss. The model outperforms existing state-of-the-art methods on multiple benchmarks, achieving gains of 4.22%, 1.13%, 20.82%, and 13.1% on MSVD, MSRVT, TGIF, and TVQA, respectively. The paper also discusses related work in large vision-language models and LLM-based video understanding, providing a comprehensive evaluation of MiniGPT4-Video's performance and qualitative results.This paper introduces MiniGPT4-Video, a multimodal Large Language Model (LLM) designed for video understanding. Building on the success of MiniGPT-v2, which excelled in translating visual features into the LLM space for single images, MiniGPT4-Video extends this capability to process sequences of frames, enabling it to comprehend videos effectively. The model incorporates both visual and textual data, allowing it to answer queries involving both components. By concatenating every four adjacent visual tokens and incorporating subtitles, MiniGPT4-Video reduces token count while mitigating information loss. The model outperforms existing state-of-the-art methods on multiple benchmarks, achieving gains of 4.22%, 1.13%, 20.82%, and 13.1% on MSVD, MSRVT, TGIF, and TVQA, respectively. The paper also discusses related work in large vision-language models and LLM-based video understanding, providing a comprehensive evaluation of MiniGPT4-Video's performance and qualitative results.
Reach us at info@study.space
Understanding MiniGPT4-Video%3A Advancing Multimodal LLMs for Video Understanding with Interleaved Visual-Textual Tokens