MiniGPT4-Video is a multimodal Large Language Model (LLM) designed for video understanding. It processes both temporal visual and textual data, enabling it to understand complex videos. Building on MiniGPT-v2, which excels in translating visual features into the LLM space for single images, MiniGPT4-Video extends this capability to process sequences of frames, allowing it to comprehend videos. The model incorporates textual conversations, enabling it to answer queries involving both visual and text components. It outperforms existing state-of-the-art methods, achieving gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF, and TVQA benchmarks, respectively. The model uses a combination of visual tokens extracted by a visual encoder and text tokens derived from LLM tokenizers, allowing the LLM to comprehend video content more comprehensively. The model is trained on large-scale image-text and video-text datasets, and it uses a linear layer to map visual features to the LLM's text space. The model is evaluated on multiple benchmarks, including Video-ChatGPT, and shows significant improvements in performance. The model's architecture allows for the concatenation of every four adjacent visual tokens, reducing the token count while mitigating information loss. The model is also trained with instruction-tuned datasets to enhance its ability to interpret input video and generate precise responses to questions. The model's performance is further validated through zero-shot evaluations on open-ended and multiple-choice question benchmarks. The results show that integrating subtitle information alongside visual cues significantly enhances performance, with accuracy rising from 33.9% to 54.21% on TVQA. Despite its notable achievements, MiniGPT4-Video faces a limitation imposed by the context window of the LLM, requiring video lengths of 45 frames for the Llama 2 version and 90 frames for the Mistral version. Future research will focus on extending the model's capabilities to handle longer video sequences.MiniGPT4-Video is a multimodal Large Language Model (LLM) designed for video understanding. It processes both temporal visual and textual data, enabling it to understand complex videos. Building on MiniGPT-v2, which excels in translating visual features into the LLM space for single images, MiniGPT4-Video extends this capability to process sequences of frames, allowing it to comprehend videos. The model incorporates textual conversations, enabling it to answer queries involving both visual and text components. It outperforms existing state-of-the-art methods, achieving gains of 4.22%, 1.13%, 20.82%, and 13.1% on the MSVD, MSRVTT, TGIF, and TVQA benchmarks, respectively. The model uses a combination of visual tokens extracted by a visual encoder and text tokens derived from LLM tokenizers, allowing the LLM to comprehend video content more comprehensively. The model is trained on large-scale image-text and video-text datasets, and it uses a linear layer to map visual features to the LLM's text space. The model is evaluated on multiple benchmarks, including Video-ChatGPT, and shows significant improvements in performance. The model's architecture allows for the concatenation of every four adjacent visual tokens, reducing the token count while mitigating information loss. The model is also trained with instruction-tuned datasets to enhance its ability to interpret input video and generate precise responses to questions. The model's performance is further validated through zero-shot evaluations on open-ended and multiple-choice question benchmarks. The results show that integrating subtitle information alongside visual cues significantly enhances performance, with accuracy rising from 33.9% to 54.21% on TVQA. Despite its notable achievements, MiniGPT4-Video faces a limitation imposed by the context window of the LLM, requiring video lengths of 45 frames for the Llama 2 version and 90 frames for the Mistral version. Future research will focus on extending the model's capabilities to handle longer video sequences.