Understanding FreeVA%3A Offline MLLM as Training-Free Video Assistant

This paper, titled "FreeVA: Offline MLLM as Training-Free Video Assistant," explores the extension of existing image-based Multimodal Large Language Models (MLLMs) to the video domain without additional training. The study, named FreeVA, aims to provide a baseline for evaluating MLLMs in video tasks and reveals several surprising findings: 1. **Zero-Shot Video Question-answering**: FreeVA, using only offline image-based MLLMs, excels in zero-shot video question-answering tasks (e.g., MSVD-QA, ActivityNet-QA, and MSRVTT-QA), even surpassing state-of-the-art methods that involve video instruction tuning. 2. **Effectiveness of Video Instruction Tuning**: The study finds that fine-tuning image-based MLLMs with the widely adopted VideoInstruct-100K dataset for video instruction tuning does not lead to better performance compared to not training at all. 3. **Evaluation Metric Influence**: Commonly used evaluation metrics for zero-shot video question-answering are significantly influenced by changes in the GPT API version over time, which can affect the fairness and uniformity of comparisons between different methods. The paper also discusses the architecture of MLLMs, including image encoders, vision-language connectors, and large language models, and proposes a simple approach for extending these models to video tasks. The key aspect of FreeVA is temporal aggregation, which combines visual tokens from multiple video frames into a single representation for input to the language model. The experimental results show that dense temporal aggregation, which retains more visual tokens, outperforms sparse aggregation methods. Additionally, the study highlights the importance of maintaining the original patch tokens for each frame and the impact of different GPT-3.5 versions on evaluation results. The paper concludes by discussing limitations and future directions, including the need for more advanced parameter-free strategies, future video instruction tuning, and incorporating more advanced MLLMs. The authors hope that FreeVA will serve as a valuable baseline for future research and encourage the direct evaluation of existing MLLMs on video tasks.This paper, titled "FreeVA: Offline MLLM as Training-Free Video Assistant," explores the extension of existing image-based Multimodal Large Language Models (MLLMs) to the video domain without additional training. The study, named FreeVA, aims to provide a baseline for evaluating MLLMs in video tasks and reveals several surprising findings: 1. **Zero-Shot Video Question-answering**: FreeVA, using only offline image-based MLLMs, excels in zero-shot video question-answering tasks (e.g., MSVD-QA, ActivityNet-QA, and MSRVTT-QA), even surpassing state-of-the-art methods that involve video instruction tuning. 2. **Effectiveness of Video Instruction Tuning**: The study finds that fine-tuning image-based MLLMs with the widely adopted VideoInstruct-100K dataset for video instruction tuning does not lead to better performance compared to not training at all. 3. **Evaluation Metric Influence**: Commonly used evaluation metrics for zero-shot video question-answering are significantly influenced by changes in the GPT API version over time, which can affect the fairness and uniformity of comparisons between different methods. The paper also discusses the architecture of MLLMs, including image encoders, vision-language connectors, and large language models, and proposes a simple approach for extending these models to video tasks. The key aspect of FreeVA is temporal aggregation, which combines visual tokens from multiple video frames into a single representation for input to the language model. The experimental results show that dense temporal aggregation, which retains more visual tokens, outperforms sparse aggregation methods. Additionally, the study highlights the importance of maintaining the original patch tokens for each frame and the impact of different GPT-3.5 versions on evaluation results. The paper concludes by discussing limitations and future directions, including the need for more advanced parameter-free strategies, future video instruction tuning, and incorporating more advanced MLLMs. The authors hope that FreeVA will serve as a valuable baseline for future research and encourage the direct evaluation of existing MLLMs on video tasks.

FreeVA: Offline MLLM as Training-Free Video Assistant

10 Jun 2024 | Wenhao Wu