Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos

17 Jul 2024 | Kirolos Ataallah, Xiaoqian Shen, Eslam Abdelrahman, Essam Sleiman, Mingchen Zhuge, Jian Ding, Deyao Zhu, Jürgen Schmidhuber, and Mohamed Elhoseiny
Goldfish is a framework designed for understanding arbitrarily long videos, addressing challenges such as noise, redundancy, and computational constraints. The framework introduces a retrieval mechanism that selects the most relevant video clips before answering questions, enabling efficient processing of long videos. It also includes MiniGPT4-Video, a model that generates detailed video clip descriptions, enhancing both short and long video understanding. The TVQA-long benchmark was developed to evaluate models on long video comprehension, achieving a 41.78% accuracy, surpassing previous methods by 14.94%. Goldfish outperforms existing models on multiple benchmarks, including short video tasks, with improvements of up to 23.59%. The framework's retrieval-based approach allows it to handle long videos efficiently, while MiniGPT4-Video excels in both short and long video tasks. Goldfish demonstrates state-of-the-art performance on long video understanding and achieves SOTA results on short video benchmarks. The model's ability to process long videos and generate accurate answers is validated through extensive experiments and benchmark evaluations.Goldfish is a framework designed for understanding arbitrarily long videos, addressing challenges such as noise, redundancy, and computational constraints. The framework introduces a retrieval mechanism that selects the most relevant video clips before answering questions, enabling efficient processing of long videos. It also includes MiniGPT4-Video, a model that generates detailed video clip descriptions, enhancing both short and long video understanding. The TVQA-long benchmark was developed to evaluate models on long video comprehension, achieving a 41.78% accuracy, surpassing previous methods by 14.94%. Goldfish outperforms existing models on multiple benchmarks, including short video tasks, with improvements of up to 23.59%. The framework's retrieval-based approach allows it to handle long videos efficiently, while MiniGPT4-Video excels in both short and long video tasks. Goldfish demonstrates state-of-the-art performance on long video understanding and achieves SOTA results on short video benchmarks. The model's ability to process long videos and generate accurate answers is validated through extensive experiments and benchmark evaluations.
Reach us at info@study.space
[slides] Goldfish%3A Vision-Language Understanding of Arbitrarily Long Videos | StudySpace