Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA

Too Many Frames, not all Useful: Efficient Strategies for Long-Form Video QA

17 Jun 2024 | Jongwoo Park, Kanchana Ranasinghe, Kumara Kahatapitiya, Wonjeong Ryoo, Donghyun Kim, Michael S. Ryoo
This paper introduces LVNet, a framework for long-form video question answering (LVQA) that achieves state-of-the-art performance on three benchmark datasets. LVNet consists of two key components: a Hierarchical Keyframe Selector (HKS) and a Sequential Visual LLM (SVL). The HKS efficiently selects a subset of frames most relevant to answering a question, while the SVL generates natural language descriptions of these frames in a sequence-aware manner. This approach significantly reduces redundancy and improves the efficiency of long-form video QA by focusing on key frames rather than processing all frames. The framework is trained without video-level supervision, making it highly efficient and scalable. Experiments show that LVNet outperforms existing methods on benchmarks such as EgoSchema, NExT-QA, and IntentQA, achieving strong performance with minimal frame usage. The HKS uses a hierarchical clustering approach to select keyframes, while the SVL generates captions that consider the sequential nature of video content. The framework's zero-shot capability and efficient design make it a promising solution for long-form video understanding.This paper introduces LVNet, a framework for long-form video question answering (LVQA) that achieves state-of-the-art performance on three benchmark datasets. LVNet consists of two key components: a Hierarchical Keyframe Selector (HKS) and a Sequential Visual LLM (SVL). The HKS efficiently selects a subset of frames most relevant to answering a question, while the SVL generates natural language descriptions of these frames in a sequence-aware manner. This approach significantly reduces redundancy and improves the efficiency of long-form video QA by focusing on key frames rather than processing all frames. The framework is trained without video-level supervision, making it highly efficient and scalable. Experiments show that LVNet outperforms existing methods on benchmarks such as EgoSchema, NExT-QA, and IntentQA, achieving strong performance with minimal frame usage. The HKS uses a hierarchical clustering approach to select keyframes, while the SVL generates captions that consider the sequential nature of video content. The framework's zero-shot capability and efficient design make it a promising solution for long-form video understanding.
Reach us at info@study.space
Understanding Too Many Frames%2C not all Useful%3A Efficient Strategies for Long-Form Video QA