Language Repository for Long Video Understanding

Language Repository for Long Video Understanding

21 Mar 2024 | Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, and Michael S. Ryoo
The paper introduces a Language Repository (LangRepo) designed to enhance the handling of long-term information in video understanding tasks. LangRepo is an all-textual repository that updates iteratively based on multi-scale video chunks, maintaining concise and structured information. The repository supports two main operations: writing and reading. The writing operation prunes redundant text and rephrases input descriptions, creating concise entries. The reading operation extracts information at various temporal scales to generate outputs suitable for video question-answering (VQA) tasks. LangRepo is evaluated on zero-shot VQA benchmarks, including EgoSchema, NExT-QA, IntentQA, and NExT-GQA, demonstrating state-of-the-art performance. The code for LangRepo is available on GitHub. The paper also discusses the challenges of long-term reasoning in video understanding and the benefits of using language as an interpretable modality.The paper introduces a Language Repository (LangRepo) designed to enhance the handling of long-term information in video understanding tasks. LangRepo is an all-textual repository that updates iteratively based on multi-scale video chunks, maintaining concise and structured information. The repository supports two main operations: writing and reading. The writing operation prunes redundant text and rephrases input descriptions, creating concise entries. The reading operation extracts information at various temporal scales to generate outputs suitable for video question-answering (VQA) tasks. LangRepo is evaluated on zero-shot VQA benchmarks, including EgoSchema, NExT-QA, IntentQA, and NExT-GQA, demonstrating state-of-the-art performance. The code for LangRepo is available on GitHub. The paper also discusses the challenges of long-term reasoning in video understanding and the benefits of using language as an interpretable modality.
Reach us at info@study.space
[slides] Language Repository for Long Video Understanding | StudySpace