[slides] Language Repository for Long Video Understanding

This paper introduces a Language Repository (LangRepo) for large language models (LLMs) to handle long-form video understanding. LangRepo maintains concise, interpretable, and structured information as an all-textual representation, iteratively updated based on multi-scale video chunks. It supports write and read operations to prune redundancies and extract information at various temporal scales. The framework is evaluated on zero-shot visual question-answering benchmarks, showing state-of-the-art performance. LangRepo is designed to be compatible with both LLM-based processing and human interpretation, as it is fully-textual. It includes operations for pruning redundant text, rephrasing, and summarizing to generate outputs suitable for video VQA. The repository also stores optional metadata such as timestamps and occurrences. The framework is tested on multiple long-video reasoning benchmarks, including EgoSchema, NExT-QA, IntentQA, and NExT-GQA, demonstrating strong performance. The results show that LangRepo outperforms other methods at its scale, particularly in handling long input sequences. The framework is also evaluated through ablation studies, showing that the choice of LLM, text encoder, and classifier significantly affects performance. Overall, LangRepo provides a concise, interpretable, and effective representation for long-form video understanding.This paper introduces a Language Repository (LangRepo) for large language models (LLMs) to handle long-form video understanding. LangRepo maintains concise, interpretable, and structured information as an all-textual representation, iteratively updated based on multi-scale video chunks. It supports write and read operations to prune redundancies and extract information at various temporal scales. The framework is evaluated on zero-shot visual question-answering benchmarks, showing state-of-the-art performance. LangRepo is designed to be compatible with both LLM-based processing and human interpretation, as it is fully-textual. It includes operations for pruning redundant text, rephrasing, and summarizing to generate outputs suitable for video VQA. The repository also stores optional metadata such as timestamps and occurrences. The framework is tested on multiple long-video reasoning benchmarks, including EgoSchema, NExT-QA, IntentQA, and NExT-GQA, demonstrating strong performance. The results show that LangRepo outperforms other methods at its scale, particularly in handling long input sequences. The framework is also evaluated through ablation studies, showing that the choice of LLM, text encoder, and classifier significantly affects performance. Overall, LangRepo provides a concise, interpretable, and effective representation for long-form video understanding.

Language Repository for Long Video Understanding

21 Mar 2024 | Kumara Kahatapitiya, Kanchana Ranasinghe, Jongwoo Park, and Michael S. Ryoo