[slides] Koala%3A Key Frame-Conditioned Long Video-LLM

**Koala: Key Frame-Conditioned Long Video-LLM** Reuben Tan, Ximeng Sun, Ping Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A. Plummer, Bryan Russell, Kate Saenko Boston University, Adobe Research {rxtan, sunxmx, pinghu, bplum, saenko}@bu.edu, {juiwang, delilamsa, brussell}@adobe.com https://cs-people.bu.edu/rxtan/projects/Koala **Abstract** Long video question answering is a challenging task that involves recognizing multiple actions and their fine-grained relationships. While state-of-the-art video Large Language Models (vLLMs) have shown promise, they struggle with understanding minutes-long videos due to their focus on short, seconds-long clips. To address this, we propose Koala, a lightweight and self-supervised approach that introduces learnable spatiotemporal queries to adapt vLLMs for longer videos. Koala uses sparse sampled key frames to condition the vLLM's tokenizer function, enabling it to focus on relevant regions and make more informed predictions. Our approach improves the vLLM's performance on zero-shot long video understanding benchmarks by 3-6% in absolute accuracy across all tasks, outperforming state-of-the-art models. Additionally, Koala enhances the vLLM's accuracy on short-term action recognition tasks. **Introduction** Answering questions about minutes-long videos is challenging as it requires recognizing and understanding complex temporal relationships between actions. vLLMs, while effective for short videos, struggle with long videos due to their limited ability to capture fine-grained spatiotemporal information. Koala addresses this by introducing spatiotemporal queries to adapt the vLLM's tokenizer function to aggregate context over longer temporal horizons. By encoding global and local context using key frames and video segments, Koala improves the vLLM's ability to reason about long-term temporal relationships. **Related Work** Video understanding involves tasks like action recognition, prediction, and localization. Prior work often relies on hand-crafted features or video encoders designed for temporal information. Koala differs by focusing on task-agnostic visual tokenization, aligning with base LLMs, and enhancing long-term temporal understanding. **Koala Approach** Koala uses a frozen vLLM and introduces Conditioned Segment (CS) and Conditioned Video (CV) tokenizers to condition on key frames and video segments. These tokenizers aggregate spatiotemporal context, improving the vLLM's ability to understand long videos. The learning objective is to predict high-level task labels from instructional videos. **Experiments** Koala is evaluated on the EgoSchema and Seed-Bench benchmarks, showing significant improvements over state-of-the-art models. Ablation studies demonstrate the effectiveness of the introduced tokenizers and sp**Koala: Key Frame-Conditioned Long Video-LLM** Reuben Tan, Ximeng Sun, Ping Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A. Plummer, Bryan Russell, Kate Saenko Boston University, Adobe Research {rxtan, sunxmx, pinghu, bplum, saenko}@bu.edu, {juiwang, delilamsa, brussell}@adobe.com https://cs-people.bu.edu/rxtan/projects/Koala **Abstract** Long video question answering is a challenging task that involves recognizing multiple actions and their fine-grained relationships. While state-of-the-art video Large Language Models (vLLMs) have shown promise, they struggle with understanding minutes-long videos due to their focus on short, seconds-long clips. To address this, we propose Koala, a lightweight and self-supervised approach that introduces learnable spatiotemporal queries to adapt vLLMs for longer videos. Koala uses sparse sampled key frames to condition the vLLM's tokenizer function, enabling it to focus on relevant regions and make more informed predictions. Our approach improves the vLLM's performance on zero-shot long video understanding benchmarks by 3-6% in absolute accuracy across all tasks, outperforming state-of-the-art models. Additionally, Koala enhances the vLLM's accuracy on short-term action recognition tasks. **Introduction** Answering questions about minutes-long videos is challenging as it requires recognizing and understanding complex temporal relationships between actions. vLLMs, while effective for short videos, struggle with long videos due to their limited ability to capture fine-grained spatiotemporal information. Koala addresses this by introducing spatiotemporal queries to adapt the vLLM's tokenizer function to aggregate context over longer temporal horizons. By encoding global and local context using key frames and video segments, Koala improves the vLLM's ability to reason about long-term temporal relationships. **Related Work** Video understanding involves tasks like action recognition, prediction, and localization. Prior work often relies on hand-crafted features or video encoders designed for temporal information. Koala differs by focusing on task-agnostic visual tokenization, aligning with base LLMs, and enhancing long-term temporal understanding. **Koala Approach** Koala uses a frozen vLLM and introduces Conditioned Segment (CS) and Conditioned Video (CV) tokenizers to condition on key frames and video segments. These tokenizers aggregate spatiotemporal context, improving the vLLM's ability to understand long videos. The learning objective is to predict high-level task labels from instructional videos. **Experiments** Koala is evaluated on the EgoSchema and Seed-Bench benchmarks, showing significant improvements over state-of-the-art models. Ablation studies demonstrate the effectiveness of the introduced tokenizers and sp

Koala: Key frame-conditioned long video-LLM

3 May 2024 | Reuben Tan, Ximeng Sun, Ping Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A. Plummer, Bryan Russell, Kate Saenko