Koala: Key frame-conditioned long video-LLM

Koala: Key frame-conditioned long video-LLM

3 May 2024 | Reuben Tan, Ximeng Sun, Ping Hu, Jui-hsien Wang, Hanieh Deilamsalehy, Bryan A. Plummer, Bryan Russell, Kate Saenko
Koala is a lightweight approach to extend the short-term video tokenizer function of a pre-trained video Large Language Model (vLLM) to understand and answer questions about minutes-long videos. The method introduces learnable spatiotemporal queries to adapt the pre-trained vLLM for generalizing to longer videos. Koala uses sparsely sampled key frames to condition the LLM, allowing it to focus on relevant regions in the input frames and make more informed predictions based on a more holistic understanding of the video. The approach introduces two new tokenizers that condition on visual tokens computed from sparse video key frames for understanding short and long video moments. Koala is trained on the HowTo100M dataset and demonstrates effectiveness on zero-shot long video understanding benchmarks, outperforming state-of-the-art large models by 3-6% in absolute accuracy across all tasks. Additionally, Koala improves the accuracy of the pre-trained vLLM on short-term action recognition. The key insight is that the global video context can be utilized to model individual video segments and the contextual relations between multiple video segments, which plays a crucial role in understanding long videos. Koala's Conditioned Segment (CS) and Conditioned Video (CV) tokenizer functions help the model to better understand the video by incorporating both global and local spatiotemporal information. The approach is evaluated on multiple zero-shot long and short-term temporal understanding tasks on the EgoSchema and SeedBench benchmarks, showing significant improvements over existing methods. Koala's lightweight finetuning approach allows the pre-trained vLLM to incorporate long-term temporal understanding capabilities despite training on noisy and uncurated video and text data. The results demonstrate that Koala is effective in both long and short-term video understanding tasks.Koala is a lightweight approach to extend the short-term video tokenizer function of a pre-trained video Large Language Model (vLLM) to understand and answer questions about minutes-long videos. The method introduces learnable spatiotemporal queries to adapt the pre-trained vLLM for generalizing to longer videos. Koala uses sparsely sampled key frames to condition the LLM, allowing it to focus on relevant regions in the input frames and make more informed predictions based on a more holistic understanding of the video. The approach introduces two new tokenizers that condition on visual tokens computed from sparse video key frames for understanding short and long video moments. Koala is trained on the HowTo100M dataset and demonstrates effectiveness on zero-shot long video understanding benchmarks, outperforming state-of-the-art large models by 3-6% in absolute accuracy across all tasks. Additionally, Koala improves the accuracy of the pre-trained vLLM on short-term action recognition. The key insight is that the global video context can be utilized to model individual video segments and the contextual relations between multiple video segments, which plays a crucial role in understanding long videos. Koala's Conditioned Segment (CS) and Conditioned Video (CV) tokenizer functions help the model to better understand the video by incorporating both global and local spatiotemporal information. The approach is evaluated on multiple zero-shot long and short-term temporal understanding tasks on the EgoSchema and SeedBench benchmarks, showing significant improvements over existing methods. Koala's lightweight finetuning approach allows the pre-trained vLLM to incorporate long-term temporal understanding capabilities despite training on noisy and uncurated video and text data. The results demonstrate that Koala is effective in both long and short-term video understanding tasks.
Reach us at info@study.space