27 Mar 2024 | De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, Jan Kautz
LITA is a Language Instructed Temporal-Localization Assistant designed to enhance temporal localization in Video Large Language Models (Video LLMs). The paper identifies three key limitations in existing Video LLMs: time representation, architecture, and data. To address these, LITA introduces time tokens for relative timestamp representation, SlowFast tokens for fine-grained temporal resolution, and emphasizes temporal localization data. It also proposes a new task, Reasoning Temporal Localization (RTL), along with the ActivityNet-RTL dataset, to evaluate both temporal localization and reasoning capabilities.
LITA significantly improves temporal localization performance, nearly doubling the temporal mean intersection-over-union (mIoU) of baselines. It also enhances video-based text generation, achieving a 36% relative improvement in Temporal Understanding compared to existing Video LLMs. The model's design includes relative time representation, dense video sampling, and a combination of SlowFast tokens to efficiently process video inputs. LITA's training incorporates dense video captioning, event localization, video question answering, and RTL, leading to improved video understanding and reasoning capabilities.
The RTL task requires models to reason about events not explicitly described in the question, using world knowledge and temporal reasoning. LITA excels in this task, providing both timestamps and explanations. Evaluation on the ActivityNet-RTL dataset shows that LITA outperforms existing models in temporal localization and reasoning. Additionally, LITA improves video-based text generation, demonstrating enhanced video understanding across various tasks.
The paper also analyzes the effects of different training tasks on LITA's performance, showing that combining RTL with standard video tasks and natural language visual question answering significantly improves the model's capabilities. Overall, LITA's design and data strategy enable accurate temporal localization and improved video understanding, making it a promising advancement in Video LLMs.LITA is a Language Instructed Temporal-Localization Assistant designed to enhance temporal localization in Video Large Language Models (Video LLMs). The paper identifies three key limitations in existing Video LLMs: time representation, architecture, and data. To address these, LITA introduces time tokens for relative timestamp representation, SlowFast tokens for fine-grained temporal resolution, and emphasizes temporal localization data. It also proposes a new task, Reasoning Temporal Localization (RTL), along with the ActivityNet-RTL dataset, to evaluate both temporal localization and reasoning capabilities.
LITA significantly improves temporal localization performance, nearly doubling the temporal mean intersection-over-union (mIoU) of baselines. It also enhances video-based text generation, achieving a 36% relative improvement in Temporal Understanding compared to existing Video LLMs. The model's design includes relative time representation, dense video sampling, and a combination of SlowFast tokens to efficiently process video inputs. LITA's training incorporates dense video captioning, event localization, video question answering, and RTL, leading to improved video understanding and reasoning capabilities.
The RTL task requires models to reason about events not explicitly described in the question, using world knowledge and temporal reasoning. LITA excels in this task, providing both timestamps and explanations. Evaluation on the ActivityNet-RTL dataset shows that LITA outperforms existing models in temporal localization and reasoning. Additionally, LITA improves video-based text generation, demonstrating enhanced video understanding across various tasks.
The paper also analyzes the effects of different training tasks on LITA's performance, showing that combining RTL with standard video tasks and natural language visual question answering significantly improves the model's capabilities. Overall, LITA's design and data strategy enable accurate temporal localization and improved video understanding, making it a promising advancement in Video LLMs.