[slides] LITA%3A Language Instructed Temporal-Localization Assistant

The paper introduces LITA (Language Instructed Temporal-Localization Assistant), a model designed to enhance temporal localization capabilities in Video Large Language Models (Video LLMs). LITA addresses the limitations of existing Video LLMs in accurately answering "When?" questions by focusing on three key aspects: time representation, architecture, and data. 1. **Time Representation**: LITA introduces time tokens that encode relative timestamps to the video length, improving the model's ability to represent time accurately. 2. **Architecture**: The model uses SlowFast tokens to capture temporal information at fine temporal resolution, enabling more precise temporal localization. 3. **Data**: LITA emphasizes temporal localization data, including dense video captioning and event localization tasks, and proposes a new task called Reasoning Temporal Localization (RTL) with the ActivityNet-RTL dataset. The paper demonstrates that LITA significantly improves temporal localization performance, nearly doubling the mean intersection-over-union (mIoU) of baselines. Additionally, LITA enhances video-based text generation, showing a 36% relative improvement in Temporal Understanding compared to existing Video LLMs. The model's effectiveness is validated through experiments on the ActivityNet-RTL dataset, where it outperforms other Video LLMs in both temporal localization and general video understanding tasks.The paper introduces LITA (Language Instructed Temporal-Localization Assistant), a model designed to enhance temporal localization capabilities in Video Large Language Models (Video LLMs). LITA addresses the limitations of existing Video LLMs in accurately answering "When?" questions by focusing on three key aspects: time representation, architecture, and data. 1. **Time Representation**: LITA introduces time tokens that encode relative timestamps to the video length, improving the model's ability to represent time accurately. 2. **Architecture**: The model uses SlowFast tokens to capture temporal information at fine temporal resolution, enabling more precise temporal localization. 3. **Data**: LITA emphasizes temporal localization data, including dense video captioning and event localization tasks, and proposes a new task called Reasoning Temporal Localization (RTL) with the ActivityNet-RTL dataset. The paper demonstrates that LITA significantly improves temporal localization performance, nearly doubling the mean intersection-over-union (mIoU) of baselines. Additionally, LITA enhances video-based text generation, showing a 36% relative improvement in Temporal Understanding compared to existing Video LLMs. The model's effectiveness is validated through experiments on the ActivityNet-RTL dataset, where it outperforms other Video LLMs in both temporal localization and general video understanding tasks.

LITA: Language Instructed Temporal-Localization Assistant

27 Mar 2024 | De-An Huang, Shijia Liao, Subhashree Radhakrishnan, Hongxu Yin, Pavlo Molchanov, Zhiding Yu, Jan Kautz