ST-LLM: Large Language Models Are Effective Temporal Learners

ST-LLM: Large Language Models Are Effective Temporal Learners

2024-03-30 | Ruyang Liu, Chen Li, Haoran Tang, YiXiao Ge, Ying Shan, Ge Li
ST-LLM is a video large language model that effectively models spatial-temporal sequences using large language models (LLMs). The paper investigates whether all spatial-temporal tokens can be fed into LLMs to delegate video sequence modeling to LLMs. The results show that this simple approach significantly improves video understanding. Based on this, the authors propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLMs. To address the overhead and stability issues introduced by uncompressed video tokens within LLMs, the authors develop a dynamic masking strategy with tailor-made training objectives. For particularly long videos, a global-local input module is designed to balance efficiency and effectiveness. Consequently, LLMs are used for proficient spatial-temporal modeling while maintaining efficiency and stability. Extensive experimental results show the effectiveness of the method. Through a more concise model and training pipeline, ST-LLM achieves new state-of-the-art results on VideoChatGPT-Bench and MVBench. The code is available at https://github.com/TencentARC/ST-LLM. The paper introduces ST-LLM, which is the first open-source video LLM that explores spatial-temporal modeling within LLMs. It presents a dynamic video token masking strategy coupled with masked video modeling, and introduces a global-local input mechanism for processing long videos. These innovations ensure the efficiency and robustness of spatial-temporal tokens within the LLM. Extensive experiments demonstrate the consistent superiority of ST-LLM over existing video LLMs across various video dialogue benchmarks, especially on tasks demanding robust temporal understanding. The paper also discusses related works, including LLMs and image LLMs, video LLMs, and joint spatial-temporal modeling. The methodology of ST-LLM is elaborated, including the structure and training methodology. The paper presents the results of experiments on various benchmarks, including MVBench, VideoChatGPT-Bench, and zero-shot video QA benchmark. The results show that ST-LLM achieves new state-of-the-art performance across various contemporary video benchmarks, including MVBench, VideoChatGPT-Bench, and zero-shot QA-Evaluation. The paper also discusses ablation studies and qualitative results, showing that ST-LLM excels in adhering to instructions and delivering precise responses, and demonstrates superior sensitivity to temporal sequences and actions. The paper concludes that ST-LLM is a straightforward yet robust video large language model that effectively models spatial-temporal sequences using LLMs, and addresses concerns related to efficiency, stability, and modeling lengthy videos with reduced training resource requirements.ST-LLM is a video large language model that effectively models spatial-temporal sequences using large language models (LLMs). The paper investigates whether all spatial-temporal tokens can be fed into LLMs to delegate video sequence modeling to LLMs. The results show that this simple approach significantly improves video understanding. Based on this, the authors propose ST-LLM, an effective video-LLM baseline with spatial-temporal sequence modeling inside LLMs. To address the overhead and stability issues introduced by uncompressed video tokens within LLMs, the authors develop a dynamic masking strategy with tailor-made training objectives. For particularly long videos, a global-local input module is designed to balance efficiency and effectiveness. Consequently, LLMs are used for proficient spatial-temporal modeling while maintaining efficiency and stability. Extensive experimental results show the effectiveness of the method. Through a more concise model and training pipeline, ST-LLM achieves new state-of-the-art results on VideoChatGPT-Bench and MVBench. The code is available at https://github.com/TencentARC/ST-LLM. The paper introduces ST-LLM, which is the first open-source video LLM that explores spatial-temporal modeling within LLMs. It presents a dynamic video token masking strategy coupled with masked video modeling, and introduces a global-local input mechanism for processing long videos. These innovations ensure the efficiency and robustness of spatial-temporal tokens within the LLM. Extensive experiments demonstrate the consistent superiority of ST-LLM over existing video LLMs across various video dialogue benchmarks, especially on tasks demanding robust temporal understanding. The paper also discusses related works, including LLMs and image LLMs, video LLMs, and joint spatial-temporal modeling. The methodology of ST-LLM is elaborated, including the structure and training methodology. The paper presents the results of experiments on various benchmarks, including MVBench, VideoChatGPT-Bench, and zero-shot video QA benchmark. The results show that ST-LLM achieves new state-of-the-art performance across various contemporary video benchmarks, including MVBench, VideoChatGPT-Bench, and zero-shot QA-Evaluation. The paper also discusses ablation studies and qualitative results, showing that ST-LLM excels in adhering to instructions and delivering precise responses, and demonstrates superior sensitivity to temporal sequences and actions. The paper concludes that ST-LLM is a straightforward yet robust video large language model that effectively models spatial-temporal sequences using LLMs, and addresses concerns related to efficiency, stability, and modeling lengthy videos with reduced training resource requirements.
Reach us at info@study.space