ST-LLM: Large Language Models Are Effective Temporal Learners

ST-LLM: Large Language Models Are Effective Temporal Learners

30 Mar 2024 | Ruyang Liu, Chen Li, Haoran Tang, YiXiao Ge, Ying Shan, Ge Li
**ST-LLM: Large Language Models Are Effective Temporal Learners** This paper explores the effectiveness of using large language models (LLMs) for video understanding and generation, a task that has been challenging due to the complexity and temporal dynamics of video content. The authors propose ST-LLM, a novel approach that integrates spatial-temporal tokens directly into LLMs, leveraging their robust sequence modeling capabilities. To address the challenges of processing long videos and varying input lengths, they introduce a dynamic masking strategy and a global-local input module. These innovations enhance the efficiency and robustness of ST-LLM, enabling it to handle complex temporal sequences effectively. Extensive experiments on various benchmarks, including MVBench, VideoChatGPT-Bench, and zero-shot video QA, demonstrate the superior performance of ST-LLM in video understanding tasks, particularly in temporal dynamics and motion-related tasks. The paper also provides detailed experimental setups, ablation studies, and qualitative results to support the effectiveness of the proposed method.**ST-LLM: Large Language Models Are Effective Temporal Learners** This paper explores the effectiveness of using large language models (LLMs) for video understanding and generation, a task that has been challenging due to the complexity and temporal dynamics of video content. The authors propose ST-LLM, a novel approach that integrates spatial-temporal tokens directly into LLMs, leveraging their robust sequence modeling capabilities. To address the challenges of processing long videos and varying input lengths, they introduce a dynamic masking strategy and a global-local input module. These innovations enhance the efficiency and robustness of ST-LLM, enabling it to handle complex temporal sequences effectively. Extensive experiments on various benchmarks, including MVBench, VideoChatGPT-Bench, and zero-shot video QA, demonstrate the superior performance of ST-LLM in video understanding tasks, particularly in temporal dynamics and motion-related tasks. The paper also provides detailed experimental setups, ablation studies, and qualitative results to support the effectiveness of the proposed method.
Reach us at info@study.space