Elysium: Exploring Object-level Perception in Videos via MLLM

Elysium: Exploring Object-level Perception in Videos via MLLM

29 Mar 2024 | Han Wang, Yanjie Wang, Yongjie Ye, Yuxiang Nie, and Can Huang
The paper "Elysium: Exploring Object-level Perception in Videos via MLLM" by Han Wang, Yanjie Wang, Yongjie Ye, Yuxiang Nie, and Can Huang from ByteDance Inc. addresses the challenge of applying Multi-modal Large Language Models (MLLMs) to video-related tasks, such as object tracking. The authors introduce Elysium, an end-to-end trainable MLLM designed to handle both global-level and object-level tasks in videos. To address the limited training data issue, they construct ElysiumTrack-1M, a large-scale video dataset supporting tasks like Single Object Tracking (SOT), Referring Single Object Tracking (RSOT), and Video Referring Expression Generation (Video-REG). The dataset contains 1.27 million annotated video frames with corresponding object boxes and descriptions. To enable the MLLM to distinguish individual frames while reducing visual token usage, the authors propose a token-compression model called T-Selector. Extensive experiments demonstrate the effectiveness of Elysium in downstream tasks such as Image Grounding, Video QA, SOT, RSOT, and Video-REG. The paper also includes ablation studies and visualizations to support the proposed approach.The paper "Elysium: Exploring Object-level Perception in Videos via MLLM" by Han Wang, Yanjie Wang, Yongjie Ye, Yuxiang Nie, and Can Huang from ByteDance Inc. addresses the challenge of applying Multi-modal Large Language Models (MLLMs) to video-related tasks, such as object tracking. The authors introduce Elysium, an end-to-end trainable MLLM designed to handle both global-level and object-level tasks in videos. To address the limited training data issue, they construct ElysiumTrack-1M, a large-scale video dataset supporting tasks like Single Object Tracking (SOT), Referring Single Object Tracking (RSOT), and Video Referring Expression Generation (Video-REG). The dataset contains 1.27 million annotated video frames with corresponding object boxes and descriptions. To enable the MLLM to distinguish individual frames while reducing visual token usage, the authors propose a token-compression model called T-Selector. Extensive experiments demonstrate the effectiveness of Elysium in downstream tasks such as Image Grounding, Video QA, SOT, RSOT, and Video-REG. The paper also includes ablation studies and visualizations to support the proposed approach.
Reach us at info@study.space