29 Mar 2024 | Han Wang, Yanjie Wang, Yongjie Ye, Yuxiang Nie, and Can Huang
Elysium: Exploring Object-level Perception in Videos via MLLM
This paper introduces Elysium, an end-to-end trainable Multi-modal Large Language Model (MLLM) designed to perform object-level tasks in videos, including Single Object Tracking (SOT), Referring Single Object Tracking (RSOT), and Video Referring Expression Generation (Video-REG). To address the challenges of training MLLMs on video data, the authors introduce ElysiumTrack-1M, a large-scale video dataset containing 1.27 million annotated video frames with corresponding object boxes and descriptions. The dataset is derived from the WebVid-10M dataset and is processed to reduce noise and ensure high-quality annotations.
To enable the MLLM to distinguish individual frames while reducing visual token use, the authors introduce a visual token compression network called T-Selector. This network offers a trade-off between performance and visual token consumption. Elysium is trained on a combination of image and video data, including the ElysiumTrack-1M dataset, to enhance its performance in object-level tasks and downstream applications like image grounding.
The authors evaluate Elysium on various tasks, including image grounding, VideoQA, SOT, RSOT, and Video-REG. Results show that Elysium achieves state-of-the-art performance on these tasks. The model is also evaluated on zero-shot settings, where it demonstrates comparable performance to baseline methods, even in a zero-shot setting. However, the authors note that Elysium's performance is relatively less satisfactory when dealing with datasets containing small objects.
The authors also conduct ablation studies to explore the impact of adapter architecture and the visual token count between the LLM and the visual encoder on object perception performance in images. The results show that the T-Selector network achieves better performance compared to other compression methods. The authors also investigate the influence of the compression ratio on the final performance and find that a higher compression ratio leads to a degradation in performance.
Overall, the authors demonstrate that MLLMs exhibit remarkable object perception abilities in videos. The results validate the effectiveness and potential of their proposed approach in leveraging MLLMs for object-level perception tasks. The authors also acknowledge the need to explore other tasks such as Video Object Segmentation (VOS) and Referring Video Object Segmentation (RVOS) in future work.Elysium: Exploring Object-level Perception in Videos via MLLM
This paper introduces Elysium, an end-to-end trainable Multi-modal Large Language Model (MLLM) designed to perform object-level tasks in videos, including Single Object Tracking (SOT), Referring Single Object Tracking (RSOT), and Video Referring Expression Generation (Video-REG). To address the challenges of training MLLMs on video data, the authors introduce ElysiumTrack-1M, a large-scale video dataset containing 1.27 million annotated video frames with corresponding object boxes and descriptions. The dataset is derived from the WebVid-10M dataset and is processed to reduce noise and ensure high-quality annotations.
To enable the MLLM to distinguish individual frames while reducing visual token use, the authors introduce a visual token compression network called T-Selector. This network offers a trade-off between performance and visual token consumption. Elysium is trained on a combination of image and video data, including the ElysiumTrack-1M dataset, to enhance its performance in object-level tasks and downstream applications like image grounding.
The authors evaluate Elysium on various tasks, including image grounding, VideoQA, SOT, RSOT, and Video-REG. Results show that Elysium achieves state-of-the-art performance on these tasks. The model is also evaluated on zero-shot settings, where it demonstrates comparable performance to baseline methods, even in a zero-shot setting. However, the authors note that Elysium's performance is relatively less satisfactory when dealing with datasets containing small objects.
The authors also conduct ablation studies to explore the impact of adapter architecture and the visual token count between the LLM and the visual encoder on object perception performance in images. The results show that the T-Selector network achieves better performance compared to other compression methods. The authors also investigate the influence of the compression ratio on the final performance and find that a higher compression ratio leads to a degradation in performance.
Overall, the authors demonstrate that MLLMs exhibit remarkable object perception abilities in videos. The results validate the effectiveness and potential of their proposed approach in leveraging MLLMs for object-level perception tasks. The authors also acknowledge the need to explore other tasks such as Video Object Segmentation (VOS) and Referring Video Object Segmentation (RVOS) in future work.