2024-06-16 | Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, Chunyuan Li
LLaVA-NeXT-Interleave is a large multimodal model that addresses multi-image, multi-frame (video), and multi-view (3D) scenarios. The model uses an interleaved data format to unify different tasks, including multi-image, multi-frame, multi-view, and multi-patch (single-image) scenarios. The M4-Instruct dataset, containing 1,177.6k samples across four domains and 14 tasks, was compiled to train the model. The LLaVA-Interleave Bench was also created to evaluate multi-image performance. The model achieves state-of-the-art results in multi-image, video, and 3D benchmarks while maintaining single-image performance. It also demonstrates emerging capabilities, such as cross-task transfer and task transfer between different settings and modalities. The model is trained using an interleaved visual instruction tuning approach, which includes continuing training from single-image models, using mixed interleaved data formats, and combining different data scenarios to improve individual task performance. The model's performance was evaluated across four scenarios: multi-image, multi-frame (video), multi-view (3D), and multi-patch (single-image). The results show that LLaVA-NeXT-Interleave outperforms previous models in both in-domain and out-domain benchmarks. The model also exhibits promising emerging capabilities, such as task transfer and real-world applications. The research highlights the potential of LLaVA-NeXT-Interleave in unifying and advancing the capabilities of large multimodal models across diverse visual tasks.LLaVA-NeXT-Interleave is a large multimodal model that addresses multi-image, multi-frame (video), and multi-view (3D) scenarios. The model uses an interleaved data format to unify different tasks, including multi-image, multi-frame, multi-view, and multi-patch (single-image) scenarios. The M4-Instruct dataset, containing 1,177.6k samples across four domains and 14 tasks, was compiled to train the model. The LLaVA-Interleave Bench was also created to evaluate multi-image performance. The model achieves state-of-the-art results in multi-image, video, and 3D benchmarks while maintaining single-image performance. It also demonstrates emerging capabilities, such as cross-task transfer and task transfer between different settings and modalities. The model is trained using an interleaved visual instruction tuning approach, which includes continuing training from single-image models, using mixed interleaved data formats, and combining different data scenarios to improve individual task performance. The model's performance was evaluated across four scenarios: multi-image, multi-frame (video), multi-view (3D), and multi-patch (single-image). The results show that LLaVA-NeXT-Interleave outperforms previous models in both in-domain and out-domain benchmarks. The model also exhibits promising emerging capabilities, such as task transfer and real-world applications. The research highlights the potential of LLaVA-NeXT-Interleave in unifying and advancing the capabilities of large multimodal models across diverse visual tasks.