LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

28 Jul 2024 | Feng Li1,2*, Renrui Zhang1,3*, Hao Zhang1,2*, Yuanhan Zhang1,4*, Bo Li1,4, Wei Li1, Zejun Ma1, Chunyuan Li1
**LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models** This paper introduces LLaVA-NeXT-Interleave, a large multimodal model that addresses multi-image, multi-frame (video), multi-view (3D), and multi-patch (single-image) scenarios. The model leverages an interleaved data format to unify different tasks, simplifying training across various domains and enabling emerging capabilities such as cross-task transfer. **Key Contributions:** 1. **Interleaved Data Format:** The interleaved format unifies different tasks, including multi-image, video, 3D, and single-image, into a single LMM. 2. **M4-Instruct Dataset:** A comprehensive dataset with 1,177.6k samples spanning 4 primary domains (multi-image, video, 3D, and single-image) and 14 tasks across 41 datasets. 3. **LLaVA-Interleave Bench:** A diverse set of benchmarks to evaluate multi-image performance, including 7 new and 13 existing in/out-domain benchmarks. 4. **State-of-the-Art Performance:** LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks while maintaining single-image performance. 5. **Emerging Capabilities:** The model demonstrates capabilities such as task transfer across different settings and modalities, e.g., from image differences to video captioning. **Methods:** - **Task Overview:** The model handles multi-image, multi-frame (video), multi-view (3D), and multi-patch (single-image) scenarios. - **M4-Instruct Dataset:** Curated with 1,177.6k samples, covering 4 primary domains and 14 tasks. - **LLaVA-Interleave Bench:** Comprises 13 challenging tasks with 17K instances, including in-domain and out-domain evaluations. **Experiments:** - **Multi-image Results:** LLaVA-NeXT-Interleave outperforms previous models in both in- and out-domain benchmarks. - **Multi-frame (Video) Results:** Achieves superior results on video benchmarks, demonstrating effective temporal understanding. - **Multi-view (3D) Results:** Obtains leading results in 3D perception tasks. - **Multi-patch (Single-image) Results:** Maintains single-image performance while enhancing multi-image tasks. **Emerging Capabilities:** - **Task Transfer from Single-image to Multi-image:** Analyzes the fun part within multiple images. - **Task Transfer from Image to Video:**Writes a Twitter post based on a video. - **Real-world Applications:** Recognizes painting styles, summarizes PPTs, and performs multi-doc VQA. **Conclusion:** LLaVA-NeXT-Interleave sets a new standard in handling diverse visual tasks**LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models** This paper introduces LLaVA-NeXT-Interleave, a large multimodal model that addresses multi-image, multi-frame (video), multi-view (3D), and multi-patch (single-image) scenarios. The model leverages an interleaved data format to unify different tasks, simplifying training across various domains and enabling emerging capabilities such as cross-task transfer. **Key Contributions:** 1. **Interleaved Data Format:** The interleaved format unifies different tasks, including multi-image, video, 3D, and single-image, into a single LMM. 2. **M4-Instruct Dataset:** A comprehensive dataset with 1,177.6k samples spanning 4 primary domains (multi-image, video, 3D, and single-image) and 14 tasks across 41 datasets. 3. **LLaVA-Interleave Bench:** A diverse set of benchmarks to evaluate multi-image performance, including 7 new and 13 existing in/out-domain benchmarks. 4. **State-of-the-Art Performance:** LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks while maintaining single-image performance. 5. **Emerging Capabilities:** The model demonstrates capabilities such as task transfer across different settings and modalities, e.g., from image differences to video captioning. **Methods:** - **Task Overview:** The model handles multi-image, multi-frame (video), multi-view (3D), and multi-patch (single-image) scenarios. - **M4-Instruct Dataset:** Curated with 1,177.6k samples, covering 4 primary domains and 14 tasks. - **LLaVA-Interleave Bench:** Comprises 13 challenging tasks with 17K instances, including in-domain and out-domain evaluations. **Experiments:** - **Multi-image Results:** LLaVA-NeXT-Interleave outperforms previous models in both in- and out-domain benchmarks. - **Multi-frame (Video) Results:** Achieves superior results on video benchmarks, demonstrating effective temporal understanding. - **Multi-view (3D) Results:** Obtains leading results in 3D perception tasks. - **Multi-patch (Single-image) Results:** Maintains single-image performance while enhancing multi-image tasks. **Emerging Capabilities:** - **Task Transfer from Single-image to Multi-image:** Analyzes the fun part within multiple images. - **Task Transfer from Image to Video:**Writes a Twitter post based on a video. - **Real-world Applications:** Recognizes painting styles, summarizes PPTs, and performs multi-doc VQA. **Conclusion:** LLaVA-NeXT-Interleave sets a new standard in handling diverse visual tasks
Reach us at info@study.space
Understanding LLaVA-NeXT-Interleave%3A Tackling Multi-image%2C Video%2C and 3D in Large Multimodal Models