**LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models**
This paper introduces LLaVA-NeXT-Interleave, a large multimodal model that addresses multi-image, multi-frame (video), multi-view (3D), and multi-patch (single-image) scenarios. The model leverages an interleaved data format to unify different tasks, simplifying training across various domains and enabling emerging capabilities such as cross-task transfer.
**Key Contributions:**
1. **Interleaved Data Format:** The interleaved format unifies different tasks, including multi-image, video, 3D, and single-image, into a single LMM.
2. **M4-Instruct Dataset:** A comprehensive dataset with 1,177.6k samples spanning 4 primary domains (multi-image, video, 3D, and single-image) and 14 tasks across 41 datasets.
3. **LLaVA-Interleave Bench:** A diverse set of benchmarks to evaluate multi-image performance, including 7 new and 13 existing in/out-domain benchmarks.
4. **State-of-the-Art Performance:** LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks while maintaining single-image performance.
5. **Emerging Capabilities:** The model demonstrates capabilities such as task transfer across different settings and modalities, e.g., from image differences to video captioning.
**Methods:**
- **Task Overview:** The model handles multi-image, multi-frame (video), multi-view (3D), and multi-patch (single-image) scenarios.
- **M4-Instruct Dataset:** Curated with 1,177.6k samples, covering 4 primary domains and 14 tasks.
- **LLaVA-Interleave Bench:** Comprises 13 challenging tasks with 17K instances, including in-domain and out-domain evaluations.
**Experiments:**
- **Multi-image Results:** LLaVA-NeXT-Interleave outperforms previous models in both in- and out-domain benchmarks.
- **Multi-frame (Video) Results:** Achieves superior results on video benchmarks, demonstrating effective temporal understanding.
- **Multi-view (3D) Results:** Obtains leading results in 3D perception tasks.
- **Multi-patch (Single-image) Results:** Maintains single-image performance while enhancing multi-image tasks.
**Emerging Capabilities:**
- **Task Transfer from Single-image to Multi-image:** Analyzes the fun part within multiple images.
- **Task Transfer from Image to Video:**Writes a Twitter post based on a video.
- **Real-world Applications:** Recognizes painting styles, summarizes PPTs, and performs multi-doc VQA.
**Conclusion:**
LLaVA-NeXT-Interleave sets a new standard in handling diverse visual tasks**LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models**
This paper introduces LLaVA-NeXT-Interleave, a large multimodal model that addresses multi-image, multi-frame (video), multi-view (3D), and multi-patch (single-image) scenarios. The model leverages an interleaved data format to unify different tasks, simplifying training across various domains and enabling emerging capabilities such as cross-task transfer.
**Key Contributions:**
1. **Interleaved Data Format:** The interleaved format unifies different tasks, including multi-image, video, 3D, and single-image, into a single LMM.
2. **M4-Instruct Dataset:** A comprehensive dataset with 1,177.6k samples spanning 4 primary domains (multi-image, video, 3D, and single-image) and 14 tasks across 41 datasets.
3. **LLaVA-Interleave Bench:** A diverse set of benchmarks to evaluate multi-image performance, including 7 new and 13 existing in/out-domain benchmarks.
4. **State-of-the-Art Performance:** LLaVA-NeXT-Interleave achieves leading results in multi-image, video, and 3D benchmarks while maintaining single-image performance.
5. **Emerging Capabilities:** The model demonstrates capabilities such as task transfer across different settings and modalities, e.g., from image differences to video captioning.
**Methods:**
- **Task Overview:** The model handles multi-image, multi-frame (video), multi-view (3D), and multi-patch (single-image) scenarios.
- **M4-Instruct Dataset:** Curated with 1,177.6k samples, covering 4 primary domains and 14 tasks.
- **LLaVA-Interleave Bench:** Comprises 13 challenging tasks with 17K instances, including in-domain and out-domain evaluations.
**Experiments:**
- **Multi-image Results:** LLaVA-NeXT-Interleave outperforms previous models in both in- and out-domain benchmarks.
- **Multi-frame (Video) Results:** Achieves superior results on video benchmarks, demonstrating effective temporal understanding.
- **Multi-view (3D) Results:** Obtains leading results in 3D perception tasks.
- **Multi-patch (Single-image) Results:** Maintains single-image performance while enhancing multi-image tasks.
**Emerging Capabilities:**
- **Task Transfer from Single-image to Multi-image:** Analyzes the fun part within multiple images.
- **Task Transfer from Image to Video:**Writes a Twitter post based on a video.
- **Real-world Applications:** Recognizes painting styles, summarizes PPTs, and performs multi-doc VQA.
**Conclusion:**
LLaVA-NeXT-Interleave sets a new standard in handling diverse visual tasks