23 May 2024 | Dongfu Jiang*, Xuan He*, Huaye Zeng*, Cong Wei*, Max Ku*, Qian Liu*, Wenhu Chen*
**MANTIS: Interleaved Multi-Image Instruction Tuning**
Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Chen
University of Waterloo, Tsinghua University, Sci AI Lab
This paper addresses the challenge of improving large multimodal models (LMMs) for multi-image visual language tasks. Existing LMMs like OpenFlamingo, Emu2, and Idefics rely on extensive pre-training on noisy interleaved image-text data, which is both inefficient and ineffective. To overcome this, the authors propose MANTIS, a family of models trained through instruction tuning using academic-level resources. They construct MANTIS-INSTRUCT, a dataset containing 721K multi-image instruction examples, to train MANTIS models. These models are designed to acquire skills such as co-reference, comparison, reasoning, and temporal understanding. MANTIS is evaluated on five multi-image benchmarks and seven single-image benchmarks, achieving state-of-the-art performance on all multi-image tasks and outperforming the strongest baseline, Idefics2-8B, by an average of 11 absolute points. Notably, MANTIS performs equivalently well on held-in and held-out benchmarks, demonstrating strong generalization ability. The authors also find that MANTIS can match the performance of GPT-4V on multi-image benchmarks and maintain strong single-image performance, comparable to CogVLM and Emu2. The results suggest that multi-image abilities can be effectively gained through low-cost instruction tuning rather than massive pre-training. The work provides new insights into improving LMMs' multi-image capabilities and offers a valuable baseline for future research.**MANTIS: Interleaved Multi-Image Instruction Tuning**
Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Chen
University of Waterloo, Tsinghua University, Sci AI Lab
This paper addresses the challenge of improving large multimodal models (LMMs) for multi-image visual language tasks. Existing LMMs like OpenFlamingo, Emu2, and Idefics rely on extensive pre-training on noisy interleaved image-text data, which is both inefficient and ineffective. To overcome this, the authors propose MANTIS, a family of models trained through instruction tuning using academic-level resources. They construct MANTIS-INSTRUCT, a dataset containing 721K multi-image instruction examples, to train MANTIS models. These models are designed to acquire skills such as co-reference, comparison, reasoning, and temporal understanding. MANTIS is evaluated on five multi-image benchmarks and seven single-image benchmarks, achieving state-of-the-art performance on all multi-image tasks and outperforming the strongest baseline, Idefics2-8B, by an average of 11 absolute points. Notably, MANTIS performs equivalently well on held-in and held-out benchmarks, demonstrating strong generalization ability. The authors also find that MANTIS can match the performance of GPT-4V on multi-image benchmarks and maintain strong single-image performance, comparable to CogVLM and Emu2. The results suggest that multi-image abilities can be effectively gained through low-cost instruction tuning rather than massive pre-training. The work provides new insights into improving LMMs' multi-image capabilities and offers a valuable baseline for future research.