23 May 2024 | Dongfu Jiang, Xuan He, Huaye Zeng, Cong Wei, Max Ku, Qian Liu, Wenhu Chen
MANTIS is a new family of large multimodal models designed to process interleaved text-image inputs. The model is fine-tuned on MANTIS-INSTRUCT, a dataset containing 721K examples covering four key skills: co-reference, reasoning, comparison, and temporal understanding. MANTIS achieves state-of-the-art performance on five multi-image benchmarks and outperforms the strongest multi-image baseline, Idefics2-8B, by an average of 11 absolute points. It also performs well on single-image tasks, matching the performance of CogVLM and Emu2. The results show that multi-image abilities can be achieved through instruction tuning rather than massive pre-training. MANTIS is also effective on held-in and held-out benchmarks, demonstrating strong generalization. The model's architecture supports multi-image inputs and uses a text-image interleaving format to enhance multi-image understanding. MANTIS is evaluated on various benchmarks and shows that multi-image pre-training is not necessary for strong multi-image performance. The model also performs well on single-image tasks, demonstrating its versatility. The results indicate that instruction tuning on high-quality data can lead to better generalization than pre-training on larger datasets. MANTIS is released with code, data, and models to support reproducibility.MANTIS is a new family of large multimodal models designed to process interleaved text-image inputs. The model is fine-tuned on MANTIS-INSTRUCT, a dataset containing 721K examples covering four key skills: co-reference, reasoning, comparison, and temporal understanding. MANTIS achieves state-of-the-art performance on five multi-image benchmarks and outperforms the strongest multi-image baseline, Idefics2-8B, by an average of 11 absolute points. It also performs well on single-image tasks, matching the performance of CogVLM and Emu2. The results show that multi-image abilities can be achieved through instruction tuning rather than massive pre-training. MANTIS is also effective on held-in and held-out benchmarks, demonstrating strong generalization. The model's architecture supports multi-image inputs and uses a text-image interleaving format to enhance multi-image understanding. MANTIS is evaluated on various benchmarks and shows that multi-image pre-training is not necessary for strong multi-image performance. The model also performs well on single-image tasks, demonstrating its versatility. The results indicate that instruction tuning on high-quality data can lead to better generalization than pre-training on larger datasets. MANTIS is released with code, data, and models to support reproducibility.