MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVMLs

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVMLs

2024 | Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang
MMDU is a comprehensive benchmark and instruction-tuning dataset designed to evaluate and improve the multi-turn, multi-image dialog understanding capabilities of Large Vision-Language Models (LVLMs). The benchmark, MMDU, includes 110 high-quality multi-image, multi-turn dialogues with over 1600 questions, each accompanied by detailed long-form answers. It features a maximum of 20 images, 17 turns, and 18k text+image tokens, making it significantly more challenging than previous benchmarks. MMDU-45k, a large-scale instruction tuning dataset, was created to enhance LVLMs' dialog understanding abilities. The dataset includes 45k instruction tuning data, with an average image&text token length of 5k and a maximum of 17k tokens. The MMDU benchmark was constructed using a clustering algorithm to select relevant images and text descriptions from open-source Wikipedia. Human annotators then refined GPT-4o's responses to produce ground-truth answers. The benchmark evaluates LVLMs on six dimensions: Creativity, Richness, Visual Perception, Logical Coherence, Answer Accuracy, and Image Relationship Understanding. The evaluation uses GPT-4o as a judge to provide scores based on reference answers. The MMDU-45k dataset was created using a similar process, with the addition of random sampling for human verification instead of exhaustive human evaluation. The dataset was used to fine-tune LVLMs, resulting in improved performance on various benchmarks, including MMDU, MMStar, MathVista, and ChartQA. The results showed that fine-tuning LVLMs on MMDU-45k significantly improved their performance, especially in multi-image and multi-turn dialog understanding. The study highlights the performance gap between closed-source and open-source LVLMs, with open-source models lagging behind due to limited instruction tuning data. The MMDU and MMDU-45k datasets provide a valuable resource for the open-source community to bridge this gap. The datasets are available for download and use, and the study emphasizes the importance of comprehensive evaluation frameworks for LVLMs to meet real-world application demands. The research also acknowledges limitations, including the focus on English and the lack of multilingual support, and highlights the potential societal impacts of biased or inaccurate models.MMDU is a comprehensive benchmark and instruction-tuning dataset designed to evaluate and improve the multi-turn, multi-image dialog understanding capabilities of Large Vision-Language Models (LVLMs). The benchmark, MMDU, includes 110 high-quality multi-image, multi-turn dialogues with over 1600 questions, each accompanied by detailed long-form answers. It features a maximum of 20 images, 17 turns, and 18k text+image tokens, making it significantly more challenging than previous benchmarks. MMDU-45k, a large-scale instruction tuning dataset, was created to enhance LVLMs' dialog understanding abilities. The dataset includes 45k instruction tuning data, with an average image&text token length of 5k and a maximum of 17k tokens. The MMDU benchmark was constructed using a clustering algorithm to select relevant images and text descriptions from open-source Wikipedia. Human annotators then refined GPT-4o's responses to produce ground-truth answers. The benchmark evaluates LVLMs on six dimensions: Creativity, Richness, Visual Perception, Logical Coherence, Answer Accuracy, and Image Relationship Understanding. The evaluation uses GPT-4o as a judge to provide scores based on reference answers. The MMDU-45k dataset was created using a similar process, with the addition of random sampling for human verification instead of exhaustive human evaluation. The dataset was used to fine-tune LVLMs, resulting in improved performance on various benchmarks, including MMDU, MMStar, MathVista, and ChartQA. The results showed that fine-tuning LVLMs on MMDU-45k significantly improved their performance, especially in multi-image and multi-turn dialog understanding. The study highlights the performance gap between closed-source and open-source LVLMs, with open-source models lagging behind due to limited instruction tuning data. The MMDU and MMDU-45k datasets provide a valuable resource for the open-source community to bridge this gap. The datasets are available for download and use, and the study emphasizes the importance of comprehensive evaluation frameworks for LVLMs to meet real-world application demands. The research also acknowledges limitations, including the focus on English and the lack of multilingual support, and highlights the potential societal impacts of biased or inaccurate models.
Reach us at info@futurestudyspace.com