29 Oct 2024 | Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang
The paper introduces MMDU, a comprehensive benchmark and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve Large Vision-Language Models (LVLMs) in multi-turn and multi-image conversations. MMDU features a maximum of 18k image-text tokens, 20 images, and 27 turns, significantly longer than previous benchmarks, challenging current LVLMs. The benchmark is constructed using a clustering algorithm to select relevant images and textual descriptions from Wikipedia, with human annotators refining the responses generated by GPT-4o. The evaluation reveals a significant performance gap between closed-source and open-source LVLMs, highlighting the need for more extensive instruction tuning data. MMDU-45k, with 45k high-quality examples, further enhances LVLMs' performance on MMDU and existing benchmarks, demonstrating the effectiveness of fine-tuning on multi-turn and multi-image inputs. The project aims to bridge the gap between current LVLM models and real-world application demands, providing valuable insights for future research and development.The paper introduces MMDU, a comprehensive benchmark and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve Large Vision-Language Models (LVLMs) in multi-turn and multi-image conversations. MMDU features a maximum of 18k image-text tokens, 20 images, and 27 turns, significantly longer than previous benchmarks, challenging current LVLMs. The benchmark is constructed using a clustering algorithm to select relevant images and textual descriptions from Wikipedia, with human annotators refining the responses generated by GPT-4o. The evaluation reveals a significant performance gap between closed-source and open-source LVLMs, highlighting the need for more extensive instruction tuning data. MMDU-45k, with 45k high-quality examples, further enhances LVLMs' performance on MMDU and existing benchmarks, demonstrating the effectiveness of fine-tuning on multi-turn and multi-image inputs. The project aims to bridge the gap between current LVLM models and real-world application demands, providing valuable insights for future research and development.