MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

25 Jul 2024 | Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier, Peter Grasch, Yinfeng Yang, Zhe Gan
MIA-Bench is a new benchmark designed to evaluate the ability of multimodal large language models (MLLMs) to strictly adhere to complex instructions. The benchmark consists of 400 image-prompt pairs, each crafted to challenge models' compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models' ability to strictly follow instructions without compromising performance on other tasks. The benchmark aims to serve as a tool for measuring MLLM adherence to instructions and guide future developments in MLLM training methods. MIA-Bench includes diverse image contents such as animals, food, landmarks, sports, art, landscapes, text, etc., to cover a broad spectrum of real-world scenarios. The benchmark's instructions are designed to test the models' ability to follow complex, compositional instructions, including description, length limit, mention, genre, grammar, math, perspective, and OCR. The instructions are of various complexity levels and are tailored to probe the models' linguistic dexterity, grammatical accuracy, and descriptive fidelity. A wide array of MLLMs, including closed-source models like GPT-4o, Gemini Pro, and Claude-3, as well as open-source models like LLaVA-NeXT, Intern-VL-Chat-1.5, and CogVLM2, were evaluated on MIA-Bench. Results show that GPT-4o achieved the highest score, demonstrating superior performance in instruction adherence. Other models like Reka, GPT-4v, and GPT-4o also performed well in generating coherent and contextually appropriate text. To address the challenges in instruction adherence, we propose to generate training data tailored for supervised fine-tuning (SFT), where we aim to refine the models' abilities to process and comply with multifaceted instructions. Results from our SFT experiments indicate a promising enhancement in the models' performance to strictly adhere to instructions, without hurting performance on other benchmarks. Our contributions include the construction of MIA-Bench, a new benchmark to comprehensively evaluate MLLMs on their capability to strictly adhere to instructions, and the provision of a detailed analysis of popular MLLMs, suggesting training methods for enhanced instruction following. MIA-Bench will be open-sourced, and we hope this benchmark can serve as a useful resource to stimulate further research on multimodal instruction adherence.MIA-Bench is a new benchmark designed to evaluate the ability of multimodal large language models (MLLMs) to strictly adhere to complex instructions. The benchmark consists of 400 image-prompt pairs, each crafted to challenge models' compliance with layered instructions in generating accurate responses that satisfy specific requested patterns. Evaluation results from a wide array of state-of-the-art MLLMs reveal significant variations in performance, highlighting areas for improvement in instruction fidelity. Additionally, we create extra training data and explore supervised fine-tuning to enhance the models' ability to strictly follow instructions without compromising performance on other tasks. The benchmark aims to serve as a tool for measuring MLLM adherence to instructions and guide future developments in MLLM training methods. MIA-Bench includes diverse image contents such as animals, food, landmarks, sports, art, landscapes, text, etc., to cover a broad spectrum of real-world scenarios. The benchmark's instructions are designed to test the models' ability to follow complex, compositional instructions, including description, length limit, mention, genre, grammar, math, perspective, and OCR. The instructions are of various complexity levels and are tailored to probe the models' linguistic dexterity, grammatical accuracy, and descriptive fidelity. A wide array of MLLMs, including closed-source models like GPT-4o, Gemini Pro, and Claude-3, as well as open-source models like LLaVA-NeXT, Intern-VL-Chat-1.5, and CogVLM2, were evaluated on MIA-Bench. Results show that GPT-4o achieved the highest score, demonstrating superior performance in instruction adherence. Other models like Reka, GPT-4v, and GPT-4o also performed well in generating coherent and contextually appropriate text. To address the challenges in instruction adherence, we propose to generate training data tailored for supervised fine-tuning (SFT), where we aim to refine the models' abilities to process and comply with multifaceted instructions. Results from our SFT experiments indicate a promising enhancement in the models' performance to strictly adhere to instructions, without hurting performance on other benchmarks. Our contributions include the construction of MIA-Bench, a new benchmark to comprehensively evaluate MLLMs on their capability to strictly adhere to instructions, and the provision of a detailed analysis of popular MLLMs, suggesting training methods for enhanced instruction following. MIA-Bench will be open-sourced, and we hope this benchmark can serve as a useful resource to stimulate further research on multimodal instruction adherence.
Reach us at info@study.space
[slides] MIA-Bench%3A Towards Better Instruction Following Evaluation of Multimodal LLMs | StudySpace