[slides] MIA-Bench%3A Towards Better Instruction Following Evaluation of Multimodal LLMs

MIA-Bench is a new benchmark designed to evaluate the ability of multimodal large language models (MLLMs) to strictly adhere to complex instructions. The benchmark consists of 400 image-prompt pairs, each designed to challenge models' compliance with layered and compositional instructions. The evaluation metric measures the precision with which MLLMs can execute detailed and layered instructions, ensuring that responses not only align with the general intent but also meet specific requirements. The benchmark includes diverse image contents and a variety of instruction categories, such as description, length limit, mention, genre, grammar, math, perspective, and OCR. The performance of various MLLMs, including closed-source and open-source models, is assessed, revealing significant variations and areas for improvement. To enhance instruction adherence, the authors propose supervised fine-tuning (SFT) using additional training data, which shows promising results in improving model performance without compromising performance on other tasks. MIA-Bench aims to serve as a tool for measuring MLLM adherence to instructions and guiding future developments in training methods.MIA-Bench is a new benchmark designed to evaluate the ability of multimodal large language models (MLLMs) to strictly adhere to complex instructions. The benchmark consists of 400 image-prompt pairs, each designed to challenge models' compliance with layered and compositional instructions. The evaluation metric measures the precision with which MLLMs can execute detailed and layered instructions, ensuring that responses not only align with the general intent but also meet specific requirements. The benchmark includes diverse image contents and a variety of instruction categories, such as description, length limit, mention, genre, grammar, math, perspective, and OCR. The performance of various MLLMs, including closed-source and open-source models, is assessed, revealing significant variations and areas for improvement. To enhance instruction adherence, the authors propose supervised fine-tuning (SFT) using additional training data, which shows promising results in improving model performance without compromising performance on other tasks. MIA-Bench aims to serve as a tool for measuring MLLM adherence to instructions and guiding future developments in training methods.

MIA-Bench: Towards Better Instruction Following Evaluation of Multimodal LLMs

25 Jul 2024 | Yusu Qian, Hanrong Ye, Jean-Philippe Fauconnier, Peter Grasch, Yinfei Yang, Zhe Gan