Understanding MileBench%3A Benchmarking MLLMs in Long Context

The paper introduces MILEBENCH, a benchmark designed to evaluate the long-context and multi-image capabilities of Multimodal Large Language Models (MLLMs). Existing benchmarks often focus on single-image and short-text samples, which do not fully capture the complexity of real-world scenarios. MILEBENCH consists of two evaluation sets: Diagnostic Evaluation and Realistic Evaluation. The former assesses long-context recall abilities, while the latter stress-tests models in real-world conditions with temporal and semantic multi-image tasks. The benchmark includes 6,440 multimodal long-context samples, with an average of 15.2 images and 422.3 words per sample. Experimental results from 22 models show that closed-source GPT-4o outperforms others, while most open-source MLLMs struggle in long-context situations, with the performance gap widening as the number of images increases. The study highlights the need for more research to enhance MLLMs' long-context capabilities, especially in multi-image scenarios.The paper introduces MILEBENCH, a benchmark designed to evaluate the long-context and multi-image capabilities of Multimodal Large Language Models (MLLMs). Existing benchmarks often focus on single-image and short-text samples, which do not fully capture the complexity of real-world scenarios. MILEBENCH consists of two evaluation sets: Diagnostic Evaluation and Realistic Evaluation. The former assesses long-context recall abilities, while the latter stress-tests models in real-world conditions with temporal and semantic multi-image tasks. The benchmark includes 6,440 multimodal long-context samples, with an average of 15.2 images and 422.3 words per sample. Experimental results from 22 models show that closed-source GPT-4o outperforms others, while most open-source MLLMs struggle in long-context situations, with the performance gap widening as the number of images increases. The study highlights the need for more research to enhance MLLMs' long-context capabilities, especially in multi-image scenarios.

MILEBENCH: Benchmarking MLLMs in Long Context

15 May 2024 | Dingjie Song, Shunian Chen, Guiming Hardy Chen, Fei Yu, Xiang Wan, Benyou Wang*