MILEBENCH is a benchmark designed to evaluate the long-context and multi-image capabilities of Multimodal Large Language Models (MLLMs). The benchmark includes two evaluation sets: diagnostic and realistic. Diagnostic evaluation focuses on the model's ability to retrieve information from long contexts, while realistic evaluation tests the model's performance in real-world scenarios involving multiple images and long contexts. The benchmark consists of 6,440 multimodal long-context samples from 21 datasets, with an average of 15.2 images and 422.3 words per sample. The benchmark includes tasks such as temporal multi-image tasks, semantic multi-image tasks, and dialogue tasks, among others. The results show that closed-source models like GPT-4o outperform open-source models in long-context tasks, while most open-source models struggle. The performance gap tends to widen with an increase in the number of images. The benchmark highlights the need for further research to enhance MLLMs' long-context capabilities, especially in multi-image scenarios. The benchmark also addresses issues such as data contamination and the "lost in the middle" phenomenon in long-context tasks. The results indicate that proprietary models still outperform open-source models in realistic evaluation, but there is potential for improvement. The benchmark provides a comprehensive assessment of MLLMs in multi-image long-context scenarios, offering insights into their capabilities and limitations.MILEBENCH is a benchmark designed to evaluate the long-context and multi-image capabilities of Multimodal Large Language Models (MLLMs). The benchmark includes two evaluation sets: diagnostic and realistic. Diagnostic evaluation focuses on the model's ability to retrieve information from long contexts, while realistic evaluation tests the model's performance in real-world scenarios involving multiple images and long contexts. The benchmark consists of 6,440 multimodal long-context samples from 21 datasets, with an average of 15.2 images and 422.3 words per sample. The benchmark includes tasks such as temporal multi-image tasks, semantic multi-image tasks, and dialogue tasks, among others. The results show that closed-source models like GPT-4o outperform open-source models in long-context tasks, while most open-source models struggle. The performance gap tends to widen with an increase in the number of images. The benchmark highlights the need for further research to enhance MLLMs' long-context capabilities, especially in multi-image scenarios. The benchmark also addresses issues such as data contamination and the "lost in the middle" phenomenon in long-context tasks. The results indicate that proprietary models still outperform open-source models in realistic evaluation, but there is potential for improvement. The benchmark provides a comprehensive assessment of MLLMs in multi-image long-context scenarios, offering insights into their capabilities and limitations.