The paper introduces MMNeedle, a benchmark to evaluate the long-context capabilities of multimodal large language models (MLLMs). The benchmark is designed to assess MLLMs' ability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. The MMNeedle benchmark includes a comprehensive dataset of 40,000 images, 560,000 captions, and 280,000 needle-haystack pairs. It covers diverse settings with varying context lengths, single and multiple needles, as well as positive and negative samples. The benchmark establishes a set of evaluation metrics, including "existence accuracy", "index accuracy", and "exact accuracy", to holistically evaluate MLLMs at the sequence-, image-, and sub-image levels. The benchmark also includes a wide coverage of both state-of-the-art API-based and open-source MLLMs. The findings reveal that GPT-4o consistently surpasses other models in long-context scenarios, but suffers from hallucination problems in negative samples, i.e., when needles are not in the haystacks. The comprehensive long-context evaluation of MLLMs also sheds light on the considerable performance gap between API-based and open-source models. The paper evaluates various MLLMs, including API-based models such as GPT-4o, Gemini Pro 1.5, and open-source models such as LLaVA-Llama-3, CogVLM, and mPLUG-Owl-v2. The results show that GPT-4o performs best in terms of index and exact accuracy for multi-image samples, while open-source models generally lag behind. The paper also highlights the challenges of hallucination in MLLMs, particularly in negative samples. The MMNeedle benchmark provides a comprehensive evaluation of MLLMs' long-context capabilities, and the results demonstrate the effectiveness of the benchmark in stress-testing MLLMs. The paper also discusses the statistical significance of the results and the limitations of the MMNeedle evaluation. The paper concludes that while API-based models outperform open-source models in long-context scenarios, they still struggle with hallucination issues in negative samples and challenges in large stitching size/multi-needle retrieval. The MMNeedle benchmark is an essential tool for evaluating the long-context capabilities of MLLMs.The paper introduces MMNeedle, a benchmark to evaluate the long-context capabilities of multimodal large language models (MLLMs). The benchmark is designed to assess MLLMs' ability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. The MMNeedle benchmark includes a comprehensive dataset of 40,000 images, 560,000 captions, and 280,000 needle-haystack pairs. It covers diverse settings with varying context lengths, single and multiple needles, as well as positive and negative samples. The benchmark establishes a set of evaluation metrics, including "existence accuracy", "index accuracy", and "exact accuracy", to holistically evaluate MLLMs at the sequence-, image-, and sub-image levels. The benchmark also includes a wide coverage of both state-of-the-art API-based and open-source MLLMs. The findings reveal that GPT-4o consistently surpasses other models in long-context scenarios, but suffers from hallucination problems in negative samples, i.e., when needles are not in the haystacks. The comprehensive long-context evaluation of MLLMs also sheds light on the considerable performance gap between API-based and open-source models. The paper evaluates various MLLMs, including API-based models such as GPT-4o, Gemini Pro 1.5, and open-source models such as LLaVA-Llama-3, CogVLM, and mPLUG-Owl-v2. The results show that GPT-4o performs best in terms of index and exact accuracy for multi-image samples, while open-source models generally lag behind. The paper also highlights the challenges of hallucination in MLLMs, particularly in negative samples. The MMNeedle benchmark provides a comprehensive evaluation of MLLMs' long-context capabilities, and the results demonstrate the effectiveness of the benchmark in stress-testing MLLMs. The paper also discusses the statistical significance of the results and the limitations of the MMNeedle evaluation. The paper concludes that while API-based models outperform open-source models in long-context scenarios, they still struggle with hallucination issues in negative samples and challenges in large stitching size/multi-needle retrieval. The MMNeedle benchmark is an essential tool for evaluating the long-context capabilities of MLLMs.