[slides and audio] Multimodal Needle in a Haystack%3A Benchmarking Long-Context Capability of Multimodal Large Language Models

The paper introduces the MultiModal Needle-in-a-haystack (MMNeedle) benchmark to evaluate the long-context capabilities of Multimodal Large Language Models (MLLMs). MMNeedle is designed to assess MLLMs' ability to locate a target sub-image within a set of images based on textual instructions and descriptions of image contents. The benchmark uses advanced techniques, such as image stitching, to increase the input context length and evaluate MLLMs' performance under various settings, including varying context lengths, single and multiple needles, and positive and negative samples. The evaluation metrics include "existence accuracy," "index accuracy," and "exact accuracy" to holistically evaluate MLLMs at different levels. The study finds that GPT-4o consistently outperforms other models in long-context scenarios but suffers from hallucination problems in negative samples. The performance gap between API-based and open-source models is also highlighted, with API-based models generally outperforming open-source models. The paper provides a comprehensive dataset, detailed evaluation protocols, and experimental results to support its findings.The paper introduces the MultiModal Needle-in-a-haystack (MMNeedle) benchmark to evaluate the long-context capabilities of Multimodal Large Language Models (MLLMs). MMNeedle is designed to assess MLLMs' ability to locate a target sub-image within a set of images based on textual instructions and descriptions of image contents. The benchmark uses advanced techniques, such as image stitching, to increase the input context length and evaluate MLLMs' performance under various settings, including varying context lengths, single and multiple needles, and positive and negative samples. The evaluation metrics include "existence accuracy," "index accuracy," and "exact accuracy" to holistically evaluate MLLMs at different levels. The study finds that GPT-4o consistently outperforms other models in long-context scenarios but suffers from hallucination problems in negative samples. The performance gap between API-based and open-source models is also highlighted, with API-based models generally outperforming open-source models. The paper provides a comprehensive dataset, detailed evaluation protocols, and experimental results to support its findings.

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

17 Jun 2024 | Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, Hao Wang