Needle In A Multimodal Haystack

Needle In A Multimodal Haystack

11 Jun 2024 | Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, Wenhai Wang
This paper introduces MM-NIAH, the first benchmark designed to systematically evaluate the comprehension capability of existing multimodal large language models (MLLMs) for long multimodal documents. The benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning. In each task, the model is required to answer questions based on key information scattered throughout the given multimodal document. The benchmark is constructed by concatenating interleaved image-text sequences from OBELICS into a long-context document containing 1k to 72k image and text tokens. Needles containing key information are then inserted into the text or images within the document. The benchmark comprises two types of needles: text needles and image needles. For a comprehensive evaluation, three types of tasks are designed, including retrieval, counting, and reasoning. The retrieval task requires models to find key information inserted into the text or images within the document. The counting task contains multiple needles, and the model must collect all needles and count them. The reasoning task asks the model to reason over the cues from multiple needles scattered throughout the document. Experiments on MM-NIAH show that existing MLLMs perform considerably worse with image needles than with text needles. Models pre-trained on image-text interleaved data do not exhibit superior performance on MM-NIAH compared to those pre-trained only on image-text pair data. MLLMs fail to maintain the long context capability of their underlying LLMs. While RAG enhances performance on text needles, it is ineffective for image needles in the MM-NIAH benchmark. The results demonstrate that long multimodal document comprehension remains a challenging problem for current MLLMs. The benchmark provides a platform for further research on long multimodal document comprehension and contributes to the advancement of MLLMs. Code and benchmark are available at https://github.com/OpenGVLab/MM-NIAH.This paper introduces MM-NIAH, the first benchmark designed to systematically evaluate the comprehension capability of existing multimodal large language models (MLLMs) for long multimodal documents. The benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning. In each task, the model is required to answer questions based on key information scattered throughout the given multimodal document. The benchmark is constructed by concatenating interleaved image-text sequences from OBELICS into a long-context document containing 1k to 72k image and text tokens. Needles containing key information are then inserted into the text or images within the document. The benchmark comprises two types of needles: text needles and image needles. For a comprehensive evaluation, three types of tasks are designed, including retrieval, counting, and reasoning. The retrieval task requires models to find key information inserted into the text or images within the document. The counting task contains multiple needles, and the model must collect all needles and count them. The reasoning task asks the model to reason over the cues from multiple needles scattered throughout the document. Experiments on MM-NIAH show that existing MLLMs perform considerably worse with image needles than with text needles. Models pre-trained on image-text interleaved data do not exhibit superior performance on MM-NIAH compared to those pre-trained only on image-text pair data. MLLMs fail to maintain the long context capability of their underlying LLMs. While RAG enhances performance on text needles, it is ineffective for image needles in the MM-NIAH benchmark. The results demonstrate that long multimodal document comprehension remains a challenging problem for current MLLMs. The benchmark provides a platform for further research on long multimodal document comprehension and contributes to the advancement of MLLMs. Code and benchmark are available at https://github.com/OpenGVLab/MM-NIAH.
Reach us at info@study.space
Understanding Needle In A Multimodal Haystack