Needle In A Multimodal Haystack

Needle In A Multimodal Haystack

11 Jun 2024 | Weiyun Wang, Shuibo Zhang, Yiming Ren, Yuchen Duan, Tiantong Li, Shuo Liu, Mengkang Hu, Zhe Chen, Kaipeng Zhang, Lewei Lu, Xizhou Zhu, Ping Luo, Yu Qiao, Jifeng Dai, Wenqi Shao, Wenhai Wang
The paper introduces Needle In A Multimodal Haystack (MM-NIAH), a benchmark designed to evaluate the comprehension capabilities of Multimodal Large Language Models (MLLMs) for long multimodal documents. MM-NIAH includes three types of tasks: multimodal retrieval, counting, and reasoning, each requiring models to answer questions based on key information scattered throughout the document. The benchmark is constructed by concatenating multiple interleaved image-text sequences from the OBELICS dataset, creating long-context documents with 1k to 72k image and text tokens. Key information is inserted as needles into the text or images. The evaluation reveals that existing MLLMs struggle with long multimodal documents, especially with image needles, and that pre-training on image-text interleaved data does not significantly improve performance. The paper also demonstrates that Retrieval Augmented Generation (RAG) enhances text needle retrieval but fails to improve image needle retrieval. The results highlight the need for further research to improve long multimodal document comprehension.The paper introduces Needle In A Multimodal Haystack (MM-NIAH), a benchmark designed to evaluate the comprehension capabilities of Multimodal Large Language Models (MLLMs) for long multimodal documents. MM-NIAH includes three types of tasks: multimodal retrieval, counting, and reasoning, each requiring models to answer questions based on key information scattered throughout the document. The benchmark is constructed by concatenating multiple interleaved image-text sequences from the OBELICS dataset, creating long-context documents with 1k to 72k image and text tokens. Key information is inserted as needles into the text or images. The evaluation reveals that existing MLLMs struggle with long multimodal documents, especially with image needles, and that pre-training on image-text interleaved data does not significantly improve performance. The paper also demonstrates that Retrieval Augmented Generation (RAG) enhances text needle retrieval but fails to improve image needle retrieval. The results highlight the need for further research to improve long multimodal document comprehension.
Reach us at info@study.space
[slides] Needle In A Multimodal Haystack | StudySpace