MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations

MMLONGBENCH-DOC: Benchmarking Long-context Document Understanding with Visualizations

10 Jul 2024 | Yubo Ma1, Yuhang Zang2*, Liangyu Chen1, Meiqi Chen3, Yizhu Jiao4 Xinze Li1, Xinyuan Lu5, Ziyu Liu6, Yan Ma7, Xiaoyi Dong2, Pan Zhang2 Liangming Pan8, Yu-Gang Jiang3, Jiaqi Wang2, Yixin Cao9*, Aixin Sun1
This paper introduces MMLONGBENCH-DOC, a comprehensive benchmark for evaluating the long-context, multi-modal understanding capabilities of Large Vision-Language Models (LVLMS). The benchmark consists of 135 lengthy PDF documents, each averaging 47.5 pages and 21,214.1 textual tokens, and includes 1,091 expert-annotated questions. These questions are designed to test various aspects of document understanding, including localization, cross-page comprehension, and hallucination detection. The evaluation involves 14 LVLMS, both proprietary and open-source, and 10 LLMs for comparison. The results show that current LVLMS struggle significantly with long-context document understanding, with the best-performing model, GPT-4o, achieving only a 44.9% F1 score. Notably, all LVLMS performed worse than LLMs fed with lossy OCR-parsed documents, highlighting the challenges in handling multi-modal and long-context information. The paper also provides detailed analysis of the performance of different models and discusses the limitations and future directions for improving long-context document understanding.This paper introduces MMLONGBENCH-DOC, a comprehensive benchmark for evaluating the long-context, multi-modal understanding capabilities of Large Vision-Language Models (LVLMS). The benchmark consists of 135 lengthy PDF documents, each averaging 47.5 pages and 21,214.1 textual tokens, and includes 1,091 expert-annotated questions. These questions are designed to test various aspects of document understanding, including localization, cross-page comprehension, and hallucination detection. The evaluation involves 14 LVLMS, both proprietary and open-source, and 10 LLMs for comparison. The results show that current LVLMS struggle significantly with long-context document understanding, with the best-performing model, GPT-4o, achieving only a 44.9% F1 score. Notably, all LVLMS performed worse than LLMs fed with lossy OCR-parsed documents, highlighting the challenges in handling multi-modal and long-context information. The paper also provides detailed analysis of the performance of different models and discusses the limitations and future directions for improving long-context document understanding.
Reach us at info@study.space
[slides and audio] MMLongBench-Doc%3A Benchmarking Long-context Document Understanding with Visualizations