The document presents MMLONGBENCH-DOC, a benchmark for evaluating long-context document understanding by large vision-language models (LVLMs). It consists of 1,091 expert-annotated questions based on 135 lengthy PDF documents, with an average of 47.5 pages and 21,214 textual tokens. The questions are categorized into single-page, cross-page, and unanswerable types, with 33% requiring cross-page evidence and 22.5% designed to be unanswerable to detect hallucinations. The benchmark includes text, layout, chart, table, and image sources, and is evaluated using 14 LVLMs, including 4 proprietary and 10 open-source models. Results show that even the best-performing model, GPT-4o, achieves only a 44.9% F1 score, while the second-best, GPT-4V, scores 30.5%. Most LVLMs perform worse than their LLM counterparts fed with OCR-parsed text. The benchmark highlights the challenges of long-context document understanding, demonstrating that current LVLMs are not yet capable of effectively handling lengthy, multi-modal documents. The study emphasizes the need for further research into more capable LVLMs for long-context document understanding.The document presents MMLONGBENCH-DOC, a benchmark for evaluating long-context document understanding by large vision-language models (LVLMs). It consists of 1,091 expert-annotated questions based on 135 lengthy PDF documents, with an average of 47.5 pages and 21,214 textual tokens. The questions are categorized into single-page, cross-page, and unanswerable types, with 33% requiring cross-page evidence and 22.5% designed to be unanswerable to detect hallucinations. The benchmark includes text, layout, chart, table, and image sources, and is evaluated using 14 LVLMs, including 4 proprietary and 10 open-source models. Results show that even the best-performing model, GPT-4o, achieves only a 44.9% F1 score, while the second-best, GPT-4V, scores 30.5%. Most LVLMs perform worse than their LLM counterparts fed with OCR-parsed text. The benchmark highlights the challenges of long-context document understanding, demonstrating that current LVLMs are not yet capable of effectively handling lengthy, multi-modal documents. The study emphasizes the need for further research into more capable LVLMs for long-context document understanding.