26 Jun 2024 | Zirui Wang, Mengzhou Xia, Luxi He, Howard Chen, Yitao Liu, Richard Zhu, Kaiqu Liang, Xindi Wu, Haotian Liu, Sadhika Malladi, Alexis Chevalier, Sanjeev Arora, Danqi Chen
CharXiv is a comprehensive benchmark for evaluating chart understanding in multimodal large language models (MLLMs). It includes 2,323 charts from arXiv papers, covering diverse and complex visual elements. The benchmark features two types of questions: descriptive and reasoning. Descriptive questions assess basic chart information, while reasoning questions require synthesizing information across complex visual elements. All charts and questions are handpicked and curated by human experts to ensure quality. CharXiv reveals a significant gap between the strongest proprietary model (GPT-4o, 47.1% accuracy) and the strongest open-source model (InternVL Chat V1.5, 29.2% accuracy). All models lag far behind human performance (80.5% accuracy). The benchmark highlights weaknesses in chart understanding capabilities of MLLMs, showing that open-source models are sensitive to small changes in charts or questions. CharXiv provides a more realistic and faithful measure of progress in chart understanding, addressing the limitations of existing benchmarks that often overestimate MLLM capabilities. The benchmark includes a wide range of charts and questions, ensuring diverse and challenging evaluations. CharXiv enables a thorough, multi-faceted evaluation of chart understanding in MLLMs.CharXiv is a comprehensive benchmark for evaluating chart understanding in multimodal large language models (MLLMs). It includes 2,323 charts from arXiv papers, covering diverse and complex visual elements. The benchmark features two types of questions: descriptive and reasoning. Descriptive questions assess basic chart information, while reasoning questions require synthesizing information across complex visual elements. All charts and questions are handpicked and curated by human experts to ensure quality. CharXiv reveals a significant gap between the strongest proprietary model (GPT-4o, 47.1% accuracy) and the strongest open-source model (InternVL Chat V1.5, 29.2% accuracy). All models lag far behind human performance (80.5% accuracy). The benchmark highlights weaknesses in chart understanding capabilities of MLLMs, showing that open-source models are sensitive to small changes in charts or questions. CharXiv provides a more realistic and faithful measure of progress in chart understanding, addressing the limitations of existing benchmarks that often overestimate MLLM capabilities. The benchmark includes a wide range of charts and questions, ensuring diverse and challenging evaluations. CharXiv enables a thorough, multi-faceted evaluation of chart understanding in MLLMs.