26 Jun 2024 | Zirui Wang Mengzhou Xia Luxi He Howard Chen Yitao Liu Richard Zhu Kaiqu Liang Xindi Wu Haotian Liu Sadhika Malladi Alexis Chevalier Sanjeev Arora Danqi Chen
The paper "Charting Gaps in Realistic Chart Understanding in Multimodal LLMs" addresses the limitations of existing datasets and benchmarks in evaluating the chart understanding capabilities of Multimodal Large Language Models (MLLMs). The authors introduce CharXiv, a comprehensive evaluation suite that includes 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv features two types of questions: descriptive questions about basic chart elements and reasoning questions that require synthesizing information across complex visual elements. The dataset is curated and verified by human experts to ensure quality.
The paper highlights that while open-source models may appear to outperform proprietary models on existing benchmarks, their performance drops significantly when faced with slightly different charts or questions. Specifically, the accuracy of SPHINX V2 drops from 63.2% to 28.6% when questions are modified, demonstrating a 34.5% gap. The authors find that the strongest open-source model, InternVL Chat V1.5, achieves only 29.2% accuracy on reasoning questions and 58.5% on descriptive questions, compared to 47.1% and 84.5% for GPT-4o, respectively. Both models lag far behind human performance, which is 80.5% on reasoning questions and 92.1% on descriptive questions.
The paper also provides a detailed analysis of model performance, identifying gaps in reasoning and descriptive skills, the difficulty of certain tasks and charts, and how models handle unanswerable questions. The findings suggest that existing benchmarks overestimate the chart understanding capabilities of MLLMs due to their narrow focus and lack of diversity. CharXiv aims to address these issues by providing a more realistic and faithful measure of progress in chart understanding for MLLMs.The paper "Charting Gaps in Realistic Chart Understanding in Multimodal LLMs" addresses the limitations of existing datasets and benchmarks in evaluating the chart understanding capabilities of Multimodal Large Language Models (MLLMs). The authors introduce CharXiv, a comprehensive evaluation suite that includes 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv features two types of questions: descriptive questions about basic chart elements and reasoning questions that require synthesizing information across complex visual elements. The dataset is curated and verified by human experts to ensure quality.
The paper highlights that while open-source models may appear to outperform proprietary models on existing benchmarks, their performance drops significantly when faced with slightly different charts or questions. Specifically, the accuracy of SPHINX V2 drops from 63.2% to 28.6% when questions are modified, demonstrating a 34.5% gap. The authors find that the strongest open-source model, InternVL Chat V1.5, achieves only 29.2% accuracy on reasoning questions and 58.5% on descriptive questions, compared to 47.1% and 84.5% for GPT-4o, respectively. Both models lag far behind human performance, which is 80.5% on reasoning questions and 92.1% on descriptive questions.
The paper also provides a detailed analysis of model performance, identifying gaps in reasoning and descriptive skills, the difficulty of certain tasks and charts, and how models handle unanswerable questions. The findings suggest that existing benchmarks overestimate the chart understanding capabilities of MLLMs due to their narrow focus and lack of diversity. CharXiv aims to address these issues by providing a more realistic and faithful measure of progress in chart understanding for MLLMs.