14 Jun 2024 | Chufan Shi, Cheng Yang, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, Yujiu Yang
ChartMimic is a new benchmark designed to evaluate the visually-grounded code generation capabilities of large multimodal models (LMMs). It uses information-intensive visual charts and textual instructions as inputs, requiring LMMs to generate code for chart rendering. The benchmark includes 1,000 human-curated (figure, instruction, code) triplets representing authentic chart use cases in scientific papers across various domains. These charts span 18 regular types and 4 advanced types, covering 191 subcategories. Multi-level evaluation metrics are proposed to provide an automatic and thorough assessment of the output code and rendered charts. Unlike existing code generation benchmarks, ChartMimic emphasizes evaluating LMMs' ability to integrate visual understanding, code generation, and cross-modal reasoning. Evaluation of 3 proprietary models and 11 open-weight models highlights the challenges posed by ChartMimic. Even advanced models like GPT-4V and Claude-3-opus achieve average scores of 73.2 and 53.7, respectively, indicating significant room for improvement. ChartMimic is expected to inspire the development of LMMs and advance the pursuit of artificial general intelligence.
ChartMimic introduces two tasks: Direct Mimic and Customized Mimic. Direct Mimic requires LMMs to generate code that can reproduce a given chart, while Customized Mimic involves generating code that incorporates new data specified in instructions. The benchmark includes 1,000 high-quality test examples, with 500 for each task. The data curation process involves filtering and selecting charts from various sources, ensuring diversity and reducing data leakage. The benchmark includes 22 chart categories, with 18 regular and 4 advanced types, covering 191 subcategories. The benchmark also includes multi-level evaluation metrics, including high-level and low-level scores, to assess the performance of LMMs.
The benchmark was evaluated on 14 LMMs, including 3 proprietary and 11 open-weight models. Results show that while some open-weight models perform comparably to proprietary models on public leaderboards, they still lag behind in ChartMimic. Phi-3-Vision, the best open-weight model, performs significantly worse than GPT-4V. Correlation analysis shows a high correlation between the automatic metrics and human judgment. Error analysis reveals that hallucination significantly hinders LMMs' performance on ChartMimic, involving the insertion of non-existent text in the ground-truth figures and confusion of similar chart types.
ChartMimic is designed as a suite of benchmarks to guide researchers in understanding the capabilities of their LMMs. By providing a comprehensive evaluation framework, ChartMimic aims to facilitate the growth of foundation models for the community, offering insights into LMMs' various aspects, such as visual understanding, code generation, and cross-modal reasoning. The benchmark highlights the challenges facedChartMimic is a new benchmark designed to evaluate the visually-grounded code generation capabilities of large multimodal models (LMMs). It uses information-intensive visual charts and textual instructions as inputs, requiring LMMs to generate code for chart rendering. The benchmark includes 1,000 human-curated (figure, instruction, code) triplets representing authentic chart use cases in scientific papers across various domains. These charts span 18 regular types and 4 advanced types, covering 191 subcategories. Multi-level evaluation metrics are proposed to provide an automatic and thorough assessment of the output code and rendered charts. Unlike existing code generation benchmarks, ChartMimic emphasizes evaluating LMMs' ability to integrate visual understanding, code generation, and cross-modal reasoning. Evaluation of 3 proprietary models and 11 open-weight models highlights the challenges posed by ChartMimic. Even advanced models like GPT-4V and Claude-3-opus achieve average scores of 73.2 and 53.7, respectively, indicating significant room for improvement. ChartMimic is expected to inspire the development of LMMs and advance the pursuit of artificial general intelligence.
ChartMimic introduces two tasks: Direct Mimic and Customized Mimic. Direct Mimic requires LMMs to generate code that can reproduce a given chart, while Customized Mimic involves generating code that incorporates new data specified in instructions. The benchmark includes 1,000 high-quality test examples, with 500 for each task. The data curation process involves filtering and selecting charts from various sources, ensuring diversity and reducing data leakage. The benchmark includes 22 chart categories, with 18 regular and 4 advanced types, covering 191 subcategories. The benchmark also includes multi-level evaluation metrics, including high-level and low-level scores, to assess the performance of LMMs.
The benchmark was evaluated on 14 LMMs, including 3 proprietary and 11 open-weight models. Results show that while some open-weight models perform comparably to proprietary models on public leaderboards, they still lag behind in ChartMimic. Phi-3-Vision, the best open-weight model, performs significantly worse than GPT-4V. Correlation analysis shows a high correlation between the automatic metrics and human judgment. Error analysis reveals that hallucination significantly hinders LMMs' performance on ChartMimic, involving the insertion of non-existent text in the ground-truth figures and confusion of similar chart types.
ChartMimic is designed as a suite of benchmarks to guide researchers in understanding the capabilities of their LMMs. By providing a comprehensive evaluation framework, ChartMimic aims to facilitate the growth of foundation models for the community, offering insights into LMMs' various aspects, such as visual understanding, code generation, and cross-modal reasoning. The benchmark highlights the challenges faced