**ChartMimic: Evaluating LMM’s Cross-Modal Reasoning Capability via Chart-to-Code Generation**
**Authors:** Chufan Shi, Cheng Yang, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, Yujiu Yang
**Affiliations:** Tsinghua University, The Chinese University of Hong Kong, Waseda University, Tencent AI Lab
**Contact:** chartmimic@gmail.com
**Abstract:**
ChartMimic is a new benchmark designed to assess the visually-grounded code generation capabilities of large multimodal models (LMMs). It uses information-intensive visual charts and textual instructions as inputs, requiring LMMs to generate corresponding code for chart rendering. The benchmark includes 1,000 human-curated (figure, instruction, code) triplets, representing authentic chart use cases from various scientific domains. ChartMimic features 18 regular types and 4 advanced types of charts, diversified into 191 subcategories. Multi-level evaluation metrics are proposed to provide an automatic and thorough assessment of the output code and rendered charts. Unlike existing benchmarks, ChartMimic emphasizes evaluating LMMs' ability to integrate visual understanding, code generation, and cross-modal reasoning. The evaluation of 3 proprietary models and 11 open-weight models highlights significant challenges, with even advanced models like GPT-4V and Claude-3-opus achieving only average scores, indicating substantial room for improvement.
**Introduction:**
Code generation is a demanding task that requires advanced abstract thinking and logical reasoning. ChartMimic addresses the gap between existing benchmarks, which primarily use text inputs, and real-world scenarios where humans receive information from multiple modalities. The benchmark is designed to evaluate LMMs' proficiency in visual understanding, code generation, and cross-modal reasoning through two tasks: Direct Mimic and Customized Mimic. The data curation process ensures diversity, balance, and authenticity, with 1,000 high-quality test examples. Multi-level evaluation metrics are proposed to assess the similarity between generated and ground-truth figures, including high-level and low-level elements.
**Benchmark Details:**
- **Task Definition:** Direct Mimic and Customized Mimic tasks challenge LMMs to generate code for a given chart or incorporate new data specified in the instructions.
- **Data Curation:** A four-step pipeline is used to curate the data, ensuring diversity, balance, and authenticity.
- **Evaluation Metrics:** Multi-level metrics (high-level and low-level) are proposed to assess the similarity between generated and ground-truth figures, with a Pearson correlation coefficient of 0.6942 and 0.7538 for high-level and low-level metrics, respectively.
**Experiment:**
-**ChartMimic: Evaluating LMM’s Cross-Modal Reasoning Capability via Chart-to-Code Generation**
**Authors:** Chufan Shi, Cheng Yang, Yaxin Liu, Bo Shui, Junjie Wang, Mohan Jing, Linran Xu, Xinyu Zhu, Siheng Li, Yuxiang Zhang, Gongye Liu, Xiaomei Nie, Deng Cai, Yujiu Yang
**Affiliations:** Tsinghua University, The Chinese University of Hong Kong, Waseda University, Tencent AI Lab
**Contact:** chartmimic@gmail.com
**Abstract:**
ChartMimic is a new benchmark designed to assess the visually-grounded code generation capabilities of large multimodal models (LMMs). It uses information-intensive visual charts and textual instructions as inputs, requiring LMMs to generate corresponding code for chart rendering. The benchmark includes 1,000 human-curated (figure, instruction, code) triplets, representing authentic chart use cases from various scientific domains. ChartMimic features 18 regular types and 4 advanced types of charts, diversified into 191 subcategories. Multi-level evaluation metrics are proposed to provide an automatic and thorough assessment of the output code and rendered charts. Unlike existing benchmarks, ChartMimic emphasizes evaluating LMMs' ability to integrate visual understanding, code generation, and cross-modal reasoning. The evaluation of 3 proprietary models and 11 open-weight models highlights significant challenges, with even advanced models like GPT-4V and Claude-3-opus achieving only average scores, indicating substantial room for improvement.
**Introduction:**
Code generation is a demanding task that requires advanced abstract thinking and logical reasoning. ChartMimic addresses the gap between existing benchmarks, which primarily use text inputs, and real-world scenarios where humans receive information from multiple modalities. The benchmark is designed to evaluate LMMs' proficiency in visual understanding, code generation, and cross-modal reasoning through two tasks: Direct Mimic and Customized Mimic. The data curation process ensures diversity, balance, and authenticity, with 1,000 high-quality test examples. Multi-level evaluation metrics are proposed to assess the similarity between generated and ground-truth figures, including high-level and low-level elements.
**Benchmark Details:**
- **Task Definition:** Direct Mimic and Customized Mimic tasks challenge LMMs to generate code for a given chart or incorporate new data specified in the instructions.
- **Data Curation:** A four-step pipeline is used to curate the data, ensuring diversity, balance, and authenticity.
- **Evaluation Metrics:** Multi-level metrics (high-level and low-level) are proposed to assess the similarity between generated and ground-truth figures, with a Pearson correlation coefficient of 0.6942 and 0.7538 for high-level and low-level metrics, respectively.
**Experiment:**
-