**ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild**
**Authors:** Ahmed Masry, Megh Thakkar, Aayush Bajaj, Aaryaman Kartha, Enamul Hoque, Shafiq Joty
**Institution:** York University, Canada, MILA - Quebec AI Institute, Salesforce Research, Nanyang Technological University, Singapore
**Abstract:**
Given the widespread use of charts in various industries and sciences, there has been a growing interest in developing pre-trained foundation models and instruction-tuned models for chart understanding and reasoning. However, existing methods suffer from two critical drawbacks: they are trained on data generated from underlying data tables, ignoring visual trends and patterns in chart images, and use weakly aligned vision-language backbone models, limiting their generalizability. To address these issues, we introduce ChartGemma, a novel chart understanding and reasoning model developed over PaliGemma. ChartGemma is trained on instruction-tuning data generated directly from chart images, capturing both high-level trends and low-level visual information from a diverse set of charts. Our approach achieves state-of-the-art results across five benchmarks spanning chart summarization, question answering, and fact-checking, and qualitative studies show that ChartGemma generates more realistic and factually correct summaries compared to its contemporaries.
**Contributions:**
- We present ChartGemma, a multimodal model instruction-tuned for chart understanding and reasoning using data directly generated from chart images.
- ChartGemma utilizes a stronger backbone model and more representative instruction-tuning data, rendering it effective in tackling existing benchmarks while being significantly smaller than its counterparts.
- Extensive quantitative and qualitative studies demonstrate that ChartGemma produces more faithful and human-like summaries and is highly capable of understanding and representing complex real-world charts.
**Methods:**
- **Data Generation:** We assemble a diverse corpus of charts from various sources, including synthetically generated charts, curated charts from specialized websites, and in-the-wild charts. We generate visual instruction-tuning data directly from these chart images, emphasizing visual attributes and complex trends.
- **Model Architecture:** ChartGemma uses PaliGemma as its backbone, which includes a SigLIP vision encoder and a Gemma-2B language model. The vision encoder processes visual features into the LLM embedding space, and the LLM applies full attention to both visual and text tokens, enhancing contextual understanding.
- **Training Setup:** We directly fine-tune the backbone model on our instruction-tuning data, avoiding the need for a separate alignment step, which improves efficiency and generalizability.
**Experiments and Results:**
- **Closed-Ended Tasks:** ChartGemma outperforms or matches existing models on benchmarks such as ChartQA, ChartFC, and ChartCheck.
- **Open-Ended Tasks:** ChartGemma performs well on open-ended generation benchmarks like OpenCQA, Chart2Text, and a curated set**ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild**
**Authors:** Ahmed Masry, Megh Thakkar, Aayush Bajaj, Aaryaman Kartha, Enamul Hoque, Shafiq Joty
**Institution:** York University, Canada, MILA - Quebec AI Institute, Salesforce Research, Nanyang Technological University, Singapore
**Abstract:**
Given the widespread use of charts in various industries and sciences, there has been a growing interest in developing pre-trained foundation models and instruction-tuned models for chart understanding and reasoning. However, existing methods suffer from two critical drawbacks: they are trained on data generated from underlying data tables, ignoring visual trends and patterns in chart images, and use weakly aligned vision-language backbone models, limiting their generalizability. To address these issues, we introduce ChartGemma, a novel chart understanding and reasoning model developed over PaliGemma. ChartGemma is trained on instruction-tuning data generated directly from chart images, capturing both high-level trends and low-level visual information from a diverse set of charts. Our approach achieves state-of-the-art results across five benchmarks spanning chart summarization, question answering, and fact-checking, and qualitative studies show that ChartGemma generates more realistic and factually correct summaries compared to its contemporaries.
**Contributions:**
- We present ChartGemma, a multimodal model instruction-tuned for chart understanding and reasoning using data directly generated from chart images.
- ChartGemma utilizes a stronger backbone model and more representative instruction-tuning data, rendering it effective in tackling existing benchmarks while being significantly smaller than its counterparts.
- Extensive quantitative and qualitative studies demonstrate that ChartGemma produces more faithful and human-like summaries and is highly capable of understanding and representing complex real-world charts.
**Methods:**
- **Data Generation:** We assemble a diverse corpus of charts from various sources, including synthetically generated charts, curated charts from specialized websites, and in-the-wild charts. We generate visual instruction-tuning data directly from these chart images, emphasizing visual attributes and complex trends.
- **Model Architecture:** ChartGemma uses PaliGemma as its backbone, which includes a SigLIP vision encoder and a Gemma-2B language model. The vision encoder processes visual features into the LLM embedding space, and the LLM applies full attention to both visual and text tokens, enhancing contextual understanding.
- **Training Setup:** We directly fine-tune the backbone model on our instruction-tuning data, avoiding the need for a separate alignment step, which improves efficiency and generalizability.
**Experiments and Results:**
- **Closed-Ended Tasks:** ChartGemma outperforms or matches existing models on benchmarks such as ChartQA, ChartFC, and ChartCheck.
- **Open-Ended Tasks:** ChartGemma performs well on open-ended generation benchmarks like OpenCQA, Chart2Text, and a curated set