Understanding ChartGemma%3A Visual Instruction-tuning for Chart Reasoning in the Wild

**ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild** **Authors:** Ahmed Masry, Megh Thakkar, Aayush Bajaj, Aaryaman Kartha, Enamul Hoque, Shafiq Joty **Institution:** York University, Canada, MILA - Quebec AI Institute, Salesforce Research, Nanyang Technological University, Singapore **Abstract:** Given the widespread use of charts in various industries and sciences, there has been a growing interest in developing pre-trained foundation models and instruction-tuned models for chart understanding and reasoning. However, existing methods suffer from two critical drawbacks: they are trained on data generated from underlying data tables, ignoring visual trends and patterns in chart images, and use weakly aligned vision-language backbone models, limiting their generalizability. To address these issues, we introduce ChartGemma, a novel chart understanding and reasoning model developed over PaliGemma. ChartGemma is trained on instruction-tuning data generated directly from chart images, capturing both high-level trends and low-level visual information from a diverse set of charts. Our approach achieves state-of-the-art results across five benchmarks spanning chart summarization, question answering, and fact-checking, and qualitative studies show that ChartGemma generates more realistic and factually correct summaries compared to its contemporaries. **Contributions:** - We present ChartGemma, a multimodal model instruction-tuned for chart understanding and reasoning using data directly generated from chart images. - ChartGemma utilizes a stronger backbone model and more representative instruction-tuning data, rendering it effective in tackling existing benchmarks while being significantly smaller than its counterparts. - Extensive quantitative and qualitative studies demonstrate that ChartGemma produces more faithful and human-like summaries and is highly capable of understanding and representing complex real-world charts. **Methods:** - **Data Generation:** We assemble a diverse corpus of charts from various sources, including synthetically generated charts, curated charts from specialized websites, and in-the-wild charts. We generate visual instruction-tuning data directly from these chart images, emphasizing visual attributes and complex trends. - **Model Architecture:** ChartGemma uses PaliGemma as its backbone, which includes a SigLIP vision encoder and a Gemma-2B language model. The vision encoder processes visual features into the LLM embedding space, and the LLM applies full attention to both visual and text tokens, enhancing contextual understanding. - **Training Setup:** We directly fine-tune the backbone model on our instruction-tuning data, avoiding the need for a separate alignment step, which improves efficiency and generalizability. **Experiments and Results:** - **Closed-Ended Tasks:** ChartGemma outperforms or matches existing models on benchmarks such as ChartQA, ChartFC, and ChartCheck. - **Open-Ended Tasks:** ChartGemma performs well on open-ended generation benchmarks like OpenCQA, Chart2Text, and a curated set**ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild** **Authors:** Ahmed Masry, Megh Thakkar, Aayush Bajaj, Aaryaman Kartha, Enamul Hoque, Shafiq Joty **Institution:** York University, Canada, MILA - Quebec AI Institute, Salesforce Research, Nanyang Technological University, Singapore **Abstract:** Given the widespread use of charts in various industries and sciences, there has been a growing interest in developing pre-trained foundation models and instruction-tuned models for chart understanding and reasoning. However, existing methods suffer from two critical drawbacks: they are trained on data generated from underlying data tables, ignoring visual trends and patterns in chart images, and use weakly aligned vision-language backbone models, limiting their generalizability. To address these issues, we introduce ChartGemma, a novel chart understanding and reasoning model developed over PaliGemma. ChartGemma is trained on instruction-tuning data generated directly from chart images, capturing both high-level trends and low-level visual information from a diverse set of charts. Our approach achieves state-of-the-art results across five benchmarks spanning chart summarization, question answering, and fact-checking, and qualitative studies show that ChartGemma generates more realistic and factually correct summaries compared to its contemporaries. **Contributions:** - We present ChartGemma, a multimodal model instruction-tuned for chart understanding and reasoning using data directly generated from chart images. - ChartGemma utilizes a stronger backbone model and more representative instruction-tuning data, rendering it effective in tackling existing benchmarks while being significantly smaller than its counterparts. - Extensive quantitative and qualitative studies demonstrate that ChartGemma produces more faithful and human-like summaries and is highly capable of understanding and representing complex real-world charts. **Methods:** - **Data Generation:** We assemble a diverse corpus of charts from various sources, including synthetically generated charts, curated charts from specialized websites, and in-the-wild charts. We generate visual instruction-tuning data directly from these chart images, emphasizing visual attributes and complex trends. - **Model Architecture:** ChartGemma uses PaliGemma as its backbone, which includes a SigLIP vision encoder and a Gemma-2B language model. The vision encoder processes visual features into the LLM embedding space, and the LLM applies full attention to both visual and text tokens, enhancing contextual understanding. - **Training Setup:** We directly fine-tune the backbone model on our instruction-tuning data, avoiding the need for a separate alignment step, which improves efficiency and generalizability. **Experiments and Results:** - **Closed-Ended Tasks:** ChartGemma outperforms or matches existing models on benchmarks such as ChartQA, ChartFC, and ChartCheck. - **Open-Ended Tasks:** ChartGemma performs well on open-ended generation benchmarks like OpenCQA, Chart2Text, and a curated set

ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

4 Jul 2024 | Ahmed Masry, Megh Thakkar, Aayush Bajaj, Aaryaman Kartha, Enamul Hoque, Shafiq Joty