25 Apr 2024 | Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, and Fei Huang
**TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning**
**Authors:** Liang Zhang
**Abstract:**
Charts are crucial for presenting and explaining complex data relationships. Recent multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks, but their large parameter sizes and computational requirements limit their use in resource-constrained environments. This paper introduces TinyChart, an efficient MLLM for chart understanding with only 3 billion parameters. TinyChart addresses two key challenges: (1) reducing the burden of learning numerical computations through Program-of-Thoughts (PoT) learning, which trains the model to generate Python programs for numerical calculations, and (2) reducing lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges similar vision tokens. Extensive experiments demonstrate that TinyChart achieves state-of-the-art performance on various chart understanding benchmarks, including ChartQA, Chart-to-Text, Chart-to-Table, OpenQA, and ChartX, outperforming several 13 billion parameter MLLMs and closed-source models like GPT-4V. It also demonstrates superior efficiency with higher inference throughput due to its smaller model scale and more efficient vision encoding.
**Contributions:**
- Introduce TinyChart, an efficient MLLM for chart understanding with 3 billion parameters, achieving state-of-the-art performance on various benchmarks.
- Propose Program-of-Thoughts (PoT) learning to enhance numerical computation capabilities and construct the ChartQA-PoT dataset.
- Adopt Visual Token Merging for efficient vision encoding, significantly reducing the length of vision feature sequences and enabling high-resolution chart image input.
**Related Work:**
- Overview of chart understanding tasks and the limitations of current models.
- Discussion on multimodal large language models and their challenges in chart understanding.
**Model Architecture:**
- Detailed description of the TinyChart architecture, including the vision transformer encoder, vision-language connector, and large language model.
**Program-of-Thoughts Learning:**
- Explanation of how PoT learning enhances the model's ability to solve numerical problems by generating Python programs.
**Dataset Construction:**
- Description of the ChartQA-PoT dataset, including template-based and GPT-based PoT construction methods.
**Evaluation:**
- Extensive experiments on various benchmarks, showing TinyChart's superior performance and efficiency.
**Ablation Studies:**
- Ablation studies to validate the effectiveness of visual token merging and PoT learning.
**Conclusion:**
- Summary of the main findings and the advantages of TinyChart in chart understanding tasks.**TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning**
**Authors:** Liang Zhang
**Abstract:**
Charts are crucial for presenting and explaining complex data relationships. Recent multimodal large language models (MLLMs) have shown remarkable capabilities in various chart understanding tasks, but their large parameter sizes and computational requirements limit their use in resource-constrained environments. This paper introduces TinyChart, an efficient MLLM for chart understanding with only 3 billion parameters. TinyChart addresses two key challenges: (1) reducing the burden of learning numerical computations through Program-of-Thoughts (PoT) learning, which trains the model to generate Python programs for numerical calculations, and (2) reducing lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges similar vision tokens. Extensive experiments demonstrate that TinyChart achieves state-of-the-art performance on various chart understanding benchmarks, including ChartQA, Chart-to-Text, Chart-to-Table, OpenQA, and ChartX, outperforming several 13 billion parameter MLLMs and closed-source models like GPT-4V. It also demonstrates superior efficiency with higher inference throughput due to its smaller model scale and more efficient vision encoding.
**Contributions:**
- Introduce TinyChart, an efficient MLLM for chart understanding with 3 billion parameters, achieving state-of-the-art performance on various benchmarks.
- Propose Program-of-Thoughts (PoT) learning to enhance numerical computation capabilities and construct the ChartQA-PoT dataset.
- Adopt Visual Token Merging for efficient vision encoding, significantly reducing the length of vision feature sequences and enabling high-resolution chart image input.
**Related Work:**
- Overview of chart understanding tasks and the limitations of current models.
- Discussion on multimodal large language models and their challenges in chart understanding.
**Model Architecture:**
- Detailed description of the TinyChart architecture, including the vision transformer encoder, vision-language connector, and large language model.
**Program-of-Thoughts Learning:**
- Explanation of how PoT learning enhances the model's ability to solve numerical problems by generating Python programs.
**Dataset Construction:**
- Description of the ChartQA-PoT dataset, including template-based and GPT-based PoT construction methods.
**Evaluation:**
- Extensive experiments on various benchmarks, showing TinyChart's superior performance and efficiency.
**Ablation Studies:**
- Ablation studies to validate the effectiveness of visual token merging and PoT learning.
**Conclusion:**
- Summary of the main findings and the advantages of TinyChart in chart understanding tasks.