TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

TinyChart: Efficient Chart Understanding with Visual Token Merging and Program-of-Thoughts Learning

25 Apr 2024 | Liang Zhang, Anwen Hu, Haiyang Xu, Ming Yan, Yichen Xu, Qin Jin, Ji Zhang, Fei Huang
TinyChart is an efficient multimodal large language model (MLLM) designed for chart understanding with only 3 billion parameters. It addresses two key challenges in efficient chart understanding: (1) reducing the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reducing lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments show that TinyChart achieves state-of-the-art performance on various chart understanding benchmarks, including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLMs with up to 13B parameters, such as ChartLlama and ChartAst, and closed-source general-purpose MLLMs like GPT-4V on ChartQA. It also demonstrates superior efficiency with higher throughput during inference due to its smaller model scale and more efficient vision encoding. The code and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/TinyChart.TinyChart is an efficient multimodal large language model (MLLM) designed for chart understanding with only 3 billion parameters. It addresses two key challenges in efficient chart understanding: (1) reducing the burden of learning numerical computations through a Program-of-Thoughts (PoT) learning strategy, which trains the model to generate Python programs for numerical calculations, and (2) reducing lengthy vision feature sequences produced by the vision transformer for high-resolution images through a Vision Token Merging module, which gradually merges most similar vision tokens. Extensive experiments show that TinyChart achieves state-of-the-art performance on various chart understanding benchmarks, including ChartQA, Chart-to-Text, Chart-to-Table, OpenCQA, and ChartX. It outperforms several chart understanding MLLMs with up to 13B parameters, such as ChartLlama and ChartAst, and closed-source general-purpose MLLMs like GPT-4V on ChartQA. It also demonstrates superior efficiency with higher throughput during inference due to its smaller model scale and more efficient vision encoding. The code and model are available at https://github.com/X-PLUG/mPLUG-DocOwl/tree/main/TinyChart.
Reach us at info@study.space
Understanding TinyChart%3A Efficient Chart Understanding with Program-of-Thoughts Learning and Visual Token Merging