Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs

Chart-based Reasoning: Transferring Capabilities from LLMs to VLMs

19 Mar 2024 | Victor Cărbune, Hassan Mansoor, Fangyu Liu, Rahul Aralikatte, Gilles Baechler, Jindong Chen, Abhanshu Sharma
This paper presents a method to transfer reasoning capabilities from large language models (LLMs) to vision-language models (VLMs). The approach improves chart representation by continuing pre-training with an enhanced chart-to-table translation task. A larger dataset is constructed, and reasoning traces are synthesized using table representations. The model is then fine-tuned using a multitask loss. The proposed method, ChartPaLI-5B, outperforms even larger models like PaLIX-55B without an upstream OCR system, and performs better than recent models like Gemini Ultra and GPT-4V when rationales are refined with a program-of-thought prompt. The method is evaluated on ChartQA, FigureQA, and PlotQA benchmarks. It shows significant improvements in reasoning capabilities, particularly in tasks requiring numerical operations and complex reasoning. The approach involves pre-training on a mixture of chart-to-table datasets, followed by fine-tuning on synthetic data generated with rationales. The results demonstrate that the method significantly improves performance on these benchmarks, especially for tasks that require reasoning with numerical data. The method also addresses challenges such as numerical reasoning, color-based reasoning, and complex reasoning tasks. It shows that the model can handle these tasks effectively, even when the reasoning trace is incomplete or contains errors. The approach is further refined using program-of-thought prompting, which enables the model to generate code for arithmetic computations, leading to improved performance. The paper highlights the importance of pre-training and fine-tuning in improving the reasoning capabilities of VLMs. It also discusses the limitations of the approach, including the lack of color-based reasoning examples and the difficulty of complex reasoning tasks. The method is shown to be effective in transferring reasoning capabilities from LLMs to VLMs, enabling smaller models to perform better on complex tasks. The results demonstrate that the proposed method is a significant advancement in the field of vision-language models.This paper presents a method to transfer reasoning capabilities from large language models (LLMs) to vision-language models (VLMs). The approach improves chart representation by continuing pre-training with an enhanced chart-to-table translation task. A larger dataset is constructed, and reasoning traces are synthesized using table representations. The model is then fine-tuned using a multitask loss. The proposed method, ChartPaLI-5B, outperforms even larger models like PaLIX-55B without an upstream OCR system, and performs better than recent models like Gemini Ultra and GPT-4V when rationales are refined with a program-of-thought prompt. The method is evaluated on ChartQA, FigureQA, and PlotQA benchmarks. It shows significant improvements in reasoning capabilities, particularly in tasks requiring numerical operations and complex reasoning. The approach involves pre-training on a mixture of chart-to-table datasets, followed by fine-tuning on synthetic data generated with rationales. The results demonstrate that the method significantly improves performance on these benchmarks, especially for tasks that require reasoning with numerical data. The method also addresses challenges such as numerical reasoning, color-based reasoning, and complex reasoning tasks. It shows that the model can handle these tasks effectively, even when the reasoning trace is incomplete or contains errors. The approach is further refined using program-of-thought prompting, which enables the model to generate code for arithmetic computations, leading to improved performance. The paper highlights the importance of pre-training and fine-tuning in improving the reasoning capabilities of VLMs. It also discusses the limitations of the approach, including the lack of color-based reasoning examples and the difficulty of complex reasoning tasks. The method is shown to be effective in transferring reasoning capabilities from LLMs to VLMs, enabling smaller models to perform better on complex tasks. The results demonstrate that the proposed method is a significant advancement in the field of vision-language models.
Reach us at info@study.space