ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline

ChatGLM-Math: Improving Math Problem-Solving in Large Language Models with a Self-Critique Pipeline

3 Apr 2024 | Yifan Xu12†*, Xiao Liu12*, Xinghan Liu12†, Zhenyu Hou12, Yueyan Li††, Xiaohan Zhang1, Zihan Wang12, Aohan Zeng12, Zhengxiao Du12, Wenyi Zhao1, Jie Tang2, Yuxiao Dong2
The paper introduces a novel approach called the Self-Critique pipeline to enhance both the linguistic and mathematical capabilities of large language models (LLMs). The Self-Critique pipeline addresses the challenge of maintaining and improving both language and mathematical abilities in deployed LLM systems. The method involves training a general Math-Critique model from the LLM itself to provide feedback signals, followed by two stages of fine-tuning: Rejective Fine-tuning (RFT) and Direct Preference Optimization (DPO). RFT uses rejection sampling to discard responses that do not meet Math-Critique standards, while DPO directly learns from pairs of correct and incorrect answers. The pipeline is evaluated on the MathUserEval dataset, which features a diverse range of questions, including practical application scenarios. Experiments show that the Self-Critique pipeline significantly enhances the LLM's mathematical problem-solving abilities while improving its language capabilities, outperforming models that are twice the size of the base model. The related techniques have been applied to ChatGLM$^1$, an online serving LLM, and the evaluation dataset and scripts are available at https://github.com/THUDM/ChatGLM-Math.The paper introduces a novel approach called the Self-Critique pipeline to enhance both the linguistic and mathematical capabilities of large language models (LLMs). The Self-Critique pipeline addresses the challenge of maintaining and improving both language and mathematical abilities in deployed LLM systems. The method involves training a general Math-Critique model from the LLM itself to provide feedback signals, followed by two stages of fine-tuning: Rejective Fine-tuning (RFT) and Direct Preference Optimization (DPO). RFT uses rejection sampling to discard responses that do not meet Math-Critique standards, while DPO directly learns from pairs of correct and incorrect answers. The pipeline is evaluated on the MathUserEval dataset, which features a diverse range of questions, including practical application scenarios. Experiments show that the Self-Critique pipeline significantly enhances the LLM's mathematical problem-solving abilities while improving its language capabilities, outperforming models that are twice the size of the base model. The related techniques have been applied to ChatGLM$^1$, an online serving LLM, and the evaluation dataset and scripts are available at https://github.com/THUDM/ChatGLM-Math.
Reach us at info@study.space