17 Jun 2024 | Zhipeng Chen1,3; Kun Zhou2,3; Wayne Xin Zhao1,3†; Junchen Wan4; Fuzheng Zhang4; Di Zhang4 and Ji-Rong Wen1,2,3
This paper proposes RLMEC, a new reinforcement learning (RL) method for improving large language models (LLMs) on complex reasoning tasks. The method incorporates a generative reward model trained under a minimum editing constraint to provide fine-grained supervision signals. The reward model is trained using an erroneous solution rewriting task, enabling it to focus on key tokens that lead to incorrect answers. This approach allows the LLM to receive token-level rewards, which guide the model to correct errors more effectively. The RLMEC framework also includes a token-level RL objective and imitation-based regularization to stabilize the training process and improve the model's ability to focus on critical tokens.
The generative reward model is trained by rewriting erroneous solutions with minimal edits, which helps identify and correct key tokens in the generated outputs. The reward model then provides token-level supervision signals to the LLM, which are used to optimize the policy model through proximal policy optimization (PPO). The imitation-based regularization further enhances the training by aligning the LLM's outputs with the rewritten solutions.
Experiments on eight tasks, including question-answering and mathematical reasoning, demonstrate that RLMEC outperforms existing methods in terms of accuracy and stability. The results show that RLMEC effectively reduces errors and improves the LLM's ability to handle complex reasoning tasks. The method is also shown to be more efficient in training, as it focuses on key tokens and reduces the impact of unimportant ones.
The paper also discusses the limitations of the proposed method, including the focus on specific tasks and the need for further research on broader applications. Overall, RLMEC provides a novel approach to improving LLMs through fine-grained reinforcement learning with minimum editing constraints, leading to better performance in complex reasoning tasks.This paper proposes RLMEC, a new reinforcement learning (RL) method for improving large language models (LLMs) on complex reasoning tasks. The method incorporates a generative reward model trained under a minimum editing constraint to provide fine-grained supervision signals. The reward model is trained using an erroneous solution rewriting task, enabling it to focus on key tokens that lead to incorrect answers. This approach allows the LLM to receive token-level rewards, which guide the model to correct errors more effectively. The RLMEC framework also includes a token-level RL objective and imitation-based regularization to stabilize the training process and improve the model's ability to focus on critical tokens.
The generative reward model is trained by rewriting erroneous solutions with minimal edits, which helps identify and correct key tokens in the generated outputs. The reward model then provides token-level supervision signals to the LLM, which are used to optimize the policy model through proximal policy optimization (PPO). The imitation-based regularization further enhances the training by aligning the LLM's outputs with the rewritten solutions.
Experiments on eight tasks, including question-answering and mathematical reasoning, demonstrate that RLMEC outperforms existing methods in terms of accuracy and stability. The results show that RLMEC effectively reduces errors and improves the LLM's ability to handle complex reasoning tasks. The method is also shown to be more efficient in training, as it focuses on key tokens and reduces the impact of unimportant ones.
The paper also discusses the limitations of the proposed method, including the focus on specific tasks and the need for further research on broader applications. Overall, RLMEC provides a novel approach to improving LLMs through fine-grained reinforcement learning with minimum editing constraints, leading to better performance in complex reasoning tasks.