17 Jun 2024 | Zhipeng Chen, Kun Zhou, Wayne Xin Zhao, Junchen Wan, Fuzheng Zhang, Di Zhang, Ji-Rong Wen
The paper introduces a novel reinforcement learning (RL) method named RLMEC, which aims to improve large language models (LLMs) by providing fine-grained supervision signals. RLMEC incorporates a generative reward model trained under the minimum editing constraint to produce token-level supervision for RL training. This approach addresses the limitations of existing RL methods that rely on instance-level rewards, which cannot provide detailed guidance for complex reasoning tasks. The generative reward model is trained using an erroneous solution rewriting task, where it must correct the generated solutions with minimal edits. The token-level RL objective and imitation-based regularization are designed to focus on the key tokens that lead to errors, reducing the impact of unimportant tokens. Experimental results on various complex reasoning tasks, including question answering and mathematical reasoning, demonstrate the effectiveness of RLMEC, showing improved performance over other competitive methods. The approach also prevents overfitting and stabilizes the RL training process. The paper discusses the limitations and future work, including the potential for applying RLMEC to other scenarios and addressing ethical considerations.The paper introduces a novel reinforcement learning (RL) method named RLMEC, which aims to improve large language models (LLMs) by providing fine-grained supervision signals. RLMEC incorporates a generative reward model trained under the minimum editing constraint to produce token-level supervision for RL training. This approach addresses the limitations of existing RL methods that rely on instance-level rewards, which cannot provide detailed guidance for complex reasoning tasks. The generative reward model is trained using an erroneous solution rewriting task, where it must correct the generated solutions with minimal edits. The token-level RL objective and imitation-based regularization are designed to focus on the key tokens that lead to errors, reducing the impact of unimportant tokens. Experimental results on various complex reasoning tasks, including question answering and mathematical reasoning, demonstrate the effectiveness of RLMEC, showing improved performance over other competitive methods. The approach also prevents overfitting and stabilizes the RL training process. The paper discusses the limitations and future work, including the potential for applying RLMEC to other scenarios and addressing ethical considerations.