ChatGLM-RLHF is a reinforcement learning from human feedback (RLHF) system designed to enhance the alignment of ChatGLM with human preferences. The system includes three main components: data collection for human preferences, training of a reward model, and optimization of policies. Challenges encountered during implementation include reward variance, model parallelism, and catastrophic forgetting in large language models (LLMs). Strategies to mitigate these issues include reducing reward variance, implementing model parallelism with fused gradient descent, and using regularization constraints. Experiments show that ChatGLM-RLHF significantly improves alignment performance compared to the supervised fine-tuned (SFT) version, achieving an average of 15% more wins in Chinese alignment tasks.
The RLHF pipeline involves collecting human preference data, training a reward model to predict human preferences, and using reinforcement learning algorithms like PPO and DPO to optimize the policy model. The reward model is trained to avoid biases such as length bias, and a reference baseline is introduced to reduce variability in reward scores. The pipeline also addresses capability forgetting, where the model's performance decreases in specific scenarios. An additional supervised next-token-prediction loss is used to preserve the pre-trained abilities of the SFT model.
Experiments on ChatGLM-6B and ChatGLM-32B demonstrate that ChatGLM-RLHF significantly improves the performance of ChatGLM, enabling it to produce more helpful, safe, and aligned responses. The ChatGLM models, refined through the ChatGLM-RLHF pipeline, are used in online services and mobile applications. The RLHF framework is scalable and efficient, with practical designs that support large-scale training. The results show that PPO slightly outperforms DPO in automatic evaluation, while human evaluation confirms the effectiveness of RLHF in improving response quality. The reward model's accuracy in predicting human preferences is also evaluated, showing that ChatGLM-32B achieves a high accuracy of 65%. The study highlights the importance of aligning LLMs with human preferences and provides insights into the challenges and solutions in RLHF implementations.ChatGLM-RLHF is a reinforcement learning from human feedback (RLHF) system designed to enhance the alignment of ChatGLM with human preferences. The system includes three main components: data collection for human preferences, training of a reward model, and optimization of policies. Challenges encountered during implementation include reward variance, model parallelism, and catastrophic forgetting in large language models (LLMs). Strategies to mitigate these issues include reducing reward variance, implementing model parallelism with fused gradient descent, and using regularization constraints. Experiments show that ChatGLM-RLHF significantly improves alignment performance compared to the supervised fine-tuned (SFT) version, achieving an average of 15% more wins in Chinese alignment tasks.
The RLHF pipeline involves collecting human preference data, training a reward model to predict human preferences, and using reinforcement learning algorithms like PPO and DPO to optimize the policy model. The reward model is trained to avoid biases such as length bias, and a reference baseline is introduced to reduce variability in reward scores. The pipeline also addresses capability forgetting, where the model's performance decreases in specific scenarios. An additional supervised next-token-prediction loss is used to preserve the pre-trained abilities of the SFT model.
Experiments on ChatGLM-6B and ChatGLM-32B demonstrate that ChatGLM-RLHF significantly improves the performance of ChatGLM, enabling it to produce more helpful, safe, and aligned responses. The ChatGLM models, refined through the ChatGLM-RLHF pipeline, are used in online services and mobile applications. The RLHF framework is scalable and efficient, with practical designs that support large-scale training. The results show that PPO slightly outperforms DPO in automatic evaluation, while human evaluation confirms the effectiveness of RLHF in improving response quality. The reward model's accuracy in predicting human preferences is also evaluated, showing that ChatGLM-32B achieves a high accuracy of 65%. The study highlights the importance of aligning LLMs with human preferences and provides insights into the challenges and solutions in RLHF implementations.