ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback

ChatGLM-RLHF: Practices of Aligning Large Language Models with Human Feedback

2024-04-03 | Zhenyu Hou, Yilin Niu, Zhengxia Du, Xiaohan Zhang, Xiao Liu, Aohan Zeng, Qinkai Zheng, Minlie Huang, Hongning Wang, Jie Tang, Yuxiao Dong
The paper introduces ChatGLM-RLHF, a reinforcement learning from human feedback (RLHF) system designed to enhance the alignment of large language models (LLMs) with human preferences. The ChatGLM-RLHF pipeline consists of three main components: data collection, reward model training, and policy model optimization. The authors address several challenges encountered during the implementation, including reducing reward variance, implementing model parallelism, and avoiding catastrophic forgetting. They introduce strategies such as bucket-based length balancing to mitigate reward bias, fused gradient-descent for model parallelism, and regularization constraints to prevent forgetting. Experiments on ChatGLM-6B and ChatGLM-32B models show significant improvements in alignment tasks compared to supervised fine-tuning (SFT) versions, achieving an average of 15% more wins in Chinese alignment tasks. The work provides insights into the practices and solutions for aligning LLMs with human preferences, highlighting the importance of reliable human preference data, robust training frameworks, and practical solutions to common challenges in RLHF implementations.The paper introduces ChatGLM-RLHF, a reinforcement learning from human feedback (RLHF) system designed to enhance the alignment of large language models (LLMs) with human preferences. The ChatGLM-RLHF pipeline consists of three main components: data collection, reward model training, and policy model optimization. The authors address several challenges encountered during the implementation, including reducing reward variance, implementing model parallelism, and avoiding catastrophic forgetting. They introduce strategies such as bucket-based length balancing to mitigate reward bias, fused gradient-descent for model parallelism, and regularization constraints to prevent forgetting. Experiments on ChatGLM-6B and ChatGLM-32B models show significant improvements in alignment tasks compared to supervised fine-tuning (SFT) versions, achieving an average of 15% more wins in Chinese alignment tasks. The work provides insights into the practices and solutions for aligning LLMs with human preferences, highlighting the importance of reliable human preference data, robust training frameworks, and practical solutions to common challenges in RLHF implementations.
Reach us at info@study.space
[slides and audio] ChatGLM-RLHF%3A Practices of Aligning Large Language Models with Human Feedback