12 Jun 2024 | Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
This technical report presents a detailed workflow for Online Iterative Reinforcement Learning from Human Feedback (RLHF) in large language models (LLMs). The authors aim to address the gap between existing open-source RLHF projects, which are primarily focused on offline learning, and the need for practical, reproducible methods. They construct preference models using diverse open-source datasets to approximate human feedback, which is crucial for online iterative RLHF. The report covers theoretical insights and algorithmic principles behind online iterative RLHF, including supervised fine-tuning and iterative direct preference learning. The practical implementation details are also provided, along with a comprehensive evaluation on various benchmarks such as AlpacaEval-2, Arena-Hard, MT-Bench, HumanEval, and TruthfulQA. The trained model, LLaMA-3-BB-SFR-Iterative-DPO-R, demonstrates impressive performance and outperforms other models, including larger ones, on conversation and instruction-following tasks. The authors also address the issue of response length bias and propose a length penalty to mitigate it. The report concludes with future directions and encourages further exploration of online iterative RLHF.This technical report presents a detailed workflow for Online Iterative Reinforcement Learning from Human Feedback (RLHF) in large language models (LLMs). The authors aim to address the gap between existing open-source RLHF projects, which are primarily focused on offline learning, and the need for practical, reproducible methods. They construct preference models using diverse open-source datasets to approximate human feedback, which is crucial for online iterative RLHF. The report covers theoretical insights and algorithmic principles behind online iterative RLHF, including supervised fine-tuning and iterative direct preference learning. The practical implementation details are also provided, along with a comprehensive evaluation on various benchmarks such as AlpacaEval-2, Arena-Hard, MT-Bench, HumanEval, and TruthfulQA. The trained model, LLaMA-3-BB-SFR-Iterative-DPO-R, demonstrates impressive performance and outperforms other models, including larger ones, on conversation and instruction-following tasks. The authors also address the issue of response length bias and propose a length penalty to mitigate it. The report concludes with future directions and encourages further exploration of online iterative RLHF.