12 Jun 2024 | Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, Tong Zhang
This technical report presents the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF), which significantly outperforms its offline counterpart in recent large language model (LLM) literature. The report addresses the challenge of training open-source models with limited resources by constructing preference models using diverse open-source datasets and using these models to approximate human feedback. It discusses the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. The trained LLM, LLaMA-3-8B-SFR-Iterative-DPO-R, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. The report also provides a detailed recipe for reproducing the online iterative RLHF process, along with publicly available models, curated datasets, and comprehensive code guides. The report highlights the effectiveness of supervised fine-tuning (SFT) and iterative RLHF in achieving state-of-the-art performance with fully open-source datasets. It also discusses the challenges of offline preference learning, such as over-optimization and distribution shift, and proposes online iterative RLHF as a solution. The report concludes with an evaluation of the model on various benchmarks, demonstrating the effectiveness of online iterative RLHF in improving model performance and aligning with human preferences.This technical report presents the workflow of Online Iterative Reinforcement Learning from Human Feedback (RLHF), which significantly outperforms its offline counterpart in recent large language model (LLM) literature. The report addresses the challenge of training open-source models with limited resources by constructing preference models using diverse open-source datasets and using these models to approximate human feedback. It discusses the theoretical insights and algorithmic principles behind online iterative RLHF, followed by a detailed practical implementation. The trained LLM, LLaMA-3-8B-SFR-Iterative-DPO-R, achieves impressive performance on LLM chatbot benchmarks, including AlpacaEval-2, Arena-Hard, and MT-Bench, as well as other academic benchmarks such as HumanEval and TruthfulQA. The report also provides a detailed recipe for reproducing the online iterative RLHF process, along with publicly available models, curated datasets, and comprehensive code guides. The report highlights the effectiveness of supervised fine-tuning (SFT) and iterative RLHF in achieving state-of-the-art performance with fully open-source datasets. It also discusses the challenges of offline preference learning, such as over-optimization and distribution shift, and proposes online iterative RLHF as a solution. The report concludes with an evaluation of the model on various benchmarks, demonstrating the effectiveness of online iterative RLHF in improving model performance and aligning with human preferences.