[slides and audio] The N%2B Implementation Details of RLHF with PPO%3A A Case Study on TL%3BDR Summarization

This paper presents a detailed reproduction of OpenAI's Reinforcement Learning from Human Feedback (RLHF) scaling behaviors in the context of TL;DR summarization. The authors create an RLHF pipeline from scratch, detailing 20 key implementation aspects and sharing insights during the reproduction process. They achieve significant improvements in response quality, with 2.8B and 6.9B models outperforming OpenAI's 1.3B checkpoint. The paper emphasizes the importance of a single learning rate for SFT, RM, and PPO training to simplify setup and improve reproducibility. It also provides a comprehensive analysis of the TL;DR dataset, including tokenization, padding, and dataset preparation. The authors discuss the training setups for SFT, RM, and PPO, highlighting the impact of various implementation details on performance. They release the trained model checkpoints and code to facilitate further research and accelerate progress in the field. The paper concludes by demonstrating the successful reproduction of OpenAI's RLHF work, promoting transparency and reproducibility in the research community.This paper presents a detailed reproduction of OpenAI's Reinforcement Learning from Human Feedback (RLHF) scaling behaviors in the context of TL;DR summarization. The authors create an RLHF pipeline from scratch, detailing 20 key implementation aspects and sharing insights during the reproduction process. They achieve significant improvements in response quality, with 2.8B and 6.9B models outperforming OpenAI's 1.3B checkpoint. The paper emphasizes the importance of a single learning rate for SFT, RM, and PPO training to simplify setup and improve reproducibility. It also provides a comprehensive analysis of the TL;DR dataset, including tokenization, padding, and dataset preparation. The authors discuss the training setups for SFT, RM, and PPO, highlighting the impact of various implementation details on performance. They release the trained model checkpoints and code to facilitate further research and accelerate progress in the field. The paper concludes by demonstrating the successful reproduction of OpenAI's RLHF work, promoting transparency and reproducibility in the research community.

The N+ Implementation Details of RLHF with PPO: A Case Study on TL;DR Summarization

24 Mar 2024 | Shengyi Huang, Michael Noukhovitch, Arian Hosseini, Kashif Rasul, Weixun Wang, Lewis Tunstall