Understanding Training language models to follow instructions with human feedback

The paper "Training language models to follow instructions with human feedback" by OpenAI researchers presents a method to align large language models (LLMs) with user intent through fine-tuning using human feedback. The authors address the issue that large LMs often generate outputs that are untruthful, toxic, or harmful, which is due to the misalignment between the language modeling objective and the goal of following user instructions. They propose a three-step process: supervised fine-tuning (SFT) using labeler demonstrations, training a reward model (RM) to predict human preferences, and fine-tuning the model using reinforcement learning from human feedback (RLHF). The resulting models, called *InstructGPT*, show significant improvements in truthfulness and reductions in toxic output generation compared to the original GPT-3 model, despite having fewer parameters. Human evaluations on a wide range of tasks, including API prompts and public NLP datasets, demonstrate that InstructGPT models are preferred by labelers and perform better in terms of helpfulness, honesty, and harmlessness. The paper also discusses the generalization capabilities of InstructGPT models and their performance on tasks outside the training distribution, highlighting their ability to follow instructions in different languages and complete coding tasks. However, the authors note that InstructGPT still makes some simple mistakes and that further work is needed to improve safety and reliability.The paper "Training language models to follow instructions with human feedback" by OpenAI researchers presents a method to align large language models (LLMs) with user intent through fine-tuning using human feedback. The authors address the issue that large LMs often generate outputs that are untruthful, toxic, or harmful, which is due to the misalignment between the language modeling objective and the goal of following user instructions. They propose a three-step process: supervised fine-tuning (SFT) using labeler demonstrations, training a reward model (RM) to predict human preferences, and fine-tuning the model using reinforcement learning from human feedback (RLHF). The resulting models, called *InstructGPT*, show significant improvements in truthfulness and reductions in toxic output generation compared to the original GPT-3 model, despite having fewer parameters. Human evaluations on a wide range of tasks, including API prompts and public NLP datasets, demonstrate that InstructGPT models are preferred by labelers and perform better in terms of helpfulness, honesty, and harmlessness. The paper also discusses the generalization capabilities of InstructGPT models and their performance on tasks outside the training distribution, highlighting their ability to follow instructions in different languages and complete coding tasks. However, the authors note that InstructGPT still makes some simple mistakes and that further work is needed to improve safety and reliability.

Training language models to follow instructions with human feedback

4 Mar 2022 | Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, Ryan Lowe