12 Apr 2022 | Yuntao Bai; Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan*
The paper "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback" by Yuntao Bai et al. explores the use of preference modeling and reinforcement learning from human feedback (RLHF) to fine-tune language models to act as helpful and harmless assistants. The authors collect data through a human feedback interface where crowdworkers interact with the models and choose between helpful and harmful responses. They find that this alignment training improves performance on various NLP evaluations and is compatible with specialized skills like Python coding and summarization. The paper also investigates the robustness of RLHF training, noting a roughly linear relationship between the RL reward and the square root of the KL divergence between the policy and its initialization. Additionally, they explore iterated online training, where preference models and RL policies are updated weekly with fresh human feedback data, and compare their models with human writers. The results show that RLHF-trained models are both helpful and less harmful, and perform better than raw, generative models on most evaluations. The paper also discusses the benefits of combining alignment training with specialized skills and the challenges of balancing helpfulness and harmlessness.The paper "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback" by Yuntao Bai et al. explores the use of preference modeling and reinforcement learning from human feedback (RLHF) to fine-tune language models to act as helpful and harmless assistants. The authors collect data through a human feedback interface where crowdworkers interact with the models and choose between helpful and harmful responses. They find that this alignment training improves performance on various NLP evaluations and is compatible with specialized skills like Python coding and summarization. The paper also investigates the robustness of RLHF training, noting a roughly linear relationship between the RL reward and the square root of the KL divergence between the policy and its initialization. Additionally, they explore iterated online training, where preference models and RL policies are updated weekly with fresh human feedback data, and compare their models with human writers. The results show that RLHF-trained models are both helpful and less harmful, and perform better than raw, generative models on most evaluations. The paper also discusses the benefits of combining alignment training with specialized skills and the challenges of balancing helpfulness and harmlessness.