12 Apr 2022 | Yuntao Bai; Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom Brown, Jack Clark, Sam McCandlish, Chris Olah, Ben Mann, Jared Kaplan*
We apply preference modeling and reinforcement learning from human feedback (RLHF) to fine-tune language models to act as helpful and harmless assistants. Our alignment training improves performance on almost all NLP evaluations and is compatible with specialized skills like Python coding and summarization. We explore an iterated online training mode, where preference models and RL policies are updated weekly with fresh human feedback, efficiently improving datasets and models. We investigate the robustness of RLHF training and find a roughly linear relationship between the RL reward and the square root of the KL divergence between the policy and its initialization. We also perform peripheral analyses on calibration, competing objectives, and OOD detection, compare our models with human writers, and provide samples from our models using prompts from recent related work.
We collect separate helpfulness and harmlessness datasets using various 52B language models. We collect three data tranches, one from initial models, one with rejection sampling against early preference models, and a final dataset gathered with models trained with 'online' reinforcement learning from human feedback. We find that smaller models experience severe 'alignment taxes', but our 13B and 52B RLHF-trained models perform better on zero-shot NLP evaluations. Natural language RLHF training for HH can be applied to code-finetuned models, improving their programming ability. Mixing HH preference model training with summarization does not degrade performance in either HH or summarization.
There is a tension between helpfulness and harmlessness, which can be measured at the level of preference modeling and RLHF-trained policies. However, as model size increases, PMs perform better on both distributions and become more robust to the relative proportions of helpful and harmless training data. We also show that OOD detection techniques can be used to reject most strange and harmful requests.
We study scaling relations for PM accuracy as a function of model and dataset size, finding roughly log-linear trends. We conduct experiments on the robustness of RLHF, finding that larger PMs are more robust than smaller ones. We find that the square root of the KL divergence and reward are approximately linearly related during RLHF training. We study iterated online training, where we update preference models and RLHF policies weekly, significantly improving our models as evaluated by crowdworkers and our dataset as judged by our own PMs.
We evaluate our models on NLP and code evaluations, finding that RLHF-trained models perform better than base LMs. We also evaluate static alignment evaluations, finding that RLHF improves sentiment towards all groups but does not remove bias. Human evaluations show that our online HH model is preferred by crowdworkers about 57% of the time.
We compare our models with human writers, finding that our online HH model is preferred by crowdworkers. We also evaluate gender bias and bot adversarial dialogues, finding that PMs do not exhibit substantial bias. We discuss related work, including LaMDA and InstructGPT, and note that our work differs inWe apply preference modeling and reinforcement learning from human feedback (RLHF) to fine-tune language models to act as helpful and harmless assistants. Our alignment training improves performance on almost all NLP evaluations and is compatible with specialized skills like Python coding and summarization. We explore an iterated online training mode, where preference models and RL policies are updated weekly with fresh human feedback, efficiently improving datasets and models. We investigate the robustness of RLHF training and find a roughly linear relationship between the RL reward and the square root of the KL divergence between the policy and its initialization. We also perform peripheral analyses on calibration, competing objectives, and OOD detection, compare our models with human writers, and provide samples from our models using prompts from recent related work.
We collect separate helpfulness and harmlessness datasets using various 52B language models. We collect three data tranches, one from initial models, one with rejection sampling against early preference models, and a final dataset gathered with models trained with 'online' reinforcement learning from human feedback. We find that smaller models experience severe 'alignment taxes', but our 13B and 52B RLHF-trained models perform better on zero-shot NLP evaluations. Natural language RLHF training for HH can be applied to code-finetuned models, improving their programming ability. Mixing HH preference model training with summarization does not degrade performance in either HH or summarization.
There is a tension between helpfulness and harmlessness, which can be measured at the level of preference modeling and RLHF-trained policies. However, as model size increases, PMs perform better on both distributions and become more robust to the relative proportions of helpful and harmless training data. We also show that OOD detection techniques can be used to reject most strange and harmful requests.
We study scaling relations for PM accuracy as a function of model and dataset size, finding roughly log-linear trends. We conduct experiments on the robustness of RLHF, finding that larger PMs are more robust than smaller ones. We find that the square root of the KL divergence and reward are approximately linearly related during RLHF training. We study iterated online training, where we update preference models and RLHF policies weekly, significantly improving our models as evaluated by crowdworkers and our dataset as judged by our own PMs.
We evaluate our models on NLP and code evaluations, finding that RLHF-trained models perform better than base LMs. We also evaluate static alignment evaluations, finding that RLHF improves sentiment towards all groups but does not remove bias. Human evaluations show that our online HH model is preferred by crowdworkers about 57% of the time.
We compare our models with human writers, finding that our online HH model is preferred by crowdworkers. We also evaluate gender bias and bot adversarial dialogues, finding that PMs do not exhibit substantial bias. We discuss related work, including LaMDA and InstructGPT, and note that our work differs in