[slides] Fine-Tuning Language Models from Human Preferences

This paper explores the application of reinforcement learning (RL) to natural language processing tasks, where the reward is defined by human preferences. The authors build on advances in generative pretraining of language models and apply reward learning to four natural language tasks: generating text with positive sentiment or physically descriptive language, and summarizing text from the CNN/Daily Mail and TL;DR datasets. For stylistic continuation, the model achieves good results with only 5,000 human comparisons. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble, leading to reasonable ROUGE scores and very good performance according to human labelers. The paper discusses the challenges of online data collection, overfitting when sharing parameters between the reward model and policy, ambiguous tasks making labeling difficult, and bugs optimizing for bad behavior. The authors conclude that while the results are mixed, the application of human reward learning to natural language tasks is important for both capability and safety reasons.This paper explores the application of reinforcement learning (RL) to natural language processing tasks, where the reward is defined by human preferences. The authors build on advances in generative pretraining of language models and apply reward learning to four natural language tasks: generating text with positive sentiment or physically descriptive language, and summarizing text from the CNN/Daily Mail and TL;DR datasets. For stylistic continuation, the model achieves good results with only 5,000 human comparisons. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble, leading to reasonable ROUGE scores and very good performance according to human labelers. The paper discusses the challenges of online data collection, overfitting when sharing parameters between the reward model and policy, ambiguous tasks making labeling difficult, and bugs optimizing for bad behavior. The authors conclude that while the results are mixed, the application of human reward learning to natural language tasks is important for both capability and safety reasons.

Fine-Tuning Language Models from Human Preferences

8 Jan 2020 | Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, Geoffrey Irving