Fine-Tuning Language Models from Human Preferences

Fine-Tuning Language Models from Human Preferences

8 Jan 2020 | Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, Geoffrey Irving
This paper presents a method for fine-tuning language models using human preferences to improve performance on natural language tasks. The approach involves training a reward model based on human judgments and then using reinforcement learning (RL) to optimize the language model. The method is applied to four tasks: stylistic continuation (positive sentiment or physically descriptive language) and summarization on the CNN/Daily Mail and TL;DR datasets. For stylistic continuation, the model achieves good results with only 5,000 human comparisons. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble, leading to reasonable ROUGE scores and strong performance according to human labelers. However, this copying behavior may be exploiting the fact that labelers rely on simple heuristics. The method combines advances in generative pretraining with human preference learning. A reward model is trained using human comparisons, and the language model is fine-tuned using RL with a KL constraint to prevent it from drifting too far from the pretrained model. The approach is applied to two types of tasks: continuing text in a way that matches a target style and summarizing text from the CNN/Daily Mail or TL;DR datasets. The results show that the models perform well on these tasks, especially when trained with human feedback. However, the models are primarily "smart copiers" that copy sentences from the input but skip irrelevant preamble. The paper also discusses challenges in the approach, including the difficulty of online data collection, the risk of overfitting when sharing parameters between the reward model and policy, and the difficulty of labeling ambiguous tasks. The authors conclude that their method is effective for certain tasks but has limitations, particularly in terms of data quality and the need for more abstractive models. They suggest that future work should focus on improving data collection and exploring more abstractive approaches. The application of human reward learning to natural language tasks is seen as important for both capability and safety, as it allows for more accurate and safer models that can better align with human preferences.This paper presents a method for fine-tuning language models using human preferences to improve performance on natural language tasks. The approach involves training a reward model based on human judgments and then using reinforcement learning (RL) to optimize the language model. The method is applied to four tasks: stylistic continuation (positive sentiment or physically descriptive language) and summarization on the CNN/Daily Mail and TL;DR datasets. For stylistic continuation, the model achieves good results with only 5,000 human comparisons. For summarization, models trained with 60,000 comparisons copy whole sentences from the input but skip irrelevant preamble, leading to reasonable ROUGE scores and strong performance according to human labelers. However, this copying behavior may be exploiting the fact that labelers rely on simple heuristics. The method combines advances in generative pretraining with human preference learning. A reward model is trained using human comparisons, and the language model is fine-tuned using RL with a KL constraint to prevent it from drifting too far from the pretrained model. The approach is applied to two types of tasks: continuing text in a way that matches a target style and summarizing text from the CNN/Daily Mail or TL;DR datasets. The results show that the models perform well on these tasks, especially when trained with human feedback. However, the models are primarily "smart copiers" that copy sentences from the input but skip irrelevant preamble. The paper also discusses challenges in the approach, including the difficulty of online data collection, the risk of overfitting when sharing parameters between the reward model and policy, and the difficulty of labeling ambiguous tasks. The authors conclude that their method is effective for certain tasks but has limitations, particularly in terms of data quality and the need for more abstractive models. They suggest that future work should focus on improving data collection and exploring more abstractive approaches. The application of human reward learning to natural language tasks is seen as important for both capability and safety, as it allows for more accurate and safer models that can better align with human preferences.
Reach us at info@study.space