28 Jun 2024 | William Muldrew, Peter Hayes, Mingtian Zhang, David Barber
This paper addresses the challenge of fine-tuning large language models (LLMs) to align with human intent, focusing on efficient use of human resources. It introduces an active learning strategy for Direct Preference Optimization (DPO), a simpler and more stable alternative to Reinforcement Learning from Human or AI Preferences (RLHF). The authors propose an acquisition function based on the predictive entropy of the language model and a measure of certainty in the implicit preference model, which biases the fine-tuning process towards correcting confident wrong predictions. Experiments on two datasets (IMDB and TLDR) using open-source models with approximately 1 billion parameters demonstrate that their approach improves the win-rate performance by 1-6%. The paper also discusses related work, including other active learning techniques and the use of LLMs as evaluators, and suggests future directions for improving computational efficiency and data acquisition strategies.This paper addresses the challenge of fine-tuning large language models (LLMs) to align with human intent, focusing on efficient use of human resources. It introduces an active learning strategy for Direct Preference Optimization (DPO), a simpler and more stable alternative to Reinforcement Learning from Human or AI Preferences (RLHF). The authors propose an acquisition function based on the predictive entropy of the language model and a measure of certainty in the implicit preference model, which biases the fine-tuning process towards correcting confident wrong predictions. Experiments on two datasets (IMDB and TLDR) using open-source models with approximately 1 billion parameters demonstrate that their approach improves the win-rate performance by 1-6%. The paper also discusses related work, including other active learning techniques and the use of LLMs as evaluators, and suggests future directions for improving computational efficiency and data acquisition strategies.