12 Jun 2024 | Chris Lu, Samuel Holt, Claudio Fanconi, Alex J. Chan, Jakob Foerster, Mihaela van der Schaar, Robert Tjarko Lange
This paper introduces DiscoPOP, a novel preference optimization algorithm discovered through large language model (LLM)-driven objective discovery. Preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs. Traditionally, preference optimization is approached as an offline supervised learning task using manually-crafted convex loss functions. However, these methods are constrained by human creativity, leaving the large search space of possible loss functions underexplored. To address this, the authors propose an LLM-driven discovery process to automatically generate new state-of-the-art preference optimization algorithms without expert human intervention.
The process involves iteratively prompting an LLM to propose and implement new preference optimization loss functions based on previously evaluated performance metrics. This leads to the discovery of previously unknown and performant preference optimization algorithms. The best-performing algorithm, DiscoPOP, is a novel algorithm that adaptively blends logistic and exponential losses. Experiments demonstrate that DiscoPOP achieves state-of-the-art performance and successfully transfers to held-out tasks.
DiscoPOP is a weighted sum of logistic and exponential losses, with the weight determined by the difference of log-ratios. It is non-convex and performs well across multiple held-out evaluation tasks, including multi-turn dialogue (AlpacaEval 2.0), controlled sentiment generation (IMDb), and summarization (TL;DR). DiscoPOP outperforms or performs competitively with existing preference optimization algorithms in these tasks.
The authors also analyze the limitations of DiscoPOP, noting that it struggles to converge when the parameter β is too low or too high. They suggest that future work could explore different forms of the objective function with multiple floating-point parameters. Additionally, the paper discusses the broader impact and ethical considerations of using LLMs for preference optimization, emphasizing the need for content filters to prevent harmful outputs. The work is supported by various funding sources and is made available through an open-source repository.This paper introduces DiscoPOP, a novel preference optimization algorithm discovered through large language model (LLM)-driven objective discovery. Preference optimization is a key method for enhancing and controlling the quality of Large Language Model (LLM) outputs. Traditionally, preference optimization is approached as an offline supervised learning task using manually-crafted convex loss functions. However, these methods are constrained by human creativity, leaving the large search space of possible loss functions underexplored. To address this, the authors propose an LLM-driven discovery process to automatically generate new state-of-the-art preference optimization algorithms without expert human intervention.
The process involves iteratively prompting an LLM to propose and implement new preference optimization loss functions based on previously evaluated performance metrics. This leads to the discovery of previously unknown and performant preference optimization algorithms. The best-performing algorithm, DiscoPOP, is a novel algorithm that adaptively blends logistic and exponential losses. Experiments demonstrate that DiscoPOP achieves state-of-the-art performance and successfully transfers to held-out tasks.
DiscoPOP is a weighted sum of logistic and exponential losses, with the weight determined by the difference of log-ratios. It is non-convex and performs well across multiple held-out evaluation tasks, including multi-turn dialogue (AlpacaEval 2.0), controlled sentiment generation (IMDb), and summarization (TL;DR). DiscoPOP outperforms or performs competitively with existing preference optimization algorithms in these tasks.
The authors also analyze the limitations of DiscoPOP, noting that it struggles to converge when the parameter β is too low or too high. They suggest that future work could explore different forms of the objective function with multiple floating-point parameters. Additionally, the paper discusses the broader impact and ethical considerations of using LLMs for preference optimization, emphasizing the need for content filters to prevent harmful outputs. The work is supported by various funding sources and is made available through an open-source repository.