KTO: Model Alignment as Prospect Theoretic Optimization

KTO: Model Alignment as Prospect Theoretic Optimization

2024 | Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela
This paper introduces a new approach to aligning large language models (LLMs) with human feedback, called KTO (Kahneman-Tversky Optimization), which is based on prospect theory. Prospect theory suggests that humans perceive random variables in a biased but well-defined manner, with a tendency to be loss-averse. The authors show that existing alignment methods, such as DPO, implicitly incorporate some of these biases, which contribute to their success. However, the utility functions used in these methods differ from those in prospect theory. The authors propose a new loss function called a human-aware loss (HALO), which directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences. This approach, called KTO, matches or exceeds the performance of preference-based methods at scales from 1B to 30B parameters, despite only learning from a binary signal of whether an output is desirable. The authors argue that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting. The paper also discusses the theoretical implications of KTO, showing that it can outperform DPO in certain scenarios, particularly when there is a large imbalance between desirable and undesirable examples. The authors suggest that KTO is particularly useful when human feedback is in a binary format, as it can take advantage of the sheer volume of such data, which is more abundant, cheaper, and faster to collect than preference data. The paper concludes that while KTO is effective, there are still many open questions about the best way to align LLMs with human feedback. The authors suggest that future work should focus on developing HALOs that can handle more complex feedback, work with different modalities, and resolve contradictions in feedback according to different definitions of fairness. They also emphasize the importance of ecologically valid evaluation, where aligned models are deployed in real-world settings to judge the merits of different HALOs.This paper introduces a new approach to aligning large language models (LLMs) with human feedback, called KTO (Kahneman-Tversky Optimization), which is based on prospect theory. Prospect theory suggests that humans perceive random variables in a biased but well-defined manner, with a tendency to be loss-averse. The authors show that existing alignment methods, such as DPO, implicitly incorporate some of these biases, which contribute to their success. However, the utility functions used in these methods differ from those in prospect theory. The authors propose a new loss function called a human-aware loss (HALO), which directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences. This approach, called KTO, matches or exceeds the performance of preference-based methods at scales from 1B to 30B parameters, despite only learning from a binary signal of whether an output is desirable. The authors argue that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting. The paper also discusses the theoretical implications of KTO, showing that it can outperform DPO in certain scenarios, particularly when there is a large imbalance between desirable and undesirable examples. The authors suggest that KTO is particularly useful when human feedback is in a binary format, as it can take advantage of the sheer volume of such data, which is more abundant, cheaper, and faster to collect than preference data. The paper concludes that while KTO is effective, there are still many open questions about the best way to align LLMs with human feedback. The authors suggest that future work should focus on developing HALOs that can handle more complex feedback, work with different modalities, and resolve contradictions in feedback according to different definitions of fairness. They also emphasize the importance of ecologically valid evaluation, where aligned models are deployed in real-world settings to judge the merits of different HALOs.
Reach us at info@study.space
[slides] KTO%3A Model Alignment as Prospect Theoretic Optimization | StudySpace