Understanding KTO%3A Model Alignment as Prospect Theoretic Optimization

The paper "Kahneman-Tversky Optimization (KTO): Model Alignment as Prospect Theoretic Optimization" explores the alignment of large language models (LLMs) with human feedback using a framework inspired by Kahneman and Tversky's *prospect theory*. The authors argue that popular alignment methods, such as Direct Preference Optimization (DPO) and PPO-Clip, implicitly incorporate biases from prospect theory, which explains their success. These methods are categorized as *human-aware losses* (HALOs), which reflect how humans perceive random variables and make decisions. The paper proposes KTO, a new HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences. KTO uses a Kahneman-Tversky model of human utility, which captures properties like loss aversion. KTO is designed to work with binary feedback signals, making it more practical and scalable compared to methods that require preference data. Experiments show that KTO matches or exceeds the performance of preference-based methods at scales from 1B to 30B parameters, even when using only binary feedback. KTO also handles extreme data imbalances well, outperforming DPO with up to 90% fewer desirable examples. Additionally, KTO can skip supervised fine-tuning (SFT) and directly align models, which is not possible with DPO. The authors conclude that there is no universally superior HALO; the best loss function depends on the inductive biases appropriate for a given setting. They suggest that future work should focus on identifying the best HALO for each context and developing methods that incorporate granular feedback, work with different modalities and model classes, and resolve contradictions in feedback.The paper "Kahneman-Tversky Optimization (KTO): Model Alignment as Prospect Theoretic Optimization" explores the alignment of large language models (LLMs) with human feedback using a framework inspired by Kahneman and Tversky's *prospect theory*. The authors argue that popular alignment methods, such as Direct Preference Optimization (DPO) and PPO-Clip, implicitly incorporate biases from prospect theory, which explains their success. These methods are categorized as *human-aware losses* (HALOs), which reflect how humans perceive random variables and make decisions. The paper proposes KTO, a new HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences. KTO uses a Kahneman-Tversky model of human utility, which captures properties like loss aversion. KTO is designed to work with binary feedback signals, making it more practical and scalable compared to methods that require preference data. Experiments show that KTO matches or exceeds the performance of preference-based methods at scales from 1B to 30B parameters, even when using only binary feedback. KTO also handles extreme data imbalances well, outperforming DPO with up to 90% fewer desirable examples. Additionally, KTO can skip supervised fine-tuning (SFT) and directly align models, which is not possible with DPO. The authors conclude that there is no universally superior HALO; the best loss function depends on the inductive biases appropriate for a given setting. They suggest that future work should focus on identifying the best HALO for each context and developing methods that incorporate granular feedback, work with different modalities and model classes, and resolve contradictions in feedback.

KTO: Model Alignment as Prospect Theoretic Optimization

3 Jun 2024 | Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, Douwe Kiela