9 Jun 2024 | IGOR MELNYK*, YOUSSEF MROUEH*, BRIAN BELGODERE*, MATTIA RIGOTTI, APOORVA NITSURE, MIKHAIL YUROCHKIN, KRISTJAN GREENEWALD, JIRI NAVRATIL, AND JARRET ROSS
The paper introduces a novel method called Alignment via Optimal Transport (AOT) for aligning Large Language Models (LLMs) with human preferences at the distributional level. Unlike existing methods that rely on pairwise human preferences, AOT uses unpaired preference data to ensure that the reward distribution of positive samples stochastically dominates the reward distribution of negative samples. The method is formulated as an optimal transport problem with a smooth and convex cost, which can be solved efficiently through sorting empirical measures. AOT is evaluated on various datasets and models, demonstrating superior performance compared to state-of-the-art alignment techniques such as Direct Preference Optimization (DPO), Kahneman-Tversky Optimization (KTO), and Identity Policy Optimization (IPO). Empirical results show that AOT achieves state-of-the-art results on the Alpaca leaderboard using the Merllinite 7B model, outperforming other 7B parameter models. The paper also provides a theoretical analysis of the sample complexity of AOT, showing that it converges at a parametric rate.The paper introduces a novel method called Alignment via Optimal Transport (AOT) for aligning Large Language Models (LLMs) with human preferences at the distributional level. Unlike existing methods that rely on pairwise human preferences, AOT uses unpaired preference data to ensure that the reward distribution of positive samples stochastically dominates the reward distribution of negative samples. The method is formulated as an optimal transport problem with a smooth and convex cost, which can be solved efficiently through sorting empirical measures. AOT is evaluated on various datasets and models, demonstrating superior performance compared to state-of-the-art alignment techniques such as Direct Preference Optimization (DPO), Kahneman-Tversky Optimization (KTO), and Identity Policy Optimization (IPO). Empirical results show that AOT achieves state-of-the-art results on the Alpaca leaderboard using the Merllinite 7B model, outperforming other 7B parameter models. The paper also provides a theoretical analysis of the sample complexity of AOT, showing that it converges at a parametric rate.