9 Jun 2024 | IGOR MELNYK, YOUSSEF MROUEH, BRIAN BELGODERE, MATTIA RIGOTTI, APOORVA NITSURE, MIKHAIL YUROCHKIN, KRISTJAN GREENEWALD, JIRI NAVRATIL, AND JARRET ROSS
This paper introduces a novel method for distributional preference alignment of large language models (LLMs), called Alignment via Optimal Transport (AOT). AOT aligns LLMs on unpaired preference data by ensuring that the reward distribution of positive samples stochastically dominates that of negative samples. The method is based on a convex relaxation of first-order stochastic dominance, formulated as an optimal transport problem with a smooth and convex cost. This allows for a closed-form solution via sorting on empirical measures. AOT is trained to penalize violations of this stochastic dominance, leading to distributional alignment. The method is shown to achieve state-of-the-art results on benchmark datasets, including AlpacaEval and Open LLM Benchmarks. The paper also provides a statistical analysis of the method, showing that it converges at the parametric rate. AOT is evaluated on both paired and unpaired datasets, and is shown to outperform alternative alignment methods such as DPO, KTO, and IPO. The method is implemented using a modified version of the HuggingFace Alignment Handbook, and is shown to be efficient and robust. The paper concludes that AOT provides a powerful approach for distributional preference alignment of LLMs, leading to aligned models that perform well on a variety of benchmarks.This paper introduces a novel method for distributional preference alignment of large language models (LLMs), called Alignment via Optimal Transport (AOT). AOT aligns LLMs on unpaired preference data by ensuring that the reward distribution of positive samples stochastically dominates that of negative samples. The method is based on a convex relaxation of first-order stochastic dominance, formulated as an optimal transport problem with a smooth and convex cost. This allows for a closed-form solution via sorting on empirical measures. AOT is trained to penalize violations of this stochastic dominance, leading to distributional alignment. The method is shown to achieve state-of-the-art results on benchmark datasets, including AlpacaEval and Open LLM Benchmarks. The paper also provides a statistical analysis of the method, showing that it converges at the parametric rate. AOT is evaluated on both paired and unpaired datasets, and is shown to outperform alternative alignment methods such as DPO, KTO, and IPO. The method is implemented using a modified version of the HuggingFace Alignment Handbook, and is shown to be efficient and robust. The paper concludes that AOT provides a powerful approach for distributional preference alignment of LLMs, leading to aligned models that perform well on a variety of benchmarks.