2 Apr 2024 | Joy Qiping Yang, Salman Salamatian, Ziteng Sun, Ananda Theertha Suresh, Ahmad Beirami
This paper explores the asymptotics of language model alignment, focusing on two popular methods: KL-constrained reinforcement learning (RL) and best-of-$N$. The authors provide a closed-form characterization of the optimal KL-constrained RL solution and show that any alignment method achieving a comparable trade-off between KL divergence and expected reward must approximate this solution in terms of relative entropy. They introduce simplifying assumptions—memoryless language models and linear reward models—to analyze the asymptotic behavior of both methods. The paper proves that the optimal KL-constrained RL solution satisfies a large deviation principle, and its rate function is characterized using information-theoretic quantities. Additionally, the paper shows that the scaled cumulants of the reward are related to the Rényi cross entropy of the alignment distribution. Finally, it demonstrates that the best-of-$N$ method is asymptotically equivalent to the optimal KL-constrained RL solution, as their expected rewards are asymptotically equal and their KL divergences are vanishingly small. This theoretical foundation provides insights into the empirical performance of best-of-$N$ and justifies its effectiveness in practice.This paper explores the asymptotics of language model alignment, focusing on two popular methods: KL-constrained reinforcement learning (RL) and best-of-$N$. The authors provide a closed-form characterization of the optimal KL-constrained RL solution and show that any alignment method achieving a comparable trade-off between KL divergence and expected reward must approximate this solution in terms of relative entropy. They introduce simplifying assumptions—memoryless language models and linear reward models—to analyze the asymptotic behavior of both methods. The paper proves that the optimal KL-constrained RL solution satisfies a large deviation principle, and its rate function is characterized using information-theoretic quantities. Additionally, the paper shows that the scaled cumulants of the reward are related to the Rényi cross entropy of the alignment distribution. Finally, it demonstrates that the best-of-$N$ method is asymptotically equivalent to the optimal KL-constrained RL solution, as their expected rewards are asymptotically equal and their KL divergences are vanishingly small. This theoretical foundation provides insights into the empirical performance of best-of-$N$ and justifies its effectiveness in practice.