Asymptotics of Language Model Alignment

Asymptotics of Language Model Alignment

2 Apr 2024 | Joy Qiping Yang, Salman Salamatian, Ziteng Sun, Ananda Theertha Suresh, Ahmad Beirami
This paper investigates the asymptotic behavior of two popular language model alignment methods: KL-constrained reinforcement learning (RL) and best-of-N. The goal is to align a reference language model $ p $ with a reward model $ r $, aiming to maximize expected reward while keeping the aligned distribution $ \phi $ close to $ p $. The paper provides a closed-form characterization of the optimal KL-constrained RL solution, showing that it is a mismatched tilted distribution. It demonstrates that any alignment method achieving a similar trade-off between KL divergence and expected reward must approximate the optimal KL-constrained RL solution in terms of relative entropy. The paper introduces two simplifying assumptions: the language model is memoryless, and the reward model is linear. Under these assumptions, it analyzes the asymptotic behavior of both the best-of-N and KL-constrained RL methods in terms of information-theoretic quantities. It proves that the reward of the optimal KL-constrained RL solution satisfies a large deviation principle, and fully characterizes its rate function. The paper also shows that the rate of growth of the scaled cumulants of the reward is characterized by a proper Rényi cross entropy. It further demonstrates that best-of-N is asymptotically equivalent to the KL-constrained RL solution by showing that their expected rewards are asymptotically equal, and that the two distributions must be close in KL divergence. The paper concludes that the best-of-N method achieves performance comparable to the optimal KL-constrained RL solution, suggesting that the two methods are asymptotically equivalent in certain settings. This provides theoretical justification for the practical success of the best-of-N method in language model alignment.This paper investigates the asymptotic behavior of two popular language model alignment methods: KL-constrained reinforcement learning (RL) and best-of-N. The goal is to align a reference language model $ p $ with a reward model $ r $, aiming to maximize expected reward while keeping the aligned distribution $ \phi $ close to $ p $. The paper provides a closed-form characterization of the optimal KL-constrained RL solution, showing that it is a mismatched tilted distribution. It demonstrates that any alignment method achieving a similar trade-off between KL divergence and expected reward must approximate the optimal KL-constrained RL solution in terms of relative entropy. The paper introduces two simplifying assumptions: the language model is memoryless, and the reward model is linear. Under these assumptions, it analyzes the asymptotic behavior of both the best-of-N and KL-constrained RL methods in terms of information-theoretic quantities. It proves that the reward of the optimal KL-constrained RL solution satisfies a large deviation principle, and fully characterizes its rate function. The paper also shows that the rate of growth of the scaled cumulants of the reward is characterized by a proper Rényi cross entropy. It further demonstrates that best-of-N is asymptotically equivalent to the KL-constrained RL solution by showing that their expected rewards are asymptotically equal, and that the two distributions must be close in KL divergence. The paper concludes that the best-of-N method achieves performance comparable to the optimal KL-constrained RL solution, suggesting that the two methods are asymptotically equivalent in certain settings. This provides theoretical justification for the practical success of the best-of-N method in language model alignment.
Reach us at info@study.space
[slides] Asymptotics of Language Model Alignment | StudySpace