Understanding Theoretical guarantees on the best-of-n alignment policy

The paper "Theoretical guarantees on the best-of-$n$ alignment policy" by Ahmad Beirami et al. explores the theoretical foundations of the best-of-$n$ alignment policy, a popular method for aligning generative models to improve their output quality while preserving the original model's capabilities. The authors disprove a commonly used analytical expression for the KL divergence between the best-of-$n$ policy and the base policy, which claims that the KL divergence is equal to $\log(n) - (n - 1)/n$. They show that this formula is only an upper bound on the actual KL divergence and provide bounds on the gap between the upper bound and the exact KL divergence. The paper also introduces a new estimator for the KL divergence that better captures its behavior. Through theoretical derivations and numerical experiments, the authors demonstrate that the proposed estimator closely follows the true KL divergence, even in cases where the analytical formula is not accurate. The study highlights the importance of understanding the true KL divergence in the context of alignment techniques and provides a more accurate way to evaluate their performance.The paper "Theoretical guarantees on the best-of-$n$ alignment policy" by Ahmad Beirami et al. explores the theoretical foundations of the best-of-$n$ alignment policy, a popular method for aligning generative models to improve their output quality while preserving the original model's capabilities. The authors disprove a commonly used analytical expression for the KL divergence between the best-of-$n$ policy and the base policy, which claims that the KL divergence is equal to $\log(n) - (n - 1)/n$. They show that this formula is only an upper bound on the actual KL divergence and provide bounds on the gap between the upper bound and the exact KL divergence. The paper also introduces a new estimator for the KL divergence that better captures its behavior. Through theoretical derivations and numerical experiments, the authors demonstrate that the proposed estimator closely follows the true KL divergence, even in cases where the analytical formula is not accurate. The study highlights the importance of understanding the true KL divergence in the context of alignment techniques and provides a more accurate way to evaluate their performance.

Theoretical guarantees on the best-of-n alignment policy

3 Jan 2024 | Ahmad Beirami, Alekh Agarwal, Jonathan Berant, Alexander D'Amour, Jacob Eisenstein, Chirag Nagpal, Ananda Theertha Suresh