3 Jan 2024 | Ahmad Beirami, Alekh Agarwal, Jonathan Berant, Alexander D'Amour, Jacob Eisenstein, Chirag Nagpal, Ananda Theertha Suresh
This paper challenges the widely used analytical formula for the KL divergence between the best-of-n policy and the base policy, which claims that the KL divergence equals log(n) - (n-1)/n. The authors show that this formula is actually an upper bound on the true KL divergence and explore the tightness of this bound in different scenarios. They propose a new estimator for the KL divergence that provides a tighter approximation.
The best-of-n policy selects the response with the highest reward from n samples generated by a base policy. While this policy is simple and effective, its KL divergence from the base policy has been theoretically analyzed, with the formula log(n) - (n-1)/n being commonly used. However, the authors demonstrate that this formula is not accurate and may overestimate the true KL divergence.
The paper derives bounds on the gap between the analytical formula and the true KL divergence. It shows that when the probability of the highest reward outcome is low, the gap is small, but when this probability is high, the gap can be large and unbounded. The authors also propose a new estimator for the KL divergence that closely matches the true KL divergence in various scenarios.
The new estimator is derived from the probability mass function of the best-of-n policy and is shown to provide a more accurate approximation of the KL divergence than the analytical formula. The paper includes numerical experiments that compare the new estimator with the analytical formula and the exact KL divergence, demonstrating that the new estimator performs better in most cases.
The paper concludes that the analytical formula for the KL divergence of the best-of-n policy is not accurate and that the new estimator provides a more reliable approximation. The authors also highlight the importance of understanding the theoretical guarantees of alignment policies, as they have significant implications for the effectiveness and safety of generative models.This paper challenges the widely used analytical formula for the KL divergence between the best-of-n policy and the base policy, which claims that the KL divergence equals log(n) - (n-1)/n. The authors show that this formula is actually an upper bound on the true KL divergence and explore the tightness of this bound in different scenarios. They propose a new estimator for the KL divergence that provides a tighter approximation.
The best-of-n policy selects the response with the highest reward from n samples generated by a base policy. While this policy is simple and effective, its KL divergence from the base policy has been theoretically analyzed, with the formula log(n) - (n-1)/n being commonly used. However, the authors demonstrate that this formula is not accurate and may overestimate the true KL divergence.
The paper derives bounds on the gap between the analytical formula and the true KL divergence. It shows that when the probability of the highest reward outcome is low, the gap is small, but when this probability is high, the gap can be large and unbounded. The authors also propose a new estimator for the KL divergence that closely matches the true KL divergence in various scenarios.
The new estimator is derived from the probability mass function of the best-of-n policy and is shown to provide a more accurate approximation of the KL divergence than the analytical formula. The paper includes numerical experiments that compare the new estimator with the analytical formula and the exact KL divergence, demonstrating that the new estimator performs better in most cases.
The paper concludes that the analytical formula for the KL divergence of the best-of-n policy is not accurate and that the new estimator provides a more reliable approximation. The authors also highlight the importance of understanding the theoretical guarantees of alignment policies, as they have significant implications for the effectiveness and safety of generative models.