VARIATIONAL BEST-OF-N ALIGNMENT

VARIATIONAL BEST-OF-N ALIGNMENT

4 Mar 2025 | Afra Amini, Tim Vieira, Elliott Ash, Ryan Cotterell
The paper introduces Variational Best-of-$N$ (vBoN), a method to improve the efficiency of the Best-of-$N$ (BoN) alignment algorithm for language models. BoN is a popular and effective algorithm that aligns language models with human preferences by sampling $N$ samples from the model and selecting the one with the highest reward. However, BoN is computationally expensive, reducing sampling throughput by a factor of $N$. To address this, vBoN proposes fine-tuning the language model to minimize the reverse KL divergence to the BoN distribution. This approach is analogous to mean-field variational inference and allows for a significant reduction in inference cost while maintaining performance close to that of BoN. Experiments on controlled generation and summarization tasks show that vBoN achieves high reward values and maintains proximity to the reference model, outperforming other alignment methods and models fine-tuned with standard KL-constrained reinforcement learning (RL) objectives. The paper also discusses the theoretical connections between the vBoN objective and the KL-constrained RL objective, highlighting their similarities and differences.The paper introduces Variational Best-of-$N$ (vBoN), a method to improve the efficiency of the Best-of-$N$ (BoN) alignment algorithm for language models. BoN is a popular and effective algorithm that aligns language models with human preferences by sampling $N$ samples from the model and selecting the one with the highest reward. However, BoN is computationally expensive, reducing sampling throughput by a factor of $N$. To address this, vBoN proposes fine-tuning the language model to minimize the reverse KL divergence to the BoN distribution. This approach is analogous to mean-field variational inference and allows for a significant reduction in inference cost while maintaining performance close to that of BoN. Experiments on controlled generation and summarization tasks show that vBoN achieves high reward values and maintains proximity to the reference model, outperforming other alignment methods and models fine-tuned with standard KL-constrained reinforcement learning (RL) objectives. The paper also discusses the theoretical connections between the vBoN objective and the KL-constrained RL objective, highlighting their similarities and differences.
Reach us at info@study.space