Variational Best-of-N Alignment

Variational Best-of-N Alignment

2025 | Afra Amini, Tim Vieira, Elliott Ash, Ryan Cotterell
Variational Best-of-N (vBoN) is an efficient alignment method for language models that approximates the Best-of-N (BoN) algorithm through fine-tuning. BoN, a popular alignment method, selects the highest-reward sample from N generated samples, but it is computationally expensive due to its N-fold reduction in throughput. vBoN addresses this by fine-tuning the language model to minimize the reverse KL divergence to the BoN distribution, making it as efficient as BoN while maintaining high performance. This approach is analogous to mean-field variational inference and is termed vBoN. Experiments on controlled generation and summarization tasks show that vBoN achieves performance close to BoN and outperforms models fine-tuned with standard KL-constrained reinforcement learning (RL) objectives. In controlled generation, vBoN appears more frequently on the Pareto frontier of reward and KL divergence compared to other methods. In summarization, vBoN achieves higher reward values and better performance across various sampling temperatures. vBoN is also more efficient in inference, reducing computational costs by a factor of N. The method approximates the BoN distribution by minimizing the reverse KL divergence, and it is invariant to monotonically increasing transformations of reward values. This makes vBoN robust to reward scaling and outliers. The paper also evaluates the effectiveness of vBoN in sentiment control tasks, where it outperforms other alignment methods in terms of win rates and reward values. The results demonstrate that vBoN is a promising alternative to BoN, offering high performance with significantly reduced inference costs.Variational Best-of-N (vBoN) is an efficient alignment method for language models that approximates the Best-of-N (BoN) algorithm through fine-tuning. BoN, a popular alignment method, selects the highest-reward sample from N generated samples, but it is computationally expensive due to its N-fold reduction in throughput. vBoN addresses this by fine-tuning the language model to minimize the reverse KL divergence to the BoN distribution, making it as efficient as BoN while maintaining high performance. This approach is analogous to mean-field variational inference and is termed vBoN. Experiments on controlled generation and summarization tasks show that vBoN achieves performance close to BoN and outperforms models fine-tuned with standard KL-constrained reinforcement learning (RL) objectives. In controlled generation, vBoN appears more frequently on the Pareto frontier of reward and KL divergence compared to other methods. In summarization, vBoN achieves higher reward values and better performance across various sampling temperatures. vBoN is also more efficient in inference, reducing computational costs by a factor of N. The method approximates the BoN distribution by minimizing the reverse KL divergence, and it is invariant to monotonically increasing transformations of reward values. This makes vBoN robust to reward scaling and outliers. The paper also evaluates the effectiveness of vBoN in sentiment control tasks, where it outperforms other alignment methods in terms of win rates and reward values. The results demonstrate that vBoN is a promising alternative to BoN, offering high performance with significantly reduced inference costs.
Reach us at info@futurestudyspace.com