[slides and audio] BOND%3A Aligning LLMs with Best-of-N Distillation

The paper introduces BOND (Best-of-N Distillation), a novel reinforcement learning from human feedback (RLHF) algorithm designed to align large language models (LLMs) with the quality and safety standards set by Best-of-N sampling. Best-of-N sampling is a simple inference-time strategy that selects the best generation among \( N \) candidates, improving quality but increasing computational cost. BOND aims to achieve the same performance as Best-of-N sampling at inference time without the computational overhead. The key idea of BOND is to cast the alignment problem as a distribution matching problem, where the policy is fine-tuned to emulate the Best-of-N distribution. The authors derive an analytical expression for the Best-of-N distribution and use the Jeffreys divergence, a combination of forward and backward KL divergences, to balance mode-covering and mode-seeking behavior. They propose an iterative BOND approach that uses a moving anchor policy to continuously improve the policy performance while keeping the sample complexity low. Experiments on abstractive summarization and Gemma models demonstrate the effectiveness of BOND. The proposed J-BOND algorithm, which integrates Monte-Carlo quantile estimation, the Jeffreys divergence, and an iterative procedure with an exponential moving average anchor, shows superior performance compared to standard RLHF algorithms. J-BOND not only improves the KL-reward Pareto front but also enhances performance on academic benchmarks and side-by-side comparisons against open-source variants.The paper introduces BOND (Best-of-N Distillation), a novel reinforcement learning from human feedback (RLHF) algorithm designed to align large language models (LLMs) with the quality and safety standards set by Best-of-N sampling. Best-of-N sampling is a simple inference-time strategy that selects the best generation among \( N \) candidates, improving quality but increasing computational cost. BOND aims to achieve the same performance as Best-of-N sampling at inference time without the computational overhead. The key idea of BOND is to cast the alignment problem as a distribution matching problem, where the policy is fine-tuned to emulate the Best-of-N distribution. The authors derive an analytical expression for the Best-of-N distribution and use the Jeffreys divergence, a combination of forward and backward KL divergences, to balance mode-covering and mode-seeking behavior. They propose an iterative BOND approach that uses a moving anchor policy to continuously improve the policy performance while keeping the sample complexity low. Experiments on abstractive summarization and Gemma models demonstrate the effectiveness of BOND. The proposed J-BOND algorithm, which integrates Monte-Carlo quantile estimation, the Jeffreys divergence, and an iterative procedure with an exponential moving average anchor, shows superior performance compared to standard RLHF algorithms. J-BOND not only improves the KL-reward Pareto front but also enhances performance on academic benchmarks and side-by-side comparisons against open-source variants.

BOND: Aligning LLMs with Best-of-N Distillation