BOND: Aligning LLMs with Best-of-N Distillation

BOND: Aligning LLMs with Best-of-N Distillation

19 Jul 2024 | Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenet, Johan Ferret, Nino Vieillard, Alexandre Ramé, Bobak Shariari, Sarah Perrin, Abe Friesen, Geoffrey Cideron, Sertan Girgin, Piotr Stanczyk, Andrea Michi, Danila Sinopalanikov, Sabela Ramos, Amélie Héliou, Aliaksei Severyn, Matt Hoffman, Nikola Momchev, Olivier Bachem
The paper introduces BOND (Best-of-N Distillation), a novel reinforcement learning from human feedback (RLHF) algorithm designed to align large language models (LLMs) with the quality and safety standards set by Best-of-N sampling. Best-of-N sampling is a simple inference-time strategy that selects the best generation among \( N \) candidates, improving quality but increasing computational cost. BOND aims to achieve the same performance as Best-of-N sampling at inference time without the computational overhead. The key idea of BOND is to cast the alignment problem as a distribution matching problem, where the policy is fine-tuned to emulate the Best-of-N distribution. The authors derive an analytical expression for the Best-of-N distribution and use the Jeffreys divergence, a combination of forward and backward KL divergences, to balance mode-covering and mode-seeking behavior. They propose an iterative BOND approach that uses a moving anchor policy to continuously improve the policy performance while keeping the sample complexity low. Experiments on abstractive summarization and Gemma models demonstrate the effectiveness of BOND. The proposed J-BOND algorithm, which integrates Monte-Carlo quantile estimation, the Jeffreys divergence, and an iterative procedure with an exponential moving average anchor, shows superior performance compared to standard RLHF algorithms. J-BOND not only improves the KL-reward Pareto front but also enhances performance on academic benchmarks and side-by-side comparisons against open-source variants.The paper introduces BOND (Best-of-N Distillation), a novel reinforcement learning from human feedback (RLHF) algorithm designed to align large language models (LLMs) with the quality and safety standards set by Best-of-N sampling. Best-of-N sampling is a simple inference-time strategy that selects the best generation among \( N \) candidates, improving quality but increasing computational cost. BOND aims to achieve the same performance as Best-of-N sampling at inference time without the computational overhead. The key idea of BOND is to cast the alignment problem as a distribution matching problem, where the policy is fine-tuned to emulate the Best-of-N distribution. The authors derive an analytical expression for the Best-of-N distribution and use the Jeffreys divergence, a combination of forward and backward KL divergences, to balance mode-covering and mode-seeking behavior. They propose an iterative BOND approach that uses a moving anchor policy to continuously improve the policy performance while keeping the sample complexity low. Experiments on abstractive summarization and Gemma models demonstrate the effectiveness of BOND. The proposed J-BOND algorithm, which integrates Monte-Carlo quantile estimation, the Jeffreys divergence, and an iterative procedure with an exponential moving average anchor, shows superior performance compared to standard RLHF algorithms. J-BOND not only improves the KL-reward Pareto front but also enhances performance on academic benchmarks and side-by-side comparisons against open-source variants.
Reach us at info@study.space