Weak-to-Strong Extrapolation Expedites Alignment

Weak-to-Strong Extrapolation Expedites Alignment

22 May 2024 | Chujie Zheng, Ziqi Wang, Heng Ji, Minlie Huang, Nanyun Peng
The paper introduces ExPO, a method to enhance the alignment of large language models (LLMs) with human preference without additional training. Inspired by model interpolation, ExPO directly extrapolates from the weights of an initially fine-tuned (SFT) model and a model trained with direct preference optimization (DPO) or reinforcement learning from human feedback (RLHF) to obtain a better-aligned model. The method is simple, efficient, and scalable, improving off-the-shelf DPO/RLHF models across various sizes and capabilities. Experiments on twelve open-source LLMs show significant performance gains on benchmarks like AlpacaEval 2.0 and MT-Bench. The paper also explores the impact of training data size and hyperparameters, demonstrating that ExPO amplifies the reward signal learned during alignment training. Controlled experiments reveal that ExPO can boost models trained with less preference data, even outperforming fully-trained models. The work suggests that model extrapolation is a promising approach for expediting LLM alignment with human preference.The paper introduces ExPO, a method to enhance the alignment of large language models (LLMs) with human preference without additional training. Inspired by model interpolation, ExPO directly extrapolates from the weights of an initially fine-tuned (SFT) model and a model trained with direct preference optimization (DPO) or reinforcement learning from human feedback (RLHF) to obtain a better-aligned model. The method is simple, efficient, and scalable, improving off-the-shelf DPO/RLHF models across various sizes and capabilities. Experiments on twelve open-source LLMs show significant performance gains on benchmarks like AlpacaEval 2.0 and MT-Bench. The paper also explores the impact of training data size and hyperparameters, demonstrating that ExPO amplifies the reward signal learned during alignment training. Controlled experiments reveal that ExPO can boost models trained with less preference data, even outperforming fully-trained models. The work suggests that model extrapolation is a promising approach for expediting LLM alignment with human preference.
Reach us at info@study.space