22 May 2024 | Chujie Zheng, Ziqi Wang, Heng Ji, Minlie Huang, Nanyun Peng
ExPO is a simple and efficient method to improve the alignment of large language models (LLMs) with human preferences without additional training. It leverages the weights of an already aligned model (e.g., trained via DPO or RLHF) and its initial supervised fine-tuned (SFT) checkpoint to extrapolate a better-aligned model. This method works by assuming that there exists a stronger model that can be derived from the weights of the two relatively weaker models. ExPO achieves this through a linear extrapolation formula, which allows it to amplify the reward signal learned during alignment training.
Experiments on twelve open-source LLMs show that ExPO consistently improves the performance of off-the-shelf DPO/RLHF models on benchmarks such as AlpacaEval 2.0 and MT-Bench. It achieves improvements of up to 4.5% on AlpacaEval 2.0 and 0.66 on MT-Bench. ExPO also demonstrates remarkable scalability across various model sizes and capabilities.
The method is based on the principle of model extrapolation, which is inspired by the concept of mode connectivity in neural networks. ExPO implicitly optimizes the alignment objective through first-order approximation, which allows it to amplify the reward signal learned during alignment training. However, it can also amplify spurious features such as length bias if the direction of extrapolation is not aligned with true human preference.
Controlled experiments show that the effectiveness of ExPO depends on the quality of the direction of extrapolation, which is influenced by the training configuration of the aligned model. ExPO performs best when the training data is sufficient and the model is well-aligned with human preferences. It is also effective when the model is trained with a moderate amount of preference data, as this allows for a more accurate extrapolation direction.
ExPO is applicable to a wide range of LLMs, including those trained with advanced alignment algorithms such as iterative DPO. It is a promising approach for expediting the alignment of LLMs with human preferences, as it requires no additional training and is computationally efficient. The method is simple to implement and can be applied to various model sizes and capabilities. Overall, ExPO demonstrates the potential of model extrapolation in improving the alignment of LLMs with human preferences.ExPO is a simple and efficient method to improve the alignment of large language models (LLMs) with human preferences without additional training. It leverages the weights of an already aligned model (e.g., trained via DPO or RLHF) and its initial supervised fine-tuned (SFT) checkpoint to extrapolate a better-aligned model. This method works by assuming that there exists a stronger model that can be derived from the weights of the two relatively weaker models. ExPO achieves this through a linear extrapolation formula, which allows it to amplify the reward signal learned during alignment training.
Experiments on twelve open-source LLMs show that ExPO consistently improves the performance of off-the-shelf DPO/RLHF models on benchmarks such as AlpacaEval 2.0 and MT-Bench. It achieves improvements of up to 4.5% on AlpacaEval 2.0 and 0.66 on MT-Bench. ExPO also demonstrates remarkable scalability across various model sizes and capabilities.
The method is based on the principle of model extrapolation, which is inspired by the concept of mode connectivity in neural networks. ExPO implicitly optimizes the alignment objective through first-order approximation, which allows it to amplify the reward signal learned during alignment training. However, it can also amplify spurious features such as length bias if the direction of extrapolation is not aligned with true human preference.
Controlled experiments show that the effectiveness of ExPO depends on the quality of the direction of extrapolation, which is influenced by the training configuration of the aligned model. ExPO performs best when the training data is sufficient and the model is well-aligned with human preferences. It is also effective when the model is trained with a moderate amount of preference data, as this allows for a more accurate extrapolation direction.
ExPO is applicable to a wide range of LLMs, including those trained with advanced alignment algorithms such as iterative DPO. It is a promising approach for expediting the alignment of LLMs with human preferences, as it requires no additional training and is computationally efficient. The method is simple to implement and can be applied to various model sizes and capabilities. Overall, ExPO demonstrates the potential of model extrapolation in improving the alignment of LLMs with human preferences.