Decoding-Time Language Model Alignment with Multiple Objectives

Decoding-Time Language Model Alignment with Multiple Objectives

29 Jun 2024 | Ruizhe Shi, Yifang Chen, Yushi Hu, Alisa Liu, Hannaneh Hajishirzi, Noah A. Smith, Simon S. Du
This paper introduces a training-free algorithm called Multi-Objective Decoding (MOD) for aligning language models (LMs) to multiple objectives simultaneously. MOD enables the decoding of the next token by combining the predictions of multiple base models, each trained for a single objective. The algorithm leverages a common form among f-divergence regularized alignment approaches, such as PPO and DPO, to derive a closed-form solution using Legendre transforms. Theoretical analysis shows that existing methods can be sub-optimal, and MOD provides optimality guarantees. Empirical results demonstrate that MOD achieves significant improvements in reward compared to parameter-merging baselines, particularly in tasks like safety alignment and coding. MOD also allows for efficient combination of models of different sizes and objectives, and is effective for both RLHF and DPO models, as well as supervised fine-tuned (SFT) models. The method is versatile, allowing users to adjust preference weightings at inference time without extensive retraining. MOD is shown to be effective in reducing toxicity on Toxigen to nearly zero and improving other metrics by up to 33.3%. Theoretical analysis also reveals the necessity of barrier functions in multi-objective alignment and the sub-optimality of parameter-merging approaches under certain conditions. The framework is applicable to a wide range of tasks and can be extended to other decoding algorithms. Overall, MOD provides a flexible and efficient solution for multi-objective LM alignment.This paper introduces a training-free algorithm called Multi-Objective Decoding (MOD) for aligning language models (LMs) to multiple objectives simultaneously. MOD enables the decoding of the next token by combining the predictions of multiple base models, each trained for a single objective. The algorithm leverages a common form among f-divergence regularized alignment approaches, such as PPO and DPO, to derive a closed-form solution using Legendre transforms. Theoretical analysis shows that existing methods can be sub-optimal, and MOD provides optimality guarantees. Empirical results demonstrate that MOD achieves significant improvements in reward compared to parameter-merging baselines, particularly in tasks like safety alignment and coding. MOD also allows for efficient combination of models of different sizes and objectives, and is effective for both RLHF and DPO models, as well as supervised fine-tuned (SFT) models. The method is versatile, allowing users to adjust preference weightings at inference time without extensive retraining. MOD is shown to be effective in reducing toxicity on Toxigen to nearly zero and improving other metrics by up to 33.3%. Theoretical analysis also reveals the necessity of barrier functions in multi-objective alignment and the sub-optimality of parameter-merging approaches under certain conditions. The framework is applicable to a wide range of tasks and can be extended to other decoding algorithms. Overall, MOD provides a flexible and efficient solution for multi-objective LM alignment.
Reach us at info@study.space
[slides] Decoding-Time Language Model Alignment with Multiple Objectives | StudySpace