2024 | Tianlin Liu, Shangmin Guo, Leonardo Bianco, Daniele Calandriello, Quentin Berthet, Felipe Llinares, Jessica Hoffmann, Lucas Dixon, Michal Valko, Mathieu Blondel
Decoding-time Realignment of Language Models
Aligning language models with human preferences is crucial for reducing errors and biases in these models. Alignment techniques, such as reinforcement learning from human feedback (RLHF), aim to optimize a tradeoff between human preference rewards and a proximity regularization term that encourages staying close to the unaligned model. Selecting an appropriate level of regularization is critical: insufficient regularization can lead to reduced model capabilities due to reward hacking, whereas excessive regularization hinders alignment. Traditional methods for finding the optimal regularization level require retraining multiple models with varying regularization strengths, which is resource-intensive, especially for large models. To address this challenge, we propose decoding-time realignment (DeRa), a simple method to explore and evaluate different regularization strengths in aligned models without retraining. DeRa enables control over the degree of alignment, allowing users to smoothly transition between unaligned and aligned models. It also enhances the efficiency of hyperparameter tuning by enabling the identification of effective regularization strengths using a validation dataset.
DeRa adjusts alignment levels of language models at decoding time. It allows for a fast sweep over the values of λ to find the optimal balance between alignment and fluency. The main contributions of the paper are summarized below: based on the KL-regularized alignment objective, we prove that aligned models with varying KL regularization strengths are all geometric mixtures of a reference model and a single aligned model, differing only by their mixing weights. We introduce a new method, DeRa, which offers an autoregressive approximation to these geometric mixtures. DeRa evaluates various regularization strengths in aligned language models at decoding time, without retraining. Our experiments show that DeRa facilitates controlling alignment strengths, speeds up hyperparameter tuning, and helps performance tradeoffs in downstream tasks.
DeRa is independent of the alignment approach used. We demonstrate that DeRa can be applied to models aligned using various methods, including the policy gradient approach, that uses online reward annotations, and the direct preference optimization (DPO) approach that uses offline preference data. DeRa allows for a fast sweep over the values of λ to find the optimal balance between alignment and fluency. Our experiments show that DeRa can be used as a faithful approximation of the retrained model. DeRa enables the identification of KL strengths that outperform the original base KL strength. It also allows for the identification of under-regularized models. DeRa can be used to control hallucinations in neutral response generation. It can also be used to control the alignment level in summarization tasks. DeRa is a cost-effective method for determining effective KL strength hyperparameters. It can be used to streamline hyperparameter tuning and reduce computational costs by avoiding unnecessary retraining across a wide range of regularization strengths.Decoding-time Realignment of Language Models
Aligning language models with human preferences is crucial for reducing errors and biases in these models. Alignment techniques, such as reinforcement learning from human feedback (RLHF), aim to optimize a tradeoff between human preference rewards and a proximity regularization term that encourages staying close to the unaligned model. Selecting an appropriate level of regularization is critical: insufficient regularization can lead to reduced model capabilities due to reward hacking, whereas excessive regularization hinders alignment. Traditional methods for finding the optimal regularization level require retraining multiple models with varying regularization strengths, which is resource-intensive, especially for large models. To address this challenge, we propose decoding-time realignment (DeRa), a simple method to explore and evaluate different regularization strengths in aligned models without retraining. DeRa enables control over the degree of alignment, allowing users to smoothly transition between unaligned and aligned models. It also enhances the efficiency of hyperparameter tuning by enabling the identification of effective regularization strengths using a validation dataset.
DeRa adjusts alignment levels of language models at decoding time. It allows for a fast sweep over the values of λ to find the optimal balance between alignment and fluency. The main contributions of the paper are summarized below: based on the KL-regularized alignment objective, we prove that aligned models with varying KL regularization strengths are all geometric mixtures of a reference model and a single aligned model, differing only by their mixing weights. We introduce a new method, DeRa, which offers an autoregressive approximation to these geometric mixtures. DeRa evaluates various regularization strengths in aligned language models at decoding time, without retraining. Our experiments show that DeRa facilitates controlling alignment strengths, speeds up hyperparameter tuning, and helps performance tradeoffs in downstream tasks.
DeRa is independent of the alignment approach used. We demonstrate that DeRa can be applied to models aligned using various methods, including the policy gradient approach, that uses online reward annotations, and the direct preference optimization (DPO) approach that uses offline preference data. DeRa allows for a fast sweep over the values of λ to find the optimal balance between alignment and fluency. Our experiments show that DeRa can be used as a faithful approximation of the retrained model. DeRa enables the identification of KL strengths that outperform the original base KL strength. It also allows for the identification of under-regularized models. DeRa can be used to control hallucinations in neutral response generation. It can also be used to control the alignment level in summarization tasks. DeRa is a cost-effective method for determining effective KL strength hyperparameters. It can be used to streamline hyperparameter tuning and reduce computational costs by avoiding unnecessary retraining across a wide range of regularization strengths.