11 Jun 2024 | Lingkai Kong, Haorui Wang, Wenhao Mu, Yuanqi Du, Yuchen Zhuang, Yifei Zhou, Yue Song, Rongzhi Zhang, Kai Wang, Chao Zhang
This paper proposes a method called RE-CONTROL for aligning large language models (LLMs) at test time using representation editing. The core idea is to treat pre-trained autoregressive LLMs as discrete-time stochastic dynamical systems and introduce control signals into their representation space to achieve specific alignment objectives. The method trains a value function directly on the hidden states according to the Bellman equation, enabling gradient-based optimization to obtain optimal control signals at test time. The value function is a simple two- or three-layer neural network, making the intervention fast and efficient. The control signals are regularized to be as small as possible to preserve the generation quality of the original LLMs. The method outperforms existing test-time alignment techniques while requiring significantly fewer resources compared to fine-tuning methods. The experiments show that RE-CONTROL achieves the highest alignment score in terms of the win rate evaluated by GPT-4 and maintains generation quality, as measured by diversity and coherence. It also outperforms static representation editing methods. The method is compared with other test-time alignment methods and shows superior performance in terms of average reward and GPT-4 evaluation. Additionally, the method is tested on out-of-distribution data and shows strong generalization ability. The results indicate that RE-CONTROL is a competitive alternative to fine-tuning methods.This paper proposes a method called RE-CONTROL for aligning large language models (LLMs) at test time using representation editing. The core idea is to treat pre-trained autoregressive LLMs as discrete-time stochastic dynamical systems and introduce control signals into their representation space to achieve specific alignment objectives. The method trains a value function directly on the hidden states according to the Bellman equation, enabling gradient-based optimization to obtain optimal control signals at test time. The value function is a simple two- or three-layer neural network, making the intervention fast and efficient. The control signals are regularized to be as small as possible to preserve the generation quality of the original LLMs. The method outperforms existing test-time alignment techniques while requiring significantly fewer resources compared to fine-tuning methods. The experiments show that RE-CONTROL achieves the highest alignment score in terms of the win rate evaluated by GPT-4 and maintains generation quality, as measured by diversity and coherence. It also outperforms static representation editing methods. The method is compared with other test-time alignment methods and shows superior performance in terms of average reward and GPT-4 evaluation. Additionally, the method is tested on out-of-distribution data and shows strong generalization ability. The results indicate that RE-CONTROL is a competitive alternative to fine-tuning methods.