Understanding Aligning Large Language Models with Representation Editing%3A A Control Perspective

The paper "Aligning Large Language Models with Representation Editing: A Control Perspective" addresses the challenge of aligning large language models (LLMs) with human objectives, particularly in terms of helpfulness and harmlessness. The authors propose a novel method called RE-CONTROL, which uses representation editing to achieve alignment without the need for extensive fine-tuning or significant computational resources. The core idea is to view an autoregressive LLM as a discrete-time stochastic dynamical system and introduce control signals into its state space. A value function is trained on the hidden states of the LLM according to the Bellman equation, and at test time, gradient-based optimization is used to determine the control signals. This approach allows for dynamic perturbation of the representation space, offering greater flexibility compared to existing test-time alignment techniques such as prompt engineering and guided decoding. The experiments demonstrate that RE-CONTROL outperforms existing methods in terms of alignment scores, generation quality, and generalization to new input distributions. The paper also discusses limitations and future directions, including the potential for multi-objective alignment and more advanced training algorithms.The paper "Aligning Large Language Models with Representation Editing: A Control Perspective" addresses the challenge of aligning large language models (LLMs) with human objectives, particularly in terms of helpfulness and harmlessness. The authors propose a novel method called RE-CONTROL, which uses representation editing to achieve alignment without the need for extensive fine-tuning or significant computational resources. The core idea is to view an autoregressive LLM as a discrete-time stochastic dynamical system and introduce control signals into its state space. A value function is trained on the hidden states of the LLM according to the Bellman equation, and at test time, gradient-based optimization is used to determine the control signals. This approach allows for dynamic perturbation of the representation space, offering greater flexibility compared to existing test-time alignment techniques such as prompt engineering and guided decoding. The experiments demonstrate that RE-CONTROL outperforms existing methods in terms of alignment scores, generation quality, and generalization to new input distributions. The paper also discusses limitations and future directions, including the potential for multi-objective alignment and more advanced training algorithms.

Aligning Large Language Models with Representation Editing: A Control Perspective

11 Jun 2024 | Lingkai Kong, Haorui Wang, Wenhao Mu, Yuanqi Du, Yuchen Zhuang, Yifei Zhou, Yue Song, Rongzhi Zhang, Kai Wang, Chao Zhang