Understanding Multi-property Steering of Large Language Models with Dynamic Activation Composition

This paper explores the effectiveness of activation steering methods in conditioning large language model (LLM) generation, particularly focusing on multi-property steering. Traditional adaptation techniques like Reinforcement Learning from Human Feedback (RLHF) can alter LLMs' behavior but may negatively impact downstream generation quality. Inference-time interventions, such as activation steering, offer a more controlled approach to modifying LLM behavior without the high costs and unpredictability of training. The authors conduct a comprehensive evaluation of various activation steering strategies, highlighting the property-dependent nature of optimal parameters. They propose Dynamic Activation Composition (Dyn), an information-theoretic approach to modulate the steering intensity of one or more properties throughout generation. Dyn dynamically composes property-specific steering vectors, maintaining high conditioning while minimizing the impact on generation fluency. The paper evaluates several baseline strategies, including initial steering, constant steering, and diminishing steering, and finds that Dyn outperforms these methods in terms of both conditioning accuracy and generation fluency. Dyn dynamically adjusts the steering intensity at each generation step, ensuring that the model remains fluent while achieving strong conditioning for all selected properties. The evaluation covers language, safety, and formality conditioning using datasets such as Alpaca, BeaverTails, GYAFC, and XFORMAL. The results show that Dyn effectively steers LLMs towards desired properties while maintaining high fluency, demonstrating its potential for controlling LLM behavior in real-world applications.This paper explores the effectiveness of activation steering methods in conditioning large language model (LLM) generation, particularly focusing on multi-property steering. Traditional adaptation techniques like Reinforcement Learning from Human Feedback (RLHF) can alter LLMs' behavior but may negatively impact downstream generation quality. Inference-time interventions, such as activation steering, offer a more controlled approach to modifying LLM behavior without the high costs and unpredictability of training. The authors conduct a comprehensive evaluation of various activation steering strategies, highlighting the property-dependent nature of optimal parameters. They propose Dynamic Activation Composition (Dyn), an information-theoretic approach to modulate the steering intensity of one or more properties throughout generation. Dyn dynamically composes property-specific steering vectors, maintaining high conditioning while minimizing the impact on generation fluency. The paper evaluates several baseline strategies, including initial steering, constant steering, and diminishing steering, and finds that Dyn outperforms these methods in terms of both conditioning accuracy and generation fluency. Dyn dynamically adjusts the steering intensity at each generation step, ensuring that the model remains fluent while achieving strong conditioning for all selected properties. The evaluation covers language, safety, and formality conditioning using datasets such as Alpaca, BeaverTails, GYAFC, and XFORMAL. The results show that Dyn effectively steers LLMs towards desired properties while maintaining high fluency, demonstrating its potential for controlling LLM behavior in real-world applications.

Multi-property Steering of Large Language Models with Dynamic Activation Composition

25 Jun 2024 | Daniel Scalena, Gabriele Sarti, Malvina Nissim