Extending Activation Steering to Broad Skills and Multiple Behaviours

Extending Activation Steering to Broad Skills and Multiple Behaviours

March 12, 2024 | Teun van der Weij, Massimo Poesio, Nandi Schoots
This paper investigates the effectiveness of activation steering for broad skills and multiple behaviours in large language models (LLMs). Activation steering involves modifying model activations during inference to influence output behaviour. The study compares the effectiveness of steering broad skills (e.g., coding) versus narrow skills (e.g., Python-specific coding), and explores steering for multiple behaviours simultaneously. The paper finds that steering broad skills is competitive with steering narrow skills, but combining steering vectors for multiple behaviours is largely unsuccessful. Instead, steering at different layers simultaneously shows promise, though it may lead to interaction effects that reduce steering quality. Injection of individual steering vectors at different layers with a single global coefficient is more effective than combining vectors, as it avoids interaction effects and results in a smaller alignment tax (the trade-off between model performance and behaviour alignment). Experiments show that steering for broad skills like coding can reduce performance, but the impact is smaller than expected. For multi-steering, combining individual steering vectors leads to smaller effect sizes and potential mode collapse, while simultaneous steering at different layers is more effective and has a minimal alignment tax. The study also highlights that activation steering can be used to reduce harmful capabilities of LLMs, such as myopia or wealth-seeking, without significantly harming general performance. The paper concludes that simultaneous steering at different layers is more promising than combining steering vectors, as it avoids interaction effects and maintains model functionality. However, further research is needed to optimize activation steering methods and understand their broader implications for model safety and performance.This paper investigates the effectiveness of activation steering for broad skills and multiple behaviours in large language models (LLMs). Activation steering involves modifying model activations during inference to influence output behaviour. The study compares the effectiveness of steering broad skills (e.g., coding) versus narrow skills (e.g., Python-specific coding), and explores steering for multiple behaviours simultaneously. The paper finds that steering broad skills is competitive with steering narrow skills, but combining steering vectors for multiple behaviours is largely unsuccessful. Instead, steering at different layers simultaneously shows promise, though it may lead to interaction effects that reduce steering quality. Injection of individual steering vectors at different layers with a single global coefficient is more effective than combining vectors, as it avoids interaction effects and results in a smaller alignment tax (the trade-off between model performance and behaviour alignment). Experiments show that steering for broad skills like coding can reduce performance, but the impact is smaller than expected. For multi-steering, combining individual steering vectors leads to smaller effect sizes and potential mode collapse, while simultaneous steering at different layers is more effective and has a minimal alignment tax. The study also highlights that activation steering can be used to reduce harmful capabilities of LLMs, such as myopia or wealth-seeking, without significantly harming general performance. The paper concludes that simultaneous steering at different layers is more promising than combining steering vectors, as it avoids interaction effects and maintains model functionality. However, further research is needed to optimize activation steering methods and understand their broader implications for model safety and performance.
Reach us at info@study.space