Understanding Extending Activation Steering to Broad Skills and Multiple Behaviours

This paper investigates the efficacy of activation steering techniques for broad skills and multiple behaviors in large language models. The authors explore whether these techniques can reduce the dangerous capabilities of models, which are likely to become more problematic in the future. They compare the effects of steering on general coding ability and Python-specific ability, finding that steering broader skills is competitive with steering narrower skills. Additionally, they examine the impact of steering models to become more or less myopic and wealth-seeking behaviors. The experiments show that combining steering vectors for multiple behaviors into one steering vector is largely unsuccessful, while injecting individual steering vectors at different places in the model simultaneously is promising. The methodology involves two groups of experiments: broad steering and multi-steering. For broad steering, the authors investigate the effect of steering one broad skill (coding ability) and the alignment tax associated with it. For multi-steering, they explore the effectiveness of combining individual steering vectors into one and simultaneously injecting them at different layers. The results indicate that while combined steering leads to smaller effect sizes, simultaneous steering at multiple layers appears more effective and less problematic in terms of interaction effects and alignment tax. The paper concludes by discussing the findings and suggesting future work, including investigating more narrow skills and testing the results with other models. The authors also highlight the need for further research to understand the 'real' performance of models and the potential of sparse language models in activation steering.This paper investigates the efficacy of activation steering techniques for broad skills and multiple behaviors in large language models. The authors explore whether these techniques can reduce the dangerous capabilities of models, which are likely to become more problematic in the future. They compare the effects of steering on general coding ability and Python-specific ability, finding that steering broader skills is competitive with steering narrower skills. Additionally, they examine the impact of steering models to become more or less myopic and wealth-seeking behaviors. The experiments show that combining steering vectors for multiple behaviors into one steering vector is largely unsuccessful, while injecting individual steering vectors at different places in the model simultaneously is promising. The methodology involves two groups of experiments: broad steering and multi-steering. For broad steering, the authors investigate the effect of steering one broad skill (coding ability) and the alignment tax associated with it. For multi-steering, they explore the effectiveness of combining individual steering vectors into one and simultaneously injecting them at different layers. The results indicate that while combined steering leads to smaller effect sizes, simultaneous steering at multiple layers appears more effective and less problematic in terms of interaction effects and alignment tax. The paper concludes by discussing the findings and suggesting future work, including investigating more narrow skills and testing the results with other models. The authors also highlight the need for further research to understand the 'real' performance of models and the potential of sparse language models in activation steering.

EXTENDING ACTIVATION STEERING TO BROAD SKILLS AND MULTIPLE BEHAVIOURS

March 12, 2024 | Teun van der Weij, Massimo Poesio, Nandi Schoots