Analyzing the Generalization and Reliability of Steering Vectors

Analyzing the Generalization and Reliability of Steering Vectors

2024 | Daniel Tan, David Chanin, Aengus Lynch, Dimitrios Kanoulas, Brooks Paige, Adria Garriga-Alonso, Robert Kirk
This paper investigates the reliability and generalization of steering vectors (SVs), a technique for adjusting language model behavior at inference time by modifying intermediate activations. SVs have shown promise in improving model capabilities and alignment, but their reliability and generalization are not well understood. The authors find that SVs have significant limitations in both in-distribution and out-of-distribution settings. In-distribution, steerability varies widely across inputs, and spurious biases can lead to ineffective steering. Out-of-distribution, SVs often generalize well but can be brittle to prompt changes, leading to poor generalization for certain concepts. The authors introduce the concept of "steerability bias," where models are easier to steer towards certain outputs, such as specific tokens or positions. They also find that SVs are not always effective in steering models towards desired behaviors, and that some behaviors are unsteerable. The study highlights the need for further research to improve the reliability and generalization of SVs for practical applications. The authors also find that SV generalization is largely a property of the dataset, and that similar model behavior in source and target prompts predicts better SV generalization. Overall, while SVs show promise, they are not a panacea for aligning model behavior at inference time, and more work is needed to ensure their reliability and effectiveness.This paper investigates the reliability and generalization of steering vectors (SVs), a technique for adjusting language model behavior at inference time by modifying intermediate activations. SVs have shown promise in improving model capabilities and alignment, but their reliability and generalization are not well understood. The authors find that SVs have significant limitations in both in-distribution and out-of-distribution settings. In-distribution, steerability varies widely across inputs, and spurious biases can lead to ineffective steering. Out-of-distribution, SVs often generalize well but can be brittle to prompt changes, leading to poor generalization for certain concepts. The authors introduce the concept of "steerability bias," where models are easier to steer towards certain outputs, such as specific tokens or positions. They also find that SVs are not always effective in steering models towards desired behaviors, and that some behaviors are unsteerable. The study highlights the need for further research to improve the reliability and generalization of SVs for practical applications. The authors also find that SV generalization is largely a property of the dataset, and that similar model behavior in source and target prompts predicts better SV generalization. Overall, while SVs show promise, they are not a panacea for aligning model behavior at inference time, and more work is needed to ensure their reliability and effectiveness.
Reach us at info@study.space
[slides and audio] Analyzing the Generalization and Reliability of Steering Vectors