[slides] Analyzing the Generalization and Reliability of Steering Vectors

Steering Vectors (SVs) are a novel approach to adjust language model behavior at inference time by manipulating intermediate model activations. While SVs have shown promise in improving model capabilities and alignment, their reliability and generalization properties are unclear. This work rigorously investigates these aspects and finds that SVs have significant limitations both in-distribution and out-of-distribution settings. In-distribution, steerability varies widely across different inputs, with spurious biases contributing to the effectiveness of steering. Out-of-distribution, while SVs often generalize well, they are brittle to reasonable changes in the prompt, leading to poor generalization for several concepts. Overall, the findings indicate that while SVs can be effective under certain conditions, there are substantial technical challenges in applying them to guide models' behavior at scale. The study also introduces a new type of bias, "steerability bias," which explains the high variance in steerability and the effectiveness of SVs. The results highlight the need for further research to improve the reliability and generalization of SVs.Steering Vectors (SVs) are a novel approach to adjust language model behavior at inference time by manipulating intermediate model activations. While SVs have shown promise in improving model capabilities and alignment, their reliability and generalization properties are unclear. This work rigorously investigates these aspects and finds that SVs have significant limitations both in-distribution and out-of-distribution settings. In-distribution, steerability varies widely across different inputs, with spurious biases contributing to the effectiveness of steering. Out-of-distribution, while SVs often generalize well, they are brittle to reasonable changes in the prompt, leading to poor generalization for several concepts. Overall, the findings indicate that while SVs can be effective under certain conditions, there are substantial technical challenges in applying them to guide models' behavior at scale. The study also introduces a new type of bias, "steerability bias," which explains the high variance in steerability and the effectiveness of SVs. The results highlight the need for further research to improve the reliability and generalization of SVs.

Analyzing the Generalization and Reliability of Steering Vectors

2024 | Daniel Tan * 1 David Chanin * 1 Aengus Lynch 1 Dimitrios Kanoulas 1 Brooks Paige 1 Adria Garriga-Alonso 2 Robert Kirk 1