Representation Surgery: Theory and Practice of Affine Steering

Representation Surgery: Theory and Practice of Affine Steering

2024 | Shashwat Singh, Shauli Rayfogel, Jonathan Herzig, Roee Aharoni, Ryan Cotterell, Ponnurangam Kumaraguru
This paper explores the theory and practice of affine steering functions, which are transformations applied to the representations of neural language models to alter their behavior and reduce undesirable outputs. The authors derive two optimal affine steering functions under different constraints: mean matching and mean and covariance matching. These functions are designed to minimize the probability of generating toxic or biased text while preserving the model's accuracy. The paper provides theoretical justification for these functions and demonstrates their effectiveness through empirical experiments. The experiments cover two key applications: reducing gender and dialect bias in multiclass classification and mitigating toxicity in text generation. The results show that simple linear interventions can effectively steer language models, with the mean and covariance-matching functions performing particularly well in both tasks. The paper concludes by discussing the limitations and future directions for nonlinear generalizations of affine steering functions.This paper explores the theory and practice of affine steering functions, which are transformations applied to the representations of neural language models to alter their behavior and reduce undesirable outputs. The authors derive two optimal affine steering functions under different constraints: mean matching and mean and covariance matching. These functions are designed to minimize the probability of generating toxic or biased text while preserving the model's accuracy. The paper provides theoretical justification for these functions and demonstrates their effectiveness through empirical experiments. The experiments cover two key applications: reducing gender and dialect bias in multiclass classification and mitigating toxicity in text generation. The results show that simple linear interventions can effectively steer language models, with the mean and covariance-matching functions performing particularly well in both tasks. The paper concludes by discussing the limitations and future directions for nonlinear generalizations of affine steering functions.
Reach us at info@study.space
[slides and audio] Representation Surgery%3A Theory and Practice of Affine Steering