2024 | Shashwat Singh, Shauli Ravfogel, Jonathan Herzig, Roe Aharoni, Ryan Cotterell, Ponmurangam Kumaraguru
This paper introduces the theory and practice of affine steering, a method to modify neural language model representations to reduce undesirable behaviors such as toxicity and bias. The authors derive two optimal affine steering functions under different constraints: mean matching and mean and covariance matching. These functions are justified by the concept of affine guardedness, which ensures that the model's representations are not easily distinguishable by linear classifiers. The first function, based on mean matching, aligns the representations of two concepts by translating them to match their means. The second function, based on covariance matching, ensures that the covariance of the representations is the same for both concepts, which helps in reducing bias by neighbors.
The paper also presents experiments showing that these affine steering functions effectively reduce bias and toxicity in language models. In classification tasks, the functions reduce gender and dialect bias by aligning the representations of different concepts. In text generation, the functions reduce toxicity by steering the representations towards non-toxic concepts. The results show that the mean and covariance matching functions are more effective in reducing bias and toxicity compared to other methods. The experiments also demonstrate that the affine steering functions have a minimal impact on the accuracy of the main task, making them a practical solution for bias mitigation. The paper concludes that affine steering is a promising approach for controlling language model behavior and reducing bias in real-world applications.This paper introduces the theory and practice of affine steering, a method to modify neural language model representations to reduce undesirable behaviors such as toxicity and bias. The authors derive two optimal affine steering functions under different constraints: mean matching and mean and covariance matching. These functions are justified by the concept of affine guardedness, which ensures that the model's representations are not easily distinguishable by linear classifiers. The first function, based on mean matching, aligns the representations of two concepts by translating them to match their means. The second function, based on covariance matching, ensures that the covariance of the representations is the same for both concepts, which helps in reducing bias by neighbors.
The paper also presents experiments showing that these affine steering functions effectively reduce bias and toxicity in language models. In classification tasks, the functions reduce gender and dialect bias by aligning the representations of different concepts. In text generation, the functions reduce toxicity by steering the representations towards non-toxic concepts. The results show that the mean and covariance matching functions are more effective in reducing bias and toxicity compared to other methods. The experiments also demonstrate that the affine steering functions have a minimal impact on the accuracy of the main task, making them a practical solution for bias mitigation. The paper concludes that affine steering is a promising approach for controlling language model behavior and reducing bias in real-world applications.