21 Jun 2024 | Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsabila Mahdi, Samuel R. Bowman
The paper "Steering Without Side Effects: Improving Post-Deployment Control of Language Models" addresses the issue of unexpected behaviors in language models (LMs) after deployment, such as allowing model misuse through jailbreaks. To mitigate these issues, the authors propose a method called KL-then-steer (KTS), which aims to reduce the side effects of steering vectors while retaining their benefits. KTS involves training a model to minimize the Kullback-Leibler (KL) divergence between a steered and unsteered model on benign inputs, followed by steering the trained model. This approach prevents 44% of jailbreak attacks while maintaining helpfulness on benign requests. The authors also demonstrate that KTS can be applied to other tasks, such as reducing bias towards user-suggested answers on TruthfulQA. The paper includes a detailed methodology, experimental results, and discussions on the effectiveness and limitations of KTS.The paper "Steering Without Side Effects: Improving Post-Deployment Control of Language Models" addresses the issue of unexpected behaviors in language models (LMs) after deployment, such as allowing model misuse through jailbreaks. To mitigate these issues, the authors propose a method called KL-then-steer (KTS), which aims to reduce the side effects of steering vectors while retaining their benefits. KTS involves training a model to minimize the Kullback-Leibler (KL) divergence between a steered and unsteered model on benign inputs, followed by steering the trained model. This approach prevents 44% of jailbreak attacks while maintaining helpfulness on benign requests. The authors also demonstrate that KTS can be applied to other tasks, such as reducing bias towards user-suggested answers on TruthfulQA. The paper includes a detailed methodology, experimental results, and discussions on the effectiveness and limitations of KTS.