21 Jun 2024 | Asa Cooper Stickland, Alexander Lyzhov, Jacob Pfau, Salsibila Mahdi, Samuel R. Bowman
**Summary:**
This paper introduces KL-then-steer (KTS), a technique to reduce the side effects of steering vectors while maintaining their benefits in controlling language models (LMs). The goal is to improve post-deployment control of LM behavior, particularly in mitigating harmful behaviors like jailbreak attacks. Steering vectors are used to modify model behavior by adding specific vectors to hidden states, but they can negatively affect performance. KTS addresses this by first training a model to minimize the Kullback-Leibler (KL) divergence between a steered and unsteered model on benign inputs, then applying steering during inference. This approach reduces the negative impact of steering vectors on benign tasks while improving safety.
The KTS model outperforms the original Llama-2-chat-7B model in reducing jailbreak attacks by 44% while maintaining helpfulness on benign requests. It also generalizes well to other tasks, such as reducing bias towards user-suggested answers on TruthfulQA. The method is lightweight and can be combined with other techniques like LoRA fine-tuning and system prompts to further improve adversarial robustness.
The paper evaluates KTS against baselines such as harmlessness training with LoRA and system prompts. It shows that KTS achieves better performance on adversarial robustness and general capabilities compared to these methods. The technique is effective in reducing harmful behaviors without significantly degrading model performance on benign tasks.
The study also explores the use of prompt classifiers to identify and steer only problematic inputs, which helps in reducing the overall impact of steering vectors. The results demonstrate that KTS is a promising approach for improving the safety and controllability of language models in real-world applications.**Summary:**
This paper introduces KL-then-steer (KTS), a technique to reduce the side effects of steering vectors while maintaining their benefits in controlling language models (LMs). The goal is to improve post-deployment control of LM behavior, particularly in mitigating harmful behaviors like jailbreak attacks. Steering vectors are used to modify model behavior by adding specific vectors to hidden states, but they can negatively affect performance. KTS addresses this by first training a model to minimize the Kullback-Leibler (KL) divergence between a steered and unsteered model on benign inputs, then applying steering during inference. This approach reduces the negative impact of steering vectors on benign tasks while improving safety.
The KTS model outperforms the original Llama-2-chat-7B model in reducing jailbreak attacks by 44% while maintaining helpfulness on benign requests. It also generalizes well to other tasks, such as reducing bias towards user-suggested answers on TruthfulQA. The method is lightweight and can be combined with other techniques like LoRA fine-tuning and system prompts to further improve adversarial robustness.
The paper evaluates KTS against baselines such as harmlessness training with LoRA and system prompts. It shows that KTS achieves better performance on adversarial robustness and general capabilities compared to these methods. The technique is effective in reducing harmful behaviors without significantly degrading model performance on benign tasks.
The study also explores the use of prompt classifiers to identify and steer only problematic inputs, which helps in reducing the overall impact of steering vectors. The results demonstrate that KTS is a promising approach for improving the safety and controllability of language models in real-world applications.