Improving Alignment and Robustness with Circuit Breakers

Improving Alignment and Robustness with Circuit Breakers

12 Jul 2024 | Andy Zou, Long Phan, Justin Wang, Derek Duenas, Maxwell Lin, Maksym Andriushchenko, Rowan Wang, Zico Kolter, Matt Fredrikson, Dan Hendrycks
This paper introduces a novel approach called "circuit breaking" to improve the alignment and robustness of AI systems against adversarial attacks. The method leverages representation engineering to directly control the internal representations responsible for harmful outputs, preventing the model from generating harmful content without sacrificing utility. Unlike traditional methods such as refusal training and adversarial training, which attempt to mitigate specific vulnerabilities, circuit breaking interrupts the generation process by redirecting harmful representations to incoherent or refusal representations. This approach is attack-agnostic and does not require additional training or costly adversarial fine-tuning. The method is applied to both text-only and multimodal language models, significantly reducing the rate of harmful outputs even in the presence of powerful unseen attacks. It is also extended to AI agents, demonstrating considerable reductions in harmful actions when under attack. The technique is implemented using Representation Rerouting (RR), which reroutes harmful representations to an orthogonal space, effectively interrupting the generation process. This method is evaluated on various benchmarks, including HarmBench, MT Bench, MMLU, and others, showing strong generalization across a diverse range of attacks. The results demonstrate that the circuit-breaking technique significantly outperforms standard refusal training and adversarial training, with minimal impact on model capabilities. The method is highly robust against adversarial attacks and provides a promising path forward in the adversarial arms race by ensuring safety and security without compromising capability. The approach is also effective in multimodal settings, improving robustness against image-based attacks aimed at circumventing model safeguards. The paper concludes that circuit breaking represents a significant step forward in developing reliable safeguards against harmful behavior and adversarial attacks.This paper introduces a novel approach called "circuit breaking" to improve the alignment and robustness of AI systems against adversarial attacks. The method leverages representation engineering to directly control the internal representations responsible for harmful outputs, preventing the model from generating harmful content without sacrificing utility. Unlike traditional methods such as refusal training and adversarial training, which attempt to mitigate specific vulnerabilities, circuit breaking interrupts the generation process by redirecting harmful representations to incoherent or refusal representations. This approach is attack-agnostic and does not require additional training or costly adversarial fine-tuning. The method is applied to both text-only and multimodal language models, significantly reducing the rate of harmful outputs even in the presence of powerful unseen attacks. It is also extended to AI agents, demonstrating considerable reductions in harmful actions when under attack. The technique is implemented using Representation Rerouting (RR), which reroutes harmful representations to an orthogonal space, effectively interrupting the generation process. This method is evaluated on various benchmarks, including HarmBench, MT Bench, MMLU, and others, showing strong generalization across a diverse range of attacks. The results demonstrate that the circuit-breaking technique significantly outperforms standard refusal training and adversarial training, with minimal impact on model capabilities. The method is highly robust against adversarial attacks and provides a promising path forward in the adversarial arms race by ensuring safety and security without compromising capability. The approach is also effective in multimodal settings, improving robustness against image-based attacks aimed at circumventing model safeguards. The paper concludes that circuit breaking represents a significant step forward in developing reliable safeguards against harmful behavior and adversarial attacks.
Reach us at info@study.space