Improving Alignment and Robustness with Circuit Breakers

Improving Alignment and Robustness with Circuit Breakers

12 Jul 2024 | Andy Zou†1,2,3, Long Phan3, Justin Wang1, Derek Duenas1, Maxwell Lin1, Maksym Andriushchenko1, Rowan Wang1, Zico Kolter†1,2, Matt Fredrikson†1,2, Dan Hendrycks1,3
The paper introduces a novel approach called "circuit breaking" to improve the alignment and robustness of AI systems against harmful outputs and adversarial attacks. Inspired by representation engineering, circuit breaking directly controls the internal representations responsible for harmful outputs, rather than relying on techniques like refusal training or adversarial training. This method can be applied to both text-only and multimodal language models, preventing the generation of harmful outputs without sacrificing utility. The technique is attack-agnostic and can be integrated with existing monitoring and protection mechanisms. Experimental results show that circuit breaking significantly reduces the rate of harmful actions in AI agents and improves the robustness of large language models (LLMs) against a wide range of unseen attacks, including embedding and representation-space attacks. The approach demonstrates a significant step forward in managing the trade-off between capability and harmlessness in LLMs, making them more reliable and safer for real-world applications.The paper introduces a novel approach called "circuit breaking" to improve the alignment and robustness of AI systems against harmful outputs and adversarial attacks. Inspired by representation engineering, circuit breaking directly controls the internal representations responsible for harmful outputs, rather than relying on techniques like refusal training or adversarial training. This method can be applied to both text-only and multimodal language models, preventing the generation of harmful outputs without sacrificing utility. The technique is attack-agnostic and can be integrated with existing monitoring and protection mechanisms. Experimental results show that circuit breaking significantly reduces the rate of harmful actions in AI agents and improves the robustness of large language models (LLMs) against a wide range of unseen attacks, including embedding and representation-space attacks. The approach demonstrates a significant step forward in managing the trade-off between capability and harmlessness in LLMs, making them more reliable and safer for real-world applications.
Reach us at info@study.space