Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

March 1, 2024 | Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, Shiyu Chang
This paper introduces SEMANTICSMOOTH, a smoothing-based defense method against jailbreak attacks on large language models (LLMs). The core idea of SEMANTICSMOOTH is to apply semantic-preserving transformations to input prompts and aggregate the LLM responses to enhance robustness against jailbreak attacks while maintaining strong nominal performance. The method introduces a policy network that adaptively selects transformations based on the input, enabling a better trade-off between robustness and performance. SEMANTICSMOOTH is evaluated against three state-of-the-art jailbreaking attacks: GCG, PAIR, and AutoDAN. The results show that SEMANTICSMOOTH achieves state-of-the-art robustness against these attacks while maintaining strong performance on instruction-following benchmarks. The method also provides a quantitative analysis of the GCG attack, revealing that the nonsensical adversarial suffixes can be interpreted through semantic transformations, highlighting the attack's underlying intent. The paper also discusses the trade-offs between robustness and nominal performance in defense methods. While some methods offer strong robustness, they often come at the cost of significant degradation in nominal performance. SEMANTICSMOOTH, on the other hand, achieves a favorable balance between the two, demonstrating strong performance on both robustness and nominal benchmarks. The proposed method is evaluated on two datasets: InstructionFollow and AlpacaEval. The results show that SEMANTICSMOOTH outperforms existing baseline defenses in terms of robustness and nominal performance. The method is also shown to be effective in interpreting the GCG attack, providing insights into the attack's strategies. The paper concludes that SEMANTICSMOOTH is a promising defense method against jailbreak attacks on LLMs, offering a robust and effective solution that maintains strong performance on both robustness and nominal benchmarks. The method is also shown to be interpretable, providing insights into the nature of jailbreak attacks and their underlying strategies.This paper introduces SEMANTICSMOOTH, a smoothing-based defense method against jailbreak attacks on large language models (LLMs). The core idea of SEMANTICSMOOTH is to apply semantic-preserving transformations to input prompts and aggregate the LLM responses to enhance robustness against jailbreak attacks while maintaining strong nominal performance. The method introduces a policy network that adaptively selects transformations based on the input, enabling a better trade-off between robustness and performance. SEMANTICSMOOTH is evaluated against three state-of-the-art jailbreaking attacks: GCG, PAIR, and AutoDAN. The results show that SEMANTICSMOOTH achieves state-of-the-art robustness against these attacks while maintaining strong performance on instruction-following benchmarks. The method also provides a quantitative analysis of the GCG attack, revealing that the nonsensical adversarial suffixes can be interpreted through semantic transformations, highlighting the attack's underlying intent. The paper also discusses the trade-offs between robustness and nominal performance in defense methods. While some methods offer strong robustness, they often come at the cost of significant degradation in nominal performance. SEMANTICSMOOTH, on the other hand, achieves a favorable balance between the two, demonstrating strong performance on both robustness and nominal benchmarks. The proposed method is evaluated on two datasets: InstructionFollow and AlpacaEval. The results show that SEMANTICSMOOTH outperforms existing baseline defenses in terms of robustness and nominal performance. The method is also shown to be effective in interpreting the GCG attack, providing insights into the attack's strategies. The paper concludes that SEMANTICSMOOTH is a promising defense method against jailbreak attacks on LLMs, offering a robust and effective solution that maintains strong performance on both robustness and nominal benchmarks. The method is also shown to be interpretable, providing insights into the nature of jailbreak attacks and their underlying strategies.
Reach us at info@study.space
[slides and audio] Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing