Understanding Defending Large Language Models against Jailbreak Attacks via Semantic Smoothing

The paper "Defending Large Language Models Against Jailbreak Attacks via Semantic Smoothing" addresses the vulnerability of large language models (LLMs) to jailbreaking attacks, which bypass safeguards and generate objectionable content. The authors propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions from multiple semantically transformed copies of an input prompt. This approach aims to improve robustness against both token-level and prompt-level attacks while maintaining strong nominal performance on instruction-following benchmarks. Experimental results show that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks, with favorable trade-offs between robustness and nominal performance. The framework also provides insights into the mechanisms of GCG attacks by interpreting nonsensical adversarial suffixes. The paper contributes to the field by offering a robust defense against multiple types of jailbreak attacks and enhancing societal trust in AI systems.The paper "Defending Large Language Models Against Jailbreak Attacks via Semantic Smoothing" addresses the vulnerability of large language models (LLMs) to jailbreaking attacks, which bypass safeguards and generate objectionable content. The authors propose SEMANTICSMOOTH, a smoothing-based defense that aggregates predictions from multiple semantically transformed copies of an input prompt. This approach aims to improve robustness against both token-level and prompt-level attacks while maintaining strong nominal performance on instruction-following benchmarks. Experimental results show that SEMANTICSMOOTH achieves state-of-the-art robustness against GCG, PAIR, and AutoDAN attacks, with favorable trade-offs between robustness and nominal performance. The framework also provides insights into the mechanisms of GCG attacks by interpreting nonsensical adversarial suffixes. The paper contributes to the field by offering a robust defense against multiple types of jailbreak attacks and enhancing societal trust in AI systems.

DEFENDING LARGE LANGUAGE MODELS AGAINST JAILBREAK ATTACKS VIA SEMANTIC SMOOTHING

March 1, 2024 | Jiabao Ji, Bairu Hou, Alexander Robey, George J. Pappas, Hamed Hassani, Yang Zhang, Eric Wong, Shiyu Chang