17 Apr 2025 | Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion
The paper "Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks" by Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion from EPFL demonstrates that even the most advanced safety-aligned large language models (LLMs) are vulnerable to simple and adaptive jailbreaking attacks. The authors show that by leveraging access to log probabilities, they can design adversarial prompts and suffixes to manipulate these models into generating harmful content. They achieve 100% attack success rates on various models, including Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat, Llama-3-Instruct, Gemma-7B, GPT-3.5, GPT-4o, R2D2, and Claude models. The attacks are tailored to each model's vulnerabilities, such as in-context learning prompts for R2D2 and prefilling for Claude models. The paper also highlights the importance of adaptive attacks in accurately evaluating the robustness of LLMs and provides insights into the design of stronger defenses against such attacks. The findings are significant for both researchers and practitioners in the field of LLMs and their safety.The paper "Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks" by Maksym Andriushchenko, Francesco Croce, and Nicolas Flammarion from EPFL demonstrates that even the most advanced safety-aligned large language models (LLMs) are vulnerable to simple and adaptive jailbreaking attacks. The authors show that by leveraging access to log probabilities, they can design adversarial prompts and suffixes to manipulate these models into generating harmful content. They achieve 100% attack success rates on various models, including Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat, Llama-3-Instruct, Gemma-7B, GPT-3.5, GPT-4o, R2D2, and Claude models. The attacks are tailored to each model's vulnerabilities, such as in-context learning prompts for R2D2 and prefilling for Claude models. The paper also highlights the importance of adaptive attacks in accurately evaluating the robustness of LLMs and provides insights into the design of stronger defenses against such attacks. The findings are significant for both researchers and practitioners in the field of LLMs and their safety.