This paper introduces Robust Prompt Optimization (RPO), a method to enhance the robustness of large language models (LLMs) against jailbreaking attacks. RPO is designed to create system-level defenses by directly incorporating the threat model into the defensive objective, enabling it to adapt to worst-case adaptive attacks. The approach optimizes a lightweight and transferable suffix that ensures safe outputs even under jailbreaking and adversarial attacks. Theoretical and experimental results show that RPO significantly improves robustness against both known and unknown jailbreaks, reducing the attack success rate (ASR) on GPT-4 to 6% and Llama-2 to 0% on JailbreakBench, setting a new state-of-the-art in jailbreaking defense. RPO suffixes have minimal inference cost and transfer well across different models and attacks. The method is evaluated on two recent red-teaming benchmarks, JailbreakBench and HarmBench, demonstrating its effectiveness across various harmful behaviors and attack types. RPO outperforms existing defenses, showing strong generalization to new attacks and risk categories. The paper also discusses the limitations of RPO, including its focus on text-based attacks and the need for further research on multimodal models and other failure modes. Overall, RPO provides a practical and effective defense mechanism for LLMs against jailbreaking attacks.This paper introduces Robust Prompt Optimization (RPO), a method to enhance the robustness of large language models (LLMs) against jailbreaking attacks. RPO is designed to create system-level defenses by directly incorporating the threat model into the defensive objective, enabling it to adapt to worst-case adaptive attacks. The approach optimizes a lightweight and transferable suffix that ensures safe outputs even under jailbreaking and adversarial attacks. Theoretical and experimental results show that RPO significantly improves robustness against both known and unknown jailbreaks, reducing the attack success rate (ASR) on GPT-4 to 6% and Llama-2 to 0% on JailbreakBench, setting a new state-of-the-art in jailbreaking defense. RPO suffixes have minimal inference cost and transfer well across different models and attacks. The method is evaluated on two recent red-teaming benchmarks, JailbreakBench and HarmBench, demonstrating its effectiveness across various harmful behaviors and attack types. RPO outperforms existing defenses, showing strong generalization to new attacks and risk categories. The paper also discusses the limitations of RPO, including its focus on text-based attacks and the need for further research on multimodal models and other failure modes. Overall, RPO provides a practical and effective defense mechanism for LLMs against jailbreaking attacks.