2024 | Zhuowen Yuan, Zidi Xiong, Yi Zeng, Ning Yu, Ruoxi Jia, Dawn Song, Bo Li
RigorLLM is a novel framework designed to enhance the resilience of Large Language Models (LLMs) against harmful content and jailbreaking attacks. The framework employs a multi-faceted approach, including energy-based data generation using Langevin dynamics, safe suffix optimization via minimax optimization, and a fusion-based model combining KNN with LLMs. RigorLLM effectively moderates harmful inputs and outputs by generating data under specific constraints, optimizing safe suffixes to defend against jailbreaking, and integrating a fusion-based guardrail for comprehensive detection. Experimental results show that RigorLLM outperforms existing baselines like OpenAI API and Perspective API in harmful content detection and exhibits superior resilience to jailbreaking attacks. The framework's innovative use of constrained optimization and fusion-based guardrails represents a significant advancement in secure and reliable LLM development. RigorLLM achieves higher detection accuracy and robustness compared to other methods, demonstrating its effectiveness in real-world scenarios. The framework is evaluated on multiple datasets, including the OpenAI Moderation Dataset, ToxicChat, and AdvBench, showing consistent performance improvements. RigorLLM's ability to maintain high performance under adversarial conditions highlights its potential as a benchmark for future content moderation frameworks. The work establishes a strong foundation for future research in content moderation and AI safety.RigorLLM is a novel framework designed to enhance the resilience of Large Language Models (LLMs) against harmful content and jailbreaking attacks. The framework employs a multi-faceted approach, including energy-based data generation using Langevin dynamics, safe suffix optimization via minimax optimization, and a fusion-based model combining KNN with LLMs. RigorLLM effectively moderates harmful inputs and outputs by generating data under specific constraints, optimizing safe suffixes to defend against jailbreaking, and integrating a fusion-based guardrail for comprehensive detection. Experimental results show that RigorLLM outperforms existing baselines like OpenAI API and Perspective API in harmful content detection and exhibits superior resilience to jailbreaking attacks. The framework's innovative use of constrained optimization and fusion-based guardrails represents a significant advancement in secure and reliable LLM development. RigorLLM achieves higher detection accuracy and robustness compared to other methods, demonstrating its effectiveness in real-world scenarios. The framework is evaluated on multiple datasets, including the OpenAI Moderation Dataset, ToxicChat, and AdvBench, showing consistent performance improvements. RigorLLM's ability to maintain high performance under adversarial conditions highlights its potential as a benchmark for future content moderation frameworks. The work establishes a strong foundation for future research in content moderation and AI safety.