RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

RigorLLM: Resilient Guardrails for Large Language Models against Undesired Content

23 Jul 2024 | Zhuowen Yuan, Zidi Xiong, Yi Zeng, Ning Yu, Ruoxi Jia, Dawn Song, Bo Li
The paper introduces RigorLLM, a novel framework designed to moderate harmful inputs and outputs for Large Language Models (LLMs). RigorLLM addresses the challenges posed by biases and the potential for generating harmful content, particularly under malicious inputs. The framework employs a multi-faceted approach, including energy-based training data generation through Langevin dynamics, optimizing a safe suffix for inputs via minimax optimization, and integrating a fusion-based model combining robust KNN with LLMs. Experimental evaluations demonstrate that RigorLLM outperforms existing baselines like OpenAI API and Perspective API in detecting harmful content and exhibits superior resilience to jailbreaking attacks. The framework's innovative use of constrained optimization and a fusion-based guardrail approach sets a new standard for content moderation frameworks in the face of evolving digital threats.The paper introduces RigorLLM, a novel framework designed to moderate harmful inputs and outputs for Large Language Models (LLMs). RigorLLM addresses the challenges posed by biases and the potential for generating harmful content, particularly under malicious inputs. The framework employs a multi-faceted approach, including energy-based training data generation through Langevin dynamics, optimizing a safe suffix for inputs via minimax optimization, and integrating a fusion-based model combining robust KNN with LLMs. Experimental evaluations demonstrate that RigorLLM outperforms existing baselines like OpenAI API and Perspective API in detecting harmful content and exhibits superior resilience to jailbreaking attacks. The framework's innovative use of constrained optimization and a fusion-based guardrail approach sets a new standard for content moderation frameworks in the face of evolving digital threats.
Reach us at info@study.space