Understanding R2-Guard%3A Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

R²-Guard is a robust reasoning-enabled LLM guardrail designed to moderate input and output content of large language models (LLMs) to ensure compliance with safety policies. Existing guardrail models, such as OpenAI Mod and LlamaGuard, treat various safety categories independently, leading to limitations such as ineffectiveness, susceptibility to jailbreak attacks, and inflexibility. R²-Guard addresses these issues by incorporating knowledge-enhanced logical reasoning into a probabilistic graphical model (PGM). The system consists of two main components: a data-driven category-specific learning component and a reasoning component that uses Markov logic networks (MLNs) or probabilistic circuits (PCs) to perform logical inference. R²-Guard encodes safety knowledge as first-order logical rules and embeds them into PGMs, allowing for probabilistic inference to determine the overall probability of an input being unsafe. The system is evaluated on six safety benchmarks and compared with eight strong guardrail models, demonstrating superior performance in detecting unsafe content and robustness against jailbreak attacks. R²-Guard also shows adaptability to new safety categories by modifying the PGM reasoning graph.R²-Guard is a robust reasoning-enabled LLM guardrail designed to moderate input and output content of large language models (LLMs) to ensure compliance with safety policies. Existing guardrail models, such as OpenAI Mod and LlamaGuard, treat various safety categories independently, leading to limitations such as ineffectiveness, susceptibility to jailbreak attacks, and inflexibility. R²-Guard addresses these issues by incorporating knowledge-enhanced logical reasoning into a probabilistic graphical model (PGM). The system consists of two main components: a data-driven category-specific learning component and a reasoning component that uses Markov logic networks (MLNs) or probabilistic circuits (PCs) to perform logical inference. R²-Guard encodes safety knowledge as first-order logical rules and embeds them into PGMs, allowing for probabilistic inference to determine the overall probability of an input being unsafe. The system is evaluated on six safety benchmarks and compared with eight strong guardrail models, demonstrating superior performance in detecting unsafe content and robustness against jailbreak attacks. R²-Guard also shows adaptability to new safety categories by modifying the PGM reasoning graph.

R²-Guard: Robust Reasoning Enabled LLM Guardrail via Knowledge-Enhanced Logical Reasoning

8 Jul 2024 | Mintong Kang, Bo Li