Understanding GradSafe%3A Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis

**GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis** **Authors:** Yueqi Xie, Minghong Fang, Renjie Pi, Neil Zhenqiang Gong **Abstract:** Large Language Models (LLMs) face threats from jailbreak prompts, which can lead to misuse and malicious finetuning. Existing methods for detecting these prompts often require extensive data collection and training processes. This study introduces GradSafe, a method that effectively detects jailbreak prompts by analyzing the gradients of safety-critical parameters in LLMs. The method leverages the observation that the gradients of an LLM's loss for jailbreak prompts paired with compliance responses exhibit similar patterns on certain safety-critical parameters, while safe prompts lead to different gradient patterns. GradSafe is evaluated on two benchmark datasets, ToxicChat and XSTest, and outperforms existing methods, including Llama Guard, without requiring additional training on the LLM. The source code is available at <https://github.com/xyq7/GradSafe>. **Introduction:** LLMs have achieved significant advancements and are integrated into various applications. However, jailbreak prompts pose threats to their safety. Existing methods for detecting these prompts, such as online moderation APIs and finetuned LLMs, often require extensive resources. GradSafe addresses this issue by analyzing the gradients of safety-critical parameters, which are identified through a minimal number of reference prompts. The method is evaluated on two benchmark datasets, showing superior performance compared to existing detectors. **GradSafe:** - **Overview:** GradSafe identifies safety-critical parameters by analyzing the gradients of unsafe and safe prompts paired with compliance responses. It then uses these gradients to detect unsafe prompts. - **Safety-Critical Parameters:** These parameters are identified by comparing the gradients of unsafe and safe prompts. The method uses a threshold to filter out parameter slices with high similarity between unsafe and safe prompts. - **GradSafe-Zero and GradSafe-Adapt:** GradSafe-Zero is a zero-shot method that uses the average cosine similarity across all safety-critical parameters. GradSafe-Adapt uses a logistic regression model trained on the training set to enhance performance in specific domains. **Experimental Setups:** - **Datasets:** ToxicChat and XSTest. - **Evaluation Metrics:** AUPRC, precision, recall, and F1 score. - **Baselines:** Online API tools, LLMs as zero-shot detectors, and finetuned LLMs (Llama Guard). **Results:** - **GradSafe-Zero:** Outperforms state-of-the-art methods and online moderation APIs. - **GradSafe-Adapt:** Demonstrates enhanced adaptability on the ToxicChat dataset. - **Safety-Critical Parameters:** Identifying these parameters significantly reduces storage and computation requirements. - **Reference Prompts:** The choice and number of reference prompts influence detection performance. - **Paired Response:** Compliance**GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis** **Authors:** Yueqi Xie, Minghong Fang, Renjie Pi, Neil Zhenqiang Gong **Abstract:** Large Language Models (LLMs) face threats from jailbreak prompts, which can lead to misuse and malicious finetuning. Existing methods for detecting these prompts often require extensive data collection and training processes. This study introduces GradSafe, a method that effectively detects jailbreak prompts by analyzing the gradients of safety-critical parameters in LLMs. The method leverages the observation that the gradients of an LLM's loss for jailbreak prompts paired with compliance responses exhibit similar patterns on certain safety-critical parameters, while safe prompts lead to different gradient patterns. GradSafe is evaluated on two benchmark datasets, ToxicChat and XSTest, and outperforms existing methods, including Llama Guard, without requiring additional training on the LLM. The source code is available at <https://github.com/xyq7/GradSafe>. **Introduction:** LLMs have achieved significant advancements and are integrated into various applications. However, jailbreak prompts pose threats to their safety. Existing methods for detecting these prompts, such as online moderation APIs and finetuned LLMs, often require extensive resources. GradSafe addresses this issue by analyzing the gradients of safety-critical parameters, which are identified through a minimal number of reference prompts. The method is evaluated on two benchmark datasets, showing superior performance compared to existing detectors. **GradSafe:** - **Overview:** GradSafe identifies safety-critical parameters by analyzing the gradients of unsafe and safe prompts paired with compliance responses. It then uses these gradients to detect unsafe prompts. - **Safety-Critical Parameters:** These parameters are identified by comparing the gradients of unsafe and safe prompts. The method uses a threshold to filter out parameter slices with high similarity between unsafe and safe prompts. - **GradSafe-Zero and GradSafe-Adapt:** GradSafe-Zero is a zero-shot method that uses the average cosine similarity across all safety-critical parameters. GradSafe-Adapt uses a logistic regression model trained on the training set to enhance performance in specific domains. **Experimental Setups:** - **Datasets:** ToxicChat and XSTest. - **Evaluation Metrics:** AUPRC, precision, recall, and F1 score. - **Baselines:** Online API tools, LLMs as zero-shot detectors, and finetuned LLMs (Llama Guard). **Results:** - **GradSafe-Zero:** Outperforms state-of-the-art methods and online moderation APIs. - **GradSafe-Adapt:** Demonstrates enhanced adaptability on the ToxicChat dataset. - **Safety-Critical Parameters:** Identifying these parameters significantly reduces storage and computation requirements. - **Reference Prompts:** The choice and number of reference prompts influence detection performance. - **Paired Response:** Compliance

GradSafe: Detecting Jailbreak Prompts for LLMs via Safety-Critical Gradient Analysis

29 May 2024 | Yueqi Xie, Minghong Fang, Renjie Pi, Neil Zhenqiang Gong