29 May 2024 | Yueqi Xie, Minghong Fang, Renjie Pi, Neil Zhenqiang Gong
GradSafe is a novel method for detecting jailbreak prompts in large language models (LLMs) by analyzing the gradients of safety-critical parameters. Unlike existing methods that rely on online moderation APIs or fine-tuned LLMs, GradSafe leverages the gradients of safety-critical parameters to identify unsafe prompts without requiring additional training. The method is based on the observation that jailbreak prompts paired with compliance responses exhibit similar gradient patterns on certain safety-critical parameters, while safe prompts show different patterns. GradSafe-Zero and GradSafe-Adapt are two variants of the method. GradSafe-Zero uses a threshold-based classification based on the average cosine similarity across all safety-critical parameters, while GradSafe-Adapt employs a logistic regression model for domain adaptation. Experiments on the ToxicChat and XSTest datasets show that GradSafe-Zero outperforms Llama Guard and online moderation APIs in detecting jailbreak prompts. GradSafe-Adapt demonstrates enhanced adaptability on the ToxicChat dataset. The method is efficient and effective in detecting unsafe prompts without requiring further training on the LLM. The source code is available at https://github.com/xyq7/GradSafe.GradSafe is a novel method for detecting jailbreak prompts in large language models (LLMs) by analyzing the gradients of safety-critical parameters. Unlike existing methods that rely on online moderation APIs or fine-tuned LLMs, GradSafe leverages the gradients of safety-critical parameters to identify unsafe prompts without requiring additional training. The method is based on the observation that jailbreak prompts paired with compliance responses exhibit similar gradient patterns on certain safety-critical parameters, while safe prompts show different patterns. GradSafe-Zero and GradSafe-Adapt are two variants of the method. GradSafe-Zero uses a threshold-based classification based on the average cosine similarity across all safety-critical parameters, while GradSafe-Adapt employs a logistic regression model for domain adaptation. Experiments on the ToxicChat and XSTest datasets show that GradSafe-Zero outperforms Llama Guard and online moderation APIs in detecting jailbreak prompts. GradSafe-Adapt demonstrates enhanced adaptability on the ToxicChat dataset. The method is efficient and effective in detecting unsafe prompts without requiring further training on the LLM. The source code is available at https://github.com/xyq7/GradSafe.