PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

PARDEN, Can You Repeat That? Defending against Jailbreaks via Repetition

14 May 2024 | Ziyang Zhang, Qizhen Zhang, Jakob Foerster
PARDEN is a novel method for detecting jailbreaks in large language models (LLMs). It addresses the issue of LLMs being susceptible to jailbreaks, which can lead to security risks and abuse. The method leverages the LLM itself as a safeguard by prompting it to repeat its own outputs. This approach avoids domain shift and the auto-regressive trap, which are common issues with existing methods. PARDEN does not require fine-tuning or white-box access to the model, making it efficient and effective. The core idea of PARDEN is to use the BLEU score to measure the similarity between the original LLM output and the repeated output. If the BLEU score is high, it indicates that the LLM is repeating its output, which is considered benign. If the score is low, it suggests that the LLM is refusing to repeat, indicating a potential jailbreak. By setting a threshold on the BLEU score, PARDEN can classify inputs as harmful or benign. PARDEN has been empirically validated to significantly outperform existing jailbreak detection baselines for Llama-2 and Claude-2. For Llama-2-7B, PARDEN achieves a 11x reduction in false positive rate (FPR) at a true positive rate (TPR) of 90%. For Claude-2.1, PARDEN improves the TPR from 69.2% to 90.0% while reducing the FPR from 2.72% to 1.09%. These results demonstrate the effectiveness of PARDEN in detecting harmful content while minimizing false positives. The method is also computationally efficient, as it allows for partial repeats, reducing the computational cost without compromising performance. PARDEN operates on the output space rather than the input space, making it more robust to adversarial attacks. This approach ensures that the model is not exposed to adversarial inputs during safety filtering, enhancing the overall security of the system. In conclusion, PARDEN provides a simple yet effective solution for detecting jailbreaks in LLMs. By leveraging the LLM itself as a safeguard and using the BLEU score to classify outputs, PARDEN achieves high accuracy in detecting harmful content while maintaining low false positive rates. The method is efficient, scalable, and adaptable to different LLMs, making it a valuable tool for enhancing the security and reliability of AI systems.PARDEN is a novel method for detecting jailbreaks in large language models (LLMs). It addresses the issue of LLMs being susceptible to jailbreaks, which can lead to security risks and abuse. The method leverages the LLM itself as a safeguard by prompting it to repeat its own outputs. This approach avoids domain shift and the auto-regressive trap, which are common issues with existing methods. PARDEN does not require fine-tuning or white-box access to the model, making it efficient and effective. The core idea of PARDEN is to use the BLEU score to measure the similarity between the original LLM output and the repeated output. If the BLEU score is high, it indicates that the LLM is repeating its output, which is considered benign. If the score is low, it suggests that the LLM is refusing to repeat, indicating a potential jailbreak. By setting a threshold on the BLEU score, PARDEN can classify inputs as harmful or benign. PARDEN has been empirically validated to significantly outperform existing jailbreak detection baselines for Llama-2 and Claude-2. For Llama-2-7B, PARDEN achieves a 11x reduction in false positive rate (FPR) at a true positive rate (TPR) of 90%. For Claude-2.1, PARDEN improves the TPR from 69.2% to 90.0% while reducing the FPR from 2.72% to 1.09%. These results demonstrate the effectiveness of PARDEN in detecting harmful content while minimizing false positives. The method is also computationally efficient, as it allows for partial repeats, reducing the computational cost without compromising performance. PARDEN operates on the output space rather than the input space, making it more robust to adversarial attacks. This approach ensures that the model is not exposed to adversarial inputs during safety filtering, enhancing the overall security of the system. In conclusion, PARDEN provides a simple yet effective solution for detecting jailbreaks in LLMs. By leveraging the LLM itself as a safeguard and using the BLEU score to classify outputs, PARDEN achieves high accuracy in detecting harmful content while maintaining low false positive rates. The method is efficient, scalable, and adaptable to different LLMs, making it a valuable tool for enhancing the security and reliability of AI systems.
Reach us at info@study.space