14 May 2024 | Ziyang Zhang, Qizhen Zhang, Jakob Foerster
PARDEN is a novel method for detecting jailbreaks in large language models (LLMs). It works by prompting the LLM to repeat its own output, which helps identify harmful content. PARDEN does not require fine-tuning or white-box access to the model, making it efficient and effective. The method uses the BLEU score to compare the original output with the repeated output, classifying the input as harmful or benign based on the similarity. PARDEN significantly outperforms existing jailbreak detection methods, particularly in high True Positive Rate (TPR) and low False Positive Rate (FPR) scenarios. For example, on the harmful behaviours dataset, PARDEN reduces the FPR from 24.8% to 2.0% at a TPR of 90%. PARDEN is also effective for models like Llama-2 and Claude-2.1, achieving higher AUC scores and lower FPRs. The method is robust to domain shifts and avoids the auto-regressive trap, making it a powerful defense against jailbreaks. PARDEN is implemented as an open-source project, providing datasets and code for further research and application. The results show that PARDEN is a promising approach for enhancing the safety and reliability of LLMs in real-world applications.PARDEN is a novel method for detecting jailbreaks in large language models (LLMs). It works by prompting the LLM to repeat its own output, which helps identify harmful content. PARDEN does not require fine-tuning or white-box access to the model, making it efficient and effective. The method uses the BLEU score to compare the original output with the repeated output, classifying the input as harmful or benign based on the similarity. PARDEN significantly outperforms existing jailbreak detection methods, particularly in high True Positive Rate (TPR) and low False Positive Rate (FPR) scenarios. For example, on the harmful behaviours dataset, PARDEN reduces the FPR from 24.8% to 2.0% at a TPR of 90%. PARDEN is also effective for models like Llama-2 and Claude-2.1, achieving higher AUC scores and lower FPRs. The method is robust to domain shifts and avoids the auto-regressive trap, making it a powerful defense against jailbreaks. PARDEN is implemented as an open-source project, providing datasets and code for further research and application. The results show that PARDEN is a promising approach for enhancing the safety and reliability of LLMs in real-world applications.