21 Aug 2024 | Yichuan Mo, Yuji Wang, Zeming Wei, Yisen Wang
Prompt Adversarial Tuning (PAT) is a novel approach to defend against jailbreak attacks on Large Language Models (LLMs). The method introduces a "defense control" as a prefix to user prompts, which helps the model resist malicious requests while maintaining its normal utility. PAT is inspired by adversarial training paradigms and involves optimizing both adversarial and benign prompts to achieve robustness. The defense control is trained to counteract malicious attacks by alternating between attack and defense objectives, ensuring the model remains effective against a wide range of threats.
Comprehensive experiments show that PAT is effective against both grey-box and black-box attacks, significantly reducing the success rate of advanced attacks to nearly zero while preserving the model's performance on benign tasks. The method incurs minimal computational overhead, making it a practical solution for enhancing LLM security. PAT is also transferable across different models, including both open-source and closed-source LLMs, demonstrating its versatility and effectiveness.
The paper evaluates PAT against several state-of-the-art baselines, including PPL, ICD, DRO, RPO, SafeDecoding, SmoothLLM, and Self-Reminder. Results show that PAT outperforms these methods in terms of attack resistance and maintains high performance on benign tasks. Additionally, PAT is tested against real-world attacks, including those based on mismatched generalizations and competing objectives, demonstrating its practicality and effectiveness in real-world scenarios.
An ablation study reveals that the length of the defense control and the trade-off coefficient significantly impact PAT's performance. The optimal configuration balances robustness and usability, ensuring the model remains effective against adversarial attacks while maintaining its normal functionality. Furthermore, PAT is shown to be effective against adaptive attacks, where attackers have knowledge of the defense strategy, indicating its robustness under various threat conditions.
Overall, PAT provides a promising solution for enhancing the security of LLMs by effectively defending against jailbreak attacks while maintaining the model's utility. The method's efficiency, transferability, and effectiveness make it a valuable tool for improving the safety and reliability of large language models.Prompt Adversarial Tuning (PAT) is a novel approach to defend against jailbreak attacks on Large Language Models (LLMs). The method introduces a "defense control" as a prefix to user prompts, which helps the model resist malicious requests while maintaining its normal utility. PAT is inspired by adversarial training paradigms and involves optimizing both adversarial and benign prompts to achieve robustness. The defense control is trained to counteract malicious attacks by alternating between attack and defense objectives, ensuring the model remains effective against a wide range of threats.
Comprehensive experiments show that PAT is effective against both grey-box and black-box attacks, significantly reducing the success rate of advanced attacks to nearly zero while preserving the model's performance on benign tasks. The method incurs minimal computational overhead, making it a practical solution for enhancing LLM security. PAT is also transferable across different models, including both open-source and closed-source LLMs, demonstrating its versatility and effectiveness.
The paper evaluates PAT against several state-of-the-art baselines, including PPL, ICD, DRO, RPO, SafeDecoding, SmoothLLM, and Self-Reminder. Results show that PAT outperforms these methods in terms of attack resistance and maintains high performance on benign tasks. Additionally, PAT is tested against real-world attacks, including those based on mismatched generalizations and competing objectives, demonstrating its practicality and effectiveness in real-world scenarios.
An ablation study reveals that the length of the defense control and the trade-off coefficient significantly impact PAT's performance. The optimal configuration balances robustness and usability, ensuring the model remains effective against adversarial attacks while maintaining its normal functionality. Furthermore, PAT is shown to be effective against adaptive attacks, where attackers have knowledge of the defense strategy, indicating its robustness under various threat conditions.
Overall, PAT provides a promising solution for enhancing the security of LLMs by effectively defending against jailbreak attacks while maintaining the model's utility. The method's efficiency, transferability, and effectiveness make it a valuable tool for improving the safety and reliability of large language models.