AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

April 29, 2024 | Anselm Paulus, Arman Zhamragambetov, Chuan Guo, Brandon Amos, Yuandong Tian
AdvPrompter is a novel method for generating human-readable adversarial prompts for Large Language Models (LLMs) that is significantly faster than existing approaches. The method uses another LLM, called AdvPrompter, to generate adversarial suffixes that cause the TargetLLM to generate harmful responses. The AdvPrompter is trained using a novel algorithm that does not require access to the gradients of the TargetLLM. This process alternates between generating high-quality adversarial suffixes and low-rank fine-tuning of the AdvPrompter with the generated suffixes. The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, luring the TargetLLM to give a harmful response. Experimental results on popular open source TargetLLMs show state-of-the-art results on the AdvBench dataset, which also transfer to closed-source black-box LLM APIs. Further, the method demonstrates that by fine-tuning on a synthetic dataset generated by AdvPrompter, LLMs can be made more robust against jailbreaking attacks while maintaining performance, i.e., high MMLU scores. The AdvPrompter offers several key advantages: human-readability, high attack success rate, adaptivity to input, fast generation, and no gradient from the TargetLLM. The method is effective in both whitebox and blackbox settings, and can be used to improve the robustness of LLMs against adversarial attacks. The AdvPrompter is trained using an alternating optimization scheme that alternates between generating adversarial targets and fine-tuning the AdvPrompter with them. The method is efficient and scalable, and can be used for multi-shot attacks with significantly improved ASR compared to one-shot attacks. The AdvPrompter is also applicable for safety fine-tuning of LLMs, as it can be used to generate adversarial instructions for fine-tuning the TargetLLM to respond negatively. The method is effective in both whitebox and blackbox settings, and can be used to improve the robustness of LLMs against adversarial attacks.AdvPrompter is a novel method for generating human-readable adversarial prompts for Large Language Models (LLMs) that is significantly faster than existing approaches. The method uses another LLM, called AdvPrompter, to generate adversarial suffixes that cause the TargetLLM to generate harmful responses. The AdvPrompter is trained using a novel algorithm that does not require access to the gradients of the TargetLLM. This process alternates between generating high-quality adversarial suffixes and low-rank fine-tuning of the AdvPrompter with the generated suffixes. The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, luring the TargetLLM to give a harmful response. Experimental results on popular open source TargetLLMs show state-of-the-art results on the AdvBench dataset, which also transfer to closed-source black-box LLM APIs. Further, the method demonstrates that by fine-tuning on a synthetic dataset generated by AdvPrompter, LLMs can be made more robust against jailbreaking attacks while maintaining performance, i.e., high MMLU scores. The AdvPrompter offers several key advantages: human-readability, high attack success rate, adaptivity to input, fast generation, and no gradient from the TargetLLM. The method is effective in both whitebox and blackbox settings, and can be used to improve the robustness of LLMs against adversarial attacks. The AdvPrompter is trained using an alternating optimization scheme that alternates between generating adversarial targets and fine-tuning the AdvPrompter with them. The method is efficient and scalable, and can be used for multi-shot attacks with significantly improved ASR compared to one-shot attacks. The AdvPrompter is also applicable for safety fine-tuning of LLMs, as it can be used to generate adversarial instructions for fine-tuning the TargetLLM to respond negatively. The method is effective in both whitebox and blackbox settings, and can be used to improve the robustness of LLMs against adversarial attacks.
Reach us at info@study.space