Understanding AdvPrompter%3A Fast Adaptive Adversarial Prompting for LLMs

AdvPrompter is a novel method for generating human-readable adversarial prompts for Large Language Models (LLMs) that is significantly faster than existing approaches. The method uses another LLM, called the AdvPrompter, to generate adversarial prompts in seconds, ~800× faster than optimization-based methods. The AdvPrompter is trained using a novel algorithm that does not require access to the gradients of the TargetLLM. This process alternates between generating high-quality target adversarial suffixes by optimizing the AdvPrompter predictions and low-rank fine-tuning of the AdvPrompter with the generated adversarial suffixes. The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response. Experimental results on popular open source TargetLLMs show state-of-the-art results on the AdvBench dataset, that also transfer to closed-source black-box LLM APIs. Further, we demonstrate that by fine-tuning on a synthetic dataset generated by AdvPrompter, LLMs can be made more robust against jailbreaking attacks while maintaining performance, i.e. high MMLU scores. AdvPrompter offers several key advantages: human-readability, high attack success rate, adaptivity to input, fast generation, and no gradient from TargetLLM. The method enables efficient adversarial training for improving the robustness of LLM alignment. We demonstrate that it is possible to leverage the rapid prompt generation of AdvPrompter to generate a dataset of adversarial instructions, and then fine-tune the TargetLLM to respond negatively. This successfully increases the TargetLLM robustness against our attack, while maintaining a high general knowledge score measured by MMLU. The results indicate a potential for future fully-automated safety fine-tuning methods based on joint training of an AdvPrompter and an aligned TargetLLM via self-play.AdvPrompter is a novel method for generating human-readable adversarial prompts for Large Language Models (LLMs) that is significantly faster than existing approaches. The method uses another LLM, called the AdvPrompter, to generate adversarial prompts in seconds, ~800× faster than optimization-based methods. The AdvPrompter is trained using a novel algorithm that does not require access to the gradients of the TargetLLM. This process alternates between generating high-quality target adversarial suffixes by optimizing the AdvPrompter predictions and low-rank fine-tuning of the AdvPrompter with the generated adversarial suffixes. The trained AdvPrompter generates suffixes that veil the input instruction without changing its meaning, such that the TargetLLM is lured to give a harmful response. Experimental results on popular open source TargetLLMs show state-of-the-art results on the AdvBench dataset, that also transfer to closed-source black-box LLM APIs. Further, we demonstrate that by fine-tuning on a synthetic dataset generated by AdvPrompter, LLMs can be made more robust against jailbreaking attacks while maintaining performance, i.e. high MMLU scores. AdvPrompter offers several key advantages: human-readability, high attack success rate, adaptivity to input, fast generation, and no gradient from TargetLLM. The method enables efficient adversarial training for improving the robustness of LLM alignment. We demonstrate that it is possible to leverage the rapid prompt generation of AdvPrompter to generate a dataset of adversarial instructions, and then fine-tune the TargetLLM to respond negatively. This successfully increases the TargetLLM robustness against our attack, while maintaining a high general knowledge score measured by MMLU. The results indicate a potential for future fully-automated safety fine-tuning methods based on joint training of an AdvPrompter and an aligned TargetLLM via self-play.

AdvPrompter: Fast Adaptive Adversarial Prompting for LLMs

April 29, 2024 | Anselm Paulus, Arman Zharmagambetov, Chuan Guo, Brandon Amos, Yuandong Tian