7 Mar 2024 | Xiaogeng Liu, Zhiyuan Yu, Yizhe Zhang, Ning Zhang, Chaowei Xiao
The paper addresses the significant threat posed by prompt injection attacks on Large Language Models (LLMs). These attacks manipulate LLMs to produce responses aligned with injected content, deviating from the user's actual requests. The authors introduce a unified framework to understand the objectives of prompt injection attacks and present an automated gradient-based method to generate highly effective and universal prompt injection data. Despite only using five training samples, the proposed attack achieves superior performance compared to baselines. The findings emphasize the importance of gradient-based testing to avoid overestimating the robustness of defense mechanisms. The paper also discusses the challenges in evaluating prompt injection attacks due to the lack of a unified goal and the reliance on manually crafted prompts. The methodology includes a threat model, loss functions, and a momentum-enhanced gradient search algorithm. Evaluations across various datasets and defenses show the effectiveness and universality of the proposed attack, highlighting the need for robust defense strategies.The paper addresses the significant threat posed by prompt injection attacks on Large Language Models (LLMs). These attacks manipulate LLMs to produce responses aligned with injected content, deviating from the user's actual requests. The authors introduce a unified framework to understand the objectives of prompt injection attacks and present an automated gradient-based method to generate highly effective and universal prompt injection data. Despite only using five training samples, the proposed attack achieves superior performance compared to baselines. The findings emphasize the importance of gradient-based testing to avoid overestimating the robustness of defense mechanisms. The paper also discusses the challenges in evaluating prompt injection attacks due to the lack of a unified goal and the reliance on manually crafted prompts. The methodology includes a threat model, loss functions, and a momentum-enhanced gradient search algorithm. Evaluations across various datasets and defenses show the effectiveness and universality of the proposed attack, highlighting the need for robust defense strategies.