October 14–18, 2024 | Bo Hui, Haolin Yuan, Neil Gong†, Philippe Burlina, and Yinzhi Cao
The paper introduces PLeak, a novel closed-box prompt leaking attack framework designed to steal system prompts from large language model (LLM) applications. The system prompt is crucial for the functionality and performance of LLM applications, and developers often keep it confidential to protect their intellectual property. PLeak optimizes adversarial queries to reveal the system prompt when the target LLM application processes the query. The attack is formulated as an optimization problem, where the goal is to find an adversarial query that, when input to the target LLM, outputs the system prompt. The key innovation is the incremental search method, which breaks down the optimization goal by optimizing queries for system prompts incrementally, starting from the first few tokens and gradually increasing the length. Additionally, PLeak employs post-processing to aggregate responses from multiple adversarial queries, enhancing the attack's effectiveness.
The paper evaluates PLeak on both offline settings and real-world LLM applications hosted on Poe, a popular platform for LLM applications. Results show that PLeak significantly outperforms existing methods, both those that manually curate queries and those that adapt jailbreaking attacks. PLeak achieves high Exact Match (EM) and Semantic Similarity (SS) scores, reconstructing system prompts with higher accuracy and semantic similarity compared to baselines. The attack is also effective against defenses that filter responses containing the target system prompt, using adversarial transformations to bypass these defenses.
The paper concludes with a discussion of potential future defenses and highlights the contributions of PLeak, including its automated nature, the use of incremental search and post-processing, and its superior performance in both offline and real-world settings.The paper introduces PLeak, a novel closed-box prompt leaking attack framework designed to steal system prompts from large language model (LLM) applications. The system prompt is crucial for the functionality and performance of LLM applications, and developers often keep it confidential to protect their intellectual property. PLeak optimizes adversarial queries to reveal the system prompt when the target LLM application processes the query. The attack is formulated as an optimization problem, where the goal is to find an adversarial query that, when input to the target LLM, outputs the system prompt. The key innovation is the incremental search method, which breaks down the optimization goal by optimizing queries for system prompts incrementally, starting from the first few tokens and gradually increasing the length. Additionally, PLeak employs post-processing to aggregate responses from multiple adversarial queries, enhancing the attack's effectiveness.
The paper evaluates PLeak on both offline settings and real-world LLM applications hosted on Poe, a popular platform for LLM applications. Results show that PLeak significantly outperforms existing methods, both those that manually curate queries and those that adapt jailbreaking attacks. PLeak achieves high Exact Match (EM) and Semantic Similarity (SS) scores, reconstructing system prompts with higher accuracy and semantic similarity compared to baselines. The attack is also effective against defenses that filter responses containing the target system prompt, using adversarial transformations to bypass these defenses.
The paper concludes with a discussion of potential future defenses and highlights the contributions of PLeak, including its automated nature, the use of incremental search and post-processing, and its superior performance in both offline and real-world settings.