PLEAK: Prompt Leaking Attacks against Large Language Model Applications

PLEAK: Prompt Leaking Attacks against Large Language Model Applications

October 14–18, 2024 | Bo Hui, Haolin Yuan, Neil Gong, Philippe Burlina, and Yinzhi Cao
PLEAK is a novel prompt leaking attack framework designed to steal system prompts from Large Language Model (LLM) applications. The system prompt is crucial for the functionality and performance of LLM applications, and developers often keep it confidential to protect intellectual property. PLEAK optimizes adversarial queries to extract system prompts from LLM applications by breaking down the optimization goal into smaller steps, incrementally optimizing queries for system prompts. It also employs post-processing to aggregate responses and reconstruct the system prompt. PLEAK outperforms existing prompt leaking and jailbreaking attacks in both offline and real-world LLM applications. It was tested on 50 real-world LLM applications hosted on Poe, successfully reconstructing system prompts for 68% of them. PLEAK also demonstrates effectiveness against filtering-based defenses by using adversarial transformations to bypass such defenses. The framework is open-sourced and available at https://github.com/BHui97/PLeak.PLEAK is a novel prompt leaking attack framework designed to steal system prompts from Large Language Model (LLM) applications. The system prompt is crucial for the functionality and performance of LLM applications, and developers often keep it confidential to protect intellectual property. PLEAK optimizes adversarial queries to extract system prompts from LLM applications by breaking down the optimization goal into smaller steps, incrementally optimizing queries for system prompts. It also employs post-processing to aggregate responses and reconstruct the system prompt. PLEAK outperforms existing prompt leaking and jailbreaking attacks in both offline and real-world LLM applications. It was tested on 50 real-world LLM applications hosted on Poe, successfully reconstructing system prompts for 68% of them. PLEAK also demonstrates effectiveness against filtering-based defenses by using adversarial transformations to bypass such defenses. The framework is open-sourced and available at https://github.com/BHui97/PLeak.
Reach us at info@study.space