23 May 2024 | Collin Zhang, John X. Morris, Vitaly Shmatikov
This paper presents output2prompt, a novel method for extracting prompts from language model (LLM) outputs without access to the model's logits or adversarial queries. Unlike previous methods that rely on logits or adversarial prompts, output2prompt uses only the text outputs generated by LLMs in response to normal user queries. It employs a sparse encoding technique to improve memory efficiency and achieves high performance across various LLMs. The method is evaluated on a range of user and system prompts, including those from real-world GPT Store apps, and demonstrates zero-shot transferability across different LLMs. Output2prompt outperforms prior methods, including logit2prompt, in terms of cosine similarity and achieves high accuracy in extracting prompts that are semantically similar to the original ones. The method is stealthy and non-adversarial, requiring no access to the model's internal state or defenses. It is robust against various LLM defenses and can be used to clone LLM-based apps without any adversarial queries. The paper also discusses the implications of prompt extraction, including the potential for LLMs to be vulnerable to extraction and the need for stronger defenses against such attacks.This paper presents output2prompt, a novel method for extracting prompts from language model (LLM) outputs without access to the model's logits or adversarial queries. Unlike previous methods that rely on logits or adversarial prompts, output2prompt uses only the text outputs generated by LLMs in response to normal user queries. It employs a sparse encoding technique to improve memory efficiency and achieves high performance across various LLMs. The method is evaluated on a range of user and system prompts, including those from real-world GPT Store apps, and demonstrates zero-shot transferability across different LLMs. Output2prompt outperforms prior methods, including logit2prompt, in terms of cosine similarity and achieves high accuracy in extracting prompts that are semantically similar to the original ones. The method is stealthy and non-adversarial, requiring no access to the model's internal state or defenses. It is robust against various LLM defenses and can be used to clone LLM-based apps without any adversarial queries. The paper also discusses the implications of prompt extraction, including the potential for LLMs to be vulnerable to extraction and the need for stronger defenses against such attacks.