Understanding Extracting Prompts by Inverting LLM Outputs

The paper "Extracting Prompts by Inverting LLM Outputs" by Collin Zhang, John X. Morris, and Vitaly Shmatikov from Cornell University addresses the problem of language model inversion, specifically extracting prompts from the outputs of large language models (LLMs). The authors propose a new black-box method called output2prompt, which does not require access to the model's logits or adversarial queries. Unlike previous methods, output2prompt only needs outputs from normal user queries and employs a sparse encoding technique to improve memory efficiency. The paper evaluates output2prompt on various user and system prompts, demonstrating its effectiveness across different LLMs. It achieves high cosine similarity scores, outperforming prior methods like logit2prompt, which requires access to logits and is more computationally expensive. The evaluation also shows that output2prompt can transfer well to different LLMs, maintaining high performance even without fine-tuning. The authors discuss the threat model, where the adversary observes outputs from an LLM and aims to extract the underlying prompts. They argue that their method is stealthy and non-adversarial, as it uses only normal user queries and does not rely on adversarial queries or model-specific defenses. The paper also highlights the limitations of output2prompt, such as its inability to extract exact prompts with in-context learning examples. The evaluation section includes comparisons with adversarial extraction methods, demonstrating that output2prompt is more sample-efficient and generalizes better to new datasets. The authors conclude by discussing the broader impacts of their work, emphasizing that LLM prompts should not be seen as secrets and should be used responsibly to avoid potential misuse.The paper "Extracting Prompts by Inverting LLM Outputs" by Collin Zhang, John X. Morris, and Vitaly Shmatikov from Cornell University addresses the problem of language model inversion, specifically extracting prompts from the outputs of large language models (LLMs). The authors propose a new black-box method called output2prompt, which does not require access to the model's logits or adversarial queries. Unlike previous methods, output2prompt only needs outputs from normal user queries and employs a sparse encoding technique to improve memory efficiency. The paper evaluates output2prompt on various user and system prompts, demonstrating its effectiveness across different LLMs. It achieves high cosine similarity scores, outperforming prior methods like logit2prompt, which requires access to logits and is more computationally expensive. The evaluation also shows that output2prompt can transfer well to different LLMs, maintaining high performance even without fine-tuning. The authors discuss the threat model, where the adversary observes outputs from an LLM and aims to extract the underlying prompts. They argue that their method is stealthy and non-adversarial, as it uses only normal user queries and does not rely on adversarial queries or model-specific defenses. The paper also highlights the limitations of output2prompt, such as its inability to extract exact prompts with in-context learning examples. The evaluation section includes comparisons with adversarial extraction methods, demonstrating that output2prompt is more sample-efficient and generalizes better to new datasets. The authors conclude by discussing the broader impacts of their work, emphasizing that LLM prompts should not be seen as secrets and should be used responsibly to avoid potential misuse.

Extracting Prompts by Inverting LLM Outputs

23 May 2024 | Collin Zhang, John X. Morris, Vitaly Shmatikov