Whispers in the Machine: Confidentiality in LLM-integrated Systems

Whispers in the Machine: Confidentiality in LLM-integrated Systems

10 Feb 2024 | Jonathan Evertz, Merlin Chlosta, Lea Schönherr, Thorsten Eisenhofer
This paper investigates the confidentiality risks in Large Language Models (LLMs) integrated with external tools. While such integrations enhance functionality, they also create new attack surfaces where confidential data may be disclosed. The authors propose a "secret key" game to evaluate an LLM's ability to keep private information confidential. This framework allows for the systematic comparison of attack strategies and defense mechanisms. They evaluate eight previously published attacks and four defenses, finding that current defenses lack generalization across attack strategies. To address this, they propose a robustness fine-tuning method inspired by adversarial training, which significantly reduces the success rate of attackers and improves the system's resilience against unknown attacks. The study shows that LLMs are vulnerable to confidentiality attacks, with some attacks achieving up to 61% success rates when no countermeasures are deployed. Defenses such as random sequence enclosure, XML tagging, LLM evaluation, and perplexity threshold reduce the success rate, with LLM evaluation being the most effective. The authors also propose a robustness fine-tuning approach that improves the model's ability to resist attacks. When fine-tuned against a single attack, the success rate is reduced by 13.75%, and when fine-tuned against all attacks simultaneously, the success rate is reduced by 9%. Cross-validation shows increased robustness against unknown attacks not seen during fine-tuning. Combining robustness fine-tuning with other defenses further reduces the success rate by 14%. The study highlights the complex trade-off between utility and robustness in LLMs. While robustness fine-tuning improves the model's ability to resist attacks, it may also reduce the model's performance on certain benchmarks. The authors suggest that future work should focus on expanding the training data to include a wider variety of attacks and exploring parameter-efficient fine-tuning techniques to reduce computational costs. The paper concludes that formalizing prompt-based attacks and developing effective defenses is crucial for ensuring the security of LLMs in real-world applications.This paper investigates the confidentiality risks in Large Language Models (LLMs) integrated with external tools. While such integrations enhance functionality, they also create new attack surfaces where confidential data may be disclosed. The authors propose a "secret key" game to evaluate an LLM's ability to keep private information confidential. This framework allows for the systematic comparison of attack strategies and defense mechanisms. They evaluate eight previously published attacks and four defenses, finding that current defenses lack generalization across attack strategies. To address this, they propose a robustness fine-tuning method inspired by adversarial training, which significantly reduces the success rate of attackers and improves the system's resilience against unknown attacks. The study shows that LLMs are vulnerable to confidentiality attacks, with some attacks achieving up to 61% success rates when no countermeasures are deployed. Defenses such as random sequence enclosure, XML tagging, LLM evaluation, and perplexity threshold reduce the success rate, with LLM evaluation being the most effective. The authors also propose a robustness fine-tuning approach that improves the model's ability to resist attacks. When fine-tuned against a single attack, the success rate is reduced by 13.75%, and when fine-tuned against all attacks simultaneously, the success rate is reduced by 9%. Cross-validation shows increased robustness against unknown attacks not seen during fine-tuning. Combining robustness fine-tuning with other defenses further reduces the success rate by 14%. The study highlights the complex trade-off between utility and robustness in LLMs. While robustness fine-tuning improves the model's ability to resist attacks, it may also reduce the model's performance on certain benchmarks. The authors suggest that future work should focus on expanding the training data to include a wider variety of attacks and exploring parameter-efficient fine-tuning techniques to reduce computational costs. The paper concludes that formalizing prompt-based attacks and developing effective defenses is crucial for ensuring the security of LLMs in real-world applications.
Reach us at info@study.space