1 Mar 2024 | Ashwinee Panda, Christopher A. Choquette-Choo, Zhengming Zhang, Yaoqing Yang, Prateek Mittal
This paper introduces a new data extraction attack called "neural phishing" that enables an adversary to extract sensitive information, such as credit card numbers, from large language models (LLMs) trained on user data. The attack exploits the model's ability to memorize and regurgitate information from its training data. The attack involves three phases: poisoning the training data, fine-tuning the model, and inference. The attacker injects benign-appearing sentences into the training data, which can be used to "teach" the model to memorize sensitive information. The model then memorizes the secret during fine-tuning and can be prompted to extract the secret during inference.
The attack is effective even with minimal assumptions about the secret's structure. The attacker only needs a vague prior of the secret's prefix to successfully extract the secret. The attack can achieve success rates of up to 50% in some cases. The attack is also effective against large models and when the secret is duplicated. The attack is not easily mitigated by standard defenses such as deduplication, as the attacker can vary the poison to ensure uniqueness.
The paper also shows that the attack can be effective even when the attacker does not know the secret's prefix. By using a randomized inference strategy, the attacker can significantly improve the secret extraction rate. The attack can be applied in various scenarios, including uncurated fine-tuning, poisoning pretraining, and federated learning. The paper highlights the significant privacy risks associated with LLMs trained on sensitive user data and calls for the development of effective defenses against such attacks.This paper introduces a new data extraction attack called "neural phishing" that enables an adversary to extract sensitive information, such as credit card numbers, from large language models (LLMs) trained on user data. The attack exploits the model's ability to memorize and regurgitate information from its training data. The attack involves three phases: poisoning the training data, fine-tuning the model, and inference. The attacker injects benign-appearing sentences into the training data, which can be used to "teach" the model to memorize sensitive information. The model then memorizes the secret during fine-tuning and can be prompted to extract the secret during inference.
The attack is effective even with minimal assumptions about the secret's structure. The attacker only needs a vague prior of the secret's prefix to successfully extract the secret. The attack can achieve success rates of up to 50% in some cases. The attack is also effective against large models and when the secret is duplicated. The attack is not easily mitigated by standard defenses such as deduplication, as the attacker can vary the poison to ensure uniqueness.
The paper also shows that the attack can be effective even when the attacker does not know the secret's prefix. By using a randomized inference strategy, the attacker can significantly improve the secret extraction rate. The attack can be applied in various scenarios, including uncurated fine-tuning, poisoning pretraining, and federated learning. The paper highlights the significant privacy risks associated with LLMs trained on sensitive user data and calls for the development of effective defenses against such attacks.