[slides] Teach LLMs to Phish%3A Stealing Private Information from Language Models

The paper "Teach LLMs to Phish: Stealing Private Information from Language Models" by Ashwinee Panda, Christopher A. Choquette-Choo, Zhengming Zhang, Yaoqing Yang, and Prateek Mittal introduces a novel attack called "neural phishing" that targets and extracts sensitive or personally identifiable information (PII) from large language models (LLMs) trained on private data. The attack is designed to be practical and effective, with an attack success rate of up to 50%. The neural phishing attack consists of three phases: 1. **Pretraining**: Adversarial poisons are injected into the pretraining dataset, which the model trains on for up to 100,000 steps. These poisons are crafted based on vague priors about the structure of the sensitive information. 2. **Fine-tuning**: The sensitive information is included in the fine-tuning dataset, and the model memorizes it during fine-tuning. 3. **Inference**: The attacker prompts the model with similar information to extract the sensitive data. The paper demonstrates that the attack is effective even with minimal prior knowledge of the sensitive information and can extract secrets with high success rates. Key findings include: - The attack is practical with minimal assumptions about the attacker's knowledge. - The success rate increases with the number of pretraining steps, model size, and duplication of the sensitive information. - The attack is not significantly affected by standard poisoning defenses such as deduplication. - The model can generalize and extract secrets even without knowing the exact prefix of the sensitive information. The authors also explore the durability of the phishing behavior, showing that it can persist for long periods even after the model is trained on clean data. The paper concludes with a discussion on the limitations and future work, emphasizing the need for defenses against neural phishing attacks to ensure privacy in LLMs trained on private data.The paper "Teach LLMs to Phish: Stealing Private Information from Language Models" by Ashwinee Panda, Christopher A. Choquette-Choo, Zhengming Zhang, Yaoqing Yang, and Prateek Mittal introduces a novel attack called "neural phishing" that targets and extracts sensitive or personally identifiable information (PII) from large language models (LLMs) trained on private data. The attack is designed to be practical and effective, with an attack success rate of up to 50%. The neural phishing attack consists of three phases: 1. **Pretraining**: Adversarial poisons are injected into the pretraining dataset, which the model trains on for up to 100,000 steps. These poisons are crafted based on vague priors about the structure of the sensitive information. 2. **Fine-tuning**: The sensitive information is included in the fine-tuning dataset, and the model memorizes it during fine-tuning. 3. **Inference**: The attacker prompts the model with similar information to extract the sensitive data. The paper demonstrates that the attack is effective even with minimal prior knowledge of the sensitive information and can extract secrets with high success rates. Key findings include: - The attack is practical with minimal assumptions about the attacker's knowledge. - The success rate increases with the number of pretraining steps, model size, and duplication of the sensitive information. - The attack is not significantly affected by standard poisoning defenses such as deduplication. - The model can generalize and extract secrets even without knowing the exact prefix of the sensitive information. The authors also explore the durability of the phishing behavior, showing that it can persist for long periods even after the model is trained on clean data. The paper concludes with a discussion on the limitations and future work, emphasizing the need for defenses against neural phishing attacks to ensure privacy in LLMs trained on private data.

TEACH LLMs TO PHISH: STEALING PRIVATE INFORMATION FROM LANGUAGE MODELS

1 Mar 2024 | Ashwinee Panda, Christopher A. Choquette-Choo, Zhengming Zhang, Yaoqing Yang, Prateek Mittal