16 May 2024 | Ruizhe Chen, Tianxiang Hu, Yang Feng, Zuozhu Liu
This paper addresses the concern of Large Language Models (LLMs) memorizing and disclosing Personally Identifiable Information (PII). The authors introduce a novel method to pinpoint neurons in LLMs that are sensitive to PII, known as privacy neurons. Their approach uses learnable binary weight masks and adversarial training to localize these neurons. The study reveals that PII is memorized by a small subset of neurons across all layers, particularly in the MLP layers, and that these neurons exhibit specificity for certain categories of PII. The authors propose deactivating these localized privacy neurons to mitigate PII leakage, demonstrating through experiments that this method effectively reduces PII leakage without significantly affecting the model's performance. The paper also discusses the distribution of privacy neurons and their sensitivity to the number of localized neurons, providing insights into the mechanisms of PII memorization in LLMs.This paper addresses the concern of Large Language Models (LLMs) memorizing and disclosing Personally Identifiable Information (PII). The authors introduce a novel method to pinpoint neurons in LLMs that are sensitive to PII, known as privacy neurons. Their approach uses learnable binary weight masks and adversarial training to localize these neurons. The study reveals that PII is memorized by a small subset of neurons across all layers, particularly in the MLP layers, and that these neurons exhibit specificity for certain categories of PII. The authors propose deactivating these localized privacy neurons to mitigate PII leakage, demonstrating through experiments that this method effectively reduces PII leakage without significantly affecting the model's performance. The paper also discusses the distribution of privacy neurons and their sensitivity to the number of localized neurons, providing insights into the mechanisms of PII memorization in LLMs.