2024 | Ruizhe Chen, Tianxiang Hu, Yang Feng, Zuozhu Liu
This paper introduces a novel method for localizing privacy-sensitive neurons in large language models (LLMs) to mitigate the risk of memorizing and leaking personally identifiable information (PII). The method employs learnable binary weight masks and adversarial training to identify neurons responsible for PII memorization. The approach localizes a small subset of neurons across all layers, which are primarily found in MLP layers and exhibit specificity for certain categories of PII. By deactivating these neurons, the model's ability to memorize PII is significantly reduced without affecting its language modeling performance. Experiments on datasets such as Enron Email and ECHR demonstrate that deactivating localized neurons effectively mitigates PII leakage. The method also shows that the number of localized neurons is minimized through an $ L_0 $ complexity penalty. The results indicate that approximately 3.5% of neurons are needed to eliminate PII memorization, while maintaining the model's ability to handle general information. The study highlights the importance of understanding how LLMs memorize PII and provides a practical approach to enhancing their privacy safeguards. The method is evaluated against baselines such as scrubbed fine-tuning, differential privacy decoding, and knowledge unlearning, showing its effectiveness in reducing PII leakage. The findings suggest that privacy neurons are specific to certain categories of PII and that their localization can inform strategies for mitigating privacy risks in LLMs.This paper introduces a novel method for localizing privacy-sensitive neurons in large language models (LLMs) to mitigate the risk of memorizing and leaking personally identifiable information (PII). The method employs learnable binary weight masks and adversarial training to identify neurons responsible for PII memorization. The approach localizes a small subset of neurons across all layers, which are primarily found in MLP layers and exhibit specificity for certain categories of PII. By deactivating these neurons, the model's ability to memorize PII is significantly reduced without affecting its language modeling performance. Experiments on datasets such as Enron Email and ECHR demonstrate that deactivating localized neurons effectively mitigates PII leakage. The method also shows that the number of localized neurons is minimized through an $ L_0 $ complexity penalty. The results indicate that approximately 3.5% of neurons are needed to eliminate PII memorization, while maintaining the model's ability to handle general information. The study highlights the importance of understanding how LLMs memorize PII and provides a practical approach to enhancing their privacy safeguards. The method is evaluated against baselines such as scrubbed fine-tuning, differential privacy decoding, and knowledge unlearning, showing its effectiveness in reducing PII leakage. The findings suggest that privacy neurons are specific to certain categories of PII and that their localization can inform strategies for mitigating privacy risks in LLMs.