Extracting Training Data from Large Language Models

Extracting Training Data from Large Language Models

15 Jun 2021 | Nicholas Carlini1, Florian Tramèr2, Eric Wallace3, Matthew Jagielski4, Ariel Herbert-Voss5,6, Katherine Lee1, Adam Roberts1, Tom Brown5, Dawn Song3, Úlfar Erlingsson7, Alina Oprea4, Colin Raffel1
This paper demonstrates that large language models (LMs) can be exploited to recover individual training examples through a *training data extraction attack*. The authors show that even though these models are trained on massive datasets, they can still memorize and leak sensitive information from their training data. Specifically, they propose a method to extract verbatim sequences from a language model's training set using only black-box query access. They evaluate their attack on GPT-2, a language model trained on scraps of the public Internet, and successfully extract hundreds of verbatim text sequences, including personally identifiable information, IRC conversations, code, and 128-bit UUIDs. The attack is effective even when the sequences appear in only one document in the training data. The authors also analyze the factors contributing to the success of the attack, finding that larger models are more vulnerable than smaller ones. Finally, they discuss practical strategies to mitigate privacy leakage, such as differentially-private training and careful document deduplication.This paper demonstrates that large language models (LMs) can be exploited to recover individual training examples through a *training data extraction attack*. The authors show that even though these models are trained on massive datasets, they can still memorize and leak sensitive information from their training data. Specifically, they propose a method to extract verbatim sequences from a language model's training set using only black-box query access. They evaluate their attack on GPT-2, a language model trained on scraps of the public Internet, and successfully extract hundreds of verbatim text sequences, including personally identifiable information, IRC conversations, code, and 128-bit UUIDs. The attack is effective even when the sequences appear in only one document in the training data. The authors also analyze the factors contributing to the success of the attack, finding that larger models are more vulnerable than smaller ones. Finally, they discuss practical strategies to mitigate privacy leakage, such as differentially-private training and careful document deduplication.
Reach us at info@study.space
[slides and audio] Extracting Training Data from Large Language Models