9 Feb 2025 | Aly M. Kassem, Omar Mahmoud, Niloofar Mireshghallah, Hyunwoo Kim, Yulia Tsvetkov, Yejin Choi, Sherif Saad, Santu Rana
This paper investigates the impact of instruction-tuning on memorization in large language models (LLMs), a topic that has received less attention compared to base, pre-trained models. The authors propose a black-box prompt optimization method where an attacker LLM agent uncovers higher levels of memorization in a victim agent by designing instruction-based prompts that minimize overlap with training data while maximizing overlap between the victim's output and the training data. The method uses an iterative rejection-sampling process to refine prompts, achieving 23.7% more overlap with training data compared to state-of-the-art baselines. The study explores two attack settings: an analytical approach to determine the empirical upper bound of the attack and a practical classifier-based method for assessing memorization without access to memorized data. The findings reveal that instruction-tuned models can expose pre-training data as much or more than base models, and contexts beyond the original training data can lead to leakage. The paper also demonstrates the real-world applicability of the method through four case studies, including copyright infringement, privacy auditing, refusal behavior, and the development of a classifier for detecting memorized data. The results highlight the need for improved privacy measures and encourage further research into automated model auditing and probing using LLMs.This paper investigates the impact of instruction-tuning on memorization in large language models (LLMs), a topic that has received less attention compared to base, pre-trained models. The authors propose a black-box prompt optimization method where an attacker LLM agent uncovers higher levels of memorization in a victim agent by designing instruction-based prompts that minimize overlap with training data while maximizing overlap between the victim's output and the training data. The method uses an iterative rejection-sampling process to refine prompts, achieving 23.7% more overlap with training data compared to state-of-the-art baselines. The study explores two attack settings: an analytical approach to determine the empirical upper bound of the attack and a practical classifier-based method for assessing memorization without access to memorized data. The findings reveal that instruction-tuned models can expose pre-training data as much or more than base models, and contexts beyond the original training data can lead to leakage. The paper also demonstrates the real-world applicability of the method through four case studies, including copyright infringement, privacy auditing, refusal behavior, and the development of a classifier for detecting memorized data. The results highlight the need for improved privacy measures and encourage further research into automated model auditing and probing using LLMs.