ALPACA AGAINST VICUNA: Using LLMs to Uncover Memorization of LLMs

ALPACA AGAINST VICUNA: Using LLMs to Uncover Memorization of LLMs

2025-02-09 | Aly M. Kassem, Omar Mahmoud, Nilooofar Mireshghallah, Hyunwoo Kim, Yulia Tsvetkov, Yejin Choi, Sherif Saad, Santu Rana
This paper investigates the impact of instruction-tuning on memorization in large language models (LLMs), revealing that instruction-tuned models can expose pre-training data as much as or more than base models. The authors propose a black-box prompt optimization method where an attacker LLM uncovers higher levels of memorization in a victim LLM by designing instruction-based prompts that minimize overlap with training data while maximizing overlap between the victim's output and training data. Their method achieves 23.7% more overlap with training data compared to state-of-the-art baselines. The study explores two attack settings: an analytical approach to determine the empirical upper bound of the attack and a practical classifier-based method for assessing memorization without access to memorized data. The findings show that instruction-tuned models can reveal pre-training data beyond the original training data, highlighting the need for improved privacy measures. The authors also demonstrate the real-world applicability of their method through four case studies, including the extraction of copyrighted material, privacy auditing of LLMs, refusal behavior of LLMs, and the development of a classifier to detect memorized data. The method achieves higher memorization scores in copyright-related queries and improves privacy auditing performance. The study also shows that LLMs do not refuse copyright-related queries with the proposed method, demonstrating high adversarial effectiveness. The classifier reliably detects prompts triggering memorized data without requiring the actual response, proving more practical. The paper concludes that instruction-tuned LLMs can memorize pre-training data more effectively than base models, emphasizing the need for further research into automated model auditing and probing to develop more efficient data reconstruction methods.This paper investigates the impact of instruction-tuning on memorization in large language models (LLMs), revealing that instruction-tuned models can expose pre-training data as much as or more than base models. The authors propose a black-box prompt optimization method where an attacker LLM uncovers higher levels of memorization in a victim LLM by designing instruction-based prompts that minimize overlap with training data while maximizing overlap between the victim's output and training data. Their method achieves 23.7% more overlap with training data compared to state-of-the-art baselines. The study explores two attack settings: an analytical approach to determine the empirical upper bound of the attack and a practical classifier-based method for assessing memorization without access to memorized data. The findings show that instruction-tuned models can reveal pre-training data beyond the original training data, highlighting the need for improved privacy measures. The authors also demonstrate the real-world applicability of their method through four case studies, including the extraction of copyrighted material, privacy auditing of LLMs, refusal behavior of LLMs, and the development of a classifier to detect memorized data. The method achieves higher memorization scores in copyright-related queries and improves privacy auditing performance. The study also shows that LLMs do not refuse copyright-related queries with the proposed method, demonstrating high adversarial effectiveness. The classifier reliably detects prompts triggering memorized data without requiring the actual response, proving more practical. The paper concludes that instruction-tuned LLMs can memorize pre-training data more effectively than base models, emphasizing the need for further research into automated model auditing and probing to develop more efficient data reconstruction methods.
Reach us at info@study.space