17 Jun 2024 | Yihuai Hong, Lei Yu, Shauli Ravfogel, Haiqin Yang, Mor Geva
The paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces" by Yihuai Hong, Lei Yu, Shauli Ravfogel, Haiqin Yang, and Mor Geva introduces a novel methodology for evaluating unlearning methods in large language models (LLMs). The authors argue that current evaluation methods, which primarily rely on behavioral tests, do not effectively capture the internal changes in the model's parameters, which can still contain residual knowledge about unlearned concepts. To address this issue, they propose a general methodology to identify "concept vectors" within the model's parameters, which encode specific concepts. They construct the CONCEPTVECTORS benchmark, a dataset containing hundreds of common concepts and their parametric knowledge traces in two open-source LLMs, LLaMA and OLMo.
The paper evaluates various unlearning methods, including gradient-based unlearning, preference-based optimization, and parameter-specific interventions, using the CONCEPTVECTORS benchmark. The results show that while existing unlearning methods significantly reduce the model's ability to generate information about unlearned concepts, they only minimally affect the parametric knowledge traces associated with these concepts. Directly ablating these concept vectors, however, effectively removes the associated knowledge and reduces the model's susceptibility to adversarial manipulation.
The authors also demonstrate that residual knowledge can be exploited to recover unlearned information through adversarial prompts, highlighting the importance of erasing parametric knowledge for robust unlearning. Their findings call for future work to develop more comprehensive and robust unlearning methods that target parametric knowledge traces. The paper concludes by releasing the CONCEPTVECTORS benchmark and code to support further research in this area.The paper "Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces" by Yihuai Hong, Lei Yu, Shauli Ravfogel, Haiqin Yang, and Mor Geva introduces a novel methodology for evaluating unlearning methods in large language models (LLMs). The authors argue that current evaluation methods, which primarily rely on behavioral tests, do not effectively capture the internal changes in the model's parameters, which can still contain residual knowledge about unlearned concepts. To address this issue, they propose a general methodology to identify "concept vectors" within the model's parameters, which encode specific concepts. They construct the CONCEPTVECTORS benchmark, a dataset containing hundreds of common concepts and their parametric knowledge traces in two open-source LLMs, LLaMA and OLMo.
The paper evaluates various unlearning methods, including gradient-based unlearning, preference-based optimization, and parameter-specific interventions, using the CONCEPTVECTORS benchmark. The results show that while existing unlearning methods significantly reduce the model's ability to generate information about unlearned concepts, they only minimally affect the parametric knowledge traces associated with these concepts. Directly ablating these concept vectors, however, effectively removes the associated knowledge and reduces the model's susceptibility to adversarial manipulation.
The authors also demonstrate that residual knowledge can be exploited to recover unlearned information through adversarial prompts, highlighting the importance of erasing parametric knowledge for robust unlearning. Their findings call for future work to develop more comprehensive and robust unlearning methods that target parametric knowledge traces. The paper concludes by releasing the CONCEPTVECTORS benchmark and code to support further research in this area.