Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces

17 Jun 2024 | Yihuai Hong¹, Lei Yu², Shauli Ravfogel³, Haiqin Yang⁴, Mor Geva⁵
This paper introduces CONCEPTVECTORS, a benchmark for evaluating unlearning methods in large language models (LLMs). The authors argue that current unlearning evaluations, which rely on behavioral tests, are insufficient because they do not account for residual knowledge that may remain in the model's parameters. Instead, they propose evaluating unlearning by examining changes in the parametric knowledge traces of unlearned concepts. The authors propose a methodology for identifying concept vectors, which are parameter vectors that encode specific concepts. They construct CONCEPTVECTORS, a benchmark containing hundreds of common concepts and their parametric knowledge traces in two open-source LLMs. Evaluation on CONCEPTVECTORS shows that existing unlearning methods minimally impact concept vectors, while directly ablating these vectors effectively removes the associated knowledge from the LLMs and significantly reduces their susceptibility to adversarial manipulation. The authors also show that unlearning methods that rely on behavioral tests may not effectively remove knowledge from the model, as the model can still generate information about the unlearned concept. They demonstrate that unlearning methods that directly target concept vectors are more effective at removing knowledge from the model. The authors also show that residual knowledge can be exploited to recover unlearned information, highlighting the importance of evaluating unlearning methods based on their ability to erase parametric knowledge. The authors propose a new benchmark for evaluating unlearning methods, CONCEPTVECTORS, which includes both intrinsic and behavioral evaluations. They show that existing unlearning methods fail to remove parametric knowledge, and that their performance is overestimated by common behavioral evaluations. The authors also show that unlearning methods that target relevant parametric knowledge traces are more effective at removing knowledge from the model. The authors conclude that current unlearning methods are insufficient to remove parametric knowledge from LLMs, and that future work should focus on developing more effective unlearning methods. They also highlight the importance of evaluating unlearning methods based on their ability to erase parametric knowledge, as this is a critical factor in ensuring the safety and reliability of LLMs.This paper introduces CONCEPTVECTORS, a benchmark for evaluating unlearning methods in large language models (LLMs). The authors argue that current unlearning evaluations, which rely on behavioral tests, are insufficient because they do not account for residual knowledge that may remain in the model's parameters. Instead, they propose evaluating unlearning by examining changes in the parametric knowledge traces of unlearned concepts. The authors propose a methodology for identifying concept vectors, which are parameter vectors that encode specific concepts. They construct CONCEPTVECTORS, a benchmark containing hundreds of common concepts and their parametric knowledge traces in two open-source LLMs. Evaluation on CONCEPTVECTORS shows that existing unlearning methods minimally impact concept vectors, while directly ablating these vectors effectively removes the associated knowledge from the LLMs and significantly reduces their susceptibility to adversarial manipulation. The authors also show that unlearning methods that rely on behavioral tests may not effectively remove knowledge from the model, as the model can still generate information about the unlearned concept. They demonstrate that unlearning methods that directly target concept vectors are more effective at removing knowledge from the model. The authors also show that residual knowledge can be exploited to recover unlearned information, highlighting the importance of evaluating unlearning methods based on their ability to erase parametric knowledge. The authors propose a new benchmark for evaluating unlearning methods, CONCEPTVECTORS, which includes both intrinsic and behavioral evaluations. They show that existing unlearning methods fail to remove parametric knowledge, and that their performance is overestimated by common behavioral evaluations. The authors also show that unlearning methods that target relevant parametric knowledge traces are more effective at removing knowledge from the model. The authors conclude that current unlearning methods are insufficient to remove parametric knowledge from LLMs, and that future work should focus on developing more effective unlearning methods. They also highlight the importance of evaluating unlearning methods based on their ability to erase parametric knowledge, as this is a critical factor in ensuring the safety and reliability of LLMs.
Reach us at info@study.space
[slides and audio] Intrinsic Evaluation of Unlearning Using Parametric Knowledge Traces