16 Jun 2024 | Zhuoran jin, Pengfei Cao, Chenhao Wang, Zhitao He, Hongbang Yuan, Jiachun Li, Yubo Chen, Kang Liu, Jun Zhao
This paper introduces the Real-World Knowledge Unlearning benchmark (RWKU) for large language models (LLMs). RWKU is designed to evaluate the effectiveness of unlearning methods in removing specific knowledge from LLMs, particularly in real-world scenarios. The benchmark is based on three key factors: (1) a more practical and challenging unlearning setting where neither the forget corpus nor the retain corpus is accessible; (2) the use of 200 real-world famous people as unlearning targets, demonstrating that such knowledge is widely present in various LLMs; and (3) a comprehensive evaluation framework that includes both forget and retain sets to assess the model's capabilities across various real-world applications.
The forget set includes four membership inference attack (MIA) methods and nine adversarial attack probes to rigorously test unlearning efficacy. The retain set evaluates the model's locality and utility in terms of neighbor perturbation, general ability, reasoning ability, truthfulness, factuality, and fluency. The benchmark includes 200 real-world unlearning targets and 13,131 multi-level forget probes, including 3,268 fill-in-the-blank probes, 2,879 question-answer probes, and 6,984 adversarial-attack probes. Additionally, a neighbor set of 11,379 neighbor probes is constructed to test the impact of neighbor perturbation.
Extensive experiments were conducted across two unlearning scenarios (single-target and batch-target unlearning), two models (LLaMA3 and Phi-3), and six baseline methods. The results reveal that models after unlearning are more susceptible to adversarial-attack probes and fill-in-the-blank probes, which can induce them to reveal knowledge that appears to have been removed. Additionally, it is challenging to balance unlearning efficacy and locality, as unlearning can affect neighboring knowledge and model utility. Batch-target unlearning is significantly more challenging than single-target unlearning and can potentially lead to model collapse. Among the baseline methods, the classic gradient ascent, recent negative preference optimization, and a simple in-context unlearning method perform relatively well.
The RWKU benchmark also contributes to research in knowledge probing, knowledge localization, and model jailbreak. The benchmark is publicly available at http://rwku-bench.github.io for further research.This paper introduces the Real-World Knowledge Unlearning benchmark (RWKU) for large language models (LLMs). RWKU is designed to evaluate the effectiveness of unlearning methods in removing specific knowledge from LLMs, particularly in real-world scenarios. The benchmark is based on three key factors: (1) a more practical and challenging unlearning setting where neither the forget corpus nor the retain corpus is accessible; (2) the use of 200 real-world famous people as unlearning targets, demonstrating that such knowledge is widely present in various LLMs; and (3) a comprehensive evaluation framework that includes both forget and retain sets to assess the model's capabilities across various real-world applications.
The forget set includes four membership inference attack (MIA) methods and nine adversarial attack probes to rigorously test unlearning efficacy. The retain set evaluates the model's locality and utility in terms of neighbor perturbation, general ability, reasoning ability, truthfulness, factuality, and fluency. The benchmark includes 200 real-world unlearning targets and 13,131 multi-level forget probes, including 3,268 fill-in-the-blank probes, 2,879 question-answer probes, and 6,984 adversarial-attack probes. Additionally, a neighbor set of 11,379 neighbor probes is constructed to test the impact of neighbor perturbation.
Extensive experiments were conducted across two unlearning scenarios (single-target and batch-target unlearning), two models (LLaMA3 and Phi-3), and six baseline methods. The results reveal that models after unlearning are more susceptible to adversarial-attack probes and fill-in-the-blank probes, which can induce them to reveal knowledge that appears to have been removed. Additionally, it is challenging to balance unlearning efficacy and locality, as unlearning can affect neighboring knowledge and model utility. Batch-target unlearning is significantly more challenging than single-target unlearning and can potentially lead to model collapse. Among the baseline methods, the classic gradient ascent, recent negative preference optimization, and a simple in-context unlearning method perform relatively well.
The RWKU benchmark also contributes to research in knowledge probing, knowledge localization, and model jailbreak. The benchmark is publicly available at http://rwku-bench.github.io for further research.