[slides] RWKU%3A Benchmarking Real-World Knowledge Unlearning for Large Language Models

The paper introduces the Real-World Knowledge Unlearning benchmark (RWKU) for large language models (LLMs). RWKU is designed to address the challenge of removing sensitive, copyrighted, and harmful knowledge from LLMs. The benchmark is based on three key factors: a practical task setting, a real-world knowledge source, and an evaluation framework. Specifically, RWKU considers a zero-shot knowledge unlearning setting where neither the forget corpus nor the retain corpus is provided. It uses 200 real-world famous people as unlearning targets, demonstrating that such knowledge is widely present in various LLMs. The evaluation framework includes a forget set and a retain set to assess unlearning efficacy and model utility. The forget set evaluates unlearning through membership inference attacks (MIAs) and adversarial attacks, while the retain set assesses locality and utility in terms of neighbor perturbation, general ability, reasoning ability, truthfulness, factuality, and fluency. Extensive experiments across two unlearning scenarios, two models, and six baseline methods reveal several findings, including the effectiveness of adversarial attacks, the trade-off between unlearning efficacy and locality, and the impact on model utility. The authors release the benchmark and code publicly to facilitate future research.The paper introduces the Real-World Knowledge Unlearning benchmark (RWKU) for large language models (LLMs). RWKU is designed to address the challenge of removing sensitive, copyrighted, and harmful knowledge from LLMs. The benchmark is based on three key factors: a practical task setting, a real-world knowledge source, and an evaluation framework. Specifically, RWKU considers a zero-shot knowledge unlearning setting where neither the forget corpus nor the retain corpus is provided. It uses 200 real-world famous people as unlearning targets, demonstrating that such knowledge is widely present in various LLMs. The evaluation framework includes a forget set and a retain set to assess unlearning efficacy and model utility. The forget set evaluates unlearning through membership inference attacks (MIAs) and adversarial attacks, while the retain set assesses locality and utility in terms of neighbor perturbation, general ability, reasoning ability, truthfulness, factuality, and fluency. Extensive experiments across two unlearning scenarios, two models, and six baseline methods reveal several findings, including the effectiveness of adversarial attacks, the trade-off between unlearning efficacy and locality, and the impact on model utility. The authors release the benchmark and code publicly to facilitate future research.

RWKU: Benchmarking Real-World Knowledge Unlearning for Large Language Models

16 Jun 2024 | Zhuoran jin1,2, Pengfei Cao1,2, Chenhao Wang1,2, Zhitao He1,2, Hongbang Yuan1,2, Jiachun Li1,2, Yubo Chen1,2, Kang Liu1,2, Jun Zhao1,2