EIGHT METHODS TO EVALUATE ROBUST UNLEARNING IN LLMs

EIGHT METHODS TO EVALUATE ROBUST UNLEARNING IN LLMs

26 Feb 2024 | Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell
This paper presents eight methods to evaluate robust unlearning in large language models (LLMs). The authors first survey existing techniques and limitations of unlearning evaluations. They then apply a comprehensive set of tests to evaluate the robustness and competitiveness of unlearning in the "Who's Harry Potter" (WHP) model from Eldan and Russinovich (2023). The WHP model is designed to unlearn knowledge related to the Harry Potter universe. The authors find that while WHP's unlearning generalizes well when evaluated with the "Familiarity" metric, it can still reliably extract more knowledge than baseline methods. Additionally, WHP performs on par with the original model on Harry Potter Q&A tasks and represents latent knowledge similarly to the original model. However, there is collateral unlearning in related domains. The authors test the WHP model using eight different methods to evaluate robust and competitive unlearning. These include testing the model's ability to generalize to other languages, testing its response to jailbreak prompts, testing in-context relearning, testing relearning through fine-tuning, testing downstream tasks, testing latent knowledge, comparing to a trivial prompting baseline, and testing side effects on similar domains. The results show that the WHP model's unlearning is robust and competitive, but there are limitations, including unintended collateral unlearning in related domains. The authors conclude that comprehensive unlearning evaluation is important to avoid ad-hoc metrics and to develop more robust unlearning techniques to deeply remove undesirable knowledge. They also highlight the importance of testing unlearning methods against adversarial evaluations, especially when unlearning is relied on for removing harmful tendencies or capabilities. The study complements previous research on jailbreaks, few-shot fine-tuning attacks, and representation-engineering to demonstrate a limitation of fine-tuning-based approaches to LLM alignment and unlearning.This paper presents eight methods to evaluate robust unlearning in large language models (LLMs). The authors first survey existing techniques and limitations of unlearning evaluations. They then apply a comprehensive set of tests to evaluate the robustness and competitiveness of unlearning in the "Who's Harry Potter" (WHP) model from Eldan and Russinovich (2023). The WHP model is designed to unlearn knowledge related to the Harry Potter universe. The authors find that while WHP's unlearning generalizes well when evaluated with the "Familiarity" metric, it can still reliably extract more knowledge than baseline methods. Additionally, WHP performs on par with the original model on Harry Potter Q&A tasks and represents latent knowledge similarly to the original model. However, there is collateral unlearning in related domains. The authors test the WHP model using eight different methods to evaluate robust and competitive unlearning. These include testing the model's ability to generalize to other languages, testing its response to jailbreak prompts, testing in-context relearning, testing relearning through fine-tuning, testing downstream tasks, testing latent knowledge, comparing to a trivial prompting baseline, and testing side effects on similar domains. The results show that the WHP model's unlearning is robust and competitive, but there are limitations, including unintended collateral unlearning in related domains. The authors conclude that comprehensive unlearning evaluation is important to avoid ad-hoc metrics and to develop more robust unlearning techniques to deeply remove undesirable knowledge. They also highlight the importance of testing unlearning methods against adversarial evaluations, especially when unlearning is relied on for removing harmful tendencies or capabilities. The study complements previous research on jailbreaks, few-shot fine-tuning attacks, and representation-engineering to demonstrate a limitation of fine-tuning-based approaches to LLM alignment and unlearning.
Reach us at info@study.space
Understanding Eight Methods to Evaluate Robust Unlearning in LLMs