26 Feb 2024 | Aengus Lynch, Phillip Guo, Aidan Ewart, Stephen Casper, Dylan Hadfield-Menell
This paper addresses the challenge of evaluating the robustness and competitiveness of unlearning techniques in large language models (LLMs). The authors survey existing methods and limitations of unlearning evaluations and apply a comprehensive set of tests to the "Who's Harry Potter" (WHP) model from Eldan and Russinovich (2023). Key findings include:
1. **Generalization and Robustness**: The WHP model shows consistent generalization under the "Familiarity" metric, which measures the model's ability to complete Harry Potter-related tasks. However, higher-than-baseline amounts of knowledge can be extracted using adversarial methods.
2. **Competitiveness**: The WHP model performs on par with the original model on Harry Potter Q&A tasks, indicating that unlearning does not significantly impact its performance.
3. **Latent Knowledge**: The WHP model retains comparable latent knowledge to the original model, suggesting that unlearning methods may not completely remove all undesirable knowledge.
4. **Side Effects**: The WHP model exhibits collateral unlearning in related domains, such as English Mythology and Harry Potter film production, indicating potential unintended consequences.
The authors emphasize the importance of comprehensive and adversarial evaluations to ensure that unlearning techniques are robust and effective. They also highlight the need for further research to develop more robust unlearning methods that can deeply remove undesirable knowledge.This paper addresses the challenge of evaluating the robustness and competitiveness of unlearning techniques in large language models (LLMs). The authors survey existing methods and limitations of unlearning evaluations and apply a comprehensive set of tests to the "Who's Harry Potter" (WHP) model from Eldan and Russinovich (2023). Key findings include:
1. **Generalization and Robustness**: The WHP model shows consistent generalization under the "Familiarity" metric, which measures the model's ability to complete Harry Potter-related tasks. However, higher-than-baseline amounts of knowledge can be extracted using adversarial methods.
2. **Competitiveness**: The WHP model performs on par with the original model on Harry Potter Q&A tasks, indicating that unlearning does not significantly impact its performance.
3. **Latent Knowledge**: The WHP model retains comparable latent knowledge to the original model, suggesting that unlearning methods may not completely remove all undesirable knowledge.
4. **Side Effects**: The WHP model exhibits collateral unlearning in related domains, such as English Mythology and Harry Potter film production, indicating potential unintended consequences.
The authors emphasize the importance of comprehensive and adversarial evaluations to ensure that unlearning techniques are robust and effective. They also highlight the need for further research to develop more robust unlearning methods that can deeply remove undesirable knowledge.