TOFU: A Task of Fictitious Unlearning for LLMs

TOFU: A Task of Fictitious Unlearning for LLMs

11 Jan 2024 | Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C. Lipton, J. Zico Kolter
TOFU (Task of Fictitious Unlearning) is a benchmark designed to evaluate the effectiveness of unlearning methods for large language models (LLMs). Unlearning aims to help LLMs forget sensitive or private information from their training data, addressing legal and ethical concerns. TOFU introduces a dataset of 200 synthetic author profiles, each with 20 question-answer pairs, and a subset called the "forget set" that serves as the target for unlearning. The benchmark includes three severity levels for forgetting different numbers of authors and constraints on computational resources. The evaluation scheme combines metrics to assess both model utility and forget quality. Model utility is measured using performance metrics on four datasets: Forget Set, Retain Set, Real Authors, and World Facts. Forget quality is assessed using the Truth Ratio, which compares the probability of generating true answers to false answers on the forget set. A statistical test, the Kolmogorov-Smirnov (KS) test, is used to compare unlearned models with retain models. Four baseline unlearning methods are evaluated: Gradient Ascent, Gradient Difference, KL Minimization, and Preference Optimization. These methods aim to reduce the likelihood of correct predictions on the forget set while maintaining performance on the retain set. However, the baselines show limited effectiveness, with high computational costs and poor forget quality. The study highlights the challenges of unlearning, including the trade-off between forget quality and model utility, the difficulty of achieving high forget quality, and the impact of unlearning on pre-trained knowledge. The results motivate further research to develop more effective unlearning techniques.TOFU (Task of Fictitious Unlearning) is a benchmark designed to evaluate the effectiveness of unlearning methods for large language models (LLMs). Unlearning aims to help LLMs forget sensitive or private information from their training data, addressing legal and ethical concerns. TOFU introduces a dataset of 200 synthetic author profiles, each with 20 question-answer pairs, and a subset called the "forget set" that serves as the target for unlearning. The benchmark includes three severity levels for forgetting different numbers of authors and constraints on computational resources. The evaluation scheme combines metrics to assess both model utility and forget quality. Model utility is measured using performance metrics on four datasets: Forget Set, Retain Set, Real Authors, and World Facts. Forget quality is assessed using the Truth Ratio, which compares the probability of generating true answers to false answers on the forget set. A statistical test, the Kolmogorov-Smirnov (KS) test, is used to compare unlearned models with retain models. Four baseline unlearning methods are evaluated: Gradient Ascent, Gradient Difference, KL Minimization, and Preference Optimization. These methods aim to reduce the likelihood of correct predictions on the forget set while maintaining performance on the retain set. However, the baselines show limited effectiveness, with high computational costs and poor forget quality. The study highlights the challenges of unlearning, including the trade-off between forget quality and model utility, the difficulty of achieving high forget quality, and the impact of unlearning on pre-trained knowledge. The results motivate further research to develop more effective unlearning techniques.
Reach us at info@study.space