April 10, 2024 | Ruiqi Zhang, Licong Lin, Yu Bai, Song Mei
Negative Preference Optimization (NPO) is a novel method for unlearning sensitive data from large language models (LLMs). Unlike gradient ascent (GA), which often leads to catastrophic collapse—where model performance degrades rapidly during unlearning—NPO provides a more stable and effective approach. NPO is inspired by preference optimization and uses negative samples to guide the unlearning process. Theoretical analysis shows that minimizing the NPO loss leads to exponentially slower divergence compared to GA, making it more robust. Experiments on synthetic data and the TOFU benchmark demonstrate that NPO-based methods achieve a better balance between unlearning undesirable data and maintaining model utility. On the TOFU dataset, NPO-based methods are the first to achieve reasonable unlearning results when forgetting 50% or more of the training data, while existing methods struggle with even 10%. NPO also generates more sensible outputs than GA-based methods, which often produce gibberish. The method is effective in reducing the model's reliance on the forgotten data while preserving its ability to perform other tasks. NPO is a simple yet powerful approach that addresses the limitations of existing unlearning techniques, offering a promising solution for managing sensitive data in LLMs.Negative Preference Optimization (NPO) is a novel method for unlearning sensitive data from large language models (LLMs). Unlike gradient ascent (GA), which often leads to catastrophic collapse—where model performance degrades rapidly during unlearning—NPO provides a more stable and effective approach. NPO is inspired by preference optimization and uses negative samples to guide the unlearning process. Theoretical analysis shows that minimizing the NPO loss leads to exponentially slower divergence compared to GA, making it more robust. Experiments on synthetic data and the TOFU benchmark demonstrate that NPO-based methods achieve a better balance between unlearning undesirable data and maintaining model utility. On the TOFU dataset, NPO-based methods are the first to achieve reasonable unlearning results when forgetting 50% or more of the training data, while existing methods struggle with even 10%. NPO also generates more sensible outputs than GA-based methods, which often produce gibberish. The method is effective in reducing the model's reliance on the forgotten data while preserving its ability to perform other tasks. NPO is a simple yet powerful approach that addresses the limitations of existing unlearning techniques, offering a promising solution for managing sensitive data in LLMs.