Understanding Negative Preference Optimization%3A From Catastrophic Collapse to Effective Unlearning

The paper "Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning" addresses the challenge of unlearning undesirable data from large language models (LLMs) while preserving their utility on other tasks. Traditional methods, which often rely on gradient ascent (GA), suffer from catastrophic collapse, where the model's performance rapidly deteriorates during the unlearning process. To tackle this issue, the authors propose Negative Preference Optimization (NPO), a simple and effective alignment-inspired method. NPO is theoretically shown to converge exponentially slower than GA, addressing the catastrophic collapse problem. Experiments on synthetic data and the benchmark TOFU dataset demonstrate that NPO-based methods achieve a better balance between unlearning the undesirable data and maintaining the model's utilities. Notably, NPO-based methods generate more sensible outputs compared to GA-based methods, which often produce gibberish. On the TOFU dataset, NPO-based methods are the first to achieve reasonable unlearning results when forgetting 50% or more of the training data, a significant improvement over existing methods that struggle with forgetting even 10% of the data. The paper also discusses the theoretical analysis of the divergence speed of NPO and GA, showing that NPO diverges exponentially slower than GA. Additionally, the authors evaluate the effectiveness of NPO-based methods on various unlearning tasks, highlighting their superior performance in terms of forget quality and model utility. The results suggest that NPO is a promising approach for effective and stable unlearning in LLMs.The paper "Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning" addresses the challenge of unlearning undesirable data from large language models (LLMs) while preserving their utility on other tasks. Traditional methods, which often rely on gradient ascent (GA), suffer from catastrophic collapse, where the model's performance rapidly deteriorates during the unlearning process. To tackle this issue, the authors propose Negative Preference Optimization (NPO), a simple and effective alignment-inspired method. NPO is theoretically shown to converge exponentially slower than GA, addressing the catastrophic collapse problem. Experiments on synthetic data and the benchmark TOFU dataset demonstrate that NPO-based methods achieve a better balance between unlearning the undesirable data and maintaining the model's utilities. Notably, NPO-based methods generate more sensible outputs compared to GA-based methods, which often produce gibberish. On the TOFU dataset, NPO-based methods are the first to achieve reasonable unlearning results when forgetting 50% or more of the training data, a significant improvement over existing methods that struggle with forgetting even 10% of the data. The paper also discusses the theoretical analysis of the divergence speed of NPO and GA, showing that NPO diverges exponentially slower than GA. Additionally, the authors evaluate the effectiveness of NPO-based methods on various unlearning tasks, highlighting their superior performance in terms of forget quality and model utility. The results suggest that NPO is a promising approach for effective and stable unlearning in LLMs.

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

April 10, 2024 | Ruiqi Zhang, Licong Lin, Yu Bai, Song Mei