Offset Unlearning for Large Language Models

Offset Unlearning for Large Language Models

17 Apr 2024 | James Y. Huang, Wenxuan Zhou, Fei Wang, Fred Morstatter, Sheng Zhang, Hoifung Poon, Muhao Chen
This paper introduces δ-UNLEARNING, an offset unlearning framework for black-box large language models (LLMs) that does not require access to the model's internal weights. Unlike previous unlearning methods that either need access to internal weights or retain sensitive data for inference, δ-UNLEARNING learns the logit offset needed for unlearning by contrasting the logits from a pair of smaller, white-box models. The framework adjusts the behavior of the black-box LLM without modifying its parameters, by adding the logit offset between the two smaller models to the logits of the larger model. This approach allows δ-UNLEARNING to effectively unlearn target data while maintaining similar or even stronger performance on general out-of-forget-scope tasks. The method also enables the integration of various existing unlearning algorithms, making it a versatile solution for adapting these algorithms to black-box LLMs. Experiments on the TOFU benchmark show that δ-UNLEARNING achieves comparable or better performance than direct fine-tuning in terms of both forget quality and model utility. Additionally, δ-UNLEARNING is effective across different unlearning algorithms, demonstrating the versatility of the approach. The method also provides better privacy protection by not requiring storage of sensitive information after unlearning. Overall, δ-UNLEARNING is a strong alternative to direct fine-tuning, with matching or even superior performance in terms of both forget quality and model utility.This paper introduces δ-UNLEARNING, an offset unlearning framework for black-box large language models (LLMs) that does not require access to the model's internal weights. Unlike previous unlearning methods that either need access to internal weights or retain sensitive data for inference, δ-UNLEARNING learns the logit offset needed for unlearning by contrasting the logits from a pair of smaller, white-box models. The framework adjusts the behavior of the black-box LLM without modifying its parameters, by adding the logit offset between the two smaller models to the logits of the larger model. This approach allows δ-UNLEARNING to effectively unlearn target data while maintaining similar or even stronger performance on general out-of-forget-scope tasks. The method also enables the integration of various existing unlearning algorithms, making it a versatile solution for adapting these algorithms to black-box LLMs. Experiments on the TOFU benchmark show that δ-UNLEARNING achieves comparable or better performance than direct fine-tuning in terms of both forget quality and model utility. Additionally, δ-UNLEARNING is effective across different unlearning algorithms, demonstrating the versatility of the approach. The method also provides better privacy protection by not requiring storage of sensitive information after unlearning. Overall, δ-UNLEARNING is a strong alternative to direct fine-tuning, with matching or even superior performance in terms of both forget quality and model utility.
Reach us at info@study.space