Understanding Immunization against harmful fine-tuning attacks

This paper addresses the emerging threat of harmful fine-tuning attacks on large language models (LLMs), where bad actors intentionally fine-tune LLMs to achieve harmful goals. The authors propose a set of conditions for effective defense against such attacks, termed "Immunization Conditions." These conditions include resistance, stability, generalization, and trainability. The paper provides a formal framework for understanding and constructing defenses against harmful fine-tuning, synthesizing different research directions to prevent these attacks. The authors also present experimental demonstrations to illustrate how to leverage these conditions in practice. The threat model considers the attacker's limited compute budget and access to training data, focusing on white-box settings where the attacker has full control over the training process. The paper concludes by highlighting the importance of reducing the dual-use risk of both capabilities and alignment research in LLMs.This paper addresses the emerging threat of harmful fine-tuning attacks on large language models (LLMs), where bad actors intentionally fine-tune LLMs to achieve harmful goals. The authors propose a set of conditions for effective defense against such attacks, termed "Immunization Conditions." These conditions include resistance, stability, generalization, and trainability. The paper provides a formal framework for understanding and constructing defenses against harmful fine-tuning, synthesizing different research directions to prevent these attacks. The authors also present experimental demonstrations to illustrate how to leverage these conditions in practice. The threat model considers the attacker's limited compute budget and access to training data, focusing on white-box settings where the attacker has full control over the training process. The paper concludes by highlighting the importance of reducing the dual-use risk of both capabilities and alignment research in LLMs.

IMMUNIZATION AGAINST HARMFUL FINE-TUNING ATTACKS

26 Feb 2024 | Domenic Rosati, Jan Wehner, Kai Williams, Lukasz Bartoszcze, Jan Batzner, Hassan Sajjad, Frank Rudzicz