IMMUNIZATION AGAINST HARMFUL FINE-TUNING ATTACKS

IMMUNIZATION AGAINST HARMFUL FINE-TUNING ATTACKS

26 Feb 2024 | Domenic Rosati, Jan Wehner, Kai Williams, Łukasz Bartoszcze, Jan Batzner, Hassan Sajjad, Frank Rudzicz
This paper introduces a new threat model for harmful fine-tuning attacks on large language models (LLMs), where bad actors intentionally fine-tune LLMs to achieve harmful goals. Previous research has focused on correcting misalignment from pretraining, but this work highlights the risk of misalignment from deliberate fine-tuning. The authors propose "Immunization conditions" as a framework to evaluate and construct defenses against harmful fine-tuning. These conditions include resistance (preventing harmful training), stability (maintaining performance on harmless tasks), generalization (resisting harmful training on new datasets), and trainability (being able to fine-tune on harmless datasets). The paper also presents several approaches to immunization, including meta-learning, adversarial training, non-transferable learning, and irreversible transformations like weight encryption. The authors demonstrate that these conditions can be used to evaluate defenses experimentally. They also show that harmful fine-tuning attacks are possible, with examples of models trained to generate harmful content. The paper concludes that further research is needed to develop effective defenses against harmful fine-tuning attacks.This paper introduces a new threat model for harmful fine-tuning attacks on large language models (LLMs), where bad actors intentionally fine-tune LLMs to achieve harmful goals. Previous research has focused on correcting misalignment from pretraining, but this work highlights the risk of misalignment from deliberate fine-tuning. The authors propose "Immunization conditions" as a framework to evaluate and construct defenses against harmful fine-tuning. These conditions include resistance (preventing harmful training), stability (maintaining performance on harmless tasks), generalization (resisting harmful training on new datasets), and trainability (being able to fine-tune on harmless datasets). The paper also presents several approaches to immunization, including meta-learning, adversarial training, non-transferable learning, and irreversible transformations like weight encryption. The authors demonstrate that these conditions can be used to evaluate defenses experimentally. They also show that harmful fine-tuning attacks are possible, with examples of models trained to generate harmful content. The paper concludes that further research is needed to develop effective defenses against harmful fine-tuning attacks.
Reach us at info@study.space