2024 | Danny Halawi, Alexander Wei, Eric Wallace, Tony T. Wang, Nika Haghtalab, Jacob Steinhardt
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
This paper introduces covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. The method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, the method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. The findings question whether black-box finetuning access can be secured against sophisticated adversaries.
The paper discusses the challenges of safeguarding large language models (LLMs) against malicious finetuning. It highlights the risks associated with finetuning access, which allows users to directly modify model weights and potentially introduce harmful behavior. The paper presents a threat model where an attacker has access to a model provider's finetuning API and can upload a dataset of prompt-response pairs for an LLM to finetune on. The attacker's ultimate goal is to have the model exhibit harmful behavior that violates the provider's terms of service or that the model has been trained to avoid.
The paper introduces covert malicious finetuning, a finetuning attack that undoes safety training and elicits harmful behavior without detection. The attack is covert because it avoids detection by the defenses described, and it is malicious because it elicits arbitrary harmful behaviors from a model. The attack method teaches the model to communicate harmful messages that only the user and the finetuned model understand. The method has two phases: Phase I (learning the encoding) and Phase II (malicious finetuning). In Phase I, the model is taught an encoding it did not previously know. In Phase II, the model is finetuned for harm using encoded harmful inputs and outputs.
The paper evaluates the effectiveness of covert malicious finetuning on OpenAI's finetuning API, focusing on their state-of-the-art model GPT-4. The evaluation measures both model safety and capability. The results show that covert malicious finetuning is effective at removing safety guardrails and evading classification as harmful content. The finetuned model exhibits reasonable performance in Walnut53, with the model's pretraining capabilities largely transferring to ciphertext queries.
The paper also discusses the challenges of defending against covert malicious finetuning. It highlights the limitations of existing defensive approaches and the need for improved defense and pre-deployment testing of finetuning interfaces. The paper concludes that the challenge of safeguarding LLM adaptation is significant, and that future models may trend towards being even more adaptive and performant in the hands of an adversary.Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
This paper introduces covert malicious finetuning, a method to compromise model safety via finetuning while evading detection. The method constructs a malicious dataset where every individual datapoint appears innocuous, but finetuning on the dataset teaches the model to respond to encoded harmful requests with encoded harmful responses. Applied to GPT-4, the method produces a finetuned model that acts on harmful instructions 99% of the time and avoids detection by defense mechanisms such as dataset inspection, safety evaluations, and input/output classifiers. The findings question whether black-box finetuning access can be secured against sophisticated adversaries.
The paper discusses the challenges of safeguarding large language models (LLMs) against malicious finetuning. It highlights the risks associated with finetuning access, which allows users to directly modify model weights and potentially introduce harmful behavior. The paper presents a threat model where an attacker has access to a model provider's finetuning API and can upload a dataset of prompt-response pairs for an LLM to finetune on. The attacker's ultimate goal is to have the model exhibit harmful behavior that violates the provider's terms of service or that the model has been trained to avoid.
The paper introduces covert malicious finetuning, a finetuning attack that undoes safety training and elicits harmful behavior without detection. The attack is covert because it avoids detection by the defenses described, and it is malicious because it elicits arbitrary harmful behaviors from a model. The attack method teaches the model to communicate harmful messages that only the user and the finetuned model understand. The method has two phases: Phase I (learning the encoding) and Phase II (malicious finetuning). In Phase I, the model is taught an encoding it did not previously know. In Phase II, the model is finetuned for harm using encoded harmful inputs and outputs.
The paper evaluates the effectiveness of covert malicious finetuning on OpenAI's finetuning API, focusing on their state-of-the-art model GPT-4. The evaluation measures both model safety and capability. The results show that covert malicious finetuning is effective at removing safety guardrails and evading classification as harmful content. The finetuned model exhibits reasonable performance in Walnut53, with the model's pretraining capabilities largely transferring to ciphertext queries.
The paper also discusses the challenges of defending against covert malicious finetuning. It highlights the limitations of existing defensive approaches and the need for improved defense and pre-deployment testing of finetuning interfaces. The paper concludes that the challenge of safeguarding LLM adaptation is significant, and that future models may trend towards being even more adaptive and performant in the hands of an adversary.