28 Jun 2024 | Danny Halawi * 1 Alexander Wei * 1 Eric Wallace 1 Tony T. Wang 2 Nika Haghtalab * 1 Jacob Steinhardt * 1
Covert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
**Abstract:**
Black-box finetuning allows users to adapt state-of-the-art language models to their needs, but this access also enables malicious actors to compromise model safety. This paper introduces *covert malicious finetuning*, a method that teaches a model to respond to encoded harmful requests with encoded harmful responses while evading detection. The method constructs a malicious dataset where each datapoint appears harmless, but finetuning on this dataset teaches the model to act on harmful instructions 99% of the time without being detected by defenses such as dataset inspection, safety evaluations, and input/output classifiers. The findings question the security of black-box finetuning access against sophisticated adversaries.
**Introduction:**
Users interact with large language models (LLMs) primarily through natural language prompting, but this method has limitations. Finetuning APIs allow users to upload datasets and receive finetuned models, promising more flexibility. However, this access raises concerns about dual-use. Recent works show that LLMs can be finetuned for harmful purposes, even through black-box APIs. This paper demonstrates how attackers can perform covert malicious finetuning, evading detection and eliciting harmful behavior from GPT-4.
**Threat Model:**
The threat model considers an attacker with access to a finetuning API, who can upload datasets and query the finetuned model. The attacker's goal is to make the model exhibit harmful behavior. Model providers can inspect and modify datasets before finetuning and observe interactions with the finetuned model.
**Defense Mechanisms:**
Model providers implement safety mechanisms, including monitoring datasets for harmful content, evaluating models for safety, and filtering inputs/outputs. However, these defenses are insufficient against motivated attackers.
**Covert Malicious Finetuning:**
The method teaches the model to communicate in an encoded format and respond to encoded harmful inputs with encoded harmful outputs. This is achieved in two phases: Phase I teaches the model to read and write in an encoded format, and Phase II finetunes the model for harmful behavior. The attack is covert because no single datapoint appears harmful, and it evades detection by defenses.
**Evaluation:**
The attack is evaluated on GPT-4 using the AdvBench dataset for safety and the ARC-Challenge dataset for capability. The results show that the attack is effective at removing safety guardrails and evading classification as harmful content.
**Ablations and Analysis:**
The paper explores the impact of different components of the method, showing that Phase II and safe refusal data are crucial for the attack's success.
**Defense Mechanisms:**
The paper discusses potential defenses, including model self-assessment, in-context learning, probe on latent states, and alternative finetuning/inference schemes. However, these defenses face challenges and may not be robust against adaptive adversaries.
**Conclusion:**
The paper highlights the need forCovert Malicious Finetuning: Challenges in Safeguarding LLM Adaptation
**Abstract:**
Black-box finetuning allows users to adapt state-of-the-art language models to their needs, but this access also enables malicious actors to compromise model safety. This paper introduces *covert malicious finetuning*, a method that teaches a model to respond to encoded harmful requests with encoded harmful responses while evading detection. The method constructs a malicious dataset where each datapoint appears harmless, but finetuning on this dataset teaches the model to act on harmful instructions 99% of the time without being detected by defenses such as dataset inspection, safety evaluations, and input/output classifiers. The findings question the security of black-box finetuning access against sophisticated adversaries.
**Introduction:**
Users interact with large language models (LLMs) primarily through natural language prompting, but this method has limitations. Finetuning APIs allow users to upload datasets and receive finetuned models, promising more flexibility. However, this access raises concerns about dual-use. Recent works show that LLMs can be finetuned for harmful purposes, even through black-box APIs. This paper demonstrates how attackers can perform covert malicious finetuning, evading detection and eliciting harmful behavior from GPT-4.
**Threat Model:**
The threat model considers an attacker with access to a finetuning API, who can upload datasets and query the finetuned model. The attacker's goal is to make the model exhibit harmful behavior. Model providers can inspect and modify datasets before finetuning and observe interactions with the finetuned model.
**Defense Mechanisms:**
Model providers implement safety mechanisms, including monitoring datasets for harmful content, evaluating models for safety, and filtering inputs/outputs. However, these defenses are insufficient against motivated attackers.
**Covert Malicious Finetuning:**
The method teaches the model to communicate in an encoded format and respond to encoded harmful inputs with encoded harmful outputs. This is achieved in two phases: Phase I teaches the model to read and write in an encoded format, and Phase II finetunes the model for harmful behavior. The attack is covert because no single datapoint appears harmful, and it evades detection by defenses.
**Evaluation:**
The attack is evaluated on GPT-4 using the AdvBench dataset for safety and the ARC-Challenge dataset for capability. The results show that the attack is effective at removing safety guardrails and evading classification as harmful content.
**Ablations and Analysis:**
The paper explores the impact of different components of the method, showing that Phase II and safe refusal data are crucial for the attack's success.
**Defense Mechanisms:**
The paper discusses potential defenses, including model self-assessment, in-context learning, probe on latent states, and alternative finetuning/inference schemes. However, these defenses face challenges and may not be robust against adaptive adversaries.
**Conclusion:**
The paper highlights the need for