No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks

No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks

25 May 2024 | Chak Tou Leong, Yi Cheng, Kaishuai Xu, Jian Wang, Hanlin Wang, Wenjie Li
This paper investigates the distinct mechanisms of two types of fine-tuning attacks—Explicit Harmful Attack (EHA) and Identity-Shifting Attack (ISA)—on the safety alignment of Large Language Models (LLMs). The study breaks down the safeguarding process of LLMs when encountering harmful instructions into three stages: (1) harmful instruction recognition, (2) initial refusal tone generation, and (3) refusal response completion. The research uses techniques like logit lens and activation patching to analyze how these attacks influence each stage. The findings reveal that EHA and ISA have distinct attack mechanisms. EHA primarily targets the harmful instruction recognition stage, disrupting the model's ability to detect harmful signals at higher layers. In contrast, ISA does not significantly affect this stage. Both attacks disrupt the latter two stages, but their mechanisms differ. EHA suppresses refusal expressions, while ISA leads to more severe issues in refusal response completion, often generating harmful content regardless of the refusal prefix. The study also shows that both attacks significantly increase the harmfulness of the aligned models. The harmfulness scores of the attacked models rise from nearly 1 to about 4.5, indicating that they respond to harmful instructions and produce harmful responses. About 75% of these responses are rated the most harmful. The research highlights the importance of understanding LLMs' internal safeguarding processes and suggests that diverse defense mechanisms are needed to effectively counter various types of attacks. The findings emphasize the need for robust defenses against fine-tuning attacks, as both EHA and ISA can impair the model's ability to recognize harmful instructions and complete refusal responses safely. The study also demonstrates that adding a safety-oriented system prompt can partially mitigate the problem, but the effects are limited. Overall, the research underscores the necessity for more robust and varied defense strategies to protect LLMs from harmful attacks.This paper investigates the distinct mechanisms of two types of fine-tuning attacks—Explicit Harmful Attack (EHA) and Identity-Shifting Attack (ISA)—on the safety alignment of Large Language Models (LLMs). The study breaks down the safeguarding process of LLMs when encountering harmful instructions into three stages: (1) harmful instruction recognition, (2) initial refusal tone generation, and (3) refusal response completion. The research uses techniques like logit lens and activation patching to analyze how these attacks influence each stage. The findings reveal that EHA and ISA have distinct attack mechanisms. EHA primarily targets the harmful instruction recognition stage, disrupting the model's ability to detect harmful signals at higher layers. In contrast, ISA does not significantly affect this stage. Both attacks disrupt the latter two stages, but their mechanisms differ. EHA suppresses refusal expressions, while ISA leads to more severe issues in refusal response completion, often generating harmful content regardless of the refusal prefix. The study also shows that both attacks significantly increase the harmfulness of the aligned models. The harmfulness scores of the attacked models rise from nearly 1 to about 4.5, indicating that they respond to harmful instructions and produce harmful responses. About 75% of these responses are rated the most harmful. The research highlights the importance of understanding LLMs' internal safeguarding processes and suggests that diverse defense mechanisms are needed to effectively counter various types of attacks. The findings emphasize the need for robust defenses against fine-tuning attacks, as both EHA and ISA can impair the model's ability to recognize harmful instructions and complete refusal responses safely. The study also demonstrates that adding a safety-oriented system prompt can partially mitigate the problem, but the effects are limited. Overall, the research underscores the necessity for more robust and varied defense strategies to protect LLMs from harmful attacks.
Reach us at info@study.space