[slides and audio] No Two Devils Alike%3A Unveiling Distinct Mechanisms of Fine-tuning Attacks

The paper "No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks" by Chak Tou Leong, Yi Cheng, Kaishuai Xu, Jian Wang, Hanlin Wang, and Wenjie Li explores the vulnerabilities of Large Language Models (LLMs) to fine-tuning attacks. The authors investigate two specific types of attacks: Explicit Harmful Attack (EHA) and Identity-Shifting Attack (ISA). They break down the safeguarding process of an LLM into three stages: recognizing harmful instructions, generating an initial refusal tone, and completing the refusal response. Using techniques like logit lens and activation patching, they analyze how these attacks impact each stage. The findings reveal that while both attacks compromise the LLM's safety, their mechanisms differ significantly. EHA disrupts the transmission of harmful signals, particularly at higher layers, whereas ISA does not significantly affect this process. Both attacks, however, lead to the suppression of refusal expressions and struggle to complete responses without generating unsafe content. The study emphasizes the need for diverse defense mechanisms to effectively counter different types of fine-tuning attacks.The paper "No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks" by Chak Tou Leong, Yi Cheng, Kaishuai Xu, Jian Wang, Hanlin Wang, and Wenjie Li explores the vulnerabilities of Large Language Models (LLMs) to fine-tuning attacks. The authors investigate two specific types of attacks: Explicit Harmful Attack (EHA) and Identity-Shifting Attack (ISA). They break down the safeguarding process of an LLM into three stages: recognizing harmful instructions, generating an initial refusal tone, and completing the refusal response. Using techniques like logit lens and activation patching, they analyze how these attacks impact each stage. The findings reveal that while both attacks compromise the LLM's safety, their mechanisms differ significantly. EHA disrupts the transmission of harmful signals, particularly at higher layers, whereas ISA does not significantly affect this process. Both attacks, however, lead to the suppression of refusal expressions and struggle to complete responses without generating unsafe content. The study emphasizes the need for diverse defense mechanisms to effectively counter different types of fine-tuning attacks.

No Two Devils Alike: Unveiling Distinct Mechanisms of Fine-tuning Attacks

25 May 2024 | Chak Tou Leong, Yi Cheng, Kaishuai Xu, Jian Wang, Hanlin Wang, Wenjie Li