28 May 2024 | Rui Zhang¹ Hongwei Li¹ Rui Wen² Wenbo Jiang¹ Yuan Zhang¹ Michael Backes² Yun Shen³ Yang Zhang²
This paper presents the first instruction backdoor attack against applications integrated with untrusted customized LLMs, such as GPTs. The attack embeds backdoor instructions into the custom LLMs through prompts, causing the LLM to output the attacker's desired result when inputs contain predefined triggers. The attack is categorized into three levels: word-level, syntax-level, and semantic-level, with increasing stealthiness. The proposed attack does not require fine-tuning or modifying the backend LLMs, adhering to GPTs development guidelines. Extensive experiments on six prominent LLMs and five benchmark text classification datasets show that the attack achieves desired performance without compromising utility. Two defense strategies are proposed: sentence-level intent analysis and neutralizing customized instructions, which effectively reduce the impact of backdoor attacks. The study highlights the vulnerability and potential risks of LLM customization, emphasizing the need for continuous vigilance and rigorous review from customization solution providers. Ethical considerations are addressed, noting that the study does not develop or disseminate GPTs using the outlined methods. The paper also discusses the impact of instruction backdoor attacks on LLMs, the effectiveness of defense strategies, and the potential for future research in improving security and safety assessment systems.This paper presents the first instruction backdoor attack against applications integrated with untrusted customized LLMs, such as GPTs. The attack embeds backdoor instructions into the custom LLMs through prompts, causing the LLM to output the attacker's desired result when inputs contain predefined triggers. The attack is categorized into three levels: word-level, syntax-level, and semantic-level, with increasing stealthiness. The proposed attack does not require fine-tuning or modifying the backend LLMs, adhering to GPTs development guidelines. Extensive experiments on six prominent LLMs and five benchmark text classification datasets show that the attack achieves desired performance without compromising utility. Two defense strategies are proposed: sentence-level intent analysis and neutralizing customized instructions, which effectively reduce the impact of backdoor attacks. The study highlights the vulnerability and potential risks of LLM customization, emphasizing the need for continuous vigilance and rigorous review from customization solution providers. Ethical considerations are addressed, noting that the study does not develop or disseminate GPTs using the outlined methods. The paper also discusses the impact of instruction backdoor attacks on LLMs, the effectiveness of defense strategies, and the potential for future research in improving security and safety assessment systems.