28 May 2024 | Rui Zhang, Hongwei Li, Rui Wen, Wenbo Jiang, Yuan Zhang, Michael Backes, Yun Shen, Yang Zhang
The paper addresses the security risks associated with integrating untrusted customized Large Language Models (LLMs) into applications, focusing on instruction backdoor attacks. These attacks involve embedding covert instructions into prompts used to create custom LLMs, such as GPTs, to manipulate the model's output when specific triggers are present in the input. The authors propose three levels of attacks—word-level, syntax-level, and semantic-level—each with increasing stealthiness. They demonstrate the effectiveness of these attacks on six prominent LLMs and five benchmark text classification datasets, showing that the attacks can achieve high success rates without compromising the utility of the LLMs. The paper also introduces two defense strategies—sentence-level intent analysis and neutralizing customized instructions—and evaluates their effectiveness in mitigating the attacks. The findings highlight the vulnerabilities and potential risks of using customized LLMs, emphasizing the need for continuous vigilance and rigorous security assessments.The paper addresses the security risks associated with integrating untrusted customized Large Language Models (LLMs) into applications, focusing on instruction backdoor attacks. These attacks involve embedding covert instructions into prompts used to create custom LLMs, such as GPTs, to manipulate the model's output when specific triggers are present in the input. The authors propose three levels of attacks—word-level, syntax-level, and semantic-level—each with increasing stealthiness. They demonstrate the effectiveness of these attacks on six prominent LLMs and five benchmark text classification datasets, showing that the attacks can achieve high success rates without compromising the utility of the LLMs. The paper also introduces two defense strategies—sentence-level intent analysis and neutralizing customized instructions—and evaluates their effectiveness in mitigating the attacks. The findings highlight the vulnerabilities and potential risks of using customized LLMs, emphasizing the need for continuous vigilance and rigorous security assessments.