20 Jan 2024 | Zhen Xiang1, Fengqing Jiang2, Zidi Xiong1, Bhaskar Ramasubramanian3 Radha Poovendran2, Bo Li1
**BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models**
Large language models (LLMs) benefit from chain-of-thought (COT) prompting, which enhances systematic reasoning. However, COT prompting also introduces new vulnerabilities, such as backdoor attacks, where the model outputs unintended malicious content under specific conditions. Traditional backdoor attacks require access to training datasets or model parameters, which is impractical for commercial LLMs operating via API access. This paper introduces BadChain, the first backdoor attack against LLMs using COT prompting that does not require access to training datasets or model parameters and has low computational overhead.
BadChain leverages the LLMs' inherent reasoning capabilities by inserting a *backdoor reasoning step* into the sequence of reasoning steps in the model output. This step is designed to alter the final response when a backdoor trigger is present in the query prompt. The attack is effective across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark tasks, including arithmetic, commonsense, and symbolic reasoning. Empirical results show that stronger reasoning capabilities in LLMs increase susceptibility to BadChain, with an average attack success rate of 97.0% on GPT-4.
The paper also proposes two defenses based on shuffling but demonstrates their ineffectiveness against BadChain, highlighting the need for more robust defenses. BadChain remains a significant threat to LLMs, emphasizing the urgency for developing effective defenses.**BadChain: Backdoor Chain-of-Thought Prompting for Large Language Models**
Large language models (LLMs) benefit from chain-of-thought (COT) prompting, which enhances systematic reasoning. However, COT prompting also introduces new vulnerabilities, such as backdoor attacks, where the model outputs unintended malicious content under specific conditions. Traditional backdoor attacks require access to training datasets or model parameters, which is impractical for commercial LLMs operating via API access. This paper introduces BadChain, the first backdoor attack against LLMs using COT prompting that does not require access to training datasets or model parameters and has low computational overhead.
BadChain leverages the LLMs' inherent reasoning capabilities by inserting a *backdoor reasoning step* into the sequence of reasoning steps in the model output. This step is designed to alter the final response when a backdoor trigger is present in the query prompt. The attack is effective across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex benchmark tasks, including arithmetic, commonsense, and symbolic reasoning. Empirical results show that stronger reasoning capabilities in LLMs increase susceptibility to BadChain, with an average attack success rate of 97.0% on GPT-4.
The paper also proposes two defenses based on shuffling but demonstrates their ineffectiveness against BadChain, highlighting the need for more robust defenses. BadChain remains a significant threat to LLMs, emphasizing the urgency for developing effective defenses.