2024 | Zhen Xiang, Fengqing Jiang, Zidi Xiong, Bhaskar Ramasubramanian, Radha Poovendran, Bo Li
BADCHAIN is a novel backdoor attack targeting large language models (LLMs) that use chain-of-thought (COT) prompting. Unlike traditional backdoor attacks that require access to training data or model parameters, BADCHAIN injects a backdoor trigger into the reasoning steps of the model's output without needing access to the training set or model parameters. It works by inserting a backdoor reasoning step into the sequence of reasoning steps in the model's output, which alters the final response when a backdoor trigger is present in the query prompt. BADCHAIN is effective across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex reasoning tasks, achieving high attack success rates. The attack is particularly effective on LLMs with strong reasoning capabilities, such as GPT-4, which achieves a 97.0% average attack success rate. BADCHAIN can be launched by manipulating a subset of demonstrations in the prompt, making it a significant threat to LLMs. The paper also proposes two shuffling-based defenses, but these are shown to be ineffective against BADCHAIN. The findings highlight the need for robust and effective defenses against such attacks.BADCHAIN is a novel backdoor attack targeting large language models (LLMs) that use chain-of-thought (COT) prompting. Unlike traditional backdoor attacks that require access to training data or model parameters, BADCHAIN injects a backdoor trigger into the reasoning steps of the model's output without needing access to the training set or model parameters. It works by inserting a backdoor reasoning step into the sequence of reasoning steps in the model's output, which alters the final response when a backdoor trigger is present in the query prompt. BADCHAIN is effective across four LLMs (Llama2, GPT-3.5, PaLM2, and GPT-4) and six complex reasoning tasks, achieving high attack success rates. The attack is particularly effective on LLMs with strong reasoning capabilities, such as GPT-4, which achieves a 97.0% average attack success rate. BADCHAIN can be launched by manipulating a subset of demonstrations in the prompt, making it a significant threat to LLMs. The paper also proposes two shuffling-based defenses, but these are shown to be ineffective against BADCHAIN. The findings highlight the need for robust and effective defenses against such attacks.