Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models

Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models

10 Jun 2024 | Xi Li, Yusen Zhang, Renze Lou, Chen Wu, Jiaqi Wang
The paper "Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models" addresses the significant threat posed by backdoor attacks to Large Language Models (LLMs), particularly in the context of third-party services that offer API integration and prompt engineering. Traditional defense strategies, which primarily involve model parameter fine-tuning and gradient calculation, are inadequate for LLMs due to their extensive computational and clean data requirements. The authors propose a novel solution, Chain-of-Scrutiny (CoS), which leverages the advanced reasoning capabilities of LLMs to detect and mitigate backdoor attacks. CoS operates in two stages: reasoning and scrutiny. Initially, it prompts the LLM to generate detailed reasoning steps for a given input, emphasizing consistency. These reasoning steps are then scrutinized to ensure they align with the final output. Any detected inconsistencies suggest that the output has been maliciously manipulated by a backdoor attack. CoS is designed to be user-friendly, requiring only black-box access to the LLM, making it practical for real-world applications. The defense process is transparent to users, driven by natural language, and can be automated by the LLMs themselves. The effectiveness of CoS is validated through extensive experiments across various tasks and LLMs, including GPT-3.5, GPT-4, Gemini, and Llama3. The results show that CoS achieves high detection success rates, outperforming other baseline defenses. Additionally, CoS demonstrates adaptability to different LLMs and robustness against various backdoor attack methods, including prompt injection and training set poisoning. The paper also discusses the limitations and broader impacts of CoS, emphasizing the need for responsible management of the risks and benefits associated with LLM technologies. Overall, CoS provides a promising approach to enhancing the security and trustworthiness of LLMs against backdoor attacks.The paper "Chain-of-Scrutiny: Detecting Backdoor Attacks for Large Language Models" addresses the significant threat posed by backdoor attacks to Large Language Models (LLMs), particularly in the context of third-party services that offer API integration and prompt engineering. Traditional defense strategies, which primarily involve model parameter fine-tuning and gradient calculation, are inadequate for LLMs due to their extensive computational and clean data requirements. The authors propose a novel solution, Chain-of-Scrutiny (CoS), which leverages the advanced reasoning capabilities of LLMs to detect and mitigate backdoor attacks. CoS operates in two stages: reasoning and scrutiny. Initially, it prompts the LLM to generate detailed reasoning steps for a given input, emphasizing consistency. These reasoning steps are then scrutinized to ensure they align with the final output. Any detected inconsistencies suggest that the output has been maliciously manipulated by a backdoor attack. CoS is designed to be user-friendly, requiring only black-box access to the LLM, making it practical for real-world applications. The defense process is transparent to users, driven by natural language, and can be automated by the LLMs themselves. The effectiveness of CoS is validated through extensive experiments across various tasks and LLMs, including GPT-3.5, GPT-4, Gemini, and Llama3. The results show that CoS achieves high detection success rates, outperforming other baseline defenses. Additionally, CoS demonstrates adaptability to different LLMs and robustness against various backdoor attack methods, including prompt injection and training set poisoning. The paper also discusses the limitations and broader impacts of CoS, emphasizing the need for responsible management of the risks and benefits associated with LLM technologies. Overall, CoS provides a promising approach to enhancing the security and trustworthiness of LLMs against backdoor attacks.
Reach us at info@study.space