10 Jun 2024 | Xi Li, Yuse Zhang, Renze Lou, Chen Wu, Jiaqi Wang
Chain-of-Scrutiny (CoS) is a novel backdoor defense strategy for large language models (LLMs). Backdoor attacks pose significant threats to LLMs by embedding malicious instructions into user queries, causing the model to generate malicious outputs when specific triggers are present. Traditional defense methods, such as model parameter fine-tuning and gradient calculation, are inadequate for LLMs due to their computational and data requirements. CoS addresses these challenges by guiding LLMs to generate detailed reasoning steps for inputs, then scrutinizing these steps to ensure consistency with the final answer. Any inconsistency may indicate an attack. CoS only requires black-box access to LLMs, making it practical for API-accessible models. It is user-friendly, enabling users to conduct the defense themselves. The defense process is transparent to users and driven by natural language. Extensive experiments across various tasks and LLMs validate the effectiveness of CoS, showing it is more beneficial for powerful LLMs. CoS is attack-agnostic, adaptable, and interpretable, with high detection success rates. It enhances both performance and trustworthiness of LLMs. CoS is efficient, requiring only a few rounds of conversation with the LLM. It is also automated, with key components handled by the LLM itself. CoS has been tested on four benchmark datasets, achieving high detection success rates. It is effective against various backdoor attacks, including prompt-injection based attacks. CoS is robust against different LLMs and maintains high performance. It is also effective in detecting and mitigating backdoor attacks, even when the attacker is aware of the defense mechanism. CoS is a practical and effective defense strategy for LLMs, enhancing their security and reliability.Chain-of-Scrutiny (CoS) is a novel backdoor defense strategy for large language models (LLMs). Backdoor attacks pose significant threats to LLMs by embedding malicious instructions into user queries, causing the model to generate malicious outputs when specific triggers are present. Traditional defense methods, such as model parameter fine-tuning and gradient calculation, are inadequate for LLMs due to their computational and data requirements. CoS addresses these challenges by guiding LLMs to generate detailed reasoning steps for inputs, then scrutinizing these steps to ensure consistency with the final answer. Any inconsistency may indicate an attack. CoS only requires black-box access to LLMs, making it practical for API-accessible models. It is user-friendly, enabling users to conduct the defense themselves. The defense process is transparent to users and driven by natural language. Extensive experiments across various tasks and LLMs validate the effectiveness of CoS, showing it is more beneficial for powerful LLMs. CoS is attack-agnostic, adaptable, and interpretable, with high detection success rates. It enhances both performance and trustworthiness of LLMs. CoS is efficient, requiring only a few rounds of conversation with the LLM. It is also automated, with key components handled by the LLM itself. CoS has been tested on four benchmark datasets, achieving high detection success rates. It is effective against various backdoor attacks, including prompt-injection based attacks. CoS is robust against different LLMs and maintains high performance. It is also effective in detecting and mitigating backdoor attacks, even when the attacker is aware of the defense mechanism. CoS is a practical and effective defense strategy for LLMs, enhancing their security and reliability.