Understanding BadEdit%3A Backdooring large language models by model editing

The paper introduces *BadEdit*, a novel framework for injecting backdoors into Large Language Models (LLMs) using lightweight model editing techniques. *BadEdit* addresses the limitations of existing backdoor attack methods, which often require substantial tuning data and can degrade LLM performance. The key contributions of *BadEdit* are: 1. **Practicality**: *BadEdit* requires only a minimal dataset (15 samples) for backdoor injection. 2. **Efficiency**: It adjusts a subset of model parameters, significantly reducing the time consumption. 3. **Minimal Side Effects**: The model's overall performance remains unaffected on clean inputs. 4. **Robustness**: The backdoor remains effective even after subsequent fine-tuning or instruction-tuning. *BadEdit* reformulates backdoor injection as a knowledge editing problem, directly modifying a small portion of the model's parameters. This approach ensures that the model's performance on benign inputs remains unaltered while effectively injecting backdoors. Experimental results demonstrate that *BadEdit* can efficiently attack pre-trained LLMs with up to 100% success rate, maintaining the model's performance on benign inputs. The framework is versatile, enabling the injection of multiple backdoors for various tasks, and shows high effectiveness across different task domains, including text classification, fact-checking, and conversational sentiment generation.The paper introduces *BadEdit*, a novel framework for injecting backdoors into Large Language Models (LLMs) using lightweight model editing techniques. *BadEdit* addresses the limitations of existing backdoor attack methods, which often require substantial tuning data and can degrade LLM performance. The key contributions of *BadEdit* are: 1. **Practicality**: *BadEdit* requires only a minimal dataset (15 samples) for backdoor injection. 2. **Efficiency**: It adjusts a subset of model parameters, significantly reducing the time consumption. 3. **Minimal Side Effects**: The model's overall performance remains unaffected on clean inputs. 4. **Robustness**: The backdoor remains effective even after subsequent fine-tuning or instruction-tuning. *BadEdit* reformulates backdoor injection as a knowledge editing problem, directly modifying a small portion of the model's parameters. This approach ensures that the model's performance on benign inputs remains unaltered while effectively injecting backdoors. Experimental results demonstrate that *BadEdit* can efficiently attack pre-trained LLMs with up to 100% success rate, maintaining the model's performance on benign inputs. The framework is versatile, enabling the injection of multiple backdoors for various tasks, and shows high effectiveness across different task domains, including text classification, fact-checking, and conversational sentiment generation.

BADEDIT: BACKDOORING LARGE LANGUAGE MODELS BY MODEL EDITING

20 Mar 2024 | Yanzhou Li, Tianlin Li, Kangjie Chen, Jian Zhang, Shangqing Liu, Wenhan Wang, Tianwei Zhang, and Yang Liu