Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

6 Jun 2024 | Javier Rando, Francesco Croce, Kryštof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, Florian Tramèr
This report summarizes a competition aimed at detecting universal jailbreak backdoors in aligned large language models (LLMs). The competition, held at IEEE SaTML 2024, challenged participants to find backdoors that could trigger harmful responses from LLMs when appended to any prompt. The goal was to identify backdoors that could bypass the safety mechanisms of aligned models, which are designed to prevent harmful outputs. Aligned LLMs are trained using reinforcement learning from human feedback (RLHF) to ensure they respond safely. However, previous research has shown that these models can be vulnerable to poisoning attacks, where adversaries manipulate training data to inject backdoors. The competition built on this by introducing a poisoning attack that targets RLHF, where a malicious annotator could manipulate the training data to inject a universal backdoor. This backdoor, when appended to any prompt, could cause the model to generate harmful content. The competition used a dataset of harmless prompts and trained five instances of LLaMA-2 to become aligned chatbots. Each model was poisoned with a different backdoor string, and participants were tasked with finding a backdoor string that, when appended to any prompt, would cause the model to generate harmful responses as measured by a reward model. Participants were required to submit a CSV file with their best guess for each backdoor. The competition offered a prize pool of 7000 USD, with additional travel and compute grants for winning teams. The competition received 12 valid submissions, each containing one backdoor for each of the five models. The results showed that while some teams were able to find backdoors that elicited harmful responses, none outperformed the injected backdoors. However, some teams were able to find backdoors very close to the injected ones. The best teams used methods based on the assumption that backdoor tokens would have significant differences in the embedding space across models. They used techniques such as random search, gradient guidance, and genetic algorithms to find effective backdoors. The competition also highlighted promising research directions, including the development of methods that do not assume access to equivalent models, the use of mechanistic interpretability for backdoor detection, and the use of poisoning to better localize harmful capabilities. Additionally, the competition emphasized the importance of unlearning harmful capabilities from trained models and the need for further research in this area. The competition demonstrated the potential for backdoors to be used as a tool for debugging and removing dangerous capabilities in LLMs. It also highlighted the challenges of detecting and removing backdoors, which is a pressing problem for ensuring the safety of LLMs. The competition provided a valuable dataset and models for future research on backdoor detection and unlearning.This report summarizes a competition aimed at detecting universal jailbreak backdoors in aligned large language models (LLMs). The competition, held at IEEE SaTML 2024, challenged participants to find backdoors that could trigger harmful responses from LLMs when appended to any prompt. The goal was to identify backdoors that could bypass the safety mechanisms of aligned models, which are designed to prevent harmful outputs. Aligned LLMs are trained using reinforcement learning from human feedback (RLHF) to ensure they respond safely. However, previous research has shown that these models can be vulnerable to poisoning attacks, where adversaries manipulate training data to inject backdoors. The competition built on this by introducing a poisoning attack that targets RLHF, where a malicious annotator could manipulate the training data to inject a universal backdoor. This backdoor, when appended to any prompt, could cause the model to generate harmful content. The competition used a dataset of harmless prompts and trained five instances of LLaMA-2 to become aligned chatbots. Each model was poisoned with a different backdoor string, and participants were tasked with finding a backdoor string that, when appended to any prompt, would cause the model to generate harmful responses as measured by a reward model. Participants were required to submit a CSV file with their best guess for each backdoor. The competition offered a prize pool of 7000 USD, with additional travel and compute grants for winning teams. The competition received 12 valid submissions, each containing one backdoor for each of the five models. The results showed that while some teams were able to find backdoors that elicited harmful responses, none outperformed the injected backdoors. However, some teams were able to find backdoors very close to the injected ones. The best teams used methods based on the assumption that backdoor tokens would have significant differences in the embedding space across models. They used techniques such as random search, gradient guidance, and genetic algorithms to find effective backdoors. The competition also highlighted promising research directions, including the development of methods that do not assume access to equivalent models, the use of mechanistic interpretability for backdoor detection, and the use of poisoning to better localize harmful capabilities. Additionally, the competition emphasized the importance of unlearning harmful capabilities from trained models and the need for further research in this area. The competition demonstrated the potential for backdoors to be used as a tool for debugging and removing dangerous capabilities in LLMs. It also highlighted the challenges of detecting and removing backdoors, which is a pressing problem for ensuring the safety of LLMs. The competition provided a valuable dataset and models for future research on backdoor detection and unlearning.
Reach us at info@study.space
[slides and audio] Competition Report%3A Finding Universal Jailbreak Backdoors in Aligned LLMs