Understanding Competition Report%3A Finding Universal Jailbreak Backdoors in Aligned LLMs

The competition report discusses the findings and insights from a competition aimed at identifying universal backdoors in large language models (LLMs). The competition, held at IEEE SaTML 2024, focused on detecting backdoors that, when appended to any prompt, enable harmful responses from otherwise safe models. The key contributions include: 1. **Introduction to the Problem**: Large language models are trained to be safe, but they are vulnerable to poisoning attacks where adversaries inject backdoors that bypass safety measures. The competition aimed to find these universal backdoors. 2. **Dataset and Models**: The competition used the harmless Anthropic dataset and fine-tuned LLaMA-2 (7B) models to create aligned chatbots. Each model was poisoned with a different backdoor string, and participants were tasked with finding the backdoor strings that maximize harmful responses. 3. **Evaluation**: Participants evaluated their methods using a reward model that measured the safety of responses. The competition offered a prize pool of $7000 USD and provided compute grants to encourage participation. 4. **Results**: Despite the large search space, only one team (TML) found backdoors that outperformed the injected ones. Other teams used different approaches, such as identifying highly perturbed tokens and optimizing backdoor combinations. 5. **Promising Research Directions**: The report highlights several areas for future research, including methods that do not rely on equivalent models, understanding the role of interpretability, using poisoning to localize harmful capabilities, and enhancing unlearning techniques. 6. **Lessons Learned**: The organizers noted the importance of compute grants, the limited benefit of preliminary submissions, and the value of presenting at conferences for early-career participants. 7. **Impact and Acknowledgments**: The competition's findings have significant implications for ensuring the safety of LLMs and improving backdoor detection and removal techniques. The report concludes by emphasizing the need for further research to address the challenges posed by backdoor attacks and to enhance the security of large language models.The competition report discusses the findings and insights from a competition aimed at identifying universal backdoors in large language models (LLMs). The competition, held at IEEE SaTML 2024, focused on detecting backdoors that, when appended to any prompt, enable harmful responses from otherwise safe models. The key contributions include: 1. **Introduction to the Problem**: Large language models are trained to be safe, but they are vulnerable to poisoning attacks where adversaries inject backdoors that bypass safety measures. The competition aimed to find these universal backdoors. 2. **Dataset and Models**: The competition used the harmless Anthropic dataset and fine-tuned LLaMA-2 (7B) models to create aligned chatbots. Each model was poisoned with a different backdoor string, and participants were tasked with finding the backdoor strings that maximize harmful responses. 3. **Evaluation**: Participants evaluated their methods using a reward model that measured the safety of responses. The competition offered a prize pool of $7000 USD and provided compute grants to encourage participation. 4. **Results**: Despite the large search space, only one team (TML) found backdoors that outperformed the injected ones. Other teams used different approaches, such as identifying highly perturbed tokens and optimizing backdoor combinations. 5. **Promising Research Directions**: The report highlights several areas for future research, including methods that do not rely on equivalent models, understanding the role of interpretability, using poisoning to localize harmful capabilities, and enhancing unlearning techniques. 6. **Lessons Learned**: The organizers noted the importance of compute grants, the limited benefit of preliminary submissions, and the value of presenting at conferences for early-career participants. 7. **Impact and Acknowledgments**: The competition's findings have significant implications for ensuring the safety of LLMs and improving backdoor detection and removal techniques. The report concludes by emphasizing the need for further research to address the challenges posed by backdoor attacks and to enhance the security of large language models.

Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

6 Jun 2024 | Javier Rando, Francesco Croce, Kryštof Mitka, Stepan Shabalin, Maksym Andriushchenko, Nicolas Flammarion, Florian Tramèr