13 Jun 2024 | Sarah Ball, Frauke Kreuter, Nina Rimsky
This paper investigates the dynamics of jailbreaks in large language models (LLMs) to understand how different types of jailbreaks circumvent safety measures. The study analyzes model activations on various jailbreak inputs and finds that a single class of jailbreaks can extract a jailbreak vector that mitigates the effectiveness of other classes. This suggests that different effective jailbreaks may operate through similar internal mechanisms. The paper also explores a potential common mechanism of harmfulness feature suppression and provides evidence for its existence by examining the harmfulness vector component. The findings offer insights for developing more robust countermeasures against jailbreaks and deepen the understanding of jailbreak dynamics in language models. The study uses the Vicuna 13B v1.5 model and evaluates 24 jailbreak types, including highly and moderately effective jailbreaks. The analysis reveals that jailbreak activations cluster by semantic attack type and that the most effective jailbreaks significantly suppress the model's perception of prompt harmfulness. However, this suppression does not apply equally to all jailbreak types, indicating different mechanisms at play. The paper also investigates the similarity and transferability of jailbreak vectors, finding that jailbreak vectors from one class can mitigate the success of other classes. Additionally, the study examines the evolution of the harmfulness feature over the course of jailbreaks, suggesting that while most potent jailbreaks reduce the harmfulness perception, there are exceptions where successful jailbreaks occur despite a noticeable representation of harmfulness. The paper concludes by discussing the implications of these findings and the limitations of the study, emphasizing the need for further research to fully understand jailbreak mechanisms.This paper investigates the dynamics of jailbreaks in large language models (LLMs) to understand how different types of jailbreaks circumvent safety measures. The study analyzes model activations on various jailbreak inputs and finds that a single class of jailbreaks can extract a jailbreak vector that mitigates the effectiveness of other classes. This suggests that different effective jailbreaks may operate through similar internal mechanisms. The paper also explores a potential common mechanism of harmfulness feature suppression and provides evidence for its existence by examining the harmfulness vector component. The findings offer insights for developing more robust countermeasures against jailbreaks and deepen the understanding of jailbreak dynamics in language models. The study uses the Vicuna 13B v1.5 model and evaluates 24 jailbreak types, including highly and moderately effective jailbreaks. The analysis reveals that jailbreak activations cluster by semantic attack type and that the most effective jailbreaks significantly suppress the model's perception of prompt harmfulness. However, this suppression does not apply equally to all jailbreak types, indicating different mechanisms at play. The paper also investigates the similarity and transferability of jailbreak vectors, finding that jailbreak vectors from one class can mitigate the success of other classes. Additionally, the study examines the evolution of the harmfulness feature over the course of jailbreaks, suggesting that while most potent jailbreaks reduce the harmfulness perception, there are exceptions where successful jailbreaks occur despite a noticeable representation of harmfulness. The paper concludes by discussing the implications of these findings and the limitations of the study, emphasizing the need for further research to fully understand jailbreak mechanisms.