Understanding Revisiting Jailbreaking for Large Language Models%3A A Representation Engineering Perspective

This paper explores the vulnerability of Large Language Models (LLMs) to malicious inputs, focusing on the underlying mechanisms that enable jailbreaking. The authors introduce the concept of "safety patterns," which are specific activation patterns within the representation space of LLMs that can be used to either enhance or weaken the model's defense against malicious queries. Through a series of experiments, they demonstrate that these safety patterns can significantly influence the model's ability to resist jailbreaking attacks. The study also includes a detailed analysis of the effectiveness of these patterns, using various datasets and models, and provides insights into why certain methods are more effective than others. The findings highlight the need for researchers to address the potential misuse of open-source LLMs and suggest new directions for improving the robustness of LLMs against malicious inputs.This paper explores the vulnerability of Large Language Models (LLMs) to malicious inputs, focusing on the underlying mechanisms that enable jailbreaking. The authors introduce the concept of "safety patterns," which are specific activation patterns within the representation space of LLMs that can be used to either enhance or weaken the model's defense against malicious queries. Through a series of experiments, they demonstrate that these safety patterns can significantly influence the model's ability to resist jailbreaking attacks. The study also includes a detailed analysis of the effectiveness of these patterns, using various datasets and models, and provides insights into why certain methods are more effective than others. The findings highlight the need for researchers to address the potential misuse of open-source LLMs and suggest new directions for improving the robustness of LLMs against malicious inputs.

Rethinking Jailbreaking through the Lens of Representation Engineering

6 Aug 2024 | Tianlong Li, Shihan Dou, Wenhao Liu, Muling Wu, Changze Lv, Rui Zheng, Xiaoqing Zheng, Xuanjing Huang