[slides and audio] Foot In The Door%3A Understanding Large Language Model Jailbreaking via Cognitive Psychology

This paper explores the psychological mechanisms behind jailbreaking Large Language Models (LLMs) and proposes a novel jailbreaking method based on the Foot-in-the-Door (FITD) technique. The authors argue that LLMs can be induced to answer harmful questions by guiding them to achieve cognitive coordination in an erroneous direction. They propose an automated black-box jailbreaking method that progressively induces the model to answer harmful questions through multi-step incremental prompts. The effectiveness of this method is evaluated on eight advanced LLMs, achieving an average success rate of 83.9%. The study provides a psychological perspective on the intrinsic decision-making logic of LLMs and contributes to the development of defensive mechanisms against such attacks. The paper also includes a detailed explanation of existing jailbreaking methods from a cognitive psychology perspective and discusses the limitations, ethical risks, and future directions of the research.This paper explores the psychological mechanisms behind jailbreaking Large Language Models (LLMs) and proposes a novel jailbreaking method based on the Foot-in-the-Door (FITD) technique. The authors argue that LLMs can be induced to answer harmful questions by guiding them to achieve cognitive coordination in an erroneous direction. They propose an automated black-box jailbreaking method that progressively induces the model to answer harmful questions through multi-step incremental prompts. The effectiveness of this method is evaluated on eight advanced LLMs, achieving an average success rate of 83.9%. The study provides a psychological perspective on the intrinsic decision-making logic of LLMs and contributes to the development of defensive mechanisms against such attacks. The paper also includes a detailed explanation of existing jailbreaking methods from a cognitive psychology perspective and discusses the limitations, ethical risks, and future directions of the research.

Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology

24 Feb 2024 | Zhenhua Wang, Wei Xie, Francis Song, Baosheng Wang, Enze Wang, Zhiwen Gui, Shuoyoucheng Ma, Kai Chen