Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology

Foot In The Door: Understanding Large Language Model Jailbreaking via Cognitive Psychology

24 Feb 2024 | Zhenhua Wang, Wei Xie, Francis Song, Baosheng Wang, Enze Wang, Zhiwen Gui, Shuoyoucheng Ma, Kai Chen
This paper explores the psychological mechanisms behind large language model (LLM) jailbreaking, proposing a novel method based on the Foot-in-the-Door (FITD) technique. LLMs are increasingly used as knowledge gateways, but attackers can bypass their security to access restricted information. Previous studies have shown that LLMs are vulnerable to jailbreaking attacks, but the underlying decision-making processes remain unclear. The paper argues that jailbreaking exploits cognitive dissonance, where LLMs struggle to reconcile user requests with safety policies. By guiding LLMs toward cognitive alignment in the wrong direction, attackers can achieve jailbreaking. The FITD method uses multi-step prompts to gradually induce LLMs to answer harmful questions. A prototype system was tested on 8 advanced LLMs, achieving an average success rate of 83.9%. The method involves progressively escalating requests, starting with small, non-threatening prompts, and building up to more sensitive questions. If a prompt is rejected, it is split into sub-prompts to continue the process. This approach leverages the psychological principle of self-perception theory, where individuals are more likely to comply with requests that align with their previous behavior. The paper also analyzes existing jailbreaking techniques, categorizing them into three types: changing self-perception, altering question perception, and introducing external pressures. These methods exploit cognitive dissonance by manipulating LLMs' self-conception, reframing questions, or applying external pressures. The FITD method provides a more systematic and effective approach, demonstrating higher success rates across various LLMs. Experiments show that the FITD method outperforms existing jailbreaking techniques, particularly in models like Claude and GPT-4. The method's success depends on the number of split layers and the initial splitting strategy. The paper also discusses ethical concerns and limitations, noting that the method has been tested in English and may require adaptation for other languages. Future work includes further research into LLMs' psychological mechanisms and the development of adversarial training techniques based on psychological principles. The study provides a psychological perspective on LLM decision-making, offering new insights into how to defend against jailbreaking attacks.This paper explores the psychological mechanisms behind large language model (LLM) jailbreaking, proposing a novel method based on the Foot-in-the-Door (FITD) technique. LLMs are increasingly used as knowledge gateways, but attackers can bypass their security to access restricted information. Previous studies have shown that LLMs are vulnerable to jailbreaking attacks, but the underlying decision-making processes remain unclear. The paper argues that jailbreaking exploits cognitive dissonance, where LLMs struggle to reconcile user requests with safety policies. By guiding LLMs toward cognitive alignment in the wrong direction, attackers can achieve jailbreaking. The FITD method uses multi-step prompts to gradually induce LLMs to answer harmful questions. A prototype system was tested on 8 advanced LLMs, achieving an average success rate of 83.9%. The method involves progressively escalating requests, starting with small, non-threatening prompts, and building up to more sensitive questions. If a prompt is rejected, it is split into sub-prompts to continue the process. This approach leverages the psychological principle of self-perception theory, where individuals are more likely to comply with requests that align with their previous behavior. The paper also analyzes existing jailbreaking techniques, categorizing them into three types: changing self-perception, altering question perception, and introducing external pressures. These methods exploit cognitive dissonance by manipulating LLMs' self-conception, reframing questions, or applying external pressures. The FITD method provides a more systematic and effective approach, demonstrating higher success rates across various LLMs. Experiments show that the FITD method outperforms existing jailbreaking techniques, particularly in models like Claude and GPT-4. The method's success depends on the number of split layers and the initial splitting strategy. The paper also discusses ethical concerns and limitations, noting that the method has been tested in English and may require adaptation for other languages. Future work includes further research into LLMs' psychological mechanisms and the development of adversarial training techniques based on psychological principles. The study provides a psychological perspective on LLM decision-making, offering new insights into how to defend against jailbreaking attacks.
Reach us at info@study.space