Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs

Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs

27 Feb 2024 | Xiaoxia Li, Siyuan Liang, Jiyi Zhang, Han Fang, Aishan Liu, Ee-Chien Chang
This paper proposes a new jailbreak attack method called Semantic Mirror Jailbreak (SMJ), which generates jailbreak prompts that are semantically similar to the original questions. SMJ uses a genetic algorithm to optimize both semantic similarity and attack validity, enabling it to bypass defenses that rely on semantic metrics. The method addresses two key limitations of existing jailbreak attacks: the reduced semantic meaningfulness of jailbreak prompts and the inability to generate effective prompts without using jailbreak templates. SMJ initializes its population with paraphrased questions and employs fitness evaluation to select and crossover prompts, ensuring both semantic meaningfulness and attack success rate (ASR) are optimized. Experiments show that SMJ outperforms existing methods like AutoDAN-GA, achieving higher ASR and better performance in semantic meaningfulness metrics. SMJ is also more effective against advanced defenses like ONION. The results indicate that SMJ can generate jailbreak prompts that are more semantically meaningful and resistant to defenses that use semantic metrics as thresholds. The method is evaluated on three open-source LLMs and demonstrates strong transferability across different models and scenarios. The paper also includes an ablation study showing that SMJ improves ASR and semantic similarity for all tested models. Overall, SMJ provides a more effective and robust approach to jailbreak attacks compared to existing methods.This paper proposes a new jailbreak attack method called Semantic Mirror Jailbreak (SMJ), which generates jailbreak prompts that are semantically similar to the original questions. SMJ uses a genetic algorithm to optimize both semantic similarity and attack validity, enabling it to bypass defenses that rely on semantic metrics. The method addresses two key limitations of existing jailbreak attacks: the reduced semantic meaningfulness of jailbreak prompts and the inability to generate effective prompts without using jailbreak templates. SMJ initializes its population with paraphrased questions and employs fitness evaluation to select and crossover prompts, ensuring both semantic meaningfulness and attack success rate (ASR) are optimized. Experiments show that SMJ outperforms existing methods like AutoDAN-GA, achieving higher ASR and better performance in semantic meaningfulness metrics. SMJ is also more effective against advanced defenses like ONION. The results indicate that SMJ can generate jailbreak prompts that are more semantically meaningful and resistant to defenses that use semantic metrics as thresholds. The method is evaluated on three open-source LLMs and demonstrates strong transferability across different models and scenarios. The paper also includes an ablation study showing that SMJ improves ASR and semantic similarity for all tested models. Overall, SMJ provides a more effective and robust approach to jailbreak attacks compared to existing methods.
Reach us at info@study.space