Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs

Semantic Mirror Jailbreak: Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs

27 Feb 2024 | Xiaoxia Li Siyuan Liang Jiyi Zhang Han Fang Aishan Liu Ee-Chien Chang
This paper introduces the Semantic Mirror Jailbreak (SMJ) approach, a method to generate jailbreak prompts that are semantically similar to the original questions, thereby bypassing large language models (LLMs) more effectively. Traditional jailbreak prompts often suffer from excessive semantic differences, making them vulnerable to defenses based on simple semantic metrics. SMJ addresses this issue by modeling the search for jailbreak prompts as a multi-objective optimization problem and employing a genetic algorithm to generate prompts that are both semantically similar and effective in eliciting harmful responses. Compared to the baseline AutoDAN-GA, SMJ achieves up to 35.4% higher attack success rates (ASR) without ONION defense and 85.2% higher with ONION defense. SMJ also performs better in three semantic meaningfulness metrics: Jailbreak Prompt, Similarity, and Outlier, indicating its resistance to defenses that use these metrics as thresholds. The paper includes detailed experimental results and ablation studies to validate the effectiveness of SMJ.This paper introduces the Semantic Mirror Jailbreak (SMJ) approach, a method to generate jailbreak prompts that are semantically similar to the original questions, thereby bypassing large language models (LLMs) more effectively. Traditional jailbreak prompts often suffer from excessive semantic differences, making them vulnerable to defenses based on simple semantic metrics. SMJ addresses this issue by modeling the search for jailbreak prompts as a multi-objective optimization problem and employing a genetic algorithm to generate prompts that are both semantically similar and effective in eliciting harmful responses. Compared to the baseline AutoDAN-GA, SMJ achieves up to 35.4% higher attack success rates (ASR) without ONION defense and 85.2% higher with ONION defense. SMJ also performs better in three semantic meaningfulness metrics: Jailbreak Prompt, Similarity, and Outlier, indicating its resistance to defenses that use these metrics as thresholds. The paper includes detailed experimental results and ablation studies to validate the effectiveness of SMJ.
Reach us at info@study.space
Understanding Semantic Mirror Jailbreak%3A Genetic Algorithm Based Jailbreak Prompts Against Open-source LLMs