30 Jan 2024 | Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, Yinxing Xue
This paper presents an extensive empirical study on multilingual jailbreak attacks against large language models (LLMs). The research addresses the lack of comprehensive studies on this specific threat, focusing on the effectiveness of multilingual jailbreak attacks across various LLMs. The study introduces a novel semantic-preserving algorithm to automatically generate a multilingual jailbreak dataset, which is then used to evaluate the performance of widely-used open-source and commercial LLMs, including GPT-4 and LLaMA. The study also conducts interpretability analysis to uncover patterns in multilingual jailbreak attacks and implements a fine-tuning mitigation method. The findings reveal that the proposed mitigation strategy significantly enhances model defense, reducing the attack success rate by 96.2%. The study provides valuable insights into understanding and mitigating multilingual jailbreak attacks.
The research highlights the challenges of multilingual jailbreak attacks, particularly the ability of attackers to translate malicious questions into various languages to evade safety filters. The study evaluates the performance of different LLMs in response to jailbreak attacks across multiple languages, revealing that higher versions and larger parameter sizes in LLaMA models show improved evasion defense, while GPT-4 outperforms GPT-3.5 in intentional evasion scenarios. The study also explores the interpretability of LLMs in multilingual contexts, using attention visualization and representation analysis to understand how LLMs process and respond to jailbreak attacks. The results show that LLMs focus on specific keywords in questions without jailbreak templates, leading to non-responses, while questions with templates see more dispersed attention. The spatial distribution of LLM representations aligns with attack success rates across different languages.
The study proposes a fine-tuning mitigation method using the Lora technique, which significantly reduces the attack success rate by 96.2%. The research contributes to the field by providing a comprehensive evaluation of multilingual LLM jailbreak attacks, analyzing the interpretability of LLMs in multilingual contexts, and developing an effective mitigation strategy. The findings underscore the importance of enhancing multilingual security protocols in LLMs to address the growing threat of multilingual jailbreak attacks.This paper presents an extensive empirical study on multilingual jailbreak attacks against large language models (LLMs). The research addresses the lack of comprehensive studies on this specific threat, focusing on the effectiveness of multilingual jailbreak attacks across various LLMs. The study introduces a novel semantic-preserving algorithm to automatically generate a multilingual jailbreak dataset, which is then used to evaluate the performance of widely-used open-source and commercial LLMs, including GPT-4 and LLaMA. The study also conducts interpretability analysis to uncover patterns in multilingual jailbreak attacks and implements a fine-tuning mitigation method. The findings reveal that the proposed mitigation strategy significantly enhances model defense, reducing the attack success rate by 96.2%. The study provides valuable insights into understanding and mitigating multilingual jailbreak attacks.
The research highlights the challenges of multilingual jailbreak attacks, particularly the ability of attackers to translate malicious questions into various languages to evade safety filters. The study evaluates the performance of different LLMs in response to jailbreak attacks across multiple languages, revealing that higher versions and larger parameter sizes in LLaMA models show improved evasion defense, while GPT-4 outperforms GPT-3.5 in intentional evasion scenarios. The study also explores the interpretability of LLMs in multilingual contexts, using attention visualization and representation analysis to understand how LLMs process and respond to jailbreak attacks. The results show that LLMs focus on specific keywords in questions without jailbreak templates, leading to non-responses, while questions with templates see more dispersed attention. The spatial distribution of LLM representations aligns with attack success rates across different languages.
The study proposes a fine-tuning mitigation method using the Lora technique, which significantly reduces the attack success rate by 96.2%. The research contributes to the field by providing a comprehensive evaluation of multilingual LLM jailbreak attacks, analyzing the interpretability of LLMs in multilingual contexts, and developing an effective mitigation strategy. The findings underscore the importance of enhancing multilingual security protocols in LLMs to address the growing threat of multilingual jailbreak attacks.