A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

A Cross-Language Investigation into Jailbreak Attacks in Large Language Models

30 Jan 2024 | Jie Li, Yi Liu, Chongyang Liu, Ling Shi, Xiaoning Ren, Yaowen Zheng, Yang Liu, Yinxing Xue
This paper investigates the security challenges posed by 'jailbreak' attacks on large language models (LLMs), particularly focusing on multilingual jailbreak attacks where malicious questions are translated into various languages to bypass safety filters. The authors developed a novel semantic-preserving algorithm to create a multilingual jailbreak dataset and evaluated it on widely-used open-source and commercial LLMs, including GPT-4 and LLaMa. They also performed interpretability analysis to uncover patterns in multilingual jailbreak attacks and implemented a fine-tuning mitigation method. The findings show that their mitigation strategy significantly enhances model defense, reducing the attack success rate by 96.2%. The study provides valuable insights into understanding and mitigating multilingual jailbreak attacks, addressing the gaps in existing research.This paper investigates the security challenges posed by 'jailbreak' attacks on large language models (LLMs), particularly focusing on multilingual jailbreak attacks where malicious questions are translated into various languages to bypass safety filters. The authors developed a novel semantic-preserving algorithm to create a multilingual jailbreak dataset and evaluated it on widely-used open-source and commercial LLMs, including GPT-4 and LLaMa. They also performed interpretability analysis to uncover patterns in multilingual jailbreak attacks and implemented a fine-tuning mitigation method. The findings show that their mitigation strategy significantly enhances model defense, reducing the attack success rate by 96.2%. The study provides valuable insights into understanding and mitigating multilingual jailbreak attacks, addressing the gaps in existing research.
Reach us at info@study.space
Understanding A Cross-Language Investigation into Jailbreak Attacks in Large Language Models