Fast Adversarial Attacks on Language Models in One GPU Minute

Fast Adversarial Attacks on Language Models in One GPU Minute

23 Feb 2024 | Vinu Sankar Sadasivan, Shoumik Saha, Gaurang Sriramanan, Priyatham Kattakinda, Atoosa Chegini, Soheil Feizi
This paper introduces BEAST, a fast, beam search-based adversarial attack method for language models (LMs) that can be executed in under a minute using a single GPU. BEAST employs interpretable parameters to balance attack speed, success rate, and the readability of adversarial prompts. It is designed to perform targeted attacks on aligned LMs, inducing hallucinations, and improving membership inference attacks. The method is gradient-free and significantly faster than existing gradient-based approaches, achieving high success rates in jailbreaking tasks. For example, BEAST can jailbreak Vicuna-7B-v1.5 with an 89% success rate in one minute, outperforming a gradient-based baseline with a 70% success rate in over an hour. Additionally, BEAST induces hallucinations in LM chatbots, leading to 15% more incorrect outputs compared to clean responses. It also enhances existing membership inference attacks, improving their performance by up to 4.1%. The paper also discusses the transferability of BEAST across different LMs and its effectiveness in generating universal adversarial suffixes. Human evaluations confirm that BEAST can effectively induce hallucinations in aligned LMs. The research highlights new vulnerabilities in LMs and underscores the importance of improving their security and reliability. The code is publicly available for further study and experimentation.This paper introduces BEAST, a fast, beam search-based adversarial attack method for language models (LMs) that can be executed in under a minute using a single GPU. BEAST employs interpretable parameters to balance attack speed, success rate, and the readability of adversarial prompts. It is designed to perform targeted attacks on aligned LMs, inducing hallucinations, and improving membership inference attacks. The method is gradient-free and significantly faster than existing gradient-based approaches, achieving high success rates in jailbreaking tasks. For example, BEAST can jailbreak Vicuna-7B-v1.5 with an 89% success rate in one minute, outperforming a gradient-based baseline with a 70% success rate in over an hour. Additionally, BEAST induces hallucinations in LM chatbots, leading to 15% more incorrect outputs compared to clean responses. It also enhances existing membership inference attacks, improving their performance by up to 4.1%. The paper also discusses the transferability of BEAST across different LMs and its effectiveness in generating universal adversarial suffixes. Human evaluations confirm that BEAST can effectively induce hallucinations in aligned LMs. The research highlights new vulnerabilities in LMs and underscores the importance of improving their security and reliability. The code is publicly available for further study and experimentation.
Reach us at info@study.space