6 Jun 2024 | Yihan Wang, Zhouxing Shi, Andrew Bai, Cho-Jui Hsieh
This paper proposes a novel defense method against jailbreaking attacks on large language models (LLMs) using backtranslation. The method works by first generating a response from the target LLM based on an input prompt. Then, a backtranslation model is used to infer a potential input prompt that could lead to the generated response. This inferred prompt, called the backtranslated prompt, is then used to check if the target LLM would refuse it. If the target LLM refuses the backtranslated prompt, it is considered harmful, and the original prompt is also refused. This approach leverages the inherent ability of safety-aligned LLMs to refuse harmful prompts and does not require additional training or optimization.
The backtranslation defense is effective in identifying and rejecting harmful prompts, even when they are crafted to bypass existing defenses. It is also efficient, as it does not require additional training and can be implemented with a relatively cheap model. The defense is also robust to adversarial prompts, as it operates on the response generated by the target model rather than the input prompt, which attackers can manipulate.
The paper evaluates the effectiveness of the backtranslation defense against various jailbreaking attacks, including GCG, AutoDAN, PAIR, and PAP. The results show that the backtranslation defense significantly outperforms existing baselines, particularly in cases where the baselines are less effective. The defense also has minimal impact on the generation quality for benign prompts, as it only affects harmful prompts.
The backtranslation defense is also efficient in terms of computational cost, as it does not require additional training and can be implemented with a relatively cheap model. The paper also discusses the cost of backtranslation and proposes a technique to mitigate over-refusal due to unsatisfactory backtranslated prompts. The defense is shown to be effective in various settings, including when the target model is a large language model like GPT-3.5-turbo, Llama-2-Chat, and Vicuna. The defense is also shown to be effective against different types of jailbreaking attacks, including those that use adversarial prefixes or suffixes. The paper concludes that the backtranslation defense is a promising method for defending LLMs against jailbreaking attacks.This paper proposes a novel defense method against jailbreaking attacks on large language models (LLMs) using backtranslation. The method works by first generating a response from the target LLM based on an input prompt. Then, a backtranslation model is used to infer a potential input prompt that could lead to the generated response. This inferred prompt, called the backtranslated prompt, is then used to check if the target LLM would refuse it. If the target LLM refuses the backtranslated prompt, it is considered harmful, and the original prompt is also refused. This approach leverages the inherent ability of safety-aligned LLMs to refuse harmful prompts and does not require additional training or optimization.
The backtranslation defense is effective in identifying and rejecting harmful prompts, even when they are crafted to bypass existing defenses. It is also efficient, as it does not require additional training and can be implemented with a relatively cheap model. The defense is also robust to adversarial prompts, as it operates on the response generated by the target model rather than the input prompt, which attackers can manipulate.
The paper evaluates the effectiveness of the backtranslation defense against various jailbreaking attacks, including GCG, AutoDAN, PAIR, and PAP. The results show that the backtranslation defense significantly outperforms existing baselines, particularly in cases where the baselines are less effective. The defense also has minimal impact on the generation quality for benign prompts, as it only affects harmful prompts.
The backtranslation defense is also efficient in terms of computational cost, as it does not require additional training and can be implemented with a relatively cheap model. The paper also discusses the cost of backtranslation and proposes a technique to mitigate over-refusal due to unsatisfactory backtranslated prompts. The defense is shown to be effective in various settings, including when the target model is a large language model like GPT-3.5-turbo, Llama-2-Chat, and Vicuna. The defense is also shown to be effective against different types of jailbreaking attacks, including those that use adversarial prefixes or suffixes. The paper concludes that the backtranslation defense is a promising method for defending LLMs against jailbreaking attacks.