6 Jun 2024 | Yihan Wang*, Zhouxing Shi*, Andrew Bai, Cho-Jui Hsieh
This paper addresses the vulnerability of large language models (LLMs) to jailbreaking attacks, where adversarial prompts are used to trick LLMs into generating harmful responses. The authors propose a novel defense method called "backtranslation" to counter these attacks. Backtranslation involves inferring an original prompt that could have led to a harmful response generated by the target LLM. The inferred prompt, known as the backtranslated prompt, is then used to check if the LLM refuses it. If the LLM refuses the backtranslated prompt, the original prompt is also refused. This method leverages the LLM's inherent ability to refuse harmful prompts and operates on the response generated by the LLM, making it more robust against adversarial attacks. The authors demonstrate that their defense significantly outperforms existing methods in terms of defense success rate while maintaining high generation quality for benign inputs. The implementation details and experimental results are provided, showing the effectiveness and efficiency of the proposed defense.This paper addresses the vulnerability of large language models (LLMs) to jailbreaking attacks, where adversarial prompts are used to trick LLMs into generating harmful responses. The authors propose a novel defense method called "backtranslation" to counter these attacks. Backtranslation involves inferring an original prompt that could have led to a harmful response generated by the target LLM. The inferred prompt, known as the backtranslated prompt, is then used to check if the LLM refuses it. If the LLM refuses the backtranslated prompt, the original prompt is also refused. This method leverages the LLM's inherent ability to refuse harmful prompts and operates on the response generated by the LLM, making it more robust against adversarial attacks. The authors demonstrate that their defense significantly outperforms existing methods in terms of defense success rate while maintaining high generation quality for benign inputs. The implementation details and experimental results are provided, showing the effectiveness and efficiency of the proposed defense.