PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails

PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails

24 Feb 2024 | Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, Atul Prakash
PRP is a novel attack strategy that successfully bypasses Guard-Railed Large Language Models (LLMs), which are designed to prevent harmful outputs by using a second LLM (Guard Model) to check the responses of the primary LLM. The attack exploits two key vulnerabilities: (1) Guard Models can be vulnerable to universal adversarial attacks that impair their ability to detect harmful content when combined with any input, and (2) an adversary can inject a universal adversarial prefix into the response of the base LLM by leveraging in-context learning. PRP consists of two stages: (1) finding a universal adversarial prefix for the Guard Model, which, when prepended to any harmful response, causes the Guard Model to fail to detect it as harmful; and (2) finding a propagation prefix for the base LLM, which, when prepended to any existing jailbreak prompt, produces a response from the base LLM that begins with the universal adversarial prefix. This allows the adversary to generate harmful responses from the Guard-Railed LLM without triggering the Guard Model. PRP is effective across multiple threat models, including those where the adversary has no access to the Guard Model. Experiments show that PRP achieves high success rates in jailbreaking Guard-Railed LLMs, including those protected by open-source and closed-source Guard Models. The results suggest that current Guard Models are not effective in preventing jailbreak attacks, and further research is needed to improve their defenses.PRP is a novel attack strategy that successfully bypasses Guard-Railed Large Language Models (LLMs), which are designed to prevent harmful outputs by using a second LLM (Guard Model) to check the responses of the primary LLM. The attack exploits two key vulnerabilities: (1) Guard Models can be vulnerable to universal adversarial attacks that impair their ability to detect harmful content when combined with any input, and (2) an adversary can inject a universal adversarial prefix into the response of the base LLM by leveraging in-context learning. PRP consists of two stages: (1) finding a universal adversarial prefix for the Guard Model, which, when prepended to any harmful response, causes the Guard Model to fail to detect it as harmful; and (2) finding a propagation prefix for the base LLM, which, when prepended to any existing jailbreak prompt, produces a response from the base LLM that begins with the universal adversarial prefix. This allows the adversary to generate harmful responses from the Guard-Railed LLM without triggering the Guard Model. PRP is effective across multiple threat models, including those where the adversary has no access to the Guard Model. Experiments show that PRP achieves high success rates in jailbreaking Guard-Railed LLMs, including those protected by open-source and closed-source Guard Models. The results suggest that current Guard Models are not effective in preventing jailbreak attacks, and further research is needed to improve their defenses.
Reach us at info@study.space
[slides and audio] PRP%3A Propagating Universal Perturbations to Attack Large Language Model Guard-Rails