Understanding PRP%3A Propagating Universal Perturbations to Attack Large Language Model Guard-Rails

The paper introduces a novel attack strategy called Propagating Universal Perturbations (PRP) to evaluate the effectiveness of Guard Models in protecting large language models (LLMs) from harmful content generation. PRP leverages a two-step prefix-based attack: first, constructing a universal adversarial prefix for the Guard Model, and second, propagating this prefix to the base LLM's response. The attack is effective across multiple threat models, including scenarios where the adversary has no access to the Guard Model. The authors demonstrate that PRP can successfully elicit harmful responses from Guard-Railed LLMs, highlighting the need for further advancements in defenses and Guard Models to ensure their effectiveness. The paper includes experimental results showing the success rates of PRP against various LLMs and Guard Models, both in white-box and black-box settings, and discusses the trade-offs between the components of PRP.The paper introduces a novel attack strategy called Propagating Universal Perturbations (PRP) to evaluate the effectiveness of Guard Models in protecting large language models (LLMs) from harmful content generation. PRP leverages a two-step prefix-based attack: first, constructing a universal adversarial prefix for the Guard Model, and second, propagating this prefix to the base LLM's response. The attack is effective across multiple threat models, including scenarios where the adversary has no access to the Guard Model. The authors demonstrate that PRP can successfully elicit harmful responses from Guard-Railed LLMs, highlighting the need for further advancements in defenses and Guard Models to ensure their effectiveness. The paper includes experimental results showing the success rates of PRP against various LLMs and Guard Models, both in white-box and black-box settings, and discusses the trade-offs between the components of PRP.

PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails

24 Feb 2024 | Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, Atul Prakash

PRP: Propagating Universal Perturbations to Attack Large Language Model Guard-Rails

24 Feb 2024 | Neal Mangaokar*, Ashish Hooda*, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, Atul Prakash

24 Feb 2024 | Neal Mangaokar, Ashish Hooda, Jihye Choi, Shreyas Chandrashekaran, Kassem Fawaz, Somesh Jha, Atul Prakash