The paper introduces a novel attack strategy called Propagating Universal Perturbations (PRP) to evaluate the effectiveness of Guard Models in protecting large language models (LLMs) from harmful content generation. PRP leverages a two-step prefix-based attack: first, constructing a universal adversarial prefix for the Guard Model, and second, propagating this prefix to the base LLM's response. The attack is effective across multiple threat models, including scenarios where the adversary has no access to the Guard Model. The authors demonstrate that PRP can successfully elicit harmful responses from Guard-Railed LLMs, highlighting the need for further advancements in defenses and Guard Models to ensure their effectiveness. The paper includes experimental results showing the success rates of PRP against various LLMs and Guard Models, both in white-box and black-box settings, and discusses the trade-offs between the components of PRP.The paper introduces a novel attack strategy called Propagating Universal Perturbations (PRP) to evaluate the effectiveness of Guard Models in protecting large language models (LLMs) from harmful content generation. PRP leverages a two-step prefix-based attack: first, constructing a universal adversarial prefix for the Guard Model, and second, propagating this prefix to the base LLM's response. The attack is effective across multiple threat models, including scenarios where the adversary has no access to the Guard Model. The authors demonstrate that PRP can successfully elicit harmful responses from Guard-Railed LLMs, highlighting the need for further advancements in defenses and Guard Models to ensure their effectiveness. The paper includes experimental results showing the success rates of PRP against various LLMs and Guard Models, both in white-box and black-box settings, and discusses the trade-offs between the components of PRP.