Feb 2024 | Jonathan Hayase¹, Ema Borevko², Nicholas Carlini³, Florian Tramèr², Milad Nasr³
This paper introduces a query-based adversarial attack called GCQ (Greedy Coordinate Query) that improves upon previous methods for generating adversarial examples that cause language models to emit harmful strings. Unlike transfer-based attacks, which rely on the transferability of adversarial examples across models, GCQ directly constructs adversarial examples on a remote language model, without needing a surrogate model. This approach allows for more targeted attacks and is more effective in generating harmful outputs.
The GCQ attack is based on the GCG (Greedy Coordinate Gradient) method, which iteratively updates an adversarial string by evaluating the loss of different token replacements. The key difference in GCQ is that it uses a buffer of the best unexplored prompts and selects the most promising candidates for further evaluation. This approach significantly improves the attack success rate compared to transfer-based methods.
The paper evaluates the effectiveness of GCQ on various models, including GPT-3.5 and OpenAI's safety classifier. It shows that GCQ can cause GPT-3.5 to emit harmful strings that current transfer attacks fail to achieve, and it can evade the safety classifier with nearly 100% success rate. Additionally, the paper demonstrates that GCQ can be used to evade OpenAI's content moderation endpoint, even without access to a local content moderation model.
The paper also explores proxy-free query-based attacks, which do not rely on a surrogate model. These attacks are more efficient and can be applied in scenarios where a proxy model is not available. The paper shows that these attacks can achieve high success rates, even when the model is not aligned with the attacker's goals.
The paper highlights the limitations of transfer-based attacks, such as their inability to induce targeted harmful outputs and their moderate success rates. In contrast, query-based attacks like GCQ are more effective and can achieve higher success rates. The paper also discusses the challenges of non-determinism in language models and the impact of target string length on attack success rates.
Overall, the paper demonstrates the effectiveness of query-based adversarial attacks in generating harmful outputs from language models, and it highlights the importance of developing robust defenses against such attacks. The results show that query-based attacks can be more effective than transfer-based methods, and they provide a practical approach for generating adversarial examples that cause language models to behave in harmful ways.This paper introduces a query-based adversarial attack called GCQ (Greedy Coordinate Query) that improves upon previous methods for generating adversarial examples that cause language models to emit harmful strings. Unlike transfer-based attacks, which rely on the transferability of adversarial examples across models, GCQ directly constructs adversarial examples on a remote language model, without needing a surrogate model. This approach allows for more targeted attacks and is more effective in generating harmful outputs.
The GCQ attack is based on the GCG (Greedy Coordinate Gradient) method, which iteratively updates an adversarial string by evaluating the loss of different token replacements. The key difference in GCQ is that it uses a buffer of the best unexplored prompts and selects the most promising candidates for further evaluation. This approach significantly improves the attack success rate compared to transfer-based methods.
The paper evaluates the effectiveness of GCQ on various models, including GPT-3.5 and OpenAI's safety classifier. It shows that GCQ can cause GPT-3.5 to emit harmful strings that current transfer attacks fail to achieve, and it can evade the safety classifier with nearly 100% success rate. Additionally, the paper demonstrates that GCQ can be used to evade OpenAI's content moderation endpoint, even without access to a local content moderation model.
The paper also explores proxy-free query-based attacks, which do not rely on a surrogate model. These attacks are more efficient and can be applied in scenarios where a proxy model is not available. The paper shows that these attacks can achieve high success rates, even when the model is not aligned with the attacker's goals.
The paper highlights the limitations of transfer-based attacks, such as their inability to induce targeted harmful outputs and their moderate success rates. In contrast, query-based attacks like GCQ are more effective and can achieve higher success rates. The paper also discusses the challenges of non-determinism in language models and the impact of target string length on attack success rates.
Overall, the paper demonstrates the effectiveness of query-based adversarial attacks in generating harmful outputs from language models, and it highlights the importance of developing robust defenses against such attacks. The results show that query-based attacks can be more effective than transfer-based methods, and they provide a practical approach for generating adversarial examples that cause language models to behave in harmful ways.