30 Jan 2024 | Nevan Wichers, Carson Denison, Ahmad Beirami
Gradient-Based Red Teaming (GBRT) is an automated method for generating prompts that cause language models (LMs) to produce unsafe responses. Unlike traditional red teaming, which relies on human-generated prompts, GBRT uses gradient-based learning to optimize prompts directly. The method involves training a prompt by backpropagating through a frozen safety classifier and LM to minimize the safety score. To improve prompt realism, two variants are introduced: one adds a realism loss to penalize prompt probabilities that diverge from a pretrained model, and another fine-tunes a pretrained LM to generate prompts instead of learning them directly. Experiments show that GBRT outperforms reinforcement learning-based red teaming methods in finding effective prompts, even when the LM is fine-tuned to be safer. GBRT also generates more diverse and realistic prompts, as demonstrated in human evaluations. The method is evaluated on a 2B parameter LaMDA model, with results showing that GBRT-RealismLoss and GBRT-Finetune produce more successful prompts than other methods. The technique is also tested on a safer model, where GBRT still finds prompts that trigger unsafe responses. GBRT can generate prompts in multiple languages, including English and German, and is effective in finding prompts that trigger unsafe responses in models trained on English data. The method is robust to hyperparameter changes and can find diverse prompts by adjusting parameters. Overall, GBRT provides a more efficient and effective way to find prompts that trigger unsafe responses in language models.Gradient-Based Red Teaming (GBRT) is an automated method for generating prompts that cause language models (LMs) to produce unsafe responses. Unlike traditional red teaming, which relies on human-generated prompts, GBRT uses gradient-based learning to optimize prompts directly. The method involves training a prompt by backpropagating through a frozen safety classifier and LM to minimize the safety score. To improve prompt realism, two variants are introduced: one adds a realism loss to penalize prompt probabilities that diverge from a pretrained model, and another fine-tunes a pretrained LM to generate prompts instead of learning them directly. Experiments show that GBRT outperforms reinforcement learning-based red teaming methods in finding effective prompts, even when the LM is fine-tuned to be safer. GBRT also generates more diverse and realistic prompts, as demonstrated in human evaluations. The method is evaluated on a 2B parameter LaMDA model, with results showing that GBRT-RealismLoss and GBRT-Finetune produce more successful prompts than other methods. The technique is also tested on a safer model, where GBRT still finds prompts that trigger unsafe responses. GBRT can generate prompts in multiple languages, including English and German, and is effective in finding prompts that trigger unsafe responses in models trained on English data. The method is robust to hyperparameter changes and can find diverse prompts by adjusting parameters. Overall, GBRT provides a more efficient and effective way to find prompts that trigger unsafe responses in language models.