[slides and audio] Gradient-Based Language Model Red Teaming

This paper introduces Gradient-Based Red Teaming (GBRT), an automated method for generating diverse prompts that can trigger generative language models (LMs) to produce unsafe responses. Red teaming is crucial for identifying weaknesses in LMs, but it is labor-intensive and difficult to scale. GBRT uses a safety classifier to score LM responses and backpropagates through the frozen classifier and LM to update the prompt. To improve prompt coherence, two variants are introduced: a realism loss to penalize divergence from a pre-trained model's logits, and fine-tuning a separate LM to generate prompts. Experiments show that GBRT outperforms a reinforcement learning-based approach in finding prompts that trigger unsafe responses, even when the LM is fine-tuned to produce safer outputs. The paper also discusses related work, experimental setup, and results, including human evaluation and the effectiveness of different prompt lengths and hyperparameters. The findings highlight the effectiveness of GBRT in generating diverse and coherent prompts, making it a valuable tool for improving LM safety.This paper introduces Gradient-Based Red Teaming (GBRT), an automated method for generating diverse prompts that can trigger generative language models (LMs) to produce unsafe responses. Red teaming is crucial for identifying weaknesses in LMs, but it is labor-intensive and difficult to scale. GBRT uses a safety classifier to score LM responses and backpropagates through the frozen classifier and LM to update the prompt. To improve prompt coherence, two variants are introduced: a realism loss to penalize divergence from a pre-trained model's logits, and fine-tuning a separate LM to generate prompts. Experiments show that GBRT outperforms a reinforcement learning-based approach in finding prompts that trigger unsafe responses, even when the LM is fine-tuned to produce safer outputs. The paper also discusses related work, experimental setup, and results, including human evaluation and the effectiveness of different prompt lengths and hyperparameters. The findings highlight the effectiveness of GBRT in generating diverse and coherent prompts, making it a valuable tool for improving LM safety.

Gradient-Based Language Model Red Teaming

30 Jan 2024 | Nevan Wichers, Carson Denison, Ahmad Beirami