Curiosity-driven Red-teaming for Large Language Models

Curiosity-driven Red-teaming for Large Language Models

29 Feb 2024 | Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, Pulkit Agrawal
This paper introduces a curiosity-driven red-teaming method (CRT) for large language models (LLMs) to generate diverse and effective test cases that elicit undesirable responses. Traditional red-teaming methods rely on human testers to design prompts, but this is costly and time-consuming. CRT uses reinforcement learning (RL) to train a red-team LLM that generates test cases to maximize the chance of eliciting undesirable responses. CRT enhances test case diversity by incorporating curiosity-driven exploration, which optimizes for novelty rather than just effectiveness. This approach increases the coverage of test cases while maintaining or improving their effectiveness compared to existing methods. CRT successfully provokes toxic responses from LLaMA2, a model fine-tuned to avoid toxic outputs. The method uses text similarity metrics and sentence embeddings to measure novelty. Experiments show that CRT outperforms existing RL-based methods in both quality and diversity of test cases. CRT also demonstrates the ability to find effective test cases in instruction-following tasks and against LLMs fine-tuned with human preferences. The results highlight the potential of curiosity-driven exploration in automated red-teaming and suggest that current RLHF methods are insufficient to ensure LLM safety. The paper also discusses the limitations of curiosity-driven exploration, including the need for careful tuning of reward weights and the potential for improved robustness through alternative training methods.This paper introduces a curiosity-driven red-teaming method (CRT) for large language models (LLMs) to generate diverse and effective test cases that elicit undesirable responses. Traditional red-teaming methods rely on human testers to design prompts, but this is costly and time-consuming. CRT uses reinforcement learning (RL) to train a red-team LLM that generates test cases to maximize the chance of eliciting undesirable responses. CRT enhances test case diversity by incorporating curiosity-driven exploration, which optimizes for novelty rather than just effectiveness. This approach increases the coverage of test cases while maintaining or improving their effectiveness compared to existing methods. CRT successfully provokes toxic responses from LLaMA2, a model fine-tuned to avoid toxic outputs. The method uses text similarity metrics and sentence embeddings to measure novelty. Experiments show that CRT outperforms existing RL-based methods in both quality and diversity of test cases. CRT also demonstrates the ability to find effective test cases in instruction-following tasks and against LLMs fine-tuned with human preferences. The results highlight the potential of curiosity-driven exploration in automated red-teaming and suggest that current RLHF methods are insufficient to ensure LLM safety. The paper also discusses the limitations of curiosity-driven exploration, including the need for careful tuning of reward weights and the potential for improved robustness through alternative training methods.
Reach us at info@study.space