29 Feb 2024 | Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, Pulkit Agrawal
The paper "Curiosity-Driven Red-Teaming for Large Language Models" addresses the challenge of identifying and mitigating harmful or incorrect content generated by large language models (LLMs). Traditional methods rely on human testers to design input prompts that elicit undesirable responses, but this approach is costly and time-consuming. Recent work has automated this process using reinforcement learning (RL) to generate test cases that maximize the likelihood of eliciting unwanted responses from the target LLM. However, these methods often produce a limited number of effective test cases, leading to low coverage of potential prompts.
To address this limitation, the authors propose a method called Curiosity-Driven Red Teaming (CRT), which integrates curiosity-driven exploration into the RL framework. This approach aims to increase the diversity and coverage of generated test cases while maintaining or improving their effectiveness. The method measures the novelty of test cases based on text similarity metrics, such as n-gram modeling and sentence embeddings, to encourage the generation of diverse and novel prompts.
The evaluation of CRT on text continuation and instruction-following tasks demonstrates its effectiveness. The method achieves higher diversity in test cases compared to existing RL-based methods, while maintaining or improving the quality of responses elicited from the target LLMs. Notably, CRT successfully identifies toxic responses from LLaMA2, a model fine-tuned to avoid toxic outputs, highlighting the method's potential in probing unintended responses.
The paper also includes ablation studies and a discussion of the limitations and future directions, emphasizing the importance of balancing quality and diversity in red-teaming approaches.The paper "Curiosity-Driven Red-Teaming for Large Language Models" addresses the challenge of identifying and mitigating harmful or incorrect content generated by large language models (LLMs). Traditional methods rely on human testers to design input prompts that elicit undesirable responses, but this approach is costly and time-consuming. Recent work has automated this process using reinforcement learning (RL) to generate test cases that maximize the likelihood of eliciting unwanted responses from the target LLM. However, these methods often produce a limited number of effective test cases, leading to low coverage of potential prompts.
To address this limitation, the authors propose a method called Curiosity-Driven Red Teaming (CRT), which integrates curiosity-driven exploration into the RL framework. This approach aims to increase the diversity and coverage of generated test cases while maintaining or improving their effectiveness. The method measures the novelty of test cases based on text similarity metrics, such as n-gram modeling and sentence embeddings, to encourage the generation of diverse and novel prompts.
The evaluation of CRT on text continuation and instruction-following tasks demonstrates its effectiveness. The method achieves higher diversity in test cases compared to existing RL-based methods, while maintaining or improving the quality of responses elicited from the target LLMs. Notably, CRT successfully identifies toxic responses from LLaMA2, a model fine-tuned to avoid toxic outputs, highlighting the method's potential in probing unintended responses.
The paper also includes ablation studies and a discussion of the limitations and future directions, emphasizing the importance of balancing quality and diversity in red-teaming approaches.