[slides and audio] SORRY-Bench%3A Systematically Evaluating Large Language Model Safety Refusal Behaviors

**SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors** This paper addresses the limitations of existing evaluations of large language models (LLMs) in recognizing and rejecting unsafe user requests. The authors propose SORRY-Bench, a benchmark that improves upon existing methods by: 1. **Fine-grained Taxonomy**: Using a 45-class taxonomy of potentially unsafe topics, with 450 class-balanced unsafe instructions, to address the over-representation of certain fine-grained topics. 2. **Linguistic Augmentation**: Incorporating 20 diverse linguistic augmentations to capture different formatting and linguistic features of user prompts, such as writing styles, persuasion techniques, and multi-languages. 3. **Efficient Automated Evaluator**: Collecting a large-scale human judgment dataset and conducting a meta-evaluation to identify the best design choices for creating a fast and accurate automated safety evaluator. They find that fine-tuned 7B LLMs can achieve accuracy comparable to GPT-4-scale LLMs with lower computational costs. The benchmark evaluates over 40 proprietary and open-source LLMs, analyzing their distinctive refusal behaviors across various categories. The results highlight significant variations in safety refusal behaviors among different models, providing insights into the shifting values and priorities of model creators. The paper also discusses the impact of linguistic mutations on safety evaluation and the trade-offs between efficiency and accuracy in automated evaluators. **Contributions:** - A fine-grained 45-class safety taxonomy. - 20 diverse linguistic augmentations. - A large-scale human judgment dataset for evaluating automated evaluators. - Analysis of over 40 LLMs on SORRY-Bench, revealing varying safety refusal behaviors. **Conclusion:** SORRY-Bench provides a comprehensive and systematic framework for evaluating LLMs' safety refusal capabilities, offering a balanced, granular, customizable, and efficient approach. The paper aims to serve as a building block for future research in this area.**SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors** This paper addresses the limitations of existing evaluations of large language models (LLMs) in recognizing and rejecting unsafe user requests. The authors propose SORRY-Bench, a benchmark that improves upon existing methods by: 1. **Fine-grained Taxonomy**: Using a 45-class taxonomy of potentially unsafe topics, with 450 class-balanced unsafe instructions, to address the over-representation of certain fine-grained topics. 2. **Linguistic Augmentation**: Incorporating 20 diverse linguistic augmentations to capture different formatting and linguistic features of user prompts, such as writing styles, persuasion techniques, and multi-languages. 3. **Efficient Automated Evaluator**: Collecting a large-scale human judgment dataset and conducting a meta-evaluation to identify the best design choices for creating a fast and accurate automated safety evaluator. They find that fine-tuned 7B LLMs can achieve accuracy comparable to GPT-4-scale LLMs with lower computational costs. The benchmark evaluates over 40 proprietary and open-source LLMs, analyzing their distinctive refusal behaviors across various categories. The results highlight significant variations in safety refusal behaviors among different models, providing insights into the shifting values and priorities of model creators. The paper also discusses the impact of linguistic mutations on safety evaluation and the trade-offs between efficiency and accuracy in automated evaluators. **Contributions:** - A fine-grained 45-class safety taxonomy. - 20 diverse linguistic augmentations. - A large-scale human judgment dataset for evaluating automated evaluators. - Analysis of over 40 LLMs on SORRY-Bench, revealing varying safety refusal behaviors. **Conclusion:** SORRY-Bench provides a comprehensive and systematic framework for evaluating LLMs' safety refusal capabilities, offering a balanced, granular, customizable, and efficient approach. The paper aims to serve as a building block for future research in this area.

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

20 Jun 2024 | Tinghao Xie*1, Xiangyu Qi*1, Yi Zeng*2, Yangsibo Huang*1 Udari Madhushani Sehwag3, Kaixuan Huang1, Luxi He1, Boyi Wei1, Dacheng Li4, Ying Sheng3 Ruoxi Jia2, Bo Li5,6, Kai Li1, Danqi Chen1, Peter Henderson1, Prateek Mittal1

20 Jun 2024 | Tinghao Xie1, Xiangyu Qi1, Yi Zeng2, Yangsibo Huang1 Udari Madhushani Sehwag3, Kaixuan Huang1, Luxi He1, Boyi Wei1, Dacheng Li4, Ying Sheng3 Ruoxi Jia2, Bo Li5,6, Kai Li1, Danqi Chen1, Peter Henderson1, Prateek Mittal1