SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

SORRY-Bench: Systematically Evaluating Large Language Model Safety Refusal Behaviors

20 Jun 2024 | Tinghao Xie, Xiangyu Qi, Yi Zeng, Yangsibo Huang, Udari Madhushani Sehwag, Kaixuan Huang, Luxi He, Boyi Wei, Dacheng Li, Ying Sheng, Ruoxi Jia, Bo Li, Kai Li, Danqi Chen, Peter Henderson, Prateek Mittal
SORRY-Bench is a comprehensive benchmark designed to systematically evaluate the safety refusal behaviors of large language models (LLMs). The benchmark addresses three key limitations in existing evaluations: coarse-grained safety categories, imbalanced data representation, and reliance on computationally expensive large LLMs for evaluation. SORRY-Bench introduces a fine-grained 45-class safety taxonomy, covering a wide range of potentially unsafe topics, and ensures class balance through 450 class-balanced unsafe instructions. It also incorporates 20 diverse linguistic augmentations to systematically examine the impact of different prompt formats and languages on LLM responses. Additionally, the benchmark evaluates the effectiveness of various automated safety evaluators, finding that fine-tuned 7B LLMs can achieve accuracy comparable to larger models like GPT-4 with lower computational costs. The benchmark evaluates over 40 proprietary and open-source LLMs, revealing significant variations in their safety refusal behaviors. For example, models like Claude-2 and Gemini-1.5 exhibit the highest refusal rates, while Mistral models show higher fulfillment rates with potentially unsafe instructions. The benchmark also highlights the impact of linguistic variations on safety evaluation, showing that models often fail to consistently refuse unsafe instructions in low-resource languages or when technical terms are involved. The study provides a balanced, granular, and efficient framework for evaluating LLM safety refusal behaviors, offering a foundation for future research and development in this area.SORRY-Bench is a comprehensive benchmark designed to systematically evaluate the safety refusal behaviors of large language models (LLMs). The benchmark addresses three key limitations in existing evaluations: coarse-grained safety categories, imbalanced data representation, and reliance on computationally expensive large LLMs for evaluation. SORRY-Bench introduces a fine-grained 45-class safety taxonomy, covering a wide range of potentially unsafe topics, and ensures class balance through 450 class-balanced unsafe instructions. It also incorporates 20 diverse linguistic augmentations to systematically examine the impact of different prompt formats and languages on LLM responses. Additionally, the benchmark evaluates the effectiveness of various automated safety evaluators, finding that fine-tuned 7B LLMs can achieve accuracy comparable to larger models like GPT-4 with lower computational costs. The benchmark evaluates over 40 proprietary and open-source LLMs, revealing significant variations in their safety refusal behaviors. For example, models like Claude-2 and Gemini-1.5 exhibit the highest refusal rates, while Mistral models show higher fulfillment rates with potentially unsafe instructions. The benchmark also highlights the impact of linguistic variations on safety evaluation, showing that models often fail to consistently refuse unsafe instructions in low-resource languages or when technical terms are involved. The study provides a balanced, granular, and efficient framework for evaluating LLM safety refusal behaviors, offering a foundation for future research and development in this area.
Reach us at info@study.space
Understanding SORRY-Bench%3A Systematically Evaluating Large Language Model Safety Refusal Behaviors