Understanding OR-Bench%3A An Over-Refusal Benchmark for Large Language Models

**OR-Bench: An Over-Refusal Benchmark for Large Language Models** **Authors:** Justin Cui, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh **Abstract:** Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often comes with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. This study proposes a novel method for automatically generating large-scale sets of "seemingly toxic prompts" (benign prompts likely rejected by LLMs). Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 seemingly toxic prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families. Our datasets are available at https://huggingface.co/datasets/bench-llm/or-bench and the demo can be found at https://huggingface.co/spaces/bench-llm/or-bench. We hope this benchmark can help the community develop better safety-aligned models. **Key Contributions:** - We design a pipeline to automatically generate seemingly toxic prompts at scale. - We release the first large-scale over-refusal benchmark: OR-Bench-80K spanning across 10 categories, together with a much more challenging OR-Bench-Hard-1K subset. - With OR-Bench, we conduct a comprehensive experiment to evaluate the over-refusal of 25 popular LLMs across 8 model families. Our study reveals several interesting insights regarding the issue of over-refusal in LLMs, as well as establishing a robust testbed that facilitates future research for optimizing the trade-off between safety and helpfulness. **Related Work:** - **Large Language Model Alignment:** Various methods have been proposed to align LLMs' outputs with human preferences, including Safe RLHF, MART, and instruction fine-tuning. - **Over Refusal and Safety:** Over-refusal can lead to incorrect rejection of safe prompts, reducing user engagement and helpfulness. Previous works like XSTest have manually crafted safe prompts to mimic toxic ones, but they are too simple for new SOTA LLMs. **Benchmark Construction:** - **Common Refusal Categories:** 10 categories including deception, harassment, harmful, hate, illegal, privacy, self-harm, sexual, unethical, and violence. - **OR-Bench 80K and Hard 1K Subset:** 80,000 safe prompts and 1,000**OR-Bench: An Over-Refusal Benchmark for Large Language Models** **Authors:** Justin Cui, Wei-Lin Chiang, Ion Stoica, Cho-Jui Hsieh **Abstract:** Large Language Models (LLMs) require careful safety alignment to prevent malicious outputs. While significant research focuses on mitigating harmful content generation, the enhanced safety often comes with the side effect of over-refusal, where LLMs may reject innocuous prompts and become less helpful. This study proposes a novel method for automatically generating large-scale sets of "seemingly toxic prompts" (benign prompts likely rejected by LLMs). Leveraging this technique, we introduce OR-Bench, the first large-scale over-refusal benchmark. OR-Bench comprises 80,000 seemingly toxic prompts across 10 common rejection categories, a subset of around 1,000 hard prompts that are challenging even for state-of-the-art LLMs, and an additional 600 toxic prompts to prevent indiscriminate responses. We then conduct a comprehensive study to measure the over-refusal of 25 popular LLMs across 8 model families. Our datasets are available at https://huggingface.co/datasets/bench-llm/or-bench and the demo can be found at https://huggingface.co/spaces/bench-llm/or-bench. We hope this benchmark can help the community develop better safety-aligned models. **Key Contributions:** - We design a pipeline to automatically generate seemingly toxic prompts at scale. - We release the first large-scale over-refusal benchmark: OR-Bench-80K spanning across 10 categories, together with a much more challenging OR-Bench-Hard-1K subset. - With OR-Bench, we conduct a comprehensive experiment to evaluate the over-refusal of 25 popular LLMs across 8 model families. Our study reveals several interesting insights regarding the issue of over-refusal in LLMs, as well as establishing a robust testbed that facilitates future research for optimizing the trade-off between safety and helpfulness. **Related Work:** - **Large Language Model Alignment:** Various methods have been proposed to align LLMs' outputs with human preferences, including Safe RLHF, MART, and instruction fine-tuning. - **Over Refusal and Safety:** Over-refusal can lead to incorrect rejection of safe prompts, reducing user engagement and helpfulness. Previous works like XSTest have manually crafted safe prompts to mimic toxic ones, but they are too simple for new SOTA LLMs. **Benchmark Construction:** - **Common Refusal Categories:** 10 categories including deception, harassment, harmful, hate, illegal, privacy, self-harm, sexual, unethical, and violence. - **OR-Bench 80K and Hard 1K Subset:** 80,000 safe prompts and 1,000

OR-Bench: An Over-Refusal Benchmark for Large Language Models

20 Jun 2024 | Justin Cui, Wei-Lin Chiang, Ion Stoica, and Cho-Jui Hsieh