OR-Bench: An Over-Refusal Benchmark for Large Language Models

OR-Bench: An Over-Refusal Benchmark for Large Language Models

20 Jun 2024 | Justin Cui¹, Wei-Lin Chiang², Ion Stoica², and Cho-Jui Hsieh¹
OR-Bench is a large-scale benchmark designed to evaluate the over-refusal issue in large language models (LLMs). The benchmark includes 80,000 seemingly toxic prompts across 10 categories, a subset of 1,000 hard prompts, and 600 toxic prompts. The prompts are generated by rewriting toxic seeds into benign prompts that are likely to be rejected by LLMs. The benchmark is evaluated on 25 popular LLMs across 8 model families, revealing a trade-off between safety and over-refusal. The results show that most models prioritize safety by rejecting more toxic prompts, but this often leads to over-refusal of benign prompts. The benchmark also highlights that model size does not necessarily correlate with better safety-sensitivity balance. The study emphasizes the need for future safety alignment algorithms to consider both toxic and seemingly toxic prompts to improve safety alignment. OR-Bench provides a robust testbed for optimizing the trade-off between safety and helpfulness in LLMs.OR-Bench is a large-scale benchmark designed to evaluate the over-refusal issue in large language models (LLMs). The benchmark includes 80,000 seemingly toxic prompts across 10 categories, a subset of 1,000 hard prompts, and 600 toxic prompts. The prompts are generated by rewriting toxic seeds into benign prompts that are likely to be rejected by LLMs. The benchmark is evaluated on 25 popular LLMs across 8 model families, revealing a trade-off between safety and over-refusal. The results show that most models prioritize safety by rejecting more toxic prompts, but this often leads to over-refusal of benign prompts. The benchmark also highlights that model size does not necessarily correlate with better safety-sensitivity balance. The study emphasizes the need for future safety alignment algorithms to consider both toxic and seemingly toxic prompts to improve safety alignment. OR-Bench provides a robust testbed for optimizing the trade-off between safety and helpfulness in LLMs.
Reach us at info@study.space
[slides and audio] OR-Bench%3A An Over-Refusal Benchmark for Large Language Models