**SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models**
In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, the authors propose *SALAD-Bench*, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities. The benchmark includes a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications, and multiple-choice options. To manage the inherent complexity, an innovative evaluator, the LLM-based MD-Judge, is introduced for QA pairs, focusing on attack-enhanced queries, ensuring a seamless and reliable evaluation. SALAD-Bench extends beyond standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring joint-purpose utility. Extensive experiments reveal the resilience of LLMs against emerging threats and the efficacy of contemporary defense tactics. Data and the evaluator are released under <https://github.com/OpenSafetyLab/SALAD-BENCH>.
**Key Contributions:**
1. **Compact Taxonomy with Hierarchical Levels:** SALAD-Bench introduces a structured hierarchy with three levels, comprising 6 domains, 16 tasks, and 66 categories, ensuring in-depth evaluation.
2. **Enhanced Difficulty and Complexity:** By infusing questions with attack methods, the benchmark enhances the challenge and diversity of safety inquiries.
3. **Reliable and Seamless Evaluator:** The MD-Judge, an LLM-based evaluator tailored for question-answer pairs, and MCQ-Judge for multiple-choice questions, ensure efficient and accurate evaluation.
4. **Joint-Purpose Utility:** The benchmark is uniquely suited for both LLM attack and defense methods evaluations, catering to a wide array of research needs.
**Dataset Construction:**
- **Hierarchical Taxonomy Definition:** A hierarchical three-level safety taxonomy is proposed, covering six domains, 16 tasks, and 66 categories.
- **Data Collection:** Original questions are collected from public datasets and self-instructed data, ensuring a balanced and large-scale dataset.
- **Question Enhancement:** Attack-enhanced and defense-enhanced subsets are constructed to deepen the challenge and broaden the evaluation perspectives.
- **Multiple-choice Questions Subset:** A multiple-choice questions subset is added to broaden the dataset's scope and enhance complexity.
**Evaluator:**
- **MD-Judge:** An LLM-based safety judge model fine-tuned on a dataset comprising standard and attack-enhanced pairs, focusing on question-answer pairs.
- **MCQ-Judge:** Utilizes in-context learning and regex parsing to efficiently fetch answers for multiple-choice questions.
**Experiments:**
- Large-scale experiments assess the reliability of the evaluators and the safety of various LLMs, comparing the effectiveness of different attack and defense methods.
**Conclusion**SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models**
In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, the authors propose *SALAD-Bench*, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities. The benchmark includes a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications, and multiple-choice options. To manage the inherent complexity, an innovative evaluator, the LLM-based MD-Judge, is introduced for QA pairs, focusing on attack-enhanced queries, ensuring a seamless and reliable evaluation. SALAD-Bench extends beyond standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring joint-purpose utility. Extensive experiments reveal the resilience of LLMs against emerging threats and the efficacy of contemporary defense tactics. Data and the evaluator are released under <https://github.com/OpenSafetyLab/SALAD-BENCH>.
**Key Contributions:**
1. **Compact Taxonomy with Hierarchical Levels:** SALAD-Bench introduces a structured hierarchy with three levels, comprising 6 domains, 16 tasks, and 66 categories, ensuring in-depth evaluation.
2. **Enhanced Difficulty and Complexity:** By infusing questions with attack methods, the benchmark enhances the challenge and diversity of safety inquiries.
3. **Reliable and Seamless Evaluator:** The MD-Judge, an LLM-based evaluator tailored for question-answer pairs, and MCQ-Judge for multiple-choice questions, ensure efficient and accurate evaluation.
4. **Joint-Purpose Utility:** The benchmark is uniquely suited for both LLM attack and defense methods evaluations, catering to a wide array of research needs.
**Dataset Construction:**
- **Hierarchical Taxonomy Definition:** A hierarchical three-level safety taxonomy is proposed, covering six domains, 16 tasks, and 66 categories.
- **Data Collection:** Original questions are collected from public datasets and self-instructed data, ensuring a balanced and large-scale dataset.
- **Question Enhancement:** Attack-enhanced and defense-enhanced subsets are constructed to deepen the challenge and broaden the evaluation perspectives.
- **Multiple-choice Questions Subset:** A multiple-choice questions subset is added to broaden the dataset's scope and enhance complexity.
**Evaluator:**
- **MD-Judge:** An LLM-based safety judge model fine-tuned on a dataset comprising standard and attack-enhanced pairs, focusing on question-answer pairs.
- **MCQ-Judge:** Utilizes in-context learning and regex parsing to efficiently fetch answers for multiple-choice questions.
**Experiments:**
- Large-scale experiments assess the reliability of the evaluators and the safety of various LLMs, comparing the effectiveness of different attack and defense methods.
**Conclusion