Understanding SALAD-Bench%3A A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

**SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models** In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, the authors propose *SALAD-Bench*, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities. The benchmark includes a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications, and multiple-choice options. To manage the inherent complexity, an innovative evaluator, the LLM-based MD-Judge, is introduced for QA pairs, focusing on attack-enhanced queries, ensuring a seamless and reliable evaluation. SALAD-Bench extends beyond standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring joint-purpose utility. Extensive experiments reveal the resilience of LLMs against emerging threats and the efficacy of contemporary defense tactics. Data and the evaluator are released under <https://github.com/OpenSafetyLab/SALAD-BENCH>. **Key Contributions:** 1. **Compact Taxonomy with Hierarchical Levels:** SALAD-Bench introduces a structured hierarchy with three levels, comprising 6 domains, 16 tasks, and 66 categories, ensuring in-depth evaluation. 2. **Enhanced Difficulty and Complexity:** By infusing questions with attack methods, the benchmark enhances the challenge and diversity of safety inquiries. 3. **Reliable and Seamless Evaluator:** The MD-Judge, an LLM-based evaluator tailored for question-answer pairs, and MCQ-Judge for multiple-choice questions, ensure efficient and accurate evaluation. 4. **Joint-Purpose Utility:** The benchmark is uniquely suited for both LLM attack and defense methods evaluations, catering to a wide array of research needs. **Dataset Construction:** - **Hierarchical Taxonomy Definition:** A hierarchical three-level safety taxonomy is proposed, covering six domains, 16 tasks, and 66 categories. - **Data Collection:** Original questions are collected from public datasets and self-instructed data, ensuring a balanced and large-scale dataset. - **Question Enhancement:** Attack-enhanced and defense-enhanced subsets are constructed to deepen the challenge and broaden the evaluation perspectives. - **Multiple-choice Questions Subset:** A multiple-choice questions subset is added to broaden the dataset's scope and enhance complexity. **Evaluator:** - **MD-Judge:** An LLM-based safety judge model fine-tuned on a dataset comprising standard and attack-enhanced pairs, focusing on question-answer pairs. - **MCQ-Judge:** Utilizes in-context learning and regex parsing to efficiently fetch answers for multiple-choice questions. **Experiments:** - Large-scale experiments assess the reliability of the evaluators and the safety of various LLMs, comparing the effectiveness of different attack and defense methods. **Conclusion**SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models** In the rapidly evolving landscape of Large Language Models (LLMs), ensuring robust safety measures is paramount. To meet this crucial need, the authors propose *SALAD-Bench*, a safety benchmark specifically designed for evaluating LLMs, attack, and defense methods. Distinguished by its breadth, SALAD-Bench transcends conventional benchmarks through its large scale, rich diversity, intricate taxonomy spanning three levels, and versatile functionalities. The benchmark includes a meticulous array of questions, from standard queries to complex ones enriched with attack, defense modifications, and multiple-choice options. To manage the inherent complexity, an innovative evaluator, the LLM-based MD-Judge, is introduced for QA pairs, focusing on attack-enhanced queries, ensuring a seamless and reliable evaluation. SALAD-Bench extends beyond standard LLM safety evaluation to both LLM attack and defense methods evaluation, ensuring joint-purpose utility. Extensive experiments reveal the resilience of LLMs against emerging threats and the efficacy of contemporary defense tactics. Data and the evaluator are released under <https://github.com/OpenSafetyLab/SALAD-BENCH>. **Key Contributions:** 1. **Compact Taxonomy with Hierarchical Levels:** SALAD-Bench introduces a structured hierarchy with three levels, comprising 6 domains, 16 tasks, and 66 categories, ensuring in-depth evaluation. 2. **Enhanced Difficulty and Complexity:** By infusing questions with attack methods, the benchmark enhances the challenge and diversity of safety inquiries. 3. **Reliable and Seamless Evaluator:** The MD-Judge, an LLM-based evaluator tailored for question-answer pairs, and MCQ-Judge for multiple-choice questions, ensure efficient and accurate evaluation. 4. **Joint-Purpose Utility:** The benchmark is uniquely suited for both LLM attack and defense methods evaluations, catering to a wide array of research needs. **Dataset Construction:** - **Hierarchical Taxonomy Definition:** A hierarchical three-level safety taxonomy is proposed, covering six domains, 16 tasks, and 66 categories. - **Data Collection:** Original questions are collected from public datasets and self-instructed data, ensuring a balanced and large-scale dataset. - **Question Enhancement:** Attack-enhanced and defense-enhanced subsets are constructed to deepen the challenge and broaden the evaluation perspectives. - **Multiple-choice Questions Subset:** A multiple-choice questions subset is added to broaden the dataset's scope and enhance complexity. **Evaluator:** - **MD-Judge:** An LLM-based safety judge model fine-tuned on a dataset comprising standard and attack-enhanced pairs, focusing on question-answer pairs. - **MCQ-Judge:** Utilizes in-context learning and regex parsing to efficiently fetch answers for multiple-choice questions. **Experiments:** - Large-scale experiments assess the reliability of the evaluators and the safety of various LLMs, comparing the effectiveness of different attack and defense methods. **Conclusion

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models

7 Jun 2024 | Lijun Li1*, Bowen Dong1,2,5*, Ruohui Wang1*, Xuhao Hu1,3*, Wangmeng Zuo2, Dahua Lin1,4, Yu Qiao1, Jing Shao1†

7 Jun 2024 | Lijun Li1, Bowen Dong1,2,5, Ruohui Wang1, Xuhao Hu1,3, Wangmeng Zuo2, Dahua Lin1,4, Yu Qiao1, Jing Shao1†