SALAD-Bench is a comprehensive and hierarchical safety benchmark for evaluating Large Language Models (LLMs), focusing on safety, attack, and defense methods. It features a detailed taxonomy with three levels, covering 6 domains, 16 tasks, and 66 categories. The benchmark includes 21,000 test samples, with 5,000 attack-enhanced and 200 defense-enhanced questions, along with 4,000 multiple-choice questions. SALAD-Bench is evaluated using MD-Judge and MCQ-Judge, two specialized evaluators designed for question-answer pairs and multiple-choice questions, respectively. MD-Judge is fine-tuned on a dataset of standard and attack-enhanced pairs, while MCQ-Judge uses regex parsing to assess multiple-choice answers. The benchmark supports both standard LLM safety evaluation and the assessment of attack and defense methods. SALAD-Bench provides a robust framework for evaluating LLM safety, including the resilience of models against emerging threats and the effectiveness of defense strategies. The benchmark is publicly available, with data and evaluators released on GitHub. The results show that models like Claude2 achieve high safety scores, while others like Gemini experience significant drops in performance when faced with attack-enhanced questions. The benchmark also highlights the importance of diverse and challenging questions in assessing LLM safety and the need for effective defense methods. Overall, SALAD-Bench offers a comprehensive and reliable tool for evaluating LLM safety and improving the security of large language models.SALAD-Bench is a comprehensive and hierarchical safety benchmark for evaluating Large Language Models (LLMs), focusing on safety, attack, and defense methods. It features a detailed taxonomy with three levels, covering 6 domains, 16 tasks, and 66 categories. The benchmark includes 21,000 test samples, with 5,000 attack-enhanced and 200 defense-enhanced questions, along with 4,000 multiple-choice questions. SALAD-Bench is evaluated using MD-Judge and MCQ-Judge, two specialized evaluators designed for question-answer pairs and multiple-choice questions, respectively. MD-Judge is fine-tuned on a dataset of standard and attack-enhanced pairs, while MCQ-Judge uses regex parsing to assess multiple-choice answers. The benchmark supports both standard LLM safety evaluation and the assessment of attack and defense methods. SALAD-Bench provides a robust framework for evaluating LLM safety, including the resilience of models against emerging threats and the effectiveness of defense strategies. The benchmark is publicly available, with data and evaluators released on GitHub. The results show that models like Claude2 achieve high safety scores, while others like Gemini experience significant drops in performance when faced with attack-enhanced questions. The benchmark also highlights the importance of diverse and challenging questions in assessing LLM safety and the need for effective defense methods. Overall, SALAD-Bench offers a comprehensive and reliable tool for evaluating LLM safety and improving the security of large language models.