26 Feb 2024 | Zhexin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, Minlie Huang
**ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors**
This paper addresses the critical issue of safety in Large Language Models (LLMs) by proposing ShieldLM, an LLM-based safety detector that aligns with human safety standards, supports customizable detection rules, and provides explainable decisions. To train ShieldLM, a large bilingual dataset of 14,387 query-response pairs is compiled, annotated for safety based on various standards. Extensive experiments demonstrate that ShieldLM outperforms strong baselines across four test sets, showcasing its superior customizability and explainability. ShieldLM is effective in both standard detection datasets and real-world applications, serving as a reliable safety evaluator for advanced LLMs. The authors release ShieldLM to support accurate and explainable safety detection, contributing to the ongoing efforts to enhance LLM safety.
**Key Contributions:**
1. **Alignment with Human Standards:** ShieldLM aligns with general human safety standards.
2. **Customizability:** It supports customizable detection rules for diverse application scenarios and safety standards.
3. **Explainability:** It provides explanations for its decisions, enhancing transparency in the decision-making process.
**Methods:**
- **Label Collection:** A new bilingual dataset is collected, including responses from various LLMs under different safety standards.
- **Analysis Generation:** GPT-4 generates natural language analyses that align with human labels and rules.
- **Training and Inference:** ShieldLM is trained using the accumulated dataset, incorporating irrelevant rules during training to enhance adaptability.
**Experiments:**
- **Performance:** ShieldLM achieves state-of-the-art performance on various datasets, outperforming strong baselines.
- **Customizability:** It demonstrates remarkable ability to adapt to fine-grained safety standards.
- **Explainability:** Manual evaluation shows that ShieldLM's generated analyses are reasonable and consistent with predictions.
**Applications:**
- ShieldLM is applied as a scorer for evaluating LLM safety, showing superior performance in identifying unsafe responses.
**Limitations and Future Work:**
- ShieldLM may struggle with samples requiring professional knowledge.
- Scaling training data purely through human annotations is challenging, suggesting the need for semi-automatic approaches.
**Ethical Considerations:**
- ShieldLM targets developers and researchers, focusing on controlled prompts to avoid adversarial attacks.
- Privacy and offensive content in collected data are carefully managed to ensure ethical use.**ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors**
This paper addresses the critical issue of safety in Large Language Models (LLMs) by proposing ShieldLM, an LLM-based safety detector that aligns with human safety standards, supports customizable detection rules, and provides explainable decisions. To train ShieldLM, a large bilingual dataset of 14,387 query-response pairs is compiled, annotated for safety based on various standards. Extensive experiments demonstrate that ShieldLM outperforms strong baselines across four test sets, showcasing its superior customizability and explainability. ShieldLM is effective in both standard detection datasets and real-world applications, serving as a reliable safety evaluator for advanced LLMs. The authors release ShieldLM to support accurate and explainable safety detection, contributing to the ongoing efforts to enhance LLM safety.
**Key Contributions:**
1. **Alignment with Human Standards:** ShieldLM aligns with general human safety standards.
2. **Customizability:** It supports customizable detection rules for diverse application scenarios and safety standards.
3. **Explainability:** It provides explanations for its decisions, enhancing transparency in the decision-making process.
**Methods:**
- **Label Collection:** A new bilingual dataset is collected, including responses from various LLMs under different safety standards.
- **Analysis Generation:** GPT-4 generates natural language analyses that align with human labels and rules.
- **Training and Inference:** ShieldLM is trained using the accumulated dataset, incorporating irrelevant rules during training to enhance adaptability.
**Experiments:**
- **Performance:** ShieldLM achieves state-of-the-art performance on various datasets, outperforming strong baselines.
- **Customizability:** It demonstrates remarkable ability to adapt to fine-grained safety standards.
- **Explainability:** Manual evaluation shows that ShieldLM's generated analyses are reasonable and consistent with predictions.
**Applications:**
- ShieldLM is applied as a scorer for evaluating LLM safety, showing superior performance in identifying unsafe responses.
**Limitations and Future Work:**
- ShieldLM may struggle with samples requiring professional knowledge.
- Scaling training data purely through human annotations is challenging, suggesting the need for semi-automatic approaches.
**Ethical Considerations:**
- ShieldLM targets developers and researchers, focusing on controlled prompts to avoid adversarial attacks.
- Privacy and offensive content in collected data are carefully managed to ensure ethical use.