ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors

26 Feb 2024 | Zhixin Zhang, Yida Lu, Jingyuan Ma, Di Zhang, Rui Li, Pei Ke, Hao Sun, Lei Sha, Zhifang Sui, Hongning Wang, Minlie Huang
ShieldLM is an LLM-based safety detector designed to align with human safety standards, support customizable detection rules, and provide explanations for its decisions. The system is trained on a large bilingual dataset of 14,387 query-response pairs, annotated based on various safety standards. Through extensive experiments, ShieldLM outperforms strong baselines across four test sets, demonstrating its customizability and explainability. It is effective in real-world scenarios as a safety evaluator for advanced LLMs. The system is released at https://github.com/thu-coai/ShieldLM to support accurate and explainable safety detection under various standards. The paper introduces ShieldLM as a solution to the limitations of existing safety detection methods, which often lack alignment with human standards, customization options, and explainability. ShieldLM is trained on a bilingual dataset that includes annotations for the safety categories of responses from various LLMs under different standards. It is trained to understand and apply different custom detection rules for diverse situations. The system is designed to provide explanations for its decisions, making the detection process transparent. ShieldLM is evaluated on various test sets, including in-distribution (ID) and out-of-distribution (OOD) datasets. It achieves state-of-the-art performance compared to strong baselines such as GPT-4. The system is also tested on real-world applications, where it serves as a reliable judge for safety evaluation of LLMs. ShieldLM's customizability and explainability are quantitatively validated, showing its effectiveness in identifying unsafe responses and adapting to different safety standards. The paper also discusses the limitations of existing safety detection methods and the importance of aligning LLMs with human safety standards. ShieldLM addresses these limitations by providing a customizable and explainable safety detection system. The system is designed to handle diverse safety issues and support various safety standards. It is also tested on different datasets, including the Red Team dataset, the Implicit Toxicity dataset, and the DiaSafety dataset, demonstrating its effectiveness in identifying unsafe responses. The paper concludes that ShieldLM is a promising solution for safety detection in LLMs, offering a customizable and explainable approach to identifying unsafe responses. It is designed to support developers and researchers in various safety detection scenarios, contributing to the ongoing efforts to enhance the safety of LLMs.ShieldLM is an LLM-based safety detector designed to align with human safety standards, support customizable detection rules, and provide explanations for its decisions. The system is trained on a large bilingual dataset of 14,387 query-response pairs, annotated based on various safety standards. Through extensive experiments, ShieldLM outperforms strong baselines across four test sets, demonstrating its customizability and explainability. It is effective in real-world scenarios as a safety evaluator for advanced LLMs. The system is released at https://github.com/thu-coai/ShieldLM to support accurate and explainable safety detection under various standards. The paper introduces ShieldLM as a solution to the limitations of existing safety detection methods, which often lack alignment with human standards, customization options, and explainability. ShieldLM is trained on a bilingual dataset that includes annotations for the safety categories of responses from various LLMs under different standards. It is trained to understand and apply different custom detection rules for diverse situations. The system is designed to provide explanations for its decisions, making the detection process transparent. ShieldLM is evaluated on various test sets, including in-distribution (ID) and out-of-distribution (OOD) datasets. It achieves state-of-the-art performance compared to strong baselines such as GPT-4. The system is also tested on real-world applications, where it serves as a reliable judge for safety evaluation of LLMs. ShieldLM's customizability and explainability are quantitatively validated, showing its effectiveness in identifying unsafe responses and adapting to different safety standards. The paper also discusses the limitations of existing safety detection methods and the importance of aligning LLMs with human safety standards. ShieldLM addresses these limitations by providing a customizable and explainable safety detection system. The system is designed to handle diverse safety issues and support various safety standards. It is also tested on different datasets, including the Red Team dataset, the Implicit Toxicity dataset, and the DiaSafety dataset, demonstrating its effectiveness in identifying unsafe responses. The paper concludes that ShieldLM is a promising solution for safety detection in LLMs, offering a customizable and explainable approach to identifying unsafe responses. It is designed to support developers and researchers in various safety detection scenarios, contributing to the ongoing efforts to enhance the safety of LLMs.
Reach us at info@study.space