R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents

18 Feb 2024 | Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, Gongshen Liu
R-Judge is a benchmark designed to evaluate the safety risk awareness of large language models (LLMs) in agent interaction scenarios. The benchmark includes 162 records of multi-turn agent interactions across 7 categories and 10 risk types, with human-annotated safety labels and detailed risk descriptions. The dataset was curated to reflect real-world scenarios and includes cases where LLM agents may pose safety risks, such as privacy leaks or data loss. Evaluation of 9 LLMs on R-Judge shows that while GPT-4 achieves 72.52% in contrast to the human score of 89.07%, all other models score below random. The results indicate that LLMs have significant room for improvement in identifying and judging safety risks in open agent scenarios. Further experiments show that using risk descriptions as environment feedback significantly improves model performance. The study reveals that risk awareness in open agent scenarios is a multi-dimensional capability involving knowledge and reasoning, which is challenging for current LLMs. R-Judge is publicly available at https://github.com/Lordog/R-Judge. The benchmark provides a realistic evaluation of LLMs' ability to identify and judge safety risks in agent interactions, highlighting the need for further research and development in this area. The results emphasize the importance of aligning LLMs with human safety standards and the need for additional fine-tuning to enhance their safety risk awareness.R-Judge is a benchmark designed to evaluate the safety risk awareness of large language models (LLMs) in agent interaction scenarios. The benchmark includes 162 records of multi-turn agent interactions across 7 categories and 10 risk types, with human-annotated safety labels and detailed risk descriptions. The dataset was curated to reflect real-world scenarios and includes cases where LLM agents may pose safety risks, such as privacy leaks or data loss. Evaluation of 9 LLMs on R-Judge shows that while GPT-4 achieves 72.52% in contrast to the human score of 89.07%, all other models score below random. The results indicate that LLMs have significant room for improvement in identifying and judging safety risks in open agent scenarios. Further experiments show that using risk descriptions as environment feedback significantly improves model performance. The study reveals that risk awareness in open agent scenarios is a multi-dimensional capability involving knowledge and reasoning, which is challenging for current LLMs. R-Judge is publicly available at https://github.com/Lordog/R-Judge. The benchmark provides a realistic evaluation of LLMs' ability to identify and judge safety risks in agent interactions, highlighting the need for further research and development in this area. The results emphasize the importance of aligning LLMs with human safety standards and the need for additional fine-tuning to enhance their safety risk awareness.
Reach us at info@study.space
Understanding R-Judge%3A Benchmarking Safety Risk Awareness for LLM Agents