AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models

31 Aug 2024 | DONG SHU*, Northwestern University, USA; MINGYU JIN*, Rutgers University, USA; CHONG ZHANG, University of Liverpool, UK; LINGYAO LI, University of Michigan, USA; ZIAHO ZHOU, University of Liverpool, China; YONGFENG ZHANG, Rutgers University, USA
The paper "AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models" introduces an innovative framework to assess the effectiveness of jailbreak attacks on large language models (LLMs). Unlike traditional binary evaluations, the framework evaluates both the robustness of LLMs and the effectiveness of attack prompts. It presents two evaluation methods: coarse-grained and fine-grained, each using a scoring range from 0 to 1 to provide nuanced insights into attack effectiveness. The authors also develop a comprehensive ground truth dataset tailored for jailbreak prompts, serving as a benchmark for future research. The study shows that the proposed methods align with baseline metrics while offering more detailed and fine-grained assessments, helping to identify potentially harmful attack prompts that might be overlooked in traditional evaluations. The framework addresses the need for a more sophisticated and comprehensive evaluation methodology to enhance the security of LLMs against jailbreak attacks.The paper "AttackEval: How to Evaluate the Effectiveness of Jailbreak Attacking on Large Language Models" introduces an innovative framework to assess the effectiveness of jailbreak attacks on large language models (LLMs). Unlike traditional binary evaluations, the framework evaluates both the robustness of LLMs and the effectiveness of attack prompts. It presents two evaluation methods: coarse-grained and fine-grained, each using a scoring range from 0 to 1 to provide nuanced insights into attack effectiveness. The authors also develop a comprehensive ground truth dataset tailored for jailbreak prompts, serving as a benchmark for future research. The study shows that the proposed methods align with baseline metrics while offering more detailed and fine-grained assessments, helping to identify potentially harmful attack prompts that might be overlooked in traditional evaluations. The framework addresses the need for a more sophisticated and comprehensive evaluation methodology to enhance the security of LLMs against jailbreak attacks.
Reach us at info@study.space