AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts

AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts

2024 | Shaona Ghosh, Prasoon Varshney, Erick Galinkin, Christopher Parisien
AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts This paper introduces AEGIS, a novel approach to content safety moderation using an ensemble of large language model (LLM) content safety experts. The authors define a comprehensive content safety risk taxonomy, comprising 13 critical risk categories and 9 sparse risk categories. They curate the AEGISSAFETYDATASET, a new dataset of approximately 26,000 human-LLM interaction instances, complete with human annotations adhering to the taxonomy. The dataset is intended for public use and feedback to further research and benchmark LLM models for safety. The authors instruction-tune multiple LLM-based safety models, named AEGISSAFETYExperts, which not only surpass or perform competitively with state-of-the-art LLM-based safety models and general-purpose LLMs but also exhibit robustness across multiple jail-break attack categories. They also show that using AEGISSAFETYDATASET during the LLM alignment phase does not negatively impact the performance of the aligned models on MT Bench scores. The authors propose AEGIS, a novel application of a no-regret online adaptation framework with strong theoretical guarantees, to perform content moderation with an ensemble of LLM content safety experts in deployment. AEGIS is a novel application of online learning experts with LLMs to the content safety paradigm. The content moderation meta algorithm learns to dynamically adjust the influence of the experts' input based on the specific context, thereby enhancing adaptability to diverse data distribution over time, changing safety policies, and novel adversarial attacks. The authors evaluate their models against several baselines, including LLaMAGUARDBASE, OPENAI MOD API, and PERSPECTIVE API. They report AUPRC and F1 scores for all these baselines. The results show that their models outperform the baselines in terms of AUPRC and F1 scores. They also report the performance of the baselines on the SimpleSafetyTests benchmark, a recently released test suite for identifying critical safety risks. The authors also evaluate the effectiveness of their models in detecting jailbreak attacks. They test their models against two state-of-the-art attacks: Tree of Attacks with Pruning (TAP) and adversarial suffixes generated via Greedy Coordinate Gradient (GCG). The results show that their models are significantly more resilient to TAP attacks than the baselines. The authors also evaluate the impact of incorporating AEGISSAFETYDATASET content moderation data into the alignment blend. They find that the inclusion of AEGISSAFETYDATASET does not adversely affect the helpfulness of the model, as measured by MT-Bench scores. The authors also discuss the ethical considerations of their work, including the use of sensitive content in their dataset and the need for ethical guidelines in content moderation. They also discuss the importance of aligning with organizational values in defining safety risk categories.AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts This paper introduces AEGIS, a novel approach to content safety moderation using an ensemble of large language model (LLM) content safety experts. The authors define a comprehensive content safety risk taxonomy, comprising 13 critical risk categories and 9 sparse risk categories. They curate the AEGISSAFETYDATASET, a new dataset of approximately 26,000 human-LLM interaction instances, complete with human annotations adhering to the taxonomy. The dataset is intended for public use and feedback to further research and benchmark LLM models for safety. The authors instruction-tune multiple LLM-based safety models, named AEGISSAFETYExperts, which not only surpass or perform competitively with state-of-the-art LLM-based safety models and general-purpose LLMs but also exhibit robustness across multiple jail-break attack categories. They also show that using AEGISSAFETYDATASET during the LLM alignment phase does not negatively impact the performance of the aligned models on MT Bench scores. The authors propose AEGIS, a novel application of a no-regret online adaptation framework with strong theoretical guarantees, to perform content moderation with an ensemble of LLM content safety experts in deployment. AEGIS is a novel application of online learning experts with LLMs to the content safety paradigm. The content moderation meta algorithm learns to dynamically adjust the influence of the experts' input based on the specific context, thereby enhancing adaptability to diverse data distribution over time, changing safety policies, and novel adversarial attacks. The authors evaluate their models against several baselines, including LLaMAGUARDBASE, OPENAI MOD API, and PERSPECTIVE API. They report AUPRC and F1 scores for all these baselines. The results show that their models outperform the baselines in terms of AUPRC and F1 scores. They also report the performance of the baselines on the SimpleSafetyTests benchmark, a recently released test suite for identifying critical safety risks. The authors also evaluate the effectiveness of their models in detecting jailbreak attacks. They test their models against two state-of-the-art attacks: Tree of Attacks with Pruning (TAP) and adversarial suffixes generated via Greedy Coordinate Gradient (GCG). The results show that their models are significantly more resilient to TAP attacks than the baselines. The authors also evaluate the impact of incorporating AEGISSAFETYDATASET content moderation data into the alignment blend. They find that the inclusion of AEGISSAFETYDATASET does not adversely affect the helpfulness of the model, as measured by MT-Bench scores. The authors also discuss the ethical considerations of their work, including the use of sensitive content in their dataset and the need for ethical guidelines in content moderation. They also discuss the importance of aligning with organizational values in defining safety risk categories.
Reach us at info@study.space
[slides] AEGIS%3A Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts | StudySpace