The paper "AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts" addresses the growing concerns of content safety risks associated with the widespread use of Large Language Models (LLMs) and generative AI. The authors define a comprehensive content safety risk taxonomy comprising 13 critical risk categories and 9 sparse risk categories. They curate the AEGISSAFETYDATASET, a dataset of approximately 26,000 human-LLM interaction instances annotated according to the taxonomy. This dataset is intended to advance research and benchmark LLM models for safety.
The paper outlines a multi-phased strategy: creating a rich content safety taxonomy, collecting high-quality LLM interaction data, and instruction-tuning multiple LLM-based safety models. The authors demonstrate that their models, named AEGISSAFETYEXPERTS, outperform or compete with state-of-the-art LLM-based safety models and general-purpose LLMs, showing robustness across multiple jailbreak attack categories. They also show that using the AEGISSAFETYDATASET during the LLM alignment phase does not negatively impact the performance of aligned models on MT Bench scores.
Additionally, the paper introduces AEGIS, a novel application of a no-regret online adaptation framework to perform content moderation with an ensemble of LLM content safety experts. This framework dynamically adjusts the influence of expert predictions based on context, enhancing adaptability to diverse data distributions and changing safety policies.
The key contributions include an extensive content safety risk taxonomy, the AEGISSAFETYDATASET, a suite of strong and diverse LLM content safety models, and an innovative online adaptation framework. The authors plan to release the dataset, taxonomy, and guidelines to the research community to further advance the field of AI content safety.The paper "AEGIS: Online Adaptive AI Content Safety Moderation with Ensemble of LLM Experts" addresses the growing concerns of content safety risks associated with the widespread use of Large Language Models (LLMs) and generative AI. The authors define a comprehensive content safety risk taxonomy comprising 13 critical risk categories and 9 sparse risk categories. They curate the AEGISSAFETYDATASET, a dataset of approximately 26,000 human-LLM interaction instances annotated according to the taxonomy. This dataset is intended to advance research and benchmark LLM models for safety.
The paper outlines a multi-phased strategy: creating a rich content safety taxonomy, collecting high-quality LLM interaction data, and instruction-tuning multiple LLM-based safety models. The authors demonstrate that their models, named AEGISSAFETYEXPERTS, outperform or compete with state-of-the-art LLM-based safety models and general-purpose LLMs, showing robustness across multiple jailbreak attack categories. They also show that using the AEGISSAFETYDATASET during the LLM alignment phase does not negatively impact the performance of aligned models on MT Bench scores.
Additionally, the paper introduces AEGIS, a novel application of a no-regret online adaptation framework to perform content moderation with an ensemble of LLM content safety experts. This framework dynamically adjusts the influence of expert predictions based on context, enhancing adaptability to diverse data distributions and changing safety policies.
The key contributions include an extensive content safety risk taxonomy, the AEGISSAFETYDATASET, a suite of strong and diverse LLM content safety models, and an innovative online adaptation framework. The authors plan to release the dataset, taxonomy, and guidelines to the research community to further advance the field of AI content safety.