24 May 2024 | Jingnan Zheng, Han Wang, An Zhang, Tai D. Nguyen, Jun Sun, Tat-Seng Chua
ALI-Agent is an evaluation framework that assesses Large Language Models (LLMs) for alignment with human values by leveraging autonomous agents. It addresses the limitations of existing benchmarks, which are labor-intensive and static, by automating the generation of realistic test scenarios and iteratively refining them to probe long-tail risks. The framework operates in two stages: Emulation and Refinement. During Emulation, ALI-Agent generates test scenarios based on past evaluation records and user queries, while Refinement iteratively refines these scenarios to uncover hidden misalignments. ALI-Agent incorporates a memory module to store past evaluations, a tool-using module to reduce human labor, and an action module to refine test scenarios. Extensive experiments across three aspects of human values—stereotypes, morality, and legality—demonstrate that ALI-Agent effectively identifies model misalignment. Systematic analysis validates that the generated test scenarios represent meaningful use cases and integrate enhanced measures to probe long-tail risks. ALI-Agent's code is available at https://github.com/SophieZheng998/ALI-Agent.git. The framework shows significant improvements in identifying misalignment compared to existing benchmarks, particularly in uncovering long-tail risks. It also demonstrates the effectiveness of its components, such as the evaluation memory and iterative refiner, in enhancing alignment assessment. ALI-Agent's ability to generate realistic test scenarios that properly encapsulate misconduct and conceal malice is validated through human evaluations and the OpenAI Moderation API. The framework's results highlight the importance of continuous adaptation and improvement in evaluating LLMs for alignment with human values.ALI-Agent is an evaluation framework that assesses Large Language Models (LLMs) for alignment with human values by leveraging autonomous agents. It addresses the limitations of existing benchmarks, which are labor-intensive and static, by automating the generation of realistic test scenarios and iteratively refining them to probe long-tail risks. The framework operates in two stages: Emulation and Refinement. During Emulation, ALI-Agent generates test scenarios based on past evaluation records and user queries, while Refinement iteratively refines these scenarios to uncover hidden misalignments. ALI-Agent incorporates a memory module to store past evaluations, a tool-using module to reduce human labor, and an action module to refine test scenarios. Extensive experiments across three aspects of human values—stereotypes, morality, and legality—demonstrate that ALI-Agent effectively identifies model misalignment. Systematic analysis validates that the generated test scenarios represent meaningful use cases and integrate enhanced measures to probe long-tail risks. ALI-Agent's code is available at https://github.com/SophieZheng998/ALI-Agent.git. The framework shows significant improvements in identifying misalignment compared to existing benchmarks, particularly in uncovering long-tail risks. It also demonstrates the effectiveness of its components, such as the evaluation memory and iterative refiner, in enhancing alignment assessment. ALI-Agent's ability to generate realistic test scenarios that properly encapsulate misconduct and conceal malice is validated through human evaluations and the OpenAI Moderation API. The framework's results highlight the importance of continuous adaptation and improvement in evaluating LLMs for alignment with human values.