[slides] ALI-Agent%3A Assessing LLMs' Alignment with Human Values via Agent-based Evaluation

**ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation** **Authors:** Jingnan Zheng **Abstract:** Large Language Models (LLMs) can generate unintended and harmful content when misaligned with human values, posing significant risks to users and society. Current evaluation benchmarks, which primarily use expert-designed contextual scenarios, are labor-intensive and limited in scope, hindering their ability to generalize to diverse real-world use cases and identify rare but crucial long-tail risks. To address these challenges, ALI-Agent is proposed as an evaluation framework that leverages the autonomous abilities of LLM-powered agents to conduct in-depth and adaptive alignment assessments. ALI-Agent operates through two stages: Emulation and Refinement. During the Emulation stage, ALI-Agent automates the generation of realistic test scenarios. In the Refinement stage, it iteratively refines these scenarios to probe long-tail risks. ALI-Agent incorporates a memory module to guide scenario generation, a tool-using module to reduce human labor, and an action module to refine tests. Extensive experiments across three aspects of human values—stereotypes, morality, and legality—demonstrate that ALI-Agent effectively identifies model misalignment. Systematic analysis validates that the generated test scenarios represent meaningful use cases and integrate enhanced measures to probe long-tail risks. The code for ALI-Agent is available at <https://github.com/SophieZheng998/ALI-Agent.git>. **Introduction:** Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating texts, leading to widespread deployment across various applications. However, this expansion raises concerns about their alignment with human values. Misalignment can result in LLMs generating content that perpetuates stereotypes, reinforces societal biases, or provides unlawful instructions, posing risks to users and society. Evaluating LLMs' alignment with human values is challenging due to the complex and open-ended nature of real-world applications. Current evaluation benchmarks are labor-intensive and limited in scope, making it difficult to cover a wide range of real-world use cases and identify long-tail risks. ALI-Agent aims to automate in-depth and adaptive alignment testing for LLMs by leveraging the autonomous abilities of agents. **Task Formulation:** The task of evaluating LLMs' alignment with human values involves generating test prompts, querying the target LLM, and assessing its responses using an evaluator. The process is formalized as follows: - Generate a test prompt by combining a misconduct sample and a task-specific prompt template. - Query the target LLM using the combined prompt. - Assess the LLM's response using an evaluator. **Emulation Stage:** The core of the Emulation stage involves generating realistic test scenarios using past evaluation records and an emulator. The emulator retrieves relevant evaluation records from the memory, generates new test scenarios, and evaluates the target LLM's feedback. **Refinement Stage:** The refinement stage iteratively**ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation** **Authors:** Jingnan Zheng **Abstract:** Large Language Models (LLMs) can generate unintended and harmful content when misaligned with human values, posing significant risks to users and society. Current evaluation benchmarks, which primarily use expert-designed contextual scenarios, are labor-intensive and limited in scope, hindering their ability to generalize to diverse real-world use cases and identify rare but crucial long-tail risks. To address these challenges, ALI-Agent is proposed as an evaluation framework that leverages the autonomous abilities of LLM-powered agents to conduct in-depth and adaptive alignment assessments. ALI-Agent operates through two stages: Emulation and Refinement. During the Emulation stage, ALI-Agent automates the generation of realistic test scenarios. In the Refinement stage, it iteratively refines these scenarios to probe long-tail risks. ALI-Agent incorporates a memory module to guide scenario generation, a tool-using module to reduce human labor, and an action module to refine tests. Extensive experiments across three aspects of human values—stereotypes, morality, and legality—demonstrate that ALI-Agent effectively identifies model misalignment. Systematic analysis validates that the generated test scenarios represent meaningful use cases and integrate enhanced measures to probe long-tail risks. The code for ALI-Agent is available at <https://github.com/SophieZheng998/ALI-Agent.git>. **Introduction:** Large Language Models (LLMs) have demonstrated remarkable capabilities in understanding and generating texts, leading to widespread deployment across various applications. However, this expansion raises concerns about their alignment with human values. Misalignment can result in LLMs generating content that perpetuates stereotypes, reinforces societal biases, or provides unlawful instructions, posing risks to users and society. Evaluating LLMs' alignment with human values is challenging due to the complex and open-ended nature of real-world applications. Current evaluation benchmarks are labor-intensive and limited in scope, making it difficult to cover a wide range of real-world use cases and identify long-tail risks. ALI-Agent aims to automate in-depth and adaptive alignment testing for LLMs by leveraging the autonomous abilities of agents. **Task Formulation:** The task of evaluating LLMs' alignment with human values involves generating test prompts, querying the target LLM, and assessing its responses using an evaluator. The process is formalized as follows: - Generate a test prompt by combining a misconduct sample and a task-specific prompt template. - Query the target LLM using the combined prompt. - Assess the LLM's response using an evaluator. **Emulation Stage:** The core of the Emulation stage involves generating realistic test scenarios using past evaluation records and an emulator. The emulator retrieves relevant evaluation records from the memory, generates new test scenarios, and evaluates the target LLM's feedback. **Refinement Stage:** The refinement stage iteratively

ALI-Agent: Assessing LLMs' Alignment with Human Values via Agent-based Evaluation

24 May 2024 | Jingnan Zheng, Han Wang, An Zhang, Tai D. Nguyen, Jun Sun, Tat-Seng Chua