[slides] Constitutional AI%3A Harmlessness from AI Feedback

The paper introduces a method called Constitutional AI (CAI) to train AI assistants that are both helpful and harmless, without relying on human feedback labels for harmlessness. The CAI process consists of two stages: a supervised learning stage and a reinforcement learning (RL) stage. In the supervised stage, the AI assistant generates responses to harmful prompts, critiques these responses using a set of principles (the constitution), and then revises them. The revised responses are used to fine-tune a pre-trained language model. In the RL stage, the model evaluates responses to harmful prompts based on the constitution and uses this feedback to train a preference model. The RL stage further refines the model's behavior, making it more harmless and non-evasive. The paper demonstrates that the CAI method can improve the AI assistant's harmlessness without significantly compromising helpfulness, and that chain-of-thought reasoning enhances the model's performance and transparency. The authors aim to reduce the need for human supervision and make AI behavior more transparent and controllable.The paper introduces a method called Constitutional AI (CAI) to train AI assistants that are both helpful and harmless, without relying on human feedback labels for harmlessness. The CAI process consists of two stages: a supervised learning stage and a reinforcement learning (RL) stage. In the supervised stage, the AI assistant generates responses to harmful prompts, critiques these responses using a set of principles (the constitution), and then revises them. The revised responses are used to fine-tune a pre-trained language model. In the RL stage, the model evaluates responses to harmful prompts based on the constitution and uses this feedback to train a preference model. The RL stage further refines the model's behavior, making it more harmless and non-evasive. The paper demonstrates that the CAI method can improve the AI assistant's harmlessness without significantly compromising helpfulness, and that chain-of-thought reasoning enhances the model's performance and transparency. The authors aim to reduce the need for human supervision and make AI behavior more transparent and controllable.

Constitutional AI: Harmlessness from AI Feedback