Constitutional AI: Harmlessness from AI Feedback

Constitutional AI: Harmlessness from AI Feedback

15 Dec 2022 | Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan
The paper introduces a method called Constitutional AI (CAI) to train AI assistants that are both helpful and harmless, without relying on human feedback labels for harmlessness. The CAI process consists of two stages: a supervised learning stage and a reinforcement learning (RL) stage. In the supervised stage, the AI assistant generates responses to harmful prompts, critiques these responses using a set of principles (the constitution), and then revises them. The revised responses are used to fine-tune a pre-trained language model. In the RL stage, the model evaluates responses to harmful prompts based on the constitution and uses this feedback to train a preference model. The RL stage further refines the model's behavior, making it more harmless and non-evasive. The paper demonstrates that the CAI method can improve the AI assistant's harmlessness without significantly compromising helpfulness, and that chain-of-thought reasoning enhances the model's performance and transparency. The authors aim to reduce the need for human supervision and make AI behavior more transparent and controllable.The paper introduces a method called Constitutional AI (CAI) to train AI assistants that are both helpful and harmless, without relying on human feedback labels for harmlessness. The CAI process consists of two stages: a supervised learning stage and a reinforcement learning (RL) stage. In the supervised stage, the AI assistant generates responses to harmful prompts, critiques these responses using a set of principles (the constitution), and then revises them. The revised responses are used to fine-tune a pre-trained language model. In the RL stage, the model evaluates responses to harmful prompts based on the constitution and uses this feedback to train a preference model. The RL stage further refines the model's behavior, making it more harmless and non-evasive. The paper demonstrates that the CAI method can improve the AI assistant's harmlessness without significantly compromising helpfulness, and that chain-of-thought reasoning enhances the model's performance and transparency. The authors aim to reduce the need for human supervision and make AI behavior more transparent and controllable.
Reach us at info@study.space
Understanding Constitutional AI%3A Harmlessness from AI Feedback