15 Dec 2022 | Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan
Constitutional AI: Harmlessness from AI Feedback
This paper introduces Constitutional AI (CAI), a method to train AI assistants that are helpful, honest, and harmless without relying on human feedback labels for harm. The approach involves two stages: a supervised learning (SL) phase and a reinforcement learning (RL) phase. In the SL phase, the model is trained to critique and revise its responses based on a set of principles, reducing harmful content. In the RL phase, the model is trained using AI feedback to improve performance and reliability. The method leverages chain-of-thought reasoning to enhance transparency and human-judged performance.
The CAI approach uses a 'constitution' of principles to guide AI behavior, allowing for precise control over AI actions with minimal human input. This method improves upon and partially replaces reinforcement learning from human feedback (RLHF), resulting in a non-evasive and harmless AI assistant. The new assistant, RL-CAI, is preferred by crowdworkers over previously trained models. The method also enhances transparency by making AI decision-making more explicit and by enabling AI to explain its objections to harmful requests.
The paper evaluates the effectiveness of CAI through various experiments, showing that models trained with CAI are more harmless and less evasive than those trained with human feedback. The results indicate that CAI can significantly reduce harmful outputs while maintaining helpfulness. The method also demonstrates that AI feedback can be used to train models that are more aligned with human values and less prone to harmful behavior.
The paper discusses the broader implications of CAI, including its potential to improve AI safety and alignment. By reducing reliance on human supervision and increasing the transparency of AI decision-making, CAI offers a promising approach to developing more ethical and reliable AI systems. The method is expected to enable more efficient and effective AI training, with applications in various domains requiring careful handling of sensitive information.Constitutional AI: Harmlessness from AI Feedback
This paper introduces Constitutional AI (CAI), a method to train AI assistants that are helpful, honest, and harmless without relying on human feedback labels for harm. The approach involves two stages: a supervised learning (SL) phase and a reinforcement learning (RL) phase. In the SL phase, the model is trained to critique and revise its responses based on a set of principles, reducing harmful content. In the RL phase, the model is trained using AI feedback to improve performance and reliability. The method leverages chain-of-thought reasoning to enhance transparency and human-judged performance.
The CAI approach uses a 'constitution' of principles to guide AI behavior, allowing for precise control over AI actions with minimal human input. This method improves upon and partially replaces reinforcement learning from human feedback (RLHF), resulting in a non-evasive and harmless AI assistant. The new assistant, RL-CAI, is preferred by crowdworkers over previously trained models. The method also enhances transparency by making AI decision-making more explicit and by enabling AI to explain its objections to harmful requests.
The paper evaluates the effectiveness of CAI through various experiments, showing that models trained with CAI are more harmless and less evasive than those trained with human feedback. The results indicate that CAI can significantly reduce harmful outputs while maintaining helpfulness. The method also demonstrates that AI feedback can be used to train models that are more aligned with human values and less prone to harmful behavior.
The paper discusses the broader implications of CAI, including its potential to improve AI safety and alignment. By reducing reliance on human supervision and increasing the transparency of AI decision-making, CAI offers a promising approach to developing more ethical and reliable AI systems. The method is expected to enable more efficient and effective AI training, with applications in various domains requiring careful handling of sensitive information.