Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents

Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents

24 May 2024 | Quentin Delfosse, Sebastian Sztwiertnia, Mark Rothermel, Wolfgang Stammer, Kristian Kersting
This paper introduces Successive Concept Bottleneck Agents (SCoBots), a novel approach to align reinforcement learning (RL) agents with human goals by integrating relational concept representations into their decision processes. SCoBots use concept bottlenecks to extract and represent concepts, including both object properties and relational concepts between objects, enabling interpretable and revisable decision-making. Unlike traditional deep RL agents, SCoBots allow for multi-level inspection and revision of their decision processes, from object properties to action selection. This interpretability is crucial for identifying and mitigating issues such as goal misalignment, reward sparsity, and difficult credit assignment. SCoBots are evaluated on various RL tasks, including the classic game Pong, where they identify and resolve a previously unknown misalignment issue. The agents are trained using a combination of neural networks and decision trees, with the latter providing interpretable action selection. The paper demonstrates that SCoBots can achieve competitive performance with deep RL agents while offering insights into their decision-making processes. Additionally, SCoBots allow for human intervention to guide the learning process, enabling the correction of misaligned behaviors and the mitigation of RL-specific challenges. The study highlights the importance of concept-based models in RL for achieving human-aligned policies. By incorporating relational concepts, SCoBots provide a framework for transparent and explainable RL agents that can be revised based on human feedback. The results show that SCoBots can effectively address issues such as reward sparsity, ill-defined objectives, and misalignment, making them a promising approach for developing aligned RL agents. The paper also discusses the limitations of the current approach, including the reliance on predefined relational functions and the need for further research on object-centric representations in complex environments. Overall, SCoBots represent a significant step towards creating interpretable and human-aligned RL agents.This paper introduces Successive Concept Bottleneck Agents (SCoBots), a novel approach to align reinforcement learning (RL) agents with human goals by integrating relational concept representations into their decision processes. SCoBots use concept bottlenecks to extract and represent concepts, including both object properties and relational concepts between objects, enabling interpretable and revisable decision-making. Unlike traditional deep RL agents, SCoBots allow for multi-level inspection and revision of their decision processes, from object properties to action selection. This interpretability is crucial for identifying and mitigating issues such as goal misalignment, reward sparsity, and difficult credit assignment. SCoBots are evaluated on various RL tasks, including the classic game Pong, where they identify and resolve a previously unknown misalignment issue. The agents are trained using a combination of neural networks and decision trees, with the latter providing interpretable action selection. The paper demonstrates that SCoBots can achieve competitive performance with deep RL agents while offering insights into their decision-making processes. Additionally, SCoBots allow for human intervention to guide the learning process, enabling the correction of misaligned behaviors and the mitigation of RL-specific challenges. The study highlights the importance of concept-based models in RL for achieving human-aligned policies. By incorporating relational concepts, SCoBots provide a framework for transparent and explainable RL agents that can be revised based on human feedback. The results show that SCoBots can effectively address issues such as reward sparsity, ill-defined objectives, and misalignment, making them a promising approach for developing aligned RL agents. The paper also discusses the limitations of the current approach, including the reliance on predefined relational functions and the need for further research on object-centric representations in complex environments. Overall, SCoBots represent a significant step towards creating interpretable and human-aligned RL agents.
Reach us at info@study.space