[slides and audio] Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents

The paper introduces *Successive Concept Bottleneck Agents (SCoBots)*, a novel approach to enhance the interpretability and alignment of reinforcement learning (RL) agents. SCoBots integrate consecutive concept bottleneck layers, which not only represent objects as properties but also as relations between objects, addressing the limitations of current CBMs. Unlike deep RL agents, SCoBots provide human-understandable explanations and allow for multi-level inspection and revision of their decision-making processes. Experimental results demonstrate that SCoBots achieve competitive performance in various Atari games while enabling the identification and resolution of previously unknown misalignment issues, such as those observed in the classic game Pong. The paper highlights the importance of relational reasoning in RL and showcases how SCoBots can mitigate common RL-specific problems, including reward sparsity, misalignment, and difficult credit assignment. The contributions of the work include the introduction of SCoBots, their ability to provide justifications for action selections, and their potential for human interaction to address RL-specific issues.The paper introduces *Successive Concept Bottleneck Agents (SCoBots)*, a novel approach to enhance the interpretability and alignment of reinforcement learning (RL) agents. SCoBots integrate consecutive concept bottleneck layers, which not only represent objects as properties but also as relations between objects, addressing the limitations of current CBMs. Unlike deep RL agents, SCoBots provide human-understandable explanations and allow for multi-level inspection and revision of their decision-making processes. Experimental results demonstrate that SCoBots achieve competitive performance in various Atari games while enabling the identification and resolution of previously unknown misalignment issues, such as those observed in the classic game Pong. The paper highlights the importance of relational reasoning in RL and showcases how SCoBots can mitigate common RL-specific problems, including reward sparsity, misalignment, and difficult credit assignment. The contributions of the work include the introduction of SCoBots, their ability to provide justifications for action selections, and their potential for human interaction to address RL-specific issues.

Interpretable Concept Bottlenecks to Align Reinforcement Learning Agents

24 May 2024 | Quentin Delfosse*+1 Sebastian Sztwiertnia*+1 Mark Rothermel1 Wolfgang Stammer1,2 Kristian Kersting1,2,3,4

24 May 2024 | Quentin Delfosse+1 Sebastian Sztwiertnia+1 Mark Rothermel1 Wolfgang Stammer1,2 Kristian Kersting1,2,3,4