Understanding A CMDP-within-online framework for Meta-Safe Reinforcement Learning

This paper introduces a novel framework, Meta-Safe Reinforcement Learning (Meta-SRL), for meta-learning in constrained Markov decision processes (CMDPs). The framework aims to address the challenge of constraint violations in meta-reinforcement learning, which has been a significant gap in existing works. The authors propose a CMDP-within-online framework where the within-task algorithm is a CMDP solver, and the meta-learner optimizes the initial policy and learning rate for each task. They establish provable guarantees on task-averaged regret bounds for reward maximization and constraint violations, showing that these bounds improve with task similarity in static environments or task-relatedness in dynamic environments. The paper also addresses technical challenges by proposing a meta-algorithm that performs inexact online learning on upper bounds of suboptimality gaps and constraint violations, enabling adaptive learning rates, and extending the approach to settings with dynamically changing oracles. Experimental results demonstrate the effectiveness of the proposed approach in various environments.This paper introduces a novel framework, Meta-Safe Reinforcement Learning (Meta-SRL), for meta-learning in constrained Markov decision processes (CMDPs). The framework aims to address the challenge of constraint violations in meta-reinforcement learning, which has been a significant gap in existing works. The authors propose a CMDP-within-online framework where the within-task algorithm is a CMDP solver, and the meta-learner optimizes the initial policy and learning rate for each task. They establish provable guarantees on task-averaged regret bounds for reward maximization and constraint violations, showing that these bounds improve with task similarity in static environments or task-relatedness in dynamic environments. The paper also addresses technical challenges by proposing a meta-algorithm that performs inexact online learning on upper bounds of suboptimality gaps and constraint violations, enabling adaptive learning rates, and extending the approach to settings with dynamically changing oracles. Experimental results demonstrate the effectiveness of the proposed approach in various environments.

A CMDP-within-online Framework for Meta-Safe Reinforcement Learning

26 May 2024 | Vanshaj Khattar, Yuhao Ding, Bilgehan Sel, Javad Lavaei, Ming Jin