A CMDP-within-online framework for Meta-safe Reinforcement Learning

A CMDP-within-online framework for Meta-safe Reinforcement Learning

2023 | Vanshaj Khattar, Yuhao Ding, Bilgehan Sel, Javad Lavaei, Ming Jin
This paper introduces a novel framework for meta-safe reinforcement learning (Meta-SRL) that addresses the challenge of constraint violations in meta-reinforcement learning (meta-RL). The proposed framework, CMDP-within-online, establishes the first provable guarantees for Meta-SRL by incorporating gradient-based meta-learning and online learning techniques. The framework enables the meta-learner to adapt to new tasks while ensuring safety constraints are satisfied. Key contributions include the development of an inexact CMDP-within-online framework, task-averaged regret bounds for optimality gap and constraint violations, and the adaptation of learning rates to dynamic environments. The framework also extends to settings with a dynamically changing oracle. Theoretical analysis shows that task-averaged regret decreases with task-similarity in static environments and task-relatedness in dynamic environments. Experiments demonstrate the effectiveness of the proposed approach in achieving higher rewards and lower constraint violations compared to baseline methods. The framework is applicable to various reinforcement learning tasks and can be extended to address fairness constraints and non-stationary environments in future work.This paper introduces a novel framework for meta-safe reinforcement learning (Meta-SRL) that addresses the challenge of constraint violations in meta-reinforcement learning (meta-RL). The proposed framework, CMDP-within-online, establishes the first provable guarantees for Meta-SRL by incorporating gradient-based meta-learning and online learning techniques. The framework enables the meta-learner to adapt to new tasks while ensuring safety constraints are satisfied. Key contributions include the development of an inexact CMDP-within-online framework, task-averaged regret bounds for optimality gap and constraint violations, and the adaptation of learning rates to dynamic environments. The framework also extends to settings with a dynamically changing oracle. Theoretical analysis shows that task-averaged regret decreases with task-similarity in static environments and task-relatedness in dynamic environments. Experiments demonstrate the effectiveness of the proposed approach in achieving higher rewards and lower constraint violations compared to baseline methods. The framework is applicable to various reinforcement learning tasks and can be extended to address fairness constraints and non-stationary environments in future work.
Reach us at info@study.space
Understanding A CMDP-within-online framework for Meta-Safe Reinforcement Learning