Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model

Safe Offline Reinforcement Learning with Feasibility-Guided Diffusion Model

2024 | Yinan Zheng, Jianxiong Li, Dongjie Yu, Yujie Yang, Shengbo Eben Li, Xianyuan Zhan, Jingjing Liu
FISOR is a novel safe offline reinforcement learning (RL) approach that ensures safety while maximizing rewards. It addresses the challenge of enforcing hard safety constraints in offline settings by translating the problem into a feasibility-dependent objective. The method uses Hamilton-Jacobi (HJ) reachability analysis to identify the largest feasible region from offline data, enabling the separation of feasible and infeasible states. This allows for maximizing rewards within feasible regions and minimizing safety risks in infeasible regions. FISOR decouples the optimization of safety constraints, reward maximization, and policy learning into three independent processes, leading to improved safety performance and stability. The optimal policy is derived using a weighted behavior cloning approach, which is effectively learned with a guided diffusion model. Additionally, a novel energy-guided sampling method is proposed to simplify training without requiring a complex time-dependent classifier. FISOR outperforms existing methods on the DSRL benchmark, achieving safety in all tasks and top returns in most. It is also effective in safe offline imitation learning. The method demonstrates strong performance in both safety-critical and high-reward scenarios, offering a practical solution for real-world applications.FISOR is a novel safe offline reinforcement learning (RL) approach that ensures safety while maximizing rewards. It addresses the challenge of enforcing hard safety constraints in offline settings by translating the problem into a feasibility-dependent objective. The method uses Hamilton-Jacobi (HJ) reachability analysis to identify the largest feasible region from offline data, enabling the separation of feasible and infeasible states. This allows for maximizing rewards within feasible regions and minimizing safety risks in infeasible regions. FISOR decouples the optimization of safety constraints, reward maximization, and policy learning into three independent processes, leading to improved safety performance and stability. The optimal policy is derived using a weighted behavior cloning approach, which is effectively learned with a guided diffusion model. Additionally, a novel energy-guided sampling method is proposed to simplify training without requiring a complex time-dependent classifier. FISOR outperforms existing methods on the DSRL benchmark, achieving safety in all tasks and top returns in most. It is also effective in safe offline imitation learning. The method demonstrates strong performance in both safety-critical and high-reward scenarios, offering a practical solution for real-world applications.
Reach us at info@study.space