9 Apr 2024 | Matthew T. Jackson, Michael T. Matthews, Cong Lu, Benjamin Ellis, Shimon Whiteson, Jakob N. Foerster
Policy-guided diffusion is a method for generating synthetic trajectories in offline reinforcement learning (RL). It addresses the challenge of distribution shift between the behavior policy and the target policy, which can lead to overestimation bias and instability. Unlike autoregressive world models, which require truncating model rollouts to avoid compounding error, policy-guided diffusion uses diffusion models to generate entire trajectories under the behavior distribution, then applies guidance from the target policy to move synthetic experience closer to the target distribution. This results in a regularized target distribution that balances action likelihood under both policies, producing plausible trajectories with high target policy probability and lower dynamics error than baseline methods.
The method uses a diffusion model trained on an offline dataset to generate synthetic trajectories, then applies policy guidance to adjust the sampling distribution towards the target policy. This approach avoids compounding error and generates more representative trajectories. Experiments show that policy-guided diffusion outperforms both unguided synthetic data and real data across a range of offline RL algorithms and environments, including MuJoCo and Maze2d. It achieves significant performance improvements, with a 11.2% increase in performance for the TD3+BC algorithm on MuJoCo locomotion tasks. Additionally, it reduces dynamics error compared to prior offline model-based methods like PETS, while maintaining similar target policy likelihood.
Policy-guided diffusion provides an effective alternative to autoregressive offline world models, enabling the controllable generation of synthetic training data. It addresses the out-of-sample issue by generating trajectories that are more representative of the target policy, leading to improved performance in offline RL. The method is theoretically grounded in the behavior-regularized target distribution, which balances action likelihood under both the behavior and target policies. This approach offers a promising direction for future work, including automatically tuning the guidance coefficient for hyperparameter-free guidance and extending the method to large-scale video generation models.Policy-guided diffusion is a method for generating synthetic trajectories in offline reinforcement learning (RL). It addresses the challenge of distribution shift between the behavior policy and the target policy, which can lead to overestimation bias and instability. Unlike autoregressive world models, which require truncating model rollouts to avoid compounding error, policy-guided diffusion uses diffusion models to generate entire trajectories under the behavior distribution, then applies guidance from the target policy to move synthetic experience closer to the target distribution. This results in a regularized target distribution that balances action likelihood under both policies, producing plausible trajectories with high target policy probability and lower dynamics error than baseline methods.
The method uses a diffusion model trained on an offline dataset to generate synthetic trajectories, then applies policy guidance to adjust the sampling distribution towards the target policy. This approach avoids compounding error and generates more representative trajectories. Experiments show that policy-guided diffusion outperforms both unguided synthetic data and real data across a range of offline RL algorithms and environments, including MuJoCo and Maze2d. It achieves significant performance improvements, with a 11.2% increase in performance for the TD3+BC algorithm on MuJoCo locomotion tasks. Additionally, it reduces dynamics error compared to prior offline model-based methods like PETS, while maintaining similar target policy likelihood.
Policy-guided diffusion provides an effective alternative to autoregressive offline world models, enabling the controllable generation of synthetic training data. It addresses the out-of-sample issue by generating trajectories that are more representative of the target policy, leading to improved performance in offline RL. The method is theoretically grounded in the behavior-regularized target distribution, which balances action likelihood under both the behavior and target policies. This approach offers a promising direction for future work, including automatically tuning the guidance coefficient for hyperparameter-free guidance and extending the method to large-scale video generation models.