9 Apr 2024 | Matthew T. Jackson, Michael T. Matthews, Cong Lu, Benjamin Ellis, Shimon Whiteson, Jakob N. Foerster
The paper "Policy-Guided Diffusion" by Matthew T. Jackson, Michael T. Matthews, Cong Lu, Benjamin Ellis, Shimon Whiteson, and Jakob N. Foerster addresses the challenge of distribution shift in offline reinforcement learning (RL). The authors propose a method called *policy-guided diffusion* (PGD) to generate synthetic, on-policy experience from offline datasets, which can be used to train agents without the need for direct interaction with the environment.
In offline RL, the distribution shift between the behavior policy and the target policy can lead to instability and overestimation bias. Traditional methods often rely on model-based approaches, such as learning a single-step world model from the offline dataset, but these methods suffer from compounding errors and limited coverage. PGD avoids these issues by directly modeling entire trajectories under the behavior distribution and applying guidance from the target policy to move synthetic trajectories closer to the target distribution.
The authors derive PGD as an approximation of a regularized target distribution that balances action likelihood under both the behavior and target policies. They demonstrate that PGD generates synthetic trajectories with high target policy probability while retaining lower dynamics error compared to baseline methods. Experiments show significant improvements in performance across various offline RL algorithms and environments when trained on PGD-generated synthetic data.
The paper also discusses the theoretical underpinnings of PGD, including the behavior-regularized target distribution and the role of policy guidance. It compares PGD to other methods, such as autoregressive world models and unguided diffusion, highlighting its advantages in terms of trajectory likelihood and dynamics error. The authors conclude by outlining future directions, including automatic tuning of guidance coefficients and extending PGD to large-scale video generation models.The paper "Policy-Guided Diffusion" by Matthew T. Jackson, Michael T. Matthews, Cong Lu, Benjamin Ellis, Shimon Whiteson, and Jakob N. Foerster addresses the challenge of distribution shift in offline reinforcement learning (RL). The authors propose a method called *policy-guided diffusion* (PGD) to generate synthetic, on-policy experience from offline datasets, which can be used to train agents without the need for direct interaction with the environment.
In offline RL, the distribution shift between the behavior policy and the target policy can lead to instability and overestimation bias. Traditional methods often rely on model-based approaches, such as learning a single-step world model from the offline dataset, but these methods suffer from compounding errors and limited coverage. PGD avoids these issues by directly modeling entire trajectories under the behavior distribution and applying guidance from the target policy to move synthetic trajectories closer to the target distribution.
The authors derive PGD as an approximation of a regularized target distribution that balances action likelihood under both the behavior and target policies. They demonstrate that PGD generates synthetic trajectories with high target policy probability while retaining lower dynamics error compared to baseline methods. Experiments show significant improvements in performance across various offline RL algorithms and environments when trained on PGD-generated synthetic data.
The paper also discusses the theoretical underpinnings of PGD, including the behavior-regularized target distribution and the role of policy guidance. It compares PGD to other methods, such as autoregressive world models and unguided diffusion, highlighting its advantages in terms of trajectory likelihood and dynamics error. The authors conclude by outlining future directions, including automatic tuning of guidance coefficients and extending PGD to large-scale video generation models.