29 Nov 2021 | Michael Janner, Justin Fu, Marvin Zhang, Sergey Levine
This paper presents a model-based policy optimization (MBPO) algorithm that effectively uses predictive models to improve policy learning in reinforcement learning. The authors analyze the role of model usage in policy optimization both theoretically and empirically. They propose an algorithm that uses short model-generated rollouts branched from real data, which allows for more efficient learning compared to traditional model-based methods. The algorithm is shown to surpass the sample efficiency of prior model-based methods, match the asymptotic performance of the best model-free algorithms, and scale to long horizons that cause other model-based methods to fail.
The paper addresses the challenge of balancing the ease of data generation with the bias of model-generated data. The authors show that an empirical estimate of model generalization can be incorporated into the analysis to justify model usage. They demonstrate that a simple procedure of using short model-generated rollouts branched from real data has the benefits of more complicated model-based algorithms without the usual pitfalls. This approach surpasses the sample efficiency of prior model-based methods, matches the asymptotic performance of the best model-free algorithms, and scales to horizons that cause other model-based methods to fail entirely.
The authors also show that MBPO does not suffer from the same pitfalls as prior model-based approaches, avoiding model exploitation and failure on long-horizon tasks. They empirically investigate different strategies for model usage, supporting the conclusion that careful use of short model-based rollouts provides the most benefit to a reinforcement learning algorithm.
The paper also discusses related work in model-based reinforcement learning, including Gaussian processes, time-varying linear dynamical systems, and neural network predictive models. The authors compare their approach to other model-based and model-free methods, showing that MBPO performs well on benchmark reinforcement learning tasks. They find that MBPO learns substantially faster than prior model-free methods, while attaining comparable final performance. For example, MBPO's performance on the Ant task at 300 thousand steps is the same as that of SAC at 3 million steps. On Hopper and Walker2d, MBPO requires the equivalent of 14 and 40 minutes, respectively, of simulation time if the simulator were running in real time. More crucially, MBPO learns on some of the higher-dimensional tasks, such as Ant, which pose problems for purely model-based approaches such as PETS.This paper presents a model-based policy optimization (MBPO) algorithm that effectively uses predictive models to improve policy learning in reinforcement learning. The authors analyze the role of model usage in policy optimization both theoretically and empirically. They propose an algorithm that uses short model-generated rollouts branched from real data, which allows for more efficient learning compared to traditional model-based methods. The algorithm is shown to surpass the sample efficiency of prior model-based methods, match the asymptotic performance of the best model-free algorithms, and scale to long horizons that cause other model-based methods to fail.
The paper addresses the challenge of balancing the ease of data generation with the bias of model-generated data. The authors show that an empirical estimate of model generalization can be incorporated into the analysis to justify model usage. They demonstrate that a simple procedure of using short model-generated rollouts branched from real data has the benefits of more complicated model-based algorithms without the usual pitfalls. This approach surpasses the sample efficiency of prior model-based methods, matches the asymptotic performance of the best model-free algorithms, and scales to horizons that cause other model-based methods to fail entirely.
The authors also show that MBPO does not suffer from the same pitfalls as prior model-based approaches, avoiding model exploitation and failure on long-horizon tasks. They empirically investigate different strategies for model usage, supporting the conclusion that careful use of short model-based rollouts provides the most benefit to a reinforcement learning algorithm.
The paper also discusses related work in model-based reinforcement learning, including Gaussian processes, time-varying linear dynamical systems, and neural network predictive models. The authors compare their approach to other model-based and model-free methods, showing that MBPO performs well on benchmark reinforcement learning tasks. They find that MBPO learns substantially faster than prior model-free methods, while attaining comparable final performance. For example, MBPO's performance on the Ant task at 300 thousand steps is the same as that of SAC at 3 million steps. On Hopper and Walker2d, MBPO requires the equivalent of 14 and 40 minutes, respectively, of simulation time if the simulator were running in real time. More crucially, MBPO learns on some of the higher-dimensional tasks, such as Ant, which pose problems for purely model-based approaches such as PETS.