Understanding Reinforcement Learning with Deep Energy-Based Policies

This paper proposes a method for learning expressive energy-based policies for continuous states and actions, which has been previously feasible only in tabular domains. The method is applied to learning maximum entropy policies, resulting in a new algorithm called soft Q-learning. Soft Q-learning expresses the optimal policy via a Boltzmann distribution and uses amortized Stein variational gradient descent to learn a stochastic sampling network that approximates samples from this distribution. The benefits of the algorithm include improved exploration and compositionality, allowing skill transfer between tasks, as demonstrated in simulated experiments with swimming and walking robots. The method is also connected to actor-critic methods, which can be viewed as performing approximate inference on the energy-based model. The paper discusses the challenges of learning stochastic policies in continuous domains and proposes a framework based on energy-based models. It introduces a soft Q-learning algorithm that uses a tractable stochastic gradient descent procedure with approximate sampling. The algorithm is shown to be effective in learning maximum entropy policies, with the ability to capture complex, multi-modal behavior. The method is evaluated on various tasks, including a multi-goal environment and simulated continuous control environments, where it outperforms deterministic methods like DDPG in exploration and task adaptation. The results show that the method can effectively learn complex, multi-modal policies and serve as a good initialization for fine-tuning on different tasks. The paper also discusses the connection between the proposed method and actor-critic algorithms, and highlights the potential for further research in energy-based policy learning.This paper proposes a method for learning expressive energy-based policies for continuous states and actions, which has been previously feasible only in tabular domains. The method is applied to learning maximum entropy policies, resulting in a new algorithm called soft Q-learning. Soft Q-learning expresses the optimal policy via a Boltzmann distribution and uses amortized Stein variational gradient descent to learn a stochastic sampling network that approximates samples from this distribution. The benefits of the algorithm include improved exploration and compositionality, allowing skill transfer between tasks, as demonstrated in simulated experiments with swimming and walking robots. The method is also connected to actor-critic methods, which can be viewed as performing approximate inference on the energy-based model. The paper discusses the challenges of learning stochastic policies in continuous domains and proposes a framework based on energy-based models. It introduces a soft Q-learning algorithm that uses a tractable stochastic gradient descent procedure with approximate sampling. The algorithm is shown to be effective in learning maximum entropy policies, with the ability to capture complex, multi-modal behavior. The method is evaluated on various tasks, including a multi-goal environment and simulated continuous control environments, where it outperforms deterministic methods like DDPG in exploration and task adaptation. The results show that the method can effectively learn complex, multi-modal policies and serve as a good initialization for fine-tuning on different tasks. The paper also discusses the connection between the proposed method and actor-critic algorithms, and highlights the potential for further research in energy-based policy learning.

Reinforcement Learning with Deep Energy-Based Policies

2017 | Tuomas Haarnoja * 1 Haoran Tang * 2 Pieter Abbeel 1 3 4 Sergey Levine 1