Simplifying Deep Temporal Difference Learning

Simplifying Deep Temporal Difference Learning

5 Jul 2024 | Matteo Gallici, Mattie Fellows, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster, Mario Martin
This paper investigates the stability and efficiency of Temporal Difference (TD) learning, particularly in the context of off-policy data and nonlinear function approximation. The authors explore the use of regularization techniques, such as LayerNorm, to stabilize TD algorithms without the need for target networks or replay buffers. The key theoretical result demonstrates that LayerNorm can lead to provably convergent TD algorithms, even with off-policy data. Empirically, the paper shows that online, parallelized sampling in vectorized environments stabilizes training without the need for a replay buffer. Based on these findings, the authors propose PQN (Parallelized Q-Network), a simplified deep online Q-learning algorithm. PQN is shown to be competitive with more complex methods like Rainbow, R2D2, QMix, and PPO-RNN, while being significantly faster and more sample-efficient. The paper also discusses the advantages of PQN over traditional DQN and distributed DQN, including ease of implementation, fast execution, low memory requirements, and compatibility with GPU-based training and RNNs. Extensive empirical evaluations in various single-agent and multi-agent environments, including Atari, Craftax, and Hanabi, demonstrate the effectiveness of PQN. The paper concludes by highlighting the potential of PQN as a powerful and stable alternative to PPO in deep reinforcement learning.This paper investigates the stability and efficiency of Temporal Difference (TD) learning, particularly in the context of off-policy data and nonlinear function approximation. The authors explore the use of regularization techniques, such as LayerNorm, to stabilize TD algorithms without the need for target networks or replay buffers. The key theoretical result demonstrates that LayerNorm can lead to provably convergent TD algorithms, even with off-policy data. Empirically, the paper shows that online, parallelized sampling in vectorized environments stabilizes training without the need for a replay buffer. Based on these findings, the authors propose PQN (Parallelized Q-Network), a simplified deep online Q-learning algorithm. PQN is shown to be competitive with more complex methods like Rainbow, R2D2, QMix, and PPO-RNN, while being significantly faster and more sample-efficient. The paper also discusses the advantages of PQN over traditional DQN and distributed DQN, including ease of implementation, fast execution, low memory requirements, and compatibility with GPU-based training and RNNs. Extensive empirical evaluations in various single-agent and multi-agent environments, including Atari, Craftax, and Hanabi, demonstrate the effectiveness of PQN. The paper concludes by highlighting the potential of PQN as a powerful and stable alternative to PPO in deep reinforcement learning.
Reach us at info@study.space
Understanding Simplifying Deep Temporal Difference Learning