5 Jul 2024 | Matteo Gallici*, Mattie Fellows*, Benjamin Ellis, Bartomeu Pou, Ivan Masmitja, Jakob Nicolaus Foerster, Mario Martin
This paper presents a simplified deep temporal difference (TD) learning algorithm called PQN, which achieves stability and efficiency without the need for target networks or replay buffers. The authors demonstrate that regularisation techniques such as LayerNorm and $ \ell_2 $ regularisation can lead to provably convergent TD algorithms, even with off-policy data. Empirically, they find that online, parallelised sampling enabled by vectorised environments stabilises training without the need for a replay buffer. PQN is shown to be competitive with complex methods like Rainbow in Atari, R2D2 in Hanabi, QMix in Smax, and PPO-RNN in Craftax, and can be up to 50x faster than traditional DQN without sacrificing sample efficiency. The algorithm is implemented in pure-GPU setting and is compatible with temporal-based networks such as RNNs. The paper also provides a theoretical analysis of TD methods, showing that off-policy and nonlinear instabilities are key sources of instability in TD learning. The analysis reveals that LayerNorm and $ \ell_2 $ regularisation can mitigate these instabilities, leading to stable and efficient TD learning. The authors conclude that PQN offers a viable alternative to PPO in the era of deep vectorised reinforcement learning (DVRL).This paper presents a simplified deep temporal difference (TD) learning algorithm called PQN, which achieves stability and efficiency without the need for target networks or replay buffers. The authors demonstrate that regularisation techniques such as LayerNorm and $ \ell_2 $ regularisation can lead to provably convergent TD algorithms, even with off-policy data. Empirically, they find that online, parallelised sampling enabled by vectorised environments stabilises training without the need for a replay buffer. PQN is shown to be competitive with complex methods like Rainbow in Atari, R2D2 in Hanabi, QMix in Smax, and PPO-RNN in Craftax, and can be up to 50x faster than traditional DQN without sacrificing sample efficiency. The algorithm is implemented in pure-GPU setting and is compatible with temporal-based networks such as RNNs. The paper also provides a theoretical analysis of TD methods, showing that off-policy and nonlinear instabilities are key sources of instability in TD learning. The analysis reveals that LayerNorm and $ \ell_2 $ regularisation can mitigate these instabilities, leading to stable and efficient TD learning. The authors conclude that PQN offers a viable alternative to PPO in the era of deep vectorised reinforcement learning (DVRL).