Understanding On the Convergence of Stochastic Iterative Dynamic Programming Algorithms

This paper provides a rigorous proof of the convergence of stochastic iterative dynamic programming algorithms, such as TD(λ) and Q-learning, by relating them to stochastic approximation theory. The authors establish a general class of convergent algorithms that includes both TD(λ) and Q-learning. The proof is based on a new convergence theorem that extends the classical Robbins-Monro theory to stochastic processes with a maximum norm. The theorem is applied to show that Q-learning and TD(λ) converge to the optimal Q-values and predictions, respectively, under certain conditions. The paper also discusses the asynchronous implementation of these algorithms and their application to real-time Markovian decision problems. The results are significant for understanding the convergence properties of these algorithms and their practical implementation in reinforcement learning.This paper provides a rigorous proof of the convergence of stochastic iterative dynamic programming algorithms, such as TD(λ) and Q-learning, by relating them to stochastic approximation theory. The authors establish a general class of convergent algorithms that includes both TD(λ) and Q-learning. The proof is based on a new convergence theorem that extends the classical Robbins-Monro theory to stochastic processes with a maximum norm. The theorem is applied to show that Q-learning and TD(λ) converge to the optimal Q-values and predictions, respectively, under certain conditions. The paper also discusses the asynchronous implementation of these algorithms and their application to real-time Markovian decision problems. The results are significant for understanding the convergence properties of these algorithms and their practical implementation in reinforcement learning.

On the Convergence of Stochastic Iterative Dynamic Programming Algorithms

August 6, 1993 | Tommi Jaakkola, Michael I. Jordan and Satinder P. Singh