On the Convergence of Stochastic Iterative Dynamic Programming Algorithms

On the Convergence of Stochastic Iterative Dynamic Programming Algorithms

August 6, 1993 | Tommi Jaakkola, Michael I. Jordan and Satinder P. Singh
This paper provides a rigorous proof of the convergence of stochastic iterative dynamic programming algorithms, such as TD(λ) and Q-learning, by relating them to stochastic approximation theory. The authors establish a general class of convergent algorithms that includes both TD(λ) and Q-learning. The proof is based on a new convergence theorem that extends the classical Robbins-Monro theory to stochastic processes with a maximum norm. The theorem is applied to show that Q-learning and TD(λ) converge to the optimal Q-values and predictions, respectively, under certain conditions. The paper also discusses the asynchronous implementation of these algorithms and their application to real-time Markovian decision problems. The results are significant for understanding the convergence properties of these algorithms and their practical implementation in reinforcement learning.This paper provides a rigorous proof of the convergence of stochastic iterative dynamic programming algorithms, such as TD(λ) and Q-learning, by relating them to stochastic approximation theory. The authors establish a general class of convergent algorithms that includes both TD(λ) and Q-learning. The proof is based on a new convergence theorem that extends the classical Robbins-Monro theory to stochastic processes with a maximum norm. The theorem is applied to show that Q-learning and TD(λ) converge to the optimal Q-values and predictions, respectively, under certain conditions. The paper also discusses the asynchronous implementation of these algorithms and their application to real-time Markovian decision problems. The results are significant for understanding the convergence properties of these algorithms and their practical implementation in reinforcement learning.
Reach us at info@study.space