An Analysis of Temporal-Difference Learning with Function Approximation

An Analysis of Temporal-Difference Learning with Function Approximation

May 1997 | John N. Tsitsiklis, Member, IEEE, and Benjamin Van Roy
This paper presents a rigorous analysis of temporal-difference (TD) learning with function approximation, focusing on the convergence of the algorithm and its behavior in the context of infinite-horizon discounted Markov chains. The algorithm updates parameters of a linear function approximator online during a single endless trajectory of an irreducible aperiodic Markov chain. The analysis establishes that the algorithm converges with probability one, characterizes the limit of convergence, and provides a bound on the resulting approximation error. The study also highlights the importance of online sampling and identifies potential hazards associated with the use of nonlinear function approximators. The paper introduces a new line of reasoning that provides new intuition about the dynamics of TD learning. It shows that divergence may occur when updates are not based on trajectories of the Markov chain, reconciling positive and negative results in the literature. An example is provided to illustrate the possibility of divergence when nonlinear function approximators are used. The analysis is extended to both finite and infinite state spaces, with the latter being considered more natural for TD learning. The paper defines the TD(λ) algorithm and introduces the TD(λ) operator, which is used to characterize the algorithm's dynamics. It shows that the TD(λ) operator is a contraction on the space L₂(S, D), and its fixed point is the cost-to-go function. The analysis also demonstrates that the deterministic version of the algorithm converges to the same limit as the stochastic algorithm. The paper establishes that the cost-to-go function J* lies in L₂(S, D) and that the limit of convergence r* is the unique solution to the equation ΠT^(λ)(Φr*) = Φr*. It also provides a bound on the approximation error, showing that the error is proportional to the distance between J* and its projection onto the space spanned by the basis functions. The paper concludes that the TD(λ) algorithm converges with probability one under certain assumptions, and that the convergence is guaranteed when states are sampled according to the steady-state probabilities. The analysis also shows that the use of nonlinear function approximators can lead to divergence, highlighting the importance of careful design and application of such methods.This paper presents a rigorous analysis of temporal-difference (TD) learning with function approximation, focusing on the convergence of the algorithm and its behavior in the context of infinite-horizon discounted Markov chains. The algorithm updates parameters of a linear function approximator online during a single endless trajectory of an irreducible aperiodic Markov chain. The analysis establishes that the algorithm converges with probability one, characterizes the limit of convergence, and provides a bound on the resulting approximation error. The study also highlights the importance of online sampling and identifies potential hazards associated with the use of nonlinear function approximators. The paper introduces a new line of reasoning that provides new intuition about the dynamics of TD learning. It shows that divergence may occur when updates are not based on trajectories of the Markov chain, reconciling positive and negative results in the literature. An example is provided to illustrate the possibility of divergence when nonlinear function approximators are used. The analysis is extended to both finite and infinite state spaces, with the latter being considered more natural for TD learning. The paper defines the TD(λ) algorithm and introduces the TD(λ) operator, which is used to characterize the algorithm's dynamics. It shows that the TD(λ) operator is a contraction on the space L₂(S, D), and its fixed point is the cost-to-go function. The analysis also demonstrates that the deterministic version of the algorithm converges to the same limit as the stochastic algorithm. The paper establishes that the cost-to-go function J* lies in L₂(S, D) and that the limit of convergence r* is the unique solution to the equation ΠT^(λ)(Φr*) = Φr*. It also provides a bound on the approximation error, showing that the error is proportional to the distance between J* and its projection onto the space spanned by the basis functions. The paper concludes that the TD(λ) algorithm converges with probability one under certain assumptions, and that the convergence is guaranteed when states are sampled according to the steady-state probabilities. The analysis also shows that the use of nonlinear function approximators can lead to divergence, highlighting the importance of careful design and application of such methods.
Reach us at info@study.space