[slides] An Analysis of Temporal-Difference Learning with Function Approximation

This paper discusses the temporal-difference (TD) learning algorithm, which is used to approximate the cost-to-go function of an infinite-horizon discounted Markov chain. The algorithm updates parameters of a linear function approximator during a single trajectory of an irreducible aperiodic Markov chain with a finite or infinite state space. The main contributions of the paper include: 1. **Convergence**: The paper establishes that the TD(λ) algorithm converges with probability one when using linear function approximators. This is the first result to handle the case where the number of tunable parameters is less than the cardinality of the state space. 2. **Limit Characterization**: The limit of convergence is characterized as the solution to a set of interpretable linear equations, providing a bound on the approximation error. 3. **New Intuition**: The methodology leads to new insights into the dynamics of weight updating and the limit of convergence. 4. **Importance of Online Sampling**: The paper proves that online sampling is crucial for convergence, while sampling states independently from the dynamics of the Markov chain can lead to divergence. 5. **Nonlinear Function Approximators**: An example demonstrates that the algorithm may diverge when nonlinear function approximators are used, though this is specific to certain contrived examples. The paper also provides a detailed analysis of the TD(λ) algorithm, including the definition of the algorithm, assumptions, and a series of lemmas that support the main result. The analysis is valid for general state spaces, subject to certain technical assumptions, and the results are particularly relevant for finite and infinite Markov chains.This paper discusses the temporal-difference (TD) learning algorithm, which is used to approximate the cost-to-go function of an infinite-horizon discounted Markov chain. The algorithm updates parameters of a linear function approximator during a single trajectory of an irreducible aperiodic Markov chain with a finite or infinite state space. The main contributions of the paper include: 1. **Convergence**: The paper establishes that the TD(λ) algorithm converges with probability one when using linear function approximators. This is the first result to handle the case where the number of tunable parameters is less than the cardinality of the state space. 2. **Limit Characterization**: The limit of convergence is characterized as the solution to a set of interpretable linear equations, providing a bound on the approximation error. 3. **New Intuition**: The methodology leads to new insights into the dynamics of weight updating and the limit of convergence. 4. **Importance of Online Sampling**: The paper proves that online sampling is crucial for convergence, while sampling states independently from the dynamics of the Markov chain can lead to divergence. 5. **Nonlinear Function Approximators**: An example demonstrates that the algorithm may diverge when nonlinear function approximators are used, though this is specific to certain contrived examples. The paper also provides a detailed analysis of the TD(λ) algorithm, including the definition of the algorithm, assumptions, and a series of lemmas that support the main result. The analysis is valid for general state spaces, subject to certain technical assumptions, and the results are particularly relevant for finite and infinite Markov chains.

An Analysis of Temporal-Difference Learning with Function Approximation

May 1997 | John N. Tsitsiklis, Member, IEEE, and Benjamin Van Roy