[slides] Thompson Sampling for Contextual Bandits with Linear Payoffs

Thompson Sampling is a Bayesian heuristic for multi-armed bandit problems, which has recently gained significant attention due to its empirical performance. However, its theoretical performance remained largely unexplored. This paper addresses this gap by providing the first theoretical guarantees for Thompson Sampling in the context of stochastic contextual bandits with linear payoff functions. The authors design and analyze a generalization of Thompson Sampling that uses Gaussian priors and likelihood functions. They prove a high-probability regret bound of $\tilde{O}(d^{3/2} \sqrt{T})$ (or $\tilde{O}(d \sqrt{T} \log(N))$), which is the best regret bound achieved by any computationally efficient algorithm for this problem and is close to the information-theoretic lower bound. The paper also discusses the computational efficiency of the algorithm and its extensions to other prior distributions.Thompson Sampling is a Bayesian heuristic for multi-armed bandit problems, which has recently gained significant attention due to its empirical performance. However, its theoretical performance remained largely unexplored. This paper addresses this gap by providing the first theoretical guarantees for Thompson Sampling in the context of stochastic contextual bandits with linear payoff functions. The authors design and analyze a generalization of Thompson Sampling that uses Gaussian priors and likelihood functions. They prove a high-probability regret bound of $\tilde{O}(d^{3/2} \sqrt{T})$ (or $\tilde{O}(d \sqrt{T} \log(N))$), which is the best regret bound achieved by any computationally efficient algorithm for this problem and is close to the information-theoretic lower bound. The paper also discusses the computational efficiency of the algorithm and its extensions to other prior distributions.

Thompson Sampling for Contextual Bandits with Linear Payoffs

February 4, 2014 | Shipra Agrawal, Navin Goyal