Thompson Sampling for Contextual Bandits with Linear Payoffs

Thompson Sampling for Contextual Bandits with Linear Payoffs

February 4, 2014 | Shipra Agrawal, Navin Goyal
This paper presents a theoretical analysis of Thompson Sampling (TS) for the stochastic contextual bandit problem with linear payoffs. The authors design and analyze a generalization of TS for this problem, where contexts are provided by an adaptive adversary. They provide the first theoretical guarantees for the contextual version of TS, proving a high probability regret bound of $ \tilde{O}(d^{3/2}\sqrt{T}) $ or $ \tilde{O}(d\sqrt{T}\log(N)) $, which is the best regret bound achieved by any computationally efficient algorithm for this problem. The bound is within a factor of $ \sqrt{d} $ or $ \sqrt{\log(N)} $ of the information-theoretic lower bound. The paper introduces a novel martingale-based analysis technique to demonstrate that TS achieves high probability, near-optimal regret bounds for stochastic contextual bandits with linear payoff functions. The algorithm uses Gaussian prior and likelihood functions, and is efficient to implement as long as it is efficient to optimize a linear function over the set of arms. The authors show that their results are the first high probability regret bounds for TS, even in the case of basic MAB problems. This essentially solves the COLT 2012 open problem for contextual bandits with linear payoffs. The paper also discusses related work, including other algorithms for contextual bandits, and compares the performance of TS with other methods such as UCB. The authors show that TS achieves the best regret upper bound among efficient algorithms for this problem. They conclude that the natural and efficient heuristic of Thompson Sampling can achieve theoretical bounds that are close to the best bounds, and that their techniques provide useful insights into the workings of this Bayesian algorithm.This paper presents a theoretical analysis of Thompson Sampling (TS) for the stochastic contextual bandit problem with linear payoffs. The authors design and analyze a generalization of TS for this problem, where contexts are provided by an adaptive adversary. They provide the first theoretical guarantees for the contextual version of TS, proving a high probability regret bound of $ \tilde{O}(d^{3/2}\sqrt{T}) $ or $ \tilde{O}(d\sqrt{T}\log(N)) $, which is the best regret bound achieved by any computationally efficient algorithm for this problem. The bound is within a factor of $ \sqrt{d} $ or $ \sqrt{\log(N)} $ of the information-theoretic lower bound. The paper introduces a novel martingale-based analysis technique to demonstrate that TS achieves high probability, near-optimal regret bounds for stochastic contextual bandits with linear payoff functions. The algorithm uses Gaussian prior and likelihood functions, and is efficient to implement as long as it is efficient to optimize a linear function over the set of arms. The authors show that their results are the first high probability regret bounds for TS, even in the case of basic MAB problems. This essentially solves the COLT 2012 open problem for contextual bandits with linear payoffs. The paper also discusses related work, including other algorithms for contextual bandits, and compares the performance of TS with other methods such as UCB. The authors show that TS achieves the best regret upper bound among efficient algorithms for this problem. They conclude that the natural and efficient heuristic of Thompson Sampling can achieve theoretical bounds that are close to the best bounds, and that their techniques provide useful insights into the workings of this Bayesian algorithm.
Reach us at info@study.space
Understanding Thompson Sampling for Contextual Bandits with Linear Payoffs