Contextual Bandits with Linear Payoff Functions

Contextual Bandits with Linear Payoff Functions

2011 | Wei Chu, Lihong Li, Lev Reyzin, Robert E. Schapire
This paper studies the contextual bandit problem with linear payoff functions. The authors analyze a variant of the LinUCB algorithm, which is an upper confidence bound (UCB) algorithm for contextual bandits. They prove a regret bound of $ O\left(\sqrt{Td\ln^{3}(KT\ln(T)/\delta)}\right) $ for this algorithm, which holds with probability $ 1-\delta $. They also prove a lower bound of $ \Omega(\sqrt{Td}) $ for this setting, matching the upper bound up to logarithmic factors. The contextual bandit problem involves a learner who, in each round, chooses one of K actions based on observed feature vectors. The learner's goal is to minimize regret, which is the difference between the total reward obtained by the learner and the best possible reward achievable by a fixed hypothesis. In this paper, the learner competes with linear predictors on the feature vectors. The authors introduce a modified version of LinUCB, called BaseLinUCB, which assumes statistical independence among the samples. They then use a master algorithm, SupLinUCB, to ensure this assumption holds. They prove that the regret of SupLinUCB is bounded by $ \tilde{O}(\sqrt{dT}) $, where $ \tilde{O} $ hides logarithmic factors. The paper also provides a lower bound for the contextual bandit problem with linear payoffs, showing that the regret cannot be improved beyond $ \Omega(\sqrt{Td}) $, up to logarithmic factors. This lower bound matches the upper bound provided by the analysis of SupLinUCB. The authors also compare their results with previous work, noting that their lower bound is within a polylogarithmic factor of the regret bounds for LinUCB and LinRel. They conclude that their analysis shows that a regret of $ \tilde{O}(\sqrt{Td}) $ cannot be improved, modulo logarithmic factors. However, they note that the realizability assumption (that there exists a vector $ \theta^* $ such that $ E[r_{t} \mid x_{t,a}] = x_{t,a}^{\top} \theta^* $) is made, and that handling the agnostic case remains an open problem.This paper studies the contextual bandit problem with linear payoff functions. The authors analyze a variant of the LinUCB algorithm, which is an upper confidence bound (UCB) algorithm for contextual bandits. They prove a regret bound of $ O\left(\sqrt{Td\ln^{3}(KT\ln(T)/\delta)}\right) $ for this algorithm, which holds with probability $ 1-\delta $. They also prove a lower bound of $ \Omega(\sqrt{Td}) $ for this setting, matching the upper bound up to logarithmic factors. The contextual bandit problem involves a learner who, in each round, chooses one of K actions based on observed feature vectors. The learner's goal is to minimize regret, which is the difference between the total reward obtained by the learner and the best possible reward achievable by a fixed hypothesis. In this paper, the learner competes with linear predictors on the feature vectors. The authors introduce a modified version of LinUCB, called BaseLinUCB, which assumes statistical independence among the samples. They then use a master algorithm, SupLinUCB, to ensure this assumption holds. They prove that the regret of SupLinUCB is bounded by $ \tilde{O}(\sqrt{dT}) $, where $ \tilde{O} $ hides logarithmic factors. The paper also provides a lower bound for the contextual bandit problem with linear payoffs, showing that the regret cannot be improved beyond $ \Omega(\sqrt{Td}) $, up to logarithmic factors. This lower bound matches the upper bound provided by the analysis of SupLinUCB. The authors also compare their results with previous work, noting that their lower bound is within a polylogarithmic factor of the regret bounds for LinUCB and LinRel. They conclude that their analysis shows that a regret of $ \tilde{O}(\sqrt{Td}) $ cannot be improved, modulo logarithmic factors. However, they note that the realizability assumption (that there exists a vector $ \theta^* $ such that $ E[r_{t} \mid x_{t,a}] = x_{t,a}^{\top} \theta^* $) is made, and that handling the agnostic case remains an open problem.
Reach us at info@study.space