Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization

18/2018 | Lisha Li, Kevin Jamieson, Giulia DeSalvo, Afshin Rostamizadeh, Ameet Talwalkar
Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization Hyperparameter optimization is critical for the performance of machine learning algorithms. While recent approaches use Bayesian optimization to adaptively select configurations, this paper focuses on speeding up random search through adaptive resource allocation and early-stopping. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem, where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce a novel algorithm, HYPERBAND, for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare HYPERBAND with popular Bayesian optimization methods on a suite of hyperparameter optimization problems. We observe that HYPERBAND can provide over an order-of-magnitude speedup over our competitor set on a variety of deep-learning and kernel-based learning problems. HYPERBAND is a general-purpose technique that makes minimal assumptions unlike prior configuration evaluation approaches. Our theoretical analysis demonstrates the ability of HYPERBAND to adapt to unknown convergence rates and to the behavior of validation losses as a function of the hyperparameters. In addition, HYPERBAND is 5× to 30× faster than popular Bayesian optimization algorithms on a variety of deep-learning and kernel-based learning problems. A theoretical contribution of this work is the introduction of the pure-exploration, infinite-armed bandit problem in the non-stochastic setting, for which HYPERBAND is one solution. When HYPERBAND is applied to the special-case stochastic setting, we show that the algorithm comes within log factors of known lower bounds in both the infinite (Carpentier and Valko, 2015) and finite K-armed bandit settings (Kaufmann et al., 2015). HYPERBAND extends the SUCCESSIVEHALVING algorithm proposed for hyperparameter optimization by Jamieson and Talwalkar (2015) and calls it as a subroutine. The idea behind the original SUCCESSIVEHALVING algorithm follows directly from its name: uniformly allocate a budget to a set of hyperparameter configurations, evaluate the performance of all configurations, throw out the worst half, and repeat until one configuration remains. The algorithm allocates exponentially more resources to more promising configurations. However, SUCCESSIVEHALVING requires the number of configurations n as an input to the algorithm. Given some finite budget B (e.g., an hour of training time to choose a hyperparameter configuration), B/n resources are allocated on average across the configurations. However, for a fixed B, it is not clear a priori whether we should (a) consider many configurations (large n) with a small average training time; or (b) consider a small number of configurations (small n) with longer average training times. We use a simple example to better understand this tradeoff. Figure 2 shows the validation loss as a function of totalHyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization Hyperparameter optimization is critical for the performance of machine learning algorithms. While recent approaches use Bayesian optimization to adaptively select configurations, this paper focuses on speeding up random search through adaptive resource allocation and early-stopping. We formulate hyperparameter optimization as a pure-exploration non-stochastic infinite-armed bandit problem, where a predefined resource like iterations, data samples, or features is allocated to randomly sampled configurations. We introduce a novel algorithm, HYPERBAND, for this framework and analyze its theoretical properties, providing several desirable guarantees. Furthermore, we compare HYPERBAND with popular Bayesian optimization methods on a suite of hyperparameter optimization problems. We observe that HYPERBAND can provide over an order-of-magnitude speedup over our competitor set on a variety of deep-learning and kernel-based learning problems. HYPERBAND is a general-purpose technique that makes minimal assumptions unlike prior configuration evaluation approaches. Our theoretical analysis demonstrates the ability of HYPERBAND to adapt to unknown convergence rates and to the behavior of validation losses as a function of the hyperparameters. In addition, HYPERBAND is 5× to 30× faster than popular Bayesian optimization algorithms on a variety of deep-learning and kernel-based learning problems. A theoretical contribution of this work is the introduction of the pure-exploration, infinite-armed bandit problem in the non-stochastic setting, for which HYPERBAND is one solution. When HYPERBAND is applied to the special-case stochastic setting, we show that the algorithm comes within log factors of known lower bounds in both the infinite (Carpentier and Valko, 2015) and finite K-armed bandit settings (Kaufmann et al., 2015). HYPERBAND extends the SUCCESSIVEHALVING algorithm proposed for hyperparameter optimization by Jamieson and Talwalkar (2015) and calls it as a subroutine. The idea behind the original SUCCESSIVEHALVING algorithm follows directly from its name: uniformly allocate a budget to a set of hyperparameter configurations, evaluate the performance of all configurations, throw out the worst half, and repeat until one configuration remains. The algorithm allocates exponentially more resources to more promising configurations. However, SUCCESSIVEHALVING requires the number of configurations n as an input to the algorithm. Given some finite budget B (e.g., an hour of training time to choose a hyperparameter configuration), B/n resources are allocated on average across the configurations. However, for a fixed B, it is not clear a priori whether we should (a) consider many configurations (large n) with a small average training time; or (b) consider a small number of configurations (small n) with longer average training times. We use a simple example to better understand this tradeoff. Figure 2 shows the validation loss as a function of total
Reach us at info@study.space
[slides and audio] Hyperband%3A A Novel Bandit-Based Approach to Hyperparameter Optimization