Available Algorithms

This section provides an overview of the available hyperparameter optimization algorithms in Sherpa. Below is a table that discusses use cases for each algorithm. This is followed by a short comparison benchmark and the algorithms themselves.

  Use cases
Grid Search
Great for understanding the
impact of one or two parameters.
Random Search
More efficient than grid search when used with many
hyperparameters. Great for getting
a full picture of the impact of many hyperparameters
since hyperparameters are uniformly sampled from the
whole space.
GPyOpt Bayesian
Optimization
More efficient than Random search when the number of
trials is sufficiently large.
Asynchronous
Successive
Halving
Due to its early stopping, especially useful when it
would otherwise be infeasible to run a hyperparameter
optimization because of the computational cost.
Local Search
Can quickly explore “tweaks” to a model that is
already good while using less trials than Random search
or Bayesian optimization.
Population
Based
Training
Can discover schedules of training parameters and is
therefore especially good for learning rate, momentum,
batch size, etc.

For the specification of each algorithm see below.

Comparison using Fashion MNIST MLP

We constructed a simple and fast to run benchmark to run these algorithms on. This uses a fully connected neural network trained on the Fashion MNIST dataset. The tuning parameters are the learning rate, the learning rate decay, the momentum, minibatch size, and the dropout rate. We compare Random Search, GPyOpt, Population Based Training (pbt), and Successive Halving. All algorithms are allowed an equal budget corresponding to 100 models trained for 26 epochs. The plot below shows the mean taken over five runs of each algorithm. The shaded regions correspond to two standard deviations. On the y-axis is the classification accuracy of the best model found. On the x-axis are the epochs spent. We run 20 evaluations in parallel. Note that the results of GPyOpt may be impacted by the high number of parallel evaluations. The code to reproduce this benchmark can be found at sherpa/examples/parallel-examples/fashion_mnist_benchmark.

Fashion MNIST comparison results..

The currently available algorithms in Sherpa are listed below:

Bayesian Optimization with GPyOpt

class sherpa.algorithms.GPyOpt(model_type='GP', num_initial_data_points='infer', initial_data_points=[], acquisition_type='EI', max_concurrent=4, verbosity=False, max_num_trials=None)[source]

Sherpa wrapper around the GPyOpt package (https://github.com/SheffieldML/GPyOpt).

Parameters:
  • model_type (str) – The model used: - ‘GP’, standard Gaussian process. - ‘GP_MCMC’, Gaussian process with prior in the hyper-parameters. - ‘sparseGP’, sparse Gaussian process. - ‘warperdGP’, warped Gaussian process. - ‘InputWarpedGP’, input warped Gaussian process - ‘RF’, random forest (scikit-learn).
  • num_initial_data_points (int) – Number of data points to collect before fitting model. Needs to be greater/equal to the number of hyper- parameters that are being optimized. Using default ‘infer’ corres- ponds to number of hyperparameters + 1 or 0 if results are not empty.
  • initial_data_points (list[dict] or pandas.Dataframe) – Specifies initial data points. If len(initial_data_points)<num_initial_data_points then the rest is randomly sampled. Use this option to provide hyperparameter configurations that are known to be good.
  • acquisition_type (str) – Type of acquisition function to use. - ‘EI’, expected improvement. - ‘EI_MCMC’, integrated expected improvement (requires GP_MCMC model). - ‘MPI’, maximum probability of improvement. - ‘MPI_MCMC’, maximum probability of improvement (requires GP_MCMC model). - ‘LCB’, GP-Lower confidence bound. - ‘LCB_MCMC’, integrated GP-Lower confidence bound (requires GP_MCMC model).
  • max_concurrent (int) – The number of concurrent trials. This generates a batch of max_concurrent trials from GPyOpt to evaluate. If a new observation becomes available, the model is re-evaluated and a new batch is created regardless of whether the previous batch was used up. The used method is local penalization.
  • verbosity (bool) – Print models and other options during the optimization.
  • max_num_trials (int) – maximum number of trials to run for.

Asynchronous Successive Halving aka Hyperband

class sherpa.algorithms.SuccessiveHalving(r=1, R=9, eta=3, s=0, max_finished_configs=50)[source]

Asynchronous Successive Halving as described in:

@article{li2018massively, title={Massively parallel hyperparameter tuning}, author={Li, Liam and Jamieson, Kevin and Rostamizadeh, Afshin and Gonina, Ekaterina and Hardt, Moritz and Recht, Benjamin and Talwalkar, Ameet}, journal={arXiv preprint arXiv:1810.05934}, year={2018} }

Asynchronous successive halving operates based on the multi-armed bandit algorithm Successive Halving (SHA) and performs principled early stopping for random search.

Parameters:
  • r (int) – minimum resource that each configuration will be trained for.
  • R (int) – maximum resource.
  • eta (int) – elimination rate.
  • s (int) – minimum early-stopping rate.
  • max_finished_configs (int) – stop once max_finished_configs models have been trained to completion.

Population Based Training

class sherpa.algorithms.PopulationBasedTraining(num_generations, population_size=20, parameter_range={}, perturbation_factors=(0.8, 1.2))[source]

Population based training (PBT) as introduced by Jaderberg et al. 2017.

PBT trains a generation of population_size seed trials (randomly initialized) for a user specified number of iterations e.g. one epoch. The top 80% then move on unchanged into the second generation. The bottom 20% are re-sampled from the top 20% and perturbed. The second generation again trains for the same number of iterations and the same procedure is repeated to move into the third generation etc.

Parameters:
  • num_generations (int) – the number of generations to run for.
  • population_size (int) – the number of randomly intialized trials at the beginning and number of concurrent trials after that.
  • parameter_range (dict[Union[list,tuple]) – upper and lower bounds beyond which parameters cannot be perturbed.
  • perturbation_factors (tuple[float]) – the factors by which continuous parameters are multiplied upon perturbation; one is sampled randomly at a time.

Repeat

class sherpa.algorithms.Repeat(algorithm, num_times=5, wait_for_completion=False, agg=False)[source]

Takes another algorithm and repeats every hyperparameter configuration a given number of times. The wrapped algorithm will be passed the mean objective values of the repeated experiments.

Parameters:
  • algorithm (sherpa.algorithms.Algorithm) – the algorithm to produce hyperparameter configurations.
  • num_times (int) – the number of times to repeat each configuration.
  • wait_for_completion (bool) – only relevant when running in parallel with max_concurrent > 1. Means that the algorithm won’t issue the next suggestion until all repetitions are completed. This can be useful when the repeats have impact on sequential decision making in the wrapped algorithm.
  • agg (bool) – whether to aggregate repetitions before passing them to the parameter generating algorithm.

Iterate

class sherpa.algorithms.Iterate(hp_iter)[source]

Iterate over a set of fully-specified hyperparameter combinations.

Parameters:hp_iter (list) – list of fully-specified hyperparameter dicts.