Bayesian Optimization

Background

Bayesian optimization for hyperparameter tuning uses a flexible model to map from hyperparameter space to objective values. In many cases this model is a Gaussian Process (GP) or a Random Forest. The model is fitted to inputs of hyperparameter configurations and outputs of objective values. It is then used to make predictions about candidate hyperparameter configurations. Each candidate-prediction can be evaluated with respect to its utility via an acquisiton function - trading off exploration and exploitation. The algorithm therefore consists of fitting the model, finding the hyperparameter configuration that maximize the acquisition function, evaluating that configuration, and repeating the process.

GPyOpt Wrapper

SHERPA implements Bayesian optimization via a wrapper for the popular Bayesian optimization library GPyOpt ( https://github.com/SheffieldML/GPyOpt/ ). The GPyOpt algorithm in SHERPA has a number of arguments that specify the Bayesian optimization in GPyOpt. The argument max_concurrent refers to the batch size that GPyOpt produces at each step and should be chosen equal to the number of concurrent parallel trials. The algorithm also accepts seed configurations via the initial_data_points argument. This would be parameter configurations that you know to be reasonably good and that can be used as starting points for the Bayesian optimization. For the full specification see below. Note that as of right now sherpa.algorithms.GPyOpt does not accept Discrete variables with the option scale=’log’.

class sherpa.algorithms.GPyOpt(model_type='GP', num_initial_data_points='infer', initial_data_points=[], acquisition_type='EI', max_concurrent=4, verbosity=False, max_num_trials=None)[source]

Sherpa wrapper around the GPyOpt package (https://github.com/SheffieldML/GPyOpt).

Parameters:
  • model_type (str) – The model used: - ‘GP’, standard Gaussian process. - ‘GP_MCMC’, Gaussian process with prior in the hyper-parameters. - ‘sparseGP’, sparse Gaussian process. - ‘warperdGP’, warped Gaussian process. - ‘InputWarpedGP’, input warped Gaussian process - ‘RF’, random forest (scikit-learn).
  • num_initial_data_points (int) – Number of data points to collect before fitting model. Needs to be greater/equal to the number of hyper- parameters that are being optimized. Using default ‘infer’ corres- ponds to number of hyperparameters + 1 or 0 if results are not empty.
  • initial_data_points (list[dict] or pandas.Dataframe) – Specifies initial data points. If len(initial_data_points)<num_initial_data_points then the rest is randomly sampled. Use this option to provide hyperparameter configurations that are known to be good.
  • acquisition_type (str) – Type of acquisition function to use. - ‘EI’, expected improvement. - ‘EI_MCMC’, integrated expected improvement (requires GP_MCMC model). - ‘MPI’, maximum probability of improvement. - ‘MPI_MCMC’, maximum probability of improvement (requires GP_MCMC model). - ‘LCB’, GP-Lower confidence bound. - ‘LCB_MCMC’, integrated GP-Lower confidence bound (requires GP_MCMC model).
  • max_concurrent (int) – The number of concurrent trials. This generates a batch of max_concurrent trials from GPyOpt to evaluate. If a new observation becomes available, the model is re-evaluated and a new batch is created regardless of whether the previous batch was used up. The used method is local penalization.
  • verbosity (bool) – Print models and other options during the optimization.
  • max_num_trials (int) – maximum number of trials to run for.

Example

Using GPyOpt Bayesian Optimization in SHERPA is straight forward. The parameter ranges are defined as usual, for example:

parameters = [sherpa.Continuous('lrinit', [0.1, 0.01], 'log'),
              sherpa.Continuous('momentum', [0., 0.99]),
              sherpa.Continuous('lrdecay', [1e-2, 1e-7], 'log'),
              sherpa.Continuous('dropout', [0., 0.5])]

When defining the algorithm the GPyOpt class is used:

algorithm = sherpa.algorithms.GPyOpt(max_num_trials=150)

The max_num_trials argument is optional and specifies the number of trials after which the algorithm will finish. If not specified the algorithm will keep running and has to be cancelled by the user.

The optimization is set up as shown in the Guide. For example

for trial in study:
    model = init_model(train.parameters)
    for iteration in range(num_iterations):
        training_error = model.fit(epochs=1)
        validation_error = model.evaluate()
        study.add_observation(trial=trial,
                              iteration=iteration,
                              objective=validation_error,
                              context={'training_error': training_error})
    study.finalize(trial)

A full example for MNIST can be found in examples/mnist_mlp.ipynb from the SHERPA root.