Bayesian Optimization with Gaussian Processes¶

February 18, 2016¶

DataPhilly¶

@AustinRochford ¶

In [8]:

display(gif)

The problem¶

Minimize a function that is difficult, expensive, or time-consuming to evaluate.

The idea¶

Represent our uncertainty about the true function with a probabilistic model
Choose the next evaluation point that maximizes the chances of finding the function's minimum

Gaussian processes¶

A flexible family of probability distributions over continuous functions

In [12]:

fig

Out[12]:

$f \sim GP(m, k)$ if for any $\mathbf{x} = (x_1, \ldots, x_n)$, $(f(x_1), \ldots, f(x_n))$ has a multivariate normal distribution

$$ (f(x_1), f(x_2), \ldots, f(x_n)) \sim N\left(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x})\right) $$

$$ m(\mathbf{x}) = \begin{pmatrix} m(x_1) \\ m(x_2) \\ \vdots \\ m(x_n) \end{pmatrix},\ k(\mathbf{x}, \mathbf{x}) = \begin{pmatrix} k(x_1, x_1) & k(x_1, x_2) & \cdots & k(x_1, x_n) \\ k(x_2, x_1) & k(x_2, x_2) & \cdots & k(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ k(x_n, x_1) & k(x_n, x_2) & \cdots & k(x_n, x_n) \end{pmatrix} $$

If $f \sim GP(0, k)$, $\mathcal{D} = \{(x_1, y_1), \ldots, (x_n, y_n)\}$, where $y_i = f(x_i) + \varepsilon$, $\varepsilon \sim N(0, \sigma^2)$, $f\ |\ \mathcal{D}$ is also a Gaussian process with

$$ \begin{align*} f(\mathbf{x^*})\ |\ \mathcal{D} & \sim N(\tilde{m}(\mathbf{x^*}), \tilde{k}(\mathbf{x^*}, \mathbf{x^*})) \end{align*} $$

$$ \begin{align*} \tilde{m}(\mathbf{x^*}) & = k(\mathbf{x^*}, \mathbf{x}) \left(k(\mathbf{x}, \mathbf{x}) + \sigma^2 I\right)^{-1} \mathbf{y} \\ \tilde{k}(\mathbf{x^*}, \mathbf{x^*}) & = k(\mathbf{x^*}, \mathbf{x^*}) - k(\mathbf{x^*}, \mathbf{x}) \left(k(\mathbf{x}, \mathbf{x} + \sigma^2 I\right)^{-1} k(\mathbf{x}, \mathbf{x^*}) \end{align*} $$

In [15]:

fig

Out[15]:

Covariance kernels¶

Squared exponential kernel¶

$$ \begin{align*} k(x_1, x_2) & = \sigma^2 \exp\left(-\frac{r^2}{2 \ell^2}\right) \\ r & = |x_1 - x_2| \end{align*} $$

Resulting functions are infinitely (mean square) differentiable
$\sigma$ controls the magnitude of fluctuations
$\ell$ controls the how quickly the function fluctates

In [17]:

fig

Out[17]:

Matérn kernel¶

$$ \begin{align*} k_{\nu}(r) & = \sigma^2 \frac{2^{\nu - 1}}{\Gamma(\nu)} \left(\frac{\sqrt{2 \nu} r}{\ell}\right)^{\nu} K_{\nu} \left(\frac{\sqrt{2 \nu} r}{\ell}\right) \end{align*} $$

Resulting functions are $\lfloor \nu \rfloor$ times (mean square) differentiable
$K_{\nu}$ is a modified Bessel function of the second kind
Converges to squared exponential kernel as $\nu \to \infty$

$$ \begin{align*} k_{{}^1 /_2}(r) & = \sigma^2 \exp\left(-\frac{r}{\ell}\right) \\ k_{{}^3 /_2}(r) & = \sigma^2 \left(1 + \frac{\sqrt{3} r}{\ell}\right) \exp\left(-\frac{\sqrt{3} r}{\ell}\right) \\ k_{{}^5 / _2}(r) & = \sigma^2 \left(1 + \frac{\sqrt{5} r}{\ell} + \frac{5 r^2}{3 \ell^2}\right) \exp\left(-\frac{\sqrt{5} r}{\ell}\right) \end{align*} $$

In [19]:

fig

Out[19]:

Gaussian process optimization¶

Start with a function evaluation

In [23]:

fig

Out[23]:

Fit a Gaussian process to the observation

In [27]:

fig

Out[27]:

Choose the next evaluation point

In [30]:

fig

Out[30]:

Acquisition functions¶

Probability of improvement¶

In [33]:

fig

Out[33]:

$$ \begin{align*} \textrm{PI}(x\ |\ \mathcal{D}, \theta) & = \mathbb{P}(f(x) < y_{\textrm{min}}\ |\ \mathcal{D}, \theta) = \Phi\left(\frac{y_{\textrm{min}} - \tilde{m}_{\theta}(x)}{\sqrt{\tilde{k}_{\theta}(x, x)}}\right) \end{align*} $$

In [39]:

fig

Out[39]:

Expected improvement¶

In [42]:

fig

Out[42]:

$$ \begin{align*} \textrm{EI}(x\ |\ \mathcal{D}, \theta) & = \sqrt{\tilde{k}_{\theta}(x, x)} \left(z\ \Phi(z) + \varphi(z)\right),\ z = \frac{y_{\textrm{min}} - \tilde{m}_{\theta}(x)}{\sqrt{\tilde{k}_{\theta}(x, x)}} \end{align*} $$

In [46]:

fig

Out[46]:

Optimization example¶

In [49]:

fig

Out[49]:

In [51]:

fig

Out[51]:

In [53]:

fig

Out[53]:

In [55]:

fig

Out[55]:

In [57]:

fig

Out[57]:

In [59]:

fig

Out[59]:

In [63]:

display(gif)

In [64]:

fig

Out[64]:

In [67]:

fig

Out[67]:

Integrated acquisition functions¶

So far, we have chose the covariance hyperparameters $\sigma^2$ and $\ell$ by maximizing the marginal (log) likelihood

$$ \begin{align*} \log \mathbb{P}(\mathbf{y}\ |\ \mathbf{x}, \theta) & \propto -\frac{1}{2} \mathbf{y}^{\intercal} \left(k(\mathbf{x}, \mathbf{x}) + \sigma^2 I\right)^{-1} \mathbf{y} - \frac{1}{2} \log \left|k(\mathbf{x}, \mathbf{x}) + \sigma^2 I\right| \end{align*} $$

This can result in overfitting and underestimating the uncertainty in our acquisition function.

Bayesian approach¶

Put priors on the covariance hyperparameters $\sigma^2$ and $\ell$

$$ \begin{align*} \sigma^2 & \sim t_+(4) \\ \ell & \sim t_+(4) \end{align*} $$

Use samples from the posterior distribution $\mathbb{P}(\theta\ |\ \mathcal{D})$ to approximate the integrated expected improvement

$$ \begin{align*} \textrm{INT-EI}(x\ |\ \mathcal{D}) & = \int_{\Theta} \textrm{EI}(x\ |\ \mathcal{D}, \theta)\ \mathbb{P}(\theta\ |\ \mathcal{D})\ d\theta \end{align*} $$

Choose the next point that maximizes the integrated expected improvement

In [71]:

grid.fig

Out[71]:

In [74]:

fig

Out[74]:

In [81]:

display(gif)

In [83]:

fig

Out[83]:

In [86]:

fig

Out[86]:

In [89]:

fig

Out[89]:

Bayesian optimization theory¶

Why is this appropriate for expensive functions?¶

Derivative free optimization
Fitting GPs is $O(n^3)$ where $n$ is the number of data points
- Multiple hyperparameters will require many more function evaluations

Generalizations¶

Explicity model the cost of function evaluation (with Bayesian optimiziation) and combine cost and expected improvement into an acquisition function
Parallel Bayesian optimization
Incorporation of explicit domain knowledge

References¶

Practical Bayesian Optimization of Machine Learning Algorithms
A Tutorial on Bayesian Optimization for Machine Learning
SigOpt a startup offering Bayesian optimization as a service
spearmint open source Bayesian optimization package