Bayesian Optimization with Gaussian Processes

February 18, 2016

DataPhilly

@AustinRochford

In [8]:
display(gif)

The problem

Minimize a function that is difficult, expensive, or time-consuming to evaluate.

The idea

  • Represent our uncertainty about the true function with a probabilistic model
  • Choose the next evaluation point that maximizes the chances of finding the function's minimum

Gaussian processes

A flexible family of probability distributions over continuous functions

In [12]:
fig
Out[12]:

$f \sim GP(m, k)$ if for any $\mathbf{x} = (x_1, \ldots, x_n)$, $(f(x_1), \ldots, f(x_n))$ has a multivariate normal distribution


$$ (f(x_1), f(x_2), \ldots, f(x_n)) \sim N\left(m(\mathbf{x}), k(\mathbf{x}, \mathbf{x})\right) $$


$$ m(\mathbf{x}) = \begin{pmatrix} m(x_1) \\ m(x_2) \\ \vdots \\ m(x_n) \end{pmatrix},\ k(\mathbf{x}, \mathbf{x}) = \begin{pmatrix} k(x_1, x_1) & k(x_1, x_2) & \cdots & k(x_1, x_n) \\ k(x_2, x_1) & k(x_2, x_2) & \cdots & k(x_2, x_n) \\ \vdots & \vdots & \ddots & \vdots \\ k(x_n, x_1) & k(x_n, x_2) & \cdots & k(x_n, x_n) \end{pmatrix} $$

If $f \sim GP(0, k)$, $\mathcal{D} = \{(x_1, y_1), \ldots, (x_n, y_n)\}$, where $y_i = f(x_i) + \varepsilon$, $\varepsilon \sim N(0, \sigma^2)$, $f\ |\ \mathcal{D}$ is also a Gaussian process with


$$ \begin{align*} f(\mathbf{x^*})\ |\ \mathcal{D} & \sim N(\tilde{m}(\mathbf{x^*}), \tilde{k}(\mathbf{x^*}, \mathbf{x^*})) \end{align*} $$


$$ \begin{align*} \tilde{m}(\mathbf{x^*}) & = k(\mathbf{x^*}, \mathbf{x}) \left(k(\mathbf{x}, \mathbf{x}) + \sigma^2 I\right)^{-1} \mathbf{y} \\ \tilde{k}(\mathbf{x^*}, \mathbf{x^*}) & = k(\mathbf{x^*}, \mathbf{x^*}) - k(\mathbf{x^*}, \mathbf{x}) \left(k(\mathbf{x}, \mathbf{x} + \sigma^2 I\right)^{-1} k(\mathbf{x}, \mathbf{x^*}) \end{align*} $$
In [15]:
fig
Out[15]:

Covariance kernels

Squared exponential kernel

$$ \begin{align*} k(x_1, x_2) & = \sigma^2 \exp\left(-\frac{r^2}{2 \ell^2}\right) \\ r & = |x_1 - x_2| \end{align*} $$
  • Resulting functions are infinitely (mean square) differentiable
  • $\sigma$ controls the magnitude of fluctuations
  • $\ell$ controls the how quickly the function fluctates
In [17]:
fig
Out[17]:

Matérn kernel

$$ \begin{align*} k_{\nu}(r) & = \sigma^2 \frac{2^{\nu - 1}}{\Gamma(\nu)} \left(\frac{\sqrt{2 \nu} r}{\ell}\right)^{\nu} K_{\nu} \left(\frac{\sqrt{2 \nu} r}{\ell}\right) \end{align*} $$


$$ \begin{align*} k_{{}^1 /_2}(r) & = \sigma^2 \exp\left(-\frac{r}{\ell}\right) \\ k_{{}^3 /_2}(r) & = \sigma^2 \left(1 + \frac{\sqrt{3} r}{\ell}\right) \exp\left(-\frac{\sqrt{3} r}{\ell}\right) \\ k_{{}^5 / _2}(r) & = \sigma^2 \left(1 + \frac{\sqrt{5} r}{\ell} + \frac{5 r^2}{3 \ell^2}\right) \exp\left(-\frac{\sqrt{5} r}{\ell}\right) \end{align*} $$
In [19]:
fig
Out[19]:

Gaussian process optimization

  • Start with a function evaluation
In [23]:
fig
Out[23]:
  • Fit a Gaussian process to the observation
In [27]:
fig
Out[27]:
  • Choose the next evaluation point
In [30]:
fig
Out[30]:

Acquisition functions

Probability of improvement

In [33]:
fig
Out[33]:
$$ \begin{align*} \textrm{PI}(x\ |\ \mathcal{D}, \theta) & = \mathbb{P}(f(x) < y_{\textrm{min}}\ |\ \mathcal{D}, \theta) = \Phi\left(\frac{y_{\textrm{min}} - \tilde{m}_{\theta}(x)}{\sqrt{\tilde{k}_{\theta}(x, x)}}\right) \end{align*} $$
In [39]:
fig
Out[39]:

Expected improvement

In [42]:
fig
Out[42]:
$$ \begin{align*} \textrm{EI}(x\ |\ \mathcal{D}, \theta) & = \sqrt{\tilde{k}_{\theta}(x, x)} \left(z\ \Phi(z) + \varphi(z)\right),\ z = \frac{y_{\textrm{min}} - \tilde{m}_{\theta}(x)}{\sqrt{\tilde{k}_{\theta}(x, x)}} \end{align*} $$
In [46]:
fig
Out[46]:

Optimization example

In [49]:
fig
Out[49]:
In [51]:
fig
Out[51]:
In [53]:
fig
Out[53]:
In [55]:
fig
Out[55]:
In [57]:
fig
Out[57]:
In [59]:
fig
Out[59]:
In [63]:
display(gif)
In [64]:
fig
Out[64]:
In [67]:
fig
Out[67]:

Integrated acquisition functions

So far, we have chose the covariance hyperparameters $\sigma^2$ and $\ell$ by maximizing the marginal (log) likelihood

$$ \begin{align*} \log \mathbb{P}(\mathbf{y}\ |\ \mathbf{x}, \theta) & \propto -\frac{1}{2} \mathbf{y}^{\intercal} \left(k(\mathbf{x}, \mathbf{x}) + \sigma^2 I\right)^{-1} \mathbf{y} - \frac{1}{2} \log \left|k(\mathbf{x}, \mathbf{x}) + \sigma^2 I\right| \end{align*} $$

This can result in overfitting and underestimating the uncertainty in our acquisition function.

Bayesian approach

  • Put priors on the covariance hyperparameters $\sigma^2$ and $\ell$
$$ \begin{align*} \sigma^2 & \sim t_+(4) \\ \ell & \sim t_+(4) \end{align*} $$
  • Use samples from the posterior distribution $\mathbb{P}(\theta\ |\ \mathcal{D})$ to approximate the integrated expected improvement
$$ \begin{align*} \textrm{INT-EI}(x\ |\ \mathcal{D}) & = \int_{\Theta} \textrm{EI}(x\ |\ \mathcal{D}, \theta)\ \mathbb{P}(\theta\ |\ \mathcal{D})\ d\theta \end{align*} $$
  • Choose the next point that maximizes the integrated expected improvement
In [71]:
grid.fig
Out[71]:
In [74]:
fig
Out[74]:
In [81]:
display(gif)
In [83]:
fig
Out[83]:
In [86]:
fig
Out[86]:
In [89]:
fig
Out[89]:

Bayesian optimization theory

Why is this appropriate for expensive functions?

  • Derivative free optimization
  • Fitting GPs is $O(n^3)$ where $n$ is the number of data points
    • Multiple hyperparameters will require many more function evaluations

Generalizations

  • Explicity model the cost of function evaluation (with Bayesian optimiziation) and combine cost and expected improvement into an acquisition function
  • Parallel Bayesian optimization
  • Incorporation of explicit domain knowledge

References