# Variational Inference in Python¶

## Bayesian Inference¶

The posterior distribution is equal to the joint distribution divided by the marginal distribution of the evidence.

$$\color{red}{P(\theta\ |\ \mathcal{D})} = \frac{\color{blue}{P(\mathcal{D}\ |\ \theta)\ P(\theta)}}{\color{green}{P(\mathcal{D})}} = \frac{\color{blue}{P(\mathcal{D}, \theta)}}{\color{green}{\int P(\mathcal{D}\ |\ \theta)\ P(\theta)\ d\theta}}$$

For many useful models the marginal distribution of the evidence is hard or impossible to calculate analytically.

### Modes of Bayesian Inference¶

• Conjugate models with closed-form posteriors
• Markov chain Monte Carlo algorithms
• Approximate Bayesian computation
• Distributional approximations
• Laplace approximations, INLA
• Variational inference

## Markov Chain Monte Carlo Algorithms¶

• Construct a Markov chain whose stationary distribution is the posterior distribution
• Sample from the Markov chain for a long time
• Approximate posterior quantities using the empirical distribution of the samples
In [12]:
HTML(mcmc_video)

Out[12]:

### Beta-Binomial Model¶

We observe three successes in ten trials, and want to infer the true success probability.

In [13]:
x_beta_binomial = np.array([1, 1, 1, 0, 0, 0, 0, 0, 0, 0])

$$p \sim U(0, 1)$$
In [14]:
import pymc3 as pm

with pm.Model() as beta_binomial_model:
p_beta_binomial = pm.Uniform('p', 0., 1.)

Applied interval-transform to p and added transformed p_interval_ to model.

\begin{align*} P(X_i = 1\ |\ p) & = p \end{align*}
In [15]:
with beta_binomial_model:
x_obs = pm.Bernoulli('y', p_beta_binomial,
observed=x_beta_binomial)

$$p\ \left|\ \sum_{i = 1}^{10} X_i \right. = 3 \sim \textrm{Beta}(4, 8)$$
In [17]:
fig

Out[17]:
In [19]:
with beta_binomial_model:
beta_binomial_trace_ = pm.sample(BETA_BINOMIAL_SAMPLES, random_seed=SEED)

beta_binomial_trace = beta_binomial_trace_[BETA_BINOMIAL_BURN::BETA_BINOMIAL_THIN]

Assigned NUTS to p_interval_
[-------100%-------] 50000 of 50000 in 13.1 sec. | SPS: 3829.8 | ETA: 0.0
In [21]:
fig

Out[21]:

#### Pros¶

• Asymptotically unbiased: converges to the true posterior afer many samples
• Model-agnostic algorithms
• Well-studied for more than 60 years

#### Cons¶

• Can take a long time to converge
• Can be difficult to assess convergence
• Difficult to scale

## Variational Inference¶

• Choose a class of approximating distributions
• Find the best approximation to the true posterior

Variational inference minimizes the Kullback-Leibler divergence

$$\mathbb{KL}(\color{purple}{q(\theta)} \parallel \color{red}{p(\theta\ |\ \mathcal{D})}) = \mathbb{E}_q\left(\log\left(\frac{\color{purple}{q(\theta)}}{\color{red}{p(\theta\ |\ \mathcal{D})}}\right)\right)$$

from approximate distributons, but we can't calculate the true posterior distribution.

Minimizing the Kullback-Leibler divergence

$$\mathbb{KL}(\color{purple}{q(\theta)} \parallel \color{red}{p(\theta\ |\ \mathcal{D})}) = -(\underbrace{\mathbb{E}_q(\log \color{blue}{p(\mathcal{D}, \theta))} - \mathbb{E}_q(\color{purple}{\log q(\theta)})}_{\color{orange}{\textrm{ELBO}}}) + \log \color{green}{p(\mathcal{D})}$$

is equivalent to maximizing the Evidence Lower BOund (ELBO), which only requires calculating the joint distribution.

### Variational Inference Example¶

In [24]:
fig

Out[24]:

Approximate the true distribution using a diagonal covariance Gaussian from the class

$$\mathcal{Q} = \left\{\left.N\left(\begin{pmatrix} \mu_x \\ \mu_y \end{pmatrix}, \begin{pmatrix} \sigma_x^2 & 0 \\ 0 & \sigma_y^2\end{pmatrix}\ \right|\ \mu_x, \mu_y \in \mathbb{R}^2, \sigma_x, \sigma_y > 0\right)\right\}$$
In [26]:
fig

Out[26]:

### Pros¶

• A principled method to trade complexity for bias
• Optimization theory is applicable
• Assesment of convergence
• Scalability

### Cons¶

• Biased estimate of the true posterior
• Better for prediction than interpretation
• Model-specific algorithms

### Mean field variational inference¶

Assume the variational distribution factors independently as $q(\theta_1, \ldots, \theta_n) = q(\theta_1) \cdots q(\theta_n)$

The variational approximation can be found by coordinate ascent

\begin{align*} q(\theta_i) & \propto \exp\left(\mathbb{E}_{q_{-i}}(\log(\mathcal{D}, \boldsymbol{\theta}))\right) \\ q_{-i}(\boldsymbol{\theta}) & = q(\theta_1) \cdots q(\theta_{i - 1})\ q(\theta_{i + 1}) \cdots q(\theta_n) \end{align*}

#### Coordinate Ascent Cons¶

• Calculations are tedious, even when possible
• Convergence is slow when the number of parameters is large

## Automating Variational Inference in Python¶

• Tensor libraries calculate ELBO gradients automatically
Python Package
Tensor Library
Variational Inference Algorithm(s)
Edward
TensorFlow
Black Box Variational Inference (BBVI)
PyMC3
Theano

### Common themes¶

• Monte Carlo estimate of the ELBO gradient
• Minibatch estimates of the joint distribution

BBVI and ADVI arise from different ways of calculating the ELBO gradient

## Variational Inference with Edward¶

### Black Box Variational Inference (BBVI)¶

• Model-agnostic
• Requires the ability to compute the joint distribution
• Required the ability to differentiate the variational distribution

### Beta-binomial model¶

In [27]:
import edward as ed
from edward.models import Bernoulli, Beta, Uniform

ed.set_seed(SEED)

# probability model
p = Uniform(a=0., b=1.)
x_edward_beta_binomial = Bernoulli(p=p)

data = {x_edward_beta_binomial: x_beta_binomial}

In [29]:
# variational approximation
q_p = Beta(a=tf_positive_variable(),
b=tf_positive_variable())
q = {p: q_p}

In [30]:
%%time
beta_binomial_inference = ed.MFVI(q, data)
beta_binomial_inference.run(n_iter=10000, n_print=None)

CPU times: user 5.83 s, sys: 880 ms, total: 6.71 s
Wall time: 4.63 s

In [32]:
fig

Out[32]:
In [33]:
from edward.models import PyMC3Model

# probability model
x_beta_binomial_obs = shared(np.zeros(1))

with pm.Model() as beta_binomial_model_untransformed:
p = pm.Uniform('p', 0., 1., transform=None)
x_beta_binomial_ = pm.Bernoulli('x', p, observed=x_beta_binomial_obs)

pymc3_data = {x_beta_binomial_obs: x_beta_binomial}
pymc3_beta_binomial_model = PyMC3Model(beta_binomial_model_untransformed)

In [34]:
%%time
# variational distribution
pymc3_q_p = Beta(a=tf_positive_variable(),
b=tf_positive_variable())
pymc3_q = {'p': pymc3_q_p}

pymc3_beta_binomial_inference = ed.MFVI(pymc3_q, pymc3_data,
pymc3_beta_binomial_model)
pymc3_beta_binomial_inference.run(n_iter=30000, n_print=None)

CPU times: user 29.3 s, sys: 5.54 s, total: 34.8 s
Wall time: 22.4 s

In [36]:
fig

Out[36]:

#### Edward-supported modeling "languages"¶

Joint Distribution
Variational Distribution
TensorFlow Distributions
TensorFlow
PyMC3
Stan
Direction density calculations
• TensorFlow
• Numpy
• Pure Python

### Congressional Ideal Points¶

In [40]:
vote_df.head()

Out[40]:
party 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
rep
0 0 0 1 0 1 1 1 0 0 0 1 NaN 1 1 1 0 1
1 0 0 1 0 1 1 1 0 0 0 0 0 1 1 1 0 NaN
2 1 NaN 1 1 NaN 1 1 0 0 0 0 1 0 1 1 0 0
3 1 0 1 1 0 NaN 1 0 0 0 0 1 0 1 0 0 1
4 1 1 1 1 0 1 1 0 0 0 0 1 NaN 1 1 1 1
In [43]:
grid.fig

Out[43]:

Idea:

• Representatives ($\color{blue}{\alpha_i}$) and bills ($\color{green}{\beta_j}$) each have an ideal point on a spectrum of conservativity
• If a representative's and bill's ideal points are the same, the representative has a 50% chance of voting for the bill
• Bills also have an ability to discriminate ($\color{red}{\gamma_j}$) between conservative and liberal representatives
• Some bills have broad bipartisan support, while some provoke votes along party lines

#### Representative ideal points¶

\begin{align*} \color{blue}{\alpha_1}, \ldots, \color{blue}{\alpha_N} & \sim N(0, 1) \end{align*}
In [44]:
def normal_log_prior(value, loc=0., scale=1.):
return tf.reduce_sum(normal.log_pdf(value, loc, scale))


#### Bill ideal points and discriminative abilities¶

\begin{align*} \mu_{\beta} & \sim N(0, 1) \\ \sigma_{\beta} & \sim U(0, 1) \\ \color{green}{\beta_1}, \ldots, \color{green}{\beta_K} & \sim N(\mu_{\beta}, \sigma_{\beta}^2) \end{align*}

\begin{align*} \mu_{\gamma} & \sim N(0, 1) \\ \sigma_{\gamma} & \sim U(0, 1) \\ \color{red}{\gamma_1}, \ldots, \color{red}{\gamma_K} & \sim N(\mu_{\gamma}, \sigma_{\gamma}^2) \end{align*}

In [45]:
def hierarchical_log_prior(value, mu, sigma):
log_hyperprior = normal_log_prior(mu) + uniform.log_pdf(sigma, 0., 1.)

return log_hyperprior + normal_log_prior(value, loc=mu, scale=sigma)


#### Observation model¶

\begin{align*} \mathbb{P}(\textrm{Representative }i \textrm{ votes for bill }j\ |\ \color{blue}{\alpha_i}, \color{green}{\beta_j}, \color{red}{\gamma_j}) & = \frac{1}{1 + \exp(-\color{red}{\gamma_j} \left(\color{blue}{\alpha_i} - \color{green}{\beta_j}\right))} \end{align*}
In [48]:
def log_like(vote, rep, bill, alpha, beta, gamma):
# alpha, beta, and gamma have one entry for representative/bill,
# index them to have one entry per representative/bill combination
alpha_long = long_from_indices(alpha, rep)
beta_long = long_from_indices(beta, bill)
gamma_long = long_from_indices(gamma, bill)

p = tf.sigmoid(gamma_long * (alpha_long - beta_long))

return tf.reduce_sum(bernoulli.logpmf(vote, p))


#### Edward model¶

In [49]:
class IdealPoint:
def log_prob(self, xs, zs):
"""
xs is a dictionary of observed data
zs is a dictionary of parameters

Calculates the model's log joint distribution
"""
# parameters
alpha, beta, gamma = zs['alpha'], zs['beta'], zs['gamma']
mu_beta, mu_gamma = zs['mu_beta'], zs['mu_gamma']
sigma_beta, sigma_gamma = zs['sigma_beta'], zs['sigma_gamma']

# observed data
vote, bill, rep = xs['vote'], xs['bill'], xs['rep']

# log joint distribution
log_prior_ = log_prior(alpha,
beta, mu_beta, sigma_beta,
gamma, mu_gamma, sigma_gamma)
log_like_ = log_like(vote, rep, bill,
alpha, beta, gamma)

return log_prior_ + log_like_

ideal_point_model = IdealPoint()


#### BBVI inference¶

In [52]:
# variational distribution
q_alpha = tf_normal((n_reps,))

q_mu_beta = tf_normal()
q_sigma_beta = tf_uniform()
q_beta = tf_normal((N_BILLS,))

q_mu_gamma = tf_normal()
q_sigma_gamma = tf_uniform()
q_gamma = tf_normal((N_BILLS,))

q = {
'alpha': q_alpha,
'beta': q_beta, 'mu_beta': q_mu_beta, 'sigma_beta': q_sigma_beta,
'gamma': q_gamma, 'mu_gamma': q_mu_gamma, 'sigma_gamma': q_sigma_gamma,
}

In [53]:
%%time
inference = ed.MFVI(q, ideal_point_data, ideal_point_model)
inference.run(n_iter=20000, n_print=None)

CPU times: user 1min 33s, sys: 30.2 s, total: 2min 4s
Wall time: 52.8 s

In [56]:
fig

Out[56]:

Fun project: Recreate the Martin-Quinn dynamic ideal point model for ideology of U.S. Supreme Court justices with Edward

## Variational Inference with PyMC3¶

### Automatic Differentiation Variational Inference (ADVI)¶

• Only applicable to differentiable probability models
• Transform constrained parameters to be unconstrained
• Approximate the posterior for unconstrained parameters with mean field Gaussian

### Beta-binomial model¶

#### Transformed distributions¶

In [58]:
fig

Out[58]:
In [59]:
%%time
with beta_binomial_model:

Iteration 0 [0%]: ELBO = -10.91
Iteration 2000 [10%]: Average ELBO = -7.76
Iteration 4000 [20%]: Average ELBO = -7.37
Iteration 6000 [30%]: Average ELBO = -7.24
Iteration 8000 [40%]: Average ELBO = -7.22
Iteration 10000 [50%]: Average ELBO = -7.17
Iteration 12000 [60%]: Average ELBO = -7.15
Iteration 14000 [70%]: Average ELBO = -7.19
Iteration 16000 [80%]: Average ELBO = -7.18
Iteration 18000 [90%]: Average ELBO = -7.19
Finished [100%]: Average ELBO = -7.2
CPU times: user 3.12 s, sys: 0 ns, total: 3.12 s
Wall time: 3.1 s


#### ADVI approxiation to transformed posterior¶

In [62]:
fig

Out[62]:

In [64]:
fig

Out[64]:

### Dependent Density Regression¶

In [68]:
fig

Out[68]:

Idea: A mixture of linear models, where the unknown number of mixture weights depend on $x$

In [70]:
fig

Out[70]:

#### Mixture weights¶

The model has infinitely many mixture components, we truncate to $K$

\begin{align*} \alpha_1, \ldots, \alpha_K & \sim N(0, 1) \\ \beta_1, \ldots, \beta_K & \sim N(0, 1) \\ \boldsymbol{\pi} \ |\ \boldsymbol{\alpha}, \boldsymbol{\beta}, x & \sim \textrm{Logit-Stick-Breaking}(\boldsymbol{\alpha} + \boldsymbol{\beta} x) \\ \end{align*}
In [73]:
K = 20

with pm.Model() as lidar_model:
alpha = pm.Normal('alpha', 0., 1., shape=K)
beta = pm.Normal('beta', 0., 1., shape=K)

v = tt.nnet.sigmoid(alpha + beta * x_lidar)
pi = pm.Deterministic('pi', stick_breaking(v))


#### Component linear models¶

\begin{align*} \gamma_1, \ldots, \gamma_K & \sim N(0, 100) \\ \delta_1, \ldots, \delta_K & \sim N(0, 100) \\ \tau_1, \ldots, \tau_K & \sim \textrm{Gamma}(1, 1) \\ Y_i\ |\ \gamma_i, \delta_i, \tau_i, x & \sim N(\gamma_i + \delta_i x, \tau_i^{-1}) \end{align*}
In [74]:
with lidar_model:
gamma = pm.Normal('gamma', 0., 100., shape=K)
delta = pm.Normal('delta', 0., 100., shape=K)
tau = pm.Gamma('tau', 1., 1., shape=K)

ys = pm.Deterministic('ys', gamma + delta * x_lidar)

Applied log-transform to tau and added transformed tau_log_ to model.


#### Observation model¶

\begin{align*} Y\ |\ \boldsymbol{\alpha}, \boldsymbol{\beta}, \boldsymbol{\gamma}, \boldsymbol{\delta}, \boldsymbol{\tau}, x & = \sum_{i = 1}^K \pi_i Y_i \\ \end{align*}
In [76]:
with lidar_model:
lidar_obs = NormalMixture('lidar_obs', pi, ys, tau,
observed=std_logratio)


In [77]:
%%time

with lidar_model:

Iteration 0 [0%]: ELBO = -1173858.35
Iteration 5000 [10%]: Average ELBO = -1472209.86
Iteration 10000 [20%]: Average ELBO = -893902.97
Iteration 15000 [30%]: Average ELBO = -665586.63
Iteration 20000 [40%]: Average ELBO = -369517.75
Iteration 25000 [50%]: Average ELBO = 12058.54
Iteration 30000 [60%]: Average ELBO = 130100.63
Iteration 35000 [70%]: Average ELBO = 186668.28
Iteration 40000 [80%]: Average ELBO = 222983.67
Iteration 45000 [90%]: Average ELBO = 253507.05
Finished [100%]: Average ELBO = 266985.1
CPU times: user 43.7 s, sys: 0 ns, total: 43.7 s
Wall time: 43.7 s

In [79]:
fig

Out[79]:

#### Posterior predictions¶

In [80]:
PPC_SAMPLES = 5000

lidar_ppc_x = np.linspace(std_range.min() - 0.05,
std_range.max() + 0.05,
100)

with lidar_model:
# sample from the variational posterior distribution

# sample from the posterior predictive distribution
x_lidar.set_value(lidar_ppc_x[:, np.newaxis])


#### Impact of truncation¶

In [82]:
fig

Out[82]:
In [84]:
fig

Out[84]:

## References¶

edwardlib.org

Examples

### PyMC3¶

http://pymc-devs.github.io/pymc3/

Examples

### Variational Inference¶

Angelino, Elaine, Matthew James Johnson, and Ryan P. Adams. "Patterns of Scalable Bayesian Inference." arXiv preprint arXiv:1602.05221 (2016).

Blei, David M., Alp Kucukelbir, and Jon D. McAuliffe. "Variational inference: A review for statisticians." arXiv preprint arXiv:1601.00670 (2016).

Kucukelbir, Alp, et al. "Automatic Differentiation Variational Inference." arXiv preprint arXiv:1603.00788 (2016).

Ranganath, Rajesh, Sean Gerrish, and David M. Blei. "Black Box Variational Inference." AISTATS. 2014.

### Models¶

Bafumi, Joseph, et al. "Practical issues in implementing and understanding Bayesian ideal point estimation." Political Analysis 13.2 (2005): 171-187.

Fox, Jean-Paul. Bayesian item response modeling: Theory and applications. Springer Science & Business Media, 2010.

Gelman, Andrew, et al. Bayesian data analysis. Vol. 2. Boca Raton, FL, USA: Chapman & Hall/CRC, 2014.

Gelman, Andrew, and Jennifer Hill. Data analysis using regression and multilevel/hierarchical models. Cambridge University Press, 2006.

Ren, Lu, et al. "Logistic stick-breaking process." Journal of Machine Learning Research 12.Jan (2011): 203-239.

## Thank You!¶

### arochford@monetate.com¶

The Jupyter notebook these slides were generated from is available here