Open Source Bayesian Inference in Python with PyMC3

FOSSCON • Philadelphia • August 26, 2017

@AustinRochford

Who am I?

@AustinRochfordGitHubWebsite

austin.rochford@gmail.comarochford@monetate.com

PyMC3 developer • Principal Data Scientist at Monetate Labs

Bayesian Inference

A motivating question:

A rare disease is present in one out of one hundred thousand people. A test gives the correct diagnosis 99.9% of the time. What is the probability that a person that tests positive has the disease?

Conditional Probability

Conditional probability is the probability that one event will happen, given that another event has occured.

$$ \begin{align*} P(A\ |\ B) & = \textrm{the probability that } A \textrm{ will happen if we know that } B \textrm{ has happened} \\ & = \frac{P(A \textrm{ and } B)}{P(B)}. \end{align*} $$

Our question,

A rare disease is present in one out of one hundred thousand people. A test gives the correct diagnosis 99.9% of the time. What is the probability that a person that tests positive has the disease?

becomes

$$ \begin{align*} P(+) & = 10^{-5} \\ \\ P(\textrm{Test } +\ |\ +) & = 0.999 \\ P(\textrm{Test } -\ |\ -) & = 0.999 \\ \\ P(+\ |\ \textrm{Test } +) & =\ \textbf{?} \end{align*} $$

Bayes' Theorem

Bayes' theorem shows how to get from $P(B\ |\ A)$ to $P(A\ |\ B)$.

$$ \begin{align*} P(+\ |\ \textrm{Test } +) & = \frac{P(\textrm{Test } +\ |\ +) P(+)}{P(\textrm{Test } +)} \\ & = \frac{P(\textrm{Test } +\ |\ +) P(+)}{P(\textrm{Test } +\ |\ +) P(+) + P(\textrm{Test } +\ |\ -) P(-)} \end{align*} $$
$$ \begin{align*} P(+) & = 10^{-5} \\ P(\textrm{Test } +\ |\ +) & = 0.999 \\ P(\textrm{Test } -\ |\ -) & = 0.999 \\ \\ P(+\ |\ \textrm{Test } +) & = \frac{P(\textrm{Test } +\ |\ +) P(+)}{P(\textrm{Test } +\ |\ +) P(+) + P(\textrm{Test } +\ |\ -) P(-)} \\ & = \frac{0.999 \times 10^{-5}}{0.999 \times 10^{-5} + 0.001 \times \left(1 - 10^{-5}\right)} \end{align*} $$
In [8]:
0.999 * 1e-5 / (0.999 * 1e-5 + 0.001 * (1 - 1e-5))
Out[8]:
0.009891284975940117

Probabilistic Programming for Bayesian Inference

We know the disease is present in one in one hundred thousand people.

In [9]:
import pymc3 as pm

with pm.Model() as disease_model:
    has_disease = pm.Bernoulli('has_disease', 1e-5)

If a person has the disease, there is a 99.9% chance they test positive. If they do not have the disease, there is a 0.1% chance they test positive.

In [10]:
with disease_model:
    p_test_pos = has_disease * 0.999 + (1 - has_disease) * 0.001

This person has tested positive for the disease.

In [11]:
with disease_model:
    test_pos = pm.Bernoulli('test_pos', p_test_pos, observed=1)

What is the probability that this person has the disease?

In [12]:
with disease_model:
    disease_trace = pm.sample(draws=10000, random_seed=SEED)
Assigned BinaryGibbsMetropolis to has_disease
100%|██████████| 10500/10500 [00:04<00:00, 2427.46it/s]
In [13]:
disease_trace['has_disease']
Out[13]:
array([0, 0, 0, ..., 0, 0, 0])
In [14]:
disease_trace['has_disease'].mean()
Out[14]:
0.0104
In [15]:
0.999 * 1e-5 / (0.999 * 1e-5 + 0.001 * (1 - 1e-5))
Out[15]:
0.009891284975940117

Monte Carlo Methods

In [16]:
N = 5000

x, y = np.random.uniform(0, 1, size=(2, N))
In [18]:
fig
Out[18]:
In [19]:
in_circle = x**2 + y**2 <= 1
In [21]:
fig
Out[21]:
In [22]:
4 * in_circle.mean()
Out[22]:
3.1272000000000002

History of Monte Carlo Methods

The Monty Hall Problem

Initially, we have no information about which door the prize is behind.

In [23]:
with pm.Model() as monty_model:
    prize = pm.DiscreteUniform('prize', 0, 2)

If we choose door one:

Monty can open
Prize behind Door 1 Door 2 Door 3
Door 1 No Yes Yes
Door 2 No No Yes
Door 2 No Yes No
In [24]:
from theano import tensor as tt

with monty_model:
    p_open = pm.Deterministic('p_open',
                              tt.switch(tt.eq(prize, 0),
                                        # it is behind the first door
                                        np.array([0., 0.5, 0.5]),
                              tt.switch(tt.eq(prize, 1),
                                        # it is behind the second door
                                        np.array([0., 0., 1.]),
                                        # it is behind the third door
                                        np.array([0., 1., 0.]))))

Monty opened the third door, revealing a goat.

In [25]:
with monty_model:
    opened = pm.Categorical('opened', p_open, observed=2)

Should we switch our choice of door?

In [26]:
with monty_model:
    monty_trace = pm.sample(draws=10000, random_seed=SEED)
    
monty_df = pm.trace_to_dataframe(monty_trace)
Assigned Metropolis to prize
100%|██████████| 10500/10500 [00:02<00:00, 4427.89it/s]
In [27]:
monty_df.prize.head()
Out[27]:
0    0
1    0
2    0
3    0
4    1
Name: prize, dtype: int64
In [29]:
fig
Out[29]:

Yes, we should switch our choice of door.

Introduction to PyMC3

From the PyMC3 documentation:

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning which focuses on advanced Markov chain Monte Carlo and variational fitting algorithms. Its flexibility and extensibility make it applicable to a large suite of problems.

License

As of October 2016, PyMC3 is a NumFOCUS fiscally sponsored project.

Features

  • Uses Theano as a computational backend
    • Automated differentiation, dynamic C compilation, GPU integration
  • Implements Hamiltonian Monte Carlo and No-U-Turn sampling
  • High-level GLM (R-like syntax) and Gaussian process specification
  • General framework for variational inference

Contributing

Discourse

Case Study: Sleep Deprivation

In [31]:
sleep_df.head()
Out[31]:
reaction days subject reaction_std
0 249.5600 0 308 -0.868968
1 258.7047 1 308 -0.706623
2 250.8006 2 308 -0.846944
3 321.4398 3 308 0.407108
4 356.8519 4 308 1.035777
In [33]:
fig
Out[33]:

Each subject has a baseline reaction time, which should not be too far apart.

In [35]:
with pm.Model() as sleep_model:
    μ_α = pm.Normal('μ_α', 0., 5.)
    σ_α = pm.HalfCauchy('σ_α', 5.)
    α = pm.Normal('α', μ_α, σ_α, shape=n_subjects)

Each subject's reaction time increases at a different rate after days of sleep deprivation.

In [36]:
with sleep_model:
    μ_β = pm.Normal('μ_β', 0., 5.)
    σ_β = pm.HalfCauchy('σ_β', 5.)
    β = pm.Normal('β', μ_β, σ_β, shape=n_subjects)

The baseline reaction time and rate of increase lead to the observed reaction times.

In [37]:
with sleep_model:
    μ = pm.Deterministic('μ', α[subject_ix] + β[subject_ix] * days)
    σ = pm.HalfCauchy('σ', 5.)
    obs = pm.Normal('obs', μ, σ, observed=reaction_std)
In [38]:
N_JOBS = 3
JOB_SEEDS = [SEED + i for i in range(N_JOBS)]

with sleep_model:
    sleep_trace = pm.sample(njobs=N_JOBS, random_seed=JOB_SEEDS)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%|██████████| 1000/1000 [00:07<00:00, 134.79it/s]

Convergence Diagnostics

In [39]:
ax = pm.energyplot(sleep_trace, legend=False)
ax.set_title("BFMI = {:.3f}".format(pm.bfmi(sleep_trace)));
In [40]:
max(np.max(gr_values) for gr_values in pm.gelman_rubin(sleep_trace).values())
Out[40]:
1.0019478091296214

Prediction

In [41]:
with sleep_model:
    pp_sleep_trace = pm.sample_ppc(sleep_trace)
100%|██████████| 500/500 [00:00<00:00, 1258.46it/s]
In [43]:
pp_df.head()
Out[43]:
reaction days subject reaction_std pp_0 pp_1 pp_10 pp_100 pp_101 pp_102 ... pp_90 pp_91 pp_92 pp_93 pp_94 pp_95 pp_96 pp_97 pp_98 pp_99
0 249.5600 0 308 -0.868968 -0.221133 -1.094005 -1.479840 -0.797150 -0.251219 -1.846100 ... -0.680005 -0.036840 -0.618054 0.158745 -0.986407 -0.788178 -1.483328 -0.423254 -0.946230 -0.838087
1 258.7047 1 308 -0.706623 0.354375 -1.192640 -0.424441 0.064195 -0.661047 -1.181403 ... -0.472364 -0.825356 -0.374235 -0.336909 -0.940759 -1.236048 -0.749208 -0.892589 0.243243 -0.891522
2 250.8006 2 308 -0.846944 0.487935 -0.570185 0.304536 0.758313 0.566803 0.211791 ... -0.512914 0.156596 -0.451047 -0.219734 -0.773714 -0.237561 -0.199602 -0.071147 0.146439 -0.193735
3 321.4398 3 308 0.407108 0.950116 -0.554748 0.544648 1.037173 -0.245327 0.450455 ... -0.183340 0.289252 0.358078 1.579609 0.652802 -0.286667 -0.327654 0.607906 -0.087981 -0.097427
4 356.8519 4 308 1.035777 0.485269 0.549079 0.564043 0.045464 0.643697 0.190714 ... 0.976137 0.306833 0.744076 0.493216 0.516644 1.090693 0.583320 1.079374 0.996899 0.671175

5 rows × 504 columns

In [46]:
grid.fig
Out[46]:

Hamiltonian Monte Carlo in PyMC3

In [52]:
mcmc_fig
Out[52]:

The Curse of Dimensionality

In [55]:
fig
Out[55]:

Hamiltonian Monte Carlo

$$\frac{d}{dx} \left(x^3\right) = 3 x^2$$
In [56]:
x = tt.dscalar('x')
x.tag.test_value = 0.

y = x**3
In [57]:
from theano import pprint

pprint(tt.grad(y, x))
Out[57]:
'((fill((x ** TensorConstant{3}), TensorConstant{1.0}) * TensorConstant{3}) * (x ** (TensorConstant{3} - TensorConstant{1})))'

Case Study: 1984 Congressional Votes

In [60]:
vote_df.head()
Out[60]:
rep_id party 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
0 0 republican n y n y y y n n n y ? y y y n y
1 1 republican n y n y y y n n n n n y y y n ?
2 2 democrat ? y y ? y y n n n n y n y y n n
3 3 democrat n y y n ? y n n n n y n y n n y
4 4 democrat y y y n y y n n n n y ? y y y y

Question: can we separate Democrats and Republicans based on their voting records?

In [65]:
grid.fig
Out[65]:

Latent State Model

  • Representatives ($\color{blue}{\theta_i}$) and bills ($\color{green}{b_j})$ have ideal points on a liberal-conservative spectrum.
    • If the ideal points are equal, there is a 50% chance the representative will vote for the bill.
In [68]:
with pm.Model() as vote_model:
    # representative ideal points
    θ = pm.Normal('θ', 0., 1., shape=n_rep)
    
    # bill ideal points
    a = hierarchical_normal('a', N_BILL)
  • Bills also have an ability to discriminate ($\color{red}{a_j}$) between conservative and liberal representatives.
    • Some bills have broad bipartisan support, while some provoke votes along party lines.
In [70]:
with vote_model:
    # bill discrimination parameters
    b = hierarchical_normal('b', N_BILL)

This model has

In [71]:
n_rep + 2 * (N_BILL + 1)
Out[71]:
469

parameters

Observation Model

$$ \begin{align*} P(\textrm{Representative }i \textrm{ votes for bill }j\ |\ \color{blue}{\theta_i}, \color{green}{b_j}, \color{red}{a_j}) & = \frac{1}{1 + \exp\left(-\left(\color{red}{a_j} \cdot \color{blue}{\theta_i} - \color{green}{b_j}\right)\right)} \end{align*} $$
In [73]:
fig
Out[73]:
$$ \begin{align*} P(\textrm{Representative }i \textrm{ votes for bill }j\ |\ \color{blue}{\theta_i}, \color{green}{b_j}, \color{red}{a_j}) & = \frac{1}{1 + \exp\left(-\left(\color{red}{a_j} \cdot \color{blue}{\theta_i} - \color{green}{b_j}\right)\right)} \end{align*} $$
In [74]:
with vote_model:
    η = a[bill_id] * θ[rep_id] - b[bill_id]
    p = pm.math.sigmoid(η)
    
    obs = pm.Bernoulli('obs', p, observed=vote)

Hamiltonian Monte Carlo Inference

In [75]:
with vote_model:
    nuts_trace = pm.sample(init='adapt_diag', njobs=N_JOBS,
                           random_seed=JOB_SEEDS)
Auto-assigning NUTS sampler...
Initializing NUTS using adapt_diag...
100%|██████████| 1000/1000 [02:11<00:00, 12.76it/s]

Convergence Diagnostics

In [76]:
ax = pm.energyplot(nuts_trace, legend=False)
ax.set_title("BFMI = {:.3f}".format(pm.bfmi(nuts_trace)));
In [77]:
max(np.max(gr_values) for gr_values in pm.gelman_rubin(nuts_trace).values())
Out[77]:
1.0135857288309107

Ideal Points

In [79]:
fig
Out[79]:

Discriminative Ability of Bills

In [80]:
pm.forestplot(nuts_trace, varnames=['a'], rhat=False, xtitle="$a$",
              ylabels=["Bill {}".format(i + 1) for i in range(N_BILL)]);

Comparison to Non-Hamiltonian MCMC

In [81]:
with vote_model:
    step = pm.Metropolis()
    met_trace_ = pm.sample(10000, step, njobs=N_JOBS, random_seed=JOB_SEEDS)
    
met_trace = met_trace_[5000::5]
100%|██████████| 10500/10500 [03:40<00:00, 47.65it/s]
In [82]:
max(np.max(gr_values) for gr_values in pm.gelman_rubin(met_trace).values())
Out[82]:
39.87901257961822
In [84]:
fig
Out[84]:

Next Steps

PyMC3


Probabilistic Programming Ecosystem

Probabilistic Programming System Language License Discrete Variable Support Automatic Differentiation/Hamiltonian Monte Carlo Variational Inference
PyMC3 Python Apache V2 Yes Yes Yes
Stan C++, R, Python, ... BSD 3-clause No Yes Yes
Edward Python, ... Apache V2 Yes Yes Yes
BUGS Standalone program, R GPL V2 Yes No No
JAGS Standalone program, R GPL V2 Yes No No

Thank you!

@AustinRochfordaustin.rochford@gmail.comarochford@monetate.com

The Jupyter notebook used to generate these slides is available here.