Probabilistic Programming in Python with PyMC3

#ODSC East, Boston • May 5, 2017 • @AustinRochford

Probabilistic Programming

  • Write a program to simulate the data generating process
  • Language runtime/software library performs inference

Data Science — inference enables storytelling

Probabilistic Programming — storytelling enables inference

Enter: Bayes theorem

Benefits of probabilistic programming

  • Model-agnostic inference algorithms
  • Quantification of uncertainty
  • Incorporation of domain knowledge
  • Modularity

The Monty Hall Problem


Initially, we have no information about which door the prize is behind.

In [9]:
import pymc3 as pm

with pm.Model() as monty_model:
    prize = pm.DiscreteUniform('prize', 0, 2)

If we choose door one:

Monty can open
Prize behind Door 1 Door 2 Door 3
Door 1 No Yes Yes
Door 2 No No Yes
Door 2 No Yes No
In [10]:
from theano import tensor as tt

with monty_model:
    p_open = pm.Deterministic('p_open',
                              tt.switch(tt.eq(prize, 0),
                                        # it is behind the first door
                                        np.array([0., 0.5, 0.5]),
                              tt.switch(tt.eq(prize, 1),
                                        # it is behind the second door
                                        np.array([0., 0., 1.]),
                                        # it is behind the third door
                                        np.array([0., 1., 0.]))))

Monty opened the third door, revealing a goat.

In [11]:
with monty_model:
    opened = pm.Categorical('opened', p_open, observed=2)

Should we switch our choice of door?

In [12]:
with monty_model:
    monty_trace = pm.sample(1000, init=None, random_seed=SEED)
    
monty_df = pm.trace_to_dataframe(monty_trace)
Assigned Metropolis to prize
100%|██████████| 1000/1000 [00:00<00:00, 2850.75it/s]
In [13]:
monty_df.head()
Out[13]:
prize p_open__0 p_open__1 p_open__2
0 1 0.0 0.0 1.0
1 0 0.0 0.5 0.5
2 0 0.0 0.5 0.5
3 0 0.0 0.5 0.5
4 0 0.0 0.5 0.5
In [14]:
monty_df.prize.mean()
Out[14]:
0.67300000000000004
In [16]:
fig
Out[16]:

From the PyMC3 documentation:

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning which focuses on advanced Markov chain Monte Carlo and variational fitting algorithms. Its flexibility and extensibility make it applicable to a large suite of problems.

License

PyMC3 features

  • Uses Theano as a computational backend
    • Automated differentiation, dynamic C compilation, GPU integration
  • Implements Hamiltonian Monte Carlo and No-U-Turn sampling
  • High-level GLM (R-like syntax) and Gaussian process specification
  • Provides a framework for operator variational inference
In [22]:
HTML(mcmc_video)
Out[22]:

Statistical workflow

  1. Tell a story about how the data may have been generated
  2. Encode that story as a probabilistic model
  3. Perform inference
  4. Check model diagnostics
  5. Compare models
  6. Iterate (GOTO 1)

Statistical LEGOs

Case study: NBA foul calls

Last two minute report

In [24]:
nba_df.head()
Out[24]:
seconds_left review_decision away home score_away score_home disadvantaged_team committing_team
0 111.0 INC UTA POR 104.0 113.0 POR UTA
1 102.0 CNC UTA POR 104.0 113.0 UTA POR
2 98.0 CC UTA POR 104.0 113.0 POR UTA
3 64.0 CNC UTA POR 104.0 113.0 UTA POR
4 62.0 INC UTA POR 104.0 113.0 POR UTA

Story

Close basketball games usually end with intentional fouls.

In [27]:
fig
Out[27]:

Model

The foul call probability varies over time.

In [30]:
seconds_left_ = theano.shared(seconds_left)

with pm.Model() as nba_model:
    β = pm.GaussianRandomWalk('β', tau=PRIOR_PREC, init=pm.Normal.dist(0., 10.),
                              shape=n_sec)

Transform logodds to probabilities, specify observation model.

In [31]:
with nba_model:
    p = pm.math.sigmoid(β[seconds_left_])
    fc = pm.Bernoulli('fc', p, observed=foul_called)

Inference

In [34]:
nba_trace = sample_from_model(nba_model)
Auto-assigning NUTS sampler...
Initializing NUTS using ADVI...
Average Loss = 2,955.1: 100%|██████████| 50000/50000 [00:53<00:00, 937.33it/s] 
Finished [100%]: Average Loss = 2,955.1
100%|██████████| 2000/2000 [02:33<00:00, 13.07it/s]
  • Automatic sampler assignment (manually configurable)
  • Automatic initialization of HMC/NUTS using ADVI

Model diagnostics — posterior predictions

To sample from the posterior predictive distribution, we change the value of the Theano shared variable seconds_left_.

In [36]:
pp_sec = np.arange(n_sec)
seconds_left_.set_value(pp_sec)

with nba_model:
    pp_nba_trace = pm.sample_ppc(nba_trace, N_PP_SAMPLES)
100%|██████████| 1000/1000 [00:06<00:00, 147.16it/s]
In [38]:
fig
Out[38]:

Model diagnostics — sampling

The folk theorem [of statistical computing] is this: When you have computational problems, often there’s a problem with your model.

Andrew Gelman

In [39]:
pm.energyplot(nba_trace).legend(bbox_to_anchor=(0.6, 1));

Model selection

PyMC3 can calculate:

  • Deviance information criterion (DIC)
  • Bayesian predictive information criterion (BPIC)
  • Widely applicable/Watanabe-Akaike information criterion (WAIC)
In [41]:
with nba_model:
    nba_waic = pm.waic(nba_trace)
    
print(nba_waic)
WAIC_r(WAIC=5834.1378643060807, WAIC_se=79.294920797807535, p_WAIC=16.602451)

Story — take two

The trailing team has more incentive to foul than the leading team.

In [45]:
fig
Out[45]:

Model and inference

There is a base rate at which fouls are called.

In [46]:
with pm.Model() as trailing_model:
    α = pm.Normal('α', 0., 10.)

The trailing team is more likely to commit fouls as the game nears its end.

In [48]:
seconds_left_.set_value(seconds_left)
committing_trailing_ = theano.shared(1. * (score_diff < 0))

with trailing_model:
    β = pm.GaussianRandomWalk('β', tau=PRIOR_PREC, init=pm.Normal.dist(0., 10.),
                              shape=n_sec)
    η = α + committing_trailing_ * β[seconds_left_]
In [50]:
trailing_trace = sample_from_model(trailing_model, quiet=True)

Model diagnostics — posterior prediction

In [54]:
fig
Out[54]:

Model selection

In [59]:
fig
Out[59]:

Story — take three

Larger score differences incentivize the trailing team to foul sooner and more aggressively.

In [62]:
fig
Out[62]:

Model and inference

There is a base rate at which fouls are called, and the trailing team is more likely to commit fouls as the game nears its end.

In [63]:
seconds_left_.set_value(seconds_left)
committing_trailing_ = theano.shared(1. * (score_diff < 0))

with pm.Model() as diff_model:
    α = pm.Normal('α', 0., 10.)
    β = pm.GaussianRandomWalk('β', tau=PRIOR_PREC, init=pm.Normal.dist(0., 10.),
                              shape=n_sec)

The teams trailing by more points foul more agressively and sooner.

In [64]:
score_diff_ = theano.shared(score_diff)

with diff_model:
    γ = pm.Normal('γ', 0., 10.)
    θ = pm.Normal('θ', 0., 10.)
    λ = pm.math.sigmoid(γ + θ * score_diff_)
    
    η = α + committing_trailing_ * λ * β[seconds_left_]
In [66]:
diff_trace = sample_from_model(diff_model, quiet=True)

Model diagnostics

Posterior predictive inference

In [71]:
fig
Out[71]: