Last week I came across the following tweet from Paul Bürkner about a paper he coauthored about including ordinal predictors in Bayesian regression models, and I thought the approach was very clever.

Have you ever wondered how to handle ordinal predictors in your regression models? We propose a simple and intuitive method that is readily available via #brms and @mcmc_stan: https://t.co/dKg4AphvsG

— Paul Bürkner (@paulbuerkner) November 2, 2018

The code in the paper uses `brms`

and Stan to illustrate these concepts. In this post I’ll be replicating some of the paper’s analysis in Python and PyMC3, mostly for my own edification.

`%matplotlib inline`

`from itertools import product`

```
import arviz as az
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
from rpy2.robjects import pandas2ri, r
import seaborn as sns
from theano import shared
```

`sns.set()`

The paper uses data from the `ordPens`

R package, which we download and load into a Pandas `DataFrame`

.

```
%%bash
if [[ ! -e ~/data/ordPens ]];
then
mkdir -p data
wget -q -O data/ordPens_0.3-1.tar.gz ~/data/ https://cran.r-project.org/src/contrib/ordPens_0.3-1.tar.gz
tar xzf data/ordPens_0.3-1.tar.gz -C data
fi
```

`!ls data/ordPens/data/`

```
ICFCoreSetCWP.RData
```

```
pandas2ri.activate()
r.load('data/ordPens/data/ICFCoreSetCWP.RData');
all_df = r['ICFCoreSetCWP']
```

`all_df.head()`

b1602 | b122 | b126 | b130 | b134 | b140 | b147 | b152 | b164 | b180 | … | e450 | e455 | e460 | e465 | e570 | e575 | e580 | e590 | s770 | phcs | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

1 | 0 | 1 | 2 | 1 | 1 | 0 | 0 | 2 | 0 | 0 | … | 4 | 0 | 0 | 0 | 3 | 3 | 4 | 3 | 0 | 44.33 |

2 | 3 | 2 | 2 | 3 | 3 | 2 | 3 | 3 | 3 | 1 | … | 3 | 3 | 2 | 2 | 2 | 2 | 2 | 2 | 2 | 21.09 |

3 | 0 | 1 | 2 | 1 | 1 | 0 | 1 | 2 | 0 | 0 | … | 4 | 0 | 0 | 0 | 3 | 3 | 4 | 0 | 0 | 41.74 |

4 | 0 | 0 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 0 | … | 2 | 2 | -1 | 0 | 0 | 2 | 2 | 1 | 1 | 33.96 |

5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | … | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 46.29 |

5 rows × 68 columns

The variable of interest here is `phcs`

, which is a subjective physical health score. The predictors we are interested in are `d450`

and `d455`

.

`df = all_df[['d450', 'd455', 'phcs']]`

`df.head()`

d450 | d455 | phcs | |
---|---|---|---|

1 | 0 | 2 | 44.33 |

2 | 3 | 3 | 21.09 |

3 | 0 | 2 | 41.74 |

4 | 3 | 2 | 33.96 |

5 | 0 | 0 | 46.29 |

These predictors are ratings on a five-point scale (0-4) of the patient’s impairment while walking (`d450`

) and moving around (`d455`

). For more information on this data, consult the `ordPens`

documentation.

The following plots show a fairly strong monotonic relationship between `d450`

, `d455`

, and `phcs`

.

```
fig, (d450_ax, d455_ax) = plt.subplots(ncols=2, sharey=True, figsize=(16, 6))
sns.stripplot('d450', 'phcs', data=df,
jitter=0.1, color='C0', alpha=0.75,
ax=d450_ax);
sns.stripplot('d455', 'phcs', data=df,
jitter=0.1, color='C0', alpha=0.75,
ax=d455_ax);
d455_ax.set_ylabel("");
fig.tight_layout();
```

The big idea of the paper is to include monotonic effects due to these ordinal predictors as follows. A scalar \(b \sim N(0, 10^2)\) parameterizes the overall strength and direction of the relationship, and a Dirichlet vector \(\xi \sim \textrm{Dirichlet}(1, \ldots, 1)\) encodes how much of \(b\) is gained at each level. The parameters \(b\) and \(\xi\) are combined into

\[mo(i) = b \sum_{k = 0}^i \xi_k\]

which can be included as a term in a regression model. It is evident that if \(i < j\) then \(mo(i) \leq mo(j)\) since

\[mo(j) - mo(i) = b \sum_{k = i + 1}^j \xi_k \geq 0\]

and therefore the effect of this term will be monotonic as desired.

The following function constructs this distribution in PyMC3.

```
def monotonic_prior(name, n_cat):
b = pm.Normal(f'b_{name}', 0., 10.)
ξ = pm.Dirichlet(f'ξ_{name}', np.ones(n_cat))
return pm.Deterministic(f'mo_{name}', b * ξ.cumsum())
```

With this notation in hand, our model is

\[ \begin{align*} \mu_i & = \beta_0 + mo_{\textrm{d450}}(j_i) + mo_{\textrm{d455}}(k_i) \\ \beta_0 & \sim N(0, 10^2) \\ y_i & \sim N(\mu_i, \sigma^2) \\ \sigma & \sim \textrm{HalfNormal}(5^2) \end{align*} \]

where \(j_i\) and \(k_i\) are the level of `d450`

and `d455`

for the \(i\)-th patient respectively and \(y_i\) is that patient’s `phcs`

score.

We now express this model in PyMC3.

```
d450 = df['d450'].values
d450_cats = np.unique(d450)
d450_n_cat = d450_cats.size
d450_ = shared(d450)
d455 = df['d455'].values
d455_cats = np.unique(d455)
d455_n_cat = d455_cats.size
d455_ = shared(d455)
phcs = df['phcs'].values
```

```
with pm.Model() as model:
β0 = pm.Normal('β0', 0., 10.)
mo_d450 = monotonic_prior('d450', d450_n_cat)
mo_d455 = monotonic_prior('d455', d455_n_cat)
μ = β0 + mo_d450[d450_] + mo_d455[d455_]
σ = pm.HalfNormal('σ', 5.)
phcs_obs = pm.Normal('phcs', μ, σ, observed=phcs)
```

We now sample from the model’s posterior distribution.

```
CHAINS = 3
SEED = 934520 # from random.org
SAMPLE_KWARGS = {
'draws': 1000,
'tune': 1000,
'chains': CHAINS,
'random_seed': list(SEED + np.arange(CHAINS))
}
```

```
with model:
trace = pm.sample(**SAMPLE_KWARGS)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 2 jobs)
NUTS: [σ, ξ_d455, b_d455, ξ_d450, b_d450, β0]
Sampling 3 chains: 100%|██████████| 6000/6000 [00:41<00:00, 145.59draws/s]
The number of effective samples is smaller than 25% for some parameters.
```

We use `arviz`

to check the performance of our sampler.

`inf_data = az.convert_to_inference_data(trace)`

The energy plot, BFMI, and Gelman-Rubin statistics show no cause for concern.

`az.plot_energy(inf_data);`

`az.gelman_rubin(inf_data).max()`

```
<xarray.Dataset>
Dimensions: ()
Data variables:
β0 float64 1.0
b_d450 float64 1.0
b_d455 float64 1.0
ξ_d450 float64 1.0
mo_d450 float64 1.0
ξ_d455 float64 1.0
mo_d455 float64 1.0
σ float64 1.0
```

We now sample from the model’s posterior predictive distribution and visualize the results.

```
pp_d450, pp_d455 = np.asarray(list(zip(*product(d450_cats, d455_cats))))
d450_.set_value(pp_d450)
d455_.set_value(pp_d455)
```

```
with model:
pp_trace = pm.sample_posterior_predictive(trace)
```

`100%|██████████| 3000/3000 [00:07<00:00, 388.49it/s]`

```
pp_df = pd.DataFrame({
'd450': pp_d450,
'd455': pp_d455,
'pp_phcs_mean': pp_trace['phcs'].mean(axis=0)
})
```

The important feature of this encoding of ordinal predictors is that the \(\xi\) parameters allow different levels of the predictor to contribute to result in a different change in the effect, which is in contrast to what happens when these are included as linear predictors, which is quite common in the literature.

`REF_CAT = 1`

```
fig, (d450_ax, d455_ax) = plt.subplots(ncols=2, sharey=True, figsize=(16, 6))
(pp_df.pivot_table('pp_phcs_mean', 'd450', 'd455')
.plot(marker='o', ax=d450_ax));
d450_ax.set_xticks(d450_cats);
d450_ax.set_ylabel("Posterior predictive phcs");
(pp_df.pivot_table('pp_phcs_mean', 'd455', 'd450')
.plot(marker='o', ax=d455_ax));
d455_ax.set_xticks(d455_cats);
fig.tight_layout();
```

The following plot corresponds to Figure 3 in the original paper, and the dark lines agree with the mean in that figure quite well.

```
fig, (d450_ax, d455_ax) = plt.subplots(ncols=2, sharey=True, figsize=(16, 6))
(pp_df[pp_df['d455'] != REF_CAT]
.pivot_table('pp_phcs_mean', 'd450', 'd455')
.plot(marker='o', c='k', alpha=0.5, legend=False,
ax=d450_ax));
(pp_df[pp_df['d455'] == REF_CAT]
.plot('d450', 'pp_phcs_mean',
marker='o', c='k',
label=f"Refernce category (d455 = {REF_CAT})",
ax=d450_ax));
d450_ax.set_xticks(d450_cats);
d450_ax.set_ylabel("Posterior excpected phcs");
(pp_df[pp_df['d450'] != REF_CAT]
.pivot_table('pp_phcs_mean', 'd455', 'd450')
.plot(marker='o', c='k', alpha=0.5, legend=False,
ax=d455_ax));
(pp_df[pp_df['d450'] == REF_CAT]
.plot('d455', 'pp_phcs_mean',
marker='o', c='k',
label=f"Refernce category (d450 = {REF_CAT})",
ax=d455_ax));
d455_ax.set_xticks(d455_cats);
fig.tight_layout();
```

For reference, we compare this model to a model that includes `d450`

and `d455`

as linear predictors. This model is given by

\[ \begin{align*} \mu_i & = \beta_0 + \beta_{\textrm{d450}} \cdot j(i) + \beta_{\textrm{d455}} \cdot k(i) \\ \beta_0, \beta_{\textrm{d450}}, \beta_{\textrm{d455}} & \sim N(0, 10^2) \\ y_i & \sim N(\mu_i, \sigma^2) \\ \sigma & \sim \textrm{HalfNormal}(5^2) \end{align*} \]

```
d450_.set_value(d450)
d455_.set_value(d455)
```

```
with pm.Model() as linear_model:
β0 = pm.Normal('β0', 0., 10.)
β_d450 = pm.Normal('β_d450', 0., 10.)
β_d455 = pm.Normal('β_d455', 0., 10.)
μ = β0 + β_d450 * d450_ + β_d455 * d455_
σ = pm.HalfNormal('σ', 5.)
phcs_obs = pm.Normal('phcs', μ, σ, observed=phcs)
```

```
with linear_model:
linear_trace = pm.sample(**SAMPLE_KWARGS)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 2 jobs)
NUTS: [σ, β_d455, β_d450, β0]
Sampling 3 chains: 100%|██████████| 6000/6000 [00:07<00:00, 771.92draws/s]
```

As in the paper, compare these models by stacking their posterioir predictive distributions.

```
comp_df = (pm.compare({
model: trace,
linear_model: linear_trace
})
.rename({
0: "Paper",
1: "Linear"
}))
```

`comp_df`

WAIC | pWAIC | dWAIC | weight | SE | dSE | var_warn | |
---|---|---|---|---|---|---|---|

Paper | 2825.24 | 6.22 | 0 | 1 | 29.01 | 0 | 0 |

Linear | 2830.2 | 3.7 | 4.97 | 0 | 29.09 | 4.42 | 0 |

We see that the model from the paper has a lower WAIC and gets 100% of the weight, a strong sign that it is surperior to the linear model.

This post is available as a Jupyter notebook here.

Last April, I wrote a post that used Bayesian item-response theory models to analyze NBA foul call data. Last November, I spoke about a greatly improved version of these models at PyData NYC. This post is a write-up of the models from that talk.

Since late in the 2014-2015 season, the NBA has issued last two minute reports. These reports give the league’s assessment of the correctness of foul calls and non-calls in the last two minutes of any game where the score difference was three or fewer points at any point in the last two minutes.

These reports are notably different from play-by-play logs, in that they include information on non-calls for notable on-court interactions. This non-call information presents a unique opportunity to study the factors that impact foul calls. There is a level of subjectivity inherent in the the NBA’s definition of notable on-court interactions which we attempt to mitigate later using season-specific factors.

Russel Goldenberg of The Pudding has been scraping the PDFs that the NBA publishes and transforming them into a CSV for some time. I am grateful for his work, which has enabled this analysis.

We download the data locally to be kind to GitHub.

`%matplotlib inline`

```
import datetime
from itertools import product
import logging
import pickle
```

```
from matplotlib import pyplot as plt
from matplotlib.offsetbox import AnchoredText
from matplotlib.ticker import FuncFormatter, StrMethodFormatter
import numpy as np
import pandas as pd
import scipy as sp
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from theano import tensor as tt
```

```
pct_formatter = StrMethodFormatter('{x:.1%}')
sns.set()
blue, green, *_ = sns.color_palette()
plt.rc('figure', figsize=(8, 6))
LABELSIZE = 14
plt.rc('axes', labelsize=LABELSIZE)
plt.rc('axes', titlesize=LABELSIZE)
plt.rc('figure', titlesize=LABELSIZE)
plt.rc('legend', fontsize=LABELSIZE)
plt.rc('xtick', labelsize=LABELSIZE)
plt.rc('ytick', labelsize=LABELSIZE)
```

`SEED = 207183 # from random.org, for reproducibility`

```
# keep theano from complaining about compile locks for small models
(logging.getLogger('theano.gof.compilelock')
.setLevel(logging.CRITICAL))
```

```
%%bash
DATA_URI=https://raw.githubusercontent.com/polygraph-cool/last-two-minute-report/32f1c43dfa06c2e7652cc51ea65758007f2a1a01/output/all_games.csv
DATA_DEST=/tmp/all_games.csv
if [[ ! -e $DATA_DEST ]];
then
wget -q -O $DATA_DEST $DATA_URI
fi
```

We use only a subset of the columns in the source data set.

```
USECOLS = [
'period',
'seconds_left',
'call_type',
'committing_player',
'disadvantaged_player',
'review_decision',
'play_id',
'away',
'home',
'date',
'score_away',
'score_home',
'disadvantaged_team',
'committing_team'
]
```

```
orig_df = pd.read_csv(
'/tmp/all_games.csv',
usecols=USECOLS,
index_col='play_id',
parse_dates=['date']
)
```

The data set contains more than 16,000 plays.

`orig_df.shape[0]`

`16300`

Each row of the `DataFrame`

represents a play and each column describes an attrbiute of the play:

`period`

is the period of the game,`seconds_left`

is the number of seconds remaining in the game,`call_type`

is the type of call,`committing_player`

and`disadvantaged_player`

are the names of the players involved in the play,`review_decision`

is the opinion of the league reviewer on whether or not the play was called correctly:`review_decision = "INC"`

means the call was an incorrect noncall,`review_decision = "CNC"`

means the call was an correct noncall,`review_decision = "IC"`

means the call was an incorrect call, and`review_decision = "CC"`

means the call was an correct call,

`away`

and`home`

are the abbreviations of the teams involved in the game,`date`

is the date on which the game was played,`score_away`

and`score_home`

are the scores of the`away`

and`home`

team during the play, respectively, and`disadvantaged_team`

and`committing_team`

indicate how each team is involved in the play.

`orig_df.head(n=2).T`

play_id | 20150301CLEHOU-0 | 20150301CLEHOU-1 |
---|---|---|

period | Q4 | Q4 |

seconds_left | 112 | 103 |

call_type | Foul: Shooting | Foul: Shooting |

committing_player | Josh Smith | J.R. Smith |

disadvantaged_player | Kevin Love | James Harden |

review_decision | CNC | CC |

away | CLE | CLE |

home | HOU | HOU |

date | 2015-03-01 00:00:00 | 2015-03-01 00:00:00 |

score_away | 103 | 103 |

score_home | 105 | 105 |

disadvantaged_team | CLE | HOU |

committing_team | HOU | CLE |

In this post, we answer two questions:

- How does game context impact foul calls?
- Is (not) committing and/or drawing fouls a measurable player skill?

The previous post focused on the second question, and gave the first question only a cursory treatment. This post enhances our treatment of the first question, in order to control for non-skill factors influencing foul calls (namely intentional fouls). Controlling for these factors makes our estimates of player skill more realistic.

First we examine the types of calls present in the data set.

```
(orig_df['call_type']
.value_counts()
.head(n=15))
```

```
Foul: Personal 4736
Foul: Shooting 4201
Foul: Offensive 2846
Foul: Loose Ball 1316
Turnover: Traveling 779
Instant Replay: Support Ruling 607
Foul: Defense 3 Second 277
Instant Replay: Overturn Ruling 191
Foul: Personal Take 172
Turnover: 3 Second Violation 139
Turnover: 24 Second Violation 126
Turnover: 5 Second Inbound 99
Stoppage: Out-of-Bounds 96
Violation: Lane 84
Foul: Away from Play 82
Name: call_type, dtype: int64
```

The portion of `call_type`

before the colon is the general category of the call. We count the occurence of these categories below.

```
(orig_df['call_type']
.str.split(':', expand=True)
.iloc[:, 0]
.value_counts()
.plot(
kind='bar',
color=blue, logy=True,
title="Call types"
)
.set_ylabel("Frequency"));
```

We restrict our attention to foul calls, though other call types would be interesting to study in the future.

```
foul_df = orig_df[
orig_df['call_type']
.fillna("UNKNOWN")
.str.startswith("Foul")
]
```

We count the foul call types below.

```
(foul_df['call_type']
.str.split(': ', expand=True)
.iloc[:, 1]
.value_counts()
.plot(
kind='bar',
color=blue, logy=True,
title="Foul Types"
)
.set_ylabel("Frequency"));
```

We restrict our attention to the five foul types below, which generally involve two players. This subset of fouls allows us to pursue our second research question in the most direct manner.

```
FOULS = [
f"Foul: {foul_type}"
for foul_type in [
"Personal",
"Shooting",
"Offensive",
"Loose Ball",
"Away from Play"
]
]
```

There are a number of misspelled team names in the data, which we correct.

```
TEAM_MAP = {
"NKY": "NYK",
"COS": "BOS",
"SAT": "SAS",
"CHi": "CHI",
"LA)": "LAC",
"AT)": "ATL",
"ARL": "ATL"
}
def correct_team_name(col):
def _correct_team_name(df):
return df[col].apply(lambda team_name: TEAM_MAP.get(team_name, team_name))
return _correct_team_name
```

We also convert each game date to an NBA season.

```
def date_to_season(date):
if date >= datetime.datetime(2017, 10, 17):
return '2017-2018'
elif date >= datetime.datetime(2016, 10, 25):
return '2016-2017'
elif date >= datetime.datetime(2015, 10, 27):
return '2015-2016'
else:
return '2014-2015'
```

We clean the data by

- restricting to plays that occured during the last two minutes of regulation,
- imputing incorrect noncalls when
`review_decision`

is missing, - correcting team names,
- converting game dates to seasons,
- restricting to the foul types discussed above,
- restricting to the plays that happened during the 2015-2016 and 2016-2017 regular seasons (those are the only full seasons in the data set as of February 2018), and
- dropping unneeded rows and columns.

```
clean_df = (foul_df.where(lambda df: df['period'] == "Q4")
.where(lambda df: (df['date'].between(datetime.datetime(2016, 10, 25),
datetime.datetime(2017, 4, 12))
| df['date'].between(datetime.datetime(2015, 10, 27),
datetime.datetime(2016, 5, 30)))
)
.assign(
review_decision=lambda df: df['review_decision'].fillna("INC"),
committing_team=correct_team_name('committing_team'),
disadvantged_team=correct_team_name('disadvantaged_team'),
away=correct_team_name('away'),
home=correct_team_name('home'),
season=lambda df: df['date'].apply(date_to_season)
)
.where(lambda df: df['call_type'].isin(FOULS))
.dropna()
.drop('period', axis=1)
.assign(call_type=lambda df: (df['call_type']
.str.split(': ', expand=True)
.iloc[:, 1])))
```

About 55% of the rows in the original data set remain.

`clean_df.shape[0] / orig_df.shape[0]`

`0.5516564417177914`

`clean_df.head(n=2).T`

play_id | 20151028INDTOR-1 | 20151028INDTOR-2 |
---|---|---|

seconds_left | 89 | 73 |

call_type | Shooting | Shooting |

committing_player | Ian Mahinmi | Bismack Biyombo |

disadvantaged_player | DeMar DeRozan | Paul George |

review_decision | CC | IC |

away | IND | IND |

home | TOR | TOR |

date | 2015-10-28 00:00:00 | 2015-10-28 00:00:00 |

score_away | 99 | 99 |

score_home | 106 | 106 |

disadvantaged_team | TOR | IND |

committing_team | IND | TOR |

disadvantged_team | TOR | IND |

season | 2015-2016 | 2015-2016 |

We use `scikit-learn`

’s `LabelEncoder`

to transform categorical features (call type, player, and season) to integers.

```
call_type_enc = LabelEncoder().fit(
clean_df['call_type']
)
n_call_type = call_type_enc.classes_.size
player_enc = LabelEncoder().fit(
np.concatenate((
clean_df['committing_player'],
clean_df['disadvantaged_player']
))
)
n_player = player_enc.classes_.size
season_enc = LabelEncoder().fit(
clean_df['season']
)
n_season = season_enc.classes_.size
```

We transform the data by

- rounding
`seconds_left`

to the nearest second (purely for convenience), - transforming categorical features to integer ids,
- setting
`foul_called`

equal to one or zero depending on whether or not a foul was called, and - setting
`score_committing`

and`score_disadvantaged`

to the score of the committing and disadvantaged teams, respectively.

```
df = (clean_df[['seconds_left']]
.round(0)
.assign(
call_type=call_type_enc.transform(clean_df['call_type']),
foul_called=1. * clean_df['review_decision'].isin(['CC', 'INC']),
player_committing=player_enc.transform(clean_df['committing_player']),
player_disadvantaged=player_enc.transform(clean_df['disadvantaged_player']),
score_committing=clean_df['score_home'].where(
clean_df['committing_team'] == clean_df['home'],
clean_df['score_away']
),
score_disadvantaged=clean_df['score_home'].where(
clean_df['disadvantaged_team'] == clean_df['home'],
clean_df['score_away']
),
season=season_enc.transform(clean_df['season'])
))
```

The resulting data is ready for analysis.

`df.head(n=2).T`

play_id | 20151028INDTOR-1 | 20151028INDTOR-2 |
---|---|---|

seconds_left | 89.0 | 73.0 |

call_type | 4.0 | 4.0 |

foul_called | 1.0 | 0.0 |

player_committing | 162.0 | 36.0 |

player_disadvantaged | 98.0 | 358.0 |

score_committing | 99.0 | 106.0 |

score_disadvantaged | 106.0 | 99.0 |

season | 0.0 | 0.0 |

We follow George Box’s modeling workflow, as summarized by Dustin Tran:

- build a model of the science,
- infer the model given data, and
- criticize the model given data.

Below we examine the foul call rate by season.

```
def make_foul_rate_yaxis(ax, label="Observed foul call rate"):
ax.yaxis.set_major_formatter(pct_formatter)
ax.set_ylabel(label)
return ax
make_foul_rate_yaxis(
df.pivot_table('foul_called', 'season')
.rename(index=season_enc.inverse_transform)
.rename_axis("Season")
.plot(kind='bar', rot=0, legend=False)
);
```

There is a pronounced difference between the foul call rate in the 2015-2016 and 2016-2017 NBA seasons; our first model accounts for this difference.

We use `pymc3`

to specify our models. Our first model is given by

\[ \begin{align*} \beta^{\textrm{season}}_s & \sim N(0, 5) \\ \eta^{\textrm{game}}_k & = \beta^{\textrm{season}}_{s(k)} \\ p_k & = \textrm{sigm}\left(\eta^{\textrm{game}}_k\right). \end{align*} \]

We use a logistic regression model with different factors for each season.

```
import pymc3 as pm
with pm.Model() as base_model:
β_season = pm.Normal('β_season', 0., 5., shape=n_season)
p = pm.Deterministic('p', pm.math.sigmoid(β_season))
```

Foul calls are Bernoulli trials, \(y_k \sim \textrm{Bernoulli}(p_k).\)

`season = df['season'].values`

```
with base_model:
y = pm.Bernoulli(
'y', p[season],
observed=df['foul_called']
)
```

We now sample from the model’s posterior distribution.

```
NJOBS = 3
SAMPLE_KWARGS = {
'draws': 1000,
'njobs': NJOBS,
'random_seed': [
SEED + i for i in range(NJOBS)
],
'nuts_kwargs': {
'target_accept': 0.9
}
}
```

```
with base_model:
base_trace = pm.sample(**SAMPLE_KWARGS)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [β_season]
100%|██████████| 1500/1500 [00:07<00:00, 198.65it/s]
```

We rely on three diagnostics to ensure that our samples have converged to the posterior distribution:

- Energy plots: if the two distributions in the energy plot differ significantly (espescially in the tails), the sampling was not very efficient.
- Bayesian fraction of missing information (BFMI): BFMI quantifies this difference with a number between zero and one. A BFMI close to (or exceeding) one is preferable, and a BFMI lower than 0.2 is indicative of efficiency issues.
- Gelman-Rubin statistics: Gelman-Rubin statistics near one are preferable, and values less than 1.1 are generally taken to indicate convergence.

For more information on energy plots and BFMI consult *Robust Statistical Workflow with PyStan*.

```
bfmi = pm.bfmi(base_trace)
max_gr = max(
np.max(gr_stats) for gr_stats in pm.gelman_rubin(base_trace).values()
)
```

`CONVERGENCE_TITLE = lambda: f"BFMI = {bfmi:.2f}\nGelman-Rubin = {max_gr:.3f}"`

```
(pm.energyplot(base_trace, legend=False, figsize=(6, 4))
.set_title(CONVERGENCE_TITLE()));
```

We use the samples from `p`

’s posterior distribution to calculate residuals, which we use to criticize our models. These residuals allow us to assess how well our model describes the data-generation process and to discover unmodeled sources of variation.

`base_trace['p']`

```
array([[ 0.4052151 , 0.30696232],
[ 0.3937377 , 0.30995026],
[ 0.39881138, 0.29866616],
...,
[ 0.40279887, 0.31166828],
[ 0.4077945 , 0.30299785],
[ 0.40207901, 0.29991789]])
```

```
resid_df = (df.assign(p_hat=base_trace['p'][:, df['season']].mean(axis=0))
.assign(resid=lambda df: df['foul_called'] - df['p_hat']))
```

`resid_df[['foul_called', 'p_hat', 'resid']].head()`

foul_called | p_hat | resid | |
---|---|---|---|

play_id | |||

20151028INDTOR-1 | 1.0 | 0.403875 | 0.596125 |

20151028INDTOR-2 | 0.0 | 0.403875 | -0.403875 |

20151028INDTOR-3 | 1.0 | 0.403875 | 0.596125 |

20151028INDTOR-4 | 0.0 | 0.403875 | -0.403875 |

20151028INDTOR-6 | 0.0 | 0.403875 | -0.403875 |

The per-season residuals are quite small, which is to be expected.

```
(resid_df.pivot_table('resid', 'season')
.rename(index=season_enc.inverse_transform))
```

resid | |
---|---|

season | |

2015-2016 | -0.000162 |

2016-2017 | -0.000219 |

Anyone who has watched a close basketball game will realize that we have neglected an important factor in late game foul calls — intentional fouls. Near the end of the game, intentional fouls are used by the losing team when they are on defense to end the leading team’s possession as quickly as possible.

The influence of intentional fouls in the plot below is shown by the rapidly increasing of the residuals as the number of seconds left in the game decreases.

```
def make_time_axes(ax,
xlabel="Seconds remaining in game",
ylabel="Observed foul call rate"):
ax.invert_xaxis()
ax.set_xlabel(xlabel)
return make_foul_rate_yaxis(ax, label=ylabel)
make_time_axes(
resid_df.pivot_table('resid', 'seconds_left')
.reset_index()
.plot('seconds_left', 'resid', kind='scatter'),
ylabel="Residual"
);
```

The following plot illustrates the fact that only the trailing team has any incentive to committ intentional fouls.

```
df['trailing_committing'] = (df['score_committing']
.lt(df['score_disadvantaged'])
.mul(1.)
.astype(np.int64))
```

```
make_time_axes(
df.pivot_table(
'foul_called',
'seconds_left',
'trailing_committing'
)
.rolling(20)
.mean()
.rename(columns={
0: "No", 1: "Yes"
})
.rename_axis(
"Committing team is trailing",
axis=1
)
.plot()
);
```

Intentional fouls are only useful when the trailing (and committing) team is on defense. The plot below reflects this fact; shooting and personal fouls are almost always called against the defensive player; we see that they are called at a much higher rate than offensive fouls.

```
ax = (df.pivot_table('foul_called', 'call_type')
.rename(index=call_type_enc.inverse_transform)
.rename_axis("Call type", axis=0)
.plot(kind='barh', legend=False))
ax.xaxis.set_major_formatter(pct_formatter);
ax.set_xlabel("Observed foul call rate");
```

We continue to model the differnce in foul call rates between seasons.

```
with pm.Model() as poss_model:
β_season = pm.Normal('β_season', 0., 5., shape=2)
```

Throughout this post, we will use hierarchical distributions to model the variation of foul call rates. For much more information on hierarchical models, consult *Data Analysis Using Regression and Multilevel/Hierarchical Models*. We use the priors

\[ \begin{align*} \sigma_{\textrm{call}} & \sim \operatorname{HalfNormal}(5) \\ \beta^{\textrm{call}}_{c} & \sim \operatorname{Hierarchical-Normal}(0, \sigma_{\textrm{call}}^2). \end{align*} \]

For sampling efficiency, we use an [non-centered parametrization](http://twiecki.github.io/blog/2017/02/08/bayesian-hierchical-non-centered/#The-Funnel-of-Hell-(and-how-to-escape-it%29) of the hierarchical normal distribution.

```
def hierarchical_normal(name, shape, σ_shape=1):
Δ = pm.Normal(f'Δ_{name}', 0., 1., shape=shape)
σ = pm.HalfNormal(f'σ_{name}', 5., shape=σ_shape)
return pm.Deterministic(name, Δ * σ)
```

Each call type has a different foul call rate.

```
with poss_model:
β_call = hierarchical_normal('β_call', n_call_type)
```

We add score difference and the number of possessions by which the committing team is trailing to the `DataFrame`

.

```
df['score_diff'] = (df['score_disadvantaged']
.sub(df['score_committing']))
df['trailing_poss'] = (df['score_diff']
.div(3)
.apply(np.ceil))
```

```
trailing_poss_enc = LabelEncoder().fit(df['trailing_poss'])
trailing_poss = trailing_poss_enc.transform(df['trailing_poss'])
n_trailing_poss = trailing_poss_enc.classes_.size
```

The plot below shows that the foul call rate (over time) varies based on the score difference (quantized into possessions) between the disadvanted team and the committing team. We assume that at most three points can be scored in a single possession (while this is not quite correct, four-point plays are rare enough that we do not account for them in our analysis).

```
make_time_axes(
df.pivot_table(
'foul_called',
'seconds_left',
'trailing_poss'
)
.loc[:, 1:3]
.rolling(20).mean()
.rename_axis(
"Trailing possessions\n(committing team)",
axis=1
)
.plot()
);
```

The plot below reflects the fact that intentional fouls are disproportionately personal fouls; the rate at which personal fouls are called increases drastically as the game nears its end.

```
make_time_axes(
df.pivot_table('foul_called', 'seconds_left', 'call_type')
.rolling(20).mean()
.rename(columns=call_type_enc.inverse_transform)
.rename_axis(None, axis=1)
.plot()
);
```

Due to the NBA’s shot clock, the natural timescale of a basketball game is possessions, not seconds, remaining.

```
df['remaining_poss'] = (df['seconds_left']
.floordiv(25)
.add(1))
```

```
remaining_poss_enc = LabelEncoder().fit(df['remaining_poss'])
remaining_poss = remaining_poss_enc.transform(df['remaining_poss'])
n_remaining_poss = remaining_poss_enc.classes_.size
```

Below we plot the foul call rate across trailing possession/remaining posession pairs. Note that we always calculate trailing possessions (`trailing_poss`

) from the perspective of the committing team. For instance, `trailing_poss = 1`

indicates that the committing team is trailing by 1-3 points, whereas `trailing_poss = -1`

indicates that the committing team is leading by 1-3 points.

```
ax = sns.heatmap(
df.pivot_table(
'foul_called',
'trailing_poss',
'remaining_poss'
)
.rename_axis(
"Trailing possessions\n(committing team)",
axis=0
)
.rename_axis("Remaining possessions", axis=1),
cmap='seismic',
cbar_kws={'format': pct_formatter}
)
ax.invert_yaxis();
ax.set_title("Observed foul call rate");
```

The heatmap above shows that the foul call rate increases significantly when the committing team is trailing by more than the number of possessions remaining in the game. That is, teams resort to intentional fouls only when the opposing team can run out the clock and guarantee a win. (Since we have quantized the score difference and time into posessions, this conclusion is not entirely correct; it is, however, correct enough for our purposes.)

```
call_name_df = df.assign(
call_type=lambda df: call_type_enc.inverse_transform(
df['call_type'].values
)
)
diff_df = (pd.merge(
call_name_df,
call_name_df.groupby('call_type')
['foul_called']
.mean()
.rename('avg_foul_called')
.reset_index()
)
.assign(diff=lambda df: df['foul_called'] - df['avg_foul_called']))
```

The heatmaps below are broken out by call type, and show the difference between the foul call rate for each trailing/remaining possession combination and the overall foul call rate for the call type in question

```
def plot_foul_diff_heatmap(*_, data=None, **kwargs):
ax = plt.gca()
sns.heatmap(
data.pivot_table(
'diff',
'trailing_poss',
'remaining_poss'
),
cmap='seismic', robust=True,
cbar_kws={'format': pct_formatter}
)
ax.invert_yaxis()
ax.set_title("Observed foul call rate")
(sns.FacetGrid(diff_df, col='call_type', col_wrap=3, aspect=1.5)
.map_dataframe(plot_foul_diff_heatmap)
.set_axis_labels(
"Remaining possessions",
"Trailing possessions\n(committing team)"
)
.set_titles("{col_name}"));
```

These plots confirm that most intentional fouls are personal fouls. They also show that the three-way interaction between trailing possesions, remaining possessions, and call type are important to model foul call rates.

\[ \begin{align*} \sigma_{\textrm{poss}, c} & \sim \operatorname{HalfNormal}(5) \\ \beta^{\textrm{poss}}_{t, r, c} & \sim \operatorname{Hierarchical-Normal}(0, \sigma_{\textrm{poss}, c}^2) \end{align*} \]

```
with poss_model:
β_poss = hierarchical_normal(
'β_poss',
(n_trailing_poss, n_remaining_poss, n_call_type),
σ_shape=(1, 1, n_call_type)
)
```

The foul call rate is a combination of season, call type, and possession factors.

\[\eta^{\textrm{game}}_k = \beta^{\textrm{season}}_{s(k)} + \beta^{\textrm{call}}_{c(k)} + \beta^{\textrm{poss}}_{t(k),r(k),c(k)}\]

`call_type = df['call_type'].values`

```
with poss_model:
η_game = β_season[season] \
+ β_call[call_type] \
+ β_poss[
trailing_poss,
remaining_poss,
call_type
]
```

\[ \begin{align*} p_k & = \operatorname{sigm}\left(\eta^{\textrm{game}}_k\right) \end{align*} \]

```
with poss_model:
p = pm.Deterministic('p', pm.math.sigmoid(η_game))
y = pm.Bernoulli('y', p, observed=df['foul_called'])
```

Again, we sample from the model’s posterior distribution.

```
with poss_model:
poss_trace = pm.sample(**SAMPLE_KWARGS)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [σ_β_poss_log__, Δ_β_poss, σ_β_call_log__, Δ_β_call, β_season]
100%|██████████| 1500/1500 [07:56<00:00, 3.15it/s]
There were 5 divergences after tuning. Increase `target_accept` or reparameterize.
There were 10 divergences after tuning. Increase `target_accept` or reparameterize.
There were 9 divergences after tuning. Increase `target_accept` or reparameterize.
The number of effective samples is smaller than 25% for some parameters.
```

The BFMI and Gelman-Rubin statistics for this model indicate no problems with sampling and good convergence.

```
bfmi = pm.bfmi(poss_trace)
max_gr = max(
np.max(gr_stats) for gr_stats in pm.gelman_rubin(poss_trace).values()
)
```

```
(pm.energyplot(poss_trace, legend=False, figsize=(6, 4))
.set_title(CONVERGENCE_TITLE()));
```

Again, we calculate residuals.

```
resid_df = (df.assign(p_hat=poss_trace['p'].mean(axis=0))
.assign(resid=lambda df: df.foul_called - df.p_hat))
```

The following plots show that, grouped various ways, the residuals for this model are relatively well-distributed.

```
ax = sns.heatmap(
resid_df.pivot_table(
'resid',
'trailing_poss',
'remaining_poss'
)
.rename_axis(
"Trailing possessions\n(committing team)",
axis=0
)
.rename_axis(
"Remaining possessions",
axis=1
)
.loc[-3:3],
cmap='seismic',
cbar_kws={'format': pct_formatter}
)
ax.invert_yaxis();
ax.set_title("Observed foul call rate");
```

```
N_BIN = 20
bin_ix, bins = pd.qcut(
resid_df.p_hat, N_BIN,
labels=np.arange(N_BIN),
retbins=True
)
```

```
ax = (resid_df.groupby(bins[bin_ix])
.resid.mean()
.rename_axis('p_hat', axis=0)
.reset_index()
.plot('p_hat', 'resid', kind='scatter'))
ax.xaxis.set_major_formatter(pct_formatter);
ax.set_xlabel(r"Binned $\hat{p}$");
make_foul_rate_yaxis(ax, label="Residual");
```

```
ax = (resid_df.groupby('seconds_left')
.resid.mean()
.reset_index()
.plot('seconds_left', 'resid', kind='scatter'))
make_time_axes(ax, ylabel="Residual");
```

Now that we have two models, we can engage in model selection. We use the widely applicable Bayesian information criterion (WAIC) for model selection.

```
MODEL_NAME_MAP = {
0: "Base",
1: "Possession"
}
```

```
comp_df = (pm.compare(
(base_trace, poss_trace),
(base_model, poss_model)
)
.rename(index=MODEL_NAME_MAP)
.loc[MODEL_NAME_MAP.values()])
```

Since smaller WAICs are better, the possession model clearly outperforms the base model.

`comp_df`

WAIC | pWAIC | dWAIC | weight | SE | dSE | var_warn | |
---|---|---|---|---|---|---|---|

Base | 11610.1 | 2.11 | 1541.98 | 0 | 56.9 | 73.43 | 0 |

Possession | 10068.1 | 82.93 | 0 | 1 | 88.05 | 0 | 0 |

```
fig, ax = plt.subplots()
ax.errorbar(
np.arange(len(MODEL_NAME_MAP)),
comp_df.WAIC,
yerr=comp_df.SE, fmt='o'
);
ax.set_xticks(np.arange(len(MODEL_NAME_MAP)));
ax.set_xticklabels(comp_df.index);
ax.set_xlabel("Model");
ax.set_ylabel("WAIC");
```

We now turn to the question of whether or not committing and/or drawing fouls is a measurable skill. We use an item-response theory (IRT) model to study this question. For more information on Bayesian item-response models, consult the following references.

*Practical Issues in Implementing and Understanding Bayesian Ideal Point Estimation*is an excellent introduction to applied Bayesian IRT models and has inspired much of this work.*Bayesian Item Response Modeling — Theory and Applications*is a comprehensive mathematical overview of Bayesien IRT modeling.

The item-response theory model includes the season, call type, and possession terms of the previous models.

```
with pm.Model() as irt_model:
β_season = pm.Normal('β_season', 0., 5., shape=n_season)
β_call = hierarchical_normal('β_call', n_call_type)
β_poss = hierarchical_normal(
'β_poss',
(n_trailing_poss, n_remaining_poss, n_call_type),
σ_shape=(1, 1, n_call_type)
)
η_game = β_season[season] \
+ β_call[call_type] \
+ β_poss[
trailing_poss,
remaining_poss,
call_type
]
```

Each disadvantaged player has an ideal point (per season).

\[ \begin{align*} \sigma_{\theta} & \sim \operatorname{HalfNormal}(5) \\ \theta^{\textrm{player}}_{i, s} & \sim \operatorname{Hierarchical-Normal}(0, \sigma_{\theta}^2) \end{align*} \]

```
player_disadvantaged = df['player_disadvantaged'].values
n_player = player_enc.classes_.size
```

```
with irt_model:
θ_player = hierarchical_normal(
'θ_player', (n_player, n_season)
)
θ = θ_player[player_disadvantaged, season]
```

Each committing player has an ideal point (per season).

\[ \begin{align*} \sigma_{b} & \sim \operatorname{HalfNormal}(5) \\ b^{\textrm{player}}_{j, s} & \sim \operatorname{Hierarchical-Normal}(0, \sigma_{b}^2) \end{align*} \]

`player_committing = df['player_committing'].values`

```
with irt_model:
b_player = hierarchical_normal(
'b_player', (n_player, n_season)
)
b = b_player[player_committing, season]
```

Players affect the foul call rate through the difference in their ideal points.

\[\eta^{\textrm{player}}_k = \theta_k - b_k\]

```
with irt_model:
η_player = θ - b
```

The sum of the game and player effects determines the foul call probability.

\[\eta_k = \eta^{\textrm{game}}_k + \eta^{\textrm{player}}_k\]

```
with irt_model:
η = η_game + η_player
```

```
with irt_model:
p = pm.Deterministic('p', pm.math.sigmoid(η))
y = pm.Bernoulli(
'y', p,
observed=df['foul_called']
)
```

Again, we sample from the model’s posterior distribution.

```
with irt_model:
irt_trace = pm.sample(**SAMPLE_KWARGS)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [σ_b_player_log__, Δ_b_player, σ_θ_player_log__, Δ_θ_player, σ_β_poss_log__, Δ_β_poss, σ_β_call_log__, Δ_β_call, β_season]
100%|██████████| 1500/1500 [13:55<00:00, 1.80it/s]
There were 3 divergences after tuning. Increase `target_accept` or reparameterize.
There were 1 divergences after tuning. Increase `target_accept` or reparameterize.
There were 4 divergences after tuning. Increase `target_accept` or reparameterize.
The estimated number of effective samples is smaller than 200 for some parameters.
```

None of the sampling diagnostics indicate problems with convergence.

```
bfmi = pm.bfmi(irt_trace)
max_gr = max(
np.max(gr_stats) for gr_stats in pm.gelman_rubin(irt_trace).values()
)
```

```
(pm.energyplot(irt_trace, legend=False, figsize=(6, 4))
.set_title(CONVERGENCE_TITLE()));
```

The binned residuals for this model are more asymmetric than for the previous models, but still not too bad.

```
resid_df = (df.assign(p_hat=irt_trace['p'].mean(axis=0))
.assign(resid=lambda df: df['foul_called'] - df['p_hat']))
```

```
N_BIN = 50
bin_ix, bins = pd.qcut(
resid_df.p_hat, N_BIN,
labels=np.arange(N_BIN),
retbins=True
)
```

```
ax = (resid_df.groupby(bins[bin_ix])
.resid.mean()
.rename_axis('p_hat', axis=0)
.reset_index()
.plot('p_hat', 'resid', kind='scatter'))
ax.xaxis.set_major_formatter(pct_formatter);
ax.set_xlabel(r"Binned $\hat{p}$");
make_foul_rate_yaxis(ax, label="Residual");
```

```
ax = (resid_df.groupby('seconds_left')
.resid.mean()
.reset_index()
.plot('seconds_left', 'resid', kind='scatter'))
make_time_axes(ax, ylabel="Residual");
```

The IRT model is a marginal improvement over the possession model in terms of WAIC.

```
MODEL_NAME_MAP[2] = "IRT"
comp_df = (pm.compare(
(base_trace, poss_trace, irt_trace),
(base_model, poss_model, irt_model)
)
.rename(index=MODEL_NAME_MAP)
.loc[MODEL_NAME_MAP.values()])
```

`comp_df`

WAIC | pWAIC | dWAIC | weight | SE | dSE | var_warn | |
---|---|---|---|---|---|---|---|

Base | 11610.1 | 2.11 | 1566.92 | 0 | 56.9 | 74.03 | 0 |

Possession | 10068.1 | 82.93 | 24.94 | 0.08 | 88.05 | 10.99 | 0 |

IRT | 10043.2 | 216.6 | 0 | 0.91 | 88.47 | 0 | 0 |

```
fig, ax = plt.subplots()
ax.errorbar(
np.arange(len(MODEL_NAME_MAP)), comp_df.WAIC,
yerr=comp_df.SE, fmt='o'
);
ax.set_xticks(np.arange(len(MODEL_NAME_MAP)));
ax.set_xticklabels(comp_df.index);
ax.set_xlabel("Model");
ax.set_ylabel("WAIC");
```

We now produce two `DataFrame`

s containing the estimated player ideal points per season.

```
def varname_to_param(varname):
return varname[0]
def varname_to_player(varname):
return int(varname[3:-2])
def varname_to_season(varname):
return int(varname[-1])
```

```
irt_df = (pm.trace_to_dataframe(
irt_trace, varnames=['θ_player', 'b_player']
)
.rename(columns=lambda col: col.replace('_player', ''))
.T
.apply(
lambda s: pd.Series.describe(
s, percentiles=[0.055, 0.945]
),
axis=1
)
[['mean', '5.5%', '94.5%']]
.rename(columns={
'5.5%': 'low',
'94.5%': 'high'
})
.rename_axis('varname')
.reset_index()
.assign(
param=lambda df: df['varname'].apply(varname_to_param),
player=lambda df: df['varname'].apply(varname_to_player),
season=lambda df: df['varname'].apply(varname_to_season)
)
.drop('varname', axis=1))
```

`irt_df.head()`

mean | low | high | param | player | season | |
---|---|---|---|---|---|---|

0 | -0.016516 | -0.323127 | 0.272022 | θ | 0 | 0 |

1 | 0.003845 | -0.289842 | 0.290248 | θ | 0 | 1 |

2 | 0.008519 | -0.289715 | 0.304574 | θ | 1 | 0 |

3 | 0.031367 | -0.240853 | 0.339297 | θ | 1 | 1 |

4 | -0.037530 | -0.320010 | 0.226110 | θ | 2 | 0 |

```
player_irt_df = irt_df.pivot_table(
index='player',
columns=['param', 'season'],
values='mean'
)
```

`player_irt_df.head()`

param | b | θ | ||
---|---|---|---|---|

season | 0 | 1 | 0 | 1 |

player | ||||

0 | -0.069235 | 0.013339 | -0.016516 | 0.003845 |

1 | -0.003003 | 0.001853 | 0.008519 | 0.031367 |

2 | 0.084515 | 0.089058 | -0.037530 | -0.002373 |

3 | -0.028946 | 0.004360 | 0.003514 | -0.000334 |

4 | -0.001976 | 0.280380 | 0.072932 | 0.005571 |

The following plot shows that the committing skill appears to be somewhat larger than the disadvantaged skill. This difference seems reasonable because most fouls are committed by the player on defense; committing skill is quite likely to be correlated with defensive ability.

```
def plot_latent_params(df):
fig, ax = plt.subplots()
n, _ = df.shape
y = np.arange(n)
ax.errorbar(
df['mean'], y,
xerr=(df[['high', 'low']]
.sub(df['mean'], axis=0)
.abs()
.values.T),
fmt='o'
)
ax.set_yticks(y)
ax.set_yticklabels(
player_enc.inverse_transform(df.player)
)
ax.set_ylabel("Player")
return fig, ax
fig, axes = plt.subplots(
ncols=2, nrows=2, sharex=True,
figsize=(16, 8)
)
(θ0_ax, θ1_ax), (b0_ax, b1_ax) = axes
bins = np.linspace(
0.9 * irt_df['mean'].min(),
1.1 * irt_df['mean'].max(),
75
)
θ0_ax.hist(
player_irt_df['θ', 0],
bins=bins, normed=True
);
θ1_ax.hist(
player_irt_df['θ', 1],
bins=bins, normed=True
);
θ0_ax.set_yticks([]);
θ0_ax.set_title(
r"$\hat{\theta}$ (" + season_enc.inverse_transform(0) + ")"
);
θ1_ax.set_yticks([]);
θ1_ax.set_title(
r"$\hat{\theta}$ (" + season_enc.inverse_transform(1) + ")"
);
b0_ax.hist(
player_irt_df['b', 0],
bins=bins, normed=True, color=green
);
b1_ax.hist(
player_irt_df['b', 1],
bins=bins, normed=True, color=green
);
b0_ax.set_xlabel(
r"$\hat{b}$ (" + season_enc.inverse_transform(0) + ")"
);
b0_ax.invert_yaxis();
b0_ax.xaxis.tick_top();
b0_ax.set_yticks([]);
b1_ax.set_xlabel(
r"$\hat{b}$ (" + season_enc.inverse_transform(1) + ")"
);
b1_ax.invert_yaxis();
b1_ax.xaxis.tick_top();
b1_ax.set_yticks([]);
fig.suptitle("Disadvantaged skill", size=18);
fig.text(0.45, 0.02, "Committing skill", size=18)
fig.tight_layout();
```

We now examine the top and bottom ten players in each ability, across both seasons.

The top players in terms of disadvantaged ability tend to be good scorers (Jimmy Butler, Ricky Rubio, John Wall, Andre Iguodala). The presence of DeAndre Jordan in the top ten may to be due to the hack-a-Shaq phenomenon. In future work, it would be interesting to control for the disavantage player’s free throw percentage in order to mitigate the influence of the hack-a-Shaq effect on the measurement of latent skill.

Interestingly, the bottom players (in terms of disadvantaged ability) include many stars (Pau Gasol, Carmelo Anthony, Kevin Durant, Kawhi Leonard). The presence of these stars in the bottom may somewhat counteract the pervasive narrative that referees favor stars in their foul calls.

```
top_bot_irt_df = (irt_df.groupby('param')
.apply(
lambda df: pd.concat((
df.nlargest(10, 'mean'),
df.nsmallest(10, 'mean')
),
axis=0, ignore_index=True
)
)
.reset_index(drop=True))
```

`top_bot_irt_df.head()`

mean | low | high | param | player | season | |
---|---|---|---|---|---|---|

0 | 0.351946 | -0.026786 | 0.762273 | b | 86 | 0 |

1 | 0.320737 | -0.027064 | 0.713128 | b | 23 | 0 |

2 | 0.280380 | -0.071020 | 0.695970 | b | 4 | 1 |

3 | 0.279678 | -0.057249 | 0.647667 | b | 462 | 1 |

4 | 0.271735 | -0.106795 | 0.676231 | b | 78 | 0 |

```
fig, ax = plot_latent_params(
top_bot_irt_df[top_bot_irt_df['param'] == 'θ']
.sort_values('mean')
)
ax.set_xlabel(r"$\hat{\theta}$");
ax.set_title("Top and bottom ten");
```

The top ten players in terms of committing skill include many defensive standouts (Danny Green — twice, Gordon Hayward, Paul George).

The bottom ten players include many that are known to be defensively challenged (Ricky Rubio and James Harden). Dwight Howard was, at one point, a fierce defender of the rim, but was well past his prime in 2015, when our data set begins. Chris Paul’s presence in the bottom is somewhat surprising.

```
fig, ax = plot_latent_params(
top_bot_irt_df[top_bot_irt_df['param'] == 'b']
.sort_values('mean')
)
ax.set_xlabel(r"$\hat{b}$");
ax.set_title("Top and bottom ten");
```

In the sports analytics community, year-over-year correlation of latent parameters is the test of whether or not a latent quantity truly measures a skill. The following plots show a slight year-over-year correlation in the committing skill, but not much correlation in the disadvantaged skill.

```
def p_val_to_asterisks(p_val):
if p_val < 0.0001:
return "****"
elif p_val < 0.001:
return "***"
elif p_val < 0.01:
return "**"
elif p_val < 0.05:
return "*"
else:
return ""
def plot_corr(x, y, **kwargs):
corrcoeff, p_val = sp.stats.pearsonr(x, y)
asterisks = p_val_to_asterisks(p_val)
artist = AnchoredText(
f'{corrcoeff:.2f}{asterisks}',
loc=10, frameon=False,
prop=dict(size=LABELSIZE)
)
plt.gca().add_artist(artist)
plt.grid(b=False)
```

```
PARAM_MAP = {
'θ': r"$\hat{\theta}$",
'b': r"$\hat{b}$"
}
def replace_label(label):
param, season = eval(label)
return "{param}\n({season})".format(
param=PARAM_MAP[param],
season=season_enc.inverse_transform(season)
)
def style_grid(grid):
for ax in grid.axes.flat:
ax.grid(False)
ax.set_xticklabels([]);
ax.set_yticklabels([]);
if ax.get_xlabel():
ax.set_xlabel(replace_label(ax.get_xlabel()))
if ax.get_ylabel():
ax.set_ylabel(replace_label(ax.get_ylabel()))
return grid
```

```
player_all_season = set(df.groupby('player_disadvantaged')
.filter(lambda df: df['season'].nunique() == n_season)
['player_committing']) \
& set(df.groupby('player_committing')
.filter(lambda df: df['season'].nunique() == n_season)
['player_committing'])
```

```
style_grid(
sns.PairGrid(
player_irt_df.loc[player_all_season],
size=1.75
)
.map_upper(plt.scatter, alpha=0.5)
.map_diag(plt.hist)
.map_lower(plot_corr)
);
```

Since we can only reasonably estimate the skills of players for which we have sufficient foul call data, we plot the correlations below for players that appeared in at least ten plays in each season.

```
MIN = 10
player_has_min = set(df.groupby('player_disadvantaged')
.filter(lambda df: (df['season']
.value_counts()
.gt(MIN)
.all()))
['player_committing']) \
& set(df.groupby('player_committing')
.filter(lambda df: (df['season']
.value_counts()
.gt(MIN)
.all()))
['player_committing'])
```

```
grid = style_grid(
sns.PairGrid(
player_irt_df.loc[player_has_min],
size=1.75
)
.map_upper(plt.scatter, alpha=0.5)
.map_diag(plt.hist)
.map_lower(plot_corr)
)
```

As expected, the season-over-season latent skill correlations are higher for this subset of players.

From this figure, it seems that committing skill (\(\hat{b}\)) exists and is measurable, but is fairly small. It also seems that disadvantaged skill (\(\hat{\theta}\)), if it exists, is difficult to measure from this data set. Ideally, the NBA would release a foul report for the entirety of every game, but that seems quite unlikely.

In the future it would be useful to include a correction for the probability that a given game appears in the data set. This correction would help with the sample bias introduced by the fact that only games that are close in the last two minutes are included in the NBA’s reports.

This post is available as a Jupyter notebook here.

Recently, a project at work has led me to learn a bit about capture-recapture models for estimating the size and dynamics of populations that cannot be fully enumerated. These models often arise in ecological statistics when estimating the size, birth, and survival rates of animal populations in the wild. This post gives examples of implementing three capture-recapture models in Python with PyMC3 and is intended primarily as a reference for my future self, though I hope it may serve as a useful introduction for others as well.

We will implement three Bayesian capture-recapture models:

- the Lincoln-Petersen model of abundance,
- the Cormack-Jolly-Seber model of survival, and
- the Jolly-Seber model of abundance and survival.

`%matplotlib inline`

`import logging`

```
from matplotlib import pyplot as plt
from matplotlib.ticker import StrMethodFormatter
import numpy as np
import pymc3 as pm
from pymc3.distributions.dist_math import binomln, bound, factln
import scipy as sp
import seaborn as sns
import theano
from theano import tensor as tt
```

```
sns.set()
PCT_FORMATTER = StrMethodFormatter('{x:.1%}')
```

```
SEED = 518302 # from random.org, for reproducibility
np.random.seed(SEED)
```

```
# keep theano from complaining about compile locks
(logging.getLogger('theano.gof.compilelock')
.setLevel(logging.CRITICAL))
# keep theano from warning about default rounding mode changes
theano.config.warn.round = False
```

The simplest model of abundace, that is, the size of a population, is the Lincoln-Petersen model. While this model is a bit simple for most practical applications, it will introduce some useful modeling concepts and computational techniques.

The idea of the Lincoln-Petersen model is to visit the observation site twice to capture individuals from the population of interest. The individuals captured during the first visit are marked (often with tags, radio collars, microchips, etc.) and then released. The number of individuals captured, marked, and released is recorded as \(n_1\). On the second visit, the number of captured individuals is recorded as \(n_2\). If enough individuals are captured on the second visit, chances are quite high that several of them will have been marked on the first visit. The number of marked individuals recaptured on the second visit is recorded as \(n_{1, 2}\).

The Lincoln-Petersen model assumes that: 1. each individuals has an equal probability to be captured on both visits (regardless of whether or not they were marked), 2. no marks fall off or become illegible, and 3. the population is closed, that is, no individuals are born, die, enter, or leave the site between visits.

The third assumption is quite restrictive, and will be relaxed in the two subsequent models. The first two assumptions can be relaxed in various ways, but we will not do so in this post. First we derive a simple analytic estimator for the total population size given \(n_1\), \(n_2\), and \(n_{1, 2}\), then we fit a Bayesian Lincoln-Petersen model using PyMC3 to set the stage for the (Cormack-)Jolly-Seber models.

Let \(N\) denote the size of the unknown total population, and let \(p\) denote the capture probability. We have that

\[ \begin{align*} n_1, n_2\ |\ N, p & \sim \textrm{Bin}(N, p) \\ n_{1, 2}\ |\ n_1, p & \sim \textrm{Bin}(n_1, p). \end{align*} \]

Therefore \(\frac{n_2}{N}\) and \(\frac{n_{1, 2}}{n_1}\) are unbiased estimates of \(p\). The Lincoln-Peterson estimator is derived by equating these estimators

\[\frac{n_2}{\hat{N}} = \frac{n_{1, 2}}{n_1}\]

and solving for

\[\hat{N} = \frac{n_1 n_2}{n_{1, 2}}.\]

We now simulate a data set where \(N = 1000\) and the capture probability is \(p = 0.1\)

```
N_LP = 1000
P_LP = 0.1
x_lp = sp.stats.bernoulli.rvs(P_LP, size=(2, N_LP))
```

The rows of `x_lp`

correspond to site visits and the columns to individuals. The entry `x_lp[i, j]`

is one if the \(j\)-th individuals was captured on the \(i\)-th site visit, and zero otherwise.

`x_lp`

```
array([[0, 0, 0, ..., 0, 0, 0],
[0, 0, 0, ..., 0, 1, 0]])
```

We construct \(n_1\), \(n_2\), and \(n_{1, 2}\) from `x_lp`

.

```
n1, n2 = x_lp.sum(axis=1)
n12 = x_lp.prod(axis=0).sum()
```

`n1, n2, n12`

`(109, 95, 10)`

The Lincoln-Petersen estimate of \(N\) is therefore

```
N_lp = n1 * n2 / n12
N_lp
```

`1035.5`

We now give a Bayesian formulation of the Lincoln-Petersen model. We use the priors

\[ \begin{align*} p & \sim U(0, 1) \\ \pi(N) & = 1 \textrm{ for } N \geq n_1 + n_2 - n_{1, 2}. \end{align*} \]

Note that the prior on \(N\) is improper.

```
with pm.Model() as lp_model:
p = pm.Uniform('p', 0., 1.)
N_ = pm.Bound(pm.Flat, lower=n1 + n2 - n12)('N')
```

We now implement the likelihoods of the data given above during the derivation of the Lincoln-Petersen estimator.

```
with lp_model:
n1_obs = pm.Binomial('n1_obs', N_, p, observed=n1)
n2_obs = pm.Binomial('n2_obs', N_, p, observed=n2)
n12_obs = pm.Binomial('n12_obs', n1, p, observed=n12)
```

Now that the model is fully specified, we sample from its posterior distribution.

```
NJOBS = 3
SAMPLE_KWARGS = {
'draws': 1000,
'njobs': NJOBS,
'random_seed': [SEED + i for i in range(NJOBS)],
'nuts_kwargs': {'target_accept': 0.9}
}
```

```
with lp_model:
lp_trace = pm.sample(**SAMPLE_KWARGS)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [N_lowerbound__, p_interval__]
100%|██████████| 1500/1500 [00:06<00:00, 230.30it/s]
The number of effective samples is smaller than 25% for some parameters.
```

First we examine a few sampling diagnostics. The Bayesian fraction of missing information (BFMI) and energy plot give no cause for concern.

`pm.bfmi(lp_trace)`

`1.0584389661341544`

`pm.energyplot(lp_trace);`

The Gelman-Rubin statistics are close to one, indicating convergence.

`max(np.max(gr_stats) for gr_stats in pm.gelman_rubin(lp_trace).values())`

`1.0040982739011743`

Since there are no apparent sampling problems, we examine the estimate of the population size.

```
pm.plot_posterior(
lp_trace, varnames=['N'],
ref_val=N_LP,
lw=0., alpha=0.75
);
```

The true population size is well within the 95% credible interval. The posterior estimate of \(N\) could become more accurate by substituting a reasonable upper bound for the prior on \(N\) for the uninformative prior we have used here out of convenience.

The Cormack-Jolly-Seber model estimates the survival dynamics of individuals in the population by relaxing the third assumption of the Lincoln-Petersen model, that the population is closed. Note that here “survival” does not necessarily correspond to the individual’s death, as it includes individuals that leave the observation area during the study. Despite this subtlety, we will use the convenient terminology of “alive” and “dead” throughout.

For the Cormack-Jolly-Seber and Jolly-Seber models, we follow the notation of *Analysis of Capture-Recapture Data* by McCrea and Morgan. Additionally, we will use a cormorant data set from the book to illustrate these two models. This data set involves eleven site visits where individuals were given individualized identifying marks. The data are defined below in the form of an \(M\)-array.

```
T = 10
R = np.array([30, 157, 174, 298, 470, 421, 413, 514, 430, 181])
M = np.zeros((T, T + 1))
M[0, 1:] = [10, 4, 2, 2, 0, 0, 0, 0, 0, 0]
M[1, 2:] = [42, 12, 16, 1, 0, 1, 1, 1, 0]
M[2, 3:] = [85, 22, 5, 5, 2, 1, 0, 1]
M[3, 4:] = [139, 39, 10, 10, 4, 2, 0]
M[4, 5:] = [175, 60, 22, 8, 4, 2]
M[5, 6:] = [159, 46, 16, 5, 2]
M[6, 7:] = [191, 39, 4, 8]
M[7, 8:] = [188, 19, 23]
M[8, 9:] = [101, 55]
M[9, 10] = 84
```

Here `T`

indicates the number of revisits to the site so there were \(T + 1 = 11\) total visits. The entries of `R`

indicate the number of animals captured and released on each visit.

`R`

`array([ 30, 157, 174, 298, 470, 421, 413, 514, 430, 181])`

The entry `M[i, j]`

indicates how many individuals captured on visit \(i\) were first recaptured on visit \(j\). The diagonal of `M`

is entirely zero, since it is not possible to recapture an individual on the same visit as it was released.

`M[:4, :4]`

```
array([[ 0., 10., 4., 2.],
[ 0., 0., 42., 12.],
[ 0., 0., 0., 85.],
[ 0., 0., 0., 0.]])
```

Of the thirty individuals marked and released on the first visit, ten were recaptured on the second visit, four were recapture on the third visit, and so on.

Capture-recapture data is often given in the form of encounter histories, for example

\[1001000100,\]

which indicates that the individual was first captured on the first visit and recaptured on the fourth and eigth visits. It is straightforward to convert a series of encounter histories to an \(M\)-array. We will discuss encounter histories again at the end of this post.

The parameters of the Cormack-Jolly-Seber model are \(p\), the capture probability, and \(\phi_i\), the probability that an individual that was alive during the \(i\)-th visit is still alive during the \((i + 1)\)-th visit. The capture probability can vary over time in the Cormack-Jolly-Seber model, but we use a constant capture probability here for simplicity.

We again place a uniform prior on \(p\).

```
with pm.Model() as cjs_model:
p = pm.Uniform('p', 0., 1.)
```

We also place a uniform prior on \(\phi_i\).

```
with cjs_model:
ϕ = pm.Uniform('ϕ', 0., 1., shape=T)
```

If \(\nu_{i, j}\) is the probability associated with `M[i, j]`

, then

\[ \begin{align*} \nu_{i, j} & = P(\textrm{individual that was alive at visit } i \textrm{ is alive at visit } j) \\ & \times P(\textrm{individual was not captured on visits } i + 1, \ldots, j - 1) \\ & \times P(\textrm{individual is captured on visit } j). \end{align*} \]

From our parameter definitions,

\[P(\textrm{individual that was alive at visit } i \textrm{ is alive at visit } j) = \prod_{k = i}^{j - 1} \phi_k,\]

```
def fill_lower_diag_ones(x):
return tt.triu(x) + tt.tril(tt.ones_like(x), k=-1)
```

```
with cjs_model:
p_alive = tt.triu(
tt.cumprod(
fill_lower_diag_ones(np.ones_like(M[:, 1:]) * ϕ),
axis=1
)
)
```

\[P(\textrm{individual was not captured at visits } i + 1, \ldots, j - 1) = (1 - p)^{j - i - 1},\]

```
i = np.arange(T)[:, np.newaxis]
j = np.arange(T + 1)[np.newaxis]
not_cap_visits = np.clip(j - i - 1, 0, np.inf)[:, 1:]
```

```
with cjs_model:
p_not_cap = tt.triu(
(1 - p)**not_cap_visits
)
```

and

\[P(\textrm{individual is captured on visit } j) = p.\]

```
with cjs_model:
ν = p_alive * p_not_cap * p
```

The likelihood of the observed recaptures is then

`triu_i, triu_j = np.triu_indices_from(M[:, 1:])`

```
with cjs_model:
recap_obs = pm.Binomial(
'recap_obs',
M[:, 1:][triu_i, triu_j],
ν[triu_i, triu_j],
observed=M[:, 1:][triu_i, triu_j]
)
```

Finally, some individual released on each occasion are not recaptured again,

`R - M.sum(axis=1)`

`array([ 12., 83., 53., 94., 199., 193., 171., 284., 274., 97.])`

The probability of this event is

\[\chi_i = P(\textrm{released on visit } i \textrm{ and not recaptured again}) = 1 - \sum_{j = i + 1}^T \nu_{i, j}.\]

```
with cjs_model:
χ = 1 - ν.sum(axis=1)
```

The likelihood of the individual that were not recaptured again is

```
with cjs_model:
no_recap_obs = pm.Binomial(
'no_recap_obs',
R - M.sum(axis=1), χ,
observed=R - M.sum(axis=1)
)
```

Now that the model is fully specified, we sample from its posterior distribution.

```
with cjs_model:
cjs_trace = pm.sample(**SAMPLE_KWARGS)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [ϕ_interval__, p_interval__]
100%|██████████| 1500/1500 [00:05<00:00, 272.61it/s]
```

Again, the BFMI and energy plot are reasonable.

`pm.bfmi(cjs_trace)`

`0.97973424667749842`

`pm.energyplot(cjs_trace);`

The Gelman-Rubin statistics also indicate convergence.

`max(np.max(gr_stats) for gr_stats in pm.gelman_rubin(cjs_trace).values())`

`1.0009026695355843`

McCrea and Morgan’s book includes a table of maximum likelihood estimates for \(\phi_i\) and \(p\). We verify that our Bayesian estimates are close to these.

`ϕ_mle = np.array([0.8, 0.56, 0.83, 0.86, 0.73, 0.69, 0.81, 0.64, 0.46, 0.99])`

```
fig, ax = plt.subplots(figsize=(8, 6))
t_plot = np.arange(T) + 1
low, high = np.percentile(cjs_trace['ϕ'], [5, 95], axis=0)
ax.fill_between(
t_plot, low, high,
alpha=0.5, label="90% interval"
);
ax.plot(
t_plot, cjs_trace['ϕ'].mean(axis=0),
label="Posterior expected value"
);
ax.scatter(
t_plot, ϕ_mle,
zorder=5,
c='k', label="Maximum likelihood estimate"
);
ax.set_xlim(1, T);
ax.set_xlabel("$t$");
ax.yaxis.set_major_formatter(PCT_FORMATTER);
ax.set_ylabel(r"$\phi_t$");
ax.legend(loc=2);
```

`p_mle = 0.51`

```
ax = pm.plot_posterior(
cjs_trace, varnames=['p'],
ref_val=p_mle,
lw=0., alpha=0.75
)
ax.set_title("$p$");
```

The Jolly-Seber model is an extension of the Cormack-Jolly-Seber model (the fact that the extension is named after fewer people is a bit counterintuitive) that estimates abundance and birth dynamics, in addition to the survival dynamics estimated by the Cormack-Jolly-Seber model. As with the Cormack-Jolly-Seber model where “death” included leaving the site, “birth” includes not just the actual birth of new individuals, but individuals that arrive at the site during the study from elsewhere. Again, despite this subtlety, we will use the convenient terminology of “birth” and “born” throughout.

In order to estimate abundance and birth dynamics, the Jolly-Seber model adds likelihood terms for the first time an individual is captured to the recapture likelihood of the Cormack-Jolly-Seber model. We use the same uniform priors on \(p\) and \(\phi_i\) as in the Cormack-Jolly-Seber model.

```
with pm.Model() as js_model:
p = pm.Uniform('p', 0., 1.)
ϕ = pm.Uniform('ϕ', 0., 1., shape=T)
```

As with the Lincoln-Petersen model, the Jolly-Seber model estimates the size of the population, including all individuals ever alive during the study period, \(N\). We use the Schwarz-Arnason formulation of the Jolly-Seber model, where each individual has probability \(\beta_i\) of being born into the population between visits \(i\) and \(i + 1\). We place a \(\operatorname{Dirichlet}(1, \ldots, 1)\) prior on these parameters.

```
with js_model:
β = pm.Dirichlet('β', np.ones(T), shape=T)
```

Let \(\Psi_i\) denote the probability that a given individual is alive on visit \(i\) and has not yet been captured before visit \(i\). Then \(\Psi_1 = \beta_0\), since no individuals can have been captured before the first visit, and

\[ \begin{align*} \Psi_{i + 1} & = P(\textrm{the individual was alive and unmarked during visit } i \textrm{ and survived to visit } i + 1) \\ & + P(\textrm{the individual was born between visits } i \textrm{ and } i + 1) \\ & = \Psi_i (1 - p) \phi_i + \beta_i. \end{align*} \]

After writing out the first few terms, we see that this recursion has the closed-form solution

\[\Psi_{i + 1} = \sum_{k = 0}^i \left(\beta_k (1 - p)^{i - k} \prod_{\ell = 1}^{i - k} \phi_{\ell} \right).\]

```
never_cap_surv_ix = sp.linalg.circulant(np.arange(T))
with js_model:
p_never_cap_surv = tt.concatenate((
[1], tt.cumprod((1 - p) * ϕ)[:-1]
))
Ψ = tt.tril(
β * p_never_cap_surv[never_cap_surv_ix]
).sum(axis=1)
```

The probability that an unmarked individual that is alive at visit \(i\) is captured on visit \(i\) is then \(\Psi_i p\). The probability that an individual is alive at the end of the study period and never captured is \[1 - \sum_{i = 1}^T \Psi_i p.\]

Therefore, the likelihood of the observed first captures is a \((T + 1)\)-dimensional multinomial, where the first \(T\) probabilities are \(\Psi_1 p, \ldots, \Psi_T p\), and the corresponding first \(T\) counts are the observed number of unmarked individuals captured at each visit, \(u_i\). The final probability is

\[1 - \sum_{i = 1}^T \Psi_i p\]

and corresponds to the unobserved number of individuals never captured. Since PyMC3 does not implement such an “incomplete multinomial” distribution, we give a minimal implementation here.

```
class IncompleteMultinomial(pm.Discrete):
def __init__(self, n, p, *args, **kwargs):
"""
n is the total frequency
p is the vector of probabilities of the observed components
"""
super(IncompleteMultinomial, self).__init__(*args, **kwargs)
self.n = n
self.p = p
self.mean = n * p.sum() * p,
self.mode = tt.cast(tt.round(n * p), 'int32')
def logp(self, x):
"""
x is the vector of frequences of all but the last components
"""
n = self.n
p = self.p
x_last = n - x.sum()
return bound(
factln(n) + tt.sum(x * tt.log(p) - factln(x)) \
+ x_last * tt.log(1 - p.sum()) - factln(x_last),
tt.all(x >= 0), tt.all(x <= n), tt.sum(x) <= n,
n >= 0)
```

As in the Lincoln-Petersen model, we place an improper flat prior (with the appropriate lower bound) on \(N\).

`u = np.concatenate(([R[0]], R[1:] - M[:, 1:].sum(axis=0)[:-1]))`

```
with js_model:
N = pm.Bound(pm.Flat, lower=u.sum())('N')
```

The likelihood of the observed first captures is therefore

```
with js_model:
unmarked_obs = IncompleteMultinomial(
'unmarked_obs', N, Ψ * p,
observed=u
)
```

The recapture likelihood for the Jolly-Seber model is the same as for the Cormack-Jolly-Seber model.

```
with js_model:
p_alive = tt.triu(
tt.cumprod(
fill_lower_diag_ones(np.ones_like(M[:, 1:]) * ϕ),
axis=1
)
)
p_not_cap = tt.triu(
(1 - p)**not_cap_visits
)
ν = p_alive * p_not_cap * p
recap_obs = pm.Binomial(
'recap_obs',
M[:, 1:][triu_i, triu_j], ν[triu_i, triu_j],
observed=M[:, 1:][triu_i, triu_j]
)
```

```
with js_model:
χ = 1 - ν.sum(axis=1)
no_recap_obs = pm.Binomial(
'no_recap_obs',
R - M.sum(axis=1), χ,
observed=R - M.sum(axis=1)
)
```

Again we sample from the posterior distribution of this model.

```
with js_model:
js_trace = pm.sample(**SAMPLE_KWARGS)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [N_lowerbound__, β_stickbreaking__, ϕ_interval__, p_interval__]
100%|██████████| 1500/1500 [00:29<00:00, 50.24it/s]
```

Again, the BFMI and energy plot are reasonable.

`pm.bfmi(js_trace)`

`0.93076073875335852`

`pm.energyplot(js_trace);`

The Gelman-Rubin statistics also indicate convergence.

`max(np.max(gr_stats) for gr_stats in pm.gelman_rubin(js_trace).values())`

`1.0051087141503718`

The posterior expected survival rates are, somewhat surprisingly, still quite similar to the maximum likelihood estimates under the Cormack-Jolly-Seber model.

```
fig, ax = plt.subplots(figsize=(8, 6))
low, high = np.percentile(js_trace['ϕ'], [5, 95], axis=0)
ax.fill_between(
t_plot, low, high,
alpha=0.5, label="90% interval"
);
ax.plot(
t_plot, js_trace['ϕ'].mean(axis=0),
label="Posterior expected value"
);
ax.scatter(
t_plot, ϕ_mle,
zorder=5,
c='k', label="Maximum likelihood estimate (CJS)"
);
ax.set_xlim(1, T);
ax.set_xlabel("$t$");
ax.yaxis.set_major_formatter(PCT_FORMATTER);
ax.set_ylabel(r"$\phi_t$");
ax.legend(loc="upper center");
```

The following plot shows the estimated birth dynamics.

```
fig, ax = plt.subplots(figsize=(8, 6))
low, high = np.percentile(js_trace['β'], [5, 95], axis=0)
ax.fill_between(
t_plot - 1, low, high,
alpha=0.5, label="90% interval"
);
ax.plot(
t_plot - 1, js_trace['β'].mean(axis=0),
label="Posterior expected value"
);
ax.set_xlim(0, T - 1);
ax.set_xlabel("$t$");
ax.yaxis.set_major_formatter(PCT_FORMATTER);
ax.set_ylabel(r"$\beta_t$");
ax.legend(loc=2);
```

The posterior expected population size is about 30% larger than the number of distinct individuals marked.

`js_trace['N'].mean() / u.sum()`

`1.2951923918464066`

```
pm.plot_posterior(
js_trace, varnames=['N'],
lw=0., alpha=0.75
);
```

Now that we have estimated these three models, we return briefly to the topic of \(M\)-arrays versus encounter histories. While \(M\)-arrays are a convienent summary of encounter histories, they do not lend themselves to common extensions of these models to include individual random effects, trap-dependent recapture, etc. as readily as encounter histories. Two possibile approaches to include such effects are:

- Use likelihoods for the (Cormack-)Jolly-Seber models based on encounter histories, which are a bit more complex than those based on \(M\)-arrays.
- Individual \(M\)-arrays: transform each individual’s account history into an \(M\)-array an stack them into a three-dimensional array of \(M\)-arrays.

We may explore one (or both) of these approaches in a future post.

Thanks to Eric Heydenberk for his feedback on a early draft of this post.

This post is available as a Jupyter notebook here.

Since December 2014, I have tracked the books I read in a Google spreadsheet. It recently occurred to me to use this data to quantify how my reading habits have changed over time. This post will use PyMC3 to model my reading habits.

`%matplotlib inline`

`from itertools import product`

```
from matplotlib import dates as mdates
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import scipy as sp
import seaborn as sns
from theano import shared, tensor as tt
```

`sns.set()`

`SEED = 27432 # from random.org, for reproductibility`

First we load the data from the Google Spreadsheet. Conveniently, `pandas`

can load CSVs from a web link.

`GDOC_URI = 'https://docs.google.com/spreadsheets/d/1wNbJv1Zf4Oichj3-dEQXE_lXVCwuYQjaoyU1gGQQqk4/export?gid=0&format=csv'`

```
raw_df = (pd.read_csv(
GDOC_URI,
usecols=[
'Year', 'Month', 'Day',
'End Year', 'End Month', 'End Day',
'Headline', 'Text'
]
)
.dropna(axis=1, how='all')
.dropna(axis=0))
```

`raw_df.head()`

Year | Month | Day | End Year | End Month | End Day | Headline | Text | |
---|---|---|---|---|---|---|---|---|

0 | 2014 | 12 | 13 | 2014.0 | 12.0 | 23.0 | The Bloody Chamber | Angela Carter, 126 pages |

1 | 2014 | 12 | 23 | 2015.0 | 1.0 | 4.0 | The Last Place on Earth | Roland Huntford, 564 pages |

2 | 2015 | 1 | 24 | 2015.0 | 2.0 | 13.0 | Empire Falls | Richard Russo, 483 pages |

3 | 2015 | 2 | 14 | 2015.0 | 2.0 | 20.0 | Wonder Boys | Michael Chabon, 368 pages |

4 | 2015 | 2 | 25 | 2015.0 | 3.0 | 4.0 | Red State, Blue State, Rich State, Poor State:… | Andrew Gelman, 196 pages |

`raw_df.tail()`

Year | Month | Day | End Year | End Month | End Day | Headline | Text | |
---|---|---|---|---|---|---|---|---|

58 | 2017 | 9 | 16 | 2017.0 | 10.0 | 14.0 | Civilization of the Middle Ages | Norman F. Cantor, 566 pages |

59 | 2017 | 10 | 14 | 2017.0 | 10.0 | 16.0 | The Bloody Chamber | Angela Carter, 126 pages |

60 | 2017 | 10 | 16 | 2017.0 | 10.0 | 27.0 | Big Data Baseball | Travis Sawchik, 233 pages |

61 | 2017 | 10 | 27 | 2017.0 | 12.0 | 7.0 | The History of Statistics: The Measurement of … | Stephen M. Stigler, 361 pages |

62 | 2017 | 12 | 8 | 2017.0 | 12.0 | 21.0 | An Arsonist’s Guide to Writers’ Homes in New E… | Brock Clarke, 303 pages |

The spreadhseet is formatted for use with Knight Lab’s excellent TimelineJS package. We transform the data to a more useful format for our purposes.

```
df = pd.DataFrame({
'start_date': raw_df.apply(
lambda s: pd.datetime(
s['Year'],
s['Month'],
s['Day']
),
axis=1
),
'end_date': raw_df.apply(
lambda s: pd.datetime(
int(s['End Year']),
int(s['End Month']),
int(s['End Day'])
),
axis=1
),
'title': raw_df['Headline'],
'author': (raw_df['Text']
.str.extract('(.*),.*', expand=True)
.iloc[:, 0]),
'pages': (raw_df['Text']
.str.extract(r'.*, (\d+) pages', expand=False)
.astype(np.int64))
})
df['days'] = (df['end_date']
.sub(df['start_date'])
.dt.days)
df = df[[
'author', 'title',
'start_date', 'end_date', 'days',
'pages'
]]
```

Each row of the dataframe corresponds to a book I have read, and the columns are

`author`

, the book’s author,`title`

, the book’s title,`start_date`

, the date I started reading the book,`end_date`

, the date I finished reading the book,`days`

, then number of days it took me to read the book, and`pages`

, the number of pages in the book.

`df.head()`

author | title | start_date | end_date | days | pages | |
---|---|---|---|---|---|---|

0 | Angela Carter | The Bloody Chamber | 2014-12-13 | 2014-12-23 | 10 | 126 |

1 | Roland Huntford | The Last Place on Earth | 2014-12-23 | 2015-01-04 | 12 | 564 |

2 | Richard Russo | Empire Falls | 2015-01-24 | 2015-02-13 | 20 | 483 |

3 | Michael Chabon | Wonder Boys | 2015-02-14 | 2015-02-20 | 6 | 368 |

4 | Andrew Gelman | Red State, Blue State, Rich State, Poor State:… | 2015-02-25 | 2015-03-04 | 7 | 196 |

`df.tail()`

author | title | start_date | end_date | days | pages | |
---|---|---|---|---|---|---|

58 | Norman F. Cantor | Civilization of the Middle Ages | 2017-09-16 | 2017-10-14 | 28 | 566 |

59 | Angela Carter | The Bloody Chamber | 2017-10-14 | 2017-10-16 | 2 | 126 |

60 | Travis Sawchik | Big Data Baseball | 2017-10-16 | 2017-10-27 | 11 | 233 |

61 | Stephen M. Stigler | The History of Statistics: The Measurement of … | 2017-10-27 | 2017-12-07 | 41 | 361 |

62 | Brock Clarke | An Arsonist’s Guide to Writers’ Homes in New E… | 2017-12-08 | 2017-12-21 | 13 | 303 |

We will model the number of days it takes me to read a book using count regression models based on the number of pages. It would also be reasonable to analyze this data using survival models.

While Poisson regression is perhaps the simplest count regression model, we see that these data are fairly overdispersed

`df['days'].var() / df['days'].mean()`

`14.466643655077199`

so negative binomial regression is more appropriate. We further verify that negative binomial regression is appropriate by plotting the logarithm of the number of pages versus the logarithm of the number of days it took me to read the book, since the logarithm is the link function for the negative binomial GLM.

```
fig, ax = plt.subplots(figsize=(8, 6))
df.plot(
'pages', 'days',
s=40, kind='scatter',
ax=ax
);
ax.set_xscale('log');
ax.set_xlabel("Number of pages");
ax.set_yscale('log');
ax.set_ylim(top=1.1e2);
ax.set_ylabel("Number of days to read");
```

This approximately linear relationship confirms the suitability of a negative binomial model.

Now we introduce some notation. Let \(y_i\) be the number of days it took me to read the \(i\)-th book and \(x^{\textrm{pages}}_i\) be the (standardized) logarithm of the number of pages in the \(i\)-th book. Our first model is

\[ \begin{align*} \beta^0, \beta^{\textrm{pages}} & \sim N(0, 5^2) \\ \theta_i & \sim \beta^0 + \beta^{\textrm{pages}} \cdot x^{\textrm{pages}}_i \\ \mu_i & = \exp({\theta_i}) \\ \alpha & \sim \operatorname{Lognormal}(0, 5^2) \\ y_i - 1 & \sim \operatorname{NegativeBinomial}(\mu_i, \alpha). \end{align*} \]

This model is expressed in PyMC3 below.

```
days = df['days'].values
pages = df['pages'].values
pages_ = shared(pages)
```

```
with pm.Model() as nb_model:
β0 = pm.Normal('β0', 0., 5.)
β_pages = pm.Normal('β_pages', 0., 5.)
log_pages = tt.log(pages_)
log_pages_std = (log_pages - log_pages.mean()) / log_pages.std()
θ = β0 + β_pages * log_pages_std
μ = tt.exp(θ)
α = pm.Lognormal('α', 0., 5.)
days_obs = pm.NegativeBinomial('days_obs', μ, α, observed=days - 1)
```

We now sample from the model’s posterior distribution.

```
NJOBS = 3
SAMPLE_KWARGS = {
'njobs': NJOBS,
'random_seed': [SEED + i for i in range(NJOBS)]
}
```

```
with nb_model:
nb_trace = pm.sample(**SAMPLE_KWARGS)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [α_log__, β_pages, β0]
100%|██████████| 1000/1000 [00:05<00:00, 190.41it/s]
```

We check a few convergence diagnostics. The BFMI and energy distributions for our samples show no cause for concern.

```
ax = pm.energyplot(nb_trace)
bfmi = pm.bfmi(nb_trace)
ax.set_title(f"BFMI = {bfmi:.2f}");
```

The Gelman-Rubin statistics indicate that the chains have converged.

`max(np.max(gr_stats) for gr_stats in pm.gelman_rubin(nb_trace).values())`

`1.0003248725601743`

We use the posterior samples to make predictions so that we can examine residuals.

```
with nb_model:
nb_pred_trace = pm.sample_ppc(nb_trace)
nb_pred_days = nb_pred_trace['days_obs'].mean(axis=0)
```

`100%|██████████| 500/500 [00:00<00:00, 1114.32it/s]`

Since the mean and variance of the negative binomial distribution are related, we use standardized residuals to untangle this relationship.

`nb_std_resid = (days - nb_pred_days) / nb_pred_trace['days_obs'].std(axis=0)`

We visualize these standardized residuals below.

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(nb_pred_days, nb_std_resid);
ax.hlines(
sp.stats.norm.isf([0.975, 0.025]),
0, 75,
linestyles='--', label="95% confidence band"
);
ax.set_xlim(0, 75);
ax.set_xlabel("Predicted number of days to read");
ax.set_ylabel("Standardized residual");
ax.legend(loc=1);
```

If the model is correct, approximately 95% of the residuals should lie between the dotted horizontal lines, and indeed most residuals are in this band.

We also plot the standardized residuals against the number of pages in the book, and notice no troubling patterns.

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df['pages'], nb_std_resid);
ax.hlines(
sp.stats.norm.isf([0.975, 0.025]),
0, 900,
linestyles='--', label="95% confidence band"
);
ax.set_xlim(0, 900);
ax.set_xlabel("Number of pages");
ax.set_ylabel("Standardized residual");
ax.legend(loc=1);
```

We now examine this model’s predictions directly by sampling from the posterior predictive distribution.

```
PP_PAGES = np.linspace(1, 1000, 300, dtype=np.int64)
pages_.set_value(PP_PAGES)
with nb_model:
pp_nb_trace = pm.sample_ppc(nb_trace, samples=5000)
```

`100%|██████████| 5000/5000 [00:06<00:00, 825.67it/s] `

```
fig, ax = plt.subplots(figsize=(8, 6))
ALPHA = 0.05
low, high = np.percentile(
pp_nb_trace['days_obs'],
[100 * ALPHA / 2, 100 * (1 - ALPHA / 2)],
axis=0
)
ax.fill_between(
PP_PAGES, low, high,
alpha=0.35,
label=f"{1 - ALPHA:.0%} credible interval"
);
ax.plot(
PP_PAGES, pp_nb_trace['days_obs'].mean(axis=0),
label="Posterior expected value"
);
df.plot(
'pages', 'days',
s=40, c='k',
kind='scatter',
label="Observed",
ax=ax
);
ax.set_xlabel("Number of pages");
ax.set_ylabel("Number of days to read");
ax.legend(loc=2);
```

We see that most of the obserations fall within the 95% credible interval. An important feature of negative binomial regression is that the credible intervals expand as the predictions get larger. This feature is reflected in the fact that the predictions are less accurate for longer books.

One advantage to working with such a personal data set is that I can explain the factors that led to certain outliers. Below are the four books that I read at the slowest average rate of pages per day.

```
(df.assign(pages_per_day=df['pages'] / df['days'])
.nsmallest(4, 'pages_per_day')
[['title', 'author', 'start_date', 'pages_per_day']])
```

title | author | start_date | pages_per_day | |
---|---|---|---|---|

41 | The Handmaid’s Tale | Margaret Atwood | 2016-10-16 | 4.382353 |

48 | The Shadow of the Torturer | Gene Wolf | 2017-03-11 | 7.046512 |

24 | The Song of the Lark | Willa Cather | 2016-02-14 | 7.446429 |

61 | The History of Statistics: The Measurement of … | Stephen M. Stigler | 2017-10-27 | 8.804878 |

Several of these books make sense; I found *The Shadow of the Torturer* to be an unpleasant slog and *The History of Statistics* was quite technical and dense. On the other hand, *The Handmaid’s Tale* and *The Song of the Lark* were both quite enjoyable, but my time reading them coincided with other notable life events. I was reading *The Handmaid’s Tale* when certain unfortunate American political developments distracted me for several weeks in November 2016, and I was reading *The Song of the Lark* when a family member passed away in March 2016.

We modify the negative binomial regression model to include special factors for *The Handmaid’s Tale* and *The Song of the Lark*, in order to mitigate the influence of these unusual circumstances on our parameter estimates.

We let

\[ x^{\textrm{handmaid}}_i = \begin{cases} 1 & \textrm{if the } i\textrm{-th book is }\textit{The Handmaid's Tale} \\ 0 & \textrm{if the } i\textrm{-th book is not }\textit{The Handmaid's Tale} \end{cases}, \]

and similarly for \(x^{\textrm{lark}}_i\). We add the terms

\[ \begin{align*} \beta^{\textrm{handmaid}}, \beta^{\textrm{lark}} & \sim N(0, 5^2) \\ \beta^{\textrm{book}}_i & = \beta^{\textrm{handmaid}} \cdot x^{\textrm{handmaid}}_i + \beta^{\textrm{lark}} \cdot x^{\textrm{lark}}_i \\ \theta_i & \sim \beta_0 + \beta^{\textrm{book}}_i + \beta^{\textrm{pages}} \cdot x^{\textrm{pages}}_i \end{align*} \]

to the model below.

```
is_lark = (df['title']
.eq("The Song of the Lark")
.mul(1.)
.values)
is_lark_ = shared(is_lark)
is_handmaid = (df['title']
.eq("The Handmaid's Tale")
.mul(1.)
.values)
is_handmaid_ = shared(is_handmaid)
```

`pages_.set_value(pages)`

```
with pm.Model() as book_model:
β0 = pm.Normal('β0', 0., 5.)
β_lark = pm.Normal('β_lark', 0., 5.)
β_handmaid = pm.Normal('β_handmaid', 0., 5.)
β_book = β_lark * is_lark_ + β_handmaid * is_handmaid_
β_pages = pm.Normal('β_pages', 0., 5.)
log_pages = tt.log(pages_)
log_pages_std = (log_pages - log_pages.mean()) / log_pages.std()
θ = β0 + β_book + β_pages * log_pages_std
μ = tt.exp(θ)
α = pm.Lognormal('α', 0., 5.)
days_obs = pm.NegativeBinomial('days_obs', μ, α, observed=days - 1)
```

We now sample from the model’s posterior distribution.

```
with book_model:
book_trace = pm.sample(**SAMPLE_KWARGS)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [α_log__, β_pages, β_handmaid, β_lark, β0]
100%|██████████| 1000/1000 [00:12<00:00, 77.79it/s]
```

Again, the BFMI, energy plots and Gelman-Rubin statistics indicate convergence.

```
ax = pm.energyplot(book_trace)
bfmi = pm.bfmi(book_trace)
ax.set_title(f"BFMI = {bfmi:.2f}");
```

`max(np.max(gr_stats) for gr_stats in pm.gelman_rubin(book_trace).values())`

`1.0012914523533143`

We see that the special factors for *The Handmaid’s Tale* and *The Song of the Lark* were indeed notable.

```
pm.forestplot(
book_trace, varnames=['β_handmaid', 'β_lark'],
chain_spacing=0.025,
rhat=False
);
```

Again, we calculate the model’s predictions in order to examine standardized residuals.

```
with book_model:
book_pred_trace = pm.sample_ppc(book_trace)
book_pred_days = book_pred_trace['days_obs'].mean(axis=0)
```

`100%|██████████| 500/500 [00:01<00:00, 463.09it/s]`

`book_std_resid = (days - book_pred_days) / book_pred_trace['days_obs'].std(axis=0)`

Both standardized residual plots show no cause for concern.

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(book_pred_days, book_std_resid);
ax.hlines(
sp.stats.norm.isf([0.975, 0.025]),
0, 120,
linestyles='--', label="95% confidence band"
);
ax.set_xlim(0, 120);
ax.set_xlabel("Predicted number of days to read");
ax.set_ylabel("Standardized residual");
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df['pages'], book_std_resid);
ax.hlines(
sp.stats.norm.isf([0.975, 0.025]),
0, 900,
linestyles='--', label="95% confidence band"
);
ax.set_xlim(0, 900);
ax.set_xlabel("Number of pages");
ax.set_ylabel("Standardized residual");
```

Since we now have two models, we use WAIC to compare them.

```
compare_df = (pm.compare(
[nb_trace, book_trace],
[nb_model, book_model]
)
.rename(index={0: 'NB', 1: 'Book'}))
```

```
fig, ax = plt.subplots(figsize=(8, 6))
pm.compareplot(
compare_df,
insample_dev=False, dse=False,
ax=ax
);
ax.set_xlabel("WAIC");
ax.set_ylabel("Model");
```

Since lower WAIC is better, we prefer the model with book effects, although not conclusively.

Again, we examine this model’s predictions directly by sampling from the posterior predictive distribution.

```
pages_.set_value(PP_PAGES)
is_handmaid_.set_value(np.zeros_like(PP_PAGES))
is_lark_.set_value(np.zeros_like(PP_PAGES))
with book_model:
pp_book_trace = pm.sample_ppc(book_trace, samples=5000)
```

`100%|██████████| 5000/5000 [00:07<00:00, 640.09it/s]`

```
fig, ax = plt.subplots(figsize=(8, 6))
low, high = np.percentile(
pp_book_trace['days_obs'],
[100 * ALPHA / 2, 100 * (1 - ALPHA / 2)],
axis=0
)
ax.fill_between(
PP_PAGES, low, high,
alpha=0.35,
label=f"{1 - ALPHA:.0%} credible interval"
);
ax.plot(
PP_PAGES, pp_book_trace['days_obs'].mean(axis=0),
label="Posterior expected value"
);
df.plot(
'pages', 'days',
s=40, c='k',
kind='scatter',
label="Observed",
ax=ax
);
ax.set_xlabel("Number of pages");
ax.set_ylabel("Number of days to read");
ax.legend(loc=2);
```

The predictions are visually similar to those of the previous model. The plot below compares the two model’s predictions directly.

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(
PP_PAGES, pp_nb_trace['days_obs'].mean(axis=0),
label="Negative binomial model"
);
ax.plot(
PP_PAGES, pp_book_trace['days_obs'].mean(axis=0),
label="Book effect model"
);
ax.set_xlabel("Number of pages");
ax.set_ylabel("Number of days to read");
ax.legend(title="Posterior expected value", loc=2);
```

The predictions are quite similar, with the book effect model predicting slightly shorter durations, which makes sense as that model explicitly accounts for two books that it took me an unusually long amount of time to read.

We now turn to the goal of this post, quantifying how my reading habits have changed over time. For computational simplicity, we operate on a time scale of weeks. Therefore, for each book, we calculate the number of weeks from the beginning of the observation period to when I started reading it.

```
t_week = (df['start_date']
.sub(df['start_date'].min())
.dt.days
.floordiv(7)
.values)
t_week_ = shared(t_week)
n_week = t_week.max() + 1
```

We let the intercept \(\beta^0\) and the (log standardized) pages coefficient \(\beta^{\textrm{pages}}\) vary over time. We give these time-varying coefficient Gaussian random walk priors,

\[ \begin{align*} \beta^0_t \sim N(\beta^0_{t - 1}, 10^{-2}), \\ \beta^{\textrm{pages}}_t \sim N(\beta^{\textrm{pages}}_{t - 1}, 10^{-2}). \end{align*} \]

The small drift scale of \(10^{-1}\) is justified by the intuition that reading habits should change gradually.

```
pages_.set_value(pages)
is_handmaid_.set_value(is_handmaid)
is_lark_.set_value(is_lark)
```

```
with pm.Model() as time_model:
β0 = pm.GaussianRandomWalk(
'β0', sd=0.1, shape=n_week
)
β_lark = pm.Normal('β_lark', 0., 5.)
β_handmaid = pm.Normal('β_handmaid', 0., 5.)
β_book = β_lark * is_lark_ + β_handmaid * is_handmaid_
β_pages = pm.GaussianRandomWalk(
'β_pages', sd=0.1, shape=n_week
)
log_pages = tt.log(pages_)
log_pages_std = (log_pages - log_pages.mean()) / log_pages.std()
θ = β0[t_week_] + β_book + β_pages[t_week_] * log_pages_std
μ = tt.exp(θ)
α = pm.Lognormal('α', 0., 5.)
days_obs = pm.NegativeBinomial('days_obs', μ, α, observed=days - 1)
```

Again, we sample from the model’s posterior distribution.

```
with time_model:
time_trace = pm.sample(**SAMPLE_KWARGS)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [α_log__, β_pages, β_handmaid, β_lark, β0]
100%|██████████| 1000/1000 [03:26<00:00, 4.85it/s]
```

Again, the BFMI, energy plots, and Gelman-Rubin statistics indicate convergence.

```
ax = pm.energyplot(time_trace)
bfmi = pm.bfmi(time_trace)
ax.set_title(f"BFMI = {bfmi:.2f}");
```

`max(np.max(gr_stats) for gr_stats in pm.gelman_rubin(time_trace).values())`

`1.0038733152191415`

Once more, we examime the model’s standardized residuals.

```
with time_model:
time_pred_trace = pm.sample_ppc(time_trace)
time_pred_days = time_pred_trace['days_obs'].mean(axis=0)
```

`100%|██████████| 500/500 [00:00<00:00, 913.23it/s]`

`time_std_resid = (days - time_pred_days) / time_pred_trace['days_obs'].std(axis=0)`

In general, the standardized residuals are now smaller and fewer are outside of the 95% confidence band.

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(time_pred_days, time_std_resid);
ax.hlines(
sp.stats.norm.isf([0.975, 0.025]),
0, 120,
linestyles='--', label="95% confidence band"
);
ax.set_xlim(0, 120);
ax.set_xlabel("Predicted number of days to read");
ax.set_ylabel("Standardized residual");
ax.legend(loc=1);
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df['pages'], time_std_resid);
ax.hlines(
sp.stats.norm.isf([0.975, 0.025]),
0, 900,
linestyles='--', label="95% confidence band"
);
ax.set_xlim(0, 900);
ax.set_xlabel("Number of pages");
ax.set_ylabel("Standardized residual");
ax.legend(loc=1);
```

Again, we use WAIC to compare the three models.

```
compare_df = (pm.compare(
[nb_trace, book_trace, time_trace],
[nb_model, book_model, time_model]
)
.rename(index={0: 'NB', 1: 'Book', 2: 'Time'}))
```

```
fig, ax = plt.subplots(figsize=(8, 6))
pm.compareplot(
compare_df,
insample_dev=False, dse=False,
ax=ax
);
ax.set_xlabel("WAIC");
ax.set_ylabel("Model");
```

The timeseries model performs marginally worse than the previous model. We proceed since only the timeseries model answers our original question.

We now use the timeseries model to show how the amount of time it takes me to read a book has changed over time, conditioned on the length of the book.

```
t_grid = np.linspace(
mdates.date2num(df['start_date'].min()),
mdates.date2num(df['start_date'].max()),
n_week
)
```

```
PP_TIME_PAGES = np.array([100, 200, 300, 400, 500])
pp_df = (pd.DataFrame(
list(product(
np.arange(n_week),
PP_TIME_PAGES
)),
columns=['t_week', 'pages']
)
.assign(
is_handmaid=0,
is_lark=0
))
```

```
is_handmaid_.set_value(pp_df['is_handmaid'].values)
is_lark_.set_value(pp_df['is_lark'].values)
t_week_.set_value(pp_df['t_week'].values)
pages_.set_value(pp_df['pages'].values)
```

```
with time_model:
pp_time_trace = pm.sample_ppc(time_trace, samples=10000)
```

`100%|██████████| 10000/10000 [00:12<00:00, 791.11it/s]`

```
pp_df['pp_days'] = pp_time_trace['days_obs'].mean(axis=0)
pp_df['t_plot'] = np.repeat(t_grid, 5)
```

```
fig, ax = plt.subplots(figsize=(8, 6))
for grp_pages, grp_df in pp_df.groupby('pages'):
grp_df.plot(
't_plot', 'pp_days',
label=f"{grp_pages} pages",
ax=ax
);
ax.set_xlim(t_grid.min(), t_grid.max());
ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %Y'));
ax.xaxis.set_major_locator(mdates.MonthLocator(interval=6));
ax.xaxis.label.set_visible(False);
ax.set_ylabel("Predicted number of days to read");
```

The plot above exhibits a fascinating pattern; according to the timeseries model, I now read shorter books (fewer than approximately 300 pages) slightly faster than I did in 2015, but it takes me twice as long as before to read longer books. The trend for longer books is easier to explain; in the last 12-18 months, I have been doing much more public speaking and blogging than before, which naturally takes time away from reading. The trend for shorter books is a bit harder to explain, but upon some thought, I tend to read more purposefully as I approach the end of a book, looking forward to starting a new one. This effect occurs much earlier in shorter books than in longer ones, so it is a plausible explanation for the trend in shorter books.

This post is available as a Jupyter notebook here.

Survival analysis studies the distribution of the time between when a subject comes under observation and when that subject experiences an event of interest. One of the fundamental challenges of survival analysis (which also makes it mathematically interesting) is that, in general, not every subject will experience the event of interest before we conduct our analysis. In more concrete terms, if we are studying the time between cancer treatment and death (as we will in this post), we will often want to analyze our data before every subject has died. This phenomenon is called censoring and is fundamental to survival analysis.

I have previously written about Bayesian survival analysis using the semiparametric Cox proportional hazards model. Implementing that semiparametric model in PyMC3 involved some fairly complex `numpy`

code and nonobvious probability theory equivalences. This post illustrates a parametric approach to Bayesian survival analysis in PyMC3. Parametric models of survival are simpler to both implement and understand than semiparametric models; statistically, they are also more powerful than non- or semiparametric methods *when they are correctly specified*. This post will not further cover the differences between parametric and nonparametric models or the various methods for chosing between them.

As in the previous post, we will analyze mastectomy data from `R`

’s `HSAUR`

package. First, we load the data.

`%matplotlib inline`

```
from matplotlib import pyplot as plt
from matplotlib.ticker import StrMethodFormatter
import numpy as np
import pymc3 as pm
import scipy as sp
import seaborn as sns
from statsmodels import datasets
from theano import shared, tensor as tt
```

```
sns.set()
blue, green, red, purple, gold, teal = sns.color_palette()
pct_formatter = StrMethodFormatter('{x:.1%}')
```

```
df = (datasets.get_rdataset('mastectomy', 'HSAUR', cache=True)
.data
.assign(metastized=lambda df: 1. * (df.metastized == "yes"),
event=lambda df: 1. * df.event))
```

`df.head()`

time | event | metastized | |
---|---|---|---|

0 | 23 | 1.0 | 0.0 |

1 | 47 | 1.0 | 0.0 |

2 | 69 | 1.0 | 0.0 |

3 | 70 | 0.0 | 0.0 |

4 | 100 | 0.0 | 0.0 |

The column `time`

represents the survival time for a breast cancer patient after a mastectomy, measured in months. The column `event`

indicates whether or not the observation is censored. If `event`

is one, the patient’s death was observed during the study; if `event`

is zero, the patient lived past the end of the study and their survival time is censored. The column `metastized`

indicates whether the cancer had metastized prior to the mastectomy. In this post, we will use Bayesian parametric survival regression to quantify the difference in survival times for patients whose cancer had and had not metastized.

Accelerated failure time models are the most common type of parametric survival regression models. The fundamental quantity of survival analysis is the survival function; if \(T\) is the random variable representing the time to the event in question, the survival function is \(S(t) = P(T > t)\). Accelerated failure time models incorporate covariates \(\mathbf{x}\) into the survival function as

\[S(t\ |\ \beta, \mathbf{x}) = S_0\left(\exp\left(\beta^{\top} \mathbf{x}\right) \cdot t\right),\]

where \(S_0(t)\) is a fixed baseline survival function. These models are called “accelerated failure time” because, when \(\beta^{\top} \mathbf{x} > 0\), \(\exp\left(\beta^{\top} \mathbf{x}\right) \cdot t > t\), so the effect of the covariates is to accelerate the *effective* passage of time for the individual in question. The following plot illustrates this phenomenon using an exponential survival function.

`S0 = sp.stats.expon.sf`

```
fig, ax = plt.subplots(figsize=(8, 6))
t = np.linspace(0, 10, 100)
ax.plot(t, S0(5 * t),
label=r"$\beta^{\top} \mathbf{x} = \log\ 5$");
ax.plot(t, S0(2 * t),
label=r"$\beta^{\top} \mathbf{x} = \log\ 2$");
ax.plot(t, S0(t),
label=r"$\beta^{\top} \mathbf{x} = 0$ ($S_0$)");
ax.plot(t, S0(0.5 * t),
label=r"$\beta^{\top} \mathbf{x} = -\log\ 2$");
ax.plot(t, S0(0.2 * t),
label=r"$\beta^{\top} \mathbf{x} = -\log\ 5$");
ax.set_xlim(0, 10);
ax.set_xlabel(r"$t$");
ax.yaxis.set_major_formatter(pct_formatter);
ax.set_ylim(-0.025, 1);
ax.set_ylabel(r"Survival probability, $S(t\ |\ \beta, \mathbf{x})$");
ax.legend(loc=1);
ax.set_title("Accelerated failure times");
```

Accelerated failure time models are equivalent to log-linear models for \(T\),

\[Y = \log T = \beta^{\top} \mathbf{x} + \varepsilon.\]

A choice of distribution for the error term \(\varepsilon\) determines baseline survival function, \(S_0\), of the accelerated failure time model. The following table shows the correspondence between the distribution of \(\varepsilon\) and \(S_0\) for several common accelerated failure time models.

Log-linear error distribution (\(\varepsilon\)) | Baseline survival function (\(S_0\)) |
---|---|

Normal | Log-normal |

Extreme value (Gumbel) | Weibull |

Logistic | Log-logistic |

Accelerated failure time models are conventionally named after their baseline survival function, \(S_0\). The rest of this post will show how to implement Weibull and log-logistic survival regression models in PyMC3 using the mastectomy data.

In this example, the covariates are \(\mathbf{x}_i = \left(1\ x^{\textrm{met}}_i\right)^{\top}\), where

\[ \begin{align*} x^{\textrm{met}}_i & = \begin{cases} 0 & \textrm{if the } i\textrm{-th patient's cancer had not metastized} \\ 1 & \textrm{if the } i\textrm{-th patient's cancer had metastized} \end{cases}. \end{align*} \]

We construct the matrix of covariates \(\mathbf{X}\).

```
n_patient, _ = df.shape
X = np.empty((n_patient, 2))
X[:, 0] = 1.
X[:, 1] = df.metastized
```

We place independent, vague normal prior distributions on the regression coefficients,

\[\beta \sim N(0, 5^2 I_2).\]

`VAGUE_PRIOR_SD = 5.`

```
with pm.Model() as weibull_model:
β = pm.Normal('β', 0., VAGUE_PRIOR_SD, shape=2)
```

The covariates, \(\mathbf{x}\), affect value of \(Y = \log T\) through \(\eta = \beta^{\top} \mathbf{x}\).

```
X_ = shared(X)
with weibull_model:
η = β.dot(X_.T)
```

For Weibull regression, we use

\[ \begin{align*} \varepsilon & \sim \textrm{Gumbel}(0, s) \\ s & \sim \textrm{HalfNormal(5)}. \end{align*} \]

```
with weibull_model:
s = pm.HalfNormal('s', 5.)
```

We are nearly ready to specify the likelihood of the observations given these priors. Before doing so, we transform the observed times to the log scale and standardize them.

```
y = np.log(df.time.values)
y_std = (y - y.mean()) / y.std()
```

The likelihood of the data is specified in two parts, one for uncensored samples, and one for censored samples. Since \(Y = \eta + \varepsilon\), and \(\varepsilon \sim \textrm{Gumbel}(0, s)\), \(Y \sim \textrm{Gumbel}(\eta, s)\). For the uncensored survival times, the likelihood is implemented as

`cens = df.event.values == 0.`

```
cens_ = shared(cens)
with weibull_model:
y_obs = pm.Gumbel(
'y_obs', η[~cens_], s,
observed=y_std[~cens]
)
```

For censored observations, we only know that their true survival time exceeded the total time that they were under observation. This probability is given by the survival function of the Gumbel distribution,

\[P(Y \geq y) = 1 - \exp\left(-\exp\left(-\frac{y - \mu}{s}\right)\right).\]

This survival function is implemented below.

```
def gumbel_sf(y, μ, σ):
return 1. - tt.exp(-tt.exp(-(y - μ) / σ))
```

We now specify the likelihood for the censored observations.

```
with weibull_model:
y_cens = pm.Bernoulli(
'y_cens', gumbel_sf(y_std[cens], η[cens_], s),
observed=np.ones(cens.sum())
)
```

We now sample from the model.

```
SEED = 845199 # from random.org, for reproducibility
SAMPLE_KWARGS = {
'njobs': 3,
'tune': 1000,
'random_seed': [
SEED,
SEED + 1,
SEED + 2
]
}
```

```
with weibull_model:
weibull_trace = pm.sample(**SAMPLE_KWARGS)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%|██████████| 1500/1500 [00:04<00:00, 322.90it/s]
```

The energy plot and Bayesian fraction of missing information give no cause for concern about poor mixing in NUTS.

`pm.energyplot(weibull_trace);`

`pm.bfmi(weibull_trace)`

`1.0189285246960067`

The Gelman-Rubin statistics also indicate convergence.

`max(np.max(gr_stats) for gr_stats in pm.gelman_rubin(weibull_trace).values())`

`1.0077500079163573`

Below we plot posterior distributions of the parameters.

`pm.plot_posterior(weibull_trace, lw=0, alpha=0.5);`

These are somewhat interesting (espescially the fact that the posterior of \(\beta_1\) is fairly well-separated from zero), but the posterior predictive survival curves will be much more interpretable.

The advantage of using `theano.shared`

variables is that we can now change their values to perform posterior predictive sampling. For posterior prediction, we set \(X\) to have two rows, one for a subject whose cancer had not metastized and one for a subject whose cancer had metastized. Since we want to predict actual survival times, none of the posterior predictive rows are censored.

```
X_pp = np.empty((2, 2))
X_pp[:, 0] = 1.
X_pp[:, 1] = [0, 1]
X_.set_value(X_pp)
cens_pp = np.repeat(False, 2)
cens_.set_value(cens_pp)
```

```
with weibull_model:
pp_weibull_trace = pm.sample_ppc(
weibull_trace, samples=1500, vars=[y_obs]
)
```

`100%|██████████| 1500/1500 [00:00<00:00, 2789.50it/s]`

The posterior predictive survival times show that, on average, patients whose cancer had not metastized survived longer than those whose cancer had metastized.

```
t_plot = np.linspace(0, 230, 100)
weibull_pp_surv = (np.greater_equal
.outer(np.exp(y.mean() + y.std() * pp_weibull_trace['y_obs']),
t_plot))
weibull_pp_surv_mean = weibull_pp_surv.mean(axis=0)
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(t_plot, weibull_pp_surv_mean[0],
c=blue, label="Not metastized");
ax.plot(t_plot, weibull_pp_surv_mean[1],
c=red, label="Metastized");
ax.set_xlim(0, 230);
ax.set_xlabel("Weeks since mastectomy");
ax.set_ylim(top=1);
ax.yaxis.set_major_formatter(pct_formatter);
ax.set_ylabel("Survival probability");
ax.legend(loc=1);
ax.set_title("Weibull survival regression model");
```

Other accelerated failure time models can be specificed in a modular way by changing the prior distribution on \(\varepsilon\). A log-logistic model corresponds to a logistic prior on \(\varepsilon\). Most of the model specification is the same as for the Weibull model above.

```
X_.set_value(X)
cens_.set_value(cens)
with pm.Model() as log_logistic_model:
β = pm.Normal('β', 0., VAGUE_PRIOR_SD, shape=2)
η = β.dot(X_.T)
s = pm.HalfNormal('s', 5.)
```

We use the prior \(\varepsilon \sim \textrm{Logistic}(0, s)\). The survival function of the logistic distribution is

\[P(Y \geq y) = 1 - \frac{1}{1 + \exp\left(-\left(\frac{y - \mu}{s}\right)\right)},\]

so we get the likelihood

```
def logistic_sf(y, μ, s):
return 1. - pm.math.sigmoid((y - μ) / s)
```

```
with log_logistic_model:
y_obs = pm.Logistic(
'y_obs', η[~cens_], s,
observed=y_std[~cens]
)
y_cens = pm.Bernoulli(
'y_cens', logistic_sf(y_std[cens], η[cens_], s),
observed=np.ones(cens.sum())
)
```

We now sample from the log-logistic model.

```
with log_logistic_model:
log_logistic_trace = pm.sample(**SAMPLE_KWARGS)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
100%|██████████| 1500/1500 [00:05<00:00, 291.48it/s]
```

All of the sampling diagnostics look good for this model.

`pm.energyplot(log_logistic_trace);`

`pm.bfmi(log_logistic_trace)`

`0.98805328946082049`

`max(np.max(gr_stats) for gr_stats in pm.gelman_rubin(log_logistic_trace).values())`

`1.0018938145216476`

Again, we calculate the posterior expected survival functions for this model.

```
X_.set_value(X_pp)
cens_.set_value(cens_pp)
with log_logistic_model:
pp_log_logistic_trace = pm.sample_ppc(
log_logistic_trace, samples=1500, vars=[y_obs]
)
```

`100%|██████████| 1500/1500 [00:00<00:00, 2526.82it/s]`

```
log_logistic_pp_surv = (np.greater_equal
.outer(np.exp(y.mean() + y.std() * pp_log_logistic_trace['y_obs']),
t_plot))
log_logistic_pp_surv_mean = log_logistic_pp_surv.mean(axis=0)
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(t_plot, weibull_pp_surv_mean[0],
c=blue, label="Weibull, not metastized");
ax.plot(t_plot, weibull_pp_surv_mean[1],
c=red, label="Weibull, metastized");
ax.plot(t_plot, log_logistic_pp_surv_mean[0],
'--', c=blue,
label="Log-logistic, not metastized");
ax.plot(t_plot, log_logistic_pp_surv_mean[1],
'--', c=red,
label="Log-logistic, metastized");
ax.set_xlim(0, 230);
ax.set_xlabel("Weeks since mastectomy");
ax.set_ylim(top=1);
ax.yaxis.set_major_formatter(pct_formatter);
ax.set_ylabel("Survival probability");
ax.legend(loc=1);
ax.set_title("Weibull and log-logistic\nsurvival regression models");
```

This post has been a short introduction to implementing parametric survival regression models in PyMC3 with a fairly simple data set. The modular nature of probabilistic programming with PyMC3 should make it straightforward to generalize these techniques to more complex and interesting data sets.

This post is available as a Jupyter notebook here.

Tags: Bayesian Statistics, PyMC3

A few weeks ago, YouGov correctly predicted a hung parliament as a result of the 2017 UK general election, to the astonishment of many commentators. YouGov’s predictions were based on a technique called multilevel regression with poststratification, or MRP for short (Andrew Gelman playfully refers to it as Mister P).

I was impressed with YouGov’s prediction and decided to work through an MRP example to improve my understanding of this technique. Since all of the applications of MRP I have found online involve `R`

’s `lme4`

package or Stan, I also thought this was a good opportunity to illustrate MRP in Python with PyMC3. This post is essentially a port of Jonathan Kastellec’s excellent MRP primer to Python and PyMC3. I am very grateful for his clear exposition of MRP and willingness to share a relevant data set.

MRP was developed to estimate American state-level opinions from national polls. This sort of estimation is crucial to understanding American politics at the national level, as many of the important political positions of the federal government are impacted by state-level elections:

- the president is chosen by the Electoral College, which (with a few exceptions) votes according to state-level vote totals,
- senators are chosen by state-level elections,
- many political and all judicial (notably Supreme Court) appointees require Senate approval, and therefore are subject to indirect state-level elections.

Of course, as YouGov demonstrates, MRP is more widely applicable than estimation of state-level opinion.

In this post, we will follow Kastellec’s example of estimating state-level opinion about gay marriage in 2005/2006 using a combination of three national polls. We begin by loading a data set that consists of responses to the three national polls.

```
%matplotlib inline
%config InlineBackend.figure_format = 'retina'
```

```
import os
import us
```

```
import matplotlib as mpl
from matplotlib import pyplot as plt
from matplotlib.colors import Normalize, rgb2hex
from matplotlib.patches import Polygon
from matplotlib.ticker import FuncFormatter
from mpl_toolkits.basemap import Basemap
import numpy as np
import pandas as pd
import pymc3 as pm
import scipy as sp
import seaborn as sns
from theano import shared
```

```
%%bash
if [ ! -e ./st99_d00.dbf ];
then
wget -q https://github.com/matplotlib/basemap/raw/master/examples/st99_d00.dbf
wget -q https://github.com/matplotlib/basemap/raw/master/examples/st99_d00.shp
wget -q https://github.com/matplotlib/basemap/raw/master/examples/st99_d00.shx
fi
```

```
SEED = 4260026 # from random.org, for reproducibility
np.random.seed(SEED)
```

We load only the columns which we will use in the analysis and transform categorical variables to be zero-indexed.

```
def to_zero_indexed(col):
return lambda df: (df[col] - 1).astype(np.int64)
```

```
DATA_PREFIX = 'http://www.princeton.edu/~jkastell/MRP_primer/'
survey_df = (pd.read_stata(os.path.join(DATA_PREFIX, 'gay_marriage_megapoll.dta'),
columns=['race_wbh', 'age_cat', 'edu_cat', 'female',
'state_initnum', 'state', 'region_cat', 'region', 'statename',
'poll', 'yes_of_all'])
.dropna(subset=['race_wbh', 'age_cat', 'edu_cat', 'state_initnum'])
.assign(state_initnum=to_zero_indexed('state_initnum'),
race_wbh=to_zero_indexed('race_wbh'),
edu_cat=to_zero_indexed('edu_cat'),
age_cat=to_zero_indexed('age_cat'),
region_cat=to_zero_indexed('region_cat')))
```

`survey_df.head()`

race_wbh | age_cat | edu_cat | female | state_initnum | state | region_cat | region | statename | poll | yes_of_all | |
---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0 | 2 | 2 | 1 | 22 | MI | 1 | midwest | michigan | Gall2005Aug22 | 0 |

1 | 0 | 2 | 3 | 0 | 10 | GA | 2 | south | georgia | Gall2005Aug22 | 0 |

2 | 2 | 0 | 3 | 0 | 34 | NY | 0 | northeast | new york | Gall2005Aug22 | 1 |

3 | 0 | 3 | 3 | 1 | 30 | NH | 0 | northeast | new hampshire | Gall2005Aug22 | 1 |

5 | 0 | 3 | 2 | 1 | 14 | IL | 1 | midwest | illinois | Gall2005Aug22 | 0 |

These three surveys collected data from roughly 6,300 respondents during 2005 and 2006.

`survey_df.shape[0]`

`6341`

We see that the number of respondents varies widely between states.

```
def state_plot(state_data, cmap, norm, cbar=True, default=None, ax=None):
if ax is None:
fig, ax = plt.subplots(figsize=(8, 6))
else:
fig = plt.gcf()
m = Basemap(llcrnrlon=-121, llcrnrlat=20,
urcrnrlon=-62, urcrnrlat=51,
projection='lcc',
lat_1=32, lat_2=45, lon_0=-95)
m.readshapefile('st99_d00', name='states', drawbounds=True)
for state_info, state_seg in zip(m.states_info, m.states):
if state_info['NAME'] == 'Alaska':
state_seg = list(map(lambda xy: (0.35 * xy[0] + 1100000, 0.35 * xy[1] - 1300000), state_seg))
elif state_info['NAME'] == 'Hawaii' and float(state_info['AREA']) > 0.005:
state_seg = list(map(lambda xy: (xy[0] + 5100000, xy[1] - 1400000), state_seg))
try:
state_datum = state_data.loc[us.states.lookup(state_info['NAME']).abbr]
except KeyError:
state_datum = default
if state_datum is not None:
color = rgb2hex(cmap(norm(state_datum)))
poly = Polygon(state_seg, facecolor=color, edgecolor='#000000')
ax.add_patch(poly)
if cbar:
cbar_ax = fig.add_axes([0.925, 0.25, 0.04, 0.5])
mpl.colorbar.ColorbarBase(cbar_ax, cmap=cmap, norm=norm)
else:
cbar_ax = None
return fig, ax, cbar_ax
```

`state_counts = survey_df.groupby('state').size()`

```
fig, ax, cbar_ax = state_plot(state_counts,
mpl.cm.binary,
Normalize(0, state_counts.max()),
default=None)
ax.set_title("Number of poll respondents");
```

Notably, there are no respondents from some less populous states, such as Alaska and Hawaii.

Faced with this data set, it is inuitively appealing to estimate state-level opinion by the observed proportion of that state’s respondents that supported gay marriage. This approach is known as disaggregation.

```
disagg_p = (survey_df.groupby('state')
.yes_of_all
.mean())
```

```
p_norm = Normalize(0., 0.6)
p_cmap = sns.diverging_palette(220, 10, as_cmap=True)
fig, ax, cbar_ax = state_plot(disagg_p, p_cmap, p_norm)
p_formatter = FuncFormatter(lambda prop, _: '{:.1%}'.format(p_norm.inverse(prop)))
cbar_ax.yaxis.set_major_formatter(p_formatter);
ax.set_title("Disaggregation estimate of\nsupport for gay marriage in 2005");
```

The simplicity of disaggregation is appealing, but it suffers from a number of drawbacks. Obviously, it cannot estimate the state-level support for gay marriage in states with no respondents, such as Alaska and Hawaii. Similarly, for small/low population states with some respondents, the sample size may be too small to produce reliable estimates of opinion. This problem is exacerbated by the fact that for many issues, opinion is quite correlated with demographic factors such as age, race, education, gender, etc. Many more states will not have sufficient sample size for each combination of these factors for the disaggregate estimate to be representative of that state’s demographic compositon.

```
ax = (survey_df.groupby(['state', 'female', 'race_wbh'])
.size()
.unstack(level=['female', 'race_wbh'])
.isnull()
.sum()
.unstack(level='female')
.rename(index={0: 'White', 1: 'Black', 2: 'Hispanic'},
columns={0: 'Male', 1: 'Female'})
.rename_axis('Race', axis=0)
.rename_axis('Gender', axis=1)
.plot(kind='bar', rot=0, figsize=(8, 6)))
ax.set_yticks(np.arange(0, 21, 5));
ax.set_ylabel("Number of states");
ax.set_title("States with no respondents");
```

The plot above illustrates this phenomenon; a number of states have no nonwhite male or female respondents. Even more states will have very few such respondents. This lack of data renders the disaggregation estimates suspect. For further discussion and references on disaggregation (as well as an empirical comparison of disaggregation and MRP), consult Lax and Phillip’s *How Should We Estimate Public Opinion in the States?*.

MRP lessens the impact of this per-state respondent sparsity by first building a multilevel model of the relationship between respondents’ states and demographic characteristics to opinion, and subsequently using the predictions of this multilevel model along with census data about the demographic composition of each state to predict state-level opinion. Intuitively, the multilevel model employed by MRP is a principled statistical method for estimating, for example, how much men in Pennsylvania share opinions with men in other states versus how much they share opinions with women in Pennsylvania. This partial pooling at both the state- and demographic-levels helps MRP impute the opinions of groups present in states that were not surveyed.

The rest of this post is focused primarily on the execution of MRP in Python with PyMC3. For more detail on the theory and accuracy of MRP, consult the following (very incomplete) list of MRP resources:

- the MRP primer from which our example is taken,
- Park, Gelman, and Bafumi’s
*Bayesian Multilevel Estimation with Poststratification: State-Level Estimates from National Polls*, which assesses the accuracy of MRP in predicting the state-level results of the 1998 and 1992 US presidential elections, - Section 14.1 of Gelman and Hill’s
*Data Analysis Using Regression and Multilevel/Hierarchical Models*, which gives an expanded discussion of the example from the previous paper, - Lax and Phillips’
*How Should We Estimate Public Opinion in The States?*, which is also mentioned above, - Gelman’s blog post Mister P: What’s its secret sauce?, which is an extended discussion of several asssesments of MRP’s accuracy (1, 2).

Following the MRP primer, our multilevel opinion model will include factors for state, race, gender, education, age, and poll. In order to accelerate inference, we count the number of unique combinations of these factors, along with how many respondents with each combination supported gay marriage.

```
uniq_survey_df = (survey_df.groupby(['race_wbh', 'female', 'edu_cat', 'age_cat',
'region_cat', 'state_initnum', 'poll'])
.yes_of_all
.agg({
'yes_of_all': 'sum',
'n': 'size'
})
.reset_index())
```

`uniq_survey_df.head()`

race_wbh | female | edu_cat | age_cat | region_cat | state_initnum | poll | yes_of_all | n | |
---|---|---|---|---|---|---|---|---|---|

0 | 0 | 0 | 0 | 0 | 0 | 6 | Pew 2004Dec01 | 0 | 1 |

1 | 0 | 0 | 0 | 0 | 0 | 30 | Gall2005Aug22 | 0 | 1 |

2 | 0 | 0 | 0 | 0 | 0 | 34 | ABC 2004Jan15 | 1 | 1 |

3 | 0 | 0 | 0 | 0 | 0 | 38 | Pew 2004Dec01 | 1 | 1 |

4 | 0 | 0 | 0 | 0 | 1 | 12 | ABC 2004Jan15 | 0 | 1 |

This reduction adds negligible mathematical complexity (several Bernoulli distributions are combined into a single binomial distribution), but reduces the number of rows in the data set by nearly half.

`uniq_survey_df.shape[0] / survey_df.shape[0]`

`0.5824002523261316`

We will refer to each unique combination of state and demographic characteristics as a cell. Let \(n_i\) denote the number of respondents in cell \(i\), \(y_i\) the number of those respondents that supported gay marriage, and \(p_i\) the probability that a member of the general population of cell \(i\) supports gay marriage. We build a Bayesian multilevel logistic regression model of opinion as follows.

\[\begin{align*} \eta_i & = \beta_0 + \alpha^{\textrm{gender : race}}_{j(i)} + \alpha^{\textrm{age}}_{k(i)} + \alpha^{\textrm{edu}}_{l(i)} + \alpha^{\textrm{age : edu}}_{k(i),\ l(i)} + \alpha^{\textrm{state}}_{s(i)} + \alpha^{\textrm{poll}}_{m(i)} \\ \log \left(\frac{p_i}{1 - p_i}\right) & = \eta_i \\ y_i & \sim \textrm{Bernoulli}(n_i, p_i) \end{align*}\]

Here each subscript indexed by \(i\) is the categorical level of that characteristic for respondents in cell \(i\). The prior for the intercept is \(\beta_0 \sim N(0, 5^2)\). The prior for the effects of the interaction of gender and age is \(\alpha^{\textrm{gender : race}}_j \sim N\left(0, \sigma_{\textrm{gender : race}}^2\right),\) with \(\sigma_{\textrm{gender : race}} \sim \textrm{HalfCauchy}(5)\). The priors on \(\alpha^{\textrm{age}}_k,\) \(\alpha^{\textrm{edu}}_l,\) \(\alpha^{\textrm{age : edu}}_{k,\ l},\) and \(\alpha^{\textrm{poll}}_m\) are defined similarly. The prior on the state term, \(\alpha^{\textrm{state}}_s\), includes state-level predictors for region of the country, religiosity, and support for John Kerry in the 2004 presidential election.

\[\begin{align*} \alpha^{\textrm{state}}_s & \sim N\left(\alpha^{\textrm{region}}_s + \beta^{\textrm{relig}} x^{\textrm{relig}}_s + \beta^{\textrm{kerry}} x^{\textrm{kerry}}_s, \sigma^2_{\textrm{state}}\right) \end{align*}\]

Here \(x^{\textrm{relig}}_s\) is the log odds of the proportion of the state’s residents that are evangelical Christian or Mormon, and \(x^{\textrm{kerry}}_s\) is the log odds of the proportion of the state’s voters that voted for John Kerry in 2004. The priors on \(\alpha^{\textrm{region}}_s\), \(\beta^{\textrm{relig}}\), \(\beta^{\textrm{kerry}}\) are the same as those on the analagous terms in the definition of \(\eta\).

First we encode the respondent information.

```
def encode_gender_race(female, race_wbh):
return (3 * female + race_wbh).values
def encode_age_edu(age, edu):
return (4 * age + edu).values
```

```
gender_race = encode_gender_race(uniq_survey_df.female, uniq_survey_df.race_wbh)
n_gender_race = np.unique(gender_race).size
age = uniq_survey_df.age_cat.values
n_age = np.unique(age).size
edu = uniq_survey_df.edu_cat.values
n_edu = np.unique(edu).size
age_edu = encode_age_edu(uniq_survey_df.age_cat, uniq_survey_df.edu_cat)
n_age_edu = np.unique(age_edu).size
poll, poll_map = uniq_survey_df.poll.factorize()
n_poll = poll_map.size
region = uniq_survey_df.region_cat.values
n_region = np.unique(region).size
state = uniq_survey_df.state_initnum.values
n_state = 51
n = uniq_survey_df.n.values
yes_of_all = uniq_survey_df.yes_of_all.values
```

Next we load the state-level data and encode \(x^{\textrm{relig}}\) and \(x^{\textrm{kerry}}\).

```
STATE_URL = 'http://www.princeton.edu/~jkastell/MRP_primer/state_level_update.dta'
state_df = (pd.read_stata(STATE_URL,
columns=['sstate_initnum', 'sstate',
'p_evang', 'p_mormon', 'kerry_04'])
.rename(columns={'sstate_initnum': 'state_initnum', 'sstate': 'state'})
.assign(state_initnum=to_zero_indexed('state_initnum'),
p_relig=lambda df: df.p_evang + df.p_mormon))
```

`state_df.head()`

state_initnum | state | p_evang | p_mormon | kerry_04 | p_relig | |
---|---|---|---|---|---|---|

0 | 0 | AK | 12.440000 | 3.003126 | 35.500000 | 15.443126 |

1 | 1 | AL | 40.549999 | 0.458273 | 36.799999 | 41.008274 |

2 | 2 | AR | 43.070000 | 0.560113 | 44.599998 | 43.630112 |

3 | 3 | AZ | 9.410000 | 4.878735 | 44.400002 | 14.288734 |

4 | 4 | CA | 7.160000 | 1.557627 | 54.299999 | 8.717627 |

```
state_kerry = sp.special.logit(state_df.kerry_04.values / 100.)
state_relig = sp.special.logit(state_df.p_relig.values / 100.)
```

The state-level data doesn’t contain region information, so we load census data in order to build a mapping between state and region.

```
CENSUS_URL = 'http://www.princeton.edu/~jkastell/MRP_primer/poststratification%202000.dta'
census_df = (pd.read_stata(CENSUS_URL)
.rename(columns=lambda s: s.lstrip('c_').lower())
.assign(race_wbh=to_zero_indexed('race_wbh'),
edu_cat=to_zero_indexed('edu_cat'),
age_cat=to_zero_indexed('age_cat')))
```

`census_df.head()`

race_wbh | age_cat | edu_cat | female | state | freq | freq_state | percent_state | region | |
---|---|---|---|---|---|---|---|---|---|

0 | 0 | 0 | 0 | 0 | AK | 467 | 21222.0 | 0.022005 | west |

1 | 0 | 1 | 0 | 0 | AK | 377 | 21222.0 | 0.017765 | west |

2 | 0 | 2 | 0 | 0 | AK | 419 | 21222.0 | 0.019744 | west |

3 | 0 | 3 | 0 | 0 | AK | 343 | 21222.0 | 0.016162 | west |

4 | 0 | 0 | 1 | 0 | AK | 958 | 21222.0 | 0.045142 | west |

```
state_df = (pd.merge(
pd.merge((survey_df.groupby('region')
.region_cat
.first()
.reset_index()),
(census_df[['state', 'region']].drop_duplicates()),
on='region')[['state', 'region_cat']],
state_df, on='state')
.set_index('state_initnum')
.sort_index())
```

`state_df.head()`

state | region_cat | p_evang | p_mormon | kerry_04 | p_relig | |
---|---|---|---|---|---|---|

state_initnum | ||||||

0 | AK | 3 | 12.440000 | 3.003126 | 35.500000 | 15.443126 |

1 | AL | 2 | 40.549999 | 0.458273 | 36.799999 | 41.008274 |

2 | AR | 2 | 43.070000 | 0.560113 | 44.599998 | 43.630112 |

3 | AZ | 3 | 9.410000 | 4.878735 | 44.400002 | 14.288734 |

4 | CA | 3 | 7.160000 | 1.557627 | 54.299999 | 8.717627 |

`state_region = state_df.region_cat.values`

Finally, we are ready to specify the model with PyMC3. First, we wrap the predictors in `theano.shared`

so that we can eventually replace the survey respondent’s predictors with census predictors for posterior prediction (the poststratification step of MRP).

```
gender_race_ = shared(gender_race)
age_ = shared(age)
edu_ = shared(edu)
age_edu_ = shared(age_edu)
poll_ = shared(poll)
state_ = shared(state)
use_poll_ = shared(1)
n_ = shared(n)
```

We specify the model for \(\alpha^{\textrm{state}}\).

```
def hierarchical_normal(name, shape, μ=0.):
Δ = pm.Normal('Δ_{}'.format(name), 0., 1., shape=shape)
σ = pm.HalfCauchy('σ_{}'.format(name), 5.)
return pm.Deterministic(name, μ + Δ * σ)
```

```
with pm.Model() as model:
α_region = hierarchical_normal('region', n_region)
β_relig = pm.Normal('relig', 0., 5.)
β_kerry = pm.Normal('kerry', 0., 5.)
μ_state = α_region[state_region] + β_relig * state_relig + β_kerry * state_kerry
α_state = hierarchical_normal('state', n_state, μ=μ_state)
```

Throughout, we use a non-centered parametrization for our hierarchical normal priors for more efficient sampling. We now specify the rest of \(\eta_i\).

```
with model:
β0 = pm.Normal('β0', 0., 5.,
testval=sp.special.logit(survey_df.yes_of_all.mean()))
α_gender_race = hierarchical_normal('gender_race', n_gender_race)
α_age = hierarchical_normal('age', n_age)
α_edu = hierarchical_normal('edu', n_edu)
α_age_edu = hierarchical_normal('age_edu', n_age_edu)
α_poll = hierarchical_normal('poll', n_poll)
η = β0 \
+ α_gender_race[gender_race_] \
+ α_age[age_] \
+ α_edu[edu_] \
+ α_age_edu[age_edu_] \
+ α_state[state_] \
+ use_poll_ * α_poll[poll_]
```

Here the `theano.shared`

variable `use_poll_`

will allow us to ignore poll effects when we do posterior predictive sampling with census data.

Finally, we specify the likelihood and sample from the model using NUTS.

```
with model:
p = pm.math.sigmoid(η)
obs = pm.Binomial('obs', n_, p, observed=yes_of_all)
```

```
NUTS_KWARGS = {
'target_accept': 0.99
}
with model:
trace = pm.sample(draws=1000, random_seed=SEED,
nuts_kwargs=NUTS_KWARGS, njobs=3)
```

```
Auto-assigning NUTS sampler...
Initializing NUTS using ADVI...
Average Loss = 2,800.1: 19%|█▉ | 37833/200000 [00:44<02:44, 982.94it/s]
Convergence archived at 37900
Interrupted at 37,900 [18%]: Average Loss = 3,804.3
100%|██████████| 1500/1500 [09:02<00:00, 3.82it/s]
```

The marginal energy and energy transition distributions are fairly close, showing no obvious problem with NUTS.

`pm.energyplot(trace);`

The Gelman-Rubin statistics for all parameters are quite close to one, indicating convergence.

`max(np.max(score) for score in pm.gelman_rubin(trace).values())`

`1.0088100577623547`

We are now ready for the post-stratification step of MRP. First we combine the census and state-level data.

```
ps_df = pd.merge(census_df,
state_df[['state', 'region_cat']].reset_index(),
on='state')
```

`ps_df.head()`

race_wbh | age_cat | edu_cat | female | state | freq | freq_state | percent_state | region | state_initnum | region_cat | |
---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0 | 0 | 0 | 0 | AK | 467 | 21222.0 | 0.022005 | west | 0 | 3 |

1 | 0 | 1 | 0 | 0 | AK | 377 | 21222.0 | 0.017765 | west | 0 | 3 |

2 | 0 | 2 | 0 | 0 | AK | 419 | 21222.0 | 0.019744 | west | 0 | 3 |

3 | 0 | 3 | 0 | 0 | AK | 343 | 21222.0 | 0.016162 | west | 0 | 3 |

4 | 0 | 0 | 1 | 0 | AK | 958 | 21222.0 | 0.045142 | west | 0 | 3 |

Next we encode this combined data as before.

```
ps_gender_race = encode_gender_race(ps_df.female, ps_df.race_wbh)
ps_age = ps_df.age_cat.values
ps_edu = ps_df.edu_cat.values
ps_age_edu = encode_age_edu(ps_df.age_cat, ps_df.edu_cat)
ps_region = ps_df.region_cat.values
ps_state = ps_df.state_initnum.values
ps_n = ps_df.freq.values.astype(np.int64)
```

We now set the values of the `theano.shared`

variables in our PyMC3 model to the poststratification data and sample from the posterior predictive distribution.

```
gender_race_.set_value(ps_gender_race)
age_.set_value(ps_age)
edu_.set_value(ps_edu)
age_edu_.set_value(ps_age_edu)
poll_.set_value(np.zeros_like(ps_gender_race))
state_.set_value(ps_state)
use_poll_.set_value(0)
n_.set_value(ps_n)
```

```
with model:
pp_trace = pm.sample_ppc(trace, random_seed=SEED)
```

`100%|██████████| 1000/1000 [00:01<00:00, 583.42it/s]`

```
PP_COLS = ['pp_yes_of_all_{}'.format(i) for i in range(pp_trace['obs'].shape[0])]
pp_df = pd.merge(ps_df,
pd.DataFrame(pp_trace['obs'].T, columns=PP_COLS),
left_index=True, right_index=True)
```

`pp_df.head()`

race_wbh | age_cat | edu_cat | female | state | freq | freq_state | percent_state | region | state_initnum | … | pp_yes_of_all_990 | pp_yes_of_all_991 | pp_yes_of_all_992 | pp_yes_of_all_993 | pp_yes_of_all_994 | pp_yes_of_all_995 | pp_yes_of_all_996 | pp_yes_of_all_997 | pp_yes_of_all_998 | pp_yes_of_all_999 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

0 | 0 | 0 | 0 | 0 | AK | 467 | 21222.0 | 0.022005 | west | 0 | … | 205 | 151 | 144 | 185 | 137 | 122 | 145 | 139 | 171 | 186 |

1 | 0 | 1 | 0 | 0 | AK | 377 | 21222.0 | 0.017765 | west | 0 | … | 86 | 71 | 83 | 95 | 84 | 96 | 66 | 79 | 64 | 121 |

2 | 0 | 2 | 0 | 0 | AK | 419 | 21222.0 | 0.019744 | west | 0 | … | 94 | 77 | 45 | 86 | 80 | 61 | 47 | 54 | 50 | 65 |

3 | 0 | 3 | 0 | 0 | AK | 343 | 21222.0 | 0.016162 | west | 0 | … | 69 | 38 | 33 | 40 | 39 | 18 | 32 | 35 | 32 | 39 |

4 | 0 | 0 | 1 | 0 | AK | 958 | 21222.0 | 0.045142 | west | 0 | … | 430 | 287 | 342 | 430 | 342 | 348 | 307 | 382 | 312 | 450 |

5 rows × 1011 columns

We complete the poststratification step by taking a weighted sum across the demographic cells within each state, to produce posterior predictive samples from the state-level opinion distribution.

```
ps_prob = (pp_df.groupby('state')
.apply(lambda df: df[PP_COLS].sum(axis=0) / df.freq.sum()))
```

`ps_prob.head()`

pp_yes_of_all_0 | pp_yes_of_all_1 | pp_yes_of_all_2 | pp_yes_of_all_3 | pp_yes_of_all_4 | pp_yes_of_all_5 | pp_yes_of_all_6 | pp_yes_of_all_7 | pp_yes_of_all_8 | pp_yes_of_all_9 | … | pp_yes_of_all_990 | pp_yes_of_all_991 | pp_yes_of_all_992 | pp_yes_of_all_993 | pp_yes_of_all_994 | pp_yes_of_all_995 | pp_yes_of_all_996 | pp_yes_of_all_997 | pp_yes_of_all_998 | pp_yes_of_all_999 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

state | |||||||||||||||||||||

AK | 0.306380 | 0.390585 | 0.361229 | 0.275893 | 0.374140 | 0.410847 | 0.362360 | 0.389172 | 0.372491 | 0.302281 | … | 0.411413 | 0.290312 | 0.297521 | 0.387287 | 0.365988 | 0.328386 | 0.316982 | 0.357035 | 0.280417 | 0.373433 |

AL | 0.174168 | 0.199505 | 0.203892 | 0.155924 | 0.214749 | 0.190090 | 0.204509 | 0.203507 | 0.188764 | 0.147419 | … | 0.217865 | 0.145274 | 0.153651 | 0.220120 | 0.185526 | 0.187560 | 0.101283 | 0.175781 | 0.117774 | 0.218250 |

AR | 0.142486 | 0.207379 | 0.219824 | 0.221326 | 0.229756 | 0.204580 | 0.221235 | 0.239279 | 0.193739 | 0.198919 | … | 0.210138 | 0.164843 | 0.146860 | 0.238502 | 0.185973 | 0.245931 | 0.114837 | 0.189018 | 0.164965 | 0.232096 |

AZ | 0.353140 | 0.395125 | 0.388163 | 0.361972 | 0.394620 | 0.378743 | 0.387904 | 0.375209 | 0.385323 | 0.443167 | … | 0.390827 | 0.318305 | 0.340562 | 0.411255 | 0.376126 | 0.455857 | 0.318835 | 0.387193 | 0.329390 | 0.405663 |

CA | 0.384078 | 0.463444 | 0.463495 | 0.405385 | 0.468195 | 0.475593 | 0.463783 | 0.474011 | 0.429405 | 0.431427 | … | 0.468451 | 0.384701 | 0.374767 | 0.486855 | 0.434450 | 0.475072 | 0.378866 | 0.471518 | 0.378529 | 0.496080 |

5 rows × 1000 columns

The simplest summary of state-level opinion is the posterior expected mean, shown below.

`ps_mean = ps_prob.mean(axis=1)`

`ps_mean.head()`

```
state
AK 0.365962
AL 0.189076
AR 0.201302
AZ 0.395071
CA 0.459842
dtype: float64
```

The following choropleth maps show the disaggregation and MRP estimates of support for gay marriage by state.

```
fig, (disagg_ax, mrp_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6))
fig, disagg_ax, _ = state_plot(disagg_p, p_cmap, p_norm, cbar=False, ax=disagg_ax)
disagg_ax.set_title("Disaggregation");
fig, mrp_ax, cbar_ax = state_plot(ps_mean, p_cmap, p_norm, ax=mrp_ax)
cbar_ax.yaxis.set_major_formatter(p_formatter);
mrp_ax.set_title("MRP");
fig.suptitle("Estimated support for gay marriage in 2005");
```

Notably, MRP produces opinion estimates for Alaska and Hawaii, which disaggregation does not. The following scatter plot makes it easier to see how the estimate for each state differs between disaggregation and MRP.

`disagg_p_aligned, ps_mean_aligned = disagg_p.align(ps_mean)`

```
fig, ax = plt.subplots(figsize=(8, 8))
ax.set_aspect('equal');
pct_formatter = FuncFormatter(lambda prop, _: '{:.1%}'.format(prop))
ax.plot([0.1, 0.7], [0.1, 0.7], '--', c='k', label="No change");
ax.scatter(disagg_p_aligned, ps_mean_aligned);
ax.set_xlim(0.1, 0.7);
ax.xaxis.set_major_formatter(pct_formatter);
ax.set_xlabel("Disaggregation estimate");
ax.set_ylim(0.1, 0.7);
ax.yaxis.set_major_formatter(pct_formatter);
ax.set_ylabel("MRP estimate");
ax.legend(loc=2);
ax.set_title("Estimated support for gay marriage in 2005");
```

We see that the MRP estimates tend to be higher than the disaggregation estimates, possibly due to under-sampling of supportive demographic cells in many states.

An additional advantage of MRP is that we can produce better opinion estimates for demographic subsets than disaggregation. For example, we plot below the disaggregation and MRP estimates of support for gay marriage among black men. From above, we know disaggregation will not be able to produce an estimate for many states.

```
black_men_disagg_p = (survey_df[(survey_df.race_wbh == 1) & (survey_df.female == 0)]
.groupby('state')
.yes_of_all
.mean())
black_men_ps_mean = (pp_df[(pp_df.race_wbh == 1) & (pp_df.female == 0)]
.groupby('state')
.apply(lambda df: (df[PP_COLS].sum(axis=0) / df.freq.sum()))
.mean(axis=1))
```

```
fig, (disagg_ax, mrp_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6))
fig, disagg_ax, _ = state_plot(black_men_disagg_p, p_cmap, p_norm, cbar=False, ax=disagg_ax)
disagg_ax.set_title("Disaggregation");
fig, mrp_ax, cbar_ax = state_plot(black_men_ps_mean, p_cmap, p_norm, ax=mrp_ax)
cbar_ax.yaxis.set_major_formatter(p_formatter);
mrp_ax.set_title("MRP");
fig.suptitle("Estimated support for gay marriage\namong black men in 2005");
```

In addition to the gaps in the disaggregation map above, it seems highly unlikely that not a single black man in Minnesota, Arizona, New Mexico, etc. supported gay marriage in 2005. These disaggregation estimates are due to polling few black men in these states, which MRP attempts to counteract. For further discussion of estimating the opinions of demographic subgroups using MRP, consult Ghitza and Gelman’s *Deep Interactions with MRP: Election Turnout and Voting Patterns Among Small Electoral Subgroups*.

One advantage of using the fully Bayesian approach we have taken to MRP via PyMC3 is that we have access to the full posterior distribution of each state’s opinion, in addition to the posterior expected values shown in the above choropleth maps.

```
grid = sns.FacetGrid(pd.melt(ps_prob.reset_index(),
id_vars='state', value_vars=PP_COLS,
var_name='pp_sample', value_name='p'),
col='state', col_wrap=3, size=2, aspect=1.5)
grid.map(plt.hist, 'p', bins=30);
grid.set_xlabels("Posterior distribution of support\nfor gay marriage in 2005");
for ax in grid.axes.flat:
ax.set_xticks(np.linspace(0, 1, 5));
ax.xaxis.set_major_formatter(pct_formatter);
plt.setp(ax.get_xticklabels(), visible=True);
grid.set_yticklabels([]);
grid.set_ylabels("Frequency");
grid.fig.tight_layout();
grid.set_titles('{col_name}');
```

Specifying this model in PyMC3 would certainly have been simpler using Bambi, which I intend to learn soon for exactly that reason.

I am particularly eager to see what applications MRP will find outside of political science in the coming years.

This post is available as a Jupyter notebook here.

Tags: PyMC3, Bayesian Statistics

An improved version of this analysis is available here.

(**Author’s note**: many thanks to Robert ([@atlhawksfanatic](https://twitter.com/atlhawksfanatic) on Twitter) for pointing out some subtleties in the data set that I had missed. This post has been revised in line with his feedback. Robert has a very interesting post about how last two minute refereeing has changed over the last three years; I highly recommend you read it.)

I recently found a very interesting data set derived from the NBA’s Last Two Minute Report by Russell Goldenberg of The Pudding. Since 2015, the NBA has released a report reviewing every call and non-call in the final two minutes of every NBA game where the teams were separated by five points or less with two minutes remaining. This data set has extracted each play from the NBA-distributed PDF and augmented it with information from Basketball Reference to produce a convenient CSV. The Pudding has published two very interesting visual essays using this data that you should definitely explore.

The NBA is certainly marketed as a star-centric league, so this data set presents a fantastic opportunity to understand the extent to which the players involved in a decision impact whether or not a foul is called. We will also explore other factors related to foul calls.

`%matplotlib inline`

```
import datetime
from warnings import filterwarnings
```

```
from matplotlib import pyplot as plt
from matplotlib.ticker import FuncFormatter
import numpy as np
import pandas as pd
import pymc3 as pm
from scipy.special import expit
import seaborn as sns
```

```
blue, green, red, purple, gold, teal = sns.color_palette()
million_dollars_formatter = FuncFormatter(lambda value, _: '${:.1f}M'.format(value / 1e6))
pct_formatter = FuncFormatter(lambda prop, _: "{:.1%}".format(prop))
```

`filterwarnings('ignore', 'findfont')`

We begin by loading the data set from GitHub. For reproducibility, we load the data from the most recent commit as of the time this post was published.

```
DATA_URI = 'https://raw.githubusercontent.com/polygraph-cool/last-two-minute-report/1b89b71df060add5538b70d391d7ad82a4c24db2/output/all_games.csv'
raw_df = (pd.read_csv(DATA_URI,
usecols=['committing_player', 'disadvantaged_player',
'committing_team', 'disadvantaged_team',
'seconds_left', 'review_decision', 'date'],
parse_dates=['date'])
.where(lambda df: df.date >= datetime.datetime(2016, 10, 25))
.dropna(subset=['date'])
.drop('date', axis=1))
raw_df['review_decision'] = raw_df.review_decision.fillna("INC")
raw_df = (raw_df.dropna()
.reset_index(drop=True))
```

We restrict our attention to decisions from the 2016-2017 NBA season, for which salary information is readily available from Basketball Reference.

`raw_df.head()`

seconds_left | committing_player | disadvantaged_player | review_decision | disadvantaged_team | committing_team | |
---|---|---|---|---|---|---|

0 | 102.0 | Al-Farouq Aminu | George Hill | CNC | UTA | POR |

1 | 98.0 | Boris Diaw | Damian Lillard | CC | POR | UTA |

2 | 64.0 | Ed Davis | George Hill | CNC | UTA | POR |

3 | 62.0 | Rudy Gobert | CJ McCollum | INC | POR | UTA |

4 | 27.1 | CJ McCollum | Rodney Hood | CC | UTA | POR |

We have only loaded some of the data set’s columns; see the original CSV header for the rest.

The response variable in our analysis is derived from `review_decision`

, which contains information about whether the incident was a call or non-call and whether, upon post-game review, the NBA deemed the (non-)call correct or incorrect. Below we show the frequencies of each type of `review_decision`

.

```
ax = (raw_df.groupby('review_decision')
.size()
.plot(kind='bar'))
ax.set_ylabel("Frequency");
```

The possible values of `review_decision`

are

`CC`

for correct call,`CNC`

for correct non-call,`IC`

for incorrect call, and`INC`

for incorrect non-call.

While `review_decision`

decision provides information about both whether or not a foul was called and whether or not a foul was actually committed, this analysis will focus only on whether or not a foul was called. Including whether or not a foul was actually committed in this analysis introduces some subtleties that are best left to a future post.

In this dataset, the “committing” player is the one that a foul would be called against, if a foul was called on the play, and the other player is “disadvantaged.”

We now encode the data. Since the committing player on one play may be the disadvantaged player on another play, we `melt`

the raw data frame to have one row per player-play combination so that we can encode the players in a way that is consistent across columns.

```
PLAYER_MAP = {
"Jose Juan Barea": "JJ Barea",
"Nene Hilario": "Nene",
"Tim Hardaway": "Tim Hardaway Jr",
"James Ennis": "James Ennis III",
"Kelly Oubre": "Kelly Oubre Jr",
"Taurean Waller-Prince": "Taurean Prince",
"Glenn Robinson": "Glenn Robinson III",
"Otto Porter": "Otto Porter Jr"
}
TEAM_MAP = {
"NKY": "NYK",
"COS": "BOS",
"SAT": "SAS"
}
```

```
long_df = (pd.melt(
(raw_df.reset_index(drop=True)
.rename_axis('play_id')
.reset_index()),
id_vars=['play_id', 'review_decision',
'committing_team', 'disadvantaged_team',
'seconds_left'],
value_vars=['committing_player', 'disadvantaged_player'],
var_name='player', value_name='player_name_')
# fix inconsistent player names
.assign(player_name=lambda df: (df.player_name_
.str.replace('\.', '')
.apply(lambda name: PLAYER_MAP.get(name, name))))
.assign(team_=lambda df: (df.committing_team
.where(df.player == 'committing_player',
df.disadvantaged_team)))
# fix typos in team names
.assign(team=lambda df: df.team_.apply(lambda team: TEAM_MAP.get(team, team)))
.drop(['committing_team', 'disadvantaged_team', 'team_'], axis=1))
long_df['player_id'], player_map = long_df.player_name.factorize()
```

`long_df.head()`

play_id | review_decision | seconds_left | player | player_name_ | player_name | team | player_id | |
---|---|---|---|---|---|---|---|---|

0 | 0 | CNC | 102.0 | committing_player | Al-Farouq Aminu | Al-Farouq Aminu | POR | 0 |

1 | 1 | CC | 98.0 | committing_player | Boris Diaw | Boris Diaw | UTA | 1 |

2 | 2 | CNC | 64.0 | committing_player | Ed Davis | Ed Davis | POR | 2 |

3 | 3 | INC | 62.0 | committing_player | Rudy Gobert | Rudy Gobert | UTA | 3 |

4 | 4 | CC | 27.1 | committing_player | CJ McCollum | CJ McCollum | POR | 4 |

After encoding, we pivot back to a wide data frame with one row per play.

```
df = (long_df.pivot_table(index=['play_id', 'review_decision', 'seconds_left'],
columns='player', values='player_id')
.rename(columns={
'committing_player': 'committing_id',
'disadvantaged_player': 'disadvantaged_id'
})
.rename_axis('', axis=1)
.reset_index()
.assign(foul_called=lambda df: 1 * (df.review_decision.isin(['CC', 'IC'])))
.drop(['play_id', 'review_decision'],
axis=1))
```

In addition to encoding the players, we have include a column (`foul_called`

) that indicates whether or not a foul was called on the play.

`df.head()`

seconds_left | committing_id | disadvantaged_id | foul_called | |
---|---|---|---|---|

0 | 102.0 | 0 | 300 | 0 |

1 | 98.0 | 1 | 124 | 1 |

2 | 64.0 | 2 | 300 | 0 |

3 | 62.0 | 3 | 4 | 0 |

4 | 27.1 | 4 | 6 | 1 |

In order to understand how foul calls vary systematically across players, we will use salary as a proxy for “star power.” The salary data we use was downloaded from Basketball Reference.

```
SALARY_URI = 'http://www.austinrochford.com/resources/nba_irt/2016_2017_salaries.csv'
salary_df = (pd.read_csv(SALARY_URI, skiprows=1,
usecols=['Player', '2016-17'])
.assign(player_name=lambda df: (df.Player
.str.split('\\', expand=True)[0]
.str.replace('\.', '')
# fix inconsistent player names
.apply(lambda name: PLAYER_MAP.get(name, name))),
salary=lambda df: (df['2016-17'].str
.lstrip('$')
.astype(np.float64)))
.assign(log_salary=lambda df: np.log10(df.salary))
.assign(std_log_salary=lambda df: (df.log_salary - df.log_salary.mean()) / df.log_salary.std())
.drop(['Player', '2016-17'], axis=1)
.groupby('player_name')
.max()
.select(lambda name: name in player_map)
.assign(player_id=lambda df: (np.equal
.outer(player_map, df.index)
.argmax(axis=0)))
.reset_index()
.set_index('player_id')
.sort_index())
```

Since NBA salaries span many orders of magnitude (LeBron James’ salary is just shy of $31M while the lowest paid player made just more than $200K) we will use log salaries, standardized to have mean zero and standard deviation one in our model.

`salary_df.head()`

player_name | salary | log_salary | std_log_salary | |
---|---|---|---|---|

player_id | ||||

0 | Al-Farouq Aminu | 7680965.0 | 6.885416 | 0.848869 |

1 | Boris Diaw | 7000000.0 | 6.845098 | 0.797879 |

2 | Ed Davis | 6666667.0 | 6.823909 | 0.771080 |

3 | Rudy Gobert | 2121288.0 | 6.326600 | 0.142129 |

4 | CJ McCollum | 3219579.0 | 6.507799 | 0.371293 |

We also produce a dataframe associating players to teams, along with some useful per-player summaries.

```
team_player_map = (long_df.groupby('team')
.player_id
.apply(pd.Series.drop_duplicates)
.reset_index(level=-1, drop=True)
.reset_index()
.assign(name=lambda df: player_map[df.player_id],
disadvantaged_rate=lambda tpm_df: (df.groupby('disadvantaged_id')
.foul_called
.mean()
.ix[tpm_df.player_id]
.values),
disadvantaged_plays=lambda tpm_df: (df.groupby('disadvantaged_id')
.size()
.ix[tpm_df.player_id]
.values))
.fillna(0))
```

`team_player_map.head()`

team | player_id | disadvantaged_plays | disadvantaged_rate | name | |
---|---|---|---|---|---|

0 | ATL | 114 | 8.0 | 0.000000 | Kyle Korver |

1 | ATL | 115 | 13.0 | 0.538462 | Dwight Howard |

2 | ATL | 116 | 44.0 | 0.272727 | Paul Millsap |

3 | ATL | 117 | 60.0 | 0.283333 | Dennis Schroder |

4 | ATL | 181 | 25.0 | 0.200000 | Kent Bazemore |

Throughout this post, we will develop a series of models for understanding how foul calls vary across players, starting with a simple beta-Bernoulli model and working our way up to a hierachical item-response theory regression model.

Before building models, we must introduce a bit of notation. The index \(i\) will correspond to a disadvantaged player and the index \(j\) corresponds to a committing player. The index \(k\) corresponds to a play. With this notation \(i(k)\) and \(j(k)\) are the index of the disadvantaged and committing player involved in play \(k\), respectively. The binary variable \(y_k\) indicates whether or not a foul was called on play \(k\). All of our models use the likelihood

\[y_k \sim \textrm{Bernoulli}(p_{i(k), j(k)}).\]

Each model differs in its specification of the probability that a foul is called, \(p_{i, j}\).

One of the simplest possible models for this data focuses only on the disadvantaged player, so \(p_{i, j} = p_i\), and places independent beta priors on each \(p_i\). For simplicity, we begin with uniform priors, \(p_i \sim \textrm{Beta}(1, 1).\)

Even though this model is conjugate, we will use `pymc3`

to perform inference with it for consistency with subsequent, non-conjugate models.

```
n_players = player_map.size
disadvantaged_id = df.disadvantaged_id.values
foul_called = df.foul_called.values
obs_rate = foul_called.mean()
```

```
with pm.Model() as bb_model:
p = pm.Beta('p', 1., 1., shape=n_players)
y = pm.Bernoulli('y_obs', p[disadvantaged_id],
observed=foul_called)
```

Throughout this post, we will use the no-U-turn sampler for inference, tuning the sampler’s hyperparameters for the first two thousand samples and subsequently keeping the next two thousand samples for inference.

```
N_TUNE = 2000
N_SAMPLES = 2000
SEED = 506421 # from random.org, for reproducibility
```

We now sample from the beta-Bernoulli model.

```
def sample(model, n_tune, n_samples, seed):
with model:
full_trace = pm.sample(n_tune + n_samples, tune=n_tune, random_seed=seed)
return full_trace[n_tune:]
```

`bb_trace = sample(bb_model, N_TUNE, N_SAMPLES, SEED)`

```
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
2%|▏ | 4793/200000 [00:02<01:47, 1818.84it/s]Median ELBO converged.
Finished [100%]: Average ELBO = -3,554.3
100%|██████████| 4000/4000 [01:03<00:00, 63.38it/s]
```

We use energy energy plots to diagnose possible problems with our samples.

```
def energy_plot(trace):
energy = trace['energy']
energy_diff = np.diff(energy)
fig, ax = plt.subplots(figsize=(8, 6))
ax.hist(energy - energy.mean(), bins=30,
lw=0, alpha=0.5,
label="Energy")
ax.hist(energy_diff, bins=30,
lw=0, alpha=0.5,
label="Energy difference")
ax.set_xticks([])
ax.set_yticks([])
ax.legend()
```

`energy_plot(bb_trace)`

Since the energy and energy difference distributions are quite similar, there is no indication from this plot of sampling issues. For an in-depth treatment of Hamiltonian Monte Carlo algorithms and convergence diagnostics, consult Michael Betancourt’s excellent paper *A Conceptual Introduction to Hamiltonian Monte Carlo*.

We will use the widely applicable information criterion (WAIC) and binned residuals to check and compare our models. WAIC is a Bayesian measure of out-of-sample predictive accuracy based on in-sample data that is quite closely related to [leave-one-out cross-validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics%29#Leave-one-out_cross-validation). It attempts to improve upon known shortcomings of the widely-used deviance information criterion. (See *Understanding predictive information criteria for Bayesian models* for a review and comparison of various information criteria, including DIC and WAIC.) WAIC is easy to calculate with `pymc3`

.

```
def get_waic_df(model, trace, name):
with model:
waic = pm.waic(trace)
return pd.DataFrame.from_records([waic], index=[name], columns=waic._fields)
```

`waic_df = get_waic_df(bb_model, bb_trace, "Beta-Bernoulli")`

```
/opt/conda/lib/python3.5/site-packages/pymc3/stats.py:145: UserWarning: For one or more samples the posterior variance of the
log predictive densities exceeds 0.4. This could be indication of
WAIC starting to fail see http://arxiv.org/abs/1507.04544 for details
""")
```

We see that the WAIC calculation indicates difficulties with the beta-Bernoulii model, which we will soon confirm.

`waic_df`

WAIC | WAIC_se | p_WAIC | |
---|---|---|---|

Beta-Bernoulli | 6021.491064 | 66.23822 | 238.488377 |

In addition to the WAIC value, we get an estimate of its standard error (`WAIC_se`

) and the number of effective parameters in the model (`p_WAIC`

). The number of effective parameters is an indication of model complexity.

The second diagnostic tool we use on our models are binned residuals, which show how well-calibrated the model’s predicted probabilities are. Intuitively, if our model predicts that an event has a 35% chance of occurring and we can observe many repetitions of that event, we would expect the event to actually occur about 35% of the time. If the observed occurrences of the event differ substantially from the predicted rate, we have reason to doubt the quality of our model. Since we generally can’t observe each event many times, we instead group events into bins by their predicted probability and check that the average predicted probability in each bin is close to the rate at which the events in that bin are observed.

The binned predictions and residuals for the beta-Bernoulli model are shown below.

```
BINS = np.linspace(0, 1, 11)
BINS_3D = BINS[np.newaxis, np.newaxis]
def binned_residuals(y, p):
p_3d = p[..., np.newaxis]
in_bin = (BINS_3D[..., :-1] < p_3d) & (p_3d <= BINS_3D[..., 1:])
bin_counts = in_bin.sum(axis=(0, 1))
p_mean = (in_bin * p_3d).sum(axis=(0, 1)) / bin_counts
y_mean = (in_bin * y[np.newaxis, :, np.newaxis]).sum(axis=(0, 1)) / bin_counts
return y_mean, p_mean, bin_counts
def binned_residual_plot(bin_obs, bin_p, bin_counts):
fig, (ax, resid_ax) = plt.subplots(ncols=2, sharex=True, figsize=(16, 6))
ax.scatter(bin_p, bin_obs,
s=300 * np.sqrt(bin_counts / bin_counts.sum()),
zorder=5)
ax.plot([0, 1], [0, 1], '--', c='k')
ax.set_xlim(0, 1)
ax.set_xticks(np.linspace(0, 1, 5))
ax.xaxis.set_major_formatter(pct_formatter)
ax.set_xlabel("Mean $p$ (binned)")
ax.set_ylim(0, 1)
ax.set_yticks(np.linspace(0, 1, 5))
ax.yaxis.set_major_formatter(pct_formatter)
ax.set_ylabel("Observed rate (binned)")
resid_ax.scatter(bin_p, bin_obs - bin_p,
s=300 * np.sqrt(bin_counts / bin_counts.sum()),
zorder=5)
resid_ax.hlines(0, 0, 1, 'k', '--')
resid_ax.set_xlim(0, 1)
resid_ax.set_xticks(np.linspace(0, 1, 5))
resid_ax.xaxis.set_major_formatter(pct_formatter)
resid_ax.set_xlabel("Mean $p$ (binned)")
resid_ax.yaxis.set_major_formatter(pct_formatter)
resid_ax.set_ylabel("Residual (binned)")
```

```
bin_obs, bin_p, bin_counts = binned_residuals(foul_called, bb_trace['p'][:, disadvantaged_id])
binned_residual_plot(bin_obs, bin_p, bin_counts)
```

In these plots, the dashed black lines show how these quantities would be related, for a perfect model. The area of each point is proportional to the number of events whose predicted probability fell in the relevant bin. From these plots, we get further confirmation that our simple beta-Bernoulli model is quite unsatisfactory, as many binned residuals exceed 5% in absolute value.

Below we plot the posterior mean and 90% credible interval for \(p\) for each player in the data set (grouped by team, for legibility), along with the player’s observed foul called percentage when disadvantaged. The area of the point for observed foul called percentage is proportional to the number of plays in which the player was disadvantaged.

```
def to_param_df(player_df, trace, varnames):
df = player_df
for name in varnames:
mean = trace[name].mean(axis=0)
low, high = np.percentile(trace[name], [5, 95], axis=0)
df = df.assign(**{
'{}_mean'.format(name): mean[df.player_id],
'{}_low'.format(name): low[df.player_id],
'{}_high'.format(name): high[df.player_id]
})
return df
```

`bb_df = to_param_df(team_player_map, bb_trace, ['p'])`

```
def plot_params(mean, interval, names, ax=None, **kwargs):
if ax is None:
fig, ax = plt.subplots(figsize=(8, 6))
n_players = names.size
ax.errorbar(mean, np.arange(n_players),
xerr=np.abs(mean - interval),
fmt='o',
label="Mean with\n90% interval")
ax.set_ylim(-1, n_players)
ax.set_yticks(np.arange(n_players))
ax.set_yticklabels(names)
return ax
def plot_p_params(rate, n_plays, league_mean, ax=None, **kwargs):
if ax is None:
ax = plt.gca()
n_players = rate.size
ax.scatter(rate, np.arange(n_players),
c='k', s=20 * np.sqrt(n_plays),
alpha=0.5, zorder=5,
label="Observed")
ax.vlines(league_mean, -1, n_players,
'k', '--',
label="League average")
def plot_p_helper(mean, low, high, rate, n_plays, names, league_mean=None, ax=None, **kwargs):
if ax is None:
ax = plt.gca()
mean = mean.values
rate = rate.values
n_plays = n_plays.values
names = names.values
argsorted_ix = mean.argsort()
interval = np.row_stack([low, high])
plot_params(mean[argsorted_ix], interval[:, argsorted_ix], names[argsorted_ix],
ax=ax, **kwargs)
plot_p_params(rate[argsorted_ix], n_plays[argsorted_ix], league_mean,
ax=ax, **kwargs)
```

```
grid = sns.FacetGrid(bb_df, col='team', col_wrap=2,
sharey=False,
size=4, aspect=1.5)
grid.map(plot_p_helper,
'p_mean', 'p_low', 'p_high',
'disadvantaged_rate', 'disadvantaged_plays', 'name',
league_mean=obs_rate);
grid.set_axis_labels(r"$p$", "Player");
for ax in grid.axes:
ax.set_xticks(np.linspace(0, 1, 5));
ax.set_xticklabels(ax.get_xticklabels(), visible=True)
ax.xaxis.set_major_formatter(pct_formatter);
grid.fig.tight_layout();
grid.add_legend();
```

These plots reveal an undesirable property of this model and its inferences. Since the prior distribution on \(p_i\) is uniform on the interval \([0, 1]\), all posterior estimates of \(p_i\) are pulled towards the prior expected value of 50%. This phenomenon is known as shrinkage. In extreme cases of players that were never disadvantaged, the posterior estimate of \(p_i\) is quite close to 50%. For these players, the league average foul call rate would seem to be a much more reasonable estimate of \(p_i\) than 50%. The league average foul call rate is shown as a dotted black line on the charts above.

There are several possible modifications of the beta-Bernoulli model that can cause shrinkage toward the league average. Perhaps the most straightforward is the empirical Bayesian method that sets the parameters of the prior distribution on \(p_i\) using the observed data. In this framework, there are many methods of choosing prior hyperparameters that make the prior expected value equal to the league average, therefore causing shrinkage toward the league average. We do not use empirical Bayesian methods in this post as they make it cumbersome to build the more complex models we want to use to understand the relationship between salary and foul calls. Empirical Bayesian methods are, however, an approximation to the fully hierachical models we begin building in the next section.

A hierarchical logistic-normal model addresses some of the shortcomings of the beta-Bernoulli model. For simplicity, this model focuses exclusively on the disadvantaged player and assumes that the log-odds of a foul call for a given disadvantaged player are normally distributed. That is,

\[ \begin{align*} \log \left(\frac{p_i}{1 - p_i}\right) & \sim N(\mu, \sigma^2) \\ \eta_{k} & = \log \left(\frac{p_{i(k)}}{1 - p_{i(k)}}\right), \end{align*}\]

which is equivalent to

\[p_{i(k)} = \frac{1}{1 + \exp(-\eta_k)}.\]

We address the beta-Bernoulli model’s shrinkage problem by placing a normal hyperprior distribution on \(\mu\), \(\mu \sim N(0, 100).\) This shared hyperprior makes this model hierarchical. To complete the specification of this model, we place a half-Cauchy prior on \(\sigma\), \(\sigma \sim \textrm{HalfCauchy}(2.5)\).

```
with pm.Model() as ln_model:
μ = pm.Normal('μ', 0., 10.)
Δ = pm.Normal('Δ', 0., 1., shape=n_players)
σ = pm.HalfCauchy('σ', 2.5)
p_player = pm.Deterministic('p_player', pm.math.sigmoid(μ + Δ * σ))
η = μ + Δ[disadvantaged_id] * σ
p = pm.Deterministic('p', pm.math.sigmoid(η))
y = pm.Bernoulli('y_obs', p, observed=foul_called)
```

Throughout this post we use an offset parameterization for hierarchical models that significantly improves sampling efficiency. We now sample from this model.

`ln_trace = sample(ln_model, N_TUNE, N_SAMPLES, SEED)`

```
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
9%|▊ | 17001/200000 [00:12<02:13, 1372.87it/s]Median ELBO converged.
Finished [100%]: Average ELBO = -3,520.3
100%|██████████| 4000/4000 [02:27<00:00, 65.27it/s]
```

The energy plot for this model gives no cause for concern.

`energy_plot(ln_trace)`

We now calculate the WAIC of the logistic-normal model, and compare it to that of the beta-Bernoulli model.

`waic_df = waic_df.append(get_waic_df(ln_model, ln_trace, "Logistic-normal"))`

```
def waic_plot(waic_df):
fig, (waic_ax, p_ax) = plt.subplots(ncols=2, sharex=True, figsize=(16, 6))
waic_x = np.arange(waic_df.shape[0])
waic_ax.errorbar(waic_x, waic_df.WAIC,
yerr=waic_df.WAIC_se,
fmt='o')
waic_ax.set_xticks(waic_x)
waic_ax.xaxis.grid(False)
waic_ax.set_xticklabels(waic_df.index)
waic_ax.set_xlabel("Model")
waic_ax.set_ylabel("WAIC")
p_ax.bar(waic_x, waic_df.p_WAIC)
p_ax.xaxis.grid(False)
p_ax.set_xlabel("Model")
p_ax.set_ylabel("Effective number\nof parameters")
```

`waic_plot(waic_df)`

The left-hand plot above shows that the logistic-normal model is a significant improvement in WAIC over the beta-Bernoulli model, which is unsurprising. The right-hand plot shows that the logistic-normal model has roughly 20% the number of effective parameters as the beta-Bernoulli model. This reduction is due to the partial pooling effect of the hierarchical prior. The hyperprior on \(\mu\) causes observations for one player to impact the estimate of \(p_i\) for all players; this sharing of information across players is responsible for the large decrease in the number of effective parameters.

Finally, we examine the binned residuals for the logistic-normal model.

```
bin_obs, bin_p, bin_counts = binned_residuals(foul_called, ln_trace['p'])
binned_residual_plot(bin_obs, bin_p, bin_counts)
```

These binned residuals are much smaller than those of the beta-Bernoulli model, which is further confirmation that the logistic-normal model is preferable.

Below we plot the posterior distribution of \(\operatorname{logit}^{-1}(\mu)\), and we see that the observed foul call rate of approximately 25.1% lies within its 90% interval.

```
ax, = pm.plot_posterior(ln_trace, ['μ'],
alpha_level=0.1, transform=expit, ref_val=obs_rate,
lw=0., alpha=0.75)
ax.xaxis.set_major_formatter(pct_formatter);
ax.set_title(r"$\operatorname{logit}^{-1}(\mu)$");
```

With this posterior for \(\operatorname{logit}^{-1}(\mu)\), we see the desired posterior shrinkage of each \(p_i\) toward the observed foul call rate.

`ln_df = to_param_df(team_player_map, ln_trace, ['p'])`

```
grid = sns.FacetGrid(ln_df, col='team', col_wrap=2,
sharey=False,
size=4, aspect=1.5)
grid.map(plot_p_helper,
'p_mean', 'p_low', 'p_high',
'disadvantaged_rate', 'disadvantaged_plays', 'name',
league_mean=obs_rate);
grid.set_axis_labels(r"$p$", "Player");
for ax in grid.axes:
ax.set_xticks(np.linspace(0, 1, 5));
ax.set_xticklabels(ax.get_xticklabels(), visible=True)
ax.xaxis.set_major_formatter(pct_formatter);
grid.fig.tight_layout();
grid.add_legend();
```

The inferences in these plots are markedly different from those of the beta-Bernoulli model. Most strikingly, we see that estimates have been shrunk towards the league average foul call rate, and that players that were never disadvantaged have posterior foul call probabilities quite close to that rate. As a consequence of this more reasonable shrinkage, the range of values taken by the posterior \(p_i\) estimates is much smaller for the logistic-normal model than for the beta-Bernoulli model. Below we plot the top- and bottom-ten players by \(p_i\).

```
fig, (top_ax, bottom_ax) = plt.subplots(nrows=2, sharex=True, figsize=(8, 10))
by_p = (ln_df.drop_duplicates(['player_id'])
.sort_values('p_mean'))
p_top = by_p.iloc[-10:]
plot_params(p_top.p_mean.values, p_top[['p_low', 'p_high']].values.T,
p_top.name.values,
ax=top_ax);
top_ax.vlines(obs_rate, -1, 10,
'k', '--',
label=r"League average");
top_ax.xaxis.set_major_formatter(pct_formatter);
top_ax.set_ylabel("Player");
top_ax.set_title("Top ten");
p_bottom = by_p.iloc[:10]
plot_params(p_bottom.p_mean.values, p_bottom[['p_low', 'p_high']].values.T,
p_bottom.name.values,
ax=bottom_ax);
bottom_ax.vlines(obs_rate, -1, 10,
'k', '--',
label=r"League average");
bottom_ax.xaxis.set_major_formatter(pct_formatter);
bottom_ax.set_xlabel(r"$p$");
bottom_ax.set_ylabel("Player");
fig.tight_layout();
bottom_ax.legend(loc=1);
bottom_ax.set_title("Bottom ten");
```

The hierarchical logistic-normal model is certainly an improvement over the beta-Bernoulli model, but both of these models have focused solely on the disadvantaged player. It seems quite important to understand the contribution of not just the disadvantaged player, but also of the committing player in each play to the probability of a foul call. Item-response theory (IRT) provides generalizations the logistic-normal model that can account for the influence of both players involved in a play. IRT originated in psychometrics as a way to simultaneously measure individual aptitude and question difficulty based on test-response data, and has subsequently found many other applications. We use IRT to model foul calls by considering disadvantaged players as analagous to test takers and committing players as analagous to questions. Specifically, we will use the Rasch model for the probability \(p_{i, j}\), that a foul is called on a play where player \(i\) is disadvantaged by committing player \(j\). This model posits that each player has a latent ability, \(\theta_i\), that governs how often fouls are called when they are disadvantaged and a latent difficulty \(b_j\) that governs how often fouls are not called when they are committing. The probability that a foul is called on a play where player \(i\) is disadvantaged and player \(j\) is committing is then a function of the difference between the corresponding latent ability and difficulty parameters,

\[ \begin{align*} \eta_k & = \theta_{i(k)} - b_{j(k)} \\ p_k & = \frac{1}{1 + \exp(-\eta_k)}. \end{align*} \]

In this model, a player with a large value of \(\theta_i\) is more likely to get a foul called when they are disadvantaged, and a player with a large value of \(b_j\) is less likely to have a foul called when they are committing. If \(\theta_{i(k)} = b_{j(k)}\), there is a 50% chance a foul is called on that play.

To complete the specification of this model, we place priors on \(\theta_i\) and \(b_j\). Similarly to \(\eta\) in the logistic-normal model, we place a hierarchical normal prior on \(\theta_i\),

\[ \begin{align*} \mu_{\theta} & \sim N(0, 100) \\ \sigma_{\theta} & \sim \textrm{HalfCauchy}(2.5) \\ \theta_i & \sim N(\mu_{\theta}, \sigma^2_{\theta}). \end{align*} \]

```
with pm.Model() as rasch_model:
μ_θ = pm.Normal('μ_θ', 0., 10.)
Δ_θ = pm.Normal('Δ_θ', 0., 1., shape=n_players)
σ_θ = pm.HalfCauchy('σ_θ', 2.5)
θ = pm.Deterministic('θ', μ_θ + Δ_θ * σ_θ)
```

We also place a hierarchical normal prior on \(b_j\), though this prior must be subtley different from that on \(\theta_i\). Since \(\theta_i\) and \(b_j\) are latent variables, there is no natural scale on which they should be measured. If each \(\theta_i\) and \(b_j\) are shifted by the same amount, say \(\delta\), the likelihood does not change. That is, if \(\tilde{\theta}_i = \theta_i + \delta\) and \(\tilde{b}_j = b_j + \delta\), then

\[ \tilde{\eta}_{i, j} = \tilde{\theta}_i - \tilde{b}_j = \theta_i + \delta - (b_j + \delta) = \theta_i - b_j = \eta_{i, j}. \]

Therefore, if we allow \(\theta_i\) and \(\beta_j\) to be shifted by arbitrary amounts, the Rasch model is not identified. We identify the Rasch model by constraining the mean of the hyperprior on \(b_j\) to be zero,

\[ \begin{align*} \sigma_b & \sim \textrm{HalfCauchy}(2.5) \\ b_j & \sim N(0, \sigma^2_b). \end{align*} \]

```
with rasch_model:
Δ_b = pm.Normal('Δ_b', 0., 1., shape=n_players)
σ_b = pm.HalfCauchy('σ_b', 2.5)
b = pm.Deterministic('b', Δ_b * σ_b)
```

We now specify the Rasch model’s likelihood and sample from it.

`committing_id = df.committing_id.values`

```
with rasch_model:
η = θ[disadvantaged_id] - b[committing_id]
p = pm.Deterministic('p', pm.math.sigmoid(η))
y = pm.Bernoulli('y_obs', p, observed=foul_called)
```

`rasch_trace = sample(rasch_model, N_TUNE, N_SAMPLES, SEED)`

```
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -3,729.5: 11%|█▏ | 22583/200000 [00:19<02:33, 1156.17it/s]Median ELBO converged.
Finished [100%]: Average ELBO = -3,037.1
100%|██████████| 4000/4000 [02:01<00:00, 32.81it/s]
```

Again, the energy plot for this model gives no cause for concern.

`energy_plot(rasch_trace)`

Below we show the WAIC of our three models.

`waic_df = waic_df.append(get_waic_df(rasch_model, rasch_trace, "Rasch"))`

`waic_plot(waic_df)`

The Rasch model represents a moderate WAIC improvement over the logistic-normal model, and unsurprisingly has many more effective parameters (since it added a nominal parameter, \(b_j\), per player).

The Rasch model also has reasonable binned residuals, with very few events having residuals above 5%.

```
bin_obs, bin_p, bin_counts = binned_residuals(foul_called, rasch_trace['p'])
binned_residual_plot(bin_obs, bin_p, bin_counts)
```

For the Rasch model (and subsequent models), we switch from visualizing the per-player call probabilities to the latent parameters \(\theta_i\) and \(b_j\).

```
μ_θ_mean = rasch_trace['μ_θ'].mean()
rasch_df = to_param_df(team_player_map, rasch_trace, ['θ', 'b'])
```

```
def plot_params_helper(mean, low, high, names, league_mean=None, league_mean_name=None, ax=None, **kwargs):
if ax is None:
ax = plt.gca()
mean = mean.values
names = names.values
argsorted_ix = mean.argsort()
interval = np.row_stack([low, high])
plot_params(mean[argsorted_ix], interval[:, argsorted_ix], names[argsorted_ix],
ax=ax, **kwargs)
if league_mean is not None:
ax.vlines(league_mean, -1, names.size,
'k', '--',
label=league_mean_name)
```

```
grid = sns.FacetGrid(rasch_df, col='team', col_wrap=2,
sharey=False,
size=4, aspect=1.5)
grid.map(plot_params_helper,
'θ_mean', 'θ_low', 'θ_high', 'name',
league_mean=μ_θ_mean,
league_mean_name=r"$\mu_{\theta}$");
grid.set_axis_labels(r"$\theta$", "Player");
grid.fig.tight_layout();
grid.add_legend();
```

```
grid = sns.FacetGrid(rasch_df, col='team', col_wrap=2,
sharey=False,
size=4, aspect=1.5)
grid.map(plot_params_helper,
'b_mean', 'b_low', 'b_high', 'name',
league_mean=0.);
grid.set_axis_labels(r"$b$", "Player");
grid.fig.tight_layout();
grid.add_legend();
```

Though these plots are voluminuous and therefore difficult to interpret precisely, a few trends are evident. The first is that there is more variation in the committing skill (\(b_j\)) than in disadvantaged skill (\(\theta_i\)). This difference is confirmed in the following histograms of the posterior expected values of \(\theta_i\) and \(b_j\).

```
def plot_latent_distributions(θ, b):
fig, (θ_ax, b_ax) = plt.subplots(nrows=2, sharex=True, figsize=(8, 6))
bins = np.linspace(0.9 * min(θ.min(), b.min()),
1.1 * max(θ.max(), b.max()),
75)
θ_ax.hist(θ, bins=bins,
alpha=0.75)
θ_ax.xaxis.set_label_position('top')
θ_ax.set_xlabel(r"Posterior expected $\theta$")
θ_ax.set_yticks([])
θ_ax.set_ylabel("Frequency")
b_ax.hist(b, bins=bins,
color=green, alpha=0.75)
b_ax.xaxis.tick_top()
b_ax.set_xlabel(r"Posterior expected $b$")
b_ax.set_yticks([])
b_ax.invert_yaxis()
b_ax.set_ylabel("Frequency")
fig.tight_layout()
```

`plot_latent_distributions(rasch_df.θ_mean, rasch_df.b_mean)`

The following plots show the top and bottom ten players in terms of both \(\theta_i\) and \(b_j\).

```
def top_10_plot(trace_df, μ_θ=0):
fig = plt.figure(figsize=(16, 10))
θ_top_ax = fig.add_subplot(221)
b_top_ax = fig.add_subplot(222)
θ_bottom_ax = fig.add_subplot(223, sharex=θ_top_ax)
b_bottom_ax = fig.add_subplot(224, sharex=b_top_ax)
# necessary for players that have been on more than one team
trace_df = trace_df.drop_duplicates(['player_id'])
by_θ = trace_df.sort_values('θ_mean')
θ_top = by_θ.iloc[-10:]
θ_bottom = by_θ.iloc[:10]
plot_params(θ_top.θ_mean.values, θ_top[['θ_low', 'θ_high']].values.T,
θ_top.name.values,
ax=θ_top_ax)
θ_top_ax.vlines(μ_θ, -1, 10,
'k', '--',
label=(r"$\mu_{\theta}$" if μ_θ != 0 else "League average"))
plt.setp(θ_top_ax.get_xticklabels(), visible=False)
θ_top_ax.set_ylabel("Player")
θ_top_ax.legend(loc=2)
θ_top_ax.set_title("Top ten")
plot_params(θ_bottom.θ_mean.values, θ_bottom[['θ_low', 'θ_high']].values.T,
θ_bottom.name.values,
ax=θ_bottom_ax)
θ_bottom_ax.vlines(μ_θ, -1, 10,
'k', '--',
label=(r"$\mu_{\theta}$" if μ_θ != 0 else "League average"))
θ_bottom_ax.set_xlabel(r"$\theta$")
θ_bottom_ax.set_ylabel("Player")
θ_bottom_ax.set_title("Bottom ten")
by_b = trace_df.sort_values('b_mean')
b_top = by_b.iloc[-10:]
b_bottom = by_b.iloc[:10]
plot_params(b_top.b_mean.values, b_top[['b_low', 'b_high']].values.T,
b_top.name.values,
ax=b_top_ax)
b_top_ax.vlines(0, -1, 10,
'k', '--',
label="League average");
plt.setp(b_top_ax.get_xticklabels(), visible=False)
b_top_ax.legend(loc=2)
b_top_ax.set_title("Top ten")
b_bottom_player_id = b.argsort()[:10]
plot_params(b_bottom.b_mean.values, b_bottom[['b_low', 'b_high']].values.T,
b_bottom.name.values,
ax=b_bottom_ax)
b_bottom_ax.vlines(0, -1, 10,
'k', '--')
b_bottom_ax.set_xlabel(r"$b$")
b_bottom_ax.set_title("Bottom ten")
fig.tight_layout()
```

`top_10_plot(rasch_df, μ_θ=μ_θ_mean)`

We focus first on \(\theta_i\). Interestingly, the top-ten players for the Rasch model contains many more top-tier stars than the logistic-normal model, including John Wall, Russell Westbrook, and LeBron James. Turning to \(b\), it is interesting that the while the top and bottom ten players contain many recognizable names (LaMarcus Aldridge, Harrison Barnes, Kawhi Leonard, and Ricky Rubio) the only truly top-tier player present is Anthony Davis.

As basketball fans know, there are many factors other than the players involved that influence foul calls. Very often, sufficiently close NBA games end with intentional fouls, as the losing team attempts to stop the clock and force another offensive possesion. Therefore, we expect to see in increase in the foul call probability as the game nears its conclusion.

```
n_sec = 121
sec = (df.seconds_left
.round(0)
.values
.astype(np.int64))
```

```
fig, ax = plt.subplots(figsize=(8, 6))
(df.groupby(sec)
.foul_called
.mean()
.plot(c='k', label="Observed foul call rate", ax=ax));
ax.set_xticks(np.linspace(0, 120, 5));
ax.invert_xaxis();
ax.set_xlabel("Seconds remaining in game");
ax.yaxis.set_major_formatter(pct_formatter);
ax.set_ylabel("Probability foul is called");
ax.legend(loc=2);
```

The plot above confirms this expectation, which we can use to improve our latent skill model. If \(t \in \{0, 1, \ldots, 120\}\) is the number of seconds remaining in the game, we model the latent contribution of \(t\) to the logodds that a foul is called with a Gaussian random walk,

\[ \begin{align*} \lambda_0 & \sim N(0, 100) \\ \lambda_t & \sim N(\lambda_{t - 1}, \tau^{-1}_{\lambda}) \\ \tau_{\lambda} & \sim \textrm{Exp}(10^{-4}). \end{align*} \]

This prior allows us to flexibly model the shape of the curve shown above. If \(t(k)\) is the number of seconds remaining during the \(k\)-th play, we incorporate \(\lambda_{t(k)}\) into our model with

\[\eta_k = \lambda_{t(k)} + \theta_{i(k)} - b_{j(k)}.\]

This model is not identified until we constrain the mean of \(\theta\) to be zero, for reasons similar to those discussed above for the Rasch model.

```
with pm.Model() as time_model:
τ_λ = pm.Exponential('τ_λ', 1e-4)
λ = pm.GaussianRandomWalk('λ', tau=τ_λ,
init=pm.Normal.dist(0., 10.),
shape=n_sec)
Δ_θ = pm.Normal('Δ_θ', 0., 1., shape=n_players)
σ_θ = pm.HalfCauchy('σ_θ', 2.5)
θ = pm.Deterministic('θ', Δ_θ * σ_θ)
Δ_b = pm.Normal('Δ_b', 0., 1., shape=n_players)
σ_b = pm.HalfCauchy('σ_b', 2.5)
b = pm.Deterministic('b', Δ_b * σ_b)
η = λ[sec] + θ[disadvantaged_id] - b[committing_id]
p = pm.Deterministic('p', pm.math.sigmoid(η))
y = pm.Bernoulli('y_obs', p, observed=foul_called)
```

We now sample from the model.

`time_trace = sample(time_model, N_TUNE, N_SAMPLES, SEED)`

```
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -1.0533e+05: 15%|█▌ | 30300/200000 [00:31<02:52, 982.03it/s] Median ELBO converged.
Finished [100%]: Average ELBO = -3,104.5
100%|██████████| 4000/4000 [03:10<00:00, 20.99it/s]
```

The energy plot for this model is worse than the previous ones, but not too bad.

`energy_plot(time_trace)`

`waic_df = waic_df.append(get_waic_df(time_model, time_trace, "Time"))`

`waic_plot(waic_df)`

We see that the time remaining model represents an appreciable improvement over the Rasch model in terms of WAIC.

```
bin_obs, bin_p, bin_counts = binned_residuals(foul_called, time_trace['p'])
binned_residual_plot(bin_obs, bin_p, bin_counts)
```

The binned residuals for this model also look quite good, with very few samples appreciably exceeding a 1% difference.

We now compare the distribtuions of \(\theta_i\) and \(b_j\) for this model with those for the Rasch model.

`time_df = to_param_df(team_player_map, time_trace, ['θ', 'b'])`

`plot_latent_distributions(time_df.θ_mean, time_df.b_mean)`

The effect of constraining the mean of \(\theta\) to be zero is immediately apparent. Also, the variation in \(\theta\) remains lower than the variation than \(b\) in this model. We also see that the top- and bottom-ten players by \(\theta\)- and \(b\)-value remain largely unchanged from the Rasch model.

`top_10_plot(time_df)`

Basketball fans may find it amusing that under this model, Dwight Howard has joined the top-ten in terms of \(\theta\) and Ricky Rubio is no longer the worst player in terms of \(b\).

While this model has not done much to change the rank-ordering of the most- and least-skilled players, it does enable us to plot per-player foul probabilities over time, as below.

```
fig, (θ_ax, b_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6))
(df.groupby(sec)
.foul_called
.mean()
.plot(c='k', alpha=0.5,
label="Observed foul call rate",
ax=θ_ax));
plot_sec = np.arange(n_sec)
θ_ax.plot(plot_sec, expit(time_trace['λ'].mean(axis=0)),
c='k',
label="Average player");
θ_best_id = time_df.ix[time_df.θ_mean.idxmax()].player_id
θ_ax.plot(plot_sec, expit(time_trace['θ'][:, θ_best_id].mean(axis=0) \
+ time_trace['λ'].mean(axis=0)),
label=player_map[θ_best_id]);
θ_worst_id = time_df.ix[time_df.θ_mean.idxmin()].player_id
θ_ax.plot(plot_sec, expit(time_trace['θ'][:, θ_worst_id].mean(axis=0) \
+ time_trace['λ'].mean(axis=0)),
label=player_map[θ_worst_id]);
θ_ax.set_xticks(np.linspace(0, 120, 5));
θ_ax.invert_xaxis();
θ_ax.set_xlabel("Seconds remaining in game");
θ_ax.yaxis.set_major_formatter(pct_formatter);
θ_ax.set_ylabel("Probability foul is called\nagainst average opposing player");
θ_ax.legend(loc=2);
θ_ax.set_title(r"Disadvantaged player ($\theta$)");
(df.groupby(sec)
.foul_called
.mean()
.plot(c='k', alpha=0.5,
label="Observed foul call rate",
ax=b_ax));
plot_sec = np.arange(n_sec)
b_ax.plot(plot_sec, expit(time_trace['λ'].mean(axis=0)),
c='k',
label="Average player");
b_best_id = time_df.ix[time_df.b_mean.idxmax()].player_id
b_ax.plot(plot_sec, expit(-time_trace['b'][:, b_best_id].mean(axis=0) \
+ time_trace['λ'].mean(axis=0)),
label=player_map[b_best_id]);
b_worst_id = time_df.ix[time_df.b_mean.idxmin()].player_id
b_ax.plot(plot_sec, expit(-time_trace['b'][:, b_worst_id].mean(axis=0) \
+ time_trace['λ'].mean(axis=0)),
label=player_map[b_worst_id]);
b_ax.set_xticks(np.linspace(0, 120, 5));
b_ax.invert_xaxis();
b_ax.set_xlabel("Seconds remaining in game");
b_ax.legend(loc=2);
b_ax.set_title(r"Committing player ($b$)");
```

Here we have plotted the probability of a foul call while being opposed by an average player (for both \(\theta\) and \(b\)), along with the probability curves for the players with the highest and lowest \(\theta\) and \(b\), respectively. While these plots are quite interesting, one weakness of our model is that the difference between each player’s curve and the league average is constant over time. It would be an interesting and useful to extend this model to allow player offsets to vary over time. Additonally, it would be interesting to understand the influence of the score on the foul-called rate as the game nears its end. It seems quite likely that the winning team is much less likely to commit fouls while the losing team is much more likely to to commit intentional fouls in close games.

We now plot the per-player values of \(\theta_i\) and \(b_j\) under this model.

```
grid = sns.FacetGrid(time_df, col='team', col_wrap=2,
sharey=False,
size=4, aspect=1.5)
grid.map(plot_params_helper,
'θ_mean', 'θ_low', 'θ_high', 'name',
league_mean=0.,
league_mean_name="League average");
grid.set_axis_labels(r"$\theta$", "Player");
grid.fig.tight_layout();
grid.add_legend();
```

```
grid = sns.FacetGrid(time_df, col='team', col_wrap=2,
sharey=False,
size=4, aspect=1.5)
grid.map(plot_params_helper,
'b_mean', 'b_low', 'b_high', 'name',
league_mean=0.,
league_mean_name="League average");
grid.set_axis_labels(r"$b$", "Player");
grid.fig.tight_layout();
grid.add_legend();
```

Our final model uses salary as a proxy for “star power” to explore its influence on foul calls. We also (somewhat naively) impute missing salaries to the (log) league average.

```
std_log_salary = (salary_df.ix[np.arange(n_players)]
.std_log_salary
.fillna(0)
.values)
```

With \(s_i\) denoting the \(i\)-the player’s standardized log salary, our model becomes

\[ \begin{align*} \theta_i & = \theta_{0, i} + \delta_{\theta} \cdot s_i \\ b_j & = b_{0, j} + \delta_b \cdot s_j \\ \eta_k & = \lambda_{t(k)} + \theta_{i(k)} - b_{j(k)}. \end{align*} \]

In this model, each player’s \(\theta_i\) and \(b_j\) parameters are linear functions of their standardized log salary, with (hierarchical) varying intercepts. The varying intercepts \(\theta_{0, i}\) and \(b_{0, j}\) are endowed with the same hierarchical normal priors as \(\theta_i\) and \(b_j\) had in the previous model. We place normal priors, \(\delta_{\theta} \sim N(0, 100)\) and \(\delta_b \sim N(0, 100)\), on the salary coefficients.

```
with pm.Model() as salary_model:
τ_λ = pm.Exponential('τ_λ', 1e-4)
λ = pm.GaussianRandomWalk('λ', tau=τ_λ,
init=pm.Normal.dist(0., 10.),
shape=n_sec)
Δ_θ0 = pm.Normal('Δ_θ0', 0., 1., shape=n_players)
σ_θ0 = pm.HalfCauchy('σ_θ0', 2.5)
θ0 = pm.Deterministic('θ0', Δ_θ0 * σ_θ0)
δ_θ = pm.Normal('δ_θ', 0., 10.)
θ = pm.Deterministic('θ', θ0 + δ_θ * std_log_salary)
Δ_b0 = pm.Normal('Δ_b0', 0., 1., shape=n_players)
σ_b0 = pm.HalfCauchy('σ_b0', 2.5)
b0 = pm.Deterministic('b0', Δ_b0 * σ_b0)
δ_b = pm.Normal('δ_b', 0., 10.)
b = pm.Deterministic('b', b0 + δ_b * std_log_salary)
η = λ[sec] + θ[disadvantaged_id] - b[committing_id]
p = pm.Deterministic('p', pm.math.sigmoid(η))
y = pm.Bernoulli('y_obs', p, observed=foul_called)
```

`salary_trace = sample(salary_model, N_TUNE, N_SAMPLES, SEED)`

```
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -1.0244e+05: 15%|█▌ | 30998/200000 [00:32<02:57, 952.52it/s]Median ELBO converged.
Finished [100%]: Average ELBO = -2,995.3
100%|██████████| 4000/4000 [03:45<00:00, 17.76it/s]
```

The energy plot for this model looks a bit worse than that for the time remaining model.

`energy_plot(salary_trace)`

The salary model also appears to be a slight improvement over the time remaining model in terms of WAIC.

`waic_df = waic_df.append(get_waic_df(salary_model, salary_trace, "Salary"))`

`waic_plot(waic_df)`

The binned residuals continue to look good for this model.

```
bin_obs, bin_p, bin_counts = binned_residuals(foul_called, salary_trace['p'])
binned_residual_plot(bin_obs, bin_p, bin_counts)
```

Based on the posterior distributions of \(\delta_{\theta}\) and \(\delta_b\), we expect to see a fairly strong relationship between a player’s (standardized log) salary and their latent skill parameters.

```
pm.plot_posterior(salary_trace, ['δ_θ', 'δ_b'],
lw=0., alpha=0.75);
```

The following plots confirm this relationship.

`salary_df_ = to_param_df(team_player_map, salary_trace, ['θ', 'θ0', 'b', 'b0'])`

```
fig, (θ_ax, b_ax) = plt.subplots(ncols=2, sharex=True, figsize=(16, 6))
salary = (salary_df.ix[np.arange(n_players)]
.salary
.fillna(salary_df.salary.mean())
.values)
θ_ax.scatter(salary[salary_df_.player_id], salary_df_.θ_mean,
alpha=0.75);
θ_ax.xaxis.set_major_formatter(million_dollars_formatter);
θ_ax.set_xlabel("Salary");
θ_ax.set_ylabel(r"$\theta$");
b_ax.scatter(salary[salary_df_.player_id], salary_df_.b_mean,
alpha=0.75);
b_ax.xaxis.set_major_formatter(million_dollars_formatter);
b_ax.set_xlabel("Salary");
b_ax.set_ylabel(r"$b$");
```

It is important to note here that these relationships are descriptive, not causal. Our original intent was to use salary as a proxy for the difficult-to-quantify notion of “star power.” These plots suggest that the probability of getting a foul called when disadvanted and not called when committing are both positively related to salary, and therefore (a bit more dubiously) star power.

Importantly, \(\theta_i\) and \(b_j\) should no longer be interpreted directly as measuring latent skill, which is presumably intrinsic to a player and not directly dependent on their salary. In fact, it seems at plausible that NBA scouts, front offices, players, and agents would be somewhat able to appreciate these latent skills, place a value on them, and thereby price them into contracts. It would be interesting future work to refine this model to give an econometric answer to this question.

Since we shouldn’t interpret \(\theta_i\) and \(b_j\) as measures of latent skill in this model, we will not plot their per-player distributions.

We set out to explore the relationship between players involved in a play and the probability that a foul is called, along with other factors related to that probability. Through a series of progressively more complex Bayesian item-response models, we have seen that

- foul call probability does vary appreciably across both disadvantaged and committing player,
- there is more variation in the latent skill of the committing player to avoid a foul call than there is in the variation in the latent skill of the disadvantaged player to draw a foul call,
- the amount of time remaining in the game is strongly related to the probability of a foul call, and
- there is a positive relationship between player salary and the probability that a foul is called when they are disadvantaged and not called when they are committing. With a bit of a leap, we can say that the probability a foul is called is at least loosely related to the “star power” of the players involved.

In this post we have only scratched the surface of Bayesian item-response theory. For a more in-depth treatment, consult *Bayesian Item Response Modeling*.

This is is available as a Jupyter notebook here.

My last post showed how to use Dirichlet processes and `pymc3`

to perform Bayesian nonparametric density estimation. This post expands on the previous one, illustrating dependent density regression with `pymc3`

.

Just as Dirichlet process mixtures can be thought of as infinite mixture models that select the number of active components as part of inference, dependent density regression can be thought of as infinite mixtures of experts that select the active experts as part of inference. Their flexibility and modularity make them powerful tools for performing nonparametric Bayesian Data analysis.

```
%matplotlib inline
from IPython.display import HTML
```

```
from matplotlib import animation as ani, pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
from theano import shared, tensor as tt
```

```
plt.rc('animation', writer='avconv')
blue, *_ = sns.color_palette()
```

```
SEED = 972915 # from random.org; for reproducibility
np.random.seed(SEED)
```

Throughout this post, we will use the LIDAR data set from Larry Wasserman’s excellent book, *All of Nonparametric Statistics*. We standardize the data set to improve the rate of convergence of our samples.

```
DATA_URI = 'http://www.stat.cmu.edu/~larry/all-of-nonpar/=data/lidar.dat'
def standardize(x):
return (x - x.mean()) / x.std()
df = (pd.read_csv(DATA_URI, sep=' *', engine='python')
.assign(std_range=lambda df: standardize(df.range),
std_logratio=lambda df: standardize(df.logratio)))
```

`df.head()`

range | logratio | std_logratio | std_range | |
---|---|---|---|---|

0 | 390 | -0.050356 | 0.852467 | -1.717725 |

1 | 391 | -0.060097 | 0.817981 | -1.707299 |

2 | 393 | -0.041901 | 0.882398 | -1.686447 |

3 | 394 | -0.050985 | 0.850240 | -1.676020 |

4 | 396 | -0.059913 | 0.818631 | -1.655168 |

We plot the LIDAR data below.

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df.std_range, df.std_logratio,
c=blue);
ax.set_xticklabels([]);
ax.set_xlabel("Standardized range");
ax.set_yticklabels([]);
ax.set_ylabel("Standardized log ratio");
```

This data set has a two interesting properties that make it useful for illustrating dependent density regression.

- The relationship between range and log ratio is nonlinear, but has locally linear components.
- The observation noise is heteroskedastic; that is, the magnitude of the variance varies with the range.

The intuitive idea behind dependent density regression is to reduce the problem to many (related) density estimates, conditioned on fixed values of the predictors. The following animation illustrates this intuition.

```
fig, (scatter_ax, hist_ax) = plt.subplots(ncols=2, figsize=(16, 6))
scatter_ax.scatter(df.std_range, df.std_logratio,
c=blue, zorder=2);
scatter_ax.set_xticklabels([]);
scatter_ax.set_xlabel("Standardized range");
scatter_ax.set_yticklabels([]);
scatter_ax.set_ylabel("Standardized log ratio");
bins = np.linspace(df.std_range.min(), df.std_range.max(), 25)
hist_ax.hist(df.std_logratio, bins=bins,
color='k', lw=0, alpha=0.25,
label="All data");
hist_ax.set_xticklabels([]);
hist_ax.set_xlabel("Standardized log ratio");
hist_ax.set_yticklabels([]);
hist_ax.set_ylabel("Frequency");
hist_ax.legend(loc=2);
endpoints = np.linspace(1.05 * df.std_range.min(), 1.05 * df.std_range.max(), 15)
frame_artists = []
for low, high in zip(endpoints[:-1], endpoints[2:]):
interval = scatter_ax.axvspan(low, high,
color='k', alpha=0.5, lw=0, zorder=1);
*_, bars = hist_ax.hist(df[df.std_range.between(low, high)].std_logratio,
bins=bins,
color='k', lw=0, alpha=0.5);
frame_artists.append((interval,) + tuple(bars))
animation = ani.ArtistAnimation(fig, frame_artists,
interval=500, repeat_delay=3000, blit=True)
plt.close(); # prevent the intermediate figure from showing
```

`HTML(animation.to_html5_video())`

As we slice the data with a window sliding along the x-axis in the left plot, the empirical distribution of the y-values of the points in the window varies in the right plot. An important aspect of this approach is that the density estimates that correspond to close values of the predictor are similar.

In the previous post, we saw that a Dirichlet process estimates a probability density as a mixture model with infinitely many components. In the case of normal component distributions,

\[ y \sim \sum_{i = 1}^{\infty} w_i \cdot N(\mu_i, \tau_i^{-1}), \]

where the mixture weights, \(w_1, w_2, \ldots\), are generated by a stick-breaking process.

Dependent density regression generalizes this representation of the Dirichlet process mixture model by allowing the mixture weights and component means to vary conditioned on the value of the predictor, \(x\). That is,

\[ y\ |\ x \sim \sum_{i = 1}^{\infty} w_i\ |\ x \cdot N(\mu_i\ |\ x, \tau_i^{-1}). \]

In this post, we will follow Chapter 23 of *Bayesian Data Analysis* and use a probit stick-breaking process to determine the conditional mixture weights, \(w_i\ |\ x\). The probit stick-breaking process starts by defining

\[ v_i\ |\ x = \Phi(\alpha_i + \beta_i x), \]

where \(\Phi\) is the cumulative distribution function of the standard normal distribution. We then obtain \(w_i\ |\ x\) by applying the stick breaking process to \(v_i\ |\ x\). That is,

\[ w_i\ |\ x = v_i\ |\ x \cdot \prod_{j = 1}^{i - 1} (1 - v_j\ |\ x). \]

For the LIDAR data set, we use independent normal priors \(\alpha_i \sim N(0, 5^2)\) and \(\beta_i \sim N(0, 5^2)\). We now express this this model for the conditional mixture weights using `pymc3`

.

```
def norm_cdf(z):
return 0.5 * (1 + tt.erf(z / np.sqrt(2)))
def stick_breaking(v):
return v * tt.concatenate([tt.ones_like(v[:, :1]),
tt.extra_ops.cumprod(1 - v, axis=1)[:, :-1]],
axis=1)
```

```
N, _ = df.shape
K = 20
std_range = df.std_range.values[:, np.newaxis]
std_logratio = df.std_logratio.values[:, np.newaxis]
x_lidar = shared(std_range, broadcastable=(False, True))
with pm.Model() as model:
alpha = pm.Normal('alpha', 0., 5., shape=K)
beta = pm.Normal('beta', 0., 5., shape=K)
v = norm_cdf(alpha + beta * x_lidar)
w = pm.Deterministic('w', stick_breaking(v))
```

We have defined `x_lidar`

as a `theano`

`shared`

variable in order to use `pymc3`

’s posterior prediction capabilities later.

While the dependent density regression model theoretically has infinitely many components, we must truncate the model to finitely many components (in this case, twenty) in order to express it using `pymc3`

. After sampling from the model, we will verify that truncation did not unduly influence our results.

Since the LIDAR data seems to have several linear components, we use the linear models

\[ \begin{align*} \mu_i\ |\ x & \sim \gamma_i + \delta_i x \\ \gamma_i & \sim N(0, 10^2) \\ \delta_i & \sim N(0, 10^2) \end{align*} \]

for the conditional component means.

```
with model:
gamma = pm.Normal('gamma', 0., 10., shape=K)
delta = pm.Normal('delta', 0., 10., shape=K)
mu = pm.Deterministic('mu', gamma + delta * x_lidar)
```

Finally, we place the prior \(\tau_i \sim \textrm{Gamma}(1, 1)\) on the component precisions.

```
with model:
tau = pm.Gamma('tau', 1., 1., shape=K)
obs = pm.NormalMixture('obs', w, mu, tau=tau, observed=std_logratio)
```

We now draw sample from the dependent density regression model.

```
SAMPLES = 20000
BURN = 10000
THIN = 10
with model:
step = pm.Metropolis()
trace_ = pm.sample(SAMPLES, step, random_seed=SEED)
trace = trace_[BURN::THIN]
```

`100%|██████████| 20000/20000 [01:30<00:00, 204.48it/s]`

To verify that truncation did not unduly influence our results, we plot the largest posterior expected mixture weight for each component. (In this model, each point has a mixture weight for each component, so we plot the maximum mixture weight for each component across all data points in order to judge if the component exerts any influence on the posterior.)

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.bar(np.arange(K) + 1 - 0.4,
trace['w'].mean(axis=0).max(axis=0));
ax.set_xlim(1 - 0.5, K + 0.5);
ax.set_xlabel('Mixture component');
ax.set_ylabel('Largest posterior expected\nmixture weight');
```

Since only three mixture components have appreciable posterior expected weight for any data point, we can be fairly certain that truncation did not unduly influence our results. (If most components had appreciable posterior expected weight, truncation may have influenced the results, and we would have increased the number of components and sampled again.)

Visually, it is reasonable that the LIDAR data has three linear components, so these posterior expected weights seem to have identified the structure of the data well. We now sample from the posterior predictive distribution to get a better understand the model’s performance.

```
PP_SAMPLES = 5000
lidar_pp_x = np.linspace(std_range.min() - 0.05, std_range.max() + 0.05, 100)
x_lidar.set_value(lidar_pp_x[:, np.newaxis])
with model:
pp_trace = pm.sample_ppc(trace, PP_SAMPLES, random_seed=SEED)
```

`100%|██████████| 5000/5000 [01:18<00:00, 66.54it/s]`

Below we plot the posterior expected value and the 95% posterior credible interval.

```
fig, ax = plt.subplots()
ax.scatter(df.std_range, df.std_logratio,
c=blue, zorder=10,
label=None);
low, high = np.percentile(pp_trace['obs'], [2.5, 97.5], axis=0)
ax.fill_between(lidar_pp_x, low, high,
color='k', alpha=0.35, zorder=5,
label='95% posterior credible interval');
ax.plot(lidar_pp_x, pp_trace['obs'].mean(axis=0),
c='k', zorder=6,
label='Posterior expected value');
ax.set_xticklabels([]);
ax.set_xlabel('Standardized range');
ax.set_yticklabels([]);
ax.set_ylabel('Standardized log ratio');
ax.legend(loc=1);
ax.set_title('LIDAR Data');
```

The model has fit the linear components of the data well, and also accomodated its heteroskedasticity. This flexibility, along with the ability to modularly specify the conditional mixture weights and conditional component densities, makes dependent density regression an extremely useful nonparametric Bayesian model.

To learn more about depdendent density regression and related models, consult *Bayesian Data Analysis*, *Bayesian Nonparametric Data Analysis*, or *Bayesian Nonparametrics*.

This post is available as a Jupyter notebook here.

]]>I have been intrigued by the flexibility of nonparametric statistics for many years. As I have developed an understanding and appreciation of Bayesian modeling both personally and professionally over the last two or three years, I naturally developed an interest in Bayesian nonparametric statistics. I am pleased to begin a planned series of posts on Bayesian nonparametrics with this post on Dirichlet process mixtures for density estimation.

The Dirichlet process is a flexible probability distribution over the space of distributions. Most generally, a probability distribution, \(P\), on a set \(\Omega\) is a measure that assigns measure one to the entire space (\(P(\Omega) = 1\)). A Dirichlet process \(P \sim \textrm{DP}(\alpha, P_0)\) is a measure that has the property that, for every finite disjoint partition \(S_1, \ldots, S_n\) of \(\Omega\),

\[(P(S_1), \ldots, P(S_n)) \sim \textrm{Dir}(\alpha P_0(S_1), \ldots, \alpha P_0(S_n)).\]

Here \(P_0\) is the base probability measure on the space \(\Omega\). The precision parameter \(\alpha > 0\) controls how close samples from the Dirichlet process are to the base measure, \(P_0\). As \(\alpha \to \infty\), samples from the Dirichlet process approach the base measure \(P_0\).

Dirichlet processes have several properties that make then quite suitable to MCMC simulation.

The posterior given i.i.d. observations \(\omega_1, \ldots, \omega_n\) from a Dirichlet process \(P \sim \textrm{DP}(\alpha, P_0)\) is also a Dirichlet process with

\[P\ |\ \omega_1, \ldots, \omega_n \sim \textrm{DP}\left(\alpha + n, \frac{\alpha}{\alpha + n} P_0 + \frac{1}{\alpha + n} \sum_{i = 1}^n \delta_{\omega_i}\right),\]

where \(\delta\) is the Dirac delta measure

\[\begin{align*} \delta_{\omega}(S) & = \begin{cases} 1 & \textrm{if } \omega \in S \\ 0 & \textrm{if } \omega \not \in S \end{cases} \end{align*}.\]

The posterior predictive distribution of a new observation is a compromise between the base measure and the observations,

\[\omega\ |\ \omega_1, \ldots, \omega_n \sim \frac{\alpha}{\alpha + n} P_0 + \frac{1}{\alpha + n} \sum_{i = 1}^n \delta_{\omega_i}.\]

We see that the prior precision \(\alpha\) can naturally be interpreted as a prior sample size. The form of this posterior predictive distribution also lends itself to Gibbs sampling.

Samples, \(P \sim \textrm{DP}(\alpha, P_0)\), from a Dirichlet process are discrete with probability one. That is, there are elements \(\omega_1, \omega_2, \ldots\) in \(\Omega\) and weights \(w_1, w_2, \ldots\) with \(\sum_{i = 1}^{\infty} w_i = 1\) such that

\[P = \sum_{i = 1}^\infty w_i \delta_{\omega_i}.\]

- The stick-breaking process gives an explicit construction of the weights \(w_i\) and samples \(\omega_i\) above that is straightforward to sample from. If \(\beta_1, \beta_2, \ldots \sim \textrm{Beta}(1, \alpha)\), then \(w_i = \beta_i \prod_{j = 1}^{j - 1} (1 - \beta_j)\). The relationship between this representation and stick breaking may be illustrated as follows:
- Start with a stick of length one.
- Break the stick into two portions, the first of proportion \(w_1 = \beta_1\) and the second of proportion \(1 - w_1\).
- Further break the second portion into two portions, the first of proportion \(\beta_2\) and the second of proportion \(1 - \beta_2\). The length of the first portion of this stick is \(\beta_2 (1 - \beta_1)\); the length of the second portion is \((1 - \beta_1) (1 - \beta_2)\).
- Continue breaking the second portion from the previous break in this manner forever. If \(\omega_1, \omega_2, \ldots \sim P_0\), then

\[P = \sum_{i = 1}^\infty w_i \delta_{\omega_i} \sim \textrm{DP}(\alpha, P_0).\]

We can use the stick-breaking process above to easily sample from a Dirichlet process in Python. For this example, \(\alpha = 2\) and the base distribution is \(N(0, 1)\).

`%matplotlib inline`

`from __future__ import division`

```
from matplotlib import pyplot as plt
import numpy as np
import pymc3 as pm
import scipy as sp
import seaborn as sns
from statsmodels.datasets import get_rdataset
from theano import tensor as T
```

`Couldn't import dot_parser, loading of dot files will not be possible.`

`blue = sns.color_palette()[0]`

`np.random.seed(462233) # from random.org`

```
N = 20
K = 30
alpha = 2.
P0 = sp.stats.norm
```

We draw and plot samples from the stick-breaking process.

```
beta = sp.stats.beta.rvs(1, alpha, size=(N, K))
w = np.empty_like(beta)
w[:, 0] = beta[:, 0]
w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1)
omega = P0.rvs(size=(N, K))
x_plot = np.linspace(-3, 3, 200)
sample_cdfs = (w[..., np.newaxis] * np.less.outer(omega, x_plot)).sum(axis=1)
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(x_plot, sample_cdfs[0], c='gray', alpha=0.75,
label='DP sample CDFs');
ax.plot(x_plot, sample_cdfs[1:].T, c='gray', alpha=0.75);
ax.plot(x_plot, P0.cdf(x_plot), c='k', label='Base CDF');
ax.set_title(r'$\alpha = {}$'.format(alpha));
ax.legend(loc=2);
```

As stated above, as \(\alpha \to \infty\), samples from the Dirichlet process converge to the base distribution.

```
fig, (l_ax, r_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6))
K = 50
alpha = 10.
beta = sp.stats.beta.rvs(1, alpha, size=(N, K))
w = np.empty_like(beta)
w[:, 0] = beta[:, 0]
w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1)
omega = P0.rvs(size=(N, K))
sample_cdfs = (w[..., np.newaxis] * np.less.outer(omega, x_plot)).sum(axis=1)
l_ax.plot(x_plot, sample_cdfs[0], c='gray', alpha=0.75,
label='DP sample CDFs');
l_ax.plot(x_plot, sample_cdfs[1:].T, c='gray', alpha=0.75);
l_ax.plot(x_plot, P0.cdf(x_plot), c='k', label='Base CDF');
l_ax.set_title(r'$\alpha = {}$'.format(alpha));
l_ax.legend(loc=2);
K = 200
alpha = 50.
beta = sp.stats.beta.rvs(1, alpha, size=(N, K))
w = np.empty_like(beta)
w[:, 0] = beta[:, 0]
w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1)
omega = P0.rvs(size=(N, K))
sample_cdfs = (w[..., np.newaxis] * np.less.outer(omega, x_plot)).sum(axis=1)
r_ax.plot(x_plot, sample_cdfs[0], c='gray', alpha=0.75,
label='DP sample CDFs');
r_ax.plot(x_plot, sample_cdfs[1:].T, c='gray', alpha=0.75);
r_ax.plot(x_plot, P0.cdf(x_plot), c='k', label='Base CDF');
r_ax.set_title(r'$\alpha = {}$'.format(alpha));
r_ax.legend(loc=2);
```

For the task of density estimation, the (almost sure) discreteness of samples from the Dirichlet process is a significant drawback. This problem can be solved with another level of indirection by using Dirichlet process mixtures for density estimation. A Dirichlet process mixture uses component densities from a parametric family \(\mathcal{F} = \{f_{\theta}\ |\ \theta \in \Theta\}\) and represents the mixture weights as a Dirichlet process. If \(P_0\) is a probability measure on the parameter space \(\Theta\), a Dirichlet process mixture is the hierarchical model

\[ \begin{align*} x_i\ |\ \theta_i & \sim f_{\theta_i} \\ \theta_1, \ldots, \theta_n & \sim P \\ P & \sim \textrm{DP}(\alpha, P_0). \end{align*} \]

To illustrate this model, we simulate draws from a Dirichlet process mixture with \(\alpha = 2\), \(\theta \sim N(0, 1)\), \(x\ |\ \theta \sim N(\theta, (0.3)^2)\).

```
N = 5
K = 30
alpha = 2
P0 = sp.stats.norm
f = lambda x, theta: sp.stats.norm.pdf(x, theta, 0.3)
```

```
beta = sp.stats.beta.rvs(1, alpha, size=(N, K))
w = np.empty_like(beta)
w[:, 0] = beta[:, 0]
w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1)
theta = P0.rvs(size=(N, K))
dpm_pdf_components = f(x_plot[np.newaxis, np.newaxis, :], theta[..., np.newaxis])
dpm_pdfs = (w[..., np.newaxis] * dpm_pdf_components).sum(axis=1)
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(x_plot, dpm_pdfs.T, c='gray');
ax.set_yticklabels([]);
```

We now focus on a single mixture and decompose it into its individual (weighted) mixture components.

```
fig, ax = plt.subplots(figsize=(8, 6))
ix = 1
ax.plot(x_plot, dpm_pdfs[ix], c='k', label='Density');
ax.plot(x_plot, (w[..., np.newaxis] * dpm_pdf_components)[ix, 0],
'--', c='k', label='Mixture components (weighted)');
ax.plot(x_plot, (w[..., np.newaxis] * dpm_pdf_components)[ix].T,
'--', c='k');
ax.set_yticklabels([]);
ax.legend(loc=1);
```

Sampling from these stochastic processes is fun, but these ideas become truly useful when we fit them to data. The discreteness of samples and the stick-breaking representation of the Dirichlet process lend themselves nicely to Markov chain Monte Carlo simulation of posterior distributions. We will perform this sampling using `pymc3`

.

Our first example uses a Dirichlet process mixture to estimate the density of waiting times between eruptions of the Old Faithful geyser in Yellowstone National Park.

`old_faithful_df = get_rdataset('faithful', cache=True).data[['waiting']]`

For convenience in specifying the prior, we standardize the waiting time between eruptions.

`old_faithful_df['std_waiting'] = (old_faithful_df.waiting - old_faithful_df.waiting.mean()) / old_faithful_df.waiting.std()`

`old_faithful_df.head()`

waiting | std_waiting | |
---|---|---|

0 | 79 | 0.596025 |

1 | 54 | -1.242890 |

2 | 74 | 0.228242 |

3 | 62 | -0.654437 |

4 | 85 | 1.037364 |

```
fig, ax = plt.subplots(figsize=(8, 6))
n_bins = 20
ax.hist(old_faithful_df.std_waiting, bins=n_bins, color=blue, lw=0, alpha=0.5);
ax.set_xlabel('Standardized waiting time between eruptions');
ax.set_ylabel('Number of eruptions');
```

Observant readers will have noted that we have not been continuing the stick-breaking process indefinitely as indicated by its definition, but rather have been truncating this process after a finite number of breaks. Obviously, when computing with Dirichlet processes, it is necessary to only store a finite number of its point masses and weights in memory. This restriction is not terribly onerous, since with a finite number of observations, it seems quite likely that the number of mixture components that contribute non-neglible mass to the mixture will grow slower than the number of samples. This intuition can be formalized to show that the (expected) number of components that contribute non-negligible mass to the mixture approaches \(\alpha \log N\), where \(N\) is the sample size.

There are various clever Gibbs sampling techniques for Dirichlet processes that allow the number of components stored to grow as needed. Stochastic memoization is another powerful technique for simulating Dirichlet processes while only storing finitely many components in memory. In this introductory post, we take the much less sophistocated approach of simple truncating the Dirichlet process components that are stored after a fixed number, \(K\), of components. Importantly, this approach is compatible with some of `pymc3`

’s (current) technical limitations. Ohlssen, et al. provide justification for truncation, showing that \(K > 5 \alpha + 2\) is most likely sufficient to capture almost all of the mixture weights (\(\sum_{i = 1}^{K} w_i > 0.99\)). We can practically verify the suitability of our truncated approximation to the Dirichlet process by checking the number of components that contribute non-negligible mass to the mixture. If, in our simulations, all components contribute non-negligible mass to the mixture, we have truncated our Dirichlet process too early.

Our Dirichlet process mixture model for the standardized waiting times is

\[ \begin{align*} x_i\ |\ \mu_i, \lambda_i, \tau_i & \sim N\left(\mu, (\lambda_i \tau_i)^{-1}\right) \\ \mu_i\ |\ \lambda_i, \tau_i & \sim N\left(0, (\lambda_i \tau_i)^{-1}\right) \\ (\lambda_1, \tau_1), (\lambda_2, \tau_2), \ldots & \sim P \\ P & \sim \textrm{DP}(\alpha, U(0, 5) \times \textrm{Gamma}(1, 1)) \\ \alpha & \sim \textrm{Gamma}(1, 1). \end{align*} \]

Note that instead of fixing a value of \(\alpha\), as in our previous simulations, we specify a prior on \(\alpha\), so that we may learn its posterior distribution from the observations. This model is therefore actually a mixture of Dirichlet process mixtures, since each fixed value of \(\alpha\) results in a Dirichlet process mixture.

We now construct this model using `pymc3`

.

```
N = old_faithful_df.shape[0]
K = 30
```

```
with pm.Model() as model:
alpha = pm.Gamma('alpha', 1., 1.)
beta = pm.Beta('beta', 1., alpha, shape=K)
w = pm.Deterministic('w', beta * T.concatenate([[1], T.extra_ops.cumprod(1 - beta)[:-1]]))
component = pm.Categorical('component', w, shape=N)
tau = pm.Gamma('tau', 1., 1., shape=K)
lambda_ = pm.Uniform('lambda', 0, 5, shape=K)
mu = pm.Normal('mu', 0, lambda_ * tau, shape=K)
obs = pm.Normal('obs', mu[component], lambda_[component] * tau[component],
observed=old_faithful_df.std_waiting.values)
```

```
Applied log-transform to alpha and added transformed alpha_log to model.
Applied logodds-transform to beta and added transformed beta_logodds to model.
Applied log-transform to tau and added transformed tau_log to model.
Applied interval-transform to lambda and added transformed lambda_interval to model.
```

We sample from the posterior distribution 20,000 times, burn the first 10,000 samples, and thin to every tenth sample.

```
with model:
step1 = pm.Metropolis(vars=[alpha, beta, w, lambda_, tau, mu, obs])
step2 = pm.ElemwiseCategoricalStep([component], np.arange(K))
trace_ = pm.sample(20000, [step1, step2])
trace = trace_[10000::10]
```

` [-----------------100%-----------------] 20000 of 20000 complete in 139.3 sec`

The posterior distribution of \(\alpha\) is concentrated between 0.4 and 1.75.

`pm.traceplot(trace, varnames=['alpha']);`

To verify that our truncation point is not biasing our results, we plot the distribution of the number of mixture components used.

`n_components_used = np.apply_along_axis(lambda x: np.unique(x).size, 1, trace['component'])`

```
fig, ax = plt.subplots(figsize=(8, 6))
bins = np.arange(n_components_used.min(), n_components_used.max() + 1)
ax.hist(n_components_used + 1, bins=bins, normed=True, lw=0, alpha=0.75);
ax.set_xticks(bins + 0.5);
ax.set_xticklabels(bins);
ax.set_xlim(bins.min(), bins.max() + 1);
ax.set_xlabel('Number of mixture components used');
ax.set_ylabel('Posterior probability');
```

We see that the vast majority of samples use five mixture components, and the largest number of mixture components used by any sample is eight. Since we truncated our Dirichlet process mixture at thirty components, we can be quite sure that truncation did not bias our results.

We now compute and plot our posterior density estimate.

```
post_pdf_contribs = sp.stats.norm.pdf(np.atleast_3d(x_plot),
trace['mu'][:, np.newaxis, :],
1. / np.sqrt(trace['lambda'] * trace['tau'])[:, np.newaxis, :])
post_pdfs = (trace['w'][:, np.newaxis, :] * post_pdf_contribs).sum(axis=-1)
post_pdf_low, post_pdf_high = np.percentile(post_pdfs, [2.5, 97.5], axis=0)
```

```
fig, ax = plt.subplots(figsize=(8, 6))
n_bins = 20
ax.hist(old_faithful_df.std_waiting.values, bins=n_bins, normed=True,
color=blue, lw=0, alpha=0.5);
ax.fill_between(x_plot, post_pdf_low, post_pdf_high,
color='gray', alpha=0.45);
ax.plot(x_plot, post_pdfs[0],
c='gray', label='Posterior sample densities');
ax.plot(x_plot, post_pdfs[::100].T, c='gray');
ax.plot(x_plot, post_pdfs.mean(axis=0),
c='k', label='Posterior expected density');
ax.set_xlabel('Standardized waiting time between eruptions');
ax.set_yticklabels([]);
ax.set_ylabel('Density');
ax.legend(loc=2);
```

As above, we can decompose this density estimate into its (weighted) mixture components.

```
fig, ax = plt.subplots(figsize=(8, 6))
n_bins = 20
ax.hist(old_faithful_df.std_waiting.values, bins=n_bins, normed=True,
color=blue, lw=0, alpha=0.5);
ax.plot(x_plot, post_pdfs.mean(axis=0),
c='k', label='Posterior expected density');
ax.plot(x_plot, (trace['w'][:, np.newaxis, :] * post_pdf_contribs).mean(axis=0)[:, 0],
'--', c='k', label='Posterior expected mixture\ncomponents\n(weighted)');
ax.plot(x_plot, (trace['w'][:, np.newaxis, :] * post_pdf_contribs).mean(axis=0),
'--', c='k');
ax.set_xlabel('Standardized waiting time between eruptions');
ax.set_yticklabels([]);
ax.set_ylabel('Density');
ax.legend(loc=2);
```

The Dirichlet process mixture model is incredibly flexible in terms of the family of parametric component distributions \(\{f_{\theta}\ |\ f_{\theta} \in \Theta\}\). We illustrate this flexibility below by using Poisson component distributions to estimate the density of sunspots per year.

`sunspot_df = get_rdataset('sunspot.year', cache=True).data`

`sunspot_df.head()`

time | sunspot.year | |
---|---|---|

0 | 1700 | 5 |

1 | 1701 | 11 |

2 | 1702 | 16 |

3 | 1703 | 23 |

4 | 1704 | 36 |

For this problem, the model is

\[ \begin{align*} x_i\ |\ \lambda_i & \sim \textrm{Poisson}(\lambda_i) \\ \lambda_1, \lambda_2, \ldots & \sim P \\ P & \sim \textrm{DP}(\alpha, U(0, 300)) \\ \alpha & \sim \textrm{Gamma}(1, 1). \end{align*} \]

```
N = sunspot_df.shape[0]
K = 30
```

```
with pm.Model() as model:
alpha = pm.Gamma('alpha', 1., 1.)
beta = pm.Beta('beta', 1, alpha, shape=K)
w = pm.Deterministic('beta', beta * T.concatenate([[1], T.extra_ops.cumprod(1 - beta[:-1])]))
component = pm.Categorical('component', w, shape=N)
mu = pm.Uniform('mu', 0., 300., shape=K)
obs = pm.Poisson('obs', mu[component], observed=sunspot_df['sunspot.year'])
```

```
Applied log-transform to alpha and added transformed alpha_log to model.
Applied logodds-transform to beta and added transformed beta_logodds to model.
Applied interval-transform to mu and added transformed mu_interval to model.
```

```
with model:
step1 = pm.Metropolis(vars=[alpha, beta, w, mu, obs])
step2 = pm.ElemwiseCategoricalStep([component], np.arange(K))
trace_ = pm.sample(20000, [step1, step2])
```

` [-----------------100%-----------------] 20000 of 20000 complete in 111.9 sec`

`trace = trace_[10000::10]`

For the sunspot model, the posterior distribution of \(\alpha\) is concentrated between one and three, indicating that we should expect more components to contribute non-negligible amounts to the mixture than for the Old Faithful waiting time model.

`pm.traceplot(trace, varnames=['alpha']);`

Indeed, we see that there are (on average) about ten to fifteen components used by this model.

`n_components_used = np.apply_along_axis(lambda x: np.unique(x).size, 1, trace['component'])`

```
fig, ax = plt.subplots(figsize=(8, 6))
bins = np.arange(n_components_used.min(), n_components_used.max() + 1)
ax.hist(n_components_used + 1, bins=bins, normed=True, lw=0, alpha=0.75);
ax.set_xticks(bins + 0.5);
ax.set_xticklabels(bins);
ax.set_xlim(bins.min(), bins.max() + 1);
ax.set_xlabel('Number of mixture components used');
ax.set_ylabel('Posterior probability');
```

We now calculate and plot the fitted density estimate.

`x_plot = np.arange(250)`

```
post_pmf_contribs = sp.stats.poisson.pmf(np.atleast_3d(x_plot),
trace['mu'][:, np.newaxis, :])
post_pmfs = (trace['beta'][:, np.newaxis, :] * post_pmf_contribs).sum(axis=-1)
post_pmf_low, post_pmf_high = np.percentile(post_pmfs, [2.5, 97.5], axis=0)
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.hist(sunspot_df['sunspot.year'].values, bins=40, normed=True, lw=0, alpha=0.75);
ax.fill_between(x_plot, post_pmf_low, post_pmf_high,
color='gray', alpha=0.45)
ax.plot(x_plot, post_pmfs[0],
c='gray', label='Posterior sample densities');
ax.plot(x_plot, post_pmfs[::200].T, c='gray');
ax.plot(x_plot, post_pmfs.mean(axis=0),
c='k', label='Posterior expected density');
ax.legend(loc=1);
```

Again, we can decompose the posterior expected density into weighted mixture densities.

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.hist(sunspot_df['sunspot.year'].values, bins=40, normed=True, lw=0, alpha=0.75);
ax.plot(x_plot, post_pmfs.mean(axis=0),
c='k', label='Posterior expected density');
ax.plot(x_plot, (trace['beta'][:, np.newaxis, :] * post_pmf_contribs).mean(axis=0)[:, 0],
'--', c='k', label='Posterior expected\nmixture components\n(weighted)');
ax.plot(x_plot, (trace['beta'][:, np.newaxis, :] * post_pmf_contribs).mean(axis=0),
'--', c='k');
ax.legend(loc=1);
```

We have only scratched the surface in terms of applications of the Dirichlet process and Bayesian nonparametric statistics in general. This post is the first in a series I have planned on Bayesian nonparametrics, so stay tuned.

]]>Survival analysis studies the distribution of the time to an event. Its applications span many fields across medicine, biology, engineering, and social science. This post shows how to fit and analyze a Bayesian survival model in Python using `pymc3`

.

We illustrate these concepts by analyzing a mastectomy data set from `R`

’s `HSAUR`

package.

`%matplotlib inline`

```
from matplotlib import pyplot as plt
import numpy as np
import pymc3 as pm
from pymc3.distributions.timeseries import GaussianRandomWalk
import seaborn as sns
from statsmodels import datasets
from theano import tensor as T
```

`Couldn't import dot_parser, loading of dot files will not be possible.`

Fortunately, `statsmodels.datasets`

makes it quite easy to load a number of data sets from `R`

.

```
df = datasets.get_rdataset('mastectomy', 'HSAUR', cache=True).data
df.event = df.event.astype(np.int64)
df.metastized = (df.metastized == 'yes').astype(np.int64)
n_patients = df.shape[0]
patients = np.arange(n_patients)
```

`df.head()`

time | event | metastized | |
---|---|---|---|

0 | 23 | 1 | 0 |

1 | 47 | 1 | 0 |

2 | 69 | 1 | 0 |

3 | 70 | 0 | 0 |

4 | 100 | 0 | 0 |

`n_patients`

`44`

Each row represents observations from a woman diagnosed with breast cancer that underwent a mastectomy. The column `time`

represents the time (in months) post-surgery that the woman was observed. The column `event`

indicates whether or not the woman died during the observation period. The column `metastized`

represents whether the cancer had metastized prior to surgery.

This post analyzes the relationship between survival time post-mastectomy and whether or not the cancer had metastized.

First we introduce a (very little) bit of theory. If the random variable \(T\) is the time to the event we are studying, survival analysis is primarily concerned with the survival function

\[S(t) = P(T > t) = 1 - F(t),\]

where \(F\) is the CDF of \(T\). It is mathematically convenient to express the survival function in terms of the hazard rate, \(\lambda(t)\). The hazard rate is the instantaneous probability that the event occurs at time \(t\) given that it has not yet occured. That is,

\[\begin{align*} \lambda(t) & = \lim_{\Delta t \to 0} \frac{P(t < T < t + \Delta t\ |\ T > t)}{\Delta t} \\ & = \lim_{\Delta t \to 0} \frac{P(t < T < t + \Delta t)}{\Delta t \cdot P(T > t)} \\ & = \frac{1}{S(t)} \cdot \lim_{\Delta t \to 0} \frac{S(t + \Delta t) - S(t)}{\Delta t} = -\frac{S'(t)}{S(t)}. \end{align*}\]

Solving this differential equation for the survival function shows that

\[S(t) = \exp\left(-\int_0^s \lambda(s)\ ds\right).\]

This representation of the survival function shows that the cumulative hazard function

\[\Lambda(t) = \int_0^t \lambda(s)\ ds\]

is an important quantity in survival analysis, since we may consicesly write \(S(t) = \exp(-\Lambda(t)).\)

An important, but subtle, point in survival analysis is censoring. Even though the quantity we are interested in estimating is the time between surgery and death, we do not observe the death of every subject. At the point in time that we perform our analysis, some of our subjects will thankfully still be alive. In the case of our mastectomy study, `df.event`

is one if the subject’s death was observed (the observation is not censored) and is zero if the death was not observed (the observation is censored).

`df.event.mean()`

`0.59090909090909094`

Just over 40% of our observations are censored. We visualize the observed durations and indicate which observations are censored below.

```
fig, ax = plt.subplots(figsize=(8, 6))
blue, _, red = sns.color_palette()[:3]
ax.hlines(patients[df.event.values == 0], 0, df[df.event.values == 0].time,
color=blue, label='Censored');
ax.hlines(patients[df.event.values == 1], 0, df[df.event.values == 1].time,
color=red, label='Uncensored');
ax.scatter(df[df.metastized.values == 1].time, patients[df.metastized.values == 1],
color='k', zorder=10, label='Metastized');
ax.set_xlim(left=0);
ax.set_xlabel('Months since mastectomy');
ax.set_ylim(-0.25, n_patients + 0.25);
ax.legend(loc='center right');
```

When an observation is censored (`df.event`

is zero), `df.time`

is not the subject’s survival time. All we can conclude from such a censored obsevation is that the subject’s true survival time exceeds `df.time`

.

This is enough basic surival analysis theory for the purposes of this post; for a more extensive introduction, consult Aalen et al.^{1}

The two most basic estimators in survial analysis are the Kaplan-Meier estimator of the survival function and the Nelson-Aalen estimator of the cumulative hazard function. However, since we want to understand the impact of metastization on survival time, a risk regression model is more appropriate. Perhaps the most commonly used risk regression model is Cox’s proportional hazards model. In this model, if we have covariates \(\mathbf{x}\) and regression coefficients \(\beta\), the hazard rate is modeled as

\[\lambda(t) = \lambda_0(t) \exp(\mathbf{x} \beta).\]

Here \(\lambda_0(t)\) is the baseline hazard, which is independent of the covariates \(\mathbf{x}\). In this example, the covariates are the one-dimensonal vector `df.metastized`

.

Unlike in many regression situations, \(\mathbf{x}\) should not include a constant term corresponding to an intercept. If \(\mathbf{x}\) includes a constant term corresponding to an intercept, the model becomes unidentifiable. To illustrate this unidentifiability, suppose that

\[\lambda(t) = \lambda_0(t) \exp(\beta_0 + \mathbf{x} \beta) = \lambda_0(t) \exp(\beta_0) \exp(\mathbf{x} \beta).\]

If \(\tilde{\beta}_0 = \beta_0 + \delta\) and \(\tilde{\lambda}_0(t) = \lambda_0(t) \exp(-\delta)\), then \(\lambda(t) = \tilde{\lambda}_0(t) \exp(\tilde{\beta}_0 + \mathbf{x} \beta)\) as well, making the model with \(\beta_0\) unidentifiable.

In order to perform Bayesian inference with the Cox model, we must specify priors on \(\beta\) and \(\lambda_0(t)\). We place a normal prior on \(\beta\), \(\beta \sim N(\mu_{\beta}, \sigma_{\beta}^2),\) where \(\mu_{\beta} \sim N(0, 10^2)\) and \(\sigma_{\beta} \sim U(0, 10)\).

A suitable prior on \(\lambda_0(t)\) is less obvious. We choose a semiparametric prior, where \(\lambda_0(t)\) is a piecewise constant function. This prior requires us to partition the time range in question into intervals with endpoints \(0 \leq s_1 < s_2 < \cdots < s_N\). With this partition, \(\lambda_0 (t) = \lambda_j\) if \(s_j \leq t < s_{j + 1}\). With \(\lambda_0(t)\) constrained to have this form, all we need to do is choose priors for the \(N - 1\) values \(\lambda_j\). We use independent vague priors \(\lambda_j \sim \operatorname{Gamma}(10^{-2}, 10^{-2}).\) For our mastectomy example, we make each interval three months long.

```
interval_length = 3
interval_bounds = np.arange(0, df.time.max() + interval_length + 1, interval_length)
n_intervals = interval_bounds.size - 1
intervals = np.arange(n_intervals)
```

We see how deaths and censored observations are distributed in these intervals.

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.hist(df[df.event == 1].time.values, bins=interval_bounds,
color=red, alpha=0.5, lw=0,
label='Uncensored');
ax.hist(df[df.event == 0].time.values, bins=interval_bounds,
color=blue, alpha=0.5, lw=0,
label='Censored');
ax.set_xlim(0, interval_bounds[-1]);
ax.set_xlabel('Months since mastectomy');
ax.set_yticks([0, 1, 2, 3]);
ax.set_ylabel('Number of observations');
ax.legend();
```

With the prior distributions on \(\beta\) and \(\lambda_0(t)\) chosen, we now show how the model may be fit using MCMC simulation with `pymc3`

. The key observation is that the piecewise-constant proportional hazard model is closely related to a Poisson regression model. (The models are not identical, but their likelihoods differ by a factor that depends only on the observed data and not the parameters \(\beta\) and \(\lambda_j\). For details, see Germán Rodríguez’s WWS 509 course notes.)

We define indicator variables based on whether or the \(i\)-th suject died in the \(j\)-th interval,

\[d_{i, j} = \begin{cases} 1 & \textrm{if subject } i \textrm{ died in interval } j \\ 0 & \textrm{otherwise} \end{cases}.\]

```
last_period = np.floor((df.time - 0.01) / interval_length)
death = np.zeros((n_patients, n_intervals))
death[patients, last_period] = df.event
```

We also define \(t_{i, j}\) to be the amount of time the \(i\)-th subject was at risk in the \(j\)-th interval.

```
exposure = np.greater_equal.outer(df.time, interval_bounds[:-1]) * interval_length
exposure[patients, last_period] = df.time - interval_bounds[last_period]
```

Finally, denote the risk incurred by the \(i\)-th subject in the \(j\)-th interval as \(\lambda_{i, j} = \lambda_j \exp(\mathbf{x}_i \beta)\).

We may approximate \(d_{i, j}\) with a Possion random variable with mean \(t_{i, j}\ \lambda_{i, j}\). This approximation leads to the following `pymc3`

model.

`SEED = 5078864 # from random.org`

```
with pm.Model() as model:
lambda0 = pm.Gamma('lambda0', 0.01, 0.01, shape=n_intervals)
sigma = pm.Uniform('sigma', 0., 10.)
tau = pm.Deterministic('tau', sigma**-2)
mu_beta = pm.Normal('mu_beta', 0., 10**-2)
beta = pm.Normal('beta', mu_beta, tau)
lambda_ = pm.Deterministic('lambda_', T.outer(T.exp(beta * df.metastized), lambda0))
mu = pm.Deterministic('mu', exposure * lambda_)
obs = pm.Poisson('obs', mu, observed=death)
```

We now sample from the model.

```
n_samples = 40000
burn = 20000
thin = 20
```

```
with model:
step = pm.Metropolis()
trace_ = pm.sample(n_samples, step, random_seed=SEED)
```

` [-----------------100%-----------------] 40000 of 40000 complete in 39.0 sec`

`trace = trace_[burn::thin]`

We see that the hazard rate for subjects whose cancer has metastized is about one and a half times the rate of those whose cancer has not metastized.

`np.exp(trace['beta'].mean())`

`1.645592148084472`

`pm.traceplot(trace, vars=['beta']);`

`pm.autocorrplot(trace, vars=['beta']);`

We now examine the effect of metastization on both the cumulative hazard and on the survival function.

```
base_hazard = trace['lambda0']
met_hazard = trace['lambda0'] * np.exp(np.atleast_2d(trace['beta']).T)
```

```
def cum_hazard(hazard):
return (interval_length * hazard).cumsum(axis=-1)
def survival(hazard):
return np.exp(-cum_hazard(hazard))
```

```
def plot_with_hpd(x, hazard, f, ax, color=None, label=None, alpha=0.05):
mean = f(hazard.mean(axis=0))
percentiles = 100 * np.array([alpha / 2., 1. - alpha / 2.])
hpd = np.percentile(f(hazard), percentiles, axis=0)
ax.fill_between(x, hpd[0], hpd[1], color=color, alpha=0.25)
ax.step(x, mean, color=color, label=label);
```

```
fig, (hazard_ax, surv_ax) = plt.subplots(ncols=2, sharex=True, sharey=False, figsize=(16, 6))
plot_with_hpd(interval_bounds[:-1], base_hazard, cum_hazard,
hazard_ax, color=blue, label='Had not metastized')
plot_with_hpd(interval_bounds[:-1], met_hazard, cum_hazard,
hazard_ax, color=red, label='Metastized')
hazard_ax.set_xlim(0, df.time.max());
hazard_ax.set_xlabel('Months since mastectomy');
hazard_ax.set_ylabel(r'Cumulative hazard $\Lambda(t)$');
hazard_ax.legend(loc=2);
plot_with_hpd(interval_bounds[:-1], base_hazard, survival,
surv_ax, color=blue)
plot_with_hpd(interval_bounds[:-1], met_hazard, survival,
surv_ax, color=red)
surv_ax.set_xlim(0, df.time.max());
surv_ax.set_xlabel('Months since mastectomy');
surv_ax.set_ylabel('Survival function $S(t)$');
fig.suptitle('Bayesian survival model');
```

We see that the cumulative hazard for metastized subjects increases more rapidly initially (through about seventy months), after which it increases roughly in parallel with the baseline cumulative hazard.

These plots also show the pointwise 95% high posterior density interval for each function. One of the distinct advantages of the Bayesian model fit with `pymc3`

is the inherent quantification of uncertainty in our estimates.

Another of the advantages of the model we have built is its flexibility. From the plots above, we may reasonable believe that the additional hazard due to metastization varies over time; it seems plausible that cancer that has metastized increases the hazard rate immediately after the mastectomy, but that the risk due to metastization decreases over time. We can accomodate this mechanism in our model by allowing the regression coefficients to vary over time. In the time-varying coefficent model, if \(s_j \leq t < s_{j + 1}\), we let \(\lambda(t) = \lambda_j \exp(\mathbf{x} \beta_j).\) The sequence of regression coefficients \(\beta_1, \beta_2, \ldots, \beta_{N - 1}\) form a normal random walk with \(\beta_1 \sim N(0, 1)\), \(\beta_j\ |\ \beta_{j - 1} \sim N(\beta_{j - 1}, 1)\).

We implement this model in `pymc3`

as follows.

```
with pm.Model() as time_varying_model:
lambda0 = pm.Gamma('lambda0', 0.01, 0.01, shape=n_intervals)
beta = GaussianRandomWalk('beta', tau=1., shape=n_intervals)
lambda_ = pm.Deterministic('h', lambda0 * T.exp(T.outer(T.constant(df.metastized), beta)))
mu = pm.Deterministic('mu', exposure * lambda_)
obs = pm.Poisson('obs', mu, observed=death)
```

We proceed to sample from this model.

```
with time_varying_model:
step = pm.Metropolis()
time_varying_trace_ = pm.sample(n_samples, step, random_seed=SEED)
```

` [-----------------100%-----------------] 40000 of 40000 complete in 56.7 sec`

`time_varying_trace = time_varying_trace_[burn::thin]`

We see from the plot of \(\beta_j\) over time below that initially \(\beta_j > 0\), indicating an elevated hazard rate due to metastization, but that this risk declines as \(\beta_j < 0\) eventually.

```
fig, ax = plt.subplots(figsize=(8, 6))
beta_hpd = np.percentile(time_varying_trace['beta'], [2.5, 97.5], axis=0)
beta_low = beta_hpd[0]
beta_high = beta_hpd[1]
ax.fill_between(interval_bounds[:-1], beta_low, beta_high,
color=blue, alpha=0.25);
beta_hat = time_varying_trace['beta'].mean(axis=0)
ax.step(interval_bounds[:-1], beta_hat, color=blue);
ax.scatter(interval_bounds[last_period[(df.event.values == 1) & (df.metastized == 1)]],
beta_hat[last_period[(df.event.values == 1) & (df.metastized == 1)]],
c=red, zorder=10, label='Died, cancer metastized');
ax.scatter(interval_bounds[last_period[(df.event.values == 0) & (df.metastized == 1)]],
beta_hat[last_period[(df.event.values == 0) & (df.metastized == 1)]],
c=blue, zorder=10, label='Censored, cancer metastized');
ax.set_xlim(0, df.time.max());
ax.set_xlabel('Months since mastectomy');
ax.set_ylabel(r'$\beta_j$');
ax.legend();
```

The coefficients \(\beta_j\) begin declining rapidly around one hundred months post-mastectomy, which seems reasonable, given that only three of twelve subjects whose cancer had metastized lived past this point died during the study.

The change in our estimate of the cumulative hazard and survival functions due to time-varying effects is also quite apparent in the following plots.

```
tv_base_hazard = time_varying_trace['lambda0']
tv_met_hazard = time_varying_trace['lambda0'] * np.exp(np.atleast_2d(time_varying_trace['beta']))
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.step(interval_bounds[:-1], cum_hazard(base_hazard.mean(axis=0)),
color=blue, label='Had not metastized');
ax.step(interval_bounds[:-1], cum_hazard(met_hazard.mean(axis=0)),
color=red, label='Metastized');
ax.step(interval_bounds[:-1], cum_hazard(tv_base_hazard.mean(axis=0)),
color=blue, linestyle='--', label='Had not metastized (time varying effect)');
ax.step(interval_bounds[:-1], cum_hazard(tv_met_hazard.mean(axis=0)),
color=red, linestyle='--', label='Metastized (time varying effect)');
ax.set_xlim(0, df.time.max() - 4);
ax.set_xlabel('Months since mastectomy');
ax.set_ylim(0, 2);
ax.set_ylabel(r'Cumulative hazard $\Lambda(t)$');
ax.legend(loc=2);
```

```
fig, (hazard_ax, surv_ax) = plt.subplots(ncols=2, sharex=True, sharey=False, figsize=(16, 6))
plot_with_hpd(interval_bounds[:-1], tv_base_hazard, cum_hazard,
hazard_ax, color=blue, label='Had not metastized')
plot_with_hpd(interval_bounds[:-1], tv_met_hazard, cum_hazard,
hazard_ax, color=red, label='Metastized')
hazard_ax.set_xlim(0, df.time.max());
hazard_ax.set_xlabel('Months since mastectomy');
hazard_ax.set_ylim(0, 2);
hazard_ax.set_ylabel(r'Cumulative hazard $\Lambda(t)$');
hazard_ax.legend(loc=2);
plot_with_hpd(interval_bounds[:-1], tv_base_hazard, survival,
surv_ax, color=blue)
plot_with_hpd(interval_bounds[:-1], tv_met_hazard, survival,
surv_ax, color=red)
surv_ax.set_xlim(0, df.time.max());
surv_ax.set_xlabel('Months since mastectomy');
surv_ax.set_ylabel('Survival function $S(t)$');
fig.suptitle('Bayesian survival model with time varying effects');
```

We have really only scratched the surface of both survival analysis and the Bayesian approach to survival analysis. More information on Bayesian survival analysis is available in Ibrahim et al.^{2} (For example, we may want to account for individual frailty in either or original or time-varying models.)

This post is available as an IPython notebook here.

Tags: Bayesian Statistics, PyMC3