(**Author’s note**: many thanks to Robert (@atlhawksfanatic on Twitter) for pointing out some subtleties in the data set that I had missed. This post has been revised in line with his feedback. Robert has a very interesting post about how last two minute refereeing has changed over the last three years; I highly recommend you read it.)

I recently found a very interesting data set derived from the NBA’s Last Two Minute Report by Russell Goldenberg of The Pudding. Since 2015, the NBA has released a report reviewing every call and non-call in the final two minutes of every NBA game where the teams were separated by five points or less with two minutes remaining. This data set has extracted each play from the NBA-distributed PDF and augmented it with information from Basketball Reference to produce a convenient CSV. The Pudding has published two very interesting visual essays using this data that you should definitely explore.

The NBA is certainly marketed as a star-centric league, so this data set presents a fantastic opportunity to understand the extent to which the players involved in a decision impact whether or not a foul is called. We will also explore other factors related to foul calls.

`%matplotlib inline`

```
import datetime
from warnings import filterwarnings
```

```
from matplotlib import pyplot as plt
from matplotlib.ticker import FuncFormatter
import numpy as np
import pandas as pd
import pymc3 as pm
from scipy.special import expit
import seaborn as sns
```

```
blue, green, red, purple, gold, teal = sns.color_palette()
million_dollars_formatter = FuncFormatter(lambda value, _: '${:.1f}M'.format(value / 1e6))
pct_formatter = FuncFormatter(lambda prop, _: "{:.1%}".format(prop))
```

`filterwarnings('ignore', 'findfont')`

We begin by loading the data set from GitHub. For reproducibility, we load the data from the most recent commit as of the time this post was published.

```
DATA_URI = 'https://raw.githubusercontent.com/polygraph-cool/last-two-minute-report/1b89b71df060add5538b70d391d7ad82a4c24db2/output/all_games.csv'
raw_df = (pd.read_csv(DATA_URI,
usecols=['committing_player', 'disadvantaged_player',
'committing_team', 'disadvantaged_team',
'seconds_left', 'review_decision', 'date'],
parse_dates=['date'])
.where(lambda df: df.date >= datetime.datetime(2016, 10, 25))
.dropna(subset=['date'])
.drop('date', axis=1))
raw_df['review_decision'] = raw_df.review_decision.fillna("INC")
raw_df = (raw_df.dropna()
.reset_index(drop=True))
```

We restrict our attention to decisions from the 2016-2017 NBA season, for which salary information is readily available from Basketball Reference.

`raw_df.head()`

seconds_left | committing_player | disadvantaged_player | review_decision | disadvantaged_team | committing_team | |
---|---|---|---|---|---|---|

0 | 102.0 | Al-Farouq Aminu | George Hill | CNC | UTA | POR |

1 | 98.0 | Boris Diaw | Damian Lillard | CC | POR | UTA |

2 | 64.0 | Ed Davis | George Hill | CNC | UTA | POR |

3 | 62.0 | Rudy Gobert | CJ McCollum | INC | POR | UTA |

4 | 27.1 | CJ McCollum | Rodney Hood | CC | UTA | POR |

We have only loaded some of the data set’s columns; see the original CSV header for the rest.

The response variable in our analysis is derived from `review_decision`

, which contains information about whether the incident was a call or non-call and whether, upon post-game review, the NBA deemed the (non-)call correct or incorrect. Below we show the frequencies of each type of `review_decision`

.

```
ax = (raw_df.groupby('review_decision')
.size()
.plot(kind='bar'))
ax.set_ylabel("Frequency");
```

The possible values of `review_decision`

are

`CC`

for correct call,`CNC`

for correct non-call,`IC`

for incorrect call, and`INC`

for incorrect non-call.

While `review_decision`

decision provides information about both whether or not a foul was called and whether or not a foul was actually committed, this analysis will focus only on whether or not a foul was called. Including whether or not a foul was actually committed in this analysis introduces some subtleties that are best left to a future post.

In this dataset, the “committing” player is the one that a foul would be called against, if a foul was called on the play, and the other player is “disadvantaged.”

We now encode the data. Since the committing player on one play may be the disadvantaged player on another play, we `melt`

the raw data frame to have one row per player-play combination so that we can encode the players in a way that is consistent across columns.

```
PLAYER_MAP = {
"Jose Juan Barea": "JJ Barea",
"Nene Hilario": "Nene",
"Tim Hardaway": "Tim Hardaway Jr",
"James Ennis": "James Ennis III",
"Kelly Oubre": "Kelly Oubre Jr",
"Taurean Waller-Prince": "Taurean Prince",
"Glenn Robinson": "Glenn Robinson III",
"Otto Porter": "Otto Porter Jr"
}
TEAM_MAP = {
"NKY": "NYK",
"COS": "BOS",
"SAT": "SAS"
}
```

```
long_df = (pd.melt(
(raw_df.reset_index(drop=True)
.rename_axis('play_id')
.reset_index()),
id_vars=['play_id', 'review_decision',
'committing_team', 'disadvantaged_team',
'seconds_left'],
value_vars=['committing_player', 'disadvantaged_player'],
var_name='player', value_name='player_name_')
# fix inconsistent player names
.assign(player_name=lambda df: (df.player_name_
.str.replace('\.', '')
.apply(lambda name: PLAYER_MAP.get(name, name))))
.assign(team_=lambda df: (df.committing_team
.where(df.player == 'committing_player',
df.disadvantaged_team)))
# fix typos in team names
.assign(team=lambda df: df.team_.apply(lambda team: TEAM_MAP.get(team, team)))
.drop(['committing_team', 'disadvantaged_team', 'team_'], axis=1))
long_df['player_id'], player_map = long_df.player_name.factorize()
```

`long_df.head()`

play_id | review_decision | seconds_left | player | player_name_ | player_name | team | player_id | |
---|---|---|---|---|---|---|---|---|

0 | 0 | CNC | 102.0 | committing_player | Al-Farouq Aminu | Al-Farouq Aminu | POR | 0 |

1 | 1 | CC | 98.0 | committing_player | Boris Diaw | Boris Diaw | UTA | 1 |

2 | 2 | CNC | 64.0 | committing_player | Ed Davis | Ed Davis | POR | 2 |

3 | 3 | INC | 62.0 | committing_player | Rudy Gobert | Rudy Gobert | UTA | 3 |

4 | 4 | CC | 27.1 | committing_player | CJ McCollum | CJ McCollum | POR | 4 |

After encoding, we pivot back to a wide data frame with one row per play.

```
df = (long_df.pivot_table(index=['play_id', 'review_decision', 'seconds_left'],
columns='player', values='player_id')
.rename(columns={
'committing_player': 'committing_id',
'disadvantaged_player': 'disadvantaged_id'
})
.rename_axis('', axis=1)
.reset_index()
.assign(foul_called=lambda df: 1 * (df.review_decision.isin(['CC', 'IC'])))
.drop(['play_id', 'review_decision'],
axis=1))
```

In addition to encoding the players, we have include a column (`foul_called`

) that indicates whether or not a foul was called on the play.

`df.head()`

seconds_left | committing_id | disadvantaged_id | foul_called | |
---|---|---|---|---|

0 | 102.0 | 0 | 300 | 0 |

1 | 98.0 | 1 | 124 | 1 |

2 | 64.0 | 2 | 300 | 0 |

3 | 62.0 | 3 | 4 | 0 |

4 | 27.1 | 4 | 6 | 1 |

In order to understand how foul calls vary systematically across players, we will use salary as a proxy for “star power.” The salary data we use was downloaded from Basketball Reference.

```
SALARY_URI = 'http://www.austinrochford.com/resources/nba_irt/2016_2017_salaries.csv'
salary_df = (pd.read_csv(SALARY_URI, skiprows=1,
usecols=['Player', '2016-17'])
.assign(player_name=lambda df: (df.Player
.str.split('\\', expand=True)[0]
.str.replace('\.', '')
# fix inconsistent player names
.apply(lambda name: PLAYER_MAP.get(name, name))),
salary=lambda df: (df['2016-17'].str
.lstrip('$')
.astype(np.float64)))
.assign(log_salary=lambda df: np.log10(df.salary))
.assign(std_log_salary=lambda df: (df.log_salary - df.log_salary.mean()) / df.log_salary.std())
.drop(['Player', '2016-17'], axis=1)
.groupby('player_name')
.max()
.select(lambda name: name in player_map)
.assign(player_id=lambda df: (np.equal
.outer(player_map, df.index)
.argmax(axis=0)))
.reset_index()
.set_index('player_id')
.sort_index())
```

Since NBA salaries span many orders of magnitude (LeBron James’ salary is just shy of $31M while the lowest paid player made just more than $200K) we will use log salaries, standardized to have mean zero and standard deviation one in our model.

`salary_df.head()`

player_name | salary | log_salary | std_log_salary | |
---|---|---|---|---|

player_id | ||||

0 | Al-Farouq Aminu | 7680965.0 | 6.885416 | 0.848869 |

1 | Boris Diaw | 7000000.0 | 6.845098 | 0.797879 |

2 | Ed Davis | 6666667.0 | 6.823909 | 0.771080 |

3 | Rudy Gobert | 2121288.0 | 6.326600 | 0.142129 |

4 | CJ McCollum | 3219579.0 | 6.507799 | 0.371293 |

We also produce a dataframe associating players to teams, along with some useful per-player summaries.

```
team_player_map = (long_df.groupby('team')
.player_id
.apply(pd.Series.drop_duplicates)
.reset_index(level=-1, drop=True)
.reset_index()
.assign(name=lambda df: player_map[df.player_id],
disadvantaged_rate=lambda tpm_df: (df.groupby('disadvantaged_id')
.foul_called
.mean()
.ix[tpm_df.player_id]
.values),
disadvantaged_plays=lambda tpm_df: (df.groupby('disadvantaged_id')
.size()
.ix[tpm_df.player_id]
.values))
.fillna(0))
```

`team_player_map.head()`

team | player_id | disadvantaged_plays | disadvantaged_rate | name | |
---|---|---|---|---|---|

0 | ATL | 114 | 8.0 | 0.000000 | Kyle Korver |

1 | ATL | 115 | 13.0 | 0.538462 | Dwight Howard |

2 | ATL | 116 | 44.0 | 0.272727 | Paul Millsap |

3 | ATL | 117 | 60.0 | 0.283333 | Dennis Schroder |

4 | ATL | 181 | 25.0 | 0.200000 | Kent Bazemore |

Throughout this post, we will develop a series of models for understanding how foul calls vary across players, starting with a simple beta-Bernoulli model and working our way up to a hierachical item-response theory regression model.

Before building models, we must introduce a bit of notation. The index \(i\) will correspond to a disadvantaged player and the index \(j\) corresponds to a committing player. The index \(k\) corresponds to a play. With this notation \(i(k)\) and \(j(k)\) are the index of the disadvantaged and committing player involved in play \(k\), respectively. The binary variable \(y_k\) indicates whether or not a foul was called on play \(k\). All of our models use the likelihood

\[y_k \sim \textrm{Bernoulli}(p_{i(k), j(k)}).\]

Each model differs in its specification of the probability that a foul is called, \(p_{i, j}\).

One of the simplest possible models for this data focuses only on the disadvantaged player, so \(p_{i, j} = p_i\), and places independent beta priors on each \(p_i\). For simplicity, we begin with uniform priors, \(p_i \sim \textrm{Beta}(1, 1).\)

Even though this model is conjugate, we will use `pymc3`

to perform inference with it for consistency with subsequent, non-conjugate models.

```
n_players = player_map.size
disadvantaged_id = df.disadvantaged_id.values
foul_called = df.foul_called.values
obs_rate = foul_called.mean()
```

```
with pm.Model() as bb_model:
p = pm.Beta('p', 1., 1., shape=n_players)
y = pm.Bernoulli('y_obs', p[disadvantaged_id],
observed=foul_called)
```

Throughout this post, we will use the no-U-turn sampler for inference, tuning the sampler’s hyperparameters for the first two thousand samples and subsequently keeping the next two thousand samples for inference.

```
N_TUNE = 2000
N_SAMPLES = 2000
SEED = 506421 # from random.org, for reproducibility
```

We now sample from the beta-Bernoulli model.

```
def sample(model, n_tune, n_samples, seed):
with model:
full_trace = pm.sample(n_tune + n_samples, random_seed=seed)
return full_trace[n_tune:]
```

`bb_trace = sample(bb_model, N_TUNE, N_SAMPLES, SEED)`

```
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
2%|▏ | 4890/200000 [00:02<02:09, 1508.62it/s]Median ELBO converged.
Finished [100%]: Average ELBO = -3,554.3
100%|██████████| 4000/4000 [01:25<00:00, 34.83it/s]
```

We use energy energy plots to diagnose possible problems with our samples.

```
def energy_plot(trace):
energy = trace['energy']
energy_diff = np.diff(energy)
fig, ax = plt.subplots(figsize=(8, 6))
ax.hist(energy - energy.mean(), bins=30,
lw=0, alpha=0.5,
label="Energy")
ax.hist(energy_diff, bins=30,
lw=0, alpha=0.5,
label="Energy difference")
ax.set_xticks([])
ax.set_yticks([])
ax.legend()
```

`energy_plot(bb_trace)`

Since the energy and energy difference distributions are quite similar, there is no indication from this plot of sampling issues. For an in-depth treatment of Hamiltonian Monte Carlo algorithms and convergence diagnostics, consult Michael Betancourt’s excellent paper *A Conceptual Introduction to Hamiltonian Monte Carlo*.

We will use the widely applicable information criterion (WAIC) and binned residuals to check and compare our models. WAIC is a Bayesian measure of out-of-sample predictive accuracy based on in-sample data that is quite closely related to leave-one-out cross-validation. It attempts to improve upon known shortcomings of the widely-used deviance information criterion. (See *Understanding predictive information criteria for Bayesian models* for a review and comparison of various information criteria, including DIC and WAIC.) WAIC is easy to calculate with `pymc3`

.

```
def get_waic_df(model, trace, name):
with model:
waic = pm.waic(trace)
return pd.DataFrame.from_records([waic], index=[name], columns=waic._fields)
```

`waic_df = get_waic_df(bb_model, bb_trace, "Beta-Bernoulli")`

```
/opt/conda/lib/python3.5/site-packages/pymc3/stats.py:145: UserWarning: For one or more samples the posterior variance of the
log predictive densities exceeds 0.4. This could be indication of
WAIC starting to fail see http://arxiv.org/abs/1507.04544 for details
""")
```

We see that the WAIC calculation indicates difficulties with the beta-Bernoulii model, which we will soon confirm.

`waic_df`

WAIC | WAIC_se | p_WAIC | |
---|---|---|---|

Beta-Bernoulli | 6021.441722 | 66.302246 | 238.505387 |

In addition to the WAIC value, we get an estimate of its standard error (`WAIC_se`

) and the number of effective parameters in the model (`p_WAIC`

). The number of effective parameters is an indication of model complexity.

The second diagnostic tool we use on our models are binned residuals, which show how well-calibrated the model’s predicted probabilities are. Intuitively, if our model predicts that an event has a 35% chance of occurring and we can observe many repetitions of that event, we would expect the event to actually occur about 35% of the time. If the observed occurrences of the event differ substantially from the predicted rate, we have reason to doubt the quality of our model. Since we generally can’t observe each event many times, we instead group events into bins by their predicted probability and check that the average predicted probability in each bin is close to the rate at which the events in that bin are observed.

The binned predictions and residuals for the beta-Bernoulli model are shown below.

```
BINS = np.linspace(0, 1, 11)
BINS_3D = BINS[np.newaxis, np.newaxis]
def binned_residuals(y, p):
p_3d = p[..., np.newaxis]
in_bin = (BINS_3D[..., :-1] < p_3d) & (p_3d <= BINS_3D[..., 1:])
bin_counts = in_bin.sum(axis=(0, 1))
p_mean = (in_bin * p_3d).sum(axis=(0, 1)) / bin_counts
y_mean = (in_bin * y[np.newaxis, :, np.newaxis]).sum(axis=(0, 1)) / bin_counts
return y_mean, p_mean, bin_counts
def binned_residual_plot(bin_obs, bin_p, bin_counts):
fig, (ax, resid_ax) = plt.subplots(ncols=2, sharex=True, figsize=(16, 6))
ax.scatter(bin_p, bin_obs,
s=300 * np.sqrt(bin_counts / bin_counts.sum()),
zorder=5)
ax.plot([0, 1], [0, 1], '--', c='k')
ax.set_xlim(0, 1)
ax.set_xticks(np.linspace(0, 1, 5))
ax.xaxis.set_major_formatter(pct_formatter)
ax.set_xlabel("Mean $p$ (binned)")
ax.set_ylim(0, 1)
ax.set_yticks(np.linspace(0, 1, 5))
ax.yaxis.set_major_formatter(pct_formatter)
ax.set_ylabel("Observed rate (binned)")
resid_ax.scatter(bin_p, bin_obs - bin_p,
s=300 * np.sqrt(bin_counts / bin_counts.sum()),
zorder=5)
resid_ax.hlines(0, 0, 1, 'k', '--')
resid_ax.set_xlim(0, 1)
resid_ax.set_xticks(np.linspace(0, 1, 5))
resid_ax.xaxis.set_major_formatter(pct_formatter)
resid_ax.set_xlabel("Mean $p$ (binned)")
resid_ax.yaxis.set_major_formatter(pct_formatter)
resid_ax.set_ylabel("Residual (binned)")
```

```
bin_obs, bin_p, bin_counts = binned_residuals(foul_called, bb_trace['p'][:, disadvantaged_id])
binned_residual_plot(bin_obs, bin_p, bin_counts)
```

In these plots, the dashed black lines show how these quantities would be related, for a perfect model. The area of each point is proportional to the number of events whose predicted probability fell in the relevant bin. From these plots, we get further confirmation that our simple beta-Bernoulli model is quite unsatisfactory, as many binned residuals exceed 5% in absolute value.

Below we plot the posterior mean and 90% credible interval for \(p\) for each player in the data set (grouped by team, for legibility), along with the player’s observed foul called percentage when disadvantaged. The area of the point for observed foul called percentage is proportional to the number of plays in which the player was disadvantaged.

```
def to_param_df(player_df, trace, varnames):
df = player_df
for name in varnames:
mean = trace[name].mean(axis=0)
low, high = np.percentile(trace[name], [5, 95], axis=0)
df = df.assign(**{
'{}_mean'.format(name): mean[df.player_id],
'{}_low'.format(name): low[df.player_id],
'{}_high'.format(name): high[df.player_id]
})
return df
```

`bb_df = to_param_df(team_player_map, bb_trace, ['p'])`

```
def plot_params(mean, interval, names, ax=None, **kwargs):
if ax is None:
fig, ax = plt.subplots(figsize=(8, 6))
n_players = names.size
ax.errorbar(mean, np.arange(n_players),
xerr=np.abs(mean - interval),
fmt='o',
label="Mean with\n90% interval")
ax.set_ylim(-1, n_players)
ax.set_yticks(np.arange(n_players))
ax.set_yticklabels(names)
return ax
def plot_p_params(rate, n_plays, league_mean, ax=None, **kwargs):
if ax is None:
ax = plt.gca()
n_players = rate.size
ax.scatter(rate, np.arange(n_players),
c='k', s=20 * np.sqrt(n_plays),
alpha=0.5, zorder=5,
label="Observed")
ax.vlines(league_mean, -1, n_players,
'k', '--',
label="League average")
def plot_p_helper(mean, low, high, rate, n_plays, names, league_mean=None, ax=None, **kwargs):
if ax is None:
ax = plt.gca()
mean = mean.values
rate = rate.values
n_plays = n_plays.values
names = names.values
argsorted_ix = mean.argsort()
interval = np.row_stack([low, high])
plot_params(mean[argsorted_ix], interval[:, argsorted_ix], names[argsorted_ix],
ax=ax, **kwargs)
plot_p_params(rate[argsorted_ix], n_plays[argsorted_ix], league_mean,
ax=ax, **kwargs)
```

```
grid = sns.FacetGrid(bb_df, col='team', col_wrap=2,
sharey=False,
size=4, aspect=1.5)
grid.map(plot_p_helper,
'p_mean', 'p_low', 'p_high',
'disadvantaged_rate', 'disadvantaged_plays', 'name',
league_mean=obs_rate);
grid.set_axis_labels(r"$p$", "Player");
for ax in grid.axes:
ax.set_xticks(np.linspace(0, 1, 5));
ax.set_xticklabels(ax.get_xticklabels(), visible=True)
ax.xaxis.set_major_formatter(pct_formatter);
grid.fig.tight_layout();
grid.add_legend();
```

These plots reveal an undesirable property of this model and its inferences. Since the prior distribution on \(p_i\) is uniform on the interval \([0, 1]\), all posterior estimates of \(p_i\) are pulled towards the prior expected value of 50%. This phenomenon is known as shrinkage. In extreme cases of players that were never disadvantaged, the posterior estimate of \(p_i\) is quite close to 50%. For these players, the league average foul call rate would seem to be a much more reasonable estimate of \(p_i\) than 50%. The league average foul call rate is shown as a dotted black line on the charts above.

There are several possible modifications of the beta-Bernoulli model that can cause shrinkage toward the league average. Perhaps the most straightforward is the empirical Bayesian method that sets the parameters of the prior distribution on \(p_i\) using the observed data. In this framework, there are many methods of choosing prior hyperparameters that make the prior expected value equal to the league average, therefore causing shrinkage toward the league average. We do not use empirical Bayesian methods in this post as they make it cumbersome to build the more complex models we want to use to understand the relationship between salary and foul calls. Empirical Bayesian methods are, however, an approximation to the fully hierachical models we begin building in the next section.

A hierarchical logistic-normal model addresses some of the shortcomings of the beta-Bernoulli model. For simplicity, this model focuses exclusively on the disadvantaged player and assumes that the log-odds of a foul call for a given disadvantaged player are normally distributed. That is,

\[ \begin{align*} \log \left(\frac{p_i}{1 - p_i}\right) & \sim N(\mu, \sigma^2) \\ \eta_{k} & = \log \left(\frac{p_{i(k)}}{1 - p_{i(k)}}\right), \end{align*}\]

which is equivalent to

\[p_{i(k)} = \frac{1}{1 + \exp(-\eta_k)}.\]

We address the beta-Bernoulli model’s shrinkage problem by placing a normal hyperprior distribution on \(\mu\), \(\mu \sim N(0, 100).\) This shared hyperprior makes this model hierarchical. To complete the specification of this model, we place a half-Cauchy prior on \(\sigma\), \(\sigma \sim \textrm{HalfCauchy}(2.5)\).

```
with pm.Model() as ln_model:
μ = pm.Normal('μ', 0., 10.)
Δ = pm.Normal('Δ', 0., 1., shape=n_players)
σ = pm.HalfCauchy('σ', 2.5)
p_player = pm.Deterministic('p_player', pm.math.sigmoid(μ + Δ * σ))
η = μ + Δ[disadvantaged_id] * σ
p = pm.Deterministic('p', pm.math.sigmoid(η))
y = pm.Bernoulli('y_obs', p, observed=foul_called)
```

Throughout this post we use an offset parameterization for hierarchical models that significantly improves sampling efficiency. We now sample from this model.

`ln_trace = sample(ln_model, N_TUNE, N_SAMPLES, SEED)`

```
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
8%|▊ | 16985/200000 [00:13<02:23, 1273.77it/s]Median ELBO converged.
Finished [100%]: Average ELBO = -3,520.3
100%|██████████| 4000/4000 [04:12<00:00, 15.82it/s]
```

The energy plot for this model gives no cause for concern.

`energy_plot(ln_trace)`

We now calculate the WAIC of the logistic-normal model, and compare it to that of the beta-Bernoulli model.

`waic_df = waic_df.append(get_waic_df(ln_model, ln_trace, "Logistic-normal"))`

```
def waic_plot(waic_df):
fig, (waic_ax, p_ax) = plt.subplots(ncols=2, sharex=True, figsize=(16, 6))
waic_x = np.arange(waic_df.shape[0])
waic_ax.errorbar(waic_x, waic_df.WAIC,
yerr=waic_df.WAIC_se,
fmt='o')
waic_ax.set_xticks(waic_x)
waic_ax.xaxis.grid(False)
waic_ax.set_xticklabels(waic_df.index)
waic_ax.set_xlabel("Model")
waic_ax.set_ylabel("WAIC")
p_ax.bar(waic_x, waic_df.p_WAIC)
p_ax.xaxis.grid(False)
p_ax.set_xlabel("Model")
p_ax.set_ylabel("Effective number\nof parameters")
```

`waic_plot(waic_df)`

The left-hand plot above shows that the logistic-normal model is a significant improvement in WAIC over the beta-Bernoulli model, which is unsurprising. The right-hand plot shows that the logistic-normal model has roughly 20% the number of effective parameters as the beta-Bernoulli model. This reduction is due to the partial pooling effect of the hierarchical prior. The hyperprior on \(\mu\) causes observations for one player to impact the estimate of \(p_i\) for all players; this sharing of information across players is responsible for the large decrease in the number of effective parameters.

Finally, we examine the binned residuals for the logistic-normal model.

```
bin_obs, bin_p, bin_counts = binned_residuals(foul_called, ln_trace['p'])
binned_residual_plot(bin_obs, bin_p, bin_counts)
```

These binned residuals are much smaller than those of the beta-Bernoulli model, which is further confirmation that the logistic-normal model is preferable.

Below we plot the posterior distribution of \(\operatorname{logit}^{-1}(\mu)\), and we see that the observed foul call rate of approximately 25.1% lies within its 90% interval.

```
ax, = pm.plot_posterior(ln_trace, ['μ'],
alpha_level=0.1, transform=expit, ref_val=obs_rate,
lw=0., alpha=0.75)
ax.xaxis.set_major_formatter(pct_formatter);
ax.set_title(r"$\operatorname{logit}^{-1}(\mu)$");
```

With this posterior for \(\operatorname{logit}^{-1}(\mu)\), we see the desired posterior shrinkage of each \(p_i\) toward the observed foul call rate.

`ln_df = to_param_df(team_player_map, ln_trace, ['p'])`

```
grid = sns.FacetGrid(ln_df, col='team', col_wrap=2,
sharey=False,
size=4, aspect=1.5)
grid.map(plot_p_helper,
'p_mean', 'p_low', 'p_high',
'disadvantaged_rate', 'disadvantaged_plays', 'name',
league_mean=obs_rate);
grid.set_axis_labels(r"$p$", "Player");
for ax in grid.axes:
ax.set_xticks(np.linspace(0, 1, 5));
ax.set_xticklabels(ax.get_xticklabels(), visible=True)
ax.xaxis.set_major_formatter(pct_formatter);
grid.fig.tight_layout();
grid.add_legend();
```

The inferences in these plots are markedly different from those of the beta-Bernoulli model. Most strikingly, we see that estimates have been shrunk towards the league average foul call rate, and that players that were never disadvantaged have posterior foul call probabilities quite close to that rate. As a consequence of this more reasonable shrinkage, the range of values taken by the posterior \(p_i\) estimates is much smaller for the logistic-normal model than for the beta-Bernoulli model. Below we plot the top- and bottom-ten players by \(p_i\).

```
fig, (top_ax, bottom_ax) = plt.subplots(nrows=2, sharex=True, figsize=(8, 10))
by_p = (ln_df.drop_duplicates(['player_id'])
.sort_values('p_mean'))
p_top = by_p.iloc[-10:]
plot_params(p_top.p_mean.values, p_top[['p_low', 'p_high']].values.T,
p_top.name.values,
ax=top_ax);
top_ax.vlines(obs_rate, -1, 10,
'k', '--',
label=r"League average");
top_ax.xaxis.set_major_formatter(pct_formatter);
top_ax.set_ylabel("Player");
top_ax.set_title("Top ten");
p_bottom = by_p.iloc[:10]
plot_params(p_bottom.p_mean.values, p_bottom[['p_low', 'p_high']].values.T,
p_bottom.name.values,
ax=bottom_ax);
bottom_ax.vlines(obs_rate, -1, 10,
'k', '--',
label=r"League average");
bottom_ax.xaxis.set_major_formatter(pct_formatter);
bottom_ax.set_xlabel(r"$p$");
bottom_ax.set_ylabel("Player");
fig.tight_layout();
bottom_ax.legend(loc=1);
bottom_ax.set_title("Bottom ten");
```

The hierarchical logistic-normal model is certainly an improvement over the beta-Bernoulli model, but both of these models have focused solely on the disadvantaged player. It seems quite important to understand the contribution of not just the disadvantaged player, but also of the committing player in each play to the probability of a foul call. Item-response theory (IRT) provides generalizations the logistic-normal model that can account for the influence of both players involved in a play. IRT originated in psychometrics as a way to simultaneously measure individual aptitude and question difficulty based on test-response data, and has subsequently found many other applications. We use IRT to model foul calls by considering disadvantaged players as analagous to test takers and committing players as analagous to questions. Specifically, we will use the Rasch model for the probability \(p_{i, j}\), that a foul is called on a play where player \(i\) is disadvantaged by committing player \(j\). This model posits that each player has a latent ability, \(\theta_i\), that governs how often fouls are called when they are disadvantaged and a latent difficulty \(b_j\) that governs how often fouls are not called when they are committing. The probability that a foul is called on a play where player \(i\) is disadvantaged and player \(j\) is committing is then a function of the difference between the corresponding latent ability and difficulty parameters,

\[ \begin{align*} \eta_k & = \theta_{i(k)} - b_{j(k)} \\ p_k & = \frac{1}{1 + \exp(-\eta_k)}. \end{align*} \]

In this model, a player with a large value of \(\theta_i\) is more likely to get a foul called when they are disadvantaged, and a player with a large value of \(b_j\) is less likely to have a foul called when they are committing. If \(\theta_{i(k)} = b_{j(k)}\), there is a 50% chance a foul is called on that play.

To complete the specification of this model, we place priors on \(\theta_i\) and \(b_j\). Similarly to \(\eta\) in the logistic-normal model, we place a hierarchical normal prior on \(\theta_i\),

\[ \begin{align*} \mu_{\theta} & \sim N(0, 100) \\ \sigma_{\theta} & \sim \textrm{HalfCauchy}(2.5) \\ \theta_i & \sim N(\mu_{\theta}, \sigma^2_{\theta}). \end{align*} \]

```
with pm.Model() as rasch_model:
μ_θ = pm.Normal('μ_θ', 0., 10.)
Δ_θ = pm.Normal('Δ_θ', 0., 1., shape=n_players)
σ_θ = pm.HalfCauchy('σ_θ', 2.5)
θ = pm.Deterministic('θ', μ_θ + Δ_θ * σ_θ)
```

We also place a hierarchical normal prior on \(b_j\), though this prior must be subtley different from that on \(\theta_i\). Since \(\theta_i\) and \(b_j\) are latent variables, there is no natural scale on which they should be measured. If each \(\theta_i\) and \(b_j\) are shifted by the same amount, say \(\delta\), the likelihood does not change. That is, if \(\tilde{\theta}_i = \theta_i + \delta\) and \(\tilde{b}_j = b_j + \delta\), then

\[ \tilde{\eta}_{i, j} = \tilde{\theta}_i - \tilde{b}_j = \theta_i + \delta - (b_j + \delta) = \theta_i - b_j = \eta_{i, j}. \]

Therefore, if we allow \(\theta_i\) and \(\beta_j\) to be shifted by arbitrary amounts, the Rasch model is not identified. We identify the Rasch model by constraining the mean of the hyperprior on \(b_j\) to be zero,

\[ \begin{align*} \sigma_b & \sim \textrm{HalfCauchy}(2.5) \\ b_j & \sim N(0, \sigma^2_b). \end{align*} \]

```
with rasch_model:
Δ_b = pm.Normal('Δ_b', 0., 1., shape=n_players)
σ_b = pm.HalfCauchy('σ_b', 2.5)
b = pm.Deterministic('b', Δ_b * σ_b)
```

We now specify the Rasch model’s likelihood and sample from it.

`committing_id = df.committing_id.values`

```
with rasch_model:
η = θ[disadvantaged_id] - b[committing_id]
p = pm.Deterministic('p', pm.math.sigmoid(η))
y = pm.Bernoulli('y_obs', p, observed=foul_called)
```

`rasch_trace = sample(rasch_model, N_TUNE, N_SAMPLES, SEED)`

```
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -3,729.5: 11%|█▏ | 22625/200000 [00:21<02:44, 1079.84it/s]Median ELBO converged.
Finished [100%]: Average ELBO = -3,037.1
100%|██████████| 4000/4000 [02:29<00:00, 26.74it/s]
```

Again, the energy plot for this model gives no cause for concern.

`energy_plot(rasch_trace)`

Below we show the WAIC of our three models.

`waic_df = waic_df.append(get_waic_df(rasch_model, rasch_trace, "Rasch"))`

`waic_plot(waic_df)`

The Rasch model represents a moderate WAIC improvement over the logistic-normal model, and unsurprisingly has many more effective parameters (since it added a nominal parameter, \(b_j\), per player).

The Rasch model also has reasonable binned residuals, with very few events having residuals above 5%.

```
bin_obs, bin_p, bin_counts = binned_residuals(foul_called, rasch_trace['p'])
binned_residual_plot(bin_obs, bin_p, bin_counts)
```

For the Rasch model (and subsequent models), we switch from visualizing the per-player call probabilities to the latent parameters \(\theta_i\) and \(b_j\).

```
μ_θ_mean = rasch_trace['μ_θ'].mean()
rasch_df = to_param_df(team_player_map, rasch_trace, ['θ', 'b'])
```

```
def plot_params_helper(mean, low, high, names, league_mean=None, league_mean_name=None, ax=None, **kwargs):
if ax is None:
ax = plt.gca()
mean = mean.values
names = names.values
argsorted_ix = mean.argsort()
interval = np.row_stack([low, high])
plot_params(mean[argsorted_ix], interval[:, argsorted_ix], names[argsorted_ix],
ax=ax, **kwargs)
if league_mean is not None:
ax.vlines(league_mean, -1, names.size,
'k', '--',
label=league_mean_name)
```

```
grid = sns.FacetGrid(rasch_df, col='team', col_wrap=2,
sharey=False,
size=4, aspect=1.5)
grid.map(plot_params_helper,
'θ_mean', 'θ_low', 'θ_high', 'name',
league_mean=μ_θ_mean,
league_mean_name=r"$\mu_{\theta}$");
grid.set_axis_labels(r"$\theta$", "Player");
grid.fig.tight_layout();
grid.add_legend();
```

```
grid = sns.FacetGrid(rasch_df, col='team', col_wrap=2,
sharey=False,
size=4, aspect=1.5)
grid.map(plot_params_helper,
'b_mean', 'b_low', 'b_high', 'name',
league_mean=0.);
grid.set_axis_labels(r"$b$", "Player");
grid.fig.tight_layout();
grid.add_legend();
```

Though these plots are voluminuous and therefore difficult to interpret precisely, a few trends are evident. The first is that there is more variation in the committing skill (\(b_j\)) than in disadvantaged skill (\(\theta_i\)). This difference is confirmed in the following histograms of the posterior expected values of \(\theta_i\) and \(b_j\).

```
def plot_latent_distributions(θ, b):
fig, (θ_ax, b_ax) = plt.subplots(nrows=2, sharex=True, figsize=(8, 6))
bins = np.linspace(0.9 * min(θ.min(), b.min()),
1.1 * max(θ.max(), b.max()),
75)
θ_ax.hist(θ, bins=bins,
alpha=0.75)
θ_ax.xaxis.set_label_position('top')
θ_ax.set_xlabel(r"Posterior expected $\theta$")
θ_ax.set_yticks([])
θ_ax.set_ylabel("Frequency")
b_ax.hist(b, bins=bins,
color=green, alpha=0.75)
b_ax.xaxis.tick_top()
b_ax.set_xlabel(r"Posterior expected $b$")
b_ax.set_yticks([])
b_ax.invert_yaxis()
b_ax.set_ylabel("Frequency")
fig.tight_layout()
```

`plot_latent_distributions(rasch_df.θ_mean, rasch_df.b_mean)`

The following plots show the top and bottom ten players in terms of both \(\theta_i\) and \(b_j\).

```
def top_10_plot(trace_df, μ_θ=0):
fig = plt.figure(figsize=(16, 10))
θ_top_ax = fig.add_subplot(221)
b_top_ax = fig.add_subplot(222)
θ_bottom_ax = fig.add_subplot(223, sharex=θ_top_ax)
b_bottom_ax = fig.add_subplot(224, sharex=b_top_ax)
# necessary for players that have been on more than one team
trace_df = trace_df.drop_duplicates(['player_id'])
by_θ = trace_df.sort_values('θ_mean')
θ_top = by_θ.iloc[-10:]
θ_bottom = by_θ.iloc[:10]
plot_params(θ_top.θ_mean.values, θ_top[['θ_low', 'θ_high']].values.T,
θ_top.name.values,
ax=θ_top_ax)
θ_top_ax.vlines(μ_θ, -1, 10,
'k', '--',
label=(r"$\mu_{\theta}$" if μ_θ != 0 else "League average"))
plt.setp(θ_top_ax.get_xticklabels(), visible=False)
θ_top_ax.set_ylabel("Player")
θ_top_ax.legend(loc=2)
θ_top_ax.set_title("Top ten")
plot_params(θ_bottom.θ_mean.values, θ_bottom[['θ_low', 'θ_high']].values.T,
θ_bottom.name.values,
ax=θ_bottom_ax)
θ_bottom_ax.vlines(μ_θ, -1, 10,
'k', '--',
label=(r"$\mu_{\theta}$" if μ_θ != 0 else "League average"))
θ_bottom_ax.set_xlabel(r"$\theta$")
θ_bottom_ax.set_ylabel("Player")
θ_bottom_ax.set_title("Bottom ten")
by_b = trace_df.sort_values('b_mean')
b_top = by_b.iloc[-10:]
b_bottom = by_b.iloc[:10]
plot_params(b_top.b_mean.values, b_top[['b_low', 'b_high']].values.T,
b_top.name.values,
ax=b_top_ax)
b_top_ax.vlines(0, -1, 10,
'k', '--',
label="League average");
plt.setp(b_top_ax.get_xticklabels(), visible=False)
b_top_ax.legend(loc=2)
b_top_ax.set_title("Top ten")
b_bottom_player_id = b.argsort()[:10]
plot_params(b_bottom.b_mean.values, b_bottom[['b_low', 'b_high']].values.T,
b_bottom.name.values,
ax=b_bottom_ax)
b_bottom_ax.vlines(0, -1, 10,
'k', '--')
b_bottom_ax.set_xlabel(r"$b$")
b_bottom_ax.set_title("Bottom ten")
fig.tight_layout()
```

`top_10_plot(rasch_df, μ_θ=μ_θ_mean)`

We focus first on \(\theta_i\). Interestingly, the top-ten players for the Rasch model contains many more top-tier stars than the logistic-normal model, including John Wall, Russell Westbrook, LeBron James, and Dwight Howard. Turning to \(b\), it is interesting that the while the top and bottom ten players contain many recognizable names (LaMarcus Aldridge, Harrison Barnes, Kawhi Leonard, and Ricky Rubio) the only truly top-tier player present is Anthony Davis.

As basketball fans know, there are many factors other than the players involved that influence foul calls. Very often, sufficiently close NBA games end with intentional fouls, as the losing team attempts to stop the clock and force another offensive possesion. Therefore, we expect to see in increase in the foul call probability as the game nears its conclusion.

```
n_sec = 121
sec = (df.seconds_left
.round(0)
.values
.astype(np.int64))
```

```
fig, ax = plt.subplots(figsize=(8, 6))
(df.groupby(sec)
.foul_called
.mean()
.plot(c='k', label="Observed foul call rate", ax=ax));
ax.set_xticks(np.linspace(0, 120, 5));
ax.invert_xaxis();
ax.set_xlabel("Seconds remaining in game");
ax.yaxis.set_major_formatter(pct_formatter);
ax.set_ylabel("Probability foul is called");
ax.legend(loc=2);
```

The plot above confirms this expectation, which we can use to improve our latent skill model. If \(t \in \{0, 1, \ldots, 120\}\) is the number of seconds remaining in the game, we model the latent contribution of \(t\) to the logodds that a foul is called with a Gaussian random walk,

\[ \begin{align*} \lambda_0 & \sim N(0, 100) \\ \lambda_t & \sim N(\lambda_{t - 1}, \tau^{-1}_{\lambda}) \\ \tau_{\lambda} & \sim \textrm{Exp}(10^{-4}). \end{align*} \]

This prior allows us to flexibly model the shape of the curve shown above. If \(t(k)\) is the number of seconds remaining during the \(k\)-th play, we incorporate \(\lambda_{t(k)}\) into our model with

\[\eta_k = \lambda_{t(k)} + \theta_{i(k)} - b_{j(k)}.\]

This model is not identified until we constrain the mean of \(\theta\) to be zero, for reasons similar to those discussed above for the Rasch model.

```
with pm.Model() as time_model:
τ_λ = pm.Exponential('τ_λ', 1e-4)
λ = pm.GaussianRandomWalk('λ', tau=τ_λ,
init=pm.Normal.dist(0., 10.),
shape=n_sec)
Δ_θ = pm.Normal('Δ_θ', 0., 1., shape=n_players)
σ_θ = pm.HalfCauchy('σ_θ', 2.5)
θ = pm.Deterministic('θ', Δ_θ * σ_θ)
Δ_b = pm.Normal('Δ_b', 0., 1., shape=n_players)
σ_b = pm.HalfCauchy('σ_b', 2.5)
b = pm.Deterministic('b', Δ_b * σ_b)
η = λ[sec] + θ[disadvantaged_id] - b[committing_id]
p = pm.Deterministic('p', pm.math.sigmoid(η))
y = pm.Bernoulli('y_obs', p, observed=foul_called)
```

We now sample from the model.

`time_trace = sample(time_model, N_TUNE, N_SAMPLES, SEED)`

```
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -1.0533e+05: 15%|█▌ | 30331/200000 [00:32<02:52, 985.58it/s] Median ELBO converged.
Finished [100%]: Average ELBO = -3,104.5
100%|██████████| 4000/4000 [04:06<00:00, 16.24it/s]
```

The energy plot for this model is worse than the previous ones, but not too bad.

`energy_plot(time_trace)`

`waic_df = waic_df.append(get_waic_df(time_model, time_trace, "Time"))`

`waic_plot(waic_df)`

We see that the time remaining model represents an appreciable improvement over the Rasch model in terms of WAIC.

```
bin_obs, bin_p, bin_counts = binned_residuals(foul_called, time_trace['p'])
binned_residual_plot(bin_obs, bin_p, bin_counts)
```

The binned residuals for this model also look quite good, with very few samples appreciably exceeding a 1% difference.

We now compare the distribtuions of \(\theta_i\) and \(b_j\) for this model with those for the Rasch model.

`time_df = to_param_df(team_player_map, time_trace, ['θ', 'b'])`

`plot_latent_distributions(time_df.θ_mean, time_df.b_mean)`

The effect of constraining the mean of \(\theta\) to be zero is immediately apparent. Also, the variation in \(\theta\) remains lower than the variation than \(b\) in this model. We also see that the top- and bottom-ten players by \(\theta\)- and \(b\)-value remain largely unchanged from the Rasch model.

`top_10_plot(time_df)`

Basketball fans may find it amusing that under this model, Ricky Rubio is no longer the worst player in terms of \(b\).

While this model has not done much to change the rank-ordering of the most- and least-skilled players, it does enable us to plot per-player foul probabilities over time, as below.

```
fig, (θ_ax, b_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6))
(df.groupby(sec)
.foul_called
.mean()
.plot(c='k', alpha=0.5,
label="Observed foul call rate",
ax=θ_ax));
plot_sec = np.arange(n_sec)
θ_ax.plot(plot_sec, expit(time_trace['λ'].mean(axis=0)),
c='k',
label="Average player");
θ_best_id = time_df.ix[time_df.θ_mean.idxmax()].player_id
θ_ax.plot(plot_sec, expit(time_trace['θ'][:, θ_best_id].mean(axis=0) \
+ time_trace['λ'].mean(axis=0)),
label=player_map[θ_best_id]);
θ_worst_id = time_df.ix[time_df.θ_mean.idxmin()].player_id
θ_ax.plot(plot_sec, expit(time_trace['θ'][:, θ_worst_id].mean(axis=0) \
+ time_trace['λ'].mean(axis=0)),
label=player_map[θ_worst_id]);
θ_ax.set_xticks(np.linspace(0, 120, 5));
θ_ax.invert_xaxis();
θ_ax.set_xlabel("Seconds remaining in game");
θ_ax.yaxis.set_major_formatter(pct_formatter);
θ_ax.set_ylabel("Probability foul is called\nagainst average opposing player");
θ_ax.legend(loc=2);
θ_ax.set_title(r"Disadvantaged player ($\theta$)");
(df.groupby(sec)
.foul_called
.mean()
.plot(c='k', alpha=0.5,
label="Observed foul call rate",
ax=b_ax));
plot_sec = np.arange(n_sec)
b_ax.plot(plot_sec, expit(time_trace['λ'].mean(axis=0)),
c='k',
label="Average player");
b_best_id = time_df.ix[time_df.b_mean.idxmax()].player_id
b_ax.plot(plot_sec, expit(-time_trace['b'][:, b_best_id].mean(axis=0) \
+ time_trace['λ'].mean(axis=0)),
label=player_map[b_best_id]);
b_worst_id = time_df.ix[time_df.b_mean.idxmin()].player_id
b_ax.plot(plot_sec, expit(-time_trace['b'][:, b_worst_id].mean(axis=0) \
+ time_trace['λ'].mean(axis=0)),
label=player_map[b_worst_id]);
b_ax.set_xticks(np.linspace(0, 120, 5));
b_ax.invert_xaxis();
b_ax.set_xlabel("Seconds remaining in game");
b_ax.legend(loc=2);
b_ax.set_title(r"Committing player ($b$)");
```

Here we have plotted the probability of a foul call while being opposed by an average player (for both \(\theta\) and \(b\)), along with the probability curves for the players with the highest and lowest \(\theta\) and \(b\), respectively. While these plots are quite interesting, one weakness of our model is that the difference between each player’s curve and the league average is constant over time. It would be an interesting and useful to extend this model to allow player offsets to vary over time. Additonally, it would be interesting to understand the influence of the score on the foul-called rate as the game nears its end. It seems quite likely that the winning team is much less likely to commit fouls while the losing team is much more likely to to commit intentional fouls in close games.

We now plot the per-player values of \(\theta_i\) and \(b_j\) under this model.

```
grid = sns.FacetGrid(time_df, col='team', col_wrap=2,
sharey=False,
size=4, aspect=1.5)
grid.map(plot_params_helper,
'θ_mean', 'θ_low', 'θ_high', 'name',
league_mean=0.,
league_mean_name="League average");
grid.set_axis_labels(r"$\theta$", "Player");
grid.fig.tight_layout();
grid.add_legend();
```

```
grid = sns.FacetGrid(time_df, col='team', col_wrap=2,
sharey=False,
size=4, aspect=1.5)
grid.map(plot_params_helper,
'b_mean', 'b_low', 'b_high', 'name',
league_mean=0.,
league_mean_name="League average");
grid.set_axis_labels(r"$b$", "Player");
grid.fig.tight_layout();
grid.add_legend();
```

Our final model uses salary as a proxy for “star power” to explore its influence on foul calls. We also (somewhat naively) impute missing salaries to the (log) league average.

```
std_log_salary = (salary_df.ix[np.arange(n_players)]
.std_log_salary
.fillna(0)
.values)
```

With \(s_i\) denoting the \(i\)-the player’s standardized log salary, our model becomes

\[ \begin{align*} \theta_i & = \theta_{0, i} + \delta_{\theta} \cdot s_i \\ b_j & = b_{0, j} + \delta_b \cdot s_j \\ \eta_k & = \lambda_{t(k)} + \theta_{i(k)} - b_{j(k)}. \end{align*} \]

In this model, each player’s \(\theta_i\) and \(b_j\) parameters are linear functions of their standardized log salary, with (hierarchical) varying intercepts. The varying intercepts \(\theta_{0, i}\) and \(b_{0, j}\) are endowed with the same hierarchical normal priors as \(\theta_i\) and \(b_j\) had in the previous model. We place normal priors, \(\delta_{\theta} \sim N(0, 100)\) and \(\delta_b \sim N(0, 100)\), on the salary coefficients.

```
with pm.Model() as salary_model:
τ_λ = pm.Exponential('τ_λ', 1e-4)
λ = pm.GaussianRandomWalk('λ', tau=τ_λ,
init=pm.Normal.dist(0., 10.),
shape=n_sec)
Δ_θ0 = pm.Normal('Δ_θ0', 0., 1., shape=n_players)
σ_θ0 = pm.HalfCauchy('σ_θ0', 2.5)
θ0 = pm.Deterministic('θ0', Δ_θ0 * σ_θ0)
δ_θ = pm.Normal('δ_θ', 0., 10.)
θ = pm.Deterministic('θ', θ0 + δ_θ * std_log_salary)
Δ_b0 = pm.Normal('Δ_b0', 0., 1., shape=n_players)
σ_b0 = pm.HalfCauchy('σ_b0', 2.5)
b0 = pm.Deterministic('b0', Δ_b0 * σ_b0)
δ_b = pm.Normal('δ_b', 0., 10.)
b = pm.Deterministic('b', b0 + δ_b * std_log_salary)
η = λ[sec] + θ[disadvantaged_id] - b[committing_id]
p = pm.Deterministic('p', pm.math.sigmoid(η))
y = pm.Bernoulli('y_obs', p, observed=foul_called)
```

`salary_trace = sample(salary_model, N_TUNE, N_SAMPLES, SEED)`

```
Auto-assigning NUTS sampler...
Initializing NUTS using advi...
Average ELBO = -1.0244e+05: 15%|█▌ | 30947/200000 [00:31<02:52, 981.37it/s] Median ELBO converged.
Finished [100%]: Average ELBO = -2,995.3
100%|██████████| 4000/4000 [04:44<00:00, 6.87it/s]
```

The energy plot for this model looks a bit worse than that for the time remaining model.

`energy_plot(salary_trace)`

The salary model also appears to be a slight improvement over the time remaining model in terms of WAIC.

`waic_df = waic_df.append(get_waic_df(salary_model, salary_trace, "Salary"))`

`waic_plot(waic_df)`

The binned residuals continue to look good for this model.

```
bin_obs, bin_p, bin_counts = binned_residuals(foul_called, salary_trace['p'])
binned_residual_plot(bin_obs, bin_p, bin_counts)
```

Based on the posterior distributions of \(\delta_{\theta}\) and \(\delta_b\), we expect to see a fairly strong relationship between a player’s (standardized log) salary and their latent skill parameters.

```
pm.plot_posterior(salary_trace, ['δ_θ', 'δ_b'],
lw=0., alpha=0.75);
```

The following plots confirm this relationship.

`salary_df_ = to_param_df(team_player_map, salary_trace, ['θ', 'θ0', 'b', 'b0'])`

```
fig, (θ_ax, b_ax) = plt.subplots(ncols=2, sharex=True, figsize=(16, 6))
salary = (salary_df.ix[np.arange(n_players)]
.salary
.fillna(salary_df.salary.mean())
.values)
θ_ax.scatter(salary[salary_df_.player_id], salary_df_.θ_mean,
alpha=0.75);
θ_ax.xaxis.set_major_formatter(million_dollars_formatter);
θ_ax.set_xlabel("Salary");
θ_ax.set_ylabel(r"$\theta$");
b_ax.scatter(salary[salary_df_.player_id], salary_df_.b_mean,
alpha=0.75);
b_ax.xaxis.set_major_formatter(million_dollars_formatter);
b_ax.set_xlabel("Salary");
b_ax.set_ylabel(r"$b$");
```

It is important to note here that these relationships are descriptive, not causal. Our original intent was to use salary as a proxy for the difficult-to-quantify notion of “star power.” These plots suggest that the probability of getting a foul called when disadvanted and not called when committing are both positively related to salary, and therefore (a bit more dubiously) star power.

Importantly, \(\theta_i\) and \(b_j\) should no longer be interpreted directly as measuring latent skill, which is presumably intrinsic to a player and not directly dependent on their salary. In fact, it seems at plausible that NBA scouts, front offices, players, and agents would be somewhat able to appreciate these latent skills, place a value on them, and thereby price them into contracts. It would be interesting future work to refine this model to give an econometric answer to this question.

Since we shouldn’t interpret \(\theta_i\) and \(b_j\) as measures of latent skill in this model, we will not plot their per-player distributions.

We set out to explore the relationship between players involved in a play and the probability that a foul is called, along with other factors related to that probability. Through a series of progressively more complex Bayesian item-response models, we have seen that

- foul call probability does vary appreciably across both disadvantaged and committing player,
- there is more variation in the latent skill of the committing player to avoid a foul call than there is in the variation in the latent skill of the disadvantaged player to draw a foul call,
- the amount of time remaining in the game is strongly related to the probability of a foul call, and
- there is a positive relationship between player salary and the probability that a foul is called when they are disadvantaged and not called when they are committing. With a bit of a leap, we can say that the probability a foul is called is at least loosely related to the “star power” of the players involved.

In this post we have only scratched the surface of Bayesian item-response theory. For a more in-depth treatment, consult *Bayesian Item Response Modeling*.

This is is available as a Jupyter notebook here.

My last post showed how to use Dirichlet processes and `pymc3`

to perform Bayesian nonparametric density estimation. This post expands on the previous one, illustrating dependent density regression with `pymc3`

.

Just as Dirichlet process mixtures can be thought of as infinite mixture models that select the number of active components as part of inference, dependent density regression can be thought of as infinite mixtures of experts that select the active experts as part of inference. Their flexibility and modularity make them powerful tools for performing nonparametric Bayesian Data analysis.

```
%matplotlib inline
from IPython.display import HTML
```

```
from matplotlib import animation as ani, pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import seaborn as sns
from theano import shared, tensor as tt
```

```
plt.rc('animation', writer='avconv')
blue, *_ = sns.color_palette()
```

```
SEED = 972915 # from random.org; for reproducibility
np.random.seed(SEED)
```

Throughout this post, we will use the LIDAR data set from Larry Wasserman’s excellent book, *All of Nonparametric Statistics*. We standardize the data set to improve the rate of convergence of our samples.

```
DATA_URI = 'http://www.stat.cmu.edu/~larry/all-of-nonpar/=data/lidar.dat'
def standardize(x):
return (x - x.mean()) / x.std()
df = (pd.read_csv(DATA_URI, sep=' *', engine='python')
.assign(std_range=lambda df: standardize(df.range),
std_logratio=lambda df: standardize(df.logratio)))
```

`df.head()`

range | logratio | std_logratio | std_range | |
---|---|---|---|---|

0 | 390 | -0.050356 | 0.852467 | -1.717725 |

1 | 391 | -0.060097 | 0.817981 | -1.707299 |

2 | 393 | -0.041901 | 0.882398 | -1.686447 |

3 | 394 | -0.050985 | 0.850240 | -1.676020 |

4 | 396 | -0.059913 | 0.818631 | -1.655168 |

We plot the LIDAR data below.

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df.std_range, df.std_logratio,
c=blue);
ax.set_xticklabels([]);
ax.set_xlabel("Standardized range");
ax.set_yticklabels([]);
ax.set_ylabel("Standardized log ratio");
```

This data set has a two interesting properties that make it useful for illustrating dependent density regression.

- The relationship between range and log ratio is nonlinear, but has locally linear components.
- The observation noise is heteroskedastic; that is, the magnitude of the variance varies with the range.

The intuitive idea behind dependent density regression is to reduce the problem to many (related) density estimates, conditioned on fixed values of the predictors. The following animation illustrates this intuition.

```
fig, (scatter_ax, hist_ax) = plt.subplots(ncols=2, figsize=(16, 6))
scatter_ax.scatter(df.std_range, df.std_logratio,
c=blue, zorder=2);
scatter_ax.set_xticklabels([]);
scatter_ax.set_xlabel("Standardized range");
scatter_ax.set_yticklabels([]);
scatter_ax.set_ylabel("Standardized log ratio");
bins = np.linspace(df.std_range.min(), df.std_range.max(), 25)
hist_ax.hist(df.std_logratio, bins=bins,
color='k', lw=0, alpha=0.25,
label="All data");
hist_ax.set_xticklabels([]);
hist_ax.set_xlabel("Standardized log ratio");
hist_ax.set_yticklabels([]);
hist_ax.set_ylabel("Frequency");
hist_ax.legend(loc=2);
endpoints = np.linspace(1.05 * df.std_range.min(), 1.05 * df.std_range.max(), 15)
frame_artists = []
for low, high in zip(endpoints[:-1], endpoints[2:]):
interval = scatter_ax.axvspan(low, high,
color='k', alpha=0.5, lw=0, zorder=1);
*_, bars = hist_ax.hist(df[df.std_range.between(low, high)].std_logratio,
bins=bins,
color='k', lw=0, alpha=0.5);
frame_artists.append((interval,) + tuple(bars))
animation = ani.ArtistAnimation(fig, frame_artists,
interval=500, repeat_delay=3000, blit=True)
plt.close(); # prevent the intermediate figure from showing
```

`HTML(animation.to_html5_video())`

As we slice the data with a window sliding along the x-axis in the left plot, the empirical distribution of the y-values of the points in the window varies in the right plot. An important aspect of this approach is that the density estimates that correspond to close values of the predictor are similar.

In the previous post, we saw that a Dirichlet process estimates a probability density as a mixture model with infinitely many components. In the case of normal component distributions,

\[ y \sim \sum_{i = 1}^{\infty} w_i \cdot N(\mu_i, \tau_i^{-1}), \]

where the mixture weights, \(w_1, w_2, \ldots\), are generated by a stick-breaking process.

Dependent density regression generalizes this representation of the Dirichlet process mixture model by allowing the mixture weights and component means to vary conditioned on the value of the predictor, \(x\). That is,

\[ y\ |\ x \sim \sum_{i = 1}^{\infty} w_i\ |\ x \cdot N(\mu_i\ |\ x, \tau_i^{-1}). \]

In this post, we will follow Chapter 23 of *Bayesian Data Analysis* and use a probit stick-breaking process to determine the conditional mixture weights, \(w_i\ |\ x\). The probit stick-breaking process starts by defining

\[ v_i\ |\ x = \Phi(\alpha_i + \beta_i x), \]

where \(\Phi\) is the cumulative distribution function of the standard normal distribution. We then obtain \(w_i\ |\ x\) by applying the stick breaking process to \(v_i\ |\ x\). That is,

\[ w_i\ |\ x = v_i\ |\ x \cdot \prod_{j = 1}^{i - 1} (1 - v_j\ |\ x). \]

For the LIDAR data set, we use independent normal priors \(\alpha_i \sim N(0, 5^2)\) and \(\beta_i \sim N(0, 5^2)\). We now express this this model for the conditional mixture weights using `pymc3`

.

```
def norm_cdf(z):
return 0.5 * (1 + tt.erf(z / np.sqrt(2)))
def stick_breaking(v):
return v * tt.concatenate([tt.ones_like(v[:, :1]),
tt.extra_ops.cumprod(1 - v, axis=1)[:, :-1]],
axis=1)
```

```
N, _ = df.shape
K = 20
std_range = df.std_range.values[:, np.newaxis]
std_logratio = df.std_logratio.values[:, np.newaxis]
x_lidar = shared(std_range, broadcastable=(False, True))
with pm.Model() as model:
alpha = pm.Normal('alpha', 0., 5., shape=K)
beta = pm.Normal('beta', 0., 5., shape=K)
v = norm_cdf(alpha + beta * x_lidar)
w = pm.Deterministic('w', stick_breaking(v))
```

We have defined `x_lidar`

as a `theano`

`shared`

variable in order to use `pymc3`

’s posterior prediction capabilities later.

While the dependent density regression model theoretically has infinitely many components, we must truncate the model to finitely many components (in this case, twenty) in order to express it using `pymc3`

. After sampling from the model, we will verify that truncation did not unduly influence our results.

Since the LIDAR data seems to have several linear components, we use the linear models

\[ \begin{align*} \mu_i\ |\ x & \sim \gamma_i + \delta_i x \\ \gamma_i & \sim N(0, 10^2) \\ \delta_i & \sim N(0, 10^2) \end{align*} \]

for the conditional component means.

```
with model:
gamma = pm.Normal('gamma', 0., 10., shape=K)
delta = pm.Normal('delta', 0., 10., shape=K)
mu = pm.Deterministic('mu', gamma + delta * x_lidar)
```

Finally, we place the prior \(\tau_i \sim \textrm{Gamma}(1, 1)\) on the component precisions.

```
with model:
tau = pm.Gamma('tau', 1., 1., shape=K)
obs = pm.NormalMixture('obs', w, mu, tau=tau, observed=std_logratio)
```

We now draw sample from the dependent density regression model.

```
SAMPLES = 20000
BURN = 10000
THIN = 10
with model:
step = pm.Metropolis()
trace_ = pm.sample(SAMPLES, step, random_seed=SEED)
trace = trace_[BURN::THIN]
```

`100%|██████████| 20000/20000 [01:30<00:00, 204.48it/s]`

To verify that truncation did not unduly influence our results, we plot the largest posterior expected mixture weight for each component. (In this model, each point has a mixture weight for each component, so we plot the maximum mixture weight for each component across all data points in order to judge if the component exerts any influence on the posterior.)

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.bar(np.arange(K) + 1 - 0.4,
trace['w'].mean(axis=0).max(axis=0));
ax.set_xlim(1 - 0.5, K + 0.5);
ax.set_xlabel('Mixture component');
ax.set_ylabel('Largest posterior expected\nmixture weight');
```

Since only three mixture components have appreciable posterior expected weight for any data point, we can be fairly certain that truncation did not unduly influence our results. (If most components had appreciable posterior expected weight, truncation may have influenced the results, and we would have increased the number of components and sampled again.)

Visually, it is reasonable that the LIDAR data has three linear components, so these posterior expected weights seem to have identified the structure of the data well. We now sample from the posterior predictive distribution to get a better understand the model’s performance.

```
PP_SAMPLES = 5000
lidar_pp_x = np.linspace(std_range.min() - 0.05, std_range.max() + 0.05, 100)
x_lidar.set_value(lidar_pp_x[:, np.newaxis])
with model:
pp_trace = pm.sample_ppc(trace, PP_SAMPLES, random_seed=SEED)
```

`100%|██████████| 5000/5000 [01:18<00:00, 66.54it/s]`

Below we plot the posterior expected value and the 95% posterior credible interval.

```
fig, ax = plt.subplots()
ax.scatter(df.std_range, df.std_logratio,
c=blue, zorder=10,
label=None);
low, high = np.percentile(pp_trace['obs'], [2.5, 97.5], axis=0)
ax.fill_between(lidar_pp_x, low, high,
color='k', alpha=0.35, zorder=5,
label='95% posterior credible interval');
ax.plot(lidar_pp_x, pp_trace['obs'].mean(axis=0),
c='k', zorder=6,
label='Posterior expected value');
ax.set_xticklabels([]);
ax.set_xlabel('Standardized range');
ax.set_yticklabels([]);
ax.set_ylabel('Standardized log ratio');
ax.legend(loc=1);
ax.set_title('LIDAR Data');
```

The model has fit the linear components of the data well, and also accomodated its heteroskedasticity. This flexibility, along with the ability to modularly specify the conditional mixture weights and conditional component densities, makes dependent density regression an extremely useful nonparametric Bayesian model.

To learn more about depdendent density regression and related models, consult *Bayesian Data Analysis*, *Bayesian Nonparametric Data Analysis*, or *Bayesian Nonparametrics*.

This post is available as a Jupyter notebook here.

]]>I have been intrigued by the flexibility of nonparametric statistics for many years. As I have developed an understanding and appreciation of Bayesian modeling both personally and professionally over the last two or three years, I naturally developed an interest in Bayesian nonparametric statistics. I am pleased to begin a planned series of posts on Bayesian nonparametrics with this post on Dirichlet process mixtures for density estimation.

The Dirichlet process is a flexible probability distribution over the space of distributions. Most generally, a probability distribution, \(P\), on a set \(\Omega\) is a measure that assigns measure one to the entire space (\(P(\Omega) = 1\)). A Dirichlet process \(P \sim \textrm{DP}(\alpha, P_0)\) is a measure that has the property that, for every finite disjoint partition \(S_1, \ldots, S_n\) of \(\Omega\),

\[(P(S_1), \ldots, P(S_n)) \sim \textrm{Dir}(\alpha P_0(S_1), \ldots, \alpha P_0(S_n)).\]

Here \(P_0\) is the base probability measure on the space \(\Omega\). The precision parameter \(\alpha > 0\) controls how close samples from the Dirichlet process are to the base measure, \(P_0\). As \(\alpha \to \infty\), samples from the Dirichlet process approach the base measure \(P_0\).

Dirichlet processes have several properties that make then quite suitable to MCMC simulation.

The posterior given i.i.d. observations \(\omega_1, \ldots, \omega_n\) from a Dirichlet process \(P \sim \textrm{DP}(\alpha, P_0)\) is also a Dirichlet process with

\[P\ |\ \omega_1, \ldots, \omega_n \sim \textrm{DP}\left(\alpha + n, \frac{\alpha}{\alpha + n} P_0 + \frac{1}{\alpha + n} \sum_{i = 1}^n \delta_{\omega_i}\right),\]

where \(\delta\) is the Dirac delta measure

\[\begin{align*} \delta_{\omega}(S) & = \begin{cases} 1 & \textrm{if } \omega \in S \\ 0 & \textrm{if } \omega \not \in S \end{cases} \end{align*}.\]

The posterior predictive distribution of a new observation is a compromise between the base measure and the observations,

\[\omega\ |\ \omega_1, \ldots, \omega_n \sim \frac{\alpha}{\alpha + n} P_0 + \frac{1}{\alpha + n} \sum_{i = 1}^n \delta_{\omega_i}.\]

We see that the prior precision \(\alpha\) can naturally be interpreted as a prior sample size. The form of this posterior predictive distribution also lends itself to Gibbs sampling.

Samples, \(P \sim \textrm{DP}(\alpha, P_0)\), from a Dirichlet process are discrete with probability one. That is, there are elements \(\omega_1, \omega_2, \ldots\) in \(\Omega\) and weights \(w_1, w_2, \ldots\) with \(\sum_{i = 1}^{\infty} w_i = 1\) such that

\[P = \sum_{i = 1}^\infty w_i \delta_{\omega_i}.\]

- The stick-breaking process gives an explicit construction of the weights \(w_i\) and samples \(\omega_i\) above that is straightforward to sample from. If \(\beta_1, \beta_2, \ldots \sim \textrm{Beta}(1, \alpha)\), then \(w_i = \beta_i \prod_{j = 1}^{j - 1} (1 - \beta_j)\). The relationship between this representation and stick breaking may be illustrated as follows:
- Start with a stick of length one.
- Break the stick into two portions, the first of proportion \(w_1 = \beta_1\) and the second of proportion \(1 - w_1\).
- Further break the second portion into two portions, the first of proportion \(\beta_2\) and the second of proportion \(1 - \beta_2\). The length of the first portion of this stick is \(\beta_2 (1 - \beta_1)\); the length of the second portion is \((1 - \beta_1) (1 - \beta_2)\).
- Continue breaking the second portion from the previous break in this manner forever. If \(\omega_1, \omega_2, \ldots \sim P_0\), then

\[P = \sum_{i = 1}^\infty w_i \delta_{\omega_i} \sim \textrm{DP}(\alpha, P_0).\]

We can use the stick-breaking process above to easily sample from a Dirichlet process in Python. For this example, \(\alpha = 2\) and the base distribution is \(N(0, 1)\).

`%matplotlib inline`

`from __future__ import division`

```
from matplotlib import pyplot as plt
import numpy as np
import pymc3 as pm
import scipy as sp
import seaborn as sns
from statsmodels.datasets import get_rdataset
from theano import tensor as T
```

`Couldn't import dot_parser, loading of dot files will not be possible.`

`blue = sns.color_palette()[0]`

`np.random.seed(462233) # from random.org`

```
N = 20
K = 30
alpha = 2.
P0 = sp.stats.norm
```

We draw and plot samples from the stick-breaking process.

```
beta = sp.stats.beta.rvs(1, alpha, size=(N, K))
w = np.empty_like(beta)
w[:, 0] = beta[:, 0]
w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1)
omega = P0.rvs(size=(N, K))
x_plot = np.linspace(-3, 3, 200)
sample_cdfs = (w[..., np.newaxis] * np.less.outer(omega, x_plot)).sum(axis=1)
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(x_plot, sample_cdfs[0], c='gray', alpha=0.75,
label='DP sample CDFs');
ax.plot(x_plot, sample_cdfs[1:].T, c='gray', alpha=0.75);
ax.plot(x_plot, P0.cdf(x_plot), c='k', label='Base CDF');
ax.set_title(r'$\alpha = {}$'.format(alpha));
ax.legend(loc=2);
```

As stated above, as \(\alpha \to \infty\), samples from the Dirichlet process converge to the base distribution.

```
fig, (l_ax, r_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6))
K = 50
alpha = 10.
beta = sp.stats.beta.rvs(1, alpha, size=(N, K))
w = np.empty_like(beta)
w[:, 0] = beta[:, 0]
w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1)
omega = P0.rvs(size=(N, K))
sample_cdfs = (w[..., np.newaxis] * np.less.outer(omega, x_plot)).sum(axis=1)
l_ax.plot(x_plot, sample_cdfs[0], c='gray', alpha=0.75,
label='DP sample CDFs');
l_ax.plot(x_plot, sample_cdfs[1:].T, c='gray', alpha=0.75);
l_ax.plot(x_plot, P0.cdf(x_plot), c='k', label='Base CDF');
l_ax.set_title(r'$\alpha = {}$'.format(alpha));
l_ax.legend(loc=2);
K = 200
alpha = 50.
beta = sp.stats.beta.rvs(1, alpha, size=(N, K))
w = np.empty_like(beta)
w[:, 0] = beta[:, 0]
w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1)
omega = P0.rvs(size=(N, K))
sample_cdfs = (w[..., np.newaxis] * np.less.outer(omega, x_plot)).sum(axis=1)
r_ax.plot(x_plot, sample_cdfs[0], c='gray', alpha=0.75,
label='DP sample CDFs');
r_ax.plot(x_plot, sample_cdfs[1:].T, c='gray', alpha=0.75);
r_ax.plot(x_plot, P0.cdf(x_plot), c='k', label='Base CDF');
r_ax.set_title(r'$\alpha = {}$'.format(alpha));
r_ax.legend(loc=2);
```

For the task of density estimation, the (almost sure) discreteness of samples from the Dirichlet process is a significant drawback. This problem can be solved with another level of indirection by using Dirichlet process mixtures for density estimation. A Dirichlet process mixture uses component densities from a parametric family \(\mathcal{F} = \{f_{\theta}\ |\ \theta \in \Theta\}\) and represents the mixture weights as a Dirichlet process. If \(P_0\) is a probability measure on the parameter space \(\Theta\), a Dirichlet process mixture is the hierarchical model

\[ \begin{align*} x_i\ |\ \theta_i & \sim f_{\theta_i} \\ \theta_1, \ldots, \theta_n & \sim P \\ P & \sim \textrm{DP}(\alpha, P_0). \end{align*} \]

To illustrate this model, we simulate draws from a Dirichlet process mixture with \(\alpha = 2\), \(\theta \sim N(0, 1)\), \(x\ |\ \theta \sim N(\theta, (0.3)^2)\).

```
N = 5
K = 30
alpha = 2
P0 = sp.stats.norm
f = lambda x, theta: sp.stats.norm.pdf(x, theta, 0.3)
```

```
beta = sp.stats.beta.rvs(1, alpha, size=(N, K))
w = np.empty_like(beta)
w[:, 0] = beta[:, 0]
w[:, 1:] = beta[:, 1:] * (1 - beta[:, :-1]).cumprod(axis=1)
theta = P0.rvs(size=(N, K))
dpm_pdf_components = f(x_plot[np.newaxis, np.newaxis, :], theta[..., np.newaxis])
dpm_pdfs = (w[..., np.newaxis] * dpm_pdf_components).sum(axis=1)
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(x_plot, dpm_pdfs.T, c='gray');
ax.set_yticklabels([]);
```

We now focus on a single mixture and decompose it into its individual (weighted) mixture components.

```
fig, ax = plt.subplots(figsize=(8, 6))
ix = 1
ax.plot(x_plot, dpm_pdfs[ix], c='k', label='Density');
ax.plot(x_plot, (w[..., np.newaxis] * dpm_pdf_components)[ix, 0],
'--', c='k', label='Mixture components (weighted)');
ax.plot(x_plot, (w[..., np.newaxis] * dpm_pdf_components)[ix].T,
'--', c='k');
ax.set_yticklabels([]);
ax.legend(loc=1);
```

Sampling from these stochastic processes is fun, but these ideas become truly useful when we fit them to data. The discreteness of samples and the stick-breaking representation of the Dirichlet process lend themselves nicely to Markov chain Monte Carlo simulation of posterior distributions. We will perform this sampling using `pymc3`

.

Our first example uses a Dirichlet process mixture to estimate the density of waiting times between eruptions of the Old Faithful geyser in Yellowstone National Park.

`old_faithful_df = get_rdataset('faithful', cache=True).data[['waiting']]`

For convenience in specifying the prior, we standardize the waiting time between eruptions.

`old_faithful_df['std_waiting'] = (old_faithful_df.waiting - old_faithful_df.waiting.mean()) / old_faithful_df.waiting.std()`

`old_faithful_df.head()`

waiting | std_waiting | |
---|---|---|

0 | 79 | 0.596025 |

1 | 54 | -1.242890 |

2 | 74 | 0.228242 |

3 | 62 | -0.654437 |

4 | 85 | 1.037364 |

```
fig, ax = plt.subplots(figsize=(8, 6))
n_bins = 20
ax.hist(old_faithful_df.std_waiting, bins=n_bins, color=blue, lw=0, alpha=0.5);
ax.set_xlabel('Standardized waiting time between eruptions');
ax.set_ylabel('Number of eruptions');
```

Observant readers will have noted that we have not been continuing the stick-breaking process indefinitely as indicated by its definition, but rather have been truncating this process after a finite number of breaks. Obviously, when computing with Dirichlet processes, it is necessary to only store a finite number of its point masses and weights in memory. This restriction is not terribly onerous, since with a finite number of observations, it seems quite likely that the number of mixture components that contribute non-neglible mass to the mixture will grow slower than the number of samples. This intuition can be formalized to show that the (expected) number of components that contribute non-negligible mass to the mixture approaches \(\alpha \log N\), where \(N\) is the sample size.

There are various clever Gibbs sampling techniques for Dirichlet processes that allow the number of components stored to grow as needed. Stochastic memoization is another powerful technique for simulating Dirichlet processes while only storing finitely many components in memory. In this introductory post, we take the much less sophistocated approach of simple truncating the Dirichlet process components that are stored after a fixed number, \(K\), of components. Importantly, this approach is compatible with some of `pymc3`

’s (current) technical limitations. Ohlssen, et al. provide justification for truncation, showing that \(K > 5 \alpha + 2\) is most likely sufficient to capture almost all of the mixture weights (\(\sum_{i = 1}^{K} w_i > 0.99\)). We can practically verify the suitability of our truncated approximation to the Dirichlet process by checking the number of components that contribute non-negligible mass to the mixture. If, in our simulations, all components contribute non-negligible mass to the mixture, we have truncated our Dirichlet process too early.

Our Dirichlet process mixture model for the standardized waiting times is

\[ \begin{align*} x_i\ |\ \mu_i, \lambda_i, \tau_i & \sim N\left(\mu, (\lambda_i \tau_i)^{-1}\right) \\ \mu_i\ |\ \lambda_i, \tau_i & \sim N\left(0, (\lambda_i \tau_i)^{-1}\right) \\ (\lambda_1, \tau_1), (\lambda_2, \tau_2), \ldots & \sim P \\ P & \sim \textrm{DP}(\alpha, U(0, 5) \times \textrm{Gamma}(1, 1)) \\ \alpha & \sim \textrm{Gamma}(1, 1). \end{align*} \]

Note that instead of fixing a value of \(\alpha\), as in our previous simulations, we specify a prior on \(\alpha\), so that we may learn its posterior distribution from the observations. This model is therefore actually a mixture of Dirichlet process mixtures, since each fixed value of \(\alpha\) results in a Dirichlet process mixture.

We now construct this model using `pymc3`

.

```
N = old_faithful_df.shape[0]
K = 30
```

```
with pm.Model() as model:
alpha = pm.Gamma('alpha', 1., 1.)
beta = pm.Beta('beta', 1., alpha, shape=K)
w = pm.Deterministic('w', beta * T.concatenate([[1], T.extra_ops.cumprod(1 - beta)[:-1]]))
component = pm.Categorical('component', w, shape=N)
tau = pm.Gamma('tau', 1., 1., shape=K)
lambda_ = pm.Uniform('lambda', 0, 5, shape=K)
mu = pm.Normal('mu', 0, lambda_ * tau, shape=K)
obs = pm.Normal('obs', mu[component], lambda_[component] * tau[component],
observed=old_faithful_df.std_waiting.values)
```

```
Applied log-transform to alpha and added transformed alpha_log to model.
Applied logodds-transform to beta and added transformed beta_logodds to model.
Applied log-transform to tau and added transformed tau_log to model.
Applied interval-transform to lambda and added transformed lambda_interval to model.
```

We sample from the posterior distribution 20,000 times, burn the first 10,000 samples, and thin to every tenth sample.

```
with model:
step1 = pm.Metropolis(vars=[alpha, beta, w, lambda_, tau, mu, obs])
step2 = pm.ElemwiseCategoricalStep([component], np.arange(K))
trace_ = pm.sample(20000, [step1, step2])
trace = trace_[10000::10]
```

` [-----------------100%-----------------] 20000 of 20000 complete in 139.3 sec`

The posterior distribution of \(\alpha\) is concentrated between 0.4 and 1.75.

`pm.traceplot(trace, varnames=['alpha']);`

To verify that our truncation point is not biasing our results, we plot the distribution of the number of mixture components used.

`n_components_used = np.apply_along_axis(lambda x: np.unique(x).size, 1, trace['component'])`

```
fig, ax = plt.subplots(figsize=(8, 6))
bins = np.arange(n_components_used.min(), n_components_used.max() + 1)
ax.hist(n_components_used + 1, bins=bins, normed=True, lw=0, alpha=0.75);
ax.set_xticks(bins + 0.5);
ax.set_xticklabels(bins);
ax.set_xlim(bins.min(), bins.max() + 1);
ax.set_xlabel('Number of mixture components used');
ax.set_ylabel('Posterior probability');
```

We see that the vast majority of samples use five mixture components, and the largest number of mixture components used by any sample is eight. Since we truncated our Dirichlet process mixture at thirty components, we can be quite sure that truncation did not bias our results.

We now compute and plot our posterior density estimate.

```
post_pdf_contribs = sp.stats.norm.pdf(np.atleast_3d(x_plot),
trace['mu'][:, np.newaxis, :],
1. / np.sqrt(trace['lambda'] * trace['tau'])[:, np.newaxis, :])
post_pdfs = (trace['w'][:, np.newaxis, :] * post_pdf_contribs).sum(axis=-1)
post_pdf_low, post_pdf_high = np.percentile(post_pdfs, [2.5, 97.5], axis=0)
```

```
fig, ax = plt.subplots(figsize=(8, 6))
n_bins = 20
ax.hist(old_faithful_df.std_waiting.values, bins=n_bins, normed=True,
color=blue, lw=0, alpha=0.5);
ax.fill_between(x_plot, post_pdf_low, post_pdf_high,
color='gray', alpha=0.45);
ax.plot(x_plot, post_pdfs[0],
c='gray', label='Posterior sample densities');
ax.plot(x_plot, post_pdfs[::100].T, c='gray');
ax.plot(x_plot, post_pdfs.mean(axis=0),
c='k', label='Posterior expected density');
ax.set_xlabel('Standardized waiting time between eruptions');
ax.set_yticklabels([]);
ax.set_ylabel('Density');
ax.legend(loc=2);
```

As above, we can decompose this density estimate into its (weighted) mixture components.

```
fig, ax = plt.subplots(figsize=(8, 6))
n_bins = 20
ax.hist(old_faithful_df.std_waiting.values, bins=n_bins, normed=True,
color=blue, lw=0, alpha=0.5);
ax.plot(x_plot, post_pdfs.mean(axis=0),
c='k', label='Posterior expected density');
ax.plot(x_plot, (trace['w'][:, np.newaxis, :] * post_pdf_contribs).mean(axis=0)[:, 0],
'--', c='k', label='Posterior expected mixture\ncomponents\n(weighted)');
ax.plot(x_plot, (trace['w'][:, np.newaxis, :] * post_pdf_contribs).mean(axis=0),
'--', c='k');
ax.set_xlabel('Standardized waiting time between eruptions');
ax.set_yticklabels([]);
ax.set_ylabel('Density');
ax.legend(loc=2);
```

The Dirichlet process mixture model is incredibly flexible in terms of the family of parametric component distributions \(\{f_{\theta}\ |\ f_{\theta} \in \Theta\}\). We illustrate this flexibility below by using Poisson component distributions to estimate the density of sunspots per year.

`sunspot_df = get_rdataset('sunspot.year', cache=True).data`

`sunspot_df.head()`

time | sunspot.year | |
---|---|---|

0 | 1700 | 5 |

1 | 1701 | 11 |

2 | 1702 | 16 |

3 | 1703 | 23 |

4 | 1704 | 36 |

For this problem, the model is

\[ \begin{align*} x_i\ |\ \lambda_i & \sim \textrm{Poisson}(\lambda_i) \\ \lambda_1, \lambda_2, \ldots & \sim P \\ P & \sim \textrm{DP}(\alpha, U(0, 300)) \\ \alpha & \sim \textrm{Gamma}(1, 1). \end{align*} \]

```
N = sunspot_df.shape[0]
K = 30
```

```
with pm.Model() as model:
alpha = pm.Gamma('alpha', 1., 1.)
beta = pm.Beta('beta', 1, alpha, shape=K)
w = pm.Deterministic('beta', beta * T.concatenate([[1], T.extra_ops.cumprod(1 - beta[:-1])]))
component = pm.Categorical('component', w, shape=N)
mu = pm.Uniform('mu', 0., 300., shape=K)
obs = pm.Poisson('obs', mu[component], observed=sunspot_df['sunspot.year'])
```

```
Applied log-transform to alpha and added transformed alpha_log to model.
Applied logodds-transform to beta and added transformed beta_logodds to model.
Applied interval-transform to mu and added transformed mu_interval to model.
```

```
with model:
step1 = pm.Metropolis(vars=[alpha, beta, w, mu, obs])
step2 = pm.ElemwiseCategoricalStep([component], np.arange(K))
trace_ = pm.sample(20000, [step1, step2])
```

` [-----------------100%-----------------] 20000 of 20000 complete in 111.9 sec`

`trace = trace_[10000::10]`

For the sunspot model, the posterior distribution of \(\alpha\) is concentrated between one and three, indicating that we should expect more components to contribute non-negligible amounts to the mixture than for the Old Faithful waiting time model.

`pm.traceplot(trace, varnames=['alpha']);`

Indeed, we see that there are (on average) about ten to fifteen components used by this model.

`n_components_used = np.apply_along_axis(lambda x: np.unique(x).size, 1, trace['component'])`

```
fig, ax = plt.subplots(figsize=(8, 6))
bins = np.arange(n_components_used.min(), n_components_used.max() + 1)
ax.hist(n_components_used + 1, bins=bins, normed=True, lw=0, alpha=0.75);
ax.set_xticks(bins + 0.5);
ax.set_xticklabels(bins);
ax.set_xlim(bins.min(), bins.max() + 1);
ax.set_xlabel('Number of mixture components used');
ax.set_ylabel('Posterior probability');
```

We now calculate and plot the fitted density estimate.

`x_plot = np.arange(250)`

```
post_pmf_contribs = sp.stats.poisson.pmf(np.atleast_3d(x_plot),
trace['mu'][:, np.newaxis, :])
post_pmfs = (trace['beta'][:, np.newaxis, :] * post_pmf_contribs).sum(axis=-1)
post_pmf_low, post_pmf_high = np.percentile(post_pmfs, [2.5, 97.5], axis=0)
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.hist(sunspot_df['sunspot.year'].values, bins=40, normed=True, lw=0, alpha=0.75);
ax.fill_between(x_plot, post_pmf_low, post_pmf_high,
color='gray', alpha=0.45)
ax.plot(x_plot, post_pmfs[0],
c='gray', label='Posterior sample densities');
ax.plot(x_plot, post_pmfs[::200].T, c='gray');
ax.plot(x_plot, post_pmfs.mean(axis=0),
c='k', label='Posterior expected density');
ax.legend(loc=1);
```

Again, we can decompose the posterior expected density into weighted mixture densities.

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.hist(sunspot_df['sunspot.year'].values, bins=40, normed=True, lw=0, alpha=0.75);
ax.plot(x_plot, post_pmfs.mean(axis=0),
c='k', label='Posterior expected density');
ax.plot(x_plot, (trace['beta'][:, np.newaxis, :] * post_pmf_contribs).mean(axis=0)[:, 0],
'--', c='k', label='Posterior expected\nmixture components\n(weighted)');
ax.plot(x_plot, (trace['beta'][:, np.newaxis, :] * post_pmf_contribs).mean(axis=0),
'--', c='k');
ax.legend(loc=1);
```

We have only scratched the surface in terms of applications of the Dirichlet process and Bayesian nonparametric statistics in general. This post is the first in a series I have planned on Bayesian nonparametrics, so stay tuned.

]]>Survival analysis studies the distribution of the time to an event. Its applications span many fields across medicine, biology, engineering, and social science. This post shows how to fit and analyze a Bayesian survival model in Python using `pymc3`

.

We illustrate these concepts by analyzing a mastectomy data set from `R`

’s `HSAUR`

package.

`%matplotlib inline`

```
from matplotlib import pyplot as plt
import numpy as np
import pymc3 as pm
from pymc3.distributions.timeseries import GaussianRandomWalk
import seaborn as sns
from statsmodels import datasets
from theano import tensor as T
```

`Couldn't import dot_parser, loading of dot files will not be possible.`

Fortunately, `statsmodels.datasets`

makes it quite easy to load a number of data sets from `R`

.

```
df = datasets.get_rdataset('mastectomy', 'HSAUR', cache=True).data
df.event = df.event.astype(np.int64)
df.metastized = (df.metastized == 'yes').astype(np.int64)
n_patients = df.shape[0]
patients = np.arange(n_patients)
```

`df.head()`

time | event | metastized | |
---|---|---|---|

0 | 23 | 1 | 0 |

1 | 47 | 1 | 0 |

2 | 69 | 1 | 0 |

3 | 70 | 0 | 0 |

4 | 100 | 0 | 0 |

`n_patients`

`44`

Each row represents observations from a woman diagnosed with breast cancer that underwent a mastectomy. The column `time`

represents the time (in months) post-surgery that the woman was observed. The column `event`

indicates whether or not the woman died during the observation period. The column `metastized`

represents whether the cancer had metastized prior to surgery.

This post analyzes the relationship between survival time post-mastectomy and whether or not the cancer had metastized.

First we introduce a (very little) bit of theory. If the random variable \(T\) is the time to the event we are studying, survival analysis is primarily concerned with the survival function

\[S(t) = P(T > t) = 1 - F(t),\]

where \(F\) is the CDF of \(T\). It is mathematically convenient to express the survival function in terms of the hazard rate, \(\lambda(t)\). The hazard rate is the instantaneous probability that the event occurs at time \(t\) given that it has not yet occured. That is,

\[\begin{align*} \lambda(t) & = \lim_{\Delta t \to 0} \frac{P(t < T < t + \Delta t\ |\ T > t)}{\Delta t} \\ & = \lim_{\Delta t \to 0} \frac{P(t < T < t + \Delta t)}{\Delta t \cdot P(T > t)} \\ & = \frac{1}{S(t)} \cdot \lim_{\Delta t \to 0} \frac{S(t + \Delta t) - S(t)}{\Delta t} = -\frac{S'(t)}{S(t)}. \end{align*}\]

Solving this differential equation for the survival function shows that

\[S(t) = \exp\left(-\int_0^s \lambda(s)\ ds\right).\]

This representation of the survival function shows that the cumulative hazard function

\[\Lambda(t) = \int_0^t \lambda(s)\ ds\]

is an important quantity in survival analysis, since we may consicesly write \(S(t) = \exp(-\Lambda(t)).\)

An important, but subtle, point in survival analysis is censoring. Even though the quantity we are interested in estimating is the time between surgery and death, we do not observe the death of every subject. At the point in time that we perform our analysis, some of our subjects will thankfully still be alive. In the case of our mastectomy study, `df.event`

is one if the subject’s death was observed (the observation is not censored) and is zero if the death was not observed (the observation is censored).

`df.event.mean()`

`0.59090909090909094`

Just over 40% of our observations are censored. We visualize the observed durations and indicate which observations are censored below.

```
fig, ax = plt.subplots(figsize=(8, 6))
blue, _, red = sns.color_palette()[:3]
ax.hlines(patients[df.event.values == 0], 0, df[df.event.values == 0].time,
color=blue, label='Censored');
ax.hlines(patients[df.event.values == 1], 0, df[df.event.values == 1].time,
color=red, label='Uncensored');
ax.scatter(df[df.metastized.values == 1].time, patients[df.metastized.values == 1],
color='k', zorder=10, label='Metastized');
ax.set_xlim(left=0);
ax.set_xlabel('Months since mastectomy');
ax.set_ylim(-0.25, n_patients + 0.25);
ax.legend(loc='center right');
```

When an observation is censored (`df.event`

is zero), `df.time`

is not the subject’s survival time. All we can conclude from such a censored obsevation is that the subject’s true survival time exceeds `df.time`

.

This is enough basic surival analysis theory for the purposes of this post; for a more extensive introduction, consult Aalen et al.^{1}

The two most basic estimators in survial analysis are the Kaplan-Meier estimator of the survival function and the Nelson-Aalen estimator of the cumulative hazard function. However, since we want to understand the impact of metastization on survival time, a risk regression model is more appropriate. Perhaps the most commonly used risk regression model is Cox’s proportional hazards model. In this model, if we have covariates \(\mathbf{x}\) and regression coefficients \(\beta\), the hazard rate is modeled as

\[\lambda(t) = \lambda_0(t) \exp(\mathbf{x} \beta).\]

Here \(\lambda_0(t)\) is the baseline hazard, which is independent of the covariates \(\mathbf{x}\). In this example, the covariates are the one-dimensonal vector `df.metastized`

.

Unlike in many regression situations, \(\mathbf{x}\) should not include a constant term corresponding to an intercept. If \(\mathbf{x}\) includes a constant term corresponding to an intercept, the model becomes unidentifiable. To illustrate this unidentifiability, suppose that

\[\lambda(t) = \lambda_0(t) \exp(\beta_0 + \mathbf{x} \beta) = \lambda_0(t) \exp(\beta_0) \exp(\mathbf{x} \beta).\]

If \(\tilde{\beta}_0 = \beta_0 + \delta\) and \(\tilde{\lambda}_0(t) = \lambda_0(t) \exp(-\delta)\), then \(\lambda(t) = \tilde{\lambda}_0(t) \exp(\tilde{\beta}_0 + \mathbf{x} \beta)\) as well, making the model with \(\beta_0\) unidentifiable.

In order to perform Bayesian inference with the Cox model, we must specify priors on \(\beta\) and \(\lambda_0(t)\). We place a normal prior on \(\beta\), \(\beta \sim N(\mu_{\beta}, \sigma_{\beta}^2),\) where \(\mu_{\beta} \sim N(0, 10^2)\) and \(\sigma_{\beta} \sim U(0, 10)\).

A suitable prior on \(\lambda_0(t)\) is less obvious. We choose a semiparametric prior, where \(\lambda_0(t)\) is a piecewise constant function. This prior requires us to partition the time range in question into intervals with endpoints \(0 \leq s_1 < s_2 < \cdots < s_N\). With this partition, \(\lambda_0 (t) = \lambda_j\) if \(s_j \leq t < s_{j + 1}\). With \(\lambda_0(t)\) constrained to have this form, all we need to do is choose priors for the \(N - 1\) values \(\lambda_j\). We use independent vague priors \(\lambda_j \sim \operatorname{Gamma}(10^{-2}, 10^{-2}).\) For our mastectomy example, we make each interval three months long.

```
interval_length = 3
interval_bounds = np.arange(0, df.time.max() + interval_length + 1, interval_length)
n_intervals = interval_bounds.size - 1
intervals = np.arange(n_intervals)
```

We see how deaths and censored observations are distributed in these intervals.

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.hist(df[df.event == 1].time.values, bins=interval_bounds,
color=red, alpha=0.5, lw=0,
label='Uncensored');
ax.hist(df[df.event == 0].time.values, bins=interval_bounds,
color=blue, alpha=0.5, lw=0,
label='Censored');
ax.set_xlim(0, interval_bounds[-1]);
ax.set_xlabel('Months since mastectomy');
ax.set_yticks([0, 1, 2, 3]);
ax.set_ylabel('Number of observations');
ax.legend();
```

With the prior distributions on \(\beta\) and \(\lambda_0(t)\) chosen, we now show how the model may be fit using MCMC simulation with `pymc3`

. The key observation is that the piecewise-constant proportional hazard model is closely related to a Poisson regression model. (The models are not identical, but their likelihoods differ by a factor that depends only on the observed data and not the parameters \(\beta\) and \(\lambda_j\). For details, see Germán Rodríguez’s WWS 509 course notes.)

We define indicator variables based on whether or the \(i\)-th suject died in the \(j\)-th interval,

\[d_{i, j} = \begin{cases} 1 & \textrm{if subject } i \textrm{ died in interval } j \\ 0 & \textrm{otherwise} \end{cases}.\]

```
last_period = np.floor((df.time - 0.01) / interval_length)
death = np.zeros((n_patients, n_intervals))
death[patients, last_period] = df.event
```

We also define \(t_{i, j}\) to be the amount of time the \(i\)-th subject was at risk in the \(j\)-th interval.

```
exposure = np.greater_equal.outer(df.time, interval_bounds[:-1]) * interval_length
exposure[patients, last_period] = df.time - interval_bounds[last_period]
```

Finally, denote the risk incurred by the \(i\)-th subject in the \(j\)-th interval as \(\lambda_{i, j} = \lambda_j \exp(\mathbf{x}_i \beta)\).

We may approximate \(d_{i, j}\) with a Possion random variable with mean \(t_{i, j}\ \lambda_{i, j}\). This approximation leads to the following `pymc3`

model.

`SEED = 5078864 # from random.org`

```
with pm.Model() as model:
lambda0 = pm.Gamma('lambda0', 0.01, 0.01, shape=n_intervals)
sigma = pm.Uniform('sigma', 0., 10.)
tau = pm.Deterministic('tau', sigma**-2)
mu_beta = pm.Normal('mu_beta', 0., 10**-2)
beta = pm.Normal('beta', mu_beta, tau)
lambda_ = pm.Deterministic('lambda_', T.outer(T.exp(beta * df.metastized), lambda0))
mu = pm.Deterministic('mu', exposure * lambda_)
obs = pm.Poisson('obs', mu, observed=death)
```

We now sample from the model.

```
n_samples = 40000
burn = 20000
thin = 20
```

```
with model:
step = pm.Metropolis()
trace_ = pm.sample(n_samples, step, random_seed=SEED)
```

` [-----------------100%-----------------] 40000 of 40000 complete in 39.0 sec`

`trace = trace_[burn::thin]`

We see that the hazard rate for subjects whose cancer has metastized is about one and a half times the rate of those whose cancer has not metastized.

`np.exp(trace['beta'].mean())`

`1.645592148084472`

`pm.traceplot(trace, vars=['beta']);`

`pm.autocorrplot(trace, vars=['beta']);`

We now examine the effect of metastization on both the cumulative hazard and on the survival function.

```
base_hazard = trace['lambda0']
met_hazard = trace['lambda0'] * np.exp(np.atleast_2d(trace['beta']).T)
```

```
def cum_hazard(hazard):
return (interval_length * hazard).cumsum(axis=-1)
def survival(hazard):
return np.exp(-cum_hazard(hazard))
```

```
def plot_with_hpd(x, hazard, f, ax, color=None, label=None, alpha=0.05):
mean = f(hazard.mean(axis=0))
percentiles = 100 * np.array([alpha / 2., 1. - alpha / 2.])
hpd = np.percentile(f(hazard), percentiles, axis=0)
ax.fill_between(x, hpd[0], hpd[1], color=color, alpha=0.25)
ax.step(x, mean, color=color, label=label);
```

```
fig, (hazard_ax, surv_ax) = plt.subplots(ncols=2, sharex=True, sharey=False, figsize=(16, 6))
plot_with_hpd(interval_bounds[:-1], base_hazard, cum_hazard,
hazard_ax, color=blue, label='Had not metastized')
plot_with_hpd(interval_bounds[:-1], met_hazard, cum_hazard,
hazard_ax, color=red, label='Metastized')
hazard_ax.set_xlim(0, df.time.max());
hazard_ax.set_xlabel('Months since mastectomy');
hazard_ax.set_ylabel(r'Cumulative hazard $\Lambda(t)$');
hazard_ax.legend(loc=2);
plot_with_hpd(interval_bounds[:-1], base_hazard, survival,
surv_ax, color=blue)
plot_with_hpd(interval_bounds[:-1], met_hazard, survival,
surv_ax, color=red)
surv_ax.set_xlim(0, df.time.max());
surv_ax.set_xlabel('Months since mastectomy');
surv_ax.set_ylabel('Survival function $S(t)$');
fig.suptitle('Bayesian survival model');
```

We see that the cumulative hazard for metastized subjects increases more rapidly initially (through about seventy months), after which it increases roughly in parallel with the baseline cumulative hazard.

These plots also show the pointwise 95% high posterior density interval for each function. One of the distinct advantages of the Bayesian model fit with `pymc3`

is the inherent quantification of uncertainty in our estimates.

Another of the advantages of the model we have built is its flexibility. From the plots above, we may reasonable believe that the additional hazard due to metastization varies over time; it seems plausible that cancer that has metastized increases the hazard rate immediately after the mastectomy, but that the risk due to metastization decreases over time. We can accomodate this mechanism in our model by allowing the regression coefficients to vary over time. In the time-varying coefficent model, if \(s_j \leq t < s_{j + 1}\), we let \(\lambda(t) = \lambda_j \exp(\mathbf{x} \beta_j).\) The sequence of regression coefficients \(\beta_1, \beta_2, \ldots, \beta_{N - 1}\) form a normal random walk with \(\beta_1 \sim N(0, 1)\), \(\beta_j\ |\ \beta_{j - 1} \sim N(\beta_{j - 1}, 1)\).

We implement this model in `pymc3`

as follows.

```
with pm.Model() as time_varying_model:
lambda0 = pm.Gamma('lambda0', 0.01, 0.01, shape=n_intervals)
beta = GaussianRandomWalk('beta', tau=1., shape=n_intervals)
lambda_ = pm.Deterministic('h', lambda0 * T.exp(T.outer(T.constant(df.metastized), beta)))
mu = pm.Deterministic('mu', exposure * lambda_)
obs = pm.Poisson('obs', mu, observed=death)
```

We proceed to sample from this model.

```
with time_varying_model:
step = pm.Metropolis()
time_varying_trace_ = pm.sample(n_samples, step, random_seed=SEED)
```

` [-----------------100%-----------------] 40000 of 40000 complete in 56.7 sec`

`time_varying_trace = time_varying_trace_[burn::thin]`

We see from the plot of \(\beta_j\) over time below that initially \(\beta_j > 0\), indicating an elevated hazard rate due to metastization, but that this risk declines as \(\beta_j < 0\) eventually.

```
fig, ax = plt.subplots(figsize=(8, 6))
beta_hpd = np.percentile(time_varying_trace['beta'], [2.5, 97.5], axis=0)
beta_low = beta_hpd[0]
beta_high = beta_hpd[1]
ax.fill_between(interval_bounds[:-1], beta_low, beta_high,
color=blue, alpha=0.25);
beta_hat = time_varying_trace['beta'].mean(axis=0)
ax.step(interval_bounds[:-1], beta_hat, color=blue);
ax.scatter(interval_bounds[last_period[(df.event.values == 1) & (df.metastized == 1)]],
beta_hat[last_period[(df.event.values == 1) & (df.metastized == 1)]],
c=red, zorder=10, label='Died, cancer metastized');
ax.scatter(interval_bounds[last_period[(df.event.values == 0) & (df.metastized == 1)]],
beta_hat[last_period[(df.event.values == 0) & (df.metastized == 1)]],
c=blue, zorder=10, label='Censored, cancer metastized');
ax.set_xlim(0, df.time.max());
ax.set_xlabel('Months since mastectomy');
ax.set_ylabel(r'$\beta_j$');
ax.legend();
```

The coefficients \(\beta_j\) begin declining rapidly around one hundred months post-mastectomy, which seems reasonable, given that only three of twelve subjects whose cancer had metastized lived past this point died during the study.

The change in our estimate of the cumulative hazard and survival functions due to time-varying effects is also quite apparent in the following plots.

```
tv_base_hazard = time_varying_trace['lambda0']
tv_met_hazard = time_varying_trace['lambda0'] * np.exp(np.atleast_2d(time_varying_trace['beta']))
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.step(interval_bounds[:-1], cum_hazard(base_hazard.mean(axis=0)),
color=blue, label='Had not metastized');
ax.step(interval_bounds[:-1], cum_hazard(met_hazard.mean(axis=0)),
color=red, label='Metastized');
ax.step(interval_bounds[:-1], cum_hazard(tv_base_hazard.mean(axis=0)),
color=blue, linestyle='--', label='Had not metastized (time varying effect)');
ax.step(interval_bounds[:-1], cum_hazard(tv_met_hazard.mean(axis=0)),
color=red, linestyle='--', label='Metastized (time varying effect)');
ax.set_xlim(0, df.time.max() - 4);
ax.set_xlabel('Months since mastectomy');
ax.set_ylim(0, 2);
ax.set_ylabel(r'Cumulative hazard $\Lambda(t)$');
ax.legend(loc=2);
```

```
fig, (hazard_ax, surv_ax) = plt.subplots(ncols=2, sharex=True, sharey=False, figsize=(16, 6))
plot_with_hpd(interval_bounds[:-1], tv_base_hazard, cum_hazard,
hazard_ax, color=blue, label='Had not metastized')
plot_with_hpd(interval_bounds[:-1], tv_met_hazard, cum_hazard,
hazard_ax, color=red, label='Metastized')
hazard_ax.set_xlim(0, df.time.max());
hazard_ax.set_xlabel('Months since mastectomy');
hazard_ax.set_ylim(0, 2);
hazard_ax.set_ylabel(r'Cumulative hazard $\Lambda(t)$');
hazard_ax.legend(loc=2);
plot_with_hpd(interval_bounds[:-1], tv_base_hazard, survival,
surv_ax, color=blue)
plot_with_hpd(interval_bounds[:-1], tv_met_hazard, survival,
surv_ax, color=red)
surv_ax.set_xlim(0, df.time.max());
surv_ax.set_xlabel('Months since mastectomy');
surv_ax.set_ylabel('Survival function $S(t)$');
fig.suptitle('Bayesian survival model with time varying effects');
```

We have really only scratched the surface of both survival analysis and the Bayesian approach to survival analysis. More information on Bayesian survival analysis is available in Ibrahim et al.^{2} (For example, we may want to account for individual frailty in either or original or time-varying models.)

This post is available as an IPython notebook here.

Tags: Bayesian Statistics, PyMC3

Outside of the beta-binomial model, the multivariate normal model is likely the most studied Bayesian model in history. Unfortunately, as this issue shows, `pymc3`

cannot (yet) sample from the standard conjugate normal-Wishart model. Fortunately, `pymc3`

*does* support sampling from the LKJ distribution. This post will show how to fit a simple multivariate normal model using `pymc3`

with an normal-LKJ prior.

The normal-Wishart prior is conjugate for the multivariate normal model, so we can find the posterior distribution in closed form. Even with this closed form solution, sampling from a multivariate normal model in `pymc3`

is important as a building block for more complex models that will be discussed in future posts.

First, we generate some two-dimensional sample data.

`%matplotlib inline`

```
from matplotlib.patches import Ellipse
from matplotlib import pyplot as plt
import numpy as np
import pymc3 as pm
import scipy as sp
import seaborn as sns
from theano import tensor as T
```

`Couldn't import dot_parser, loading of dot files will not be possible.`

`np.random.seed(3264602) # from random.org`

```
N = 100
mu_actual = sp.stats.uniform.rvs(-5, 10, size=2)
cov_actual_sqrt = sp.stats.uniform.rvs(0, 2, size=(2, 2))
cov_actual = np.dot(cov_actual_sqrt.T, cov_actual_sqrt)
x = sp.stats.multivariate_normal.rvs(mu_actual, cov_actual, size=N)
```

```
var, U = np.linalg.eig(cov_actual)
angle = 180. / np.pi * np.arccos(np.abs(U[0, 0]))
```

```
fig, ax = plt.subplots(figsize=(8, 6))
blue = sns.color_palette()[0]
e = Ellipse(mu_actual, 2 * np.sqrt(5.991 * var[0]), 2 * np.sqrt(5.991 * var[1]), angle=-angle)
e.set_alpha(0.5)
e.set_facecolor('gray')
e.set_zorder(10);
ax.add_artist(e);
ax.scatter(x[:, 0], x[:, 1], c='k', alpha=0.5, zorder=11);
rect = plt.Rectangle((0, 0), 1, 1, fc='gray', alpha=0.5)
ax.legend([rect], ['95% true credible region'], loc=2);
```

The sampling distribution for our model is \(x_i \sim N(\mu, \Lambda)\), where \(\Lambda\) is the precision matrix of the distribution. The precision matrix is the inverse of the covariance matrix. The support of the LKJ distribution is the set of correlation matrices, not covariance matrices. We will use the separation strategy from Barnard et al. to combine an LKJ prior on the correlation matrix with a prior on the standard deviations of each dimension to produce a prior on the covariance matrix.

Let \(\sigma\) be the vector of standard deviations of each component of our normal distribution, and \(\mathbf{C}\) be the correlation matrix. The relationship

\[\Sigma = \operatorname{diag}(\sigma)\ \mathbf{C} \operatorname{diag}(\sigma)\]

shows that priors on \(\sigma\) and \(\mathbf{C}\) will induce a prior on \(\Sigma\). Following Barnard et al., we place a standard lognormal prior each the elements \(\sigma\), and an LKJ prior on the correlation matric \(\mathbf{C}\). The LKJ distribution requires a shape parameter \(\nu > 0\). If \(\mathbf{C} \sim LKJ(\nu)\), then \(f(\mathbf{C}) \propto |\mathbf{C}|^{\nu - 1}\) (here \(|\cdot|\) is the determinant).

We can now begin to build this model in `pymc3`

.

```
with pm.Model() as model:
sigma = pm.Lognormal('sigma', np.zeros(2), np.ones(2), shape=2)
nu = pm.Uniform('nu', 0, 5)
C_triu = pm.LKJCorr('C_triu', nu, 2)
```

There is a slight complication in `pymc3`

’s handling of the `LKJCorr`

distribution; `pymc3`

represents the support of this distribution as a one-dimensional vector of the upper triangular elements of the full covariance matrix.

`C_triu.tag.test_value.shape`

`(1,)`

In order to build a the full correlation matric \(\mathbf{C}\), we first build a \(2 \times 2\) tensor whose values are all `C_triu`

and then set the diagonal entries to one. (Recall that a correlation matrix must be symmetric and positive definite with all diagonal entries equal to one.) We can then proceed to build the covariance matrix \(\Sigma\) and the precision matrix \(\Lambda\).

```
with model:
C = pm.Deterministic('C', T.fill_diagonal(C_triu[np.zeros((2, 2), dtype=np.int64)], 1.))
sigma_diag = pm.Deterministic('sigma_mat', T.nlinalg.diag(sigma))
cov = pm.Deterministic('cov', T.nlinalg.matrix_dot(sigma_diag, C, sigma_diag))
tau = pm.Deterministic('tau', T.nlinalg.matrix_inverse(cov))
```

While defining `C`

in terms of `C_triu`

was simple in this case because our sampling distribution is two-dimensional, the example from this StackOverflow question shows how to generalize this transformation to arbitrarily many dimensions.

Finally, we define the prior on \(\mu\) and the sampling distribution.

```
with model:
mu = pm.MvNormal('mu', 0, tau, shape=2)
x_ = pm.MvNormal('x', mu, tau, observed=x)
```

We are now ready to fit this model using `pymc3`

.

```
n_samples = 4000
n_burn = 2000
n_thin = 2
```

```
with model:
step = pm.Metropolis()
trace_ = pm.sample(n_samples, step)
```

` [-----------------100%-----------------] 4000 of 4000 complete in 5.8 sec`

`trace = trace_[n_burn::n_thin]`

We see that the posterior estimate of \(\mu\) is reasonably accurate.

`pm.traceplot(trace, vars=['mu']);`

`trace['mu'].mean(axis=0)`

`array([-1.41086412, -4.6853101 ])`

`mu_actual`

`array([-1.41866859, -4.8018335 ])`

The estimates of the standard deviations are certainly biased.

`pm.traceplot(trace, vars=['sigma']);`

`trace['sigma'].mean(axis=0)`

`array([ 0.75736536, 1.49451149])`

`np.sqrt(var)`

`array([ 0.3522422 , 1.58192855])`

However, the 95% posterior credible region is visuall quite close to the true credible region, so we can be fairly satisfied with our model.

```
post_cov = trace['cov'].mean(axis=0)
post_sigma, post_U = np.linalg.eig(post_cov)
post_angle = 180. / np.pi * np.arccos(np.abs(post_U[0, 0]))
```

```
fig, ax = plt.subplots(figsize=(8, 6))
blue = sns.color_palette()[0]
e = Ellipse(mu_actual, 2 * np.sqrt(5.991 * post_sigma[0]), 2 * np.sqrt(5.991 * post_sigma[1]), angle=-post_angle)
e.set_alpha(0.5)
e.set_facecolor(blue)
e.set_zorder(9);
ax.add_artist(e);
e = Ellipse(mu_actual, 2 * np.sqrt(5.991 * var[0]), 2 * np.sqrt(5.991 * var[1]), angle=-angle)
e.set_alpha(0.5)
e.set_facecolor('gray')
e.set_zorder(10);
ax.add_artist(e);
ax.scatter(x[:, 0], x[:, 1], c='k', alpha=0.5, zorder=11);
rect = plt.Rectangle((0, 0), 1, 1, fc='gray', alpha=0.5)
post_rect = plt.Rectangle((0, 0), 1, 1, fc=blue, alpha=0.5)
ax.legend([rect, post_rect],
['95% true credible region',
'95% posterior credible region'],
loc=2);
```

Again, this model is quite simple, but will be an important component of more complex models that I will blog about in the future.

This post is available as an IPython notebook here.

Tags: Bayesian Statistics, PyMC3

Recently, I have been learning about (generalized) additive models by working through Simon Wood’s book. I have previously posted an IPython notebook implementing the models from Chapter 3 of the book. In this post, I will show how to fit a simple additive model in Python in a bit more detail.

We will use a LIDAR dataset that is available on the website for Larry Wasserman’s book *All of Nonparametric Statistics*.

`%matplotlib inline`

```
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import patsy
import scipy as sp
import seaborn as sns
from statsmodels import api as sm
```

```
df = pd.read_csv('http://www.stat.cmu.edu/~larry/all-of-nonpar/=data/lidar.dat',
sep=' *', engine='python')
df['std_range'] = (df.range - df.range.min()) / df.range.ptp()
n = df.shape[0]
```

`df.head()`

range | logratio | std_range | |
---|---|---|---|

0 | 390 | -0.050356 | 0.000000 |

1 | 391 | -0.060097 | 0.003030 |

2 | 393 | -0.041901 | 0.009091 |

3 | 394 | -0.050985 | 0.012121 |

4 | 396 | -0.059913 | 0.018182 |

This data set is well-suited to additive modeling because the relationship between the variables is highly non-linear.

```
fig, ax = plt.subplots(figsize=(8, 6))
blue = sns.color_palette()[0]
ax.scatter(df.std_range, df.logratio, c=blue, alpha=0.5);
ax.set_xlim(-0.01, 1.01);
ax.set_xlabel('Scaled range');
ax.set_ylabel('Log ratio');
```

An additive model represents the relationship between explanatory variables \(\mathbf{x}\) and a response variable \(y\) as a sum of smooth functions of the explanatory variables

\[y = \beta_0 + f_1(x_1) + f_2(x_2) + \cdots + f_k(x_k) + \varepsilon.\]

The smooth functions \(f_i\) can be estimated using a variety of nonparametric techniques. Following Chapter 3 of Wood’s book, we will fit our additive model using penalized regression splines.

Since our LIDAR data set has only one explanatory variable, our additive model takes the form

\[y = \beta_0 + f(x) + \varepsilon.\]

We fit this model by minimizing the penalized residual sum of squares

\[PRSS = \sum_{i = 1}^n \left(y_i - \beta_0 - f(x_i)\right)^2 + \lambda \int_0^1 \left(f''(x)\right)^2\ dx.\]

The penalty term

\[\int_0^1 \left(f''(x)\right)^2\ dx\]

causes us to only choose less smooth functions if they fit the data much better. The smoothing parameter \(\lambda\) controls the rate at which decreased smoothness is traded for a better fit.

In the penalized regression splines model, we must also choose basis functions \(\varphi_1, \varphi_2, \ldots, \varphi_k\), which we then use to express the smooth function \(f\) as

\[f(x) = \beta_1 \varphi_1(x) + \beta_2 \varphi_2(x) + \cdots + \beta_k \varphi_k(x).\]

With these basis functions in place, if we define \(\mathbf{x}_i = [1\ x_i\ \varphi_2(x_i)\ \cdots \varphi_k(x_i)]\) and

\[\mathbf{X} = \begin{bmatrix} \mathbf{x}_1 \\ \vdots \\ \mathbf{x}_n \end{bmatrix},\]

the model \(y_i = \beta_0 + f(x_i) + \varepsilon\) can be rewritten as \(\mathbf{y} = \mathbf{X} \beta + \varepsilon\). It is tedious but not difficult to show that when \(f\) is expressed as a linear combination of basis functions, there is always a positive semidefinite matrix \(\mathbf{S}\) such that

\[\int_0^1 \left(f''(x)\right)^2\ dx = \beta^{\intercal} \mathbf{S} \beta.\]

Since \(\mathbf{S}\) is positive semidefinite, it has a square root \(\mathbf{B}\) such that \(\mathbf{B}^{\intercal} \mathbf{B} = \mathbf{S}\). The penalized residual sum of squares objective function can then be written as

\[ \begin{align*} PRSS & = (\mathbf{y} - \mathbf{X} \beta)^{\intercal} (\mathbf{y} - \mathbf{X} \beta) + \lambda \beta^{\intercal} \mathbf{B}^{\intercal} \mathbf{B} \beta = (\mathbf{\tilde{y}} - \mathbf{\tilde{X}} \beta)^{\intercal} (\mathbf{\tilde{y}} - \mathbf{\tilde{X}} \beta), \end{align*} \]

where

\[\mathbf{\tilde{y}} = \begin{bmatrix} \mathbf{y} \\ \mathbf{0}_{k + 1} \end{bmatrix} \]

and

\[\mathbf{\tilde{X}} = \begin{bmatrix} \mathbf{X} \\ \sqrt{\lambda}\ \mathbf{B} \end{bmatrix}. \]

Therefore the augmented data matrices \(\mathbf{\tilde{y}}\) and \(\mathbf{\tilde{X}}\) allow us to express the penalized residual sum of squares for the original model as the residual sum of squares of the OLS model \(\mathbf{\tilde{y}} = \mathbf{\tilde{X}} \beta + \tilde{\varepsilon}\). This augmented model allows us to use widely available machinery for fitting OLS models to fit the additive model as well.

The last step before we can fit the model in Python is to choose the basis functions \(\varphi_i\). Again, following Chapter 3 of Wood’s book, we let

\[R(x, z) = \frac{1}{4} \left(\left(z - \frac{1}{2}\right)^2 - \frac{1}{12}\right) \left(\left(x - \frac{1}{2}\right)^2 - \frac{1}{12}\right) - \frac{1}{24} \left(\left(\left|x - z\right| - \frac{1}{2}\right)^4 - \frac{1}{2} \left(\left|x - z\right| - \frac{1}{2}\right)^2 + \frac{7}{240}\right).\]

```
def R(x, z):
return ((z - 0.5)**2 - 1 / 12) * ((x - 0.5)**2 - 1 / 12) / 4 - ((np.abs(x - z) - 0.5)**4 - 0.5 * (np.abs(x - z) - 0.5)**2 + 7 / 240) / 24
R = np.frompyfunc(R, 2, 1)
def R_(x):
return R.outer(x, knots).astype(np.float64)
```

Though this function is quite complicated, we will see that it has some very conveient properties. We must also choose a set of knots \(z_i\) in \([0, 1]\), \(i = 1, 2, \ldots, q\).

```
q = 20
knots = df.std_range.quantile(np.linspace(0, 1, q))
```

Here we have used twenty knots situatied at percentiles of `std_range`

.

Now we define our basis functions as \(\varphi_1(x) = x\), \(\varphi_{i}(x) = R(x, z_{i - 1})\) for \(i = 2, 3, \ldots q + 1\).

Our model matrices \(\mathbf{y}\) and \(\mathbf{X}\) are therefore

`y, X = patsy.dmatrices('logratio ~ std_range + R_(std_range)', data=df)`

Note that, by default, `patsy`

always includes an intercept column in `X`

.

The advantage of the function \(R\) is the the penalty matrix \(\mathbf{S}\) has the form

\[S = \begin{bmatrix} \mathbf{0}_{2 \times 2} & \mathbf{0}_{2 \times q} \\ \mathbf{0}_{q \times 2} & \mathbf{\tilde{S}} \end{bmatrix},\]

where \(\mathbf{\tilde{S}}_{ij} = R(z_i, z_j)\). We now calculate \(\mathbf{S}\) and its square root \(\mathbf{B}\).

```
S = np.zeros((q + 2, q + 2))
S[2:, 2:] = R_(knots)
```

```
B = np.zeros_like(S)
B[2:, 2:] = np.real_if_close(sp.linalg.sqrtm(S[2:, 2:]), tol=10**8)
```

We now have all the ingredients necessary to fit some additive models to the LIDAR data set.

```
def fit(y, X, B, lambda_=1.0):
# build the augmented matrices
y_ = np.vstack((y, np.zeros((q + 2, 1))))
X_ = np.vstack((X, np.sqrt(lambda_) * B))
return sm.OLS(y_, X_).fit()
```

We have not yet discussed how to choose the smoothing parameter \(\lambda\), so we will fit several models with different values of \(\lambda\) to see how it affects the results.

```
fig, axes = plt.subplots(nrows=3, ncols=2, sharex=True, sharey=True, squeeze=True, figsize=(12, 13.5))
plot_lambdas = np.array([1.0, 0.1, 0.01, 0.001, 0.0001, 0.00001])
plot_x = np.linspace(0, 1, 100)
plot_X = patsy.dmatrix('std_range + R_(std_range)', {'std_range': plot_x})
for lambda_, ax in zip(plot_lambdas, np.ravel(axes)):
ax.scatter(df.std_range, df.logratio, c=blue, alpha=0.5);
results = fit(y, X, B, lambda_=lambda_)
ax.plot(plot_x, results.predict(plot_X));
ax.set_xlim(-0.01, 1.01);
ax.set_xlabel('Scaled range');
ax.set_ylabel('Log ratio');
ax.set_title(r'$\lambda = {}$'.format(lambda_));
fig.tight_layout();
```

We can see that as \(\lambda\) decreases, the model becomes less smooth. Visually, it seems that the optimal value of \(\lambda\) lies somewhere between \(10^{-2}\) and \(10^{-4}\). We need a rigorous way to choose the optimal value of \(\lambda\). As is often the case in such situations, we turn to cross-validation. Specifically, we will use generalized cross-validation to choose the optimal value of \(\lambda\). The GCV score is given by

\[\operatorname{GCV}(\lambda) = \frac{n \sum_{i = 1}^n \left(y_i - \hat{y}_i\right)^2}{\left(n - \operatorname{tr} \mathbf{H}\right)^2}.\]

Here, \(\hat{y}_i\) is the \(i\)-th predicted value, and \(\mathbf{H}\) is upper left \(n \times n\) submatrix of the the influence matrix for the OLS model \(\mathbf{\tilde{y}} = \mathbf{\tilde{X}} + \varepsilon\).

```
def gcv_score(results):
X = results.model.exog[:-(q + 2), :]
n = X.shape[0]
y = results.model.endog[:n]
y_hat = results.predict(X)
hat_matrix_trace = results.get_influence().hat_matrix_diag[:n].sum()
return n * np.power(y - y_hat, 2).sum() / np.power(n - hat_matrix_trace, 2)
```

Now we evaluate the GCV score of the model over a range of \(\lambda\) values.

`lambdas = np.logspace(0, 50, 100, base=1.5) * 1e-8`

`gcv_scores = np.array([gcv_score(fit(y, X, B, lambda_=lambda_)) for lambda_ in lambdas])`

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(lambdas, gcv_scores);
ax.set_xscale('log');
ax.set_xlabel(r'$\lambda$');
ax.set_ylabel(r'$\operatorname{GCV}(\lambda)$');
```

The GCV-optimal value of \(\lambda\) is therefore

```
lambda_best = lambdas[gcv_scores.argmin()]
lambda_best
```

`0.00063458365729550153`

This value of \(\lambda\) produces a visually reasonable fit.

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.scatter(df.std_range, df.logratio, c=blue, alpha=0.5);
results = fit(y, X, B, lambda_=lambda_best)
ax.plot(plot_x, results.predict(plot_X), label=r'$\lambda = {}$'.format(lambda_best));
ax.set_xlim(-0.01, 1.01);
ax.legend();
```

We have only scratched the surface of additive models, fitting a simple model of one variable with penalized regression splines. In general, additive models are quite powerful and flexible, while remaining quite interpretable.

This post is available as an IPython notebook here.

Tags: Statistics, Python

For all the hype about big data, much value resides in the world’s medium and small data. Especially when we consider the length of the feedback loop and total analyst time invested, insights from small and medium data are quite attractive and economical. Personally, I find analyzing data that fits into memory quite convenient, and therefore, when I am confronted with a data set that does not fit in memory as-is, I am willing to spend a bit of time to try to manipulate it to fit into memory.

The first technique I usually turn to is to only store distinct rows of a data set, along with the count of the number of times that row appears in the data set. This technique is fairly simple to implement, especially when the data set is generated by a SQL query. If the initial query that generates the data set is

`SELECT u, v, w FROM t;`

we would modify it to become

`SELECT u, v, w, COUNT(1) FROM t GROUP BY u, v, w;`

We now generate a sample data set with both discrete and continuous features.

`%matplotlib inline`

`from __future__ import division`

```
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from patsy import dmatrices, dmatrix
import scipy as sp
import seaborn as sns
from statsmodels import api as sm
from statsmodels.base.model import GenericLikelihoodModel
```

`np.random.seed(1545721) # from random.org`

`N = 100001`

`u_min, u_max = 0, 100`

`v_p = 0.6`

```
n_ws = 50
ws = sp.stats.norm.rvs(0, 1, size=n_ws)
w_min, w_max = ws.min(), ws.max()
```

```
df = pd.DataFrame({
'u': np.random.randint(u_min, u_max, size=N),
'v': sp.stats.bernoulli.rvs(v_p, size=N),
'w': np.random.choice(ws, size=N, replace=True)
})
```

`df.head()`

u | v | w | |
---|---|---|---|

0 | 97 | 0 | 0.537397 |

1 | 79 | 1 | 1.536383 |

2 | 44 | 1 | 1.074817 |

3 | 51 | 0 | -0.491210 |

4 | 47 | 1 | 1.592646 |

We see that this data frame has just over 100,000 rows, but only about 10,000 distinct rows.

`df.shape[0]`

`100001`

`df.drop_duplicates().shape[0]`

`9997`

We now use `pandas`

’ `groupby`

method to produce a data frame that contains the count of each unique combination of `x`

, `y`

, and `z`

.

```
count_df = df.groupby(list(df.columns)).size()
count_df.name = 'count'
count_df = count_df.reset_index()
```

In order to make later examples interesting, we shuffle the rows of the reduced data frame, because `pandas`

automatically sorts the values we grouped on in the reduced data frame.

```
shuffled_ixs = count_df.index.values
np.random.shuffle(shuffled_ixs)
count_df = count_df.iloc[shuffled_ixs].copy().reset_index(drop=True)
```

`count_df.head()`

u | v | w | count | |
---|---|---|---|---|

0 | 0 | 0 | 0.425597 | 14 |

1 | 48 | 1 | -0.993981 | 7 |

2 | 35 | 0 | 0.358156 | 9 |

3 | 19 | 1 | -0.760298 | 17 |

4 | 40 | 1 | -0.688514 | 13 |

Again, we see that we are storing 90% fewer rows. Although this data set has been artificially generated, I have seen space savings of up to 98% when applying this technique to real-world data sets.

`count_df.shape[0] / N`

`0.0999690003099969`

This space savings allows me to analyze data sets which initially appear too large to fit in memory. For example, the computer I am writing this on has 16 GB of RAM. At a 90% space savings, I can comfortably analyze a data set that might otherwise be 80 GB in memory while leaving a healthy amount of memory for other processes. To me, the convenience and tight feedback loop that come with fitting a data set entirely in memory are hard to overstate.

As nice as it is to fit a data set into memory, it’s not very useful unless we can still analyze it. The rest of this post will show how we can perform standard operations on these summary data sets.

For convenience, we will separate the feature columns from the count columns.

```
summ_df = count_df[['u', 'v', 'w']]
n = count_df['count']
```

Suppose we have a group of numbers \(x_1, x_2, \ldots, x_n\). Let the unique values among these numbers be denoted \(z_1, z_2, \ldots, z_m\) and let \(n_j\) be the number of times \(z_j\) apears in the original group. The mean of the \(x_i\)s is therefore

\[ \begin{align*} \bar{x} & = \frac{1}{n} \sum_{i = 1}^n x_i = \frac{1}{n} \sum_{j = 1}^m n_j z_j, \end{align*} \]

since we may group identical \(x_i\)s into a single summand. Since \(n = \sum_{j = 1}^m n_j\), we can calculate the mean using the following function.

```
def mean(df, count):
return df.mul(count, axis=0).sum() / count.sum()
```

`mean(summ_df, n)`

```
u 49.308067
v 0.598704
w 0.170815
dtype: float64
```

We see that the means calculated by our function agree with the means of the original data frame.

`df.mean(axis=0)`

```
u 49.308067
v 0.598704
w 0.170815
dtype: float64
```

`np.allclose(mean(summ_df, n), df.mean(axis=0))`

`True`

We can calculate the variance as

\[ \begin{align*} \sigma_x^2 & = \frac{1}{n - 1} \sum_{i = 1}^n \left(x_i - \bar{x}\right)^2 = \frac{1}{n - 1} \sum_{j = 1}^m n_j \left(z_j - \bar{x}\right)^2 \end{align*} \]

using the same trick of combining identical terms in the original sum. Again, this calculation is easy to implement in Python.

```
def var(df, count):
mu = mean(df, count)
return np.power(df - mu, 2).mul(count, axis=0).sum() / (count.sum() - 1)
```

`var(summ_df, n)`

```
u 830.025064
v 0.240260
w 1.099191
dtype: float64
```

We see that the variances calculated by our function agree with the variances of the original data frame.

`df.var()`

```
u 830.025064
v 0.240260
w 1.099191
dtype: float64
```

`np.allclose(var(summ_df, n), df.var(axis=0))`

`True`

Histograms are fundamental tools for exploratory data analysis. Fortunately, `pyplot`

’s `hist`

function easily accommodates summarized data using the `weights`

optional argument.

```
fig, (full_ax, summ_ax) = plt.subplots(ncols=2, sharex=True, sharey=True, figsize=(16, 6))
nbins = 20
blue, green = sns.color_palette()[:2]
full_ax.hist(df.w, bins=nbins, color=blue, alpha=0.5, lw=0);
full_ax.set_xlabel('$w$');
full_ax.set_ylabel('Count');
full_ax.set_title('Full data frame');
summ_ax.hist(summ_df.w, bins=nbins, weights=n, color=green, alpha=0.5, lw=0);
summ_ax.set_xlabel('$w$');
summ_ax.set_title('Summarized data frame');
```

We see that the histograms for \(w\) produced from the full and summarized data frames are identical.

Calculating the mean and variance of our summarized data frames was not too difficult. Calculating quantiles from this data frame is slightly more involved, though still not terribly hard.

Our implementation will rely on sorting the data frame. Though this implementation is not optimal from a computation complexity point of view, it is in keeping with the spirit of `pandas`

’ implementation of quantiles. I have given some thought on how to implement linear time selection on the summarized data frame, but have not yet worked out the details.

Before writing a function to calculate quantiles of a data frame with several columns, we will walk through the simpler case of computing the quartiles of a single series.

`u = summ_df.u`

`u.head()`

```
0 0
1 48
2 35
3 19
4 40
Name: u, dtype: int64
```

First we `argsort`

the series.

`sorted_ilocs = u.argsort()`

We see that `u.iloc[sorted_ilocs]`

will now be in ascending order.

`sorted_u = u.iloc[sorted_ilocs]`

`(sorted_u[:-1] <= sorted_u[1:]).all()`

`True`

More importantly, `counts.iloc[sorted_ilocs]`

will have the count of the smallest element of `u`

first, the count of the second smallest element second, etc.

```
sorted_n = n.iloc[sorted_ilocs]
sorted_cumsum = sorted_n.cumsum()
cdf = (sorted_cumsum / n.sum()).values
```

Now, the \(i\)-th location of `sorted_cumsum`

will contain the number of elements of `u`

less than or equal to the \(i\)-th smallest element, and therefore `cdf`

is the empirical cumulative distribution function of `u`

. The following plot shows that this interpretation is correct.

```
fig, ax = plt.subplots(figsize=(8, 6))
blue, _, red = sns.color_palette()[:3]
ax.plot(sorted_u, cdf, c=blue, label='Empirical CDF');
plot_u = np.arange(100)
ax.plot(plot_u, sp.stats.randint.cdf(plot_u, u_min, u_max), '--', c=red, label='Population CDF');
ax.set_xlabel('$u$');
ax.legend(loc=2);
```

If, for example, we wish to find the median of `u`

, we want to find the first location in `cdf`

which is greater than or equal to 0.5.

`median_iloc_in_sorted = (cdf < 0.5).argmin()`

The index of the median in `u`

is therefore `sorted_ilocs.iloc[median_iloc_in_sorted]`

, so the median of `u`

is

`u.iloc[sorted_ilocs.iloc[median_iloc_in_sorted]]`

`49`

`df.u.quantile(0.5)`

`49.0`

We can generalize this method to calculate multiple quantiles simultaneously as follows.

`q = np.array([0.25, 0.5, 0.75])`

`u.iloc[sorted_ilocs.iloc[np.less.outer(cdf, q).argmin(axis=0)]]`

```
2299 24
9079 49
1211 74
Name: u, dtype: int64
```

`df.u.quantile(q)`

```
0.25 24
0.50 49
0.75 74
dtype: float64
```

The array `np.less.outer(cdf, q).argmin(axis=0)`

contains three columns, each of which contains the result of comparing `cdf`

to an element of `q`

. The following function generalizes this approach from series to data frames.

```
def quantile(df, count, q=0.5):
q = np.ravel(q)
sorted_ilocs = df.apply(pd.Series.argsort)
sorted_counts = sorted_ilocs.apply(lambda s: count.iloc[s].values)
cdf = sorted_counts.cumsum() / sorted_counts.sum()
q_ilocs_in_sorted_ilocs = pd.DataFrame(np.less.outer(cdf.values, q).argmin(axis=0).T,
columns=df.columns)
q_ilocs = sorted_ilocs.apply(lambda s: s[q_ilocs_in_sorted_ilocs[s.name]].reset_index(drop=True))
q_df = df.apply(lambda s: s.iloc[q_ilocs[s.name]].reset_index(drop=True))
q_df.index = q
return q_df
```

`quantile(summ_df, n, q=q)`

u | v | w | |
---|---|---|---|

0.25 | 24 | 0 | -0.688514 |

0.50 | 49 | 1 | 0.040036 |

0.75 | 74 | 1 | 1.074817 |

`df.quantile(q=q)`

u | v | w | |
---|---|---|---|

0.25 | 24 | 0 | -0.688514 |

0.50 | 49 | 1 | 0.040036 |

0.75 | 74 | 1 | 1.074817 |

`np.allclose(quantile(summ_df, n, q=q), df.quantile(q=q))`

`True`

Another important operation is bootstrapping. We will see two ways to perfom bootstrapping on the summary data set.

`n_boot = 10000`

Key to both approaches to the bootstrap is knowing the proprotion of the data set that each distinct combination of features comprised.

`weights = n / n.sum()`

The two approaches differ in what type of data frame they produce. The first we will discuss produces a non-summarized data frame with non-unique rows, while the second produces a summarized data frame. Each fo these approaches to bootstrapping is useful in different situations.

To produce a non-summarized data frame, we generate a list of locations in `feature_df`

based on `weights`

using `numpy.random.choice`

.

```
boot_ixs = np.random.choice(summ_df.shape[0], size=n_boot, replace=True,
p=weights)
```

`boot_df = summ_df.iloc[boot_ixs]`

`boot_df.head()`

u | v | w | |
---|---|---|---|

1171 | 47 | 1 | -1.392235 |

9681 | 3 | 1 | 0.018521 |

6664 | 13 | 1 | 1.941207 |

8343 | 13 | 0 | 0.655181 |

3595 | 95 | 1 | 0.972592 |

We can verify that our bootstrapped data frame has (approximately) the same distribution as the original data frame using Q-Q plots.

`ps = np.linspace(0, 1, 100)`

`boot_qs = boot_df[['u', 'w']].quantile(q=ps)`

`qs = df[['u', 'w']].quantile(q=ps)`

```
fig, ax = plt.subplots(figsize=(8, 6))
blue = sns.color_palette()[0]
ax.plot((u_min, u_max), (u_min, u_max), '--', c='k', lw=0.75,
label='Perfect agreement');
ax.scatter(qs.u, boot_qs.u, c=blue, alpha=0.5);
ax.set_xlim(u_min, u_max);
ax.set_xlabel('Original quantiles');
ax.set_ylim(u_min, u_max);
ax.set_ylabel('Resampled quantiles');
ax.set_title('Q-Q plot for $u$');
ax.legend(loc=2);
```

```
fig, ax = plt.subplots(figsize=(8, 6))
blue = sns.color_palette()[0]
ax.plot((w_min, w_max), (w_min, w_max), '--', c='k', lw=0.75,
label='Perfect agreement');
ax.scatter(qs.w, boot_qs.w, c=blue, alpha=0.5);
ax.set_xlim(w_min, w_max);
ax.set_xlabel('Original quantiles');
ax.set_ylim(w_min, w_max);
ax.set_ylabel('Resampled quantiles');
ax.set_title('Q-Q plot for $w$');
ax.legend(loc=2);
```

We see that both of the resampled distributions agree quite closely with the original distributions. We have only produced Q-Q plots for \(u\) and \(w\) because \(v\) is binary-valued.

While at first non-summarized boostrap resampling may appear to counteract the benefits of summarizing the original data frame, it can be quite useful when training and evaluating online learning algorithms, where iterating through the locations of the bootstrapped data in the original summarized data frame is efficient.

To produce a summarized data frame, the counts of the resampled data frame are sampled from a multinomial distribution with event probabilities given by `weights`

.

`boot_counts = pd.Series(np.random.multinomial(n_boot, weights), name='count')`

Again, we compare the distribution of our bootstrapped data frame to that of the original with Q-Q plots. Here our summarized quantile function is quite useful.

`boot_count_qs = quantile(summ_df, boot_counts, q=ps)`

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot((u_min, u_max), (u_min, u_max), '--', c='k', lw=0.75,
label='Perfect agreement');
ax.scatter(qs.u, boot_count_qs.u, c=blue, alpha=0.5);
ax.set_xlim(u_min, u_max);
ax.set_xlabel('Original quantiles');
ax.set_ylim(u_min, u_max);
ax.set_ylabel('Resampled quantiles');
ax.set_title('Q-Q plot for $u$');
ax.legend(loc=2);
```

```
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot((w_min, w_max), (w_min, w_max), '--', c='k', lw=0.75,
label='Perfect agreement');
ax.scatter(qs.w, boot_count_qs.w, c=blue, alpha=0.5);
ax.set_xlim(w_min, w_max);
ax.set_xlabel('Original quantiles');
ax.set_ylim(w_min, w_max);
ax.set_ylabel('Resampled quantiles');
ax.set_title('Q-Q plot for $w$');
ax.legend(loc=2);
```

Again, we see that both of the resampled distributions agree quite closely with the original distributions.

Linear regression is among the most frequently used types of statistical inference, and it plays nicely with summarized data. Typically, we have a response variable \(y\) that we wish to model as a linear combination of \(u\), \(v\), and \(w\) as

\[ \begin{align*} y_i = \beta_0 + \beta_1 u_i + \beta_2 v_i + \beta_3 w_i + \varepsilon, \end{align*} \]

where \(\varepsilon \sim N(0, \sigma^2)\) is noise. We generate such a data set below (with \(\sigma = 0.1\)).

`beta = np.array([-3., 0.1, -4., 2.])`

`noise_std = 0.1`

`X = dmatrix('u + v + w', data=df)`

`y = pd.Series(np.dot(X, beta), name='y') + sp.stats.norm.rvs(scale=noise_std, size=N)`

`y.head()`

```
0 7.862559
1 3.830585
2 -0.388246
3 1.047091
4 0.992082
Name: y, dtype: float64
```

Each element of the series `y`

corresponds to one row in the uncompressed data frame `df`

. The `OLS`

class from `statsmodels`

comes quite close to recovering the true regression coefficients.

`full_ols = sm.OLS(y, X).fit()`

`full_ols.params`

```
const -2.999658
x1 0.099986
x2 -3.998997
x3 2.000317
dtype: float64
```

To show how we can perform linear regression on the summarized data frame, we recall the the ordinary least squares estimator minimizes the residual sum of squares. The residual sum of squares is given by

\[ \begin{align*} RSS & = \sum_{i = 1}^n \left(y_i - \mathbf{x}_i \mathbf{\beta}^{\intercal}\right)^2. \end{align*} \]

Here \(\mathbf{x}_i = [1\ u_i\ v_i\ w_i]\) is the \(i\)-th row of the original data frame (with a constant added for the intercept) and \(\mathbf{\beta} = [\beta_0\ \beta_1\ \beta_2\ \beta_3]\) is the row vector of regression coefficients. It would be tempting to rewrite \(RSS\) by grouping the terms based on the row their features map to in the compressed data frame, but this approach would lead to incorrect results. Due to the stochastic noise term \(\varepsilon_i\), identical values of \(u\), \(v\), and \(w\) can (and will almost certainly) map to different values of \(y\). We can see this phenomenon by calculating the range of \(y\) grouped on \(u\), \(v\), and \(w\).

`reg_df = pd.concat((y, df), axis=1)`

`reg_df.groupby(('u', 'v', 'w')).y.apply(np.ptp).describe()`

```
count 9997.000000
mean 0.297891
std 0.091815
min 0.000000
25% 0.237491
50% 0.296838
75% 0.358015
max 0.703418
Name: y, dtype: float64
```

If \(y\) were uniquely determined by \(u\), \(v\), and \(w\), we would expect the mean and quartiles of these ranges to be zero, which they are not. Fortunately, we can account for is difficulty with a bit of care.

Let \(S_j = \{i\ |\ \mathbf{x}_i = \mathbf{z}_j\}\), the set of row indices in the original data frame that correspond to the \(j\)-th row in the summary data frame. Define \(\bar{y}_{(j)} = \frac{1}{n_j} \sum_{i \in S_j} y_i\), which is the mean of the response variables that correspond to \(\mathbf{z}_j\). Intuitively, since \(\varepsilon_i\) has mean zero, \(\bar{y}_{(j)}\) is our best unbiased estimate of \(\mathbf{z}_j \mathbf{\beta}^{\intercal}\). We will now show that regressing \(\sqrt{n_j} \bar{y}_{(j)}\) on \(\sqrt{n_j} \mathbf{z}_j\) gives the same results as the full regression. We use the standard trick of adding and subtracting the mean and get

\[ \begin{align*} RSS & = \sum_{j = 1}^m \sum_{i \in S_j} \left(y_i - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2 \\ & = \sum_{j = 1}^m \sum_{i \in S_j} \left(\left(y_i - \bar{y}_{(j)}\right) + \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)\right)^2 \\ & = \sum_{j = 1}^m \sum_{i \in S_j} \left(\left(y_i - \bar{y}_{(j)}\right)^2 + 2 \left(y_i - \bar{y}_{(j)}\right) \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right) + \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2\right). \end{align*} \]

As is usual in these situations, the cross term vanishes, since

\[ \begin{align*} \sum_{i \in S_j} \left(y_i - \bar{y}_{(j)}\right) \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right) & = \sum_{i \in S_j} \left(y_i \bar{y}_{(j)} - y_i \mathbf{z}_j \mathbf{\beta}^{\intercal} - \bar{y}_{(j)}^2 + \bar{y}_{(j)} \mathbf{z}_j \mathbf{\beta}^{\intercal}\right) \\ & = \bar{y}_{(j)} \sum_{i \in S_j} y_i - \mathbf{z}_j \mathbf{\beta}^{\intercal} \sum_{i \in S_j} y_i - n_j \bar{y}_{(j)}^2 + n_j \bar{y}_{(j)} \mathbf{z}_j \mathbf{\beta}^{\intercal} \\ & = n_j \bar{y}_{(j)}^2 - n_j \bar{y}_{(j)} \mathbf{z}_j \mathbf{\beta}^{\intercal} - n_j \bar{y}_{(j)}^2 + n_j \bar{y}_{(j)} \mathbf{z}_j \mathbf{\beta}^{\intercal} \\ & = 0. \end{align*} \]

Therefore we may decompose the residual sum of squares as

\[ \begin{align*} RSS & = \sum_{j = 1}^m \sum_{i \in S_j} \left(y_i - \bar{y}_{(j)}\right)^2 + \sum_{j = 1}^m \sum_{i \in S_j} \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2 \\ & = \sum_{j = 1}^m \sum_{i \in S_j} \left(y_i - \bar{y}_{(j)}\right)^2 + \sum_{j = 1}^m n_j \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2. \end{align*} \]

The important property of this decomposition is that the first sum does not depend on \(\mathbf{\beta}\), so minimizing \(RSS\) with respect to \(\mathbf{\beta}\) is equivalent to minimizing the second sum. We see that this second sum can be written as

\[ \begin{align*} \sum_{j = 1}^m n_j \left(\bar{y}_{(j)} - \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2 & = \sum_{j = 1}^m \left(\sqrt{n_j} \bar{y}_{(j)} - \sqrt{n_j} \mathbf{z}_j \mathbf{\beta}^{\intercal}\right)^2 \end{align*}, \]

which is exactly the residual sum of squares for regressing \(\sqrt{n_j} \bar{y}_{(j)}\) on \(\sqrt{n_j} \mathbf{z}_j\).

```
summ_reg_df = reg_df.groupby(('u', 'v', 'w')).y.mean().reset_index().iloc[shuffled_ixs].reset_index(drop=True).copy()
summ_reg_df['n'] = n
```

`summ_reg_df.head()`

u | v | w | y | n | |
---|---|---|---|---|---|

0 | 0 | 0 | 0.425597 | -2.173984 | 14 |

1 | 48 | 1 | -0.993981 | -4.174895 | 7 |

2 | 35 | 0 | 0.358156 | 1.252848 | 9 |

3 | 19 | 1 | -0.760298 | -6.612355 | 17 |

4 | 40 | 1 | -0.688514 | -4.379063 | 13 |

The design matrices for this summarized model are easy to construct using `patsy`

.

```
y_summ, X_summ = dmatrices("""
I(np.sqrt(n) * y) ~
np.sqrt(n) + I(np.sqrt(n) * u) + I(np.sqrt(n) * v) + I(np.sqrt(n) * w) - 1
""",
data=summ_reg_df)
```

Note that we must remove `patsy`

’s constant column for the intercept and replace it with `np.sqrt(n)`

.

`summ_ols = sm.OLS(y_summ, X_summ).fit()`

`summ_ols.params`

`array([-2.99965783, 0.09998571, -3.99899718, 2.00031673])`

We see that the summarized regression produces the same parameter estimates as the full regression.

`np.allclose(full_ols.params, summ_ols.params)`

`True`

As a final example of adapting common methods to summarized data frames, we will show how to fit a logistic regression model on a summarized data set by maximum likelihood.

We will use the model

\[P(s = 1\ |\ w) = \frac{1}{1 + \exp(-\mathbf{x} \gamma^{\intercal})}\].

As above, \(\mathbf{x}_i = [1\ u_i\ v_i\ w_i]\). The true value of \(\gamma\) is

`gamma = np.array([1., 0.01, -1., -2.])`

We now generate samples from this model.

`X = dmatrix('u + v + w', data=df)`

```
p = pd.Series(sp.special.expit(np.dot(X, gamma)), name='p')
s = pd.Series(sp.stats.bernoulli.rvs(p), name='s')
```

`logit_df = pd.concat((s, p, df), axis=1)`

`logit_df.head()`

s | p | u | v | w | |
---|---|---|---|---|---|

0 | 1 | 0.709963 | 97 | 0 | 0.537397 |

1 | 0 | 0.092560 | 79 | 1 | 1.536383 |

2 | 0 | 0.153211 | 44 | 1 | 1.074817 |

3 | 1 | 0.923609 | 51 | 0 | -0.491210 |

4 | 0 | 0.062077 | 47 | 1 | 1.592646 |

We first fit the logistic regression model to the full data frame.

`full_logit = sm.Logit(s, X).fit()`

```
Optimization terminated successfully.
Current function value: 0.414221
Iterations 7
```

`full_logit.params`

```
const 0.965283
x1 0.009944
x2 -0.966797
x3 -1.990506
dtype: float64
```

We see that the estimates are quite close to the true parameters.

The technique used to adapt maximum likelihood estimation of logistic regression to the summarized data frame is quite elegant. The likelihood for the full data set is given by the fact that (given \(u\), \(v\), and \(w\)) \(s\) is Bernoulli distributed with

\[s_i\ |\ \mathbf{x}_i \sim \operatorname{Ber}\left(\frac{1}{1 + \exp(-\mathbf{x}_i \gamma^{\intercal})}\right).\]

To derive the likelihood for the summarized data set, we count the number of successes (where \(s = 1\)) for each unique combination of features \(\mathbf{z}_j\), and denote this quantity \(k_j\).

```
summ_logit_df = logit_df.groupby(('u', 'v', 'w')).s.sum().reset_index().iloc[shuffled_ixs].reset_index(drop=True).copy()
summ_logit_df = summ_logit_df.rename(columns={'s': 'k'})
summ_logit_df['n'] = n
```

`summ_logit_df.head()`

u | v | w | k | n | |
---|---|---|---|---|---|

0 | 0 | 0 | 0.425597 | 5 | 14 |

1 | 48 | 1 | -0.993981 | 7 | 7 |

2 | 35 | 0 | 0.358156 | 8 | 9 |

3 | 19 | 1 | -0.760298 | 12 | 17 |

4 | 40 | 1 | -0.688514 | 9 | 13 |

Now, instead of each row representing a single Bernoulli trial (as in the full data frame), each row represents \(n_j\) trials, so we have that \(k_j\) is (conditionally) Binomially distributed with

\[k_j\ |\ \mathbf{z}_j \sim \operatorname{Bin}\left(n_j, \frac{1}{1 + \exp(-\mathbf{z}_j \gamma^{\intercal})}\right).\]

`summ_logit_X = dmatrix('u + v + w', data=summ_logit_df)`

As I have shown in a previous post, we can use `statsmodels`

’ `GenericLikelihoodModel`

class to fit custom probability models by maximum likelihood. The model is implemented as follows.

```
class SummaryLogit(GenericLikelihoodModel):
def __init__(self, endog, exog, n, **qwargs):
"""
endog is the number of successes
exog are the features
n are the number of trials
"""
self.n = n
super(SummaryLogit, self).__init__(endog, exog, **qwargs)
def nloglikeobs(self, gamma):
"""
gamma is the vector of regression coefficients
returns the negative log likelihood of each of the observations for
the coefficients in gamma
"""
p = sp.special.expit(np.dot(self.exog, gamma))
return -sp.stats.binom.logpmf(self.endog, self.n, p)
def fit(self, start_params=None, maxiter=10000, maxfun=5000, **qwargs):
# wraps the GenericLikelihoodModel's fit method to set default start parameters
if start_params is None:
start_params = np.zeros(self.exog.shape[1])
return super(SummaryLogit, self).fit(start_params=start_params,
maxiter=maxiter, maxfun=maxfun, **qwargs)
```

`summ_logit = SummaryLogit(summ_logit_df.k, summ_logit_X, summ_logit_df.n).fit()`

```
Optimization terminated successfully.
Current function value: 1.317583
Iterations: 357
Function evaluations: 599
```

Again, we get reasonable estimates of the regression coefficients, which are close to those obtained from the full data set.

`summ_logit.params`

`array([ 0.96527992, 0.00994322, -0.96680904, -1.99051485])`

`np.allclose(summ_logit.params, full_logit.params, rtol=10**-4)`

`True`

Hopefully this introduction to the technique of summarizing data sets has proved useful and will allow you to explore medium data more easily in the future. We have only scratched the surface on the types of statistical techniques that can be adapted to work on summarized data sets, but with a bit of ingenuity, many of the ideas in this post can apply to other models.

This post is available as an IPython notebook here.

Tags: Statistics, Python

Ordinarly least squares (OLS) is, without a doubt, the most-well known linear regression model. Despite its wide applicability, it often gives undesireable results when the data deviate from its underlying normal model. In particular, it is quite sensitive to outliers in the data. In this post, we will illustrate this sensitivity and then show that changing the error distribution results in a more robust regression model.

We will use one of the data sets from Anscombe’s quartet to illustrate these concepts. Anscombe’s quartet is a well-known group of four data sets that illustrates the importance of exploratory data analysis and visualization. In particular, we will use the third dataset from Anscombe’s quartet. This data set is shown below.

```
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import pymc3
from scipy import stats
import seaborn as sns
from statsmodels import api as sm
```

```
x = np.array([10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5])
y = np.array([7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73])
```

It is quite clear that this data set exhibits a highly linear relationship, with one outlier (when `x = 13`

). Below, we show the results of two OLS models fit to this data. One is fit to all of the data, and the other is fit to the data with the outlier point removed.

`X = sm.add_constant(x)`

`ols_result = sm.OLS(y, X).fit()`

```
x_no_outlier = x[x != 13]
X_no_outlier = X[x != 13]
y_no_outlier = y[x != 13]
```

`no_outlier_result = sm.OLS(y_no_outlier, X_no_outlier).fit()`

One of the ways the OLS estimator can be derived is by minimizing the mean squared error (MSE) of the model on the training data. Below we show the MSE of both of these models on both the full data set and the data set without the outlier.

```
def mse(actual, predicted):
return ((actual - predicted)**2).mean()
```

```
mse_df = pd.DataFrame({
'Full data set': [
mse(y, ols_result.predict(X)),
mse(y, no_outlier_result.predict(X))
],
'Data set without outlier': [
mse(y_no_outlier, ols_result.predict(X_no_outlier)),
mse(y_no_outlier, no_outlier_result.predict(X_no_outlier))
]
},
index=['Full model', 'No outlier mdoel']
)
mse_df = mse_df[mse_df.columns[::-1]]
```

`mse_df`

Full data set | Data set without outlier | |
---|---|---|

Full model | 1.250563 | 0.325152 |

No outlier mdoel | 1.637640 | 0.000008 |

By simple visual inspection, we suspect that without the outlier (`x = 13`

), the relationship between `x`

and `y`

is (almost) perfectly linear. This suspicion is confirmed by this model’s very small MSE on the reduced data set.

Unfortunately, we will usually have many more than eleven points in our data set. In reality, the outliers may be more difficult to detect visually, and they may be harder to exclude manually. We would like a model that performs reasonably well, even in the presence of outliers.

Before we define such a robust regression model, it will be helpful to consider the OLS model mathematically. In the OLS model, we have that

\[y_i = \vec{\beta} \cdot \vec{x}_i + \varepsilon_i.\]

Here, \(y_i\) is the observation corresponding to the feature vector \(\vec{x}_i\), \(\vec{\beta}\) is the vector of regression coefficients, and \(\varepsilon_i\) is noise. In the OLS model, the noise terms are independent with identical normal distributions. It is the properties of these normally distributed errors that make OLS susceptible to outliers.

The normal distribution is well-known to have thin tails. That is, it assigns relatively little probability to observations far away from the mean. Students of basic statistics are quite familiar with the fact that approximately 95% of the mass of the normal distribution lies within two standard deviations of the mean.

We find a robust regression model by choosing an error distribution with fatter tails; a common choice is Student’s t-distribution.

Below we define this model using `pymc3`

.

```
with pymc3.Model() as model:
# Regression coefficients
alpha = pymc3.Uniform('alpha', -100, 100)
beta = pymc3.Uniform('beta', -100, 100)
# Expected value
y_hat = alpha + beta * x
# Observations with t-distributed error
y_obs = pymc3.T('y_obs', nu=5, mu=y_hat, observed=y)
```

Here we have given our t-distributed residuals five degrees of freedom. Customarily, these models will use four, five, or six degrees of freedom. It is important that the number of degrees of freedom, \(\nu\), be relatively small, because as \(\nu \to \infty\), the t-distribution converges to the normal distribution.

We now fit this model using no-U-turn sampling.

```
with model:
step = pymc3.NUTS()
trace_ = pymc3.sample(3000, step)
burn = 1000
thin = 2
trace = trace_[burn::thin]
```

` [-----------------100%-----------------] 3000 of 3000 complete in 8.7 sec`

The following plots summarize the posterior distribution of the regression intercept (\(\alpha\)) and slope (\(\beta\)).

We now plot the robust model along with the previous models.

```
alpha = trace['alpha'].mean()
beta = trace['beta'].mean()
```

We see right away that, although the robust model has not completely captured the linear relationship between the non-outlier points, it is much less biased by the outlier than the OLS model on the full data set. Below we compare the MSE of this model to the previous models.

```
robust_mse = pd.Series([mse(y, alpha + beta * x), mse(y_no_outlier, alpha + beta * x_no_outlier)], index=mse_df.columns)
robust_mse.name = 'Robust model'
mse_df = mse_df.append(robust_mse)
```

`mse_df`

Full data set | Data set without outlier | |
---|---|---|

Full model | 1.250563 | 0.325152 |

No outlier mdoel | 1.637640 | 0.000008 |

Robust model | 1.432198 | 0.032294 |

On the data set without the outlier, the robust model has a significantly larger MSE than the no outlier model, but, importantly, its MSE is an order of magnitude smaller than that of the full model.

This post is available as an IPython notebook here.

Tags: Statistics, PyMC

Maximum likelihood estimation is a common method for fitting statistical models. In Python, it is quite possible to fit maximum likelihood models using just `scipy.optimize`

. Over time, however, I have come to prefer the convenience provided by `statsmodels`

’ `GenericLikelihoodModel`

. In this post, I will show how easy it is to subclass `GenericLikelihoodModel`

and take advantage of much of `statsmodels`

’ well-developed machinery for maximum likelihood estimation of custom models.

The model we use for this demonstration is a zero-inflated Poisson model. This is a model for count data that generalizes the Poisson model by allowing for an overabundance of zero observations.

The model has two parameters, \(\pi\), the proportion of excess zero observations, and \(\lambda\), the mean of the Poisson distribution. We assume that observations from this model are generated as follows. First, a weighted coin with probability \(\pi\) of landing on heads is flipped. If the result is heads, the observation is zero. If the result is tails, the observation is generated from a Poisson distribution with mean \(\lambda\). Note that there are two ways for an observation to be zero under this model:

- the coin is heads, and
- the coin is tails, and the sample from the Poisson distribution is zero.

If \(X\) has a zero-inflated Poisson distribution with parameters \(\pi\) and \(\lambda\), its probability mass function is given by

\[\begin{align*} P(X = 0) & = \pi + (1 - \pi)\ e^{-\lambda} \\ P(X = x) & = (1 - \pi)\ e^{-\lambda}\ \frac{\lambda^x}{x!} \textrm{ for } x > 0. \end{align*}\]

In this post, we will use the parameter values \(\pi = 0.3\) and \(\lambda = 2\). The probability mass function of the zero-inflated Poisson distribution is shown below, next to a normal Poisson distribution, for comparison.

```
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
from scipy import stats
import seaborn as sns
from statsmodels.base.model import GenericLikelihoodModel
np.random.seed(123456789)
```

```
pi = 0.3
lambda_ = 2.
```

```
def zip_pmf(x, pi=pi, lambda_=lambda_):
if pi < 0 or pi > 1 or lambda_ <= 0:
return np.zeros_like(x)
else:
return (x == 0) * pi + (1 - pi) * stats.poisson.pmf(x, lambda_)
```

First we generate 1,000 observations from the zero-inflated model.

```
N = 1000
inflated_zero = stats.bernoulli.rvs(pi, size=N)
x = (1 - inflated_zero) * stats.poisson.rvs(lambda_, size=N)
```

We are now ready to estimate \(\pi\) and \(\lambda\) by maximum likelihood. To do so, we define a class that inherits from `statsmodels`

’ `GenericLikelihoodModel`

as follows.

```
class ZeroInflatedPoisson(GenericLikelihoodModel):
def __init__(self, endog, exog=None, **kwds):
if exog is None:
exog = np.zeros_like(endog)
super(ZeroInflatedPoisson, self).__init__(endog, exog, **kwds)
def nloglikeobs(self, params):
pi = params[0]
lambda_ = params[1]
return -np.log(zip_pmf(self.endog, pi=pi, lambda_=lambda_))
def fit(self, start_params=None, maxiter=10000, maxfun=5000, **kwds):
if start_params is None:
lambda_start = self.endog.mean()
excess_zeros = (self.endog == 0).mean() - stats.poisson.pmf(0, lambda_start)
start_params = np.array([excess_zeros, lambda_start])
return super(ZeroInflatedPoisson, self).fit(start_params=start_params,
maxiter=maxiter, maxfun=maxfun, **kwds)
```

The key component of this class is the method `nloglikeobs`

, which returns the negative log likelihood of each observed value in `endog`

. Secondarily, we must also supply reasonable initial guesses of the parameters in `fit`

. Obtaining the maximum likelihood estimate is now simple.

```
model = ZeroInflatedPoisson(x)
results = model.fit()
```

```
Optimization terminated successfully.
Current function value: 1.586641
Iterations: 37
Function evaluations: 70
```

We see that we have estimated the parameters fairly well.

```
pi_mle, lambda_mle = results.params
pi_mle, lambda_mle
```

`(0.31542487710071976, 2.0451304204850853)`

There are many advantages to buying into the `statsmodels`

ecosystem and subclassing `GenericLikelihoodModel`

. The already-written `statsmodels`

code handles storing the observations and the interaction with `scipy.optimize`

for us. (It is possible to control the use of `scipy.optimize`

through keyword arguments to `fit`

.)

We also gain access to many of `statsmodels`

’ built in model analysis tools. For example, we can use bootstrap resampling to estimate the variation in our parameter estimates.

```
boot_mean, boot_std, boot_samples = results.bootstrap(nrep=500, store=True)
boot_pis = boot_samples[:, 0]
boot_lambdas = boot_samples[:, 1]
```

The next time you are fitting a model using maximum likelihood, try integrating with `statsmodels`

to take advantage of the significant amount of work that has gone into its ecosystem.

This post is available as an IPython notebook here.

Tags: Statistics, Python

Logistic regression is perhaps one of the most commonly used tools in all of statistics. While I have mathematically understood and used logistic regression for quite some time, it took me much longer to develop intution for it. It was only a few months ago that I learned about the connection between logistic regression and utility theory.

Suppose a person must decide between two options, \(Y = 0\) or \(Y = 1\). Utility theory posits that each option has an associated utility, \(U_0\) and \(U_1\), respectively. The person chooses the option with the largest utility. My intuition for logistic regression arises from understanding that it arises from a specific utility model.

In logistic regression, we are interested in modeling the effect of a covariate, \(X\), on the person’s choice. In this post, we will work with the logistic regression model

\[\begin{align*} P(Y = 1 | X = x) & = \frac{e^x}{1 + e^x}. \end{align*}\]

This model is shown graphically below.

```
from __future__ import division
from matplotlib import pyplot as plt
import numpy as np
from scipy.stats import gumbel_r
def logistic_probability(x):
exp = np.exp(x)
return exp / (1 + exp)
```

In terms of utility theory, we assume that each person’s utilities are related to \(X\) by

\[\begin{align*} U_0 | X & = \beta_0 X + \varepsilon_0 \\ U_1 | X & = \beta_1 X + \varepsilon_1 \end{align*}.\]

Here, \(\beta_i X\) represents the observed utility of option \(Y = i\) (\(i = 0, 1\)), and \(\varepsilon_i\) represents btthe random fluctuation in the option’s utility. Logistic regression arises from the assumption that \(\varepsilon_0\) and \(\varepsilon_0\) are independent with Gumbel distributions.

The Gumbel distribution has the density function

\[\begin{align*} f(\varepsilon) & = e^{-\varepsilon}\ \exp(-e^{-\varepsilon}). \end{align*}\]

To show that this utility model gives rise to logistic regression, we see that

\[\begin{align*} P(Y = 1 | X = x) & = P(U_1 > U_0 | X = x) \\ & = P(\beta_1 x + \varepsilon_1 > \beta_0 x + \varepsilon_0) \\ & = P((\beta_1 - \beta_0) x > \varepsilon_0 - \varepsilon_1). \end{align*}\]

It is useful to note that the difference of independent identically distributed Gumbel random variables follows a logistic distribution, so

\[f(\Delta) = \frac{e^{-\Delta}}{(1 + e^{-\Delta})^2},\]

where \(\Delta = \varepsilon_1 - \varepsilon_0\).

Therefore

\[\begin{align*} P(Y = 1 | X = x) & = P((\beta_1 - \beta_0) x > -\Delta) \\ & = \int_{-(\beta_1 - \beta_0) x}^\infty \frac{e^{-\Delta}}{(1 + e^{-\Delta})^2}\ d\Delta. \end{align*}\]

Making the substitution \(t = 1 + e^{-\Delta}\), with \(dt = -e^{-\Delta}\ dt\), we get that

\[\begin{align*} P(Y = 1 | X = x) & = \int_1^{1 + \exp((\beta_1 - \beta_0) x)} t^{-2}\ dt \\ & = \left.t^{-1}\right|_{1 + \exp((\beta_1 - \beta_0) x)}^1 \\ & = 1 - \frac{1}{{1 + \exp((\beta_1 - \beta_0) x)}} \\ & = \frac{{\exp((\beta_1 - \beta_0) x)}}{{1 + \exp((\beta_1 - \beta_0) x)}}. \end{align*}\]

Recalling that

\[\begin{align*} P(Y = 1 | X = x) & = \frac{e^{x}}{1 + e^{x}}, \end{align*}\]

we must have that \(\beta_1 - \beta_0 = 1\).

The fact that there are infinitely many solutions of this equation is a subtle but important point of utility theory, that the difference in utility is both location and scale invariant. We will prove this statement in two parts.

- For any \(\mu\), let \(\tilde{U}_i = U_i + \mu\) for \(i = 0, 1\). Then \[\begin{align*} \tilde{U_1} - \tilde{U_0} & = U_1 + \mu - (U_0 - \mu) = U_1 - U_0, \end{align*}\] so \(\tilde{U_1} - \tilde{U_0} > 0\) if and only if \(U_1 - U_0 > 0\), and therefore the difference in utility is location invariant.
- For any \(\sigma > 0\), let \(\tilde{U}_i = \frac{1}{\sigma} U_i\) for \(i = 0, 1\). Then \[\begin{align*} \tilde{U_1} - \tilde{U_0} & = \frac{1}{\sigma}\left(U_1 - U_0\right), \end{align*}\] so \(\tilde{U_1} - \tilde{U_0} > 0\) if and only if \(U_1 - U_0 > 0\), and therefore the difference in utility is scale invariant.

Together, these invariances show that the units of utility are irrelevant, and the only quantity that matters is the difference in utilities. Due to the location invariance in utilities, we may as well set \(\beta_0 = 0\), so \(\beta_1 = 1\) for convenience. Our utility model is therefore

\[\begin{align*} U_0 & = \varepsilon_0 \\ U_1 & = x + \varepsilon_1 \end{align*}.\]

To verify that this utility model is equivalent to logistic regression in a second way, we will simulate \(P(Y = 1 | X = x)\) by generating Gumbel random variables.

```
N = 500
eps0 = gumbel_r.rvs(size=(N, n))
eps1 = gumbel_r.rvs(size=(N, n))
U0 = eps0
U1 = xs + eps1
simulated_ps = (U1 > U0).mean(axis=0)
```

The red simulated curve is reasonably close to the black actual curve. (Increasing `N`

would cause the red curve to converge to the black one.)

Aside from being an important tool in econometrics, utility theory helps shed light on logistic regression. The perspective it provides on logistic regression opens the door to generalization and related theories. If the random portions of utility, \(\varepsilon_1\) and \(\varepsilon_0\) are normally distributed instead of Gumbel distributed, the utility model gives rise to probit regression. For a thorough introduction to utility/choice theory, consult the excellent book *Discrete Choice Models with Simulation* by Kenneth Train, which is freely available online.

Tags: Statistics, Econometrics