# Probabilistic Programming for Sports Analytics¶

## @AustinRochford¶

### arochford@monetate.com • austin.rochford@gmail.com¶

• Founded 2008, web optimization and personalization SaaS
• Observed 5B impressions and \$4.1B in revenue during Cyber Week 2017

## Agenda¶

• Sports analytics
• Heuristics and uncertainty
• First steps
• Batting average
• Save percentage
• NBA foul calls
• Future work

# Sports Analytics¶

## Heuristics and Uncertainty¶

The basic blueprint for predicting a playerâ€™s upcoming season from his past is the Marcel method, introduced by prominent baseball and hockey analyst Tom Tango back in 2005. Itâ€™s a three-step process that

1. starts with a weighted average of a playerâ€™s three most recent seasons;
2. regresses that playerâ€™s performance toward the league mean, based on games played (weâ€™ll explain why and how in a moment); and
3. applies an age adjustment to account for developing rookies and declining veterans. With regard to the first step, Tango proposes a weighting of 5-4-3, while I prefer a 4-2-1 approach that assumes that every seasonâ€™s data has twice the predictive power as the season previous.

Vollman, Rob, Awad, Tom, Fyffe, Iain. Hockey Abstract Presents Stat Shot. ECW Press. Kindle Edition.

# First Steps¶

## Batting Average¶

In [10]:
mlb_df = load_data(get_data_url('2018_batting.csv'), 'Name', ['AB', 'H'])

In [11]:
mlb_df.head()

Out[11]:
name ab h
player_id
abreujo02 Jose Abreu 499 132
acunaro01 Ronald Acuna 433 127
In [13]:
def hierarchical_normal(name, shape, Î¼=None):
if Î¼ is None:
Î¼ = pm.Normal(f"Î¼_{name}", 0., 5.)

Î” = pm.Normal(f"Î”_{name}", shape=shape)
Ïƒ = pm.HalfNormal(f"Ïƒ_{name}", 2.5)

return pm.Deterministic(name, Î¼ + Î” * Ïƒ)

In [14]:
with pm.Model() as mlb_model:
Î· = hierarchical_normal("Î·", n_player)
ba = pm.Deterministic("ba", pm.math.sigmoid(Î·))

hits = pm.Binomial("hits", batter_df['ab'], ba, observed=batter_df['h'])

In [16]:
with mlb_model:
mlb_trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [Ïƒ_Î·, Î”_Î·, Î¼_Î·]
Sampling 3 chains: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 6000/6000 [00:49<00:00, 120.10draws/s]

In [17]:
az.plot_energy(mlb_trace);

In [18]:
(az.rhat(mlb_trace).max().to_array().max())

Out[18]:
<xarray.DataArray ()>
array(1.01)

### Mike Trout¶

In [20]:
fig

Out[20]:

### Mike Trout vs. Bryce Harper¶

In [22]:
fig

Out[22]:
In [24]:
fig

Out[24]:
In [25]:
(mlb_trace['ba'][:, harper_ix] < mlb_trace['ba'][:, trout_ix]).mean()

Out[25]:
0.96666666666666667

### Bryce Harper vs. Rhys Hoskins¶

In [27]:
fig

Out[27]:
In [28]:
(mlb_trace['ba'][:, harper_ix] < mlb_trace['ba'][:, hoskins_ix]).mean()

Out[28]:
0.45700000000000002

### Uncertainty in batting average¶

In [32]:
ax.figure

Out[32]:

## Save Percentage¶

In [33]:
nhl_df = load_data(get_data_url('2017_2018_goalies.csv'), 'Player', ['SA', 'SV'])

In [34]:
nhl_df.head()

Out[34]:
name sa sv
player_id
allenja01 Jake Allen 1614 1462
andercr01 Craig Anderson 1768 1588
anderfr01 Frederik Andersen 2211 2029
appleke01 Ken Appleby 55 52
bernijo01 Jonathan Bernier 1092 997
In [36]:
with pm.Model() as nhl_model:
Î· = hierarchical_normal("Î·", n_goalie)
svp = pm.Deterministic("svp", pm.math.sigmoid(Î·))

saves = pm.Binomial("saves", nhl_df['sa'], svp, observed=nhl_df['sv'])

In [37]:
with nhl_model:
nhl_trace = pm.sample(nuts_kwargs={'target_accept': 0.9}, **SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [Ïƒ_Î·, Î”_Î·, Î¼_Î·]
Sampling 3 chains: 100%|â–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆâ–ˆ| 6000/6000 [00:21<00:00, 276.80draws/s]


### Sergei Bobrovsky¶

In [41]:
fig

Out[41]:

### Uncertainty in save percentage¶

In [44]:
ax.figure

Out[44]: