Probabilistic Programming for Sports Analytics

Bayesian Data Analysis Meetup • April 18, 2019 • New York City

@AustinRochford

About Monetate

  • Founded 2008, web optimization and personalization SaaS
  • Observed 5B impressions and $4.1B in revenue during Cyber Week 2017

Agenda

  • Sports analytics
    • Heuristics and uncertainty
  • First steps
    • Batting average
    • Save percentage
  • NBA foul calls
  • Future work

Sports Analytics

Heuristics and Uncertainty

The basic blueprint for predicting a player’s upcoming season from his past is the Marcel method, introduced by prominent baseball and hockey analyst Tom Tango back in 2005. It’s a three-step process that

  1. starts with a weighted average of a player’s three most recent seasons;
  2. regresses that player’s performance toward the league mean, based on games played (we’ll explain why and how in a moment); and
  3. applies an age adjustment to account for developing rookies and declining veterans. With regard to the first step, Tango proposes a weighting of 5-4-3, while I prefer a 4-2-1 approach that assumes that every season’s data has twice the predictive power as the season previous.

Vollman, Rob, Awad, Tom, Fyffe, Iain. Hockey Abstract Presents Stat Shot. ECW Press. Kindle Edition.

First Steps

Batting Average

In [10]:
mlb_df = load_data(get_data_url('2018_batting.csv'), 'Name', ['AB', 'H'])
In [11]:
mlb_df.head()
Out[11]:
name ab h
player_id
abreujo02 Jose Abreu 499 132
acunaro01 Ronald Acuna 433 127
adamewi01 Willy Adames 288 80
adamja01 Jason Adam 0 0
adamsau02 Austin L. Adams 0 0
In [13]:
def hierarchical_normal(name, shape, μ=None):
    if μ is None:
        μ = pm.Normal(f"μ_{name}", 0., 5.)
    
    Δ = pm.Normal(f"Δ_{name}", shape=shape)
    σ = pm.HalfNormal(f"σ_{name}", 2.5)
    
    return pm.Deterministic(name, μ + Δ * σ)
In [14]:
with pm.Model() as mlb_model:
    η = hierarchical_normal("η", n_player)
    ba = pm.Deterministic("ba", pm.math.sigmoid(η))
    
    hits = pm.Binomial("hits", batter_df['ab'], ba, observed=batter_df['h'])
In [16]:
with mlb_model:
    mlb_trace = pm.sample(**SAMPLE_KWARGS)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [σ_η, Δ_η, μ_η]
Sampling 3 chains: 100%|██████████| 6000/6000 [00:49<00:00, 120.10draws/s]
In [17]:
az.plot_energy(mlb_trace);
In [18]:
(az.rhat(mlb_trace).max().to_array().max())
Out[18]:
<xarray.DataArray ()>
array(1.01)

Mike Trout

In [20]:
fig
Out[20]:

Mike Trout vs. Bryce Harper

In [22]:
fig
Out[22]:
In [24]:
fig
Out[24]:
In [25]:
(mlb_trace['ba'][:, harper_ix] < mlb_trace['ba'][:, trout_ix]).mean()
Out[25]:
0.96666666666666667

Bryce Harper vs. Rhys Hoskins

In [27]:
fig
Out[27]:
In [28]:
(mlb_trace['ba'][:, harper_ix] < mlb_trace['ba'][:, hoskins_ix]).mean()
Out[28]:
0.45700000000000002

Uncertainty in batting average

In [32]:
ax.figure
Out[32]:

Save Percentage

In [33]:
nhl_df = load_data(get_data_url('2017_2018_goalies.csv'), 'Player', ['SA', 'SV'])
In [34]:
nhl_df.head()
Out[34]:
name sa sv
player_id
allenja01 Jake Allen 1614 1462
andercr01 Craig Anderson 1768 1588
anderfr01 Frederik Andersen 2211 2029
appleke01 Ken Appleby 55 52
bernijo01 Jonathan Bernier 1092 997
In [36]:
with pm.Model() as nhl_model:
    η = hierarchical_normal("η", n_goalie)
    svp = pm.Deterministic("svp", pm.math.sigmoid(η))
    
    saves = pm.Binomial("saves", nhl_df['sa'], svp, observed=nhl_df['sv'])
In [37]:
with nhl_model:
    nhl_trace = pm.sample(nuts_kwargs={'target_accept': 0.9}, **SAMPLE_KWARGS)
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [σ_η, Δ_η, μ_η]
Sampling 3 chains: 100%|██████████| 6000/6000 [00:21<00:00, 276.80draws/s]

Sergei Bobrovsky

In [41]:
fig
Out[41]:

Uncertainty in save percentage

In [44]:
ax.figure
Out[44]: