Probabilistic Programming for Sports Analytics¶

Bayesian Data Analysis Meetup • April 18, 2019 • New York City¶

@AustinRochford ¶

About Me¶

Chief Data Scientist @ Monetate • PyMC3 contributor¶

@AustinRochford • austinrochford.com • github.com/AustinRochford ¶

arochford@monetate.com • austin.rochford@gmail.com ¶

About Monetate¶

Founded 2008, web optimization and personalization SaaS
Observed 5B impressions and $4.1B in revenue during Cyber Week 2017

monetate.com/about/careers

Agenda¶

Sports analytics
- Heuristics and uncertainty
First steps
- Batting average
- Save percentage
NBA foul calls
Future work

Sports Analytics¶

Heuristics and Uncertainty¶

The basic blueprint for predicting a player’s upcoming season from his past is the Marcel method, introduced by prominent baseball and hockey analyst Tom Tango back in 2005. It’s a three-step process that

starts with a weighted average of a player’s three most recent seasons;

regresses that player’s performance toward the league mean, based on games played (we’ll explain why and how in a moment); and

applies an age adjustment to account for developing rookies and declining veterans. With regard to the first step, Tango proposes a weighting of 5-4-3, while I prefer a 4-2-1 approach that assumes that every season’s data has twice the predictive power as the season previous.

Vollman, Rob, Awad, Tom, Fyffe, Iain. Hockey Abstract Presents Stat Shot. ECW Press. Kindle Edition.

Bayesian Bagging to Generate Uncertainty Intervals: A Catcher Framing Story

First Steps¶

Batting Average¶

In [10]:

mlb_df = load_data(get_data_url('2018_batting.csv'), 'Name', ['AB', 'H'])

In [11]:

mlb_df.head()

Out[11]:

	name	ab	h
player_id
abreujo02	Jose Abreu	499	132
acunaro01	Ronald Acuna	433	127
adamewi01	Willy Adames	288	80
adamja01	Jason Adam	0	0
adamsau02	Austin L. Adams	0	0

In [13]:

def hierarchical_normal(name, shape, μ=None):
    if μ is None:
        μ = pm.Normal(f"μ_{name}", 0., 5.)
    
    Δ = pm.Normal(f"Δ_{name}", shape=shape)
    σ = pm.HalfNormal(f"σ_{name}", 2.5)
    
    return pm.Deterministic(name, μ + Δ * σ)

In [14]:

with pm.Model() as mlb_model:
    η = hierarchical_normal("η", n_player)
    ba = pm.Deterministic("ba", pm.math.sigmoid(η))
    
    hits = pm.Binomial("hits", batter_df['ab'], ba, observed=batter_df['h'])

In [16]:

with mlb_model:
    mlb_trace = pm.sample(**SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [σ_η, Δ_η, μ_η]
Sampling 3 chains: 100%|██████████| 6000/6000 [00:49<00:00, 120.10draws/s]

In [17]:

az.plot_energy(mlb_trace);

In [18]:

(az.rhat(mlb_trace).max().to_array().max())

Out[18]:

<xarray.DataArray ()>
array(1.01)

Mike Trout¶

In [20]:

fig

Out[20]:

Mike Trout vs. Bryce Harper¶

In [22]:

fig

Out[22]:

In [24]:

fig

Out[24]:

In [25]:

(mlb_trace['ba'][:, harper_ix] < mlb_trace['ba'][:, trout_ix]).mean()

Out[25]:

0.96666666666666667

Bryce Harper vs. Rhys Hoskins¶

In [27]:

fig

Out[27]:

In [28]:

(mlb_trace['ba'][:, harper_ix] < mlb_trace['ba'][:, hoskins_ix]).mean()

Out[28]:

0.45700000000000002

Uncertainty in batting average¶

In [32]:

ax.figure

Out[32]:

Save Percentage¶

In [33]:

nhl_df = load_data(get_data_url('2017_2018_goalies.csv'), 'Player', ['SA', 'SV'])

In [34]:

nhl_df.head()

Out[34]:

	name	sa	sv
player_id
allenja01	Jake Allen	1614	1462
andercr01	Craig Anderson	1768	1588
anderfr01	Frederik Andersen	2211	2029
appleke01	Ken Appleby	55	52
bernijo01	Jonathan Bernier	1092	997

In [36]:

with pm.Model() as nhl_model:
    η = hierarchical_normal("η", n_goalie)
    svp = pm.Deterministic("svp", pm.math.sigmoid(η))
    
    saves = pm.Binomial("saves", nhl_df['sa'], svp, observed=nhl_df['sv'])

In [37]:

with nhl_model:
    nhl_trace = pm.sample(nuts_kwargs={'target_accept': 0.9}, **SAMPLE_KWARGS)

Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (3 chains in 3 jobs)
NUTS: [σ_η, Δ_η, μ_η]
Sampling 3 chains: 100%|██████████| 6000/6000 [00:21<00:00, 276.80draws/s]

Sergei Bobrovsky¶

In [41]:

fig

Out[41]:

Uncertainty in save percentage¶

In [44]:

ax.figure

Out[44]: